Stochastic Gradient Descent

Stochastic gradient descent, usually shortened to SGD, is the optimization algorithm at the heart of almost all modern machine learning. It is the engine that turns a pile of data and a randomly initialized model into one that has actually learned something.

Ordinary gradient descent improves a model by computing the gradient of the error over the entire training set and taking a small step in the direction that reduces error. For large datasets that is hopelessly slow, because every single update requires a full pass over all the data. SGD makes a pragmatic trade: at each step it estimates the gradient from just one example, or in practice a small “mini-batch,” and updates the model immediately. Each estimate is noisy, but the updates are cheap and frequent, and over many steps the noise averages out while the model makes rapid progress. The mathematical foundation for why such noisy, incremental updates converge traces back to the 1951 stochastic approximation work of Herbert Robbins and Sutton Monro.

The noise is not purely a cost. It is widely believed to help SGD escape poor solutions and find flat, well-generalizing minima, which is part of why SGD with momentum remains a strong baseline even against fancier adaptive optimizers. The size of the step, the learning rate, and the batch size are the key knobs, and tuning them is one of the practical arts of training models.

For a general reader, SGD is worth knowing because it is the reason large models can be trained at all: rather than waiting to consider all the data before acting, it learns a little from each scrap and keeps moving, which is exactly what makes training at internet scale feasible.