Weight Decay and L2 Regularization

Weight decay and L2 regularization are among the oldest and simplest tools for keeping a machine learning model from overfitting. Both work by discouraging the model’s weights from growing too large, on the principle that a model with smaller, more modest parameters tends to be simpler and to generalize better to new data.

L2 regularization expresses this as an extra term added to the loss function: a penalty proportional to the sum of the squared weights. Because the optimizer minimizes the loss, this penalty pulls the weights toward zero as a side effect of training. Weight decay states the same intent more directly: at every update step, multiply each weight by a number slightly less than one, literally shrinking it a little. For plain stochastic gradient descent these two formulations are mathematically equivalent, which is why the terms have long been used interchangeably.

The equivalence quietly breaks down for adaptive optimizers like Adam, which rescale each parameter’s update by its own gradient history. As Loshchilov and Hutter showed in their 2017 “Decoupled Weight Decay Regularization” paper, folding the penalty into the loss (true L2) then gets rescaled inconsistently across parameters, whereas applying decay as a separate shrinkage step (true weight decay) behaves as intended. Their AdamW variant, built on this distinction, is now the default for training large models.

For a general reader, weight decay is a clean illustration of a core regularization idea, that preferring simpler models guards against memorizing noise, and a reminder that even a textbook technique can hide a subtlety that only surfaces when the surrounding machinery changes.

Weight Decay and L2 Regularization

Sources

Related