Weight Initialization (Xavier and He)

Weight initialization is the choice of what values to give a neural network’s connection weights before training begins. It sounds like a minor detail, but it is not: if the initial weights are too large, signals and gradients explode as they pass through many layers, and if they are too small, they shrink toward nothing, and in either case a deep network fails to train. The goal of a good initialization scheme is to set the random starting weights so that the variance of the signals stays roughly constant as data flows forward through the layers and as gradients flow backward.

Two schemes became standard. The first, often called Xavier or Glorot initialization, came from the 2010 paper “Understanding the difficulty of training deep feedforward neural networks” by Xavier Glorot and Yoshua Bengio, which analyzed why deep networks with sigmoid-like activations were so hard to train and proposed scaling the initial weights by the number of inputs and outputs of each layer. The second, He initialization, came from the 2015 paper “Delving Deep into Rectifiers” by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, which adjusted the scaling specifically for rectified linear units and made it possible to train extremely deep networks from scratch, contributing to the first result to surpass human-level accuracy on the ImageNet classification benchmark.

These initialization rules, alongside techniques like batch normalization and residual connections, are part of why depth stopped being a barrier. Today they are built into deep learning frameworks as sensible defaults that most practitioners never have to think about.

Why a business reader should care: weight initialization is one of the quiet mathematical insights that made deep networks trainable at all, a precondition for essentially every modern AI system rather than an optional refinement.

Sources

Last verified June 7, 2026