The problem
The activation functions post showed that sigmoid's derivative is at most 0.25, causing gradients to shrink exponentially through depth. But even with ReLU (gradient = 1 for positive inputs), the signal can still vanish or explode. The weight matrices are the other half of the story.
Consider a 10-layer network where every weight is drawn from a distribution that's slightly too wide or too narrow. Each layer multiplies the signal by a random matrix, and the scale of those entries determines whether the output variance grows, shrinks, or stays constant. After 10 multiplications, even a small per-layer bias compounds exponentially. If the variance shrinks by 20% per layer, after 10 layers you have $0.8^{10} \approx 0.11$ of the original signal. After 50 layers, $0.8^{50} \approx 0.00001$. The signal is effectively dead, and no gradient can revive it.
The fix is to choose the weight scale so that the variance multiplier is exactly 1. That is the entire subject of this post.
Variance of a linear layer
The question is simple: if the inputs to a layer have some variance, what variance will the outputs have? If the output variance is smaller, the signal is shrinking. If larger, it is growing. We want them to be equal.
A single neuron in a layer computes:
$\textcolor{#1D9E75}{y} = \sum_{i=1}^{\textcolor{#7F77DD}{n_\text{in}}} \textcolor{#E8725C}{w_i} \, \textcolor{#E07B9D}{x_i} + \textcolor{#8CB4D5}{b}$
| $\textcolor{#1D9E75}{y}$ | Output of one neuron (before activation). |
| $\textcolor{#E8725C}{w_i}$ | Weight connecting input $i$ to this neuron. |
| $\textcolor{#E07B9D}{x_i}$ | The $i$-th input activation. |
| $\textcolor{#7F77DD}{n_\text{in}}$ | Fan-in: number of inputs to this neuron. |
| $\textcolor{#8CB4D5}{b}$ | Bias. A constant, so $\textcolor{#18B8C8}{\text{Var}}(\textcolor{#1D9E75}{y} + \textcolor{#8CB4D5}{b}) = \textcolor{#18B8C8}{\text{Var}}(\textcolor{#1D9E75}{y})$. Adding a constant shifts the mean but does not change the spread. We can drop it from the analysis. |
Both $\textcolor{#E8725C}{w_i}$ and $\textcolor{#E07B9D}{x_i}$ are random variables (the weights are random at initialization, the inputs are random data), so $\textcolor{#1D9E75}{y}$ is a random variable too. We want to know how $\textcolor{#18B8C8}{\text{Var}}(\textcolor{#1D9E75}{y})$ relates to $\textcolor{#18B8C8}{\text{Var}}(\textcolor{#E07B9D}{x})$. Three assumptions make the math tractable:
- Weights $\textcolor{#E8725C}{w_i}$ are i.i.d. with mean 0 and variance $\textcolor{#18B8C8}{\sigma_w^2}$
- Inputs $\textcolor{#E07B9D}{x_i}$ are i.i.d. with mean 0 and variance $\textcolor{#18B8C8}{\sigma_x^2}$
- Weights and inputs are independent of each other
Step 1: variance of a single product
When two independent, zero-mean random variables are multiplied, the variance of their product is the product of their variances:
$\textcolor{#18B8C8}{\text{Var}}(\textcolor{#E8725C}{w_i}\,\textcolor{#E07B9D}{x_i}) = \textcolor{#18B8C8}{\text{Var}}(\textcolor{#E8725C}{w_i}) \cdot \textcolor{#18B8C8}{\text{Var}}(\textcolor{#E07B9D}{x_i}) = \textcolor{#18B8C8}{\sigma_w^2} \cdot \textcolor{#18B8C8}{\sigma_x^2}$
| $\textcolor{#18B8C8}{\text{Var}}(\textcolor{#E8725C}{w_i}\,\textcolor{#E07B9D}{x_i})$ | Variance of one weight-times-input term. |
| $\textcolor{#18B8C8}{\sigma_w^2}$ | Variance of the weight distribution (the thing we control). |
| $\textcolor{#18B8C8}{\sigma_x^2}$ | Variance of the input activations (from the previous layer). |
This identity holds because $E[\textcolor{#E8725C}{w_i}] = 0$ and $E[\textcolor{#E07B9D}{x_i}] = 0$. The general formula $\text{Var}(AB) = \text{Var}(A)\text{Var}(B) + \text{Var}(A)E[B]^2 + E[A]^2\text{Var}(B)$ simplifies when both means are zero.
Step 2: variance of the sum
The neuron's output $\textcolor{#1D9E75}{y}$ is a sum of $\textcolor{#7F77DD}{n_\text{in}}$ terms. Each term $\textcolor{#E8725C}{w_i}\,\textcolor{#E07B9D}{x_i}$ uses a different weight, so the terms are independent. For independent random variables, the variance of their sum equals the sum of their variances:
$\textcolor{#18B8C8}{\text{Var}}(\textcolor{#1D9E75}{y}) = \sum_{i=1}^{\textcolor{#7F77DD}{n_\text{in}}} \textcolor{#18B8C8}{\text{Var}}(\textcolor{#E8725C}{w_i}\,\textcolor{#E07B9D}{x_i}) = \sum_{i=1}^{\textcolor{#7F77DD}{n_\text{in}}} \textcolor{#18B8C8}{\sigma_w^2} \cdot \textcolor{#18B8C8}{\sigma_x^2} = \textcolor{#7F77DD}{n_\text{in}} \cdot \textcolor{#18B8C8}{\sigma_w^2} \cdot \textcolor{#18B8C8}{\sigma_x^2}$
The last step is just adding the same value $\textcolor{#7F77DD}{n_\text{in}}$ times. Every term has the same variance because the weights are identically distributed and the inputs are identically distributed.
| $\textcolor{#18B8C8}{\text{Var}}(\textcolor{#1D9E75}{y})$ | Variance of the neuron's output. |
| $\textcolor{#7F77DD}{n_\text{in}} \cdot \textcolor{#18B8C8}{\sigma_w^2}$ | The per-layer variance multiplier. This single number determines everything. |
Step 3: the preservation condition
For the signal to pass through a layer without growing or shrinking, we need the output variance to equal the input variance:
$\textcolor{#7F77DD}{n_\text{in}} \cdot \textcolor{#18B8C8}{\sigma_w^2} \cdot \textcolor{#18B8C8}{\sigma_x^2} = \textcolor{#18B8C8}{\sigma_x^2}$
The $\textcolor{#18B8C8}{\sigma_x^2}$ cancels from both sides:
$\textcolor{#7F77DD}{n_\text{in}} \cdot \textcolor{#18B8C8}{\sigma_w^2} = 1 \quad \Longrightarrow \quad \textcolor{#18B8C8}{\sigma_w^2} = \frac{1}{\textcolor{#7F77DD}{n_\text{in}}}$
Each layer's output is the next layer's input. So the multiplier $\textcolor{#7F77DD}{n} \cdot \textcolor{#18B8C8}{\sigma_w^2}$ compounds. After $\textcolor{#8CB4D5}{L}$ layers:
$\textcolor{#18B8C8}{\text{Var}}(\textcolor{#1D9E75}{y_L}) = \left(\textcolor{#7F77DD}{n} \cdot \textcolor{#18B8C8}{\sigma_w^2}\right)^{\textcolor{#8CB4D5}{L}} \cdot \textcolor{#18B8C8}{\text{Var}}(\textcolor{#E07B9D}{x})$
If $\textcolor{#7F77DD}{n} \cdot \textcolor{#18B8C8}{\sigma_w^2} > 1$, exponential growth. If $< 1$, exponential decay. If $= 1$, stable propagation. The entire weight initialization problem reduces to making this product equal 1.
Xavier/Glorot initialization
The forward-pass analysis above gives $\textcolor{#18B8C8}{\sigma_w^2} = 1/\textcolor{#7F77DD}{n_\text{in}}$. Running the same derivation in reverse (how does gradient variance propagate backward?) gives $\textcolor{#18B8C8}{\sigma_w^2} = 1/\textcolor{#D36BE0}{n_\text{out}}$. These two conditions can't both be satisfied unless $\textcolor{#7F77DD}{n_\text{in}} = \textcolor{#D36BE0}{n_\text{out}}$, so Glorot and Bengio (2010) compromise with the harmonic mean:
$\textcolor{#18B8C8}{\sigma_w^2} = \frac{2}{\textcolor{#7F77DD}{n_\text{in}} + \textcolor{#D36BE0}{n_\text{out}}}$
| $\textcolor{#7F77DD}{n_\text{in}}$ | Fan-in: number of inputs to the layer. |
| $\textcolor{#D36BE0}{n_\text{out}}$ | Fan-out: number of outputs. |
| $\frac{2}{\textcolor{#7F77DD}{n_\text{in}} + \textcolor{#D36BE0}{n_\text{out}}}$ | Harmonic mean of the forward and backward conditions. |
In practice, weights are sampled from either a normal or uniform distribution with this variance:
$\textcolor{#E8725C}{w} \sim \mathcal{N}\!\left(0,\; \frac{2}{\textcolor{#7F77DD}{n_\text{in}} + \textcolor{#D36BE0}{n_\text{out}}}\right) \qquad \text{or} \qquad \textcolor{#E8725C}{w} \sim U\!\left[-\sqrt{\frac{6}{\textcolor{#7F77DD}{n_\text{in}} + \textcolor{#D36BE0}{n_\text{out}}}},\;\; \sqrt{\frac{6}{\textcolor{#7F77DD}{n_\text{in}} + \textcolor{#D36BE0}{n_\text{out}}}}\right]$
The $\sqrt{6}$ in the uniform version comes from the variance of $U[-a, a]$ being $a^2/3$: setting $a^2/3 = 2/(\textcolor{#7F77DD}{n_\text{in}} + \textcolor{#D36BE0}{n_\text{out}})$ gives $a = \sqrt{6/(\textcolor{#7F77DD}{n_\text{in}} + \textcolor{#D36BE0}{n_\text{out}})}$.
The derivation assumed a linear activation. This is why Xavier works for sigmoid and tanh: near the origin (where properly initialized activations live), both functions are approximately linear. But it breaks for ReLU, because ReLU is not linear around zero. It kills half the signal.
He/Kaiming initialization
ReLU zeroes out all negative values. If the pre-activation $\textcolor{#1D9E75}{y}$ is symmetric around zero (which it is, given zero-mean weights and inputs), exactly half the values are killed. The surviving half retains its variance, but averaged over the whole output, the variance halves:
$\textcolor{#18B8C8}{\text{Var}}\!\left(\textcolor{#D85A30}{\text{ReLU}}(\textcolor{#1D9E75}{y})\right) = \frac{1}{2} \cdot \textcolor{#18B8C8}{\text{Var}}(\textcolor{#1D9E75}{y})$
As shown in the activation functions post, ReLU's gradient is 1 for positive inputs and 0 for negative inputs. With zero-centered pre-activations, the expected gradient is $\frac{1}{2}$, so each layer halves the variance. Combining this with the linear-layer variance from before:
$\textcolor{#18B8C8}{\text{Var}}(\textcolor{#1D9E75}{y_{l+1}}) = \frac{1}{2} \cdot \textcolor{#7F77DD}{n_\text{in}} \cdot \textcolor{#18B8C8}{\sigma_w^2} \cdot \textcolor{#18B8C8}{\text{Var}}(\textcolor{#1D9E75}{y_l})$
For variance preservation, the per-layer multiplier must equal 1:
$\frac{1}{2} \cdot \textcolor{#7F77DD}{n_\text{in}} \cdot \textcolor{#18B8C8}{\sigma_w^2} = 1 \quad \Longrightarrow \quad \textcolor{#18B8C8}{\sigma_w^2} = \frac{2}{\textcolor{#7F77DD}{n_\text{in}}}$
| $\frac{2}{\textcolor{#7F77DD}{n_\text{in}}}$ | He initialization variance. The 2 in the numerator compensates for ReLU killing half the distribution. |
He et al. (2015) showed this allows training networks with 30+ layers where Xavier fails. The same idea extends to Leaky ReLU: if the negative slope is $\alpha$, the variance factor becomes $(1 + \alpha^2)/2$ instead of $1/2$, giving $\textcolor{#18B8C8}{\sigma_w^2} = 2 / ((1 + \alpha^2) \cdot \textcolor{#7F77DD}{n_\text{in}})$.
Switch between activations to see the difference. With linear, Xavier ($1/n$) keeps variance flat and He ($2/n$) causes it to grow. With ReLU, He stays flat and Xavier decays (each layer halves the variance because of the missing factor of 2). Tanh behaves close to linear near the origin. GELU sits between tanh and ReLU: it suppresses slightly more than half the distribution, so He slightly overshoots and Xavier slightly undershoots.
PyTorch in practice
PyTorch's nn.Linear defaults to Kaiming uniform initialization. This is why you can stack nn.Linear + nn.ReLU layers and things "just work": PyTorch already applies the $\textcolor{#D85A30}{2/\textcolor{#7F77DD}{n_\text{in}}}$ variance from the He derivation above. You only need to change initialization explicitly when using a non-ReLU activation or when you want the normal (Gaussian) variant instead of uniform:
import torch.nn as nn
# Default: Kaiming uniform (designed for ReLU)
layer = nn.Linear(256, 128)
# Xavier normal (for tanh/sigmoid)
nn.init.xavier_normal_(layer.weight)
# He/Kaiming normal (for ReLU)
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
# He for Leaky ReLU (accounts for negative slope)
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='leaky_relu')
For a full network, apply initialization in a loop after construction:
class MLP(nn.Module):
def __init__(self, dims):
super().__init__()
layers = []
for i in range(len(dims) - 1):
layers.append(nn.Linear(dims[i], dims[i + 1]))
if i < len(dims) - 2:
layers.append(nn.ReLU())
self.net = nn.Sequential(*layers)
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
nn.init.zeros_(m.bias)
model = MLP([784, 256, 256, 256, 10])
The mode='fan_in' argument (default) preserves variance in the forward pass. Use mode='fan_out' if you care more about gradient variance in the backward pass. For most architectures, fan_in is the right choice.
Other methods
LeCun initialization predates Xavier: $\textcolor{#18B8C8}{\sigma_w^2} = 1/\textcolor{#7F77DD}{n_\text{in}}$. It is the forward-pass-only version of Xavier (equivalent when $\textcolor{#7F77DD}{n_\text{in}} = \textcolor{#D36BE0}{n_\text{out}}$). Used with SELU activations, which are designed to be self-normalizing.
Orthogonal initialization constructs the weight matrix as a random orthogonal matrix (via QR decomposition of a random Gaussian matrix). An orthogonal matrix preserves norms exactly, not just in expectation. This is particularly useful for RNNs, where the same weight matrix is applied at every time step and even small per-step variance changes compound over hundreds of steps.
What about GELU? GELU soft-gates rather than hard-zeroing, so its variance factor is ~0.425 (slightly less than ReLU's 0.5). Neither Xavier nor He perfectly preserves variance for GELU. The "correct" init would be $\textcolor{#18B8C8}{\sigma_w^2} \approx 2.35/\textcolor{#7F77DD}{n_\text{in}}$, but nobody uses this in practice because GELU networks (transformers) always include LayerNorm, which corrects the drift at every layer.
Normalization layers reduce sensitivity. Batch normalization, layer normalization, and group normalization re-center and re-scale activations at every layer, partially correcting bad initialization on the fly. This is why modern architectures (transformers with LayerNorm, ResNets with BatchNorm) are less sensitive to initialization than the theory predicts. But initialization still matters for the first few training steps and for architectures without normalization.
| Method | $\textcolor{#18B8C8}{\sigma_w^2}$ | Best for |
|---|---|---|
| Xavier/Glorot | $\frac{2}{\textcolor{#7F77DD}{n_\text{in}} + \textcolor{#D36BE0}{n_\text{out}}}$ | sigmoid, tanh |
| He/Kaiming | $\frac{2}{\textcolor{#7F77DD}{n_\text{in}}}$ | ReLU, Leaky ReLU |
| LeCun | $\frac{1}{\textcolor{#7F77DD}{n_\text{in}}}$ | SELU |
| Orthogonal | N/A (preserves norms exactly) | RNNs |