Why nonlinearity
A neural network layer does two things: multiply by a weight matrix and add a bias. That is a linear transformation. The question is: what happens when you stack two of them?
Layer 1 produces $\textcolor{#E8725C}{W_1}\textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b_1}$. Layer 2 takes that as input:
$\textcolor{#E8725C}{W_2}\left(\textcolor{#E8725C}{W_1}\textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b_1}\right) + \textcolor{#7F77DD}{b_2} = \underbrace{\textcolor{#E8725C}{W_2}\textcolor{#E8725C}{W_1}}_{\text{one matrix}}\textcolor{#E07B9D}{x} + \underbrace{\textcolor{#E8725C}{W_2}\textcolor{#7F77DD}{b_1} + \textcolor{#7F77DD}{b_2}}_{\text{one bias}}$
| $\textcolor{#E8725C}{W_1}, \textcolor{#E8725C}{W_2}$ | Weight matrices for layers 1 and 2. Each layer has its own. |
| $\textcolor{#7F77DD}{b_1}, \textcolor{#7F77DD}{b_2}$ | Bias vectors for each layer. |
| $\textcolor{#E07B9D}{x}$ | The input vector to the network. |
Two layers collapsed into one. This generalizes to any number of layers: $\textcolor{#E8725C}{W_L} \cdots \textcolor{#E8725C}{W_2}\textcolor{#E8725C}{W_1}$ is always a single matrix. Depth without nonlinearity is an illusion.
Now insert an activation function $\textcolor{#C9A227}{f}$ between the layers. Layer 1 produces $\textcolor{#C9A227}{f}(\textcolor{#E8725C}{W_1}\textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b_1})$, and layer 2 computes $\textcolor{#E8725C}{W_2} \cdot \textcolor{#C9A227}{f}(\textcolor{#E8725C}{W_1}\textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b_1}) + \textcolor{#7F77DD}{b_2}$. The function $\textcolor{#C9A227}{f}$ sits inside the composition, so the matrices cannot be multiplied together. The output is no longer a linear function of the input. The demo below shows a classification problem that no linear model can solve. With a nonlinear activation function between layers, the network bends the decision boundary to separate the classes. Without it, you get a straight line.
The function explorer
Before diving into each activation function individually, here is an interactive explorer. Switch between functions, drag along the curve to see the output and gradient at each point. The dashed line is the derivative.
Sigmoid
$\textcolor{#C9A227}{\sigma}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} 0.27 & 0.88 \\\\ 0.62 & 0.05 \end{bmatrix}$
We want a function that squashes any real number into the range $(0, 1)$. One natural path: start with odds. If the probability of an event is $p$, the odds are $\frac{p}{1-p}$. The log-odds (or logit) is $\log\frac{p}{1-p}$, which ranges over all real numbers. Inverting that relationship (solving for $p$) gives the logistic function:
$\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}) = \frac{1}{1 + \textcolor{#7F77DD}{e}^{-\textcolor{#E07B9D}{x}}}$
| $\textcolor{#E07B9D}{x}$ | The input (pre-activation). Can be any real number from $-\infty$ to $+\infty$. |
| $\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$ | The output. Always in $(0, 1)$. Behaves like a probability: 0 means "off," 1 means "on." |
| $\textcolor{#7F77DD}{e}^{-\textcolor{#E07B9D}{x}}$ | The exponential decay term. When $\textcolor{#E07B9D}{x}$ is large and positive, this is near 0, so the output is near 1. When $\textcolor{#E07B9D}{x}$ is large and negative, this dominates, pushing the output toward 0. |
Deriving the derivative
The derivative tells us how sensitive the output is to small changes in input. We need $\textcolor{#8CB4D5}{\sigma'(x)}$, and the derivation is a clean application of the quotient rule.
Write $\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}) = \frac{\textcolor{#1D9E75}{1}}{\textcolor{#D85A30}{1 + e^{-x}}}$. The quotient rule says $\left(\frac{u}{v}\right)' = \frac{u'v - uv'}{v^2}$:
$\textcolor{#8CB4D5}{\sigma'(x)} = \frac{\textcolor{#1D9E75}{0} \cdot \textcolor{#D85A30}{(1+e^{-x})} - \textcolor{#1D9E75}{1} \cdot \textcolor{#D85A30}{(-e^{-x})}}{\textcolor{#D85A30}{(1+e^{-x})^2}} = \frac{e^{-x}}{(1+e^{-x})^2}$
Now notice that $\frac{e^{-x}}{(1+e^{-x})^2} = \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}}$. The first factor is $\textcolor{#C9A227}{\sigma(x)}$, and the second is $\frac{(1+e^{-x}) - 1}{1+e^{-x}} = 1 - \textcolor{#C9A227}{\sigma(x)}$. So:
$\textcolor{#8CB4D5}{\sigma'(x)} = \textcolor{#C9A227}{\sigma(x)}\left(1 - \textcolor{#C9A227}{\sigma(x)}\right)$
| $\textcolor{#8CB4D5}{\sigma'(x)}$ | The derivative (gradient) of sigmoid. Tells you how much the output changes per unit change in input. |
| $\textcolor{#C9A227}{\sigma(x)}(1 - \textcolor{#C9A227}{\sigma(x)})$ | A product of the output and its complement. Maximum when $\textcolor{#C9A227}{\sigma(x)} = 0.5$ (i.e., when $\textcolor{#E07B9D}{x} = 0$), which gives $0.5 \times 0.5 = 0.25$. |
This elegant form is why sigmoid was popular: the derivative is trivially cheap to compute once you already have the output. But it also reveals the problem. The maximum value is 0.25, and it drops off sharply as you move away from zero.
Concrete values
Using the same inputs from the section preview:
| $\textcolor{#E07B9D}{x}$ | $\textcolor{#C9A227}{\sigma(x)}$ | $\textcolor{#8CB4D5}{\sigma'(x)}$ | |
|---|---|---|---|
| $-1$ | $0.27$ | $0.197$ | Moderate input, gradient still reasonable |
| $2$ | $0.88$ | $0.105$ | Approaching saturation, gradient halved |
| $0.5$ | $0.62$ | $0.235$ | Near zero, close to peak gradient of $0.25$ |
| $-3$ | $0.05$ | $0.045$ | Deep in saturation, gradient nearly gone |
The derivative peaks at $0.25$ (when $\textcolor{#E07B9D}{x} = 0$) and drops off sharply in both directions. By $\textcolor{#E07B9D}{x} = -3$, the gradient is $0.045$: the function has saturated. Stack 10 layers of this and the gradient shrinks by $0.25^{10} \approx 10^{-6}$. This is the vanishing gradient problem.
The problems with sigmoid in hidden layers:
Vanishing gradients. The derivative peaks at 0.25, not 1. Every layer multiplies the gradient by at most 0.25. After 10 layers, the gradient shrinks by $0.25^{10} \approx 10^{-6}$. Early layers barely learn.
Not zero-centered. Sigmoid outputs are always positive (between 0 and 1). This means the gradients for the weights in the next layer are either all positive or all negative, forcing zig-zag updates during optimization.
Expensive. Computing $e^{-x}$ is slower than a comparison and multiply (which is all ReLU needs).
Tanh
$\textcolor{#7F77DD}{\tanh}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} -0.76 & 0.96 \\\\ 0.46 & -1.00 \end{bmatrix}$
Tanh is sigmoid's zero-centered cousin. It squashes to $(-1, 1)$ instead of $(0, 1)$:
$\textcolor{#7F77DD}{\tanh}(\textcolor{#E07B9D}{x}) = \frac{\textcolor{#7F77DD}{e}^{\textcolor{#E07B9D}{x}} - \textcolor{#7F77DD}{e}^{-\textcolor{#E07B9D}{x}}}{\textcolor{#7F77DD}{e}^{\textcolor{#E07B9D}{x}} + \textcolor{#7F77DD}{e}^{-\textcolor{#E07B9D}{x}}}$
| $\textcolor{#E07B9D}{x}$ | The input (pre-activation). |
| $\textcolor{#7F77DD}{\tanh}(\textcolor{#E07B9D}{x})$ | The output. Always in $(-1, 1)$. Zero-centered, unlike sigmoid. |
Connection to sigmoid
Tanh and sigmoid are not just similar; one is a linear transformation of the other. Starting from $2\textcolor{#C9A227}{\sigma}(2\textcolor{#E07B9D}{x}) - 1$:
$2\textcolor{#C9A227}{\sigma}(2\textcolor{#E07B9D}{x}) - 1 = \frac{2}{1 + e^{-2x}} - 1 = \frac{2 - (1 + e^{-2x})}{1 + e^{-2x}} = \frac{1 - e^{-2x}}{1 + e^{-2x}}$
Multiply numerator and denominator by $e^{x}$:
$\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} = \textcolor{#7F77DD}{\tanh}(\textcolor{#E07B9D}{x})$
So $\textcolor{#7F77DD}{\tanh}(\textcolor{#E07B9D}{x}) = 2\textcolor{#C9A227}{\sigma}(2\textcolor{#E07B9D}{x}) - 1$. Tanh is a rescaled, shifted sigmoid.
Deriving the derivative
Using the quotient rule on $\frac{e^x - e^{-x}}{e^x + e^{-x}}$, or more quickly, using the identity $\tanh^2(x) + \text{sech}^2(x) = 1$:
$\textcolor{#8CB4D5}{\tanh'(x)} = 1 - \textcolor{#7F77DD}{\tanh}^2(\textcolor{#E07B9D}{x})$
| $\textcolor{#8CB4D5}{\tanh'(x)}$ | The derivative of tanh. Peaks at 1.0 when $\textcolor{#E07B9D}{x} = 0$ (four times stronger than sigmoid's 0.25). |
| $1 - \textcolor{#7F77DD}{\tanh}^2(\textcolor{#E07B9D}{x})$ | One minus the square of the output. Maximum when $\textcolor{#7F77DD}{\tanh(x)} = 0$ (at the origin), minimum when $|\textcolor{#7F77DD}{\tanh(x)}| \to 1$ (in the tails). |
Concrete values
Using the same reference inputs:
| $\textcolor{#E07B9D}{x}$ | $\textcolor{#7F77DD}{\tanh(x)}$ | $\textcolor{#8CB4D5}{\tanh'(x)}$ | |
|---|---|---|---|
| $-1$ | $-0.76$ | $0.420$ | Moderate input, gradient still healthy |
| $2$ | $0.96$ | $0.071$ | Near saturation, gradient dropping fast |
| $0.5$ | $0.46$ | $0.786$ | Near origin, gradient close to peak of $1.0$ |
| $-3$ | $-1.00$ | $0.010$ | Fully saturated, gradient nearly zero |
Comparison with sigmoid
At the origin, tanh has a gradient of $\textcolor{#8CB4D5}{1.0}$, four times stronger than sigmoid's $\textcolor{#8CB4D5}{0.25}$. This means gradients flow more freely near the center. But tanh still saturates: by $|\textcolor{#E07B9D}{x}| = 2$, the gradient has dropped to 0.071. By $|\textcolor{#E07B9D}{x}| = 5$, it is below 0.001. The tails are just as flat as sigmoid's.
Tanh is still used as the activation in LSTM and GRU cells (where the output needs to be in $[-1, 1]$), and occasionally in the hidden layers of small networks. For deep networks, it has the same fundamental issue as sigmoid: saturating tails kill the gradient.
ReLU
$\textcolor{#D85A30}{\text{ReLU}}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} 0 & 2 \\\\ 0.5 & 0 \end{bmatrix}$
ReLU (Rectified Linear Unit) is the function that made deep learning practical:
$\textcolor{#D85A30}{\text{ReLU}}(\textcolor{#E07B9D}{x}) = \max(0, \textcolor{#E07B9D}{x})$
| $\textcolor{#E07B9D}{x}$ | The input (pre-activation). |
| $\textcolor{#D85A30}{\text{ReLU}}(\textcolor{#E07B9D}{x})$ | The output. Equal to $\textcolor{#E07B9D}{x}$ when positive, 0 when negative. No upper bound. |
Dead simple. If the input is positive, pass it through. If it is negative, output zero.
The derivative is a step function
$\textcolor{#8CB4D5}{\text{ReLU}'(x)} = \begin{cases} \textcolor{#D85A30}{1} & \text{if } \textcolor{#E07B9D}{x} > 0 \\\\ \textcolor{#D85A30}{0} & \text{if } \textcolor{#E07B9D}{x} < 0 \end{cases}$
| $\textcolor{#8CB4D5}{\text{ReLU}'(x)}$ | The derivative. Either 1 or 0. No saturation in the positive regime, no expensive computation. |
| $\textcolor{#E07B9D}{x} = 0$ | Technically undefined (the kink). In practice, frameworks like PyTorch define it as 0. The choice doesn't affect training because with continuous-valued inputs, landing on exactly $0.000...$ has probability zero. |
With zero-centered inputs, roughly half the pre-activations are positive (gradient $= 1$) and half are negative (gradient $= 0$). The expected gradient across the layer is $\frac{1}{2}$, which means ReLU halves the variance of the signal at each layer. This has direct consequences for weight initialization.
Why this fixes vanishing gradients
The key insight: in the positive regime, the gradient is exactly $\textcolor{#D85A30}{1}$. No matter how deep the network, the gradient passes through unchanged. Let's trace the gradient through a 5-layer network where each layer computes $\textcolor{#E07B9D}{z_l} = \textcolor{#E8725C}{W_l} \cdot \textcolor{#C9A227}{f_{l-1}} + \textcolor{#7F77DD}{b_l}$ (the pre-activation), then applies the activation: $\textcolor{#C9A227}{f_l} = \textcolor{#D85A30}{\text{ReLU}}(\textcolor{#E07B9D}{z_l})$. By the chain rule, the gradient of the loss with respect to a parameter in layer 1 includes the product of activation derivatives at every layer:
$\textcolor{#8CB4D5}{\frac{\partial \text{loss}}{\partial W_1}} = \textcolor{#8CB4D5}{\frac{\partial \text{loss}}{\partial \textcolor{#C9A227}{f_5}}} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_5}) \cdot \textcolor{#E8725C}{W_5} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_4}) \cdot \textcolor{#E8725C}{W_4} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_3}) \cdot \textcolor{#E8725C}{W_3} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_2}) \cdot \textcolor{#E8725C}{W_2} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_1}) \cdot \textcolor{#E07B9D}{x}$
If every pre-activation $\textcolor{#E07B9D}{z_l}$ is positive, every $\textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_l}) = 1$. The gradient is not attenuated by the activations at all. Compare this to sigmoid, where each $\textcolor{#C9A227}{\sigma'}(\textcolor{#E07B9D}{z_l}) \leq 0.25$, multiplying five of them gives at most $0.25^5 = 0.000977$. With ReLU, the five activation derivative factors contribute a product of $1^5 = 1$. This is why deep networks started working.
The dead neuron problem
The catch: if a neuron's pre-activation $\textcolor{#E07B9D}{z} = \textcolor{#E8725C}{W} \cdot \textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b}$ is negative for every training example, then:
- Output: $\textcolor{#D85A30}{\text{ReLU}}(\textcolor{#E07B9D}{z}) = 0$ always
- Gradient: $\textcolor{#8CB4D5}{\text{ReLU}'(z)} = 0$ always
- Weight update: $\Delta \textcolor{#E8725C}{W} = -\eta \cdot \textcolor{#8CB4D5}{0} = 0$
The neuron produces no output, receives no gradient, and never updates its weights. It is permanently dead. A large learning rate can push weights into a region where the pre-activation is always negative, killing the neuron. With unlucky initialization, a significant fraction of neurons can start dead.
Concrete example: suppose a neuron has weights $\textcolor{#E8725C}{W} = [-3, -2]$ and bias $\textcolor{#7F77DD}{b} = -1$. For any input $\textcolor{#E07B9D}{x}$ with positive components (as is typical after a preceding ReLU), the pre-activation is $-3x_1 - 2x_2 - 1$, which is always negative. The neuron will never fire, will never receive a gradient, and will never learn.
Leaky ReLU and PReLU
$\textcolor{#E07B9D}{\text{LeakyReLU}}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} -0.01 & 2 \\\\ 0.5 & -0.03 \end{bmatrix}$
Leaky ReLU fixes the dead neuron problem with a small slope for negative inputs:
$\textcolor{#E07B9D}{\text{LeakyReLU}}(\textcolor{#E07B9D}{x}) = \begin{cases} \textcolor{#E07B9D}{x} & \text{if } \textcolor{#E07B9D}{x} > 0 \\\\ \textcolor{#E07B9D}{\alpha}\textcolor{#E07B9D}{x} & \text{if } \textcolor{#E07B9D}{x} \le 0 \end{cases}$
| $\textcolor{#E07B9D}{\alpha}$ | The negative slope. Typically $0.01$. Small enough that the function is "nearly" ReLU, but nonzero so gradients flow. |
The derivative
$\textcolor{#8CB4D5}{\text{LeakyReLU}'(x)} = \begin{cases} \textcolor{#E07B9D}{1} & \text{if } \textcolor{#E07B9D}{x} > 0 \\\\ \textcolor{#E07B9D}{\alpha} & \text{if } \textcolor{#E07B9D}{x} \le 0 \end{cases}$
The gradient is never zero. In the positive regime it is 1 (same as ReLU). In the negative regime it is $\textcolor{#E07B9D}{\alpha}$, typically 0.01.
Concrete comparison
Using the reference inputs with $\textcolor{#E07B9D}{\alpha} = 0.01$:
| $\textcolor{#E07B9D}{x}$ | $\textcolor{#D85A30}{\text{ReLU}}$ | $\textcolor{#E07B9D}{\text{LeakyReLU}}$ | $\textcolor{#8CB4D5}{\text{Gradient}}$ | |
|---|---|---|---|---|
| $-1$ | $0$ | $-0.01$ | $0$ vs $0.01$ | ReLU dead, Leaky still learning |
| $2$ | $2$ | $2$ | $1$ vs $1$ | Identical in positive regime |
| $0.5$ | $0.5$ | $0.5$ | $1$ vs $1$ | Identical in positive regime |
| $-3$ | $0$ | $-0.03$ | $0$ vs $0.01$ | ReLU dead, Leaky still learning |
The difference is stark for negative inputs. ReLU outputs zero and receives zero gradient. Leaky ReLU produces a tiny output and receives a small gradient. $0.01$ is small, but it is infinitely larger than $0$.
PReLU (Parametric ReLU) makes $\textcolor{#E07B9D}{\alpha}$ a learnable parameter. The network decides for itself how much of the negative signal to let through. In practice, the learned $\textcolor{#E07B9D}{\alpha}$ values tend to be small (0.01 to 0.25), confirming that a small leak is all you need.
GELU
$\textcolor{#1D9E75}{\text{GELU}}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} -0.16 & 1.95 \\\\ 0.35 & 0.00 \end{bmatrix}$
What if instead of a hard cutoff at zero (ReLU), we used a soft, probabilistic gate? That is the core idea behind GELU (Gaussian Error Linear Unit).
The Gaussian CDF
Start with $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$, the cumulative distribution function (CDF) of the standard normal distribution. This is an S-shaped curve from 0 to 1, centered at 0. It answers the question: "If I draw a random number from a standard Gaussian, what is the probability it is less than $\textcolor{#E07B9D}{x}$?" The dotted curve below is the familiar bell curve (the PDF $\phi(x)$). The solid green curve is the CDF: the running integral of that bell curve, accumulating probability from left to right. Drag to see both values at any point.
| $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$ | Gaussian CDF. $\textcolor{#1D9E75}{\Phi}(-\infty) = 0$, $\textcolor{#1D9E75}{\Phi}(0) = 0.5$, $\textcolor{#1D9E75}{\Phi}(\infty) = 1$. Smooth S-curve, similar to sigmoid but with a different functional form. |
When $\textcolor{#E07B9D}{x}$ is large and positive, $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) \approx 1$ (almost certainly less than $x$). When $\textcolor{#E07B9D}{x}$ is large and negative, $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) \approx 0$.
The GELU formula
GELU uses the CDF as a soft gate: scale $\textcolor{#E07B9D}{x}$ by $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$, which is near 1 for large positive inputs (let them through) and near 0 for large negative inputs (suppress them):
$\textcolor{#1D9E75}{\text{GELU}}(\textcolor{#E07B9D}{x}) = \textcolor{#E07B9D}{x} \cdot \textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$
| $\textcolor{#E07B9D}{x}$ | The input. |
| $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$ | The gate: a smooth value between 0 and 1 that determines how much of $\textcolor{#E07B9D}{x}$ passes through. |
| $\textcolor{#1D9E75}{\text{GELU}}(\textcolor{#E07B9D}{x})$ | The output: the input scaled by its Gaussian gate. |
The intuition: for large positive $\textcolor{#E07B9D}{x}$, the gate $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) \approx 1$, so the input passes through nearly unchanged (like ReLU). For large negative $\textcolor{#E07B9D}{x}$, the gate $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) \approx 0$, so the output is nearly zero (also like ReLU). But near zero, the transition is smooth rather than a hard kink, and small negative values can pass through with reduced magnitude.
The tanh approximation
The Gaussian CDF has no closed-form expression, so in practice GELU is computed via a tanh approximation:
$\textcolor{#1D9E75}{\text{GELU}}(\textcolor{#E07B9D}{x}) \approx 0.5\textcolor{#E07B9D}{x}\left(1 + \textcolor{#7F77DD}{\tanh}\left[\textcolor{#D85A30}{\sqrt{\frac{2}{\pi}}}\left(\textcolor{#E07B9D}{x} + \textcolor{#18B8C8}{0.044715}\textcolor{#E07B9D}{x}^3\right)\right]\right)$
| $\textcolor{#D85A30}{\sqrt{2/\pi}}$ | $\approx 0.7979$. A scaling constant that maps the Gaussian CDF onto the tanh range. |
| $\textcolor{#18B8C8}{0.044715}$ | A fitted cubic correction that improves the approximation accuracy. |
Concrete values
Using the reference inputs:
| $\textcolor{#E07B9D}{x}$ | $\textcolor{#1D9E75}{\Phi(x)}$ | $\textcolor{#1D9E75}{\text{GELU}(x)}$ | $\textcolor{#D85A30}{\text{ReLU}(x)}$ | |
|---|---|---|---|---|
| $-1$ | $0.159$ | $-0.16$ | $0$ | GELU lets a small negative through; ReLU blocks entirely |
| $2$ | $0.977$ | $1.95$ | $2$ | Nearly identical in the positive regime |
| $0.5$ | $0.691$ | $0.35$ | $0.5$ | Soft gating scales down small positives |
| $-3$ | $0.001$ | $-0.004$ | $0$ | Gate nearly shut, both close to zero |
The key difference from ReLU shows at $x = -1$: ReLU gives exactly $0$, while GELU gives $-0.16$. The gate does not fully shut at the boundary.
The derivative
The derivative of $\textcolor{#1D9E75}{\text{GELU}}(\textcolor{#E07B9D}{x}) = \textcolor{#E07B9D}{x} \cdot \textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$ uses the product rule:
$\textcolor{#8CB4D5}{\text{GELU}'(x)} = \textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) + \textcolor{#E07B9D}{x} \cdot \textcolor{#1D9E75}{\phi}(\textcolor{#E07B9D}{x})$
| $\textcolor{#1D9E75}{\phi}(\textcolor{#E07B9D}{x})$ | The Gaussian PDF (probability density function): $\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$. This is the derivative of $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$. |
The derivative is smooth everywhere, unlike ReLU's discontinuity at zero. At $\textcolor{#E07B9D}{x} = 0$: $\textcolor{#8CB4D5}{\text{GELU}'(0)} = 0.5 + 0 \cdot \phi(0) = 0.5$. For large positive $\textcolor{#E07B9D}{x}$, it approaches 1 (like ReLU). For large negative $\textcolor{#E07B9D}{x}$, it approaches 0.
GELU is used in BERT, GPT-2/3/4, Vision Transformer (ViT), and most modern transformer architectures. It slightly outperforms ReLU in these models, likely because the smooth transition handles the residual connections and layer normalization better than ReLU's hard kink at zero.
Swish / SiLU
$\textcolor{#18B8C8}{\text{Swish}}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} -0.27 & 1.76 \\\\ 0.31 & -0.14 \end{bmatrix}$
Swish (also called SiLU, Sigmoid Linear Unit) is a self-gated activation:
$\textcolor{#18B8C8}{\text{Swish}}(\textcolor{#E07B9D}{x}) = \textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$
| $\textcolor{#E07B9D}{x}$ | The input. |
| $\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$ | Sigmoid, acting as a smooth gate between 0 and 1. |
| $\textcolor{#18B8C8}{\text{Swish}}(\textcolor{#E07B9D}{x})$ | The output: input multiplied by its sigmoid gate. |
Notice the structural parallel: GELU = $\textcolor{#E07B9D}{x} \cdot \textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$ and Swish = $\textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$. Same idea, different gate function. The Gaussian CDF $\textcolor{#1D9E75}{\Phi}$ and sigmoid $\textcolor{#C9A227}{\sigma}$ are both smooth S-curves from 0 to 1. They differ in shape (sigmoid has heavier tails), but the gating mechanism is identical.
Deriving the derivative
Apply the product rule to $\textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$:
$\textcolor{#8CB4D5}{\text{Swish}'(x)} = \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}) + \textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})\left(1 - \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})\right)$
| $\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$ | The first term: the gate value itself. From the product rule, this is $\frac{d}{dx}[x] \cdot \sigma(x) = 1 \cdot \sigma(x)$. |
| $\textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})(1 - \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}))$ | The second term: $x$ times the derivative of sigmoid. From the product rule, this is $x \cdot \frac{d}{dx}[\sigma(x)]$. |
This can also be written as $\textcolor{#8CB4D5}{\text{Swish}'(x)} = \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}) + \textcolor{#18B8C8}{\text{Swish}}(\textcolor{#E07B9D}{x}) \cdot (1 - \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}))$, which is cheap to compute once you already have the forward pass values.
Concrete values and the non-monotonic dip
Using the reference inputs:
| $\textcolor{#E07B9D}{x}$ | $\textcolor{#C9A227}{\sigma(x)}$ | $\textcolor{#18B8C8}{\text{Swish}(x)}$ | $\textcolor{#D85A30}{\text{ReLU}(x)}$ | |
|---|---|---|---|---|
| $-1$ | $0.27$ | $-0.27$ | $0$ | Swish dips negative; ReLU is flat at zero |
| $2$ | $0.88$ | $1.76$ | $2$ | Gate nearly open, close to linear |
| $0.5$ | $0.62$ | $0.31$ | $0.5$ | Soft gating scales down small positives |
| $-3$ | $0.05$ | $-0.14$ | $0$ | Gate closing, output recovering toward zero |
Unlike ReLU (which is monotonically non-decreasing), Swish dips below zero for negative inputs, reaching a minimum of about $-0.278$ near $x \approx -1.28$. For more negative values, the sigmoid gate closes faster than $|x|$ grows, so the output rises back toward zero. This non-monotonicity is unusual. Swish (and GELU) both have this dip, allowing a small negative "bump" before gating back to near zero. Empirically, this seems to help during training, perhaps by providing richer gradient information near the origin.
Swish was discovered through automated search (neural architecture search) at Google Brain, which is a satisfying detail: a neural network found a good activation function for neural networks.
The vanishing gradient problem
The choice of activation function determines whether gradients survive the trip through many layers. To see why, start from the chain rule.
Chain rule through $L$ layers
Consider a network $\textcolor{#C9A227}{f} = \textcolor{#C9A227}{f_L} \circ \textcolor{#C9A227}{f_{L-1}} \circ \cdots \circ \textcolor{#C9A227}{f_1}$, where each layer $\textcolor{#C9A227}{f_l}$ applies a linear transformation followed by an activation. The gradient of the loss with respect to weights in the first layer is a product of partial derivatives, one per layer:
$\textcolor{#8CB4D5}{\frac{\partial \text{loss}}{\partial W_1}} = \textcolor{#8CB4D5}{\frac{\partial \text{loss}}{\partial \textcolor{#C9A227}{f_L}}} \cdot \textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_L}}{\partial \textcolor{#C9A227}{f_{L-1}}}} \cdot \textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_{L-1}}}{\partial \textcolor{#C9A227}{f_{L-2}}}} \cdots \textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_2}}{\partial \textcolor{#C9A227}{f_1}}} \cdot \textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_1}}{\partial W_1}}$
Each factor $\textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_l}}{\partial \textcolor{#C9A227}{f_{l-1}}}}$ includes the activation function's derivative at that layer. If the activation is sigmoid, each such factor is at most $\textcolor{#C9A227}{\sigma'(z_l)} \leq 0.25$. If it is ReLU in the positive regime, each factor is $\textcolor{#D85A30}{\text{ReLU}'(z_l)} = 1$.
The exponential decay
The gradient reaching layer 1 is proportional to the product of all these activation derivatives. Each activation function has a maximum derivative, and that maximum compounds across layers. Here is every function from this post, showing the worst-case gradient factor surviving $L$ layers (assuming inputs near the origin, where derivatives are largest):
| Typical $\textcolor{#8CB4D5}{f'(x)}$ | $L = 5$ | $L = 10$ | $L = 20$ | $L = 50$ | |
|---|---|---|---|---|---|
| Sigmoid | $\sim 0.20$ | $3.2 \times 10^{-4}$ | $1.0 \times 10^{-7}$ | $1.0 \times 10^{-14}$ | $1.1 \times 10^{-35}$ |
| Tanh | $\sim 0.42$ | $0.013$ | $1.7 \times 10^{-4}$ | $2.8 \times 10^{-8}$ | $1.3 \times 10^{-19}$ |
| ReLU | $1.0$ | $1.0$ | $1.0$ | $1.0$ | $1.0$ |
| Leaky ReLU | $1.0$ | $1.0$ | $1.0$ | $1.0$ | $1.0$ |
| GELU | $1.0$ | $1.0$ | $1.0$ | $1.0$ | $1.0$ |
| Swish | $1.0$ | $1.0$ | $1.0$ | $1.0$ | $1.0$ |
The "typical" column uses realistic derivative values at common pre-activation magnitudes, not theoretical maxima. Sigmoid's max derivative is 0.25 (at $x=0$), but most neurons operate away from zero where it is even lower. Tanh peaks at 1.0 (at $x=0$), but by $|x|=1$ it has already dropped to 0.42, and it falls off rapidly from there. Both saturate, but sigmoid is far worse.
ReLU, Leaky ReLU, GELU, and Swish all maintain a gradient of 1.0 in the positive regime with no saturation. The gradient passes through unchanged regardless of depth. This is not the whole story: the weight matrices $\textcolor{#E8725C}{W_l}$ also appear in the chain and can cause exploding or vanishing gradients on their own. But eliminating the activation-induced decay was the critical breakthrough.
The demo below makes this concrete. Watch how the gradient magnitude changes at each layer as you increase network depth.
With sigmoid, the gradient is multiplied by at most 0.25 at each layer. Ten layers deep, the signal reaching the first layer is essentially zero. With ReLU, the gradient passes through at full strength for positive activations. This is the single biggest reason ReLU replaced sigmoid in deep networks.
How to choose
| ReLU | Default for CNNs and most feedforward networks. Simple, fast, works. Use unless you have a reason not to. |
| GELU | Default for transformers. Used in BERT, GPT, ViT. Slightly better than ReLU for attention-based architectures. |
| Swish/SiLU | Alternative to GELU with similar properties. Used in EfficientNet and some diffusion models. |
| Leaky ReLU | When you observe dead neurons with ReLU. Common in GANs. |
| Sigmoid | Output layer for binary classification. Gates in LSTMs. Not for hidden layers. |
| Tanh | Gates in LSTMs/GRUs. Occasionally in small RNNs. Not for deep hidden layers. |
The trend is clear: older activations (sigmoid, tanh) saturate and kill gradients. Modern activations (ReLU, GELU, Swish) avoid saturation in the positive regime. The remaining differences are about smoothness, the behavior near zero, and how well they interact with normalization and residual connections.
import torch.nn as nn
# CNNs
model = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1),
nn.ReLU(), # ReLU for conv layers
nn.Conv2d(64, 128, 3, padding=1),
nn.ReLU(),
)
# Transformers
model = nn.Sequential(
nn.Linear(768, 3072),
nn.GELU(), # GELU for transformer FFN
nn.Linear(3072, 768),
)