Why do neural networks need activation functions?

Without activation functions, every layer in a neural network performs a linear transformation (matrix multiply + bias). Composing linear functions produces another linear function, so no matter how many layers you stack, the network can only learn linear relationships. Activation functions introduce nonlinearity between layers, allowing the network to approximate arbitrarily complex functions.

What is the vanishing gradient problem?

The vanishing gradient problem occurs when gradients shrink exponentially as they propagate backward through many layers. Sigmoid's maximum derivative is 0.25, so after 10 layers the gradient can shrink by a factor of 0.25^10 (about 0.000001). This makes early layers learn extremely slowly or not at all. ReLU avoids this because its gradient is 1 for positive inputs, allowing gradients to flow unchanged.

What is the difference between ReLU and GELU?

ReLU (Rectified Linear Unit) outputs max(0, x): it passes positive values unchanged and zeroes out negatives. GELU (Gaussian Error Linear Unit) is a smooth approximation that allows small negative values through, weighted by the probability that a Gaussian random variable would be less than x. GELU is the default in transformer architectures (BERT, GPT) while ReLU remains the default for CNNs.

What is a dead neuron in ReLU?

A dead neuron is a ReLU unit whose input is always negative, producing zero output and zero gradient. Once dead, the neuron stops learning because no gradient flows through it. This can happen when a large gradient update pushes the weights so that the pre-activation is always negative for every training example. Leaky ReLU and PReLU fix this by allowing a small nonzero gradient for negative inputs.

Which activation function should I use?

ReLU is the safe default for convolutional networks and most feedforward architectures. GELU is the standard for transformer models (BERT, GPT, ViT). Sigmoid and tanh are used for gating mechanisms in RNNs and LSTMs, not as hidden-layer activations. Swish/SiLU is an alternative to GELU with similar properties. When in doubt, start with ReLU and switch to GELU if you're building a transformer.

Activation Functions

Why nonlinearity

A neural network layer does two things: multiply by a weight matrix and add a bias. That is a linear transformation. The question is: what happens when you stack two of them?

Layer 1 produces $\textcolor{#E8725C}{W_1}\textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b_1}$. Layer 2 takes that as input:

$\textcolor{#E8725C}{W_2}\left(\textcolor{#E8725C}{W_1}\textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b_1}\right) + \textcolor{#7F77DD}{b_2} = \underbrace{\textcolor{#E8725C}{W_2}\textcolor{#E8725C}{W_1}}_{\text{one matrix}}\textcolor{#E07B9D}{x} + \underbrace{\textcolor{#E8725C}{W_2}\textcolor{#7F77DD}{b_1} + \textcolor{#7F77DD}{b_2}}_{\text{one bias}}$

$\textcolor{#E8725C}{W_1}, \textcolor{#E8725C}{W_2}$	Weight matrices for layers 1 and 2. Each layer has its own.
$\textcolor{#7F77DD}{b_1}, \textcolor{#7F77DD}{b_2}$	Bias vectors for each layer.
$\textcolor{#E07B9D}{x}$	The input vector to the network.

Two layers collapsed into one. This generalizes to any number of layers: $\textcolor{#E8725C}{W_L} \cdots \textcolor{#E8725C}{W_2}\textcolor{#E8725C}{W_1}$ is always a single matrix. Depth without nonlinearity is an illusion.

Now insert an activation function $\textcolor{#C9A227}{f}$ between the layers. Layer 1 produces $\textcolor{#C9A227}{f}(\textcolor{#E8725C}{W_1}\textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b_1})$, and layer 2 computes $\textcolor{#E8725C}{W_2} \cdot \textcolor{#C9A227}{f}(\textcolor{#E8725C}{W_1}\textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b_1}) + \textcolor{#7F77DD}{b_2}$. The function $\textcolor{#C9A227}{f}$ sits inside the composition, so the matrices cannot be multiplied together. The output is no longer a linear function of the input. The demo below shows a classification problem that no linear model can solve. With a nonlinear activation function between layers, the network bends the decision boundary to separate the classes. Without it, you get a straight line.

The function explorer

Before diving into each activation function individually, here is an interactive explorer. Switch between functions, drag along the curve to see the output and gradient at each point. The dashed line is the derivative.

Sigmoid

$\textcolor{#C9A227}{\sigma}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} 0.27 & 0.88 \\\\ 0.62 & 0.05 \end{bmatrix}$

We want a function that squashes any real number into the range $(0, 1)$. One natural path: start with odds. If the probability of an event is $p$, the odds are $\frac{p}{1-p}$. The log-odds (or logit) is $\log\frac{p}{1-p}$, which ranges over all real numbers. Inverting that relationship (solving for $p$) gives the logistic function:

$\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}) = \frac{1}{1 + \textcolor{#7F77DD}{e}^{-\textcolor{#E07B9D}{x}}}$

$\textcolor{#E07B9D}{x}$	The input (pre-activation). Can be any real number from $-\infty$ to $+\infty$.
$\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$	The output. Always in $(0, 1)$. Behaves like a probability: 0 means "off," 1 means "on."
$\textcolor{#7F77DD}{e}^{-\textcolor{#E07B9D}{x}}$	The exponential decay term. When $\textcolor{#E07B9D}{x}$ is large and positive, this is near 0, so the output is near 1. When $\textcolor{#E07B9D}{x}$ is large and negative, this dominates, pushing the output toward 0.

Deriving the derivative

The derivative tells us how sensitive the output is to small changes in input. We need $\textcolor{#8CB4D5}{\sigma'(x)}$, and the derivation is a clean application of the quotient rule.

Write $\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}) = \frac{\textcolor{#1D9E75}{1}}{\textcolor{#D85A30}{1 + e^{-x}}}$. The quotient rule says $\left(\frac{u}{v}\right)' = \frac{u'v - uv'}{v^2}$:

$\textcolor{#8CB4D5}{\sigma'(x)} = \frac{\textcolor{#1D9E75}{0} \cdot \textcolor{#D85A30}{(1+e^{-x})} - \textcolor{#1D9E75}{1} \cdot \textcolor{#D85A30}{(-e^{-x})}}{\textcolor{#D85A30}{(1+e^{-x})^2}} = \frac{e^{-x}}{(1+e^{-x})^2}$

Now notice that $\frac{e^{-x}}{(1+e^{-x})^2} = \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}}$. The first factor is $\textcolor{#C9A227}{\sigma(x)}$, and the second is $\frac{(1+e^{-x}) - 1}{1+e^{-x}} = 1 - \textcolor{#C9A227}{\sigma(x)}$. So:

$\textcolor{#8CB4D5}{\sigma'(x)} = \textcolor{#C9A227}{\sigma(x)}\left(1 - \textcolor{#C9A227}{\sigma(x)}\right)$

$\textcolor{#8CB4D5}{\sigma'(x)}$	The derivative (gradient) of sigmoid. Tells you how much the output changes per unit change in input.
$\textcolor{#C9A227}{\sigma(x)}(1 - \textcolor{#C9A227}{\sigma(x)})$	A product of the output and its complement. Maximum when $\textcolor{#C9A227}{\sigma(x)} = 0.5$ (i.e., when $\textcolor{#E07B9D}{x} = 0$), which gives $0.5 \times 0.5 = 0.25$.

This elegant form is why sigmoid was popular: the derivative is trivially cheap to compute once you already have the output. But it also reveals the problem. The maximum value is 0.25, and it drops off sharply as you move away from zero.

Concrete values

Using the same inputs from the section preview:

$\textcolor{#E07B9D}{x}$	$\textcolor{#C9A227}{\sigma(x)}$	$\textcolor{#8CB4D5}{\sigma'(x)}$
$-1$	$0.27$	$0.197$	Moderate input, gradient still reasonable
$2$	$0.88$	$0.105$	Approaching saturation, gradient halved
$0.5$	$0.62$	$0.235$	Near zero, close to peak gradient of $0.25$
$-3$	$0.05$	$0.045$	Deep in saturation, gradient nearly gone

The derivative peaks at $0.25$ (when $\textcolor{#E07B9D}{x} = 0$) and drops off sharply in both directions. By $\textcolor{#E07B9D}{x} = -3$, the gradient is $0.045$: the function has saturated. Stack 10 layers of this and the gradient shrinks by $0.25^{10} \approx 10^{-6}$. This is the vanishing gradient problem.

The problems with sigmoid in hidden layers:

Vanishing gradients. The derivative peaks at 0.25, not 1. Every layer multiplies the gradient by at most 0.25. After 10 layers, the gradient shrinks by $0.25^{10} \approx 10^{-6}$. Early layers barely learn.
Not zero-centered. Sigmoid outputs are always positive (between 0 and 1). This means the gradients for the weights in the next layer are either all positive or all negative, forcing zig-zag updates during optimization.
Expensive. Computing $e^{-x}$ is slower than a comparison and multiply (which is all ReLU needs).

Tanh

$\textcolor{#7F77DD}{\tanh}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} -0.76 & 0.96 \\\\ 0.46 & -1.00 \end{bmatrix}$

Tanh is sigmoid's zero-centered cousin. It squashes to $(-1, 1)$ instead of $(0, 1)$:

$\textcolor{#7F77DD}{\tanh}(\textcolor{#E07B9D}{x}) = \frac{\textcolor{#7F77DD}{e}^{\textcolor{#E07B9D}{x}} - \textcolor{#7F77DD}{e}^{-\textcolor{#E07B9D}{x}}}{\textcolor{#7F77DD}{e}^{\textcolor{#E07B9D}{x}} + \textcolor{#7F77DD}{e}^{-\textcolor{#E07B9D}{x}}}$

$\textcolor{#E07B9D}{x}$	The input (pre-activation).
$\textcolor{#7F77DD}{\tanh}(\textcolor{#E07B9D}{x})$	The output. Always in $(-1, 1)$. Zero-centered, unlike sigmoid.

Connection to sigmoid

Tanh and sigmoid are not just similar; one is a linear transformation of the other. Starting from $2\textcolor{#C9A227}{\sigma}(2\textcolor{#E07B9D}{x}) - 1$:

$2\textcolor{#C9A227}{\sigma}(2\textcolor{#E07B9D}{x}) - 1 = \frac{2}{1 + e^{-2x}} - 1 = \frac{2 - (1 + e^{-2x})}{1 + e^{-2x}} = \frac{1 - e^{-2x}}{1 + e^{-2x}}$

Multiply numerator and denominator by $e^{x}$:

$\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} = \textcolor{#7F77DD}{\tanh}(\textcolor{#E07B9D}{x})$

So $\textcolor{#7F77DD}{\tanh}(\textcolor{#E07B9D}{x}) = 2\textcolor{#C9A227}{\sigma}(2\textcolor{#E07B9D}{x}) - 1$. Tanh is a rescaled, shifted sigmoid.

Deriving the derivative

Using the quotient rule on $\frac{e^x - e^{-x}}{e^x + e^{-x}}$, or more quickly, using the identity $\tanh^2(x) + \text{sech}^2(x) = 1$:

$\textcolor{#8CB4D5}{\tanh'(x)} = 1 - \textcolor{#7F77DD}{\tanh}^2(\textcolor{#E07B9D}{x})$

$\textcolor{#8CB4D5}{\tanh'(x)}$	The derivative of tanh. Peaks at 1.0 when $\textcolor{#E07B9D}{x} = 0$ (four times stronger than sigmoid's 0.25).
$1 - \textcolor{#7F77DD}{\tanh}^2(\textcolor{#E07B9D}{x})$	One minus the square of the output. Maximum when $\textcolor{#7F77DD}{\tanh(x)} = 0$ (at the origin), minimum when $\|\textcolor{#7F77DD}{\tanh(x)}\| \to 1$ (in the tails).

Concrete values

Using the same reference inputs:

$\textcolor{#E07B9D}{x}$	$\textcolor{#7F77DD}{\tanh(x)}$	$\textcolor{#8CB4D5}{\tanh'(x)}$
$-1$	$-0.76$	$0.420$	Moderate input, gradient still healthy
$2$	$0.96$	$0.071$	Near saturation, gradient dropping fast
$0.5$	$0.46$	$0.786$	Near origin, gradient close to peak of $1.0$
$-3$	$-1.00$	$0.010$	Fully saturated, gradient nearly zero

Comparison with sigmoid

At the origin, tanh has a gradient of $\textcolor{#8CB4D5}{1.0}$, four times stronger than sigmoid's $\textcolor{#8CB4D5}{0.25}$. This means gradients flow more freely near the center. But tanh still saturates: by $|\textcolor{#E07B9D}{x}| = 2$, the gradient has dropped to 0.071. By $|\textcolor{#E07B9D}{x}| = 5$, it is below 0.001. The tails are just as flat as sigmoid's.

Tanh is still used as the activation in LSTM and GRU cells (where the output needs to be in $[-1, 1]$), and occasionally in the hidden layers of small networks. For deep networks, it has the same fundamental issue as sigmoid: saturating tails kill the gradient.

ReLU

$\textcolor{#D85A30}{\text{ReLU}}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} 0 & 2 \\\\ 0.5 & 0 \end{bmatrix}$

ReLU (Rectified Linear Unit) is the function that made deep learning practical:

$\textcolor{#D85A30}{\text{ReLU}}(\textcolor{#E07B9D}{x}) = \max(0, \textcolor{#E07B9D}{x})$

$\textcolor{#E07B9D}{x}$	The input (pre-activation).
$\textcolor{#D85A30}{\text{ReLU}}(\textcolor{#E07B9D}{x})$	The output. Equal to $\textcolor{#E07B9D}{x}$ when positive, 0 when negative. No upper bound.

Dead simple. If the input is positive, pass it through. If it is negative, output zero.

The derivative is a step function

$\textcolor{#8CB4D5}{\text{ReLU}'(x)} = \begin{cases} \textcolor{#D85A30}{1} & \text{if } \textcolor{#E07B9D}{x} > 0 \\\\ \textcolor{#D85A30}{0} & \text{if } \textcolor{#E07B9D}{x} < 0 \end{cases}$

$\textcolor{#8CB4D5}{\text{ReLU}'(x)}$	The derivative. Either 1 or 0. No saturation in the positive regime, no expensive computation.
$\textcolor{#E07B9D}{x} = 0$	Technically undefined (the kink). In practice, frameworks like PyTorch define it as 0. The choice doesn't affect training because with continuous-valued inputs, landing on exactly $0.000...$ has probability zero.

With zero-centered inputs, roughly half the pre-activations are positive (gradient $= 1$) and half are negative (gradient $= 0$). The expected gradient across the layer is $\frac{1}{2}$, which means ReLU halves the variance of the signal at each layer. This has direct consequences for weight initialization.

Why this fixes vanishing gradients

The key insight: in the positive regime, the gradient is exactly $\textcolor{#D85A30}{1}$. No matter how deep the network, the gradient passes through unchanged. Let's trace the gradient through a 5-layer network where each layer computes $\textcolor{#E07B9D}{z_l} = \textcolor{#E8725C}{W_l} \cdot \textcolor{#C9A227}{f_{l-1}} + \textcolor{#7F77DD}{b_l}$ (the pre-activation), then applies the activation: $\textcolor{#C9A227}{f_l} = \textcolor{#D85A30}{\text{ReLU}}(\textcolor{#E07B9D}{z_l})$. By the chain rule, the gradient of the loss with respect to a parameter in layer 1 includes the product of activation derivatives at every layer:

$\textcolor{#8CB4D5}{\frac{\partial \text{loss}}{\partial W_1}} = \textcolor{#8CB4D5}{\frac{\partial \text{loss}}{\partial \textcolor{#C9A227}{f_5}}} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_5}) \cdot \textcolor{#E8725C}{W_5} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_4}) \cdot \textcolor{#E8725C}{W_4} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_3}) \cdot \textcolor{#E8725C}{W_3} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_2}) \cdot \textcolor{#E8725C}{W_2} \cdot \textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_1}) \cdot \textcolor{#E07B9D}{x}$

If every pre-activation $\textcolor{#E07B9D}{z_l}$ is positive, every $\textcolor{#D85A30}{\text{ReLU}'}(\textcolor{#E07B9D}{z_l}) = 1$. The gradient is not attenuated by the activations at all. Compare this to sigmoid, where each $\textcolor{#C9A227}{\sigma'}(\textcolor{#E07B9D}{z_l}) \leq 0.25$, multiplying five of them gives at most $0.25^5 = 0.000977$. With ReLU, the five activation derivative factors contribute a product of $1^5 = 1$. This is why deep networks started working.

The dead neuron problem

The catch: if a neuron's pre-activation $\textcolor{#E07B9D}{z} = \textcolor{#E8725C}{W} \cdot \textcolor{#E07B9D}{x} + \textcolor{#7F77DD}{b}$ is negative for every training example, then:

Output: $\textcolor{#D85A30}{\text{ReLU}}(\textcolor{#E07B9D}{z}) = 0$ always
Gradient: $\textcolor{#8CB4D5}{\text{ReLU}'(z)} = 0$ always
Weight update: $\Delta \textcolor{#E8725C}{W} = -\eta \cdot \textcolor{#8CB4D5}{0} = 0$

The neuron produces no output, receives no gradient, and never updates its weights. It is permanently dead. A large learning rate can push weights into a region where the pre-activation is always negative, killing the neuron. With unlucky initialization, a significant fraction of neurons can start dead.

Concrete example: suppose a neuron has weights $\textcolor{#E8725C}{W} = [-3, -2]$ and bias $\textcolor{#7F77DD}{b} = -1$. For any input $\textcolor{#E07B9D}{x}$ with positive components (as is typical after a preceding ReLU), the pre-activation is $-3x_1 - 2x_2 - 1$, which is always negative. The neuron will never fire, will never receive a gradient, and will never learn.

Leaky ReLU and PReLU

$\textcolor{#E07B9D}{\text{LeakyReLU}}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} -0.01 & 2 \\\\ 0.5 & -0.03 \end{bmatrix}$

Leaky ReLU fixes the dead neuron problem with a small slope for negative inputs:

$\textcolor{#E07B9D}{\text{LeakyReLU}}(\textcolor{#E07B9D}{x}) = \begin{cases} \textcolor{#E07B9D}{x} & \text{if } \textcolor{#E07B9D}{x} > 0 \\\\ \textcolor{#E07B9D}{\alpha}\textcolor{#E07B9D}{x} & \text{if } \textcolor{#E07B9D}{x} \le 0 \end{cases}$

$\textcolor{#E07B9D}{\alpha}$

The negative slope. Typically $0.01$. Small enough that the function is "nearly" ReLU, but nonzero so gradients flow.

The derivative

$\textcolor{#8CB4D5}{\text{LeakyReLU}'(x)} = \begin{cases} \textcolor{#E07B9D}{1} & \text{if } \textcolor{#E07B9D}{x} > 0 \\\\ \textcolor{#E07B9D}{\alpha} & \text{if } \textcolor{#E07B9D}{x} \le 0 \end{cases}$

The gradient is never zero. In the positive regime it is 1 (same as ReLU). In the negative regime it is $\textcolor{#E07B9D}{\alpha}$, typically 0.01.

Concrete comparison

Using the reference inputs with $\textcolor{#E07B9D}{\alpha} = 0.01$:

$\textcolor{#E07B9D}{x}$	$\textcolor{#D85A30}{\text{ReLU}}$	$\textcolor{#E07B9D}{\text{LeakyReLU}}$	$\textcolor{#8CB4D5}{\text{Gradient}}$
$-1$	$0$	$-0.01$	$0$ vs $0.01$	ReLU dead, Leaky still learning
$2$	$2$	$2$	$1$ vs $1$	Identical in positive regime
$0.5$	$0.5$	$0.5$	$1$ vs $1$	Identical in positive regime
$-3$	$0$	$-0.03$	$0$ vs $0.01$	ReLU dead, Leaky still learning

The difference is stark for negative inputs. ReLU outputs zero and receives zero gradient. Leaky ReLU produces a tiny output and receives a small gradient. $0.01$ is small, but it is infinitely larger than $0$.

PReLU (Parametric ReLU) makes $\textcolor{#E07B9D}{\alpha}$ a learnable parameter. The network decides for itself how much of the negative signal to let through. In practice, the learned $\textcolor{#E07B9D}{\alpha}$ values tend to be small (0.01 to 0.25), confirming that a small leak is all you need.

GELU

$\textcolor{#1D9E75}{\text{GELU}}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} -0.16 & 1.95 \\\\ 0.35 & 0.00 \end{bmatrix}$

What if instead of a hard cutoff at zero (ReLU), we used a soft, probabilistic gate? That is the core idea behind GELU (Gaussian Error Linear Unit).

The Gaussian CDF

Start with $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$, the cumulative distribution function (CDF) of the standard normal distribution. This is an S-shaped curve from 0 to 1, centered at 0. It answers the question: "If I draw a random number from a standard Gaussian, what is the probability it is less than $\textcolor{#E07B9D}{x}$?" The dotted curve below is the familiar bell curve (the PDF $\phi(x)$). The solid green curve is the CDF: the running integral of that bell curve, accumulating probability from left to right. Drag to see both values at any point.

$\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$

Gaussian CDF. $\textcolor{#1D9E75}{\Phi}(-\infty) = 0$, $\textcolor{#1D9E75}{\Phi}(0) = 0.5$, $\textcolor{#1D9E75}{\Phi}(\infty) = 1$. Smooth S-curve, similar to sigmoid but with a different functional form.

When $\textcolor{#E07B9D}{x}$ is large and positive, $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) \approx 1$ (almost certainly less than $x$). When $\textcolor{#E07B9D}{x}$ is large and negative, $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) \approx 0$.

The GELU formula

GELU uses the CDF as a soft gate: scale $\textcolor{#E07B9D}{x}$ by $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$, which is near 1 for large positive inputs (let them through) and near 0 for large negative inputs (suppress them):

$\textcolor{#1D9E75}{\text{GELU}}(\textcolor{#E07B9D}{x}) = \textcolor{#E07B9D}{x} \cdot \textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$

$\textcolor{#E07B9D}{x}$	The input.
$\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$	The gate: a smooth value between 0 and 1 that determines how much of $\textcolor{#E07B9D}{x}$ passes through.
$\textcolor{#1D9E75}{\text{GELU}}(\textcolor{#E07B9D}{x})$	The output: the input scaled by its Gaussian gate.

The intuition: for large positive $\textcolor{#E07B9D}{x}$, the gate $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) \approx 1$, so the input passes through nearly unchanged (like ReLU). For large negative $\textcolor{#E07B9D}{x}$, the gate $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) \approx 0$, so the output is nearly zero (also like ReLU). But near zero, the transition is smooth rather than a hard kink, and small negative values can pass through with reduced magnitude.

The tanh approximation

The Gaussian CDF has no closed-form expression, so in practice GELU is computed via a tanh approximation:

$\textcolor{#1D9E75}{\text{GELU}}(\textcolor{#E07B9D}{x}) \approx 0.5\textcolor{#E07B9D}{x}\left(1 + \textcolor{#7F77DD}{\tanh}\left[\textcolor{#D85A30}{\sqrt{\frac{2}{\pi}}}\left(\textcolor{#E07B9D}{x} + \textcolor{#18B8C8}{0.044715}\textcolor{#E07B9D}{x}^3\right)\right]\right)$

$\textcolor{#D85A30}{\sqrt{2/\pi}}$	$\approx 0.7979$. A scaling constant that maps the Gaussian CDF onto the tanh range.
$\textcolor{#18B8C8}{0.044715}$	A fitted cubic correction that improves the approximation accuracy.

Concrete values

Using the reference inputs:

$\textcolor{#E07B9D}{x}$	$\textcolor{#1D9E75}{\Phi(x)}$	$\textcolor{#1D9E75}{\text{GELU}(x)}$	$\textcolor{#D85A30}{\text{ReLU}(x)}$
$-1$	$0.159$	$-0.16$	$0$	GELU lets a small negative through; ReLU blocks entirely
$2$	$0.977$	$1.95$	$2$	Nearly identical in the positive regime
$0.5$	$0.691$	$0.35$	$0.5$	Soft gating scales down small positives
$-3$	$0.001$	$-0.004$	$0$	Gate nearly shut, both close to zero

The key difference from ReLU shows at $x = -1$: ReLU gives exactly $0$, while GELU gives $-0.16$. The gate does not fully shut at the boundary.

The derivative

The derivative of $\textcolor{#1D9E75}{\text{GELU}}(\textcolor{#E07B9D}{x}) = \textcolor{#E07B9D}{x} \cdot \textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$ uses the product rule:

$\textcolor{#8CB4D5}{\text{GELU}'(x)} = \textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x}) + \textcolor{#E07B9D}{x} \cdot \textcolor{#1D9E75}{\phi}(\textcolor{#E07B9D}{x})$

$\textcolor{#1D9E75}{\phi}(\textcolor{#E07B9D}{x})$

The Gaussian PDF (probability density function): $\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$. This is the derivative of $\textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$.

The derivative is smooth everywhere, unlike ReLU's discontinuity at zero. At $\textcolor{#E07B9D}{x} = 0$: $\textcolor{#8CB4D5}{\text{GELU}'(0)} = 0.5 + 0 \cdot \phi(0) = 0.5$. For large positive $\textcolor{#E07B9D}{x}$, it approaches 1 (like ReLU). For large negative $\textcolor{#E07B9D}{x}$, it approaches 0.

GELU is used in BERT, GPT-2/3/4, Vision Transformer (ViT), and most modern transformer architectures. It slightly outperforms ReLU in these models, likely because the smooth transition handles the residual connections and layer normalization better than ReLU's hard kink at zero.

Swish / SiLU

$\textcolor{#18B8C8}{\text{Swish}}\left(\begin{bmatrix} -1 & 2 \\\\ 0.5 & -3 \end{bmatrix}\right) = \begin{bmatrix} -0.27 & 1.76 \\\\ 0.31 & -0.14 \end{bmatrix}$

Swish (also called SiLU, Sigmoid Linear Unit) is a self-gated activation:

$\textcolor{#18B8C8}{\text{Swish}}(\textcolor{#E07B9D}{x}) = \textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$

$\textcolor{#E07B9D}{x}$	The input.
$\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$	Sigmoid, acting as a smooth gate between 0 and 1.
$\textcolor{#18B8C8}{\text{Swish}}(\textcolor{#E07B9D}{x})$	The output: input multiplied by its sigmoid gate.

Notice the structural parallel: GELU = $\textcolor{#E07B9D}{x} \cdot \textcolor{#1D9E75}{\Phi}(\textcolor{#E07B9D}{x})$ and Swish = $\textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$. Same idea, different gate function. The Gaussian CDF $\textcolor{#1D9E75}{\Phi}$ and sigmoid $\textcolor{#C9A227}{\sigma}$ are both smooth S-curves from 0 to 1. They differ in shape (sigmoid has heavier tails), but the gating mechanism is identical.

Deriving the derivative

Apply the product rule to $\textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$:

$\textcolor{#8CB4D5}{\text{Swish}'(x)} = \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}) + \textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})\left(1 - \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})\right)$

$\textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})$	The first term: the gate value itself. From the product rule, this is $\frac{d}{dx}[x] \cdot \sigma(x) = 1 \cdot \sigma(x)$.
$\textcolor{#E07B9D}{x} \cdot \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x})(1 - \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}))$	The second term: $x$ times the derivative of sigmoid. From the product rule, this is $x \cdot \frac{d}{dx}[\sigma(x)]$.

This can also be written as $\textcolor{#8CB4D5}{\text{Swish}'(x)} = \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}) + \textcolor{#18B8C8}{\text{Swish}}(\textcolor{#E07B9D}{x}) \cdot (1 - \textcolor{#C9A227}{\sigma}(\textcolor{#E07B9D}{x}))$, which is cheap to compute once you already have the forward pass values.

Concrete values and the non-monotonic dip

Using the reference inputs:

$\textcolor{#E07B9D}{x}$	$\textcolor{#C9A227}{\sigma(x)}$	$\textcolor{#18B8C8}{\text{Swish}(x)}$	$\textcolor{#D85A30}{\text{ReLU}(x)}$
$-1$	$0.27$	$-0.27$	$0$	Swish dips negative; ReLU is flat at zero
$2$	$0.88$	$1.76$	$2$	Gate nearly open, close to linear
$0.5$	$0.62$	$0.31$	$0.5$	Soft gating scales down small positives
$-3$	$0.05$	$-0.14$	$0$	Gate closing, output recovering toward zero

Unlike ReLU (which is monotonically non-decreasing), Swish dips below zero for negative inputs, reaching a minimum of about $-0.278$ near $x \approx -1.28$. For more negative values, the sigmoid gate closes faster than $|x|$ grows, so the output rises back toward zero. This non-monotonicity is unusual. Swish (and GELU) both have this dip, allowing a small negative "bump" before gating back to near zero. Empirically, this seems to help during training, perhaps by providing richer gradient information near the origin.

Swish was discovered through automated search (neural architecture search) at Google Brain, which is a satisfying detail: a neural network found a good activation function for neural networks.

The vanishing gradient problem

The choice of activation function determines whether gradients survive the trip through many layers. To see why, start from the chain rule.

Chain rule through $L$ layers

Consider a network $\textcolor{#C9A227}{f} = \textcolor{#C9A227}{f_L} \circ \textcolor{#C9A227}{f_{L-1}} \circ \cdots \circ \textcolor{#C9A227}{f_1}$, where each layer $\textcolor{#C9A227}{f_l}$ applies a linear transformation followed by an activation. The gradient of the loss with respect to weights in the first layer is a product of partial derivatives, one per layer:

$\textcolor{#8CB4D5}{\frac{\partial \text{loss}}{\partial W_1}} = \textcolor{#8CB4D5}{\frac{\partial \text{loss}}{\partial \textcolor{#C9A227}{f_L}}} \cdot \textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_L}}{\partial \textcolor{#C9A227}{f_{L-1}}}} \cdot \textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_{L-1}}}{\partial \textcolor{#C9A227}{f_{L-2}}}} \cdots \textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_2}}{\partial \textcolor{#C9A227}{f_1}}} \cdot \textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_1}}{\partial W_1}}$

Each factor $\textcolor{#8CB4D5}{\frac{\partial \textcolor{#C9A227}{f_l}}{\partial \textcolor{#C9A227}{f_{l-1}}}}$ includes the activation function's derivative at that layer. If the activation is sigmoid, each such factor is at most $\textcolor{#C9A227}{\sigma'(z_l)} \leq 0.25$. If it is ReLU in the positive regime, each factor is $\textcolor{#D85A30}{\text{ReLU}'(z_l)} = 1$.

The exponential decay

The gradient reaching layer 1 is proportional to the product of all these activation derivatives. Each activation function has a maximum derivative, and that maximum compounds across layers. Here is every function from this post, showing the worst-case gradient factor surviving $L$ layers (assuming inputs near the origin, where derivatives are largest):

	Typical $\textcolor{#8CB4D5}{f'(x)}$	$L = 5$	$L = 10$	$L = 20$	$L = 50$
Sigmoid	$\sim 0.20$	$3.2 \times 10^{-4}$	$1.0 \times 10^{-7}$	$1.0 \times 10^{-14}$	$1.1 \times 10^{-35}$
Tanh	$\sim 0.42$	$0.013$	$1.7 \times 10^{-4}$	$2.8 \times 10^{-8}$	$1.3 \times 10^{-19}$
ReLU	$1.0$	$1.0$	$1.0$	$1.0$	$1.0$
Leaky ReLU	$1.0$	$1.0$	$1.0$	$1.0$	$1.0$
GELU	$1.0$	$1.0$	$1.0$	$1.0$	$1.0$
Swish	$1.0$	$1.0$	$1.0$	$1.0$	$1.0$

The "typical" column uses realistic derivative values at common pre-activation magnitudes, not theoretical maxima. Sigmoid's max derivative is 0.25 (at $x=0$), but most neurons operate away from zero where it is even lower. Tanh peaks at 1.0 (at $x=0$), but by $|x|=1$ it has already dropped to 0.42, and it falls off rapidly from there. Both saturate, but sigmoid is far worse.

ReLU, Leaky ReLU, GELU, and Swish all maintain a gradient of 1.0 in the positive regime with no saturation. The gradient passes through unchanged regardless of depth. This is not the whole story: the weight matrices $\textcolor{#E8725C}{W_l}$ also appear in the chain and can cause exploding or vanishing gradients on their own. But eliminating the activation-induced decay was the critical breakthrough.

The demo below makes this concrete. Watch how the gradient magnitude changes at each layer as you increase network depth.

With sigmoid, the gradient is multiplied by at most 0.25 at each layer. Ten layers deep, the signal reaching the first layer is essentially zero. With ReLU, the gradient passes through at full strength for positive activations. This is the single biggest reason ReLU replaced sigmoid in deep networks.

How to choose

ReLU	Default for CNNs and most feedforward networks. Simple, fast, works. Use unless you have a reason not to.
GELU	Default for transformers. Used in BERT, GPT, ViT. Slightly better than ReLU for attention-based architectures.
Swish/SiLU	Alternative to GELU with similar properties. Used in EfficientNet and some diffusion models.
Leaky ReLU	When you observe dead neurons with ReLU. Common in GANs.
Sigmoid	Output layer for binary classification. Gates in LSTMs. Not for hidden layers.
Tanh	Gates in LSTMs/GRUs. Occasionally in small RNNs. Not for deep hidden layers.

The trend is clear: older activations (sigmoid, tanh) saturate and kill gradients. Modern activations (ReLU, GELU, Swish) avoid saturation in the positive regime. The remaining differences are about smoothness, the behavior near zero, and how well they interact with normalization and residual connections.

import torch.nn as nn

# CNNs
model = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.ReLU(),               # ReLU for conv layers
    nn.Conv2d(64, 128, 3, padding=1),
    nn.ReLU(),
)

# Transformers
model = nn.Sequential(
    nn.Linear(768, 3072),
    nn.GELU(),               # GELU for transformer FFN
    nn.Linear(3072, 768),
)

Activation Functions

Why nonlinearity

The function explorer

Sigmoid

Deriving the derivative

Concrete values

Tanh

Connection to sigmoid

Deriving the derivative

Concrete values

Comparison with sigmoid

ReLU

The derivative is a step function

Why this fixes vanishing gradients

The dead neuron problem

Leaky ReLU and PReLU

The derivative

Concrete comparison

GELU

The Gaussian CDF

The GELU formula

The tanh approximation

Concrete values

The derivative

Swish / SiLU

Deriving the derivative

Concrete values and the non-monotonic dip

The vanishing gradient problem

Chain rule through $L$ layers

The exponential decay

How to choose

/ FAQ

Why do neural networks need activation functions?

What is the vanishing gradient problem?

What is the difference between ReLU and GELU?

What is a dead neuron in ReLU?

Which activation function should I use?