What is the difference between L1 and L2 loss?

L1 loss (Mean Absolute Error) sums the absolute values of errors: Σ|yᵢ - ŷᵢ|. L2 loss (Mean Squared Error) sums the squares: Σ(yᵢ - ŷᵢ)². L2 penalizes large errors disproportionately more than small ones, making it sensitive to outliers. L1 treats all errors proportionally, making it robust to outliers.

Why is L2 loss sensitive to outliers?

Because squaring an error amplifies large values. An error of 10 contributes 100 to L2 loss but only 10 to L1 loss. This means a single outlier can dominate the L2 objective, pulling the entire model toward it. L1 sees the same outlier as just a moderately large error, so the model stays closer to the majority of the data.

When should I use L1 vs L2 loss?

Use L2 (MSE) as the default for clean data with roughly Gaussian errors. Use L1 (MAE) when your data has outliers, heavy-tailed noise, or occasional spikes (sensor data, financial data). Use Huber loss when you're unsure: it behaves like L2 for small errors and L1 for large ones.

Huber loss is a hybrid of L1 and L2. For errors smaller than a threshold delta, it's quadratic (like L2). For errors larger than delta, it's linear (like L1). This gives you L2's smoothness near zero (stable gradients) and L1's robustness for large errors (outlier resistance). The delta parameter controls the transition point.

Why does the L1 gradient have a discontinuity at zero?

The derivative of |e| is sign(e): +1 for positive errors, -1 for negative errors. At exactly e=0, the derivative is undefined because the absolute value function has a sharp corner (a kink). This can cause oscillation near the optimum during gradient descent, which is one reason some practitioners prefer smooth alternatives like Huber loss.

L1 vs L2 Loss: MAE and MSE

You have a model. It makes predictions. Some are close, some are off. You need a single number that captures "how wrong is this model overall?" That number is the loss.

The two most common choices: square the errors (L2), or take their absolute value (L1). The "L1" and "L2" refer to the norm used to measure the error vector. They sound almost the same. They are not.

Squared error (L2 loss)

The most widely used loss function in machine learning. Take each error, square it, add them up:

$\textcolor{#D85A30}{\text{L2}} = \lVert \mathbf{y} - \hat{\mathbf{y}} \rVert_2^2 = \sum(y_i - \hat{y}_i)^2$

$y_i$	The actual value. What really happened.
$\hat{y}_i$	The predicted value. What your model said would happen.
$y_i - \hat{y}_i$	The error (residual). Positive means the model undershot, negative means it overshot.

Squaring does two things: it makes all errors positive (no cancellation between over- and under-predictions), and it makes large errors count much more than small ones. An error of 10 contributes 100 to the loss. An error of 1 contributes 1. The big error is 10x larger, but it contributes 100x more to the loss.

Absolute error (L1 loss)

Take the absolute value instead of squaring:

$\textcolor{#4A9EDE}{\text{L1}} = \lVert \mathbf{y} - \hat{\mathbf{y}} \rVert_1 = \sum|y_i - \hat{y}_i|$

An error of 10 contributes 10. An error of 1 contributes 1. Large errors still matter more, but proportionally, not quadratically.

Below: both loss functions plotted as a function of the error. Notice how L2 curves upward aggressively while L1 grows at a steady rate. At an error of 4, L2 is already at 16 while L1 is at 4.

The outlier test

Here's where the difference matters. Below: eight points roughly on a line, with both an L2 regression line and an L1 regression line fit to them. They start out almost identical.

Now drag one point far from the others.

The L2 line chases the outlier. The L1 line barely moves. Why?

L2 squares the error, so one point far away creates a massive loss term. The optimizer will distort the entire fit to reduce that one huge penalty. L1 just sees a moderately larger absolute error. Not worth wrecking the fit for the other seven points.

This is the core tradeoff: L2 is efficient when errors are well-behaved (no outliers). L1 is robust when they're not.

The gradient tells the story

The loss function determines the gradient, and the gradient determines how the optimizer updates the model. Look at what each loss function tells the optimizer to do:

$\textcolor{#D85A30}{\text{L2 gradient}} = 2e \quad \small\textsf{proportional to error}$

$\textcolor{#4A9EDE}{\text{L1 gradient}} = \text{sign}(e) \quad \small\textsf{constant: always} \pm 1$

L2's gradient is proportional to the error. A large error produces a large gradient, so the optimizer rushes to fix it. A small error produces a small gradient. This is why L2 overreacts to outliers: the outlier's error is huge, its gradient dominates, and the optimizer prioritizes it above everything else.

L1's gradient is always $\pm 1$ regardless of error size. The optimizer treats a small error with the same urgency as a large one. This is why L1 doesn't chase outliers: it doesn't "see" them as more urgent than any other error.

The tradeoff: L1's gradient has a discontinuity at zero (the absolute value has a sharp corner there). This can cause the optimizer to oscillate near the solution instead of settling smoothly. What happens when the error is exactly zero? The gradient is technically undefined. In practice, libraries like PyTorch define sign(0) = 0: if the prediction is exactly right, the gradient is zero, so the optimizer doesn't push it in either direction. In floating point this almost never triggers (values rarely land on exactly 0.0), but it's the convention that makes L1 well-behaved.

Huber loss: the compromise

What if you want L2's smoothness for small errors and L1's robustness for large ones? That's Huber loss:

$\textcolor{#1D9E75}{\text{Huber}(e)} = \frac{e^2}{2} \quad \small\textsf{if } |e| \le \textcolor{#1D9E75}{\delta}$

$\textcolor{#1D9E75}{\text{Huber}(e)} = \textcolor{#1D9E75}{\delta} \cdot \left(|e| - \frac{\textcolor{#1D9E75}{\delta}}{2}\right) \quad \small\textsf{if } |e| > \textcolor{#1D9E75}{\delta}$

Below $\textcolor{#1D9E75}{\delta}$, it's quadratic (like L2): smooth, stable gradients. Above $\textcolor{#1D9E75}{\delta}$, it's linear (like L1): outliers don't dominate. The green dots mark the transition points. Drag the slider to see how $\textcolor{#1D9E75}{\delta}$ controls where you stop trusting the error and start treating it as a potential outlier.

Beyond lines: fitting a signal

Linear regression makes the difference clear, but how does this play out with a neural network on a more complex task? Below: an MLP ($1 \to 64 \to 64 \to 1$ with tanh activations) learns a composite sinusoidal signal. Both networks have identical architecture and identical initial weights. The only difference is the loss function.

The data has a few outlier points (red dots) where the measured value is far from the true signal. Hit train and watch both networks learn simultaneously.

Same architecture, same data, same optimizer, same learning rate. The loss function is the only variable. Two things to notice:

L2 converges faster. Its gradient is proportional to error ($2e$), so large early errors produce large updates. L1's gradient is constant ($\pm 1$), so it learns at a steadier pace.
L2 settles at a higher error floor. It spends capacity fitting outliers, distorting the curve away from the true signal. L1 largely ignores them and reaches a lower final error.

When to use which

L2 (MSE): The default. Clean data, Gaussian-distributed errors, no outliers. Used in most regression tasks. Gives the optimizer strong signal for large errors and lets it settle precisely for small ones.

L1 (MAE): Robust choice. Sensor data with noise spikes, financial data with tail events, anything where occasional extreme values shouldn't dominate the fit. Also encourages sparsity in some formulations (see: lasso regression).

Huber: When you're not sure. Common in reinforcement learning (smooth Bellman error), robust regression, and anywhere you want smooth optimization near zero without outlier sensitivity. The $\textcolor{#1D9E75}{\delta}$ parameter lets you tune the tradeoff.

You can also switch loss functions during training: start with L2 for fast early convergence (large errors produce large gradients), then switch to L1 for fine-tuning to avoid overfitting outliers. This is different from Huber, which blends based on error magnitude at each point. Switching is based on training phase.

import torch.nn.functional as F

# L2
loss = F.mse_loss(predictions, targets)

# L1
loss = F.l1_loss(predictions, targets)

# Huber
loss = F.huber_loss(predictions, targets, delta=1.0)