A naming collision that trips everyone up: "L1 loss" and "L1 regularization" both start with "L1," but they do different things to different quantities. The "L1" and "L2" refer to the norm being used, not the application.
| Applied to | Norm notation | Purpose | |
|---|---|---|---|
| L1/L2 loss | prediction errors ($\mathbf{y} - \hat{\mathbf{y}}$) | $\lVert \mathbf{y} - \hat{\mathbf{y}} \rVert_1$ or $\lVert \mathbf{y} - \hat{\mathbf{y}} \rVert_2^2$ | measure how wrong the model is |
| L1/L2 regularization | model weights ($\mathbf{w}$) | $\lVert \mathbf{w} \rVert_1$ or $\lVert \mathbf{w} \rVert_2^2$ | penalize model complexity |
Same norms, different targets. You can combine any loss with any regularization. The previous post covered the loss side. This one covers regularization.
The problem
A model with enough parameters can memorize any dataset. Fit a degree-8 polynomial to 15 noisy points and it will thread through every single one, contorting itself into shapes that have nothing to do with the underlying pattern.
It scores perfectly on training data and terribly on anything new. This is overfitting: the model learned the noise, not the signal.
Regularization fixes this by adding a penalty term that punishes large weights. The model now has to balance two objectives: fit the data and keep the weights small.
Ridge: L2 regularization
Add the sum of squared weights to the loss:
$\textcolor{#D85A30}{\text{Ridge}} = \text{Loss} + \lambda \lVert \mathbf{w} \rVert_2^2 = \text{Loss} + \lambda \sum w_i^2$
Large weights get penalized quadratically. A weight of 10 contributes 100 to the penalty, a weight of 1 contributes 1. The optimizer shrinks everything toward zero, but the penalty gets smaller and smaller as weights approach zero, so they never quite reach it.
Ridge has a clean closed-form solution: $\mathbf{w} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}$. The $\lambda I$ term is why it's also called Tikhonov regularization. It stabilizes the matrix inversion, making Ridge well-behaved even when features are correlated or the system is underdetermined.
Lasso: L1 regularization
Replace the squared penalty with an absolute value penalty:
$\textcolor{#4A9EDE}{\text{Lasso}} = \text{Loss} + \lambda \lVert \mathbf{w} \rVert_1 = \text{Loss} + \lambda \sum |w_i|$
The gradient of $|w|$ is $\pm 1$ (constant, regardless of how small w is). So unlike Ridge, the penalty pushes on small weights just as hard as large ones. Weights that aren't pulling their weight (contributing enough to reduce the loss) get driven to exactly zero.
This is automatic feature selection. Lasso doesn't just shrink the model, it simplifies it by eliminating irrelevant features entirely. No separate feature selection step needed.
The tradeoff: Lasso has no closed-form solution (the absolute value isn't differentiable at zero). It's solved iteratively, typically with coordinate descent.
Why L1 produces sparsity
This is the key insight, and the geometry makes it clear.
Think of regularization as a constraint: instead of "minimize loss + penalty," think "minimize loss subject to the weights staying inside a budget region." For Ridge, that region is the L2 ball ($\lVert \mathbf{w} \rVert_2^2 \le t$). For Lasso, it's the L1 ball ($\lVert \mathbf{w} \rVert_1 \le t$).
The loss function has elliptical contours centered at the unconstrained optimum. The regularized solution is where those contours first touch the constraint boundary.
A circle is smooth everywhere. The contours almost always touch it at an off-axis point, meaning both weights are nonzero. A diamond has corners on the axes. The contours are much more likely to first touch at a corner, meaning one weight is exactly zero. In higher dimensions this effect is stronger: an n-dimensional L1 ball has 2n corners, all on coordinate axes.
Coefficient paths
Another way to see the difference: watch the coefficients as $\lambda$ increases from zero (no regularization) to a large value (heavy regularization).
Ridge coefficients shrink smoothly toward zero. They get small but never vanish. Lasso coefficients hit zero one by one as $\lambda$ increases. Each zero-crossing is a feature being dropped from the model.
The Bayesian connection
If you read the MLE to MAP post, you saw that L2 regularization falls out of placing a Gaussian prior on the weights. L1 regularization corresponds to a Laplace prior: same idea, different distribution.
$\textcolor{#D85A30}{\text{Ridge}} \leftrightarrow \text{Gaussian prior: } p(w) \propto \exp\left(-\frac{w^2}{2\sigma^2}\right)$
$\textcolor{#4A9EDE}{\text{Lasso}} \leftrightarrow \text{Laplace prior: } p(w) \propto \exp\left(-\frac{|w|}{b}\right)$
The Laplace distribution has a sharper peak at zero and heavier tails. It says: "most weights should be zero or near-zero, but the few that matter can be large." The Gaussian says: "all weights should be moderate." This is exactly the behavior we see: Lasso produces sparse solutions, Ridge produces small-but-dense solutions.
Elastic Net: the compromise
Just like Huber loss splits the difference between L1 and L2 loss, Elastic Net combines both regularizers:
$\textcolor{#1D9E75}{\text{Elastic Net}} = \text{Loss} + \lambda\left(\alpha \lVert \mathbf{w} \rVert_1 + (1-\alpha) \lVert \mathbf{w} \rVert_2^2\right)$
At $\alpha=1$ it's pure Lasso. At $\alpha=0$ it's pure Ridge. In between, you get Lasso's sparsity with Ridge's stability. This matters when features are correlated: Lasso tends to arbitrarily pick one from a correlated group and zero out the rest. Elastic Net keeps correlated features together.
When to use which
Ridge: the default for regularization. Stable, differentiable, closed-form solution. Use when you believe most features matter but want to prevent overfitting.
Lasso: when you suspect many features are irrelevant and want the model to select the important ones automatically. Common in high-dimensional settings (genomics, text) where p >> n.
Elastic Net: when features are correlated and you want both sparsity and group stability. The safe middle ground.
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# Ridge (L2)
model = Ridge(alpha=1.0).fit(X, y)
# Lasso (L1)
model = Lasso(alpha=1.0).fit(X, y)
# Elastic Net
model = ElasticNet(alpha=1.0, l1_ratio=0.5).fit(X, y)