What is the difference between L1 and L2 regularization?

L2 regularization (Ridge) adds a penalty of λΣwᵢ² to the loss. It shrinks all weights toward zero but never sets them exactly to zero. L1 regularization (Lasso) adds λΣ|wᵢ|. It can drive weights to exactly zero, performing automatic feature selection. The difference comes from the geometry of their constraint regions: L2's circle has no corners, while L1's diamond has corners on the axes where weights are zero.

Is L1 regularization the same as L1 loss?

No. They use the same mathematical norm (absolute value) but apply it to different things. L1 loss (MAE) measures prediction errors: Σ|yᵢ - ŷᵢ|. L1 regularization (Lasso) penalizes weight magnitudes: λΣ|wⱼ|. You can combine any loss function with any regularization. The 'L1' and 'L2' refer to the norm used, not the application.

Why does Lasso produce sparse weights?

The L1 constraint region is a diamond (in 2D) or cross-polytope (in higher dimensions). When you minimize a smooth loss function subject to this constraint, the optimum tends to land at a corner of the diamond. Corners sit on coordinate axes, which means one or more weights are exactly zero. The L2 constraint region is a sphere with no corners, so the optimum almost never lands on an axis.

Elastic Net combines L1 and L2 regularization: α·λΣ|wⱼ| + (1-α)·λΣwⱼ². The mixing parameter α controls the balance. At α=1 it is pure Lasso; at α=0 it is pure Ridge. Elastic Net gets Lasso's sparsity with Ridge's stability, particularly useful when features are correlated (Lasso tends to pick one arbitrarily; Elastic Net keeps groups together).

What is the Bayesian interpretation of regularization?

L2 regularization is equivalent to placing a Gaussian prior on the weights (centered at zero). L1 regularization is equivalent to a Laplace prior (sharper peak at zero, heavier tails). The regularization strength λ controls how tight the prior is. This connection is derived through MAP estimation: maximizing the posterior is equivalent to minimizing the loss plus the regularization penalty.

Lasso vs Ridge: L1 and L2 Regularization

A naming collision that trips everyone up: "L1 loss" and "L1 regularization" both start with "L1," but they do different things to different quantities. The "L1" and "L2" refer to the norm being used, not the application.

	Applied to	Norm notation	Purpose
L1/L2 loss	prediction errors ($\mathbf{y} - \hat{\mathbf{y}}$)	$\lVert \mathbf{y} - \hat{\mathbf{y}} \rVert_1$ or $\lVert \mathbf{y} - \hat{\mathbf{y}} \rVert_2^2$	measure how wrong the model is
L1/L2 regularization	model weights ($\mathbf{w}$)	$\lVert \mathbf{w} \rVert_1$ or $\lVert \mathbf{w} \rVert_2^2$	penalize model complexity

Same norms, different targets. You can combine any loss with any regularization. The previous post covered the loss side. This one covers regularization.

The problem

A model with enough parameters can memorize any dataset. Fit a degree-8 polynomial to 15 noisy points and it will thread through every single one, contorting itself into shapes that have nothing to do with the underlying pattern.

It scores perfectly on training data and terribly on anything new. This is overfitting: the model learned the noise, not the signal.

Regularization fixes this by adding a penalty term that punishes large weights. The model now has to balance two objectives: fit the data and keep the weights small.

Ridge: L2 regularization

Add the sum of squared weights to the loss:

$\textcolor{#D85A30}{\text{Ridge}} = \text{Loss} + \lambda \lVert \mathbf{w} \rVert_2^2 = \text{Loss} + \lambda \sum w_i^2$

Large weights get penalized quadratically. A weight of 10 contributes 100 to the penalty, a weight of 1 contributes 1. The optimizer shrinks everything toward zero, but the penalty gets smaller and smaller as weights approach zero, so they never quite reach it.

Ridge has a clean closed-form solution: $\mathbf{w} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}$. The $\lambda I$ term is why it's also called Tikhonov regularization. It stabilizes the matrix inversion, making Ridge well-behaved even when features are correlated or the system is underdetermined.

Lasso: L1 regularization

Replace the squared penalty with an absolute value penalty:

$\textcolor{#4A9EDE}{\text{Lasso}} = \text{Loss} + \lambda \lVert \mathbf{w} \rVert_1 = \text{Loss} + \lambda \sum |w_i|$

The gradient of $|w|$ is $\pm 1$ (constant, regardless of how small w is). So unlike Ridge, the penalty pushes on small weights just as hard as large ones. Weights that aren't pulling their weight (contributing enough to reduce the loss) get driven to exactly zero.

This is automatic feature selection. Lasso doesn't just shrink the model, it simplifies it by eliminating irrelevant features entirely. No separate feature selection step needed.

The tradeoff: Lasso has no closed-form solution (the absolute value isn't differentiable at zero). It's solved iteratively, typically with coordinate descent.

Why L1 produces sparsity

This is the key insight, and the geometry makes it clear.

Think of regularization as a constraint: instead of "minimize loss + penalty," think "minimize loss subject to the weights staying inside a budget region." For Ridge, that region is the L2 ball ($\lVert \mathbf{w} \rVert_2^2 \le t$). For Lasso, it's the L1 ball ($\lVert \mathbf{w} \rVert_1 \le t$).

The loss function has elliptical contours centered at the unconstrained optimum. The regularized solution is where those contours first touch the constraint boundary.

A circle is smooth everywhere. The contours almost always touch it at an off-axis point, meaning both weights are nonzero. A diamond has corners on the axes. The contours are much more likely to first touch at a corner, meaning one weight is exactly zero. In higher dimensions this effect is stronger: an n-dimensional L1 ball has 2n corners, all on coordinate axes.

Coefficient paths

Another way to see the difference: watch the coefficients as $\lambda$ increases from zero (no regularization) to a large value (heavy regularization).

Ridge coefficients shrink smoothly toward zero. They get small but never vanish. Lasso coefficients hit zero one by one as $\lambda$ increases. Each zero-crossing is a feature being dropped from the model.

The Bayesian connection

If you read the MLE to MAP post, you saw that L2 regularization falls out of placing a Gaussian prior on the weights. L1 regularization corresponds to a Laplace prior: same idea, different distribution.

$\textcolor{#D85A30}{\text{Ridge}} \leftrightarrow \text{Gaussian prior: } p(w) \propto \exp\left(-\frac{w^2}{2\sigma^2}\right)$

$\textcolor{#4A9EDE}{\text{Lasso}} \leftrightarrow \text{Laplace prior: } p(w) \propto \exp\left(-\frac{|w|}{b}\right)$

The Laplace distribution has a sharper peak at zero and heavier tails. It says: "most weights should be zero or near-zero, but the few that matter can be large." The Gaussian says: "all weights should be moderate." This is exactly the behavior we see: Lasso produces sparse solutions, Ridge produces small-but-dense solutions.

Elastic Net: the compromise

Just like Huber loss splits the difference between L1 and L2 loss, Elastic Net combines both regularizers:

$\textcolor{#1D9E75}{\text{Elastic Net}} = \text{Loss} + \lambda\left(\alpha \lVert \mathbf{w} \rVert_1 + (1-\alpha) \lVert \mathbf{w} \rVert_2^2\right)$

At $\alpha=1$ it's pure Lasso. At $\alpha=0$ it's pure Ridge. In between, you get Lasso's sparsity with Ridge's stability. This matters when features are correlated: Lasso tends to arbitrarily pick one from a correlated group and zero out the rest. Elastic Net keeps correlated features together.

When to use which

Ridge: the default for regularization. Stable, differentiable, closed-form solution. Use when you believe most features matter but want to prevent overfitting.

Lasso: when you suspect many features are irrelevant and want the model to select the important ones automatically. Common in high-dimensional settings (genomics, text) where p >> n.

Elastic Net: when features are correlated and you want both sparsity and group stability. The safe middle ground.

from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ridge (L2)
model = Ridge(alpha=1.0).fit(X, y)

# Lasso (L1)
model = Lasso(alpha=1.0).fit(X, y)

# Elastic Net
model = ElasticNet(alpha=1.0, l1_ratio=0.5).fit(X, y)