What is Maximum A Posteriori (MAP) estimation?

MAP estimation finds the model parameters that maximize the posterior probability: the most probable parameters given the data AND prior beliefs. It differs from MLE (Maximum Likelihood Estimation) by incorporating a prior distribution over parameters. The MAP estimate equals the mode of the posterior distribution.

How is L2 regularization equivalent to a Gaussian prior?

Adding a penalty term lambda ||w||^2 to a least-squares loss is mathematically identical to taking the negative log of a Gaussian prior N(0, sigma^2 I) on the weights, where sigma^2 = sigma_n^2 / lambda. The regularization strength lambda directly corresponds to the inverse prior variance: stronger regularization means a tighter prior centered at zero.

What is the difference between MAP and MLE?

MLE finds parameters that maximize the probability of the observed data alone. MAP multiplies in a prior distribution over parameters before maximizing. When the prior is uniform (flat, no opinion), MAP reduces to MLE. When the prior is informative (e.g., Gaussian centered at zero), MAP shrinks parameters toward the prior mean, which is exactly what regularization does.

Why does regularization prevent overfitting from a Bayesian perspective?

A Gaussian prior centered at zero encodes the belief that model weights should be small unless the data strongly disagrees. The model needs substantial evidence to justify large weights. In low-data or high-noise regimes, the prior dominates and keeps weights small, preventing the wild oscillations that characterize overfitting. The regularization strength controls this prior-vs-data tradeoff.

Can other regularization methods be interpreted as Bayesian priors?

Yes. L1 regularization (lasso) corresponds to a Laplace prior, which encourages sparsity. Elastic net corresponds to a mixture prior. Dropout can be interpreted as approximate Bayesian inference. The general pattern holds: any additive penalty in the loss function corresponds to a multiplicative prior in the probability model.

From MLE to MAP: L2 Regularization is Bayesian in Disguise

Suppose you have data: house prices and their square footages. You fit a line, price = w * sqft. What slope w best explains your data?

Maximum likelihood estimation

For each possible value of w, you can ask: if the true slope were this, how likely is it that I'd see the data I actually observed? That question is what the likelihood measures.

Imagine trying w = 200 (dollars per square foot). You draw the line and look at how far each actual sale price falls from it. Some are close, some are far. Now try w = 50. The gaps are huge: houses sold for way more than that line predicts. Try w = 500. Now the line overshoots everything. Somewhere in between, there's a w where the data lines up best. That's your Maximum Likelihood Estimate (MLE).

To make "lines up best" precise, we need a model for the gaps between predicted and actual values. The simplest one: each error (aka residual) is drawn from a Gaussian (bell curve) centered at zero, with standard deviation $\textcolor{#E07B9D}{\sigma_n}$. This $\textcolor{#E07B9D}{\sigma_n}$ captures how noisy the data is. Houses don't sell for exactly w * sqft; they scatter around that line by roughly $\textcolor{#E07B9D}{\sigma_n}$ dollars. Given that noise model, the likelihood of the full dataset is the product of these individual Gaussian probabilities:

$\textcolor{#D85A30}{p(\textcolor{#18B8C8}{D} \mid \textcolor{#ffffff}{w})} = \prod \mathcal{N}(\textcolor{#18B8C8}{y_i} \mid \textcolor{#ffffff}{w}\textcolor{#18B8C8}{x_i},~ \textcolor{#E07B9D}{\sigma_n}^2)$

$\textcolor{#18B8C8}{D}$	Observed data: pairs of (x, y) values. The house prices and square footages you collected.
$\textcolor{#ffffff}{w}$	Weight parameter. Each value of w implies a different line y = wx. We're scanning over all of them.
$\textcolor{#E07B9D}{\sigma_n}$	Noise standard deviation. How much each sale price scatters around the true line.
$\textcolor{#D85A30}{p(D\|w)}$	Likelihood. "If w were this value, how probable is the data I observed?" The peak is the MLE.

Below: the top panel shows data points (drag them) with a Gaussian bell at each point, centered on the regression line. Those bells are the noise model: each data point is drawn from a Gaussian centered at the predicted value. The bottom panel shows each point's individual likelihood contribution (thin cyan curves) and their product (bold orange). The peak of the product is the MLE.

MLE is clean and direct. It finds the w that makes your observed data most probable, no opinions required. With plenty of data, this works well.

But MLE has no memory of the world. It only listens to the data in front of it. When data is scarce or noisy, it can latch onto patterns that aren't real. A polynomial fit through five noisy points can swing wildly between them, producing a "perfect" fit that's useless for prediction. MLE sees a great fit. You see overfitting.

What if you could also tell the model: "I expect the weights to be reasonable numbers, not extreme ones"?

The prior

That's exactly what a prior does. Before looking at any data, you encode a belief about what w should look like. If you think extreme values are unlikely but don't have strong opinions beyond that, a Gaussian centered at zero captures this:

$\textcolor{#7F77DD}{p(\textcolor{#ffffff}{w})} = \frac{1}{\sqrt{2\pi\textcolor{#7F77DD}{\sigma}^2}} \cdot \exp\left(-\frac{\textcolor{#ffffff}{w}^2}{2\textcolor{#7F77DD}{\sigma}^2}\right)$

$\textcolor{#ffffff}{w}$	Model weight. In the house price example, the slope (dollars per square foot). In a neural network, every connection has one.
$\textcolor{#7F77DD}{\sigma}$	Prior standard deviation. How spread out your belief is before seeing data. Small $\sigma$: "I'm fairly sure w is near zero." Large $\sigma$: "w could be anything."
$\textcolor{#7F77DD}{p(w)}$	Prior probability density. Highest at w = 0, decaying symmetrically outward.

Note: this $\textcolor{#7F77DD}{\sigma}$ is not the same as $\textcolor{#E07B9D}{\sigma_n}$ from the likelihood. That one measured noise in the data (how much sale prices scatter around the true line). This one measures how spread out your belief about the weight itself is. Two different sources of uncertainty. They'll meet again in the punchline.

Drag the slider to see how $\textcolor{#7F77DD}{\sigma}$ reshapes the prior. A tight prior (small $\textcolor{#7F77DD}{\sigma}$) concentrates probability mass near zero. A wide prior (large $\textcolor{#7F77DD}{\sigma}$) is nearly flat, encoding minimal opinion.

The Python computing this:

w = np.linspace(-5, 5, 300)
pdf = np.exp(-w**2 / (2 * sigma**2)) / np.sqrt(2 * np.pi * sigma**2)

Putting it together

Now we have two pieces of information. The likelihood says how well each w explains the data. The prior says what we believed about w before seeing any data. Neither one alone gives the full picture.

Bayes' theorem is the rule for combining them. Multiply the likelihood by the prior, and you get the posterior: your updated belief about w after accounting for both the data and your prior expectations.

$\textcolor{#1D9E75}{p(\textcolor{#ffffff}{w} \mid \textcolor{#18B8C8}{D})} \propto \textcolor{#D85A30}{p(\textcolor{#18B8C8}{D} \mid \textcolor{#ffffff}{w})} \cdot \textcolor{#7F77DD}{p(\textcolor{#ffffff}{w})}$

$\textcolor{#1D9E75}{p(w\|D)}$	Posterior: our updated belief about w after seeing the data. This is what we actually care about.
$\textcolor{#D85A30}{p(D\|w)}$	Likelihood: how well w explains the observed data. Pulls the estimate toward the best fit.
$\textcolor{#7F77DD}{p(w)}$	Prior: what we believed about w before seeing data. Pulls the estimate toward the prior's center (zero, in our case).

The height of the posterior at any w tells you how plausible that w is, given both the data and the prior. A w where the curve is tall is more plausible than one where it's low. But at the end of the day, you need to pick a single slope to make predictions with. The MAP estimate (Maximum A Posteriori) picks the most plausible one: the w where the posterior is tallest.

It's a compromise. The likelihood pulls toward the best fit. The prior pulls toward its center (zero here, since that's where we centered it). MAP lands somewhere in between, depending on which force is stronger.

When the prior is wide ($\textcolor{#7F77DD}{\sigma} \to \infty$), it barely pulls at all, so MAP $\approx$ MLE. The data dominates. When the prior is tight (small $\textcolor{#7F77DD}{\sigma}$), it pulls hard, and MAP gets dragged toward the prior's center.

Try cranking the prior $\textcolor{#7F77DD}{\sigma}$ down to 0.3 and watch the green posterior snap toward the prior's center, dragging the MAP estimate with it. Then push $\textcolor{#7F77DD}{\sigma}$ up to 4.0 and watch MAP converge on the MLE. That's the prior-vs-data tug of war.

The punchline

Here's where the connection to regularization appears. We want the w where the posterior is tallest. For our one-parameter example, that's just calculus: take the derivative, set it to zero, solve. One equation, one unknown. But a neural network has millions of weights. Setting all the partial derivatives to zero gives you a system of millions of coupled equations, and solving that requires inverting a matrix that scales as $O(n^3)$. For n = 10 million weights, that's not happening. And that's still assuming the model is linear. With nonlinear activations (ReLU, etc.), there is no closed-form solution at all. So instead, you rewrite the problem as a loss function and walk downhill with gradient descent: compute the gradient, take a small step, repeat. That's why ML frameworks think in terms of losses rather than equations to solve.

To get there, take the negative log of the posterior. The log turns the product (likelihood times prior) into a sum. The negation flips the peak into a valley. Now instead of finding the highest point, we're finding the lowest point of this expression:

$-\log \textcolor{#1D9E75}{p(\textcolor{#ffffff}{w}|\textcolor{#18B8C8}{D})} = -\log \textcolor{#D85A30}{p(\textcolor{#18B8C8}{D}|\textcolor{#ffffff}{w})} - \log \textcolor{#7F77DD}{p(\textcolor{#ffffff}{w})} + \text{const}$

Now look at what each term becomes. The negative log of the Gaussian likelihood is proportional to the sum of squared residuals (how far the data is from the line). The negative log of the Gaussian prior is proportional to $\textcolor{#ffffff}{w}^2$ (how far the weight is from zero). Two penalties, added together:

$= \textcolor{#D85A30}{\frac{\sum(\textcolor{#18B8C8}{y_i} - \textcolor{#ffffff}{w}\textcolor{#18B8C8}{x_i})^2}{2\textcolor{#E07B9D}{\sigma_n}^2}} + \textcolor{#7F77DD}{\frac{\textcolor{#ffffff}{w}^2}{2\textcolor{#7F77DD}{\sigma}^2}}$

$= \frac{1}{2\textcolor{#E07B9D}{\sigma_n}^2} \cdot \left[\textcolor{#D85A30}{\sum(\textcolor{#18B8C8}{y_i} - \textcolor{#ffffff}{w}\textcolor{#18B8C8}{x_i})^2} + \frac{\textcolor{#E07B9D}{\sigma_n}^2}{\textcolor{#7F77DD}{\sigma}^2} \cdot \textcolor{#ffffff}{w}^2\right] \quad \small\textsf{factor out } \frac{1}{2\sigma_n^2}$

$= \frac{1}{2\textcolor{#E07B9D}{\sigma_n}^2} \cdot \left[\textcolor{#D85A30}{\sum(\textcolor{#18B8C8}{y_i} - \textcolor{#ffffff}{w}\textcolor{#18B8C8}{x_i})^2} + \textcolor{#C9A227}{\lambda}\textcolor{#ffffff}{w}^2\right] \quad \small\textsf{where } \textcolor{#C9A227}{\lambda} = \frac{\textcolor{#E07B9D}{\sigma_n}^2}{\textcolor{#7F77DD}{\sigma}^2}$

Read that last line again. It's a sum of squared errors plus $\textcolor{#C9A227}{\lambda}$ times $w^2$. That is ridge regression (L2-regularized least squares). The + λw² term that people add to their loss function to "prevent overfitting" didn't come from nowhere. It fell out of Bayes' theorem the moment you chose a Gaussian prior on the weights.

And the two sigmas meet: $\textcolor{#C9A227}{\lambda} = \textcolor{#E07B9D}{\sigma_n}^2 / \textcolor{#7F77DD}{\sigma}^2$. The regularization strength is the ratio of data noise to prior confidence. Noisy data or a tight prior means more regularization. Clean data or a weak prior (wider distribution) means less.

$\textcolor{#C9A227}{\lambda} = \frac{\textcolor{#E07B9D}{\sigma_n}^2}{\textcolor{#7F77DD}{\sigma}^2}$

Regularization strength. High noise or tight prior means more regularization. Clean data or weak prior (wider distribution) means less.

The two panels below visualize the same objective. Left: the regularization view (data loss + penalty). Right: the Bayesian view (negative log-posterior). They always minimize at the same w, because they are the same math.

Drag $\textcolor{#C9A227}{\lambda}$ and watch both curves deform in lockstep. The minimum stays at the same w in both panels, always.

Try it yourself

Below: polynomial regression on draggable data. The orange curve is the MLE fit (no regularization). The green curve is the MAP/ridge fit. Crank the degree up to 10+ and watch MLE go wild while MAP stays smooth. That's the Gaussian prior doing its job: large weights need strong evidence.

What this means

When you pick $\textcolor{#C9A227}{\lambda}$, you are implicitly choosing how much you trust your prior (weights should be small) versus your data. This is not a hack. It is Bayesian inference with a specific prior.

The pattern generalizes. L1 regularization (lasso) is MAP with a Laplace prior, which pushes weights to exactly zero (sparsity). Elastic net is a mixture prior. Dropout can be framed as approximate variational inference. Every regularizer you use is encoding a belief about what "reasonable" parameters look like.

The next time you tune weight_decay, you are adjusting a prior variance. Might as well know what you are doing.

The connection between MAP and L2 regularization is nicely covered in this video by DataMListic.