A norm is a way to measure the "size" of a vector. You encounter two constantly in machine learning: the L1 norm and the L2 norm. They show up in loss functions, regularization, distance metrics, and optimization theory. They sound similar but measure fundamentally different things.
The Lp norm
The Lp norm of a vector $\mathbf{x}$, written $\textcolor{#C9A227}{\lVert \mathbf{x} \rVert_p}$, measures the "size" of that vector using a parameter p. For a vector $\mathbf{x}$ with components $x_1, x_2, \ldots, x_n$:
$\textcolor{#C9A227}{\lVert \mathbf{x} \rVert_p} = \left(\sum |x_i|^p\right)^{1/p}$
Plugging in specific values of p gives the three norms you see everywhere:
| p = 1 | $\lVert \mathbf{x} \rVert_1 = |x_1| + |x_2| + \cdots + |x_n|$. Sum of absolute values. |
| p = 2 | $\lVert \mathbf{x} \rVert_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$. The familiar Euclidean length. |
| p = $\infty$ | $\lVert \mathbf{x} \rVert_\infty = \max(|x_1|, |x_2|, \ldots, |x_n|)$. The largest component. |
The intuition for L-infinity: as p grows, raising each $|x_i|$ to a higher and higher power makes the largest component dominate the sum. The smaller components become negligible. In the limit, only the single largest component survives, so the constraint $\lVert \mathbf{x} \rVert_\infty \le 1$ just means each axis is independently clamped to [-1, 1].
Same vector, different rulers. The vector (3, 4) has L1 norm 7, L2 norm 5, and L-infinity norm 4.
Manhattan vs Euclidean distance
Norms define distance: $d(\mathbf{a}, \mathbf{b}) = \lVert \mathbf{a} - \mathbf{b} \rVert$. The norm you pick determines the geometry.
L2 distance is a straight line (Euclidean). L1 distance is the sum of axis-aligned steps (Manhattan, because you walk along a grid like city blocks; it took me embarrassingly long to realize this was the origin of this convention). Same two points, different paths, different numbers.
The unit ball
The unit ball is the set of all vectors with norm $\le 1$. Its shape reveals everything about how the norm behaves.
$B_p = \lbrace \mathbf{x} : \textcolor{#C9A227}{\lVert \mathbf{x} \rVert_p} \le 1 \rbrace$
At p=1, the unit ball is a diamond. The corners sit on the axes, where one coordinate is $\pm 1$ and the others are zero. At p=2, it is a circle with no corners. At p=$\infty$, it is a square (a hypercube in higher dimensions).
The corners matter. When you optimize subject to a norm constraint (stay inside the unit ball), the solution tends to land where the constraint boundary first meets the objective. A diamond has corners on the axes, so solutions naturally have zero coordinates. A circle has no corners, so solutions have all nonzero coordinates. This single geometric fact is why:
- L1 loss treats all errors equally (constant gradient, linear growth)
- L1 regularization produces sparse models (weights land at diamond corners)
- L2 loss amplifies large errors (quadratic growth)
- L2 regularization shrinks all weights but zeros out none (circle has no corners)
Notation used in this series
The rest of the posts in this series use norms frequently. Here is a quick reference:
$\textcolor{#4A9EDE}{\lVert \mathbf{x} \rVert_1} = \sum|x_i| \quad \small\textsf{L1 norm}$
$\textcolor{#D85A30}{\lVert \mathbf{x} \rVert_2} = \sqrt{\sum x_i^2} \quad \small\textsf{L2 norm}$
$\textcolor{#D85A30}{\lVert \mathbf{x} \rVert_2^2} = \sum x_i^2 \quad \small\textsf{squared L2 norm (no root)}$
In equations, you will often see the simplified forms $\sum|w_i|$ and $\sum w_i^2$. These are the L1 and squared L2 norms of the weight vector $\mathbf{w}$, written without the norm bars for brevity.