Back to Blog

Norms: Measuring Size and Distance

What Lp norms are, why L1 and L2 measure different things, and how the unit ball shape explains everything from Manhattan distance to sparsity.

April 30, 2026
mllinear-algebranormsinteractive

A norm is a way to measure the "size" of a vector. You encounter two constantly in machine learning: the L1 norm and the L2 norm. They show up in loss functions, regularization, distance metrics, and optimization theory. They sound similar but measure fundamentally different things.

The Lp norm

The Lp norm of a vector $\mathbf{x}$, written $\textcolor{#C9A227}{\lVert \mathbf{x} \rVert_p}$, measures the "size" of that vector using a parameter p. For a vector $\mathbf{x}$ with components $x_1, x_2, \ldots, x_n$:

$\textcolor{#C9A227}{\lVert \mathbf{x} \rVert_p} = \left(\sum |x_i|^p\right)^{1/p}$

Plugging in specific values of p gives the three norms you see everywhere:

p = 1$\lVert \mathbf{x} \rVert_1 = |x_1| + |x_2| + \cdots + |x_n|$. Sum of absolute values.
p = 2$\lVert \mathbf{x} \rVert_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$. The familiar Euclidean length.
p = $\infty$$\lVert \mathbf{x} \rVert_\infty = \max(|x_1|, |x_2|, \ldots, |x_n|)$. The largest component.

The intuition for L-infinity: as p grows, raising each $|x_i|$ to a higher and higher power makes the largest component dominate the sum. The smaller components become negligible. In the limit, only the single largest component survives, so the constraint $\lVert \mathbf{x} \rVert_\infty \le 1$ just means each axis is independently clamped to [-1, 1].

Same vector, different rulers. The vector (3, 4) has L1 norm 7, L2 norm 5, and L-infinity norm 4.

Manhattan vs Euclidean distance

Norms define distance: $d(\mathbf{a}, \mathbf{b}) = \lVert \mathbf{a} - \mathbf{b} \rVert$. The norm you pick determines the geometry.

L2 distance is a straight line (Euclidean). L1 distance is the sum of axis-aligned steps (Manhattan, because you walk along a grid like city blocks; it took me embarrassingly long to realize this was the origin of this convention). Same two points, different paths, different numbers.

The unit ball

The unit ball is the set of all vectors with norm $\le 1$. Its shape reveals everything about how the norm behaves.

$B_p = \lbrace \mathbf{x} : \textcolor{#C9A227}{\lVert \mathbf{x} \rVert_p} \le 1 \rbrace$

At p=1, the unit ball is a diamond. The corners sit on the axes, where one coordinate is $\pm 1$ and the others are zero. At p=2, it is a circle with no corners. At p=$\infty$, it is a square (a hypercube in higher dimensions).

The corners matter. When you optimize subject to a norm constraint (stay inside the unit ball), the solution tends to land where the constraint boundary first meets the objective. A diamond has corners on the axes, so solutions naturally have zero coordinates. A circle has no corners, so solutions have all nonzero coordinates. This single geometric fact is why:

  • L1 loss treats all errors equally (constant gradient, linear growth)
  • L1 regularization produces sparse models (weights land at diamond corners)
  • L2 loss amplifies large errors (quadratic growth)
  • L2 regularization shrinks all weights but zeros out none (circle has no corners)

Notation used in this series

The rest of the posts in this series use norms frequently. Here is a quick reference:

$\textcolor{#4A9EDE}{\lVert \mathbf{x} \rVert_1} = \sum|x_i| \quad \small\textsf{L1 norm}$

$\textcolor{#D85A30}{\lVert \mathbf{x} \rVert_2} = \sqrt{\sum x_i^2} \quad \small\textsf{L2 norm}$

$\textcolor{#D85A30}{\lVert \mathbf{x} \rVert_2^2} = \sum x_i^2 \quad \small\textsf{squared L2 norm (no root)}$

In equations, you will often see the simplified forms $\sum|w_i|$ and $\sum w_i^2$. These are the L1 and squared L2 norms of the weight vector $\mathbf{w}$, written without the norm bars for brevity.

/ FAQ

What is a norm in linear algebra?

A norm is a function that assigns a non-negative length or size to a vector. It must satisfy three properties: (1) non-negativity (zero only for the zero vector), (2) scalar multiplication (scaling a vector scales its norm), and (3) the triangle inequality (the norm of a sum is at most the sum of the norms). Different norms measure size differently, leading to different notions of distance and geometry.

What is the difference between L1 and L2 norm?

The L1 norm sums absolute values: ||x||_1 = |x_1| + |x_2| + ... + |x_n|. The L2 norm takes the square root of squared values: ||x||_2 = sqrt(x_1^2 + x_2^2 + ... + x_n^2). L1 corresponds to Manhattan distance (walking along a grid), L2 to Euclidean distance (straight line). The L1 unit ball is a diamond, the L2 unit ball is a circle.

What is the Lp norm?

The Lp norm generalizes L1 and L2: ||x||_p = (|x_1|^p + |x_2|^p + ... + |x_n|^p)^(1/p) for p >= 1. At p=1 you get L1 (diamond unit ball), at p=2 you get L2 (circular unit ball), and as p approaches infinity you get the L-infinity norm (square unit ball, max of absolute values). The unit ball shape smoothly morphs between these extremes.

Why does the L1 norm encourage sparsity?

The L1 unit ball is a diamond (cross-polytope) with sharp corners that sit on the coordinate axes. When you constrain an optimization to stay within the L1 ball, solutions tend to land at these corners, where one or more coordinates are exactly zero. This geometric property is why L1 regularization (Lasso) produces sparse models.