What is dimensionality reduction in machine learning?

Dimensionality reduction is the process of projecting high-dimensional data into a lower-dimensional space (usually 2D or 3D) while preserving as much meaningful structure as possible. It is used for visualization, noise removal, and as a preprocessing step before training models. The three most common methods are PCA (linear), t-SNE (nonlinear, local structure), and UMAP (nonlinear, both local and global structure).

What is the difference between PCA, t-SNE, and UMAP?

PCA is linear: it finds orthogonal axes of maximum variance and projects data onto them. It is fast, deterministic, and preserves global structure, but cannot capture nonlinear relationships. t-SNE preserves local neighborhoods by matching pairwise similarity distributions, producing well-separated clusters but often distorting global distances. UMAP builds a fuzzy topological graph and optimizes a low-dimensional layout to match it, preserving both local and global structure better than t-SNE, and running faster.

When should I use t-SNE vs UMAP?

Use t-SNE when you care primarily about local cluster separation and are working with a small dataset (under ~10,000 points). Use UMAP when you need to preserve global relationships between clusters, when you have a larger dataset, or when you need faster computation. UMAP also supports supervised and semi-supervised modes and can produce UMAP embeddings incrementally for new data.

What does the perplexity parameter do in t-SNE?

Perplexity controls the effective number of neighbors each point considers when computing pairwise similarities. Low perplexity (5-10) focuses on very local structure, producing tight micro-clusters. High perplexity (30-50) considers broader neighborhoods, producing smoother, more globally coherent embeddings. A common default is 30. The interactive demo above lets you see the effect directly.

Is PCA still useful if I have nonlinear data?

Yes. PCA is the right first step for most dimensionality reduction tasks. It is fast, deterministic, and interpretable. Even when data has nonlinear structure, PCA reveals the dominant axes of variance and how much of the total variance is captured. If PCA produces well-separated clusters, you may not need a nonlinear method at all. When it does not, that is a clear signal that t-SNE or UMAP will help.

Dimensionality Reduction

The problem

A shape sampled as an SDF on a 24x24x24 grid lives in 13,824 dimensions. A word embedding might have 768. A gene expression profile, 20,000. Most of those dimensions are redundant, correlated, or noise. Dimensionality reduction finds a compact representation that preserves the distinctions that matter: which shapes are similar, which gene profiles cluster together, which embeddings are neighbors.

A 2D scatter plot is just one expression of that compressed representation, useful because humans can look at it. But the same principle drives preprocessing pipelines (remove noise before training), feature compression (fewer dimensions, faster models), and latent space design (autoencoders, diffusion models). The question is always the same: what counts as "structure worth keeping"? Different methods give different answers.

Three Gaussian clusters in 3D, projected to 2D by three different methods. Click between them and watch the points rearrange. PCA preserves the overall spread. t-SNE and UMAP pull the clusters apart more aggressively, sacrificing global distances for local clarity.

PCA: the linear baseline

PCA finds orthogonal directions of maximum variance and projects data onto them. The first principal component captures the most spread, the second captures the most remaining spread perpendicular to the first, and so on.

Imagine a cloud of 2D points. Some directions through the cloud have a lot of spread (points are far apart), and others have very little (points are bunched together). If you had to summarize the cloud with a single arrow, you would point it in the direction where the data varies the most. That arrow is the first principal component (PC1). The second principal component (PC2) is the direction with the most remaining spread, perpendicular to the first.

Click "project onto PC1" to collapse every point onto the PC1 line. The cyan dots are the projected positions: 2D data compressed to 1D. The dashed lines are what gets lost. PCA chooses the direction that makes those dashed lines as short as possible.

Centering the data

Before finding directions, shift the entire cloud so its center sits at the origin. Compute the mean and subtract it from every point:

$\textcolor{#7F77DD}{\boldsymbol{\mu}} = \frac{1}{\textcolor{#8CB4D5}{M}}\sum_{i=1}^{\textcolor{#8CB4D5}{M}} \mathbf{x}_i$

$\textcolor{#7F77DD}{\boldsymbol{\mu}}$	The mean (center of mass). Computed element by element across all data points.
$\textcolor{#8CB4D5}{M}$	Number of data points.
$\mathbf{x}_i$	A single data point.

Why center? PCA finds directions of maximum spread from a reference point. If the data is off-center, the strongest "direction" would just point from the origin toward the cloud, which tells you nothing about the cloud's shape. Centering removes that offset so every direction PCA finds reflects genuine variation in the data.

Variance and the covariance matrix

Variance ($\textcolor{#E8725C}{\sigma^2}$) measures how spread out values are along a single axis. Take the distance of each value from the mean, square it, and average:

$\textcolor{#E8725C}{\sigma_x^2} = \text{var}(x) = \frac{1}{\textcolor{#8CB4D5}{M}}\sum_{i=1}^{\textcolor{#8CB4D5}{M}}(x_i - \textcolor{#7F77DD}{\mu_x})^2$

Covariance ($\textcolor{#D36BE0}{\sigma_{xy}}$) measures how much two axes vary together. Replace the squared single-axis term with a product across both axes:

$\textcolor{#D36BE0}{\sigma_{xy}} = \text{cov}(x, y) = \frac{1}{\textcolor{#8CB4D5}{M}}\sum_{i=1}^{\textcolor{#8CB4D5}{M}}(x_i - \textcolor{#7F77DD}{\mu_x})(y_i - \textcolor{#7F77DD}{\mu_y})$

Positive covariance means x and y tend to move together. Negative means opposite directions. Near zero means independent.

The covariance matrix $\textcolor{#C9A227}{\mathbf{C}}$ packages these into one object: variances on the diagonal, covariances off-diagonal:

$\textcolor{#C9A227}{\mathbf{C}} = \frac{1}{\textcolor{#8CB4D5}{M}}\textcolor{#E07B9D}{\tilde{\mathbf{X}}}^\top \textcolor{#E07B9D}{\tilde{\mathbf{X}}} = \begin{bmatrix} \textcolor{#E8725C}{\sigma_x^2} & \textcolor{#D36BE0}{\sigma_{xy}} \\\\ \textcolor{#D36BE0}{\sigma_{xy}} & \textcolor{#E8725C}{\sigma_y^2} \end{bmatrix}$

$\textcolor{#C9A227}{\mathbf{C}}$	Covariance matrix ($D \times D$). Diagonal: variances ($\textcolor{#E8725C}{\sigma^2}$). Off-diagonal: covariances ($\textcolor{#D36BE0}{\sigma_{xy}}$).
$\textcolor{#E07B9D}{\tilde{\mathbf{X}}}$	Centered data matrix ($\textcolor{#8CB4D5}{M} \times D$). Each row is one data point with the mean subtracted.

Here is the key property: if you pick any unit-length direction $\mathbf{u}$ and project the data onto it, the variance of the projected values is $\mathbf{u}^\top \textcolor{#C9A227}{\mathbf{C}} \mathbf{u}$. PCA asks: which $\mathbf{u}$ maximizes this? The answer is an eigenvector of $\textcolor{#C9A227}{\mathbf{C}}$.

Why? Because $\textcolor{#C9A227}{\mathbf{C}}$ is symmetric ($C_{ij} = C_{ji}$). Symmetric matrices have a special guarantee: their eigenvectors are perpendicular to each other, and the eigenvalues are real numbers. The eigenvectors form a clean set of axes, and the eigenvalue of each axis is the variance along that axis. The direction with the largest eigenvalue has the most variance, which is exactly what PCA is looking for.

Eigenvectors: the directions that matter

A matrix transforms vectors: it stretches them, rotates them, or both. Most vectors get changed in complicated ways. But certain special directions only get stretched (or shrunk), with no rotation at all. These are the eigenvectors, and the stretch factor for each one is its eigenvalue:

$\textcolor{#C9A227}{\mathbf{C}}\mathbf{u} = \textcolor{#E8725C}{\lambda} \mathbf{u}$

Read this as: "when I apply $\textcolor{#C9A227}{\mathbf{C}}$ to the vector $\mathbf{u}$, the result points in the same direction as $\mathbf{u}$, just scaled by $\textcolor{#E8725C}{\lambda}$." Most vectors would get rotated. Eigenvectors are the special ones that don't.

Three vectors start on the unit circle: u (white, drag to rotate), PC1, and PC2. Slide the transform to apply $\textcolor{#C9A227}{\mathbf{C}}$. The eigenvectors stretch along their line without rotating. The white vector gets knocked off its original direction.

$\mathbf{u}_k$	The $k$-th principal direction (eigenvector of $\textcolor{#C9A227}{\mathbf{C}}$). The direction where the data has the $k$-th most spread.
$\textcolor{#E8725C}{\lambda_k}$	Eigenvalue: the variance of the data along $\mathbf{u}_k$. Larger means more important.

Sort eigenvectors by eigenvalue (largest first) and you get the principal components in order of importance:

PC1: largest eigenvalue, direction of maximum variance.
PC2: second largest, maximum remaining variance, perpendicular to PC1.
PC3, PC4, ...: each captures the next largest chunk, always perpendicular to all previous.

Projection and reconstruction

To compress a data point, project its centered vector onto each principal direction. The dot product gives a single scalar: how far along that direction the point sits.

$\textcolor{#18B8C8}{c_k} = \langle \mathbf{x} - \textcolor{#7F77DD}{\boldsymbol{\mu}},~ \mathbf{u}_k \rangle$

To reconstruct, multiply each coefficient by its direction and add back the mean:

$\textcolor{#18B8C8}{\hat{\mathbf{x}}} = \textcolor{#7F77DD}{\boldsymbol{\mu}} + \sum_{k=1}^{K} \textcolor{#18B8C8}{c_k} \mathbf{u}_k$

$\textcolor{#18B8C8}{c_k}$	Coefficient for the $k$-th component. A single scalar encoding "how much of direction $k$ is in this data point."
$K$	Number of components kept. More components, better reconstruction, larger representation.
$\textcolor{#18B8C8}{\hat{\mathbf{x}}}$	The reconstructed data point. An approximation of $\mathbf{x}$ using only $K$ numbers plus the shared mean and directions.

With all $D$ components ($K = D$), the reconstruction is exact. With fewer, the error is exactly the variance in the dropped components. PCA guarantees this is the smallest possible error for any linear projection of rank $K$.

PCA is fast, deterministic, and interpretable. It has a closed-form solution (eigendecomposition or SVD). The same input always produces the same output. But it can only find linear structure. If your data lives on a curved manifold (concentric rings, a swiss roll, a branching tree), PCA will flatten that structure and mix things together.

t-SNE: preserving neighborhoods

t-SNE (t-distributed Stochastic Neighbor Embedding) asks a different question: instead of maximizing variance, can we place points in 2D so that nearby points in high-D stay nearby in low-D?

The answer is built from four ideas: turning distances into similarities, matching those similarities in a low-dimensional embedding, measuring the mismatch with KL divergence, and fixing it with gradient descent. Each step follows from the previous one.

Distance vs. similarity

A distance is large when points are far apart. A similarity is large when points are close. Every nonlinear embedding method needs to convert distances into similarities, because the optimization target is "make the similarity structure match," not "make the raw distances match."

The raw squared Euclidean distance between two points $\textcolor{#E07B9D}{\mathbf{x}_i}$ and $\textcolor{#E07B9D}{\mathbf{x}_j}$ is:

$\textcolor{#CC4444}{d_{ij}^2} = \lVert \textcolor{#E07B9D}{\mathbf{x}_i} - \textcolor{#E07B9D}{\mathbf{x}_j} \rVert^2 = \sum_{d=1}^{D} (x_{i,d} - x_{j,d})^2$

$\textcolor{#CC4444}{d_{ij}^2}$	Squared Euclidean distance between points $i$ and $j$.
$\textcolor{#E07B9D}{\mathbf{x}_i}, \textcolor{#E07B9D}{\mathbf{x}_j}$	Data points in the original $D$-dimensional space.
$D$	Number of dimensions in the original space.

This distance is zero when the points overlap and grows without bound as they separate. We need to flip it: zero distance should become maximum similarity, and large distances should decay to near-zero similarity. The tool for this is the Gaussian kernel.

The Gaussian kernel

Apply a Gaussian (bell curve) to the squared distance:

$\textcolor{#C9A227}{K_{ij}} = \exp\!\left(-\frac{\textcolor{#CC4444}{d_{ij}^2}}{2\textcolor{#7F77DD}{\sigma}^2}\right)$

$\textcolor{#C9A227}{K_{ij}}$	Unnormalized similarity between points $i$ and $j$. Ranges from 1 (identical) to 0 (infinitely far).
$\textcolor{#CC4444}{d_{ij}^2}$	Squared Euclidean distance between the two points.
$\textcolor{#7F77DD}{\sigma}$	Bandwidth parameter. Controls how quickly similarity decays with distance.

Why the Gaussian specifically? It is the maximum-entropy distribution for a given mean and variance. That means it encodes the least amount of assumptions about the data beyond "nearby points are similar." Points within about one $\textcolor{#7F77DD}{\sigma}$ get high similarity. Beyond two or three $\textcolor{#7F77DD}{\sigma}$, similarity drops to effectively zero. The bandwidth acts as a soft cutoff: a small $\textcolor{#7F77DD}{\sigma}$ means only very close points count as neighbors, while a large one spreads the neighborhood wider.

The left panel below shows distances from a reference point $\textcolor{#E07B9D}{\mathbf{x}_i}$ to each neighbor. The right panel maps those distances onto the Gaussian kernel curve: nearby points land high on the curve (strong similarity), distant points land near zero. Drag the reference point and adjust $\textcolor{#7F77DD}{\sigma}$ to see how the bandwidth reshapes the neighborhood.

Conditional probability $\textcolor{#E8725C}{p_{j|i}}$

The raw kernel values $\textcolor{#C9A227}{K_{ij}}$ tell us that point $j$ is "more similar" to $i$ than point $k$ is, but they don't tell us how much more. A kernel value of 0.8 means nothing on its own because the scale depends entirely on $\textcolor{#7F77DD}{\sigma_i}$. More importantly, you can't compare raw values across different reference points. A point in a dense cluster might have five neighbors all scoring 0.7, while an isolated point has one neighbor scoring 0.7 and nothing else nearby. The kernel treats both "0.7" relationships the same, but structurally they are very different: the first is one of many equally strong connections, the second is the only connection that matters.

Normalizing fixes this. Dividing each kernel value by the sum of all kernel values from point $i$ gives us a probability distribution: "given point $i$, what fraction of its total similarity budget goes to point $j$?" Now the values are comparable across points. A point in a dense cluster spreads its probability mass across many neighbors, while an isolated point concentrates it on the few neighbors it has. This is exactly the local structure we want to preserve.

Click different points below to see this in action. When you select a point in the dense cluster (A through E), its probability mass spreads across several nearby neighbors. Select one of the sparse points (F or G) and almost all of the budget goes to its single close neighbor. Adjust $\textcolor{#7F77DD}{\sigma}$ to see how bandwidth changes the shape of each distribution.

Formally, the conditional probability that point $i$ would pick point $j$ as its neighbor is:

$\textcolor{#E8725C}{p_{j|i}} = \frac{\exp\!\left(-\lVert \textcolor{#E07B9D}{\mathbf{x}_i} - \textcolor{#E07B9D}{\mathbf{x}_j} \rVert^2 / 2\textcolor{#7F77DD}{\sigma}^2\right)}{\sum_{k \neq i} \exp\!\left(-\lVert \textcolor{#E07B9D}{\mathbf{x}_i} - \textcolor{#E07B9D}{\mathbf{x}_k} \rVert^2 / 2\textcolor{#7F77DD}{\sigma}^2\right)}$

$\textcolor{#E8725C}{p_{j\|i}}$	Probability that point $i$ picks point $j$ as its neighbor. Sums to 1 over all $j \neq i$.
$\textcolor{#E07B9D}{\mathbf{x}_i}$	The "reference" point. The Gaussian is centered here.
$\textcolor{#E07B9D}{\mathbf{x}_j}$	A candidate neighbor.
$\textcolor{#7F77DD}{\sigma}$	Bandwidth. Controls the neighborhood radius. We will see shortly how this gets set.
$\sum_{k \neq i}$	The denominator: sum of all kernel values from $i$ to every other point. Makes the probabilities sum to 1.

The denominator is just a normalizing constant. It ensures we get a valid probability distribution: all the $\textcolor{#E8725C}{p_{j|i}}$ values for a given $i$ add up to 1. Notice that $\textcolor{#E8725C}{p_{i|i}} = 0$ by convention (a point is not its own neighbor).

Toy example: computing $\textcolor{#E8725C}{p_{j|i}}$ by hand

Take five points in 2D:

$\textcolor{#E07B9D}{\mathbf{x}_1} = (0, 0), \quad \textcolor{#E07B9D}{\mathbf{x}_2} = (1, 0), \quad \textcolor{#E07B9D}{\mathbf{x}_3} = (3, 0), \quad \textcolor{#E07B9D}{\mathbf{x}_4} = (0, 4), \quad \textcolor{#E07B9D}{\mathbf{x}_5} = (5, 5)$

Compute the conditional distribution from $\textcolor{#E07B9D}{\mathbf{x}_1} = (0,0)$ with $\textcolor{#7F77DD}{\sigma_1} = 1.5$.

Step 1: squared distances from $\textcolor{#E07B9D}{\mathbf{x}_1}$.

Pair	$\textcolor{#CC4444}{d_{1j}^2}$
$\textcolor{#E07B9D}{\mathbf{x}_1} \to \textcolor{#E07B9D}{\mathbf{x}_2}$	$(1-0)^2 + (0-0)^2 = \textcolor{#CC4444}{1}$
$\textcolor{#E07B9D}{\mathbf{x}_1} \to \textcolor{#E07B9D}{\mathbf{x}_3}$	$(3-0)^2 + (0-0)^2 = \textcolor{#CC4444}{9}$
$\textcolor{#E07B9D}{\mathbf{x}_1} \to \textcolor{#E07B9D}{\mathbf{x}_4}$	$(0-0)^2 + (4-0)^2 = \textcolor{#CC4444}{16}$
$\textcolor{#E07B9D}{\mathbf{x}_1} \to \textcolor{#E07B9D}{\mathbf{x}_5}$	$(5-0)^2 + (5-0)^2 = \textcolor{#CC4444}{50}$

Step 2: apply the Gaussian kernel with $2\textcolor{#7F77DD}{\sigma_1}^2 = 2(\textcolor{#7F77DD}{1.5})^2 = \textcolor{#7F77DD}{4.5}$.

Pair	$\exp(-\textcolor{#CC4444}{d^2} / \textcolor{#7F77DD}{4.5})$	$\textcolor{#C9A227}{K_{1j}}$
$\to \textcolor{#E07B9D}{\mathbf{x}_2}$	$\exp(-\textcolor{#CC4444}{1}/\textcolor{#7F77DD}{4.5}) = \exp(-0.222)$	$\textcolor{#C9A227}{0.801}$
$\to \textcolor{#E07B9D}{\mathbf{x}_3}$	$\exp(-\textcolor{#CC4444}{9}/\textcolor{#7F77DD}{4.5}) = \exp(-2.0)$	$\textcolor{#C9A227}{0.135}$
$\to \textcolor{#E07B9D}{\mathbf{x}_4}$	$\exp(-\textcolor{#CC4444}{16}/\textcolor{#7F77DD}{4.5}) = \exp(-3.556)$	$\textcolor{#C9A227}{0.029}$
$\to \textcolor{#E07B9D}{\mathbf{x}_5}$	$\exp(-\textcolor{#CC4444}{50}/\textcolor{#7F77DD}{4.5}) = \exp(-11.11)$	$\textcolor{#C9A227}{\approx 0}$

Step 3: normalize. The denominator is $\textcolor{#C9A227}{0.801} + \textcolor{#C9A227}{0.135} + \textcolor{#C9A227}{0.029} + \textcolor{#C9A227}{0} = \textcolor{#8CB4D5}{0.965}$.

	$\textcolor{#C9A227}{K_{1j}} / \textcolor{#8CB4D5}{0.965}$	$\textcolor{#E8725C}{p_{j\|1}}$
$\textcolor{#E8725C}{p_{2\\|1}}$	$\textcolor{#C9A227}{0.801} / \textcolor{#8CB4D5}{0.965}$	$\textcolor{#E8725C}{0.830}$
$\textcolor{#E8725C}{p_{3\\|1}}$	$\textcolor{#C9A227}{0.135} / \textcolor{#8CB4D5}{0.965}$	$\textcolor{#E8725C}{0.140}$
$\textcolor{#E8725C}{p_{4\\|1}}$	$\textcolor{#C9A227}{0.029} / \textcolor{#8CB4D5}{0.965}$	$\textcolor{#E8725C}{0.030}$
$\textcolor{#E8725C}{p_{5\\|1}}$	$\textcolor{#C9A227}{0} / \textcolor{#8CB4D5}{0.965}$	$\textcolor{#E8725C}{\approx 0}$

Point $\textcolor{#E07B9D}{\mathbf{x}_2}$ (distance $\textcolor{#CC4444}{1}$) gets $\textcolor{#E8725C}{83\%}$ of the probability mass. Point $\textcolor{#E07B9D}{\mathbf{x}_3}$ (distance $\textcolor{#CC4444}{3}$) gets $\textcolor{#E8725C}{14\%}$. Point $\textcolor{#E07B9D}{\mathbf{x}_4}$ (distance $\textcolor{#CC4444}{4}$) gets $\textcolor{#E8725C}{3\%}$. Point $\textcolor{#E07B9D}{\mathbf{x}_5}$ (distance $\textcolor{#CC4444}{\sqrt{50} \approx 7.07}$) is so far away it gets essentially zero. The Gaussian has converted the distance ordering into a probability distribution concentrated on the nearest neighbors.

Perplexity: how $\textcolor{#7F77DD}{\sigma}$ gets chosen

So far we have been treating $\textcolor{#7F77DD}{\sigma}$ as a single knob. But a single bandwidth can't handle a dataset where some regions are dense and others are sparse. In a dense cluster, a large $\textcolor{#7F77DD}{\sigma}$ would blur together points that should be distinct neighbors. In a sparse region, a small $\textcolor{#7F77DD}{\sigma}$ would make the Gaussian so narrow that it can't reach any neighbors at all.

t-SNE solves this by giving each point its own bandwidth $\textcolor{#7F77DD}{\sigma_i}$. But rather than making you set $N$ different bandwidths, it lets you specify how many effective neighbors each point should have. This is the perplexity parameter. You say "I want each point to attend to roughly 30 neighbors," and the algorithm finds the $\textcolor{#7F77DD}{\sigma_i}$ that achieves that for each point individually. Points in dense regions get a small $\textcolor{#7F77DD}{\sigma_i}$ (many close neighbors, so a tight Gaussian is enough to cover 30 of them). Points in sparse regions get a large $\textcolor{#7F77DD}{\sigma_i}$ (neighbors are farther away, so the Gaussian must spread wider to reach the same count).

But what does "effectively 30 neighbors" mean when the probability distribution is continuous, not a hard cutoff? This is where entropy comes in. The Shannon entropy of a distribution measures how spread out it is:

$H(P_i) = -\sum_{j \neq i} \textcolor{#E8725C}{p_{j|i}} \log_2 \textcolor{#E8725C}{p_{j|i}}$

If all the probability lands on a single neighbor, entropy is 0. If probability is spread uniformly over 30 neighbors, entropy is $\log_2 30 \approx 4.91$. Perplexity exponentiates this back to a count:

$\textcolor{#E8725C}{\text{Perp}(P_i)} = 2^{H(P_i)}$

$\textcolor{#E8725C}{\text{Perp}(P_i)}$	Perplexity of the distribution centered on point $i$. Interpretable as the effective number of neighbors.
$H(P_i)$	Shannon entropy (in bits). Higher entropy means probability is more spread out.

So perplexity literally counts how many neighbors the distribution is effectively spread over. A perplexity of 30 means the probability distribution "looks like" it's attending to about 30 neighbors, even though it doesn't hard-cut at 30.

The user sets a target perplexity (typically 5 to 50). For each point $i$, a binary search adjusts $\textcolor{#7F77DD}{\sigma_i}$ until the actual perplexity matches the target. This is how the algorithm adapts to local density without the user having to manually tune anything per-point.

In the toy example above, the perplexity for $\textcolor{#E07B9D}{\mathbf{x}_1}$ with $\textcolor{#7F77DD}{\sigma_1} = 1.5$ would be: $H = -(\textcolor{#E8725C}{0.830} \log_2 \textcolor{#E8725C}{0.830} + \textcolor{#E8725C}{0.140} \log_2 \textcolor{#E8725C}{0.140} + \textcolor{#E8725C}{0.030} \log_2 \textcolor{#E8725C}{0.030} + 0 \cdot \ldots) = -(-0.268 - 0.403 - 0.152) = \textcolor{#8CB4D5}{0.823}$ bits, giving perplexity $2^{\textcolor{#8CB4D5}{0.823}} \approx \textcolor{#8CB4D5}{1.77}$. That is roughly 2 effective neighbors, which makes sense: the probability is concentrated on $\textcolor{#E07B9D}{\mathbf{x}_2}$ with some mass on $\textcolor{#E07B9D}{\mathbf{x}_3}$. If we wanted a perplexity of $\textcolor{#8CB4D5}{3}$, the binary search would increase $\textcolor{#7F77DD}{\sigma_1}$ to spread the distribution more evenly.

Remember the goal: t-SNE builds a probability distribution $\textcolor{#E8725C}{P}$ over pairs in high-D, then tries to find a low-dimensional layout whose distribution $\textcolor{#18B8C8}{Q}$ matches it. Perplexity controls what $\textcolor{#E8725C}{P}$ looks like, which determines what structure the embedding tries to preserve. At low perplexity, each point only attends to a handful of nearest neighbors, so the embedding preserves very local structure but fragments larger patterns into isolated clumps. At high perplexity, each point considers a wider neighborhood, and the embedding preserves more of the global ring structure at the cost of smearing fine detail. There is no single "correct" value. The default of 30 is a reasonable starting point for most datasets.

Symmetrization

The conditional probabilities are asymmetric: $\textcolor{#E8725C}{p_{j|i}} \neq \textcolor{#E8725C}{p_{i|j}}$ in general. Point $i$ might consider $j$ a close neighbor (small $\textcolor{#7F77DD}{\sigma_i}$, dense region) while $j$ barely notices $i$ (large $\textcolor{#7F77DD}{\sigma_j}$, sparse region). t-SNE symmetrizes by averaging:

$\textcolor{#E8725C}{p_{ij}} = \frac{\textcolor{#E8725C}{p_{j|i}} + \textcolor{#E8725C}{p_{i|j}}}{2\textcolor{#8CB4D5}{N}}$

$\textcolor{#E8725C}{p_{ij}}$	Symmetric pairwise similarity. The joint probability of points $i$ and $j$ being neighbors.
$\textcolor{#E8725C}{p_{j\|i}}, \textcolor{#E8725C}{p_{i\|j}}$	Conditional probabilities from each direction.
$\textcolor{#8CB4D5}{N}$	Total number of data points. Dividing by $2\textcolor{#8CB4D5}{N}$ ensures $\sum_{i,j} \textcolor{#E8725C}{p_{ij}} = 1$.

Why symmetrize? Two reasons. First, it makes the math cleaner: the KL divergence objective is over a single joint distribution, not $\textcolor{#8CB4D5}{N}$ separate conditional distributions. Second, it prevents outliers from being ignored. If $i$ is an outlier, $\textcolor{#E8725C}{p_{j|i}}$ is spread thinly over many points, but some nearby point $j$ in the dense cluster will still have a nontrivial $\textcolor{#E8725C}{p_{i|j}}$. The average ensures the outlier retains at least some similarity to its nearest neighbors.

The crowding problem and the Student-t kernel

Now we need to place points in 2D and define similarities there. The naive approach would use another Gaussian kernel in the low-dimensional space. But this fails badly due to the crowding problem.

Consider the volume of a thin spherical shell at distance $r$ in $d$ dimensions. It grows as $r^{d-1}$. In 100 dimensions, a shell at radius 10 has roughly $10^{99}$ times the volume of a shell at radius 1. That means in high-D, there is exponentially more "room" at moderate distances than at small distances. Many points can be moderate-distance neighbors without crowding each other.

In 2D, the shell at radius 10 has only $10^1 = 10$ times the volume of the shell at radius 1. There is far less room. If you try to embed all those moderate-distance high-D neighbors at moderate distances in 2D, they physically cannot fit. They all crowd into the center, crushing the structure.

The solution: use a Student-t distribution with one degree of freedom (which is a Cauchy distribution) instead of a Gaussian in the low-dimensional space:

$\textcolor{#18B8C8}{q_{ij}} = \frac{(1 + \lVert \textcolor{#18B8C8}{\mathbf{y}_i} - \textcolor{#18B8C8}{\mathbf{y}_j} \rVert^2)^{-1}}{\sum_{k \neq l}(1 + \lVert \textcolor{#18B8C8}{\mathbf{y}_k} - \textcolor{#18B8C8}{\mathbf{y}_l} \rVert^2)^{-1}}$

$\textcolor{#18B8C8}{q_{ij}}$	Low-dimensional similarity between points $i$ and $j$. Analogous to $\textcolor{#E8725C}{p_{ij}}$ but computed from the 2D positions.
$\textcolor{#18B8C8}{\mathbf{y}_i}, \textcolor{#18B8C8}{\mathbf{y}_j}$	Positions of points $i$ and $j$ in the 2D embedding.
$(1 + \lVert \cdot \rVert^2)^{-1}$	Student-t kernel (Cauchy). Decays as $1/r^2$ for large $r$, much slower than $\exp(-r^2)$.
$\sum_{k \neq l}$	Normalization sum over all pairs. Makes $\textcolor{#18B8C8}{q_{ij}}$ a proper distribution.

The key difference from a Gaussian: the Student-t has heavy tails. A Gaussian drops to near-zero by $3\sigma$. The Student-t decays polynomially as $1/r^2$, meaning distant points still have nonzero similarity. This gives far-apart points room to spread out in 2D without collapsing the local structure. Nearby points (where $(1 + r^2)^{-1} \approx 1$) are still mapped accurately. Distant points (where the heavy tail provides slack) can spread much farther apart than a Gaussian would allow. This is the "t" in t-SNE: the Student-t kernel that fixes the crowding problem.

KL divergence: the objective

t-SNE has two distributions: $\textcolor{#E8725C}{P}$ (high-D similarities, fixed) and $\textcolor{#18B8C8}{Q}$ (low-D similarities, depends on the embedding positions $\textcolor{#18B8C8}{\mathbf{y}}$). The goal is to make $\textcolor{#18B8C8}{Q}$ match $\textcolor{#E8725C}{P}$. The mismatch is measured by KL divergence.

KL divergence (Kullback-Leibler divergence) measures how different one probability distribution is from another:

$\textcolor{#D36BE0}{\text{KL}}(\textcolor{#E8725C}{P} \| \textcolor{#18B8C8}{Q}) = \sum_{i \neq j} \textcolor{#E8725C}{p_{ij}} \log \frac{\textcolor{#E8725C}{p_{ij}}}{\textcolor{#18B8C8}{q_{ij}}}$

$\textcolor{#D36BE0}{\text{KL}}(\textcolor{#E8725C}{P} \\| \textcolor{#18B8C8}{Q})$	KL divergence from $\textcolor{#E8725C}{P}$ to $\textcolor{#18B8C8}{Q}$. Always $\geq 0$, equals 0 only when $\textcolor{#E8725C}{P} = \textcolor{#18B8C8}{Q}$.
$\textcolor{#E8725C}{p_{ij}}$	High-D similarity (fixed). Computed once from the original data.
$\textcolor{#18B8C8}{q_{ij}}$	Low-D similarity (variable). Depends on the current 2D positions $\textcolor{#18B8C8}{\mathbf{y}}$.

KL divergence is asymmetric: $\textcolor{#D36BE0}{\text{KL}}(\textcolor{#E8725C}{P} \| \textcolor{#18B8C8}{Q}) \neq \textcolor{#D36BE0}{\text{KL}}(\textcolor{#18B8C8}{Q} \| \textcolor{#E8725C}{P})$. The direction matters enormously. Look at the formula: each term is weighted by $\textcolor{#E8725C}{p_{ij}}$. When $\textcolor{#E8725C}{p_{ij}}$ is large (points are true neighbors) but $\textcolor{#18B8C8}{q_{ij}}$ is small (the embedding placed them far apart), the ratio $\textcolor{#E8725C}{p}/\textcolor{#18B8C8}{q}$ is large and the $\log$ blows up. Big penalty. When $\textcolor{#E8725C}{p_{ij}}$ is small (points are not neighbors) but $\textcolor{#18B8C8}{q_{ij}}$ is large (the embedding placed them close), the $\textcolor{#E8725C}{p_{ij}}$ weight in front suppresses the term. Small penalty.

This asymmetry is why t-SNE preserves local neighborhoods faithfully but can distort global distances: it cares a lot about pulling true neighbors together, but barely cares if non-neighbors end up close. Two clusters that appear adjacent in a t-SNE plot may or may not be truly related in the original space.

The gradient: attractive and repulsive forces

Minimizing KL divergence by gradient descent moves the 2D points. The gradient with respect to position $\textcolor{#18B8C8}{\mathbf{y}_i}$ is:

$\frac{\partial \textcolor{#D36BE0}{\text{KL}}}{\partial \textcolor{#18B8C8}{\mathbf{y}_i}} = 4 \sum_{j \neq i} \left(\textcolor{#E8725C}{p_{ij}} - \textcolor{#18B8C8}{q_{ij}}\right) \left(1 + \lVert \textcolor{#18B8C8}{\mathbf{y}_i} - \textcolor{#18B8C8}{\mathbf{y}_j} \rVert^2\right)^{-1} \left(\textcolor{#18B8C8}{\mathbf{y}_i} - \textcolor{#18B8C8}{\mathbf{y}_j}\right)$

$\textcolor{#E8725C}{p_{ij}} - \textcolor{#18B8C8}{q_{ij}}$	The mismatch. Positive when the points should be closer ($\textcolor{#E8725C}{p_{ij}} > \textcolor{#18B8C8}{q_{ij}}$). Negative when they should be farther apart.
$(1 + \lVert \textcolor{#18B8C8}{\mathbf{y}_i} - \textcolor{#18B8C8}{\mathbf{y}_j} \rVert^2)^{-1}$	The Student-t kernel weight. Nearby points feel stronger forces.
$\textcolor{#18B8C8}{\mathbf{y}_i} - \textcolor{#18B8C8}{\mathbf{y}_j}$	Direction of the force: along the line connecting $i$ and $j$.

This gradient has a physical interpretation. Each pair of points exerts a spring-like force on each other:

Attractive force (when $\textcolor{#E8725C}{p_{ij}} > \textcolor{#18B8C8}{q_{ij}}$): points that should be neighbors but are too far apart in 2D get pulled together.
Repulsive force (when $\textcolor{#18B8C8}{q_{ij}} > \textcolor{#E8725C}{p_{ij}}$): points that are too close in 2D relative to their true similarity get pushed apart.

The balance between these forces produces the final layout. In the early iterations, t-SNE uses "early exaggeration" (multiplying all $\textcolor{#E8725C}{p_{ij}}$ by a factor of 4-12) to strengthen attractive forces, which helps clusters form quickly. Later, the exaggeration is removed and the repulsive forces refine the spacing.

Caveats

t-SNE is non-convex: different random initializations produce different embeddings. Cluster sizes and inter-cluster distances in t-SNE plots are not meaningful. Two clusters that appear the same size may have very different true sizes. Two clusters that appear far apart may not actually be more different than two that appear close. Read the local neighborhoods, not the global layout.

UMAP: graphs and topology

UMAP (Uniform Manifold Approximation and Projection) shares t-SNE's goal of preserving local structure, but approaches it through a different mathematical framework: fuzzy topology. The practical differences are that UMAP better preserves global structure, runs faster on large datasets, and uses a cross-entropy objective instead of KL divergence.

Building the k-NN graph

The first step is to find each point's $\textcolor{#8CB4D5}{k}$ nearest neighbors (controlled by n_neighbors, analogous to perplexity in t-SNE).

Using the same five toy points from before:

With $\textcolor{#8CB4D5}{k} = 2$ (each point keeps its 2 nearest neighbors):

Point	Neighbors (by distance)
$\textcolor{#E07B9D}{\mathbf{x}_1}$	$\textcolor{#E07B9D}{\mathbf{x}_2}$ (dist 1), $\textcolor{#E07B9D}{\mathbf{x}_3}$ (dist 3)
$\textcolor{#E07B9D}{\mathbf{x}_2}$	$\textcolor{#E07B9D}{\mathbf{x}_1}$ (dist 1), $\textcolor{#E07B9D}{\mathbf{x}_3}$ (dist 2)
$\textcolor{#E07B9D}{\mathbf{x}_3}$	$\textcolor{#E07B9D}{\mathbf{x}_2}$ (dist 2), $\textcolor{#E07B9D}{\mathbf{x}_1}$ (dist 3)
$\textcolor{#E07B9D}{\mathbf{x}_4}$	$\textcolor{#E07B9D}{\mathbf{x}_1}$ (dist 4), $\textcolor{#E07B9D}{\mathbf{x}_2}$ (dist $\sqrt{17} \approx 4.12$)
$\textcolor{#E07B9D}{\mathbf{x}_5}$	$\textcolor{#E07B9D}{\mathbf{x}_3}$ (dist $\sqrt{29} \approx 5.39$), $\textcolor{#E07B9D}{\mathbf{x}_4}$ (dist $\sqrt{26} \approx 5.10$)

Points $\textcolor{#E07B9D}{\mathbf{x}_1}$, $\textcolor{#E07B9D}{\mathbf{x}_2}$, $\textcolor{#E07B9D}{\mathbf{x}_3}$ form a tightly connected group. Points $\textcolor{#E07B9D}{\mathbf{x}_4}$ and $\textcolor{#E07B9D}{\mathbf{x}_5}$ are more isolated, only connected to the dense group through their nearest-neighbor edges. This k-NN graph is the skeleton of the embedding.

Exponential decay weights

UMAP assigns edge weights using an exponential decay from each point's nearest neighbor:

$\textcolor{#1D9E75}{w_{i \to j}} = \exp\!\left(-\frac{\max(0,\; \textcolor{#CC4444}{d_{ij}} - \textcolor{#CC4444}{\rho_i})}{\textcolor{#7F77DD}{\sigma_i}}\right)$

$\textcolor{#1D9E75}{w_{i \to j}}$	Directed edge weight from point $i$ to point $j$. Ranges from 1 (nearest neighbor) to 0 (far away).
$\textcolor{#CC4444}{d_{ij}}$	Distance from $i$ to $j$ (not squared, unlike t-SNE).
$\textcolor{#CC4444}{\rho_i}$	Distance from $i$ to its nearest neighbor. This ensures every point has at least one connection with weight 1.
$\textcolor{#7F77DD}{\sigma_i}$	Bandwidth for point $i$. Found by binary search, analogous to t-SNE's $\textcolor{#7F77DD}{\sigma_i}$, calibrated so that $\sum_j \textcolor{#1D9E75}{w_{i \to j}} = \log_2(\textcolor{#8CB4D5}{k})$.

The $\textcolor{#CC4444}{\rho_i}$ offset is important. It subtracts the distance to the nearest neighbor, so the nearest neighbor always gets weight 1 regardless of how isolated the point is. This ensures that even outliers in sparse regions remain connected to the graph. Without it, an isolated point might have all its weights decay to near-zero, effectively disconnecting from the rest of the data.

Fuzzy set union (symmetrization)

The k-NN graph is directed: point $i$ might consider $j$ a neighbor, but $j$ might not reciprocate. UMAP symmetrizes using a fuzzy set union, which is the probabilistic OR:

$\textcolor{#1D9E75}{w_{ij}} = \textcolor{#1D9E75}{w_{i \to j}} + \textcolor{#1D9E75}{w_{j \to i}} - \textcolor{#1D9E75}{w_{i \to j}} \cdot \textcolor{#1D9E75}{w_{j \to i}}$

$\textcolor{#1D9E75}{w_{ij}}$	Symmetric edge weight. The probability that at least one direction considers this a neighbor edge.
$\textcolor{#1D9E75}{w_{i \to j}}$	Directed weight from $i$ to $j$.
$\textcolor{#1D9E75}{w_{j \to i}}$	Directed weight from $j$ to $i$.

Why this formula instead of a simple average? Think of each directed weight as a probability: "the probability that $i$ considers $j$ a neighbor." The probability that at least one direction says "neighbor" is the union of two independent events: $P(A \cup B) = P(A) + P(B) - P(A)P(B)$. This is the product t-conorm from fuzzy set theory.

The result: if either direction says "neighbor," the edge exists with high weight. If $\textcolor{#1D9E75}{w_{i \to j}} = 0.9$ and $\textcolor{#1D9E75}{w_{j \to i}} = 0.1$, the symmetric weight is $0.9 + 0.1 - 0.09 = 0.91$. The strong direction dominates. Compare this to t-SNE's simple average, which would give $0.5$. UMAP's approach preserves connectivity more aggressively: it only takes one direction to see the relationship for the edge to survive.

The cross-entropy objective

UMAP optimizes a cross-entropy between the high-D graph weights $\textcolor{#1D9E75}{w_{ij}}$ and the low-D similarities $\textcolor{#1D9E75}{v_{ij}}$:

$\textcolor{#1D9E75}{\text{CE}} = \sum_{i \neq j} \left[ \textcolor{#1D9E75}{w_{ij}} \log \frac{\textcolor{#1D9E75}{w_{ij}}}{\textcolor{#1D9E75}{v_{ij}}} + (1 - \textcolor{#1D9E75}{w_{ij}}) \log \frac{1 - \textcolor{#1D9E75}{w_{ij}}}{1 - \textcolor{#1D9E75}{v_{ij}}} \right]$

$\textcolor{#1D9E75}{\text{CE}}$	Fuzzy set cross-entropy. The mismatch between the high-D and low-D graphs.
$\textcolor{#1D9E75}{w_{ij}}$	High-D edge weight (fixed). From the symmetrized k-NN graph.
$\textcolor{#1D9E75}{v_{ij}}$	Low-D edge weight (variable). Typically $\textcolor{#1D9E75}{v_{ij}} = (1 + a \lVert \textcolor{#18B8C8}{\mathbf{y}_i} - \textcolor{#18B8C8}{\mathbf{y}_j} \rVert^{2b})^{-1}$, a smooth approximation of the Student-t kernel with tunable parameters $a$ and $b$.

Compare this to t-SNE's KL divergence. There are two terms in each summand:

First term ($\textcolor{#1D9E75}{w_{ij}} \log \frac{\textcolor{#1D9E75}{w_{ij}}}{\textcolor{#1D9E75}{v_{ij}}}$): penalizes when $\textcolor{#1D9E75}{w_{ij}}$ is large but $\textcolor{#1D9E75}{v_{ij}}$ is small. True neighbors placed far apart. This is the attractive term, similar to what KL divergence does.
Second term ($(1 - \textcolor{#1D9E75}{w_{ij}}) \log \frac{1 - \textcolor{#1D9E75}{w_{ij}}}{1 - \textcolor{#1D9E75}{v_{ij}}}$): penalizes when $\textcolor{#1D9E75}{w_{ij}}$ is small but $\textcolor{#1D9E75}{v_{ij}}$ is large. Non-neighbors placed too close. This is the repulsive term, and it is what KL divergence mostly lacks.

KL divergence weights everything by $\textcolor{#E8725C}{p_{ij}}$, so terms where $\textcolor{#E8725C}{p_{ij}} \approx 0$ (non-neighbors) contribute almost nothing. Cross-entropy explicitly includes the complement $(1 - w)$, so non-neighbor pairs that end up close get a real penalty. This is the core mathematical reason UMAP preserves global structure better than t-SNE: it actively prevents non-neighbors from collapsing together, not just passively ignoring them.

Attractive and repulsive forces in SGD

UMAP optimizes with stochastic gradient descent, sampling edges rather than computing the full gradient over all pairs (which would be $O(\textcolor{#8CB4D5}{N}^2)$).

Attractive step: sample an edge $(i, j)$ with $\textcolor{#1D9E75}{w_{ij}} > 0$. Move $\textcolor{#18B8C8}{\mathbf{y}_i}$ and $\textcolor{#18B8C8}{\mathbf{y}_j}$ closer together.
Repulsive step: for the same point $i$, sample a random non-neighbor $l$ (negative sampling). Push $\textcolor{#18B8C8}{\mathbf{y}_i}$ and $\textcolor{#18B8C8}{\mathbf{y}_l}$ apart.

The negative sampling trick makes UMAP scale to millions of points. Instead of computing repulsive forces between all $O(\textcolor{#8CB4D5}{N}^2)$ pairs, it samples a small constant number of repulsive interactions per edge. The attractive forces come from the k-NN graph (sparse, $O(\textcolor{#8CB4D5}{N} \cdot \textcolor{#8CB4D5}{k})$ edges), and the repulsive forces are approximated by sampling. This is why UMAP is dramatically faster than t-SNE for large datasets.

The learning rate decreases over epochs, and the embedding converges when attractive and repulsive forces roughly balance at each point.

Side-by-side comparison

The difference is clearest on data with nonlinear structure. Concentric rings in 3D are a classic test case: PCA, being linear, projects them on top of each other. t-SNE and UMAP unravel them.

Three panels, same data. PCA mixes the rings because it can only find linear projections. t-SNE separates them into distinct clusters. UMAP separates them while preserving the relative ring ordering. Click regenerate to see different random samples.

When to use which

	PCA	t-SNE	UMAP
Type	Linear	Nonlinear	Nonlinear
Preserves	Global variance	Local neighborhoods	Local + global
Deterministic	Yes	No	No
Speed	Very fast	Slow ($O(n^2)$)	Fast (approx. NN)
Interpretable axes	Yes (PCs)	No	No
Best for	Initial exploration, preprocessing	Cluster visualization	Large-scale visualization
Distances meaningful	Yes	No (only neighbors)	Roughly
Objective	Max variance (eigendecomposition)	$\textcolor{#D36BE0}{\text{KL}}(\textcolor{#E8725C}{P} \\| \textcolor{#18B8C8}{Q})$	$\textcolor{#1D9E75}{\text{CE}}(\textcolor{#1D9E75}{W}, \textcolor{#1D9E75}{V})$
Kernel (low-D)	N/A (linear projection)	Student-t $(1 + r^2)^{-1}$	$(1 + ar^{2b})^{-1}$
Non-neighbor penalty	N/A	Weak (via $\textcolor{#E8725C}{p_{ij}}$ weighting)	Strong (explicit CE term)

Start with PCA. If the result already separates your classes or shows clear structure, stop there. If clusters overlap or nonlinear patterns are hidden, try UMAP (faster, better global structure) or t-SNE (sometimes better local separation at small scale).

All three are projection methods, not clustering algorithms. They show you structure that exists in the data. They do not create it. If you see clusters in a t-SNE plot, verify them with a proper clustering method before drawing conclusions.

What else is out there

PCA, t-SNE, and UMAP are the workhorses, but the landscape is wider.

Classical linear methods. Multidimensional scaling (MDS) predates t-SNE and works directly on a distance matrix: it finds low-dimensional coordinates that preserve pairwise distances as faithfully as possible. Classical MDS on Euclidean distances is equivalent to PCA, but it generalizes to arbitrary distance metrics. Independent component analysis (ICA) finds axes that maximize statistical independence rather than variance, useful when you care about separating mixed signals rather than capturing spread. Linear discriminant analysis (LDA) is supervised: given class labels, it finds projections that maximize between-class separation relative to within-class spread.

Manifold learning. Before t-SNE, several methods tried to learn nonlinear structure. Isomap replaces Euclidean distances with geodesic distances along the data manifold (shortest paths through a neighbor graph), then applies MDS. Locally linear embedding (LLE) reconstructs each point as a weighted combination of its neighbors and preserves those weights in the embedding. Laplacian eigenmaps build a neighbor graph, weight edges by similarity, and embed using the eigenvectors of the graph Laplacian. These methods work well on clean manifolds (e.g., the Swiss roll) but tend to be fragile on noisy real-world data, which is why t-SNE and UMAP largely replaced them.

Deep learning. Autoencoders learn a compression function by training a neural network to reconstruct its input through a bottleneck layer. The bottleneck activations are the embedding. Unlike PCA, autoencoders can learn nonlinear mappings, and unlike t-SNE/UMAP, they produce a parametric encoder you can apply to new data without re-running the whole algorithm. Variational autoencoders (VAEs) add a probabilistic prior to the bottleneck, which regularizes the latent space and makes it smoother.

Recent alternatives. TriMap adds a triplet-based objective (point A is closer to B than to C) and tends to preserve global structure better than t-SNE. PaCMAP explicitly balances near-pair, mid-near-pair, and far-pair forces during optimization, avoiding some of t-SNE's tendency to shatter global structure. PHATE uses diffusion distances and is designed for biological data with branching trajectories.