The autoencoder
An autoencoder is a neural network that learns to compress its input and then reconstruct it. The architecture has three parts: an encoder that maps the input to a small latent representation, a bottleneck that forces compression, and a decoder that reconstructs the original input from the compressed form.
The input $\textcolor{#E07B9D}{x}$ passes through the encoder, which progressively narrows the representation down to the bottleneck $\textcolor{#7F77DD}{z}$. The decoder then widens it back to reconstruct $\textcolor{#E07B9D}{\hat{x}}$.
$\textcolor{#7F77DD}{z} = \textcolor{#C9A227}{f_\phi}(\textcolor{#E07B9D}{x}) \qquad \textcolor{#E07B9D}{\hat{x}} = \textcolor{#D85A30}{g_\theta}(\textcolor{#7F77DD}{z})$
| $\textcolor{#E07B9D}{x}$ | Input (for MNIST: a 784-dimensional vector, the flattened 28x28 image). |
| $\textcolor{#C9A227}{f_\phi}$ | Encoder network with parameters $\phi$. Compresses the input. |
| $\textcolor{#7F77DD}{z}$ | Latent code. The bottleneck representation. If this is 2D, we have compressed 784 dimensions down to 2. |
| $\textcolor{#D85A30}{g_\theta}$ | Decoder network with parameters $\theta$. Reconstructs the input from the latent space representation. |
| $\textcolor{#E07B9D}{\hat{x}}$ | Reconstruction. The decoder's best attempt at recovering the original input. |
The network is trained to minimize the reconstruction error, the difference between the input and the output:
$\textcolor{#D36BE0}{\mathcal{L}_\text{recon}} = \frac{1}{N} \sum_{i=1}^{N} \| \textcolor{#E07B9D}{x^{(i)}} - \textcolor{#E07B9D}{\hat{x}^{(i)}} \|^2$
The bottleneck is the entire point. If $\textcolor{#7F77DD}{z}$ had the same dimensionality as $\textcolor{#E07B9D}{x}$, the network could learn the identity function and the loss would be zero. By making $\textcolor{#7F77DD}{z}$ much smaller, the encoder is forced to discover a compact representation that preserves the most important information. This is learned dimensionality reduction, the same problem the dimensionality reduction post approached with PCA, t-SNE, and UMAP, but here the mapping is a neural network.
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, latent_dim=2):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(784, 256), nn.ReLU(),
nn.Linear(256, 128), nn.ReLU(),
nn.Linear(128, latent_dim),
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 128), nn.ReLU(),
nn.Linear(128, 256), nn.ReLU(),
nn.Linear(256, 784), nn.Sigmoid(),
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z), z
The encoder narrows 784 → 256 → 128 → 2, the decoder widens 2 → 128 → 256 → 784. The symmetric architecture is a common convention but not a requirement. The decoder can be wider, deeper, or use entirely different layer sizes. What matters is that the bottleneck is small enough to force compression. The final sigmoid constrains outputs to [0, 1] to match pixel values.
The problem with autoencoders
An autoencoder learns to reconstruct training data, but its latent space has no structure. Points are placed wherever minimizes reconstruction loss, with no constraint on the overall layout. This means:
- Gaps. Regions of latent space between clusters may decode to nonsense. The decoder was never trained on inputs from those regions.
- No sampling. You cannot generate new data by drawing a random $\textcolor{#7F77DD}{z}$ because you have no idea which regions of latent space produce valid outputs.
- No smooth interpolation. Moving between two encoded points may pass through dead zones where the decoder produces garbage.
Compare the two scatter plots. The autoencoder (left) places digit clusters wherever it wants. The VAE (right) organizes them into a smooth, roughly Gaussian distribution. This structure is not an accident. It is enforced by the loss function.
The variational autoencoder
The key idea: instead of encoding each input to a single vector $\textcolor{#7F77DD}{z}$, encode the distribution of vectors it could map to. The encoder outputs two vectors: a mean $\textcolor{#1D9E75}{\mu}$ and a log-variance $\log \textcolor{#18B8C8}{\sigma^2}$:
$\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x}) = \mathcal{N}\!\left(\textcolor{#1D9E75}{\mu_\phi}(\textcolor{#E07B9D}{x}),\; \textcolor{#18B8C8}{\sigma_\phi^2}(\textcolor{#E07B9D}{x})\right)$
| $\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})$ | The encoder's output distribution. For each input, it produces a Gaussian in latent space. |
| $\textcolor{#1D9E75}{\mu_\phi}(\textcolor{#E07B9D}{x})$ | Mean vector. Where the distribution is centered. |
| $\textcolor{#18B8C8}{\sigma_\phi^2}(\textcolor{#E07B9D}{x})$ | Variance vector. How spread out the distribution is. |
During training, we sample $\textcolor{#7F77DD}{z}$ from this distribution and pass it to the decoder. But sampling is a problem: you cannot backpropagate through a random operation.
The reparameterization trick
Instead of sampling $\textcolor{#7F77DD}{z} \sim \mathcal{N}(\textcolor{#1D9E75}{\mu}, \textcolor{#18B8C8}{\sigma^2})$ directly, reparameterize the sample as a deterministic function of the network outputs plus external noise:
$\textcolor{#7F77DD}{z} = \textcolor{#1D9E75}{\mu} + \textcolor{#18B8C8}{\sigma} \odot \textcolor{#8CB4D5}{\epsilon}, \qquad \textcolor{#8CB4D5}{\epsilon} \sim \mathcal{N}(0, I)$
| $\textcolor{#8CB4D5}{\epsilon}$ | Noise sampled from a standard normal. This is the only stochastic part. |
| $\odot$ | Element-wise multiplication. |
Now $\textcolor{#7F77DD}{z}$ is a deterministic function of $\textcolor{#1D9E75}{\mu}$ and $\textcolor{#18B8C8}{\sigma}$, which are outputs of the encoder. Gradients flow through both. The randomness comes only from $\textcolor{#8CB4D5}{\epsilon}$, which does not depend on any parameters.
class VAE(nn.Module):
def __init__(self, latent_dim=2):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(784, 256), nn.ReLU(),
nn.Linear(256, 128), nn.ReLU(),
)
self.fc_mu = nn.Linear(128, latent_dim)
self.fc_logvar = nn.Linear(128, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 128), nn.ReLU(),
nn.Linear(128, 256), nn.ReLU(),
nn.Linear(256, 784), nn.Sigmoid(),
)
def encode(self, x):
h = self.encoder(x)
return self.fc_mu(h), self.fc_logvar(h)
def forward(self, x):
mu, logvar = self.encode(x)
std = torch.exp(0.5 * logvar)
z = mu + std * torch.randn_like(std)
return self.decoder(z), mu, logvar
Notice the encoder splits at the end: the shared layers output a hidden vector, then two separate linear layers produce $\textcolor{#1D9E75}{\mu}$ and $\log \textcolor{#18B8C8}{\sigma^2}$. We work with log-variance instead of variance directly because it can be any real number (variance must be positive, and exp is always positive).
Deriving the loss: the ELBO
The VAE is trained with two objectives pulling in opposite directions:
- Reconstruct well. The decoder should be able to recover $\textcolor{#E07B9D}{x}$ from $\textcolor{#7F77DD}{z}$.
- Stay close to the prior. The encoder's distribution $\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})$ should not stray too far from a standard normal $\textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z}) = \mathcal{N}(0, I)$.
These two objectives come from a single principled derivation. We want to maximize the probability of the data under our model. Start with the marginal log-likelihood:
$\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x}) = \log \int \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z}) \, \textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z}) \, d\textcolor{#7F77DD}{z}$
This integral is intractable (we would need to evaluate the decoder at every possible $\textcolor{#7F77DD}{z}$). Instead, introduce the encoder $\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})$ as an approximation to the true posterior, and derive a lower bound.
Step 1: introduce the encoder
Multiply and divide by $\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})$ inside the integral:
$\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x}) = \log \int \textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x}) \, \frac{\textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z}) \, \textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z})}{\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})} \, d\textcolor{#7F77DD}{z}$
Step 2: apply Jensen's inequality
Since log is concave, $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$:
$\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x}) \geq \mathbb{E}_{\textcolor{#C9A227}{q_\phi}}\!\left[\log \frac{\textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z}) \, \textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z})}{\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})}\right]$
Step 3: split the log
$= \mathbb{E}_{\textcolor{#C9A227}{q_\phi}}\!\left[\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z})\right] + \mathbb{E}_{\textcolor{#C9A227}{q_\phi}}\!\left[\log \frac{\textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z})}{\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})}\right]$
The second term is the negative KL divergence $-D_\text{KL}(\textcolor{#C9A227}{q_\phi} \| \textcolor{#D85A30}{p})$. So the Evidence Lower Bound (ELBO) is:
$\textcolor{#8CB4D5}{\text{ELBO}} = \underbrace{\mathbb{E}_{\textcolor{#C9A227}{q_\phi}}\!\left[\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z})\right]}_{\textcolor{#D36BE0}{\text{reconstruction}}} - \underbrace{D_\text{KL}\!\left(\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x}) \;\|\; \textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z})\right)}_{\textcolor{#18B8C8}{\text{regularization}}}$
| $\textcolor{#D36BE0}{\text{reconstruction}}$ | How well the decoder reconstructs the input from the sampled latent code. In practice, this is the negative of the binary cross-entropy or MSE between $\textcolor{#E07B9D}{x}$ and $\textcolor{#E07B9D}{\hat{x}}$. |
| $\textcolor{#18B8C8}{\text{regularization}}$ | How far the encoder's distribution is from the prior $\mathcal{N}(0, I)$. This pulls every encoded distribution toward the origin, preventing gaps in the latent space. |
Maximizing the ELBO is equivalent to minimizing $\textcolor{#D36BE0}{\mathcal{L}_\text{recon}} + \textcolor{#18B8C8}{\mathcal{L}_\text{KL}}$.
Closed-form KL divergence
When both distributions are Gaussian, the KL divergence has a closed form. For a single input with a $d$-dimensional latent space:
$\textcolor{#18B8C8}{\mathcal{L}_\text{KL}} = D_\text{KL}\!\left(\mathcal{N}(\textcolor{#1D9E75}{\mu}, \textcolor{#18B8C8}{\sigma^2}) \;\|\; \mathcal{N}(0, I)\right) = -\frac{1}{2}\sum_{j=1}^{d}\left(1 + \log \textcolor{#18B8C8}{\sigma_j^2} - \textcolor{#1D9E75}{\mu_j}^2 - \textcolor{#18B8C8}{\sigma_j^2}\right)$
| $\textcolor{#1D9E75}{\mu_j}^2$ | Penalizes the mean for moving away from 0. |
| $\textcolor{#18B8C8}{\sigma_j^2}$ | Penalizes the variance for deviating from 1 (too wide or too narrow). |
| $\log \textcolor{#18B8C8}{\sigma_j^2}$ | Counterbalances the $\textcolor{#18B8C8}{\sigma_j^2}$ term. Together, $\log \sigma^2 - \sigma^2$ is minimized at $\sigma^2 = 1$. |
In PyTorch, using log-variance directly:
def vae_loss(recon_x, x, mu, logvar):
recon = F.binary_cross_entropy(recon_x, x, reduction='sum')
kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon + kl
Why the latent space is smooth
The two loss terms create a tension that produces structure:
$\textcolor{#18B8C8}{\mathcal{L}_\text{KL}}$ alone would collapse everything to a single standard normal. Every input would encode to the same distribution, destroying all information.
$\textcolor{#D36BE0}{\mathcal{L}_\text{recon}}$ alone would push each input's distribution to a narrow spike far from everything else (this is the autoencoder solution).
Together, the encoder must spread its distributions enough to overlap with the prior while keeping them distinct enough for the decoder to tell apart. The result: digit classes form overlapping clouds that tile the latent space smoothly. Similar digits (3 and 8, 4 and 9) end up nearby because that lets both losses be low.
Move your mouse across the latent plane. The decoder runs in real time in your browser (pure JS matrix multiplication, no server). Switch between the AE and VAE decoder to see the difference: the VAE produces recognizable digits across the entire plane, while the AE has dead zones that produce noise.
Interpolation and generation
Because the VAE's latent space is smooth, you can do two things that autoencoders cannot:
Interpolation. Walk a straight line between two latent codes. At every point along the line, the decoder produces a plausible output. The transition is gradual, not abrupt.
Generation. Sample $\textcolor{#7F77DD}{z} \sim \mathcal{N}(0, I)$ and decode. Because the KL term pulled the training distribution toward the prior, random samples from the prior land in regions the decoder has seen.
The autoencoder interpolation passes through blurry, broken intermediate frames. The VAE interpolation produces smooth transitions where each frame looks like a plausible digit.
Training loop
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
train_data = datasets.MNIST('.', train=True, download=True,
transform=transforms.ToTensor())
loader = DataLoader(train_data, batch_size=256, shuffle=True)
vae = VAE(latent_dim=2)
opt = torch.optim.Adam(vae.parameters(), lr=1e-3)
for epoch in range(1, 51):
vae.train()
total_loss = 0
for x, _ in loader:
x = x.view(-1, 784)
recon, mu, logvar = vae(x)
loss = vae_loss(recon, x, mu, logvar)
opt.zero_grad()
loss.backward()
opt.step()
total_loss += loss.item()
print(f"Epoch {epoch}: {total_loss / len(train_data):.2f}")
A common training issue with VAEs is posterior collapse: the encoder learns to ignore the input and output $\textcolor{#1D9E75}{\mu} = 0$, $\textcolor{#18B8C8}{\sigma^2} = 1$ for everything, making the KL loss zero but producing blurry reconstructions. A practical fix is KL warmup: multiply $\textcolor{#18B8C8}{\mathcal{L}_\text{KL}}$ by a coefficient $\beta$ that ramps from 0 to 1 over the first several epochs, letting the decoder learn a useful representation before the KL penalty kicks in.
Connection to dimensionality reduction
The encoder of a trained autoencoder is a nonlinear dimensionality reduction method. How does it compare to PCA, t-SNE, and UMAP from the dimensionality reduction post?
| Method | Linear? | Learnable? | Invertible? | Generative? |
|---|---|---|---|---|
| PCA | Yes | No | Yes (approximate) | No |
| t-SNE | No | No | No | No |
| UMAP | No | No | No | No |
| Autoencoder | No | Yes | Yes (decoder) | No |
| VAE | No | Yes | Yes (decoder) | Yes |
PCA finds the best linear compression. t-SNE and UMAP find good low-dimensional layouts for visualization but cannot map new points or reconstruct the originals. The autoencoder learns a nonlinear compression with a decoder that can reconstruct, but its latent space is unstructured. The VAE adds structure: a smooth, continuous latent space that supports generation and interpolation.
Beyond MNIST
The 2D latent space in these demos is deliberately tiny to make the latent space easy to visualize. In practice, VAEs use latent dimensions of 64-512. The same principles apply:
$\beta$-VAE uses a hyperparameter $\beta > 1$ on the KL term to encourage disentangled latent dimensions, where each dimension controls a single factor of variation (e.g., rotation, width, style).
VQ-VAE (Vector Quantized VAE) replaces the continuous Gaussian latent space with a discrete codebook. The encoder maps to the nearest codebook entry. This avoids the blurriness of Gaussian VAEs and powers modern image and audio generation (DALL-E, Jukebox).
Diffusion models can be seen as a hierarchy of VAEs with many latent layers, where the "encoding" process is a fixed noise schedule rather than a learned encoder. The connection is through the same ELBO framework.