What is the difference between an autoencoder and a VAE?

An autoencoder maps inputs to a fixed latent vector and reconstructs from it. A variational autoencoder (VAE) maps inputs to a distribution (mean and variance) in latent space, samples from that distribution, and reconstructs. The VAE adds a KL divergence penalty that pulls the encoded distributions toward a standard normal, producing a smooth latent space where nearby points decode to similar outputs and you can sample new data by drawing from the prior.

What is the ELBO in variational autoencoders?

The Evidence Lower Bound (ELBO) is a lower bound on the log-likelihood of the data. Maximizing the ELBO is equivalent to maximizing data likelihood while minimizing the gap between the approximate posterior q(z|x) and the true posterior p(z|x). The ELBO decomposes into a reconstruction term (how well the decoder reconstructs the input) minus a KL divergence term (how far the encoder's distribution is from the prior).

Why does the VAE use the reparameterization trick?

Sampling z ~ N(mu, sigma^2) is a stochastic operation that blocks gradient flow. The reparameterization trick rewrites the sample as z = mu + sigma * epsilon where epsilon ~ N(0,1). Now z is a deterministic function of mu and sigma (which are network outputs), plus external noise. Gradients flow through mu and sigma to the encoder, enabling end-to-end backpropagation.

What is KL divergence and why does the VAE minimize it?

KL divergence measures how different one probability distribution is from another. In a VAE, the KL term D_KL(q(z|x) || p(z)) penalizes the encoder for producing distributions that deviate from the prior N(0,I). This regularizes the latent space: it prevents the encoder from mapping each input to a narrow spike far from the origin, ensuring the latent space is smooth and continuous.

Autoencoders & VAEs

The autoencoder

An autoencoder is a neural network that learns to compress its input and then reconstruct it. The architecture has three parts: an encoder that maps the input to a small latent representation, a bottleneck that forces compression, and a decoder that reconstructs the original input from the compressed form.

The input $\textcolor{#E07B9D}{x}$ passes through the encoder, which progressively narrows the representation down to the bottleneck $\textcolor{#7F77DD}{z}$. The decoder then widens it back to reconstruct $\textcolor{#E07B9D}{\hat{x}}$.

$\textcolor{#7F77DD}{z} = \textcolor{#C9A227}{f_\phi}(\textcolor{#E07B9D}{x}) \qquad \textcolor{#E07B9D}{\hat{x}} = \textcolor{#D85A30}{g_\theta}(\textcolor{#7F77DD}{z})$

$\textcolor{#E07B9D}{x}$	Input (for MNIST: a 784-dimensional vector, the flattened 28x28 image).
$\textcolor{#C9A227}{f_\phi}$	Encoder network with parameters $\phi$. Compresses the input.
$\textcolor{#7F77DD}{z}$	Latent code. The bottleneck representation. If this is 2D, we have compressed 784 dimensions down to 2.
$\textcolor{#D85A30}{g_\theta}$	Decoder network with parameters $\theta$. Reconstructs the input from the latent space representation.
$\textcolor{#E07B9D}{\hat{x}}$	Reconstruction. The decoder's best attempt at recovering the original input.

The network is trained to minimize the reconstruction error, the difference between the input and the output:

$\textcolor{#D36BE0}{\mathcal{L}_\text{recon}} = \frac{1}{N} \sum_{i=1}^{N} \| \textcolor{#E07B9D}{x^{(i)}} - \textcolor{#E07B9D}{\hat{x}^{(i)}} \|^2$

The bottleneck is the entire point. If $\textcolor{#7F77DD}{z}$ had the same dimensionality as $\textcolor{#E07B9D}{x}$, the network could learn the identity function and the loss would be zero. By making $\textcolor{#7F77DD}{z}$ much smaller, the encoder is forced to discover a compact representation that preserves the most important information. This is learned dimensionality reduction, the same problem the dimensionality reduction post approached with PCA, t-SNE, and UMAP, but here the mapping is a neural network.

import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, latent_dim=2):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(784, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU(),
            nn.Linear(128, latent_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128), nn.ReLU(),
            nn.Linear(128, 256), nn.ReLU(),
            nn.Linear(256, 784), nn.Sigmoid(),
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z), z

The encoder narrows 784 → 256 → 128 → 2, the decoder widens 2 → 128 → 256 → 784. The symmetric architecture is a common convention but not a requirement. The decoder can be wider, deeper, or use entirely different layer sizes. What matters is that the bottleneck is small enough to force compression. The final sigmoid constrains outputs to [0, 1] to match pixel values.

The problem with autoencoders

An autoencoder learns to reconstruct training data, but its latent space has no structure. Points are placed wherever minimizes reconstruction loss, with no constraint on the overall layout. This means:

Gaps. Regions of latent space between clusters may decode to nonsense. The decoder was never trained on inputs from those regions.
No sampling. You cannot generate new data by drawing a random $\textcolor{#7F77DD}{z}$ because you have no idea which regions of latent space produce valid outputs.
No smooth interpolation. Moving between two encoded points may pass through dead zones where the decoder produces garbage.

Compare the two scatter plots. The autoencoder (left) places digit clusters wherever it wants. The VAE (right) organizes them into a smooth, roughly Gaussian distribution. This structure is not an accident. It is enforced by the loss function.

The variational autoencoder

The key idea: instead of encoding each input to a single vector $\textcolor{#7F77DD}{z}$, encode the distribution of vectors it could map to. The encoder outputs two vectors: a mean $\textcolor{#1D9E75}{\mu}$ and a log-variance $\log \textcolor{#18B8C8}{\sigma^2}$:

$\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x}) = \mathcal{N}\!\left(\textcolor{#1D9E75}{\mu_\phi}(\textcolor{#E07B9D}{x}),\; \textcolor{#18B8C8}{\sigma_\phi^2}(\textcolor{#E07B9D}{x})\right)$

$\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})$	The encoder's output distribution. For each input, it produces a Gaussian in latent space.
$\textcolor{#1D9E75}{\mu_\phi}(\textcolor{#E07B9D}{x})$	Mean vector. Where the distribution is centered.
$\textcolor{#18B8C8}{\sigma_\phi^2}(\textcolor{#E07B9D}{x})$	Variance vector. How spread out the distribution is.

During training, we sample $\textcolor{#7F77DD}{z}$ from this distribution and pass it to the decoder. But sampling is a problem: you cannot backpropagate through a random operation.

The reparameterization trick

Instead of sampling $\textcolor{#7F77DD}{z} \sim \mathcal{N}(\textcolor{#1D9E75}{\mu}, \textcolor{#18B8C8}{\sigma^2})$ directly, reparameterize the sample as a deterministic function of the network outputs plus external noise:

$\textcolor{#7F77DD}{z} = \textcolor{#1D9E75}{\mu} + \textcolor{#18B8C8}{\sigma} \odot \textcolor{#8CB4D5}{\epsilon}, \qquad \textcolor{#8CB4D5}{\epsilon} \sim \mathcal{N}(0, I)$

$\textcolor{#8CB4D5}{\epsilon}$	Noise sampled from a standard normal. This is the only stochastic part.
$\odot$	Element-wise multiplication.

Now $\textcolor{#7F77DD}{z}$ is a deterministic function of $\textcolor{#1D9E75}{\mu}$ and $\textcolor{#18B8C8}{\sigma}$, which are outputs of the encoder. Gradients flow through both. The randomness comes only from $\textcolor{#8CB4D5}{\epsilon}$, which does not depend on any parameters.

class VAE(nn.Module):
    def __init__(self, latent_dim=2):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(784, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU(),
        )
        self.fc_mu = nn.Linear(128, latent_dim)
        self.fc_logvar = nn.Linear(128, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128), nn.ReLU(),
            nn.Linear(128, 256), nn.ReLU(),
            nn.Linear(256, 784), nn.Sigmoid(),
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)

    def forward(self, x):
        mu, logvar = self.encode(x)
        std = torch.exp(0.5 * logvar)
        z = mu + std * torch.randn_like(std)
        return self.decoder(z), mu, logvar

Notice the encoder splits at the end: the shared layers output a hidden vector, then two separate linear layers produce $\textcolor{#1D9E75}{\mu}$ and $\log \textcolor{#18B8C8}{\sigma^2}$. We work with log-variance instead of variance directly because it can be any real number (variance must be positive, and exp is always positive).

Deriving the loss: the ELBO

The VAE is trained with two objectives pulling in opposite directions:

Reconstruct well. The decoder should be able to recover $\textcolor{#E07B9D}{x}$ from $\textcolor{#7F77DD}{z}$.
Stay close to the prior. The encoder's distribution $\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})$ should not stray too far from a standard normal $\textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z}) = \mathcal{N}(0, I)$.

These two objectives come from a single principled derivation. We want to maximize the probability of the data under our model. Start with the marginal log-likelihood:

$\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x}) = \log \int \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z}) \, \textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z}) \, d\textcolor{#7F77DD}{z}$

This integral is intractable (we would need to evaluate the decoder at every possible $\textcolor{#7F77DD}{z}$). Instead, introduce the encoder $\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})$ as an approximation to the true posterior, and derive a lower bound.

Step 1: introduce the encoder

Multiply and divide by $\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})$ inside the integral:

$\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x}) = \log \int \textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x}) \, \frac{\textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z}) \, \textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z})}{\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})} \, d\textcolor{#7F77DD}{z}$

Step 2: apply Jensen's inequality

Since log is concave, $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$:

$\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x}) \geq \mathbb{E}_{\textcolor{#C9A227}{q_\phi}}\!\left[\log \frac{\textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z}) \, \textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z})}{\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})}\right]$

Step 3: split the log

$= \mathbb{E}_{\textcolor{#C9A227}{q_\phi}}\!\left[\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z})\right] + \mathbb{E}_{\textcolor{#C9A227}{q_\phi}}\!\left[\log \frac{\textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z})}{\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x})}\right]$

The second term is the negative KL divergence $-D_\text{KL}(\textcolor{#C9A227}{q_\phi} \| \textcolor{#D85A30}{p})$. So the Evidence Lower Bound (ELBO) is:

$\textcolor{#8CB4D5}{\text{ELBO}} = \underbrace{\mathbb{E}_{\textcolor{#C9A227}{q_\phi}}\!\left[\log \textcolor{#D85A30}{p_\theta}(\textcolor{#E07B9D}{x} \mid \textcolor{#7F77DD}{z})\right]}_{\textcolor{#D36BE0}{\text{reconstruction}}} - \underbrace{D_\text{KL}\!\left(\textcolor{#C9A227}{q_\phi}(\textcolor{#7F77DD}{z} \mid \textcolor{#E07B9D}{x}) \;\|\; \textcolor{#D85A30}{p}(\textcolor{#7F77DD}{z})\right)}_{\textcolor{#18B8C8}{\text{regularization}}}$

$\textcolor{#D36BE0}{\text{reconstruction}}$	How well the decoder reconstructs the input from the sampled latent code. In practice, this is the negative of the binary cross-entropy or MSE between $\textcolor{#E07B9D}{x}$ and $\textcolor{#E07B9D}{\hat{x}}$.
$\textcolor{#18B8C8}{\text{regularization}}$	How far the encoder's distribution is from the prior $\mathcal{N}(0, I)$. This pulls every encoded distribution toward the origin, preventing gaps in the latent space.

Maximizing the ELBO is equivalent to minimizing $\textcolor{#D36BE0}{\mathcal{L}_\text{recon}} + \textcolor{#18B8C8}{\mathcal{L}_\text{KL}}$.

Closed-form KL divergence

When both distributions are Gaussian, the KL divergence has a closed form. For a single input with a $d$-dimensional latent space:

$\textcolor{#18B8C8}{\mathcal{L}_\text{KL}} = D_\text{KL}\!\left(\mathcal{N}(\textcolor{#1D9E75}{\mu}, \textcolor{#18B8C8}{\sigma^2}) \;\|\; \mathcal{N}(0, I)\right) = -\frac{1}{2}\sum_{j=1}^{d}\left(1 + \log \textcolor{#18B8C8}{\sigma_j^2} - \textcolor{#1D9E75}{\mu_j}^2 - \textcolor{#18B8C8}{\sigma_j^2}\right)$

$\textcolor{#1D9E75}{\mu_j}^2$	Penalizes the mean for moving away from 0.
$\textcolor{#18B8C8}{\sigma_j^2}$	Penalizes the variance for deviating from 1 (too wide or too narrow).
$\log \textcolor{#18B8C8}{\sigma_j^2}$	Counterbalances the $\textcolor{#18B8C8}{\sigma_j^2}$ term. Together, $\log \sigma^2 - \sigma^2$ is minimized at $\sigma^2 = 1$.

In PyTorch, using log-variance directly:

def vae_loss(recon_x, x, mu, logvar):
    recon = F.binary_cross_entropy(recon_x, x, reduction='sum')
    kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon + kl

Why the latent space is smooth

The two loss terms create a tension that produces structure:

$\textcolor{#18B8C8}{\mathcal{L}_\text{KL}}$ alone would collapse everything to a single standard normal. Every input would encode to the same distribution, destroying all information.

$\textcolor{#D36BE0}{\mathcal{L}_\text{recon}}$ alone would push each input's distribution to a narrow spike far from everything else (this is the autoencoder solution).

Together, the encoder must spread its distributions enough to overlap with the prior while keeping them distinct enough for the decoder to tell apart. The result: digit classes form overlapping clouds that tile the latent space smoothly. Similar digits (3 and 8, 4 and 9) end up nearby because that lets both losses be low.

Move your mouse across the latent plane. The decoder runs in real time in your browser (pure JS matrix multiplication, no server). Switch between the AE and VAE decoder to see the difference: the VAE produces recognizable digits across the entire plane, while the AE has dead zones that produce noise.

Interpolation and generation

Because the VAE's latent space is smooth, you can do two things that autoencoders cannot:

Interpolation. Walk a straight line between two latent codes. At every point along the line, the decoder produces a plausible output. The transition is gradual, not abrupt.

Generation. Sample $\textcolor{#7F77DD}{z} \sim \mathcal{N}(0, I)$ and decode. Because the KL term pulled the training distribution toward the prior, random samples from the prior land in regions the decoder has seen.

The autoencoder interpolation passes through blurry, broken intermediate frames. The VAE interpolation produces smooth transitions where each frame looks like a plausible digit.

Training loop

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

train_data = datasets.MNIST('.', train=True, download=True,
                            transform=transforms.ToTensor())
loader = DataLoader(train_data, batch_size=256, shuffle=True)

vae = VAE(latent_dim=2)
opt = torch.optim.Adam(vae.parameters(), lr=1e-3)

for epoch in range(1, 51):
    vae.train()
    total_loss = 0
    for x, _ in loader:
        x = x.view(-1, 784)
        recon, mu, logvar = vae(x)
        loss = vae_loss(recon, x, mu, logvar)
        opt.zero_grad()
        loss.backward()
        opt.step()
        total_loss += loss.item()
    print(f"Epoch {epoch}: {total_loss / len(train_data):.2f}")

A common training issue with VAEs is posterior collapse: the encoder learns to ignore the input and output $\textcolor{#1D9E75}{\mu} = 0$, $\textcolor{#18B8C8}{\sigma^2} = 1$ for everything, making the KL loss zero but producing blurry reconstructions. A practical fix is KL warmup: multiply $\textcolor{#18B8C8}{\mathcal{L}_\text{KL}}$ by a coefficient $\beta$ that ramps from 0 to 1 over the first several epochs, letting the decoder learn a useful representation before the KL penalty kicks in.

Connection to dimensionality reduction

The encoder of a trained autoencoder is a nonlinear dimensionality reduction method. How does it compare to PCA, t-SNE, and UMAP from the dimensionality reduction post?

Method	Linear?	Learnable?	Invertible?	Generative?
PCA	Yes	No	Yes (approximate)	No
t-SNE	No	No	No	No
UMAP	No	No	No	No
Autoencoder	No	Yes	Yes (decoder)	No
VAE	No	Yes	Yes (decoder)	Yes

PCA finds the best linear compression. t-SNE and UMAP find good low-dimensional layouts for visualization but cannot map new points or reconstruct the originals. The autoencoder learns a nonlinear compression with a decoder that can reconstruct, but its latent space is unstructured. The VAE adds structure: a smooth, continuous latent space that supports generation and interpolation.

Beyond MNIST

The 2D latent space in these demos is deliberately tiny to make the latent space easy to visualize. In practice, VAEs use latent dimensions of 64-512. The same principles apply:

$\beta$-VAE uses a hyperparameter $\beta > 1$ on the KL term to encourage disentangled latent dimensions, where each dimension controls a single factor of variation (e.g., rotation, width, style).

VQ-VAE (Vector Quantized VAE) replaces the continuous Gaussian latent space with a discrete codebook. The encoder maps to the nearest codebook entry. This avoids the blurriness of Gaussian VAEs and powers modern image and audio generation (DALL-E, Jukebox).

Diffusion models can be seen as a hierarchy of VAEs with many latent layers, where the "encoding" process is a fixed noise schedule rather than a learned encoder. The connection is through the same ELBO framework.