import matplotlib
if not hasattr(matplotlib.RcParams, "_get"):
matplotlib.RcParams._get = dict.get
Lecture 4: Latent Diffusion and Cutting-Edge Diffusion Architectures#
This notebook gives a structured technical overview of modern diffusion architecture design, with explicit links to research papers and Hugging Face model repositories.
Scope and Date#
This overview is curated for as of April 9, 2026 and emphasizes open, practically relevant model families.
1. Why Latent Diffusion Was a Turning Point#
Pixel-space diffusion is expensive because denoising runs over high-dimensional tensors.
Latent diffusion introduces an autoencoder bottleneck:
Diffusion training is performed in latent space:
where \(z_0=E(x)\) and \(c\) is conditioning (text, image, depth, etc.).
Key advantages:
Much lower compute/memory cost.
Easier scaling to high-resolution generation.
Modular conditioning interfaces (cross-attention, ControlNet-like side channels).
2. Canonical Papers You Should Know#
Topic |
Paper |
|---|---|
DDPM objective foundation |
|
Fast deterministic sampling |
|
Latent diffusion |
High-Resolution Image Synthesis with Latent Diffusion Models (2112.10752) |
Diffusion Transformers (DiT) |
|
Consistency models |
|
Latent consistency |
|
Score-SDE view |
|
Flow matching |
3. Cutting-Edge Open Models (Paper + Hugging Face Links)#
Text-to-Image and Image Editing#
Model family |
Paper / technical report |
Hugging Face |
|---|---|---|
Stable Diffusion v1 (LDM era) |
||
SDXL |
||
Stable Diffusion 3.5 |
||
FLUX.1 Kontext (in-context image editing) |
||
LCM acceleration adapters |
Text-to-Video / Video Foundation Diffusion#
Model family |
Paper / technical report |
Hugging Face |
|---|---|---|
CogVideoX |
||
HunyuanVideo |
||
LTX-Video |
||
Wan2.1 |
4. Architecture Patterns Behind Current SOTA#
Latent-space operation: aggressive token compression with strong VAEs.
Transformer denoisers (DiT/MMDiT variants): better scaling behavior than older UNet-only stacks.
Flow/ODE viewpoints: rectified-flow and flow-matching styles for faster or cleaner trajectories.
Few-step distillation: LCM/ADD-style acceleration for practical latency.
Multimodal conditioning: richer text encoders, prompt rewriting, and control adapters.
import numpy as np
import matplotlib
if not hasattr(matplotlib.rcParams, '_get'):
matplotlib.rcParams._get = matplotlib.rcParams.get
import matplotlib.pyplot as plt
plt.style.use('default')
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(0)
5. Toy Code: Why Latent Space Helps#
We simulate compression + latent denoising with a linear toy autoencoder (PCA-style) to illustrate compute reduction intuition.
n, d = 2600, 12
z_true = rng.normal(size=(n, 2))
W = rng.normal(size=(2, d))
X = z_true @ W + 0.2 * rng.normal(size=(n, d))
Xc = X - X.mean(axis=0, keepdims=True)
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
Wenc = Vt[:2].T
z = Xc @ Wenc
z_noisy = z + 0.55 * rng.normal(size=z.shape)
z_denoised = 0.78 * z_noisy
Xrec = z_denoised @ Wenc.T + X.mean(axis=0, keepdims=True)
print('Pixel-space dim :', d)
print('Latent dim :', z.shape[1])
print('Compression :', f"{d / z.shape[1]:.1f}x")
print('Toy recon MSE :', np.mean((X - Xrec) ** 2))
6. Few-Step vs Many-Step Sampling Intuition#
Distillation and deterministic samplers reduce denoising calls. The speed-quality frontier is model-dependent, but the principle is universal:
many steps: usually higher fidelity and diversity,
few steps: lower latency and better deployment viability.
# Step-coarsening illustration.
dense = np.arange(100)
sparse_12 = np.linspace(99, 0, 12, dtype=int)
sparse_4 = np.linspace(99, 0, 4, dtype=int)
print('Dense schedule length:', len(dense))
print('12-step schedule :', sparse_12)
print('4-step schedule :', sparse_4)
7. Toy Code: Video Latent Denoising Intuition#
This is a minimal spatiotemporal denoising sketch (not a full video diffusion model), included to connect concepts used by modern video systems.
Tvid, H, W = 12, 26, 26
video = np.zeros((Tvid, H, W))
yy, xx = np.indices((H, W))
for t in range(Tvid):
cx = 6 + t
cy = 13
video[t] = np.exp(-((xx - cx) ** 2 + (yy - cy) ** 2) / 20.0)
noisy = video + 0.34 * rng.normal(size=video.shape)
den = noisy.copy()
for t in range(1, Tvid - 1):
den[t] = 0.22 * noisy[t - 1] + 0.56 * noisy[t] + 0.22 * noisy[t + 1]
fig, ax = plt.subplots(3, 4, figsize=(10.8, 7.8))
idx = np.linspace(0, Tvid - 1, 4, dtype=int)
for j, t in enumerate(idx):
ax[0, j].imshow(video[t], cmap='magma', vmin=0, vmax=1)
ax[0, j].set_title(f'Clean t={t}')
ax[0, j].axis('off')
ax[1, j].imshow(noisy[t], cmap='magma', vmin=0, vmax=1)
ax[1, j].set_title(f'Noisy t={t}')
ax[1, j].axis('off')
ax[2, j].imshow(den[t], cmap='magma', vmin=0, vmax=1)
ax[2, j].set_title(f'Denoised t={t}')
ax[2, j].axis('off')
plt.tight_layout()
plt.show()
8. How to Read New Diffusion Papers Efficiently#
For any new model, extract these first:
Representation: pixel or latent? spatial only or spatiotemporal?
Backbone: UNet, DiT, MMDiT, hybrid?
Objective: \(\epsilon\) / \(x_0\) / \(v\) / flow-matching / consistency?
Sampler: SDE, ODE, DDIM-like, distilled few-step?
Compute profile: training budget, inference steps, VRAM footprint.
This checklist makes comparisons much less ambiguous.
Final Summary#
Latent diffusion remains the key systems-level idea enabling high-resolution and multimodal generation.
Current frontier models combine transformer denoisers, stronger latent tokenization, and faster samplers/distillation.
Video diffusion has rapidly advanced through large open models with explicit spatiotemporal latent design.
You now have paper and Hugging Face starting points for both foundational and cutting-edge model families.