Lecture 4: Latent Diffusion and Cutting-Edge Diffusion Architectures

Lecture 4: Latent Diffusion and Cutting-Edge Diffusion Architectures#

About Me Publications My Teaching Research Notes

This notebook gives a structured technical overview of modern diffusion architecture design, with explicit links to research papers and Hugging Face model repositories.

Scope and Date#

This overview is curated for as of April 9, 2026 and emphasizes open, practically relevant model families.

1. Why Latent Diffusion Was a Turning Point#

Pixel-space diffusion is expensive because denoising runs over high-dimensional tensors.

Latent diffusion introduces an autoencoder bottleneck:

\[ x \xrightarrow{E} z, \qquad z \xrightarrow{\text{diffusion}} \hat z, \qquad \hat z \xrightarrow{D} \hat x. \]

Diffusion training is performed in latent space:

\[ \mathcal L_{\text{latent-diffusion}} = \mathbb E_{z_0,\epsilon,t}\left[\|\epsilon-\epsilon_\theta(z_t,t,c)\|_2^2\right], \]

where \(z_0=E(x)\) and \(c\) is conditioning (text, image, depth, etc.).

Key advantages:

Much lower compute/memory cost.
Easier scaling to high-resolution generation.
Modular conditioning interfaces (cross-attention, ControlNet-like side channels).

2. Canonical Papers You Should Know#

Topic	Paper
DDPM objective foundation	Denoising Diffusion Probabilistic Models (2006.11239)
Fast deterministic sampling	Denoising Diffusion Implicit Models, DDIM (2010.02502)
Latent diffusion	High-Resolution Image Synthesis with Latent Diffusion Models (2112.10752)
Diffusion Transformers (DiT)	Scalable Diffusion Models with Transformers (2212.09748)
Consistency models	Consistency Models (2303.01469)
Latent consistency	Latent Consistency Models (2310.04378)
Score-SDE view	Score-Based Generative Modeling through SDEs (2011.13456)
Flow matching	Flow Matching for Generative Modeling (2210.02747)

3. Cutting-Edge Open Models (Paper + Hugging Face Links)#

Text-to-Image and Image Editing#

Model family	Paper / technical report	Hugging Face
Stable Diffusion v1 (LDM era)	2112.10752	CompVis/stable-diffusion-v1-4
SDXL	SDXL report, 2307.01952	stabilityai/stable-diffusion-xl-base-1.0
Stable Diffusion 3.5	Scaling Rectified Flow Transformers, 2403.03206	stabilityai/stable-diffusion-3.5-large
FLUX.1 Kontext (in-context image editing)	FLUX.1 Kontext report, 2506.15742	black-forest-labs/FLUX.1-Kontext-dev
LCM acceleration adapters	LCM-LoRA, 2311.05556	latent-consistency/lcm-lora-sdv1-5

Text-to-Video / Video Foundation Diffusion#

Model family	Paper / technical report	Hugging Face
CogVideoX	CogVideoX, 2408.06072	THUDM/CogVideoX-5b
HunyuanVideo	HunyuanVideo, 2412.03603	tencent/HunyuanVideo
LTX-Video	LTX-Video, 2501.00103	Lightricks/LTX-Video
Wan2.1	Wan, 2503.20314	Wan-AI/Wan2.1-T2V-14B-Diffusers

4. Architecture Patterns Behind Current SOTA#

Latent-space operation: aggressive token compression with strong VAEs.
Transformer denoisers (DiT/MMDiT variants): better scaling behavior than older UNet-only stacks.
Flow/ODE viewpoints: rectified-flow and flow-matching styles for faster or cleaner trajectories.
Few-step distillation: LCM/ADD-style acceleration for practical latency.
Multimodal conditioning: richer text encoders, prompt rewriting, and control adapters.

import numpy as np
import matplotlib
if not hasattr(matplotlib.rcParams, '_get'):
    matplotlib.rcParams._get = matplotlib.rcParams.get
import matplotlib.pyplot as plt

plt.style.use('default')
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(0)

5. Toy Code: Why Latent Space Helps#

We simulate compression + latent denoising with a linear toy autoencoder (PCA-style) to illustrate compute reduction intuition.

n, d = 2600, 12
z_true = rng.normal(size=(n, 2))
W = rng.normal(size=(2, d))
X = z_true @ W + 0.2 * rng.normal(size=(n, d))

Xc = X - X.mean(axis=0, keepdims=True)
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
Wenc = Vt[:2].T

z = Xc @ Wenc
z_noisy = z + 0.55 * rng.normal(size=z.shape)
z_denoised = 0.78 * z_noisy
Xrec = z_denoised @ Wenc.T + X.mean(axis=0, keepdims=True)

print('Pixel-space dim :', d)
print('Latent dim      :', z.shape[1])
print('Compression      :', f"{d / z.shape[1]:.1f}x")
print('Toy recon MSE    :', np.mean((X - Xrec) ** 2))

6. Few-Step vs Many-Step Sampling Intuition#

Distillation and deterministic samplers reduce denoising calls. The speed-quality frontier is model-dependent, but the principle is universal:

many steps: usually higher fidelity and diversity,
few steps: lower latency and better deployment viability.

# Step-coarsening illustration.
dense = np.arange(100)
sparse_12 = np.linspace(99, 0, 12, dtype=int)
sparse_4 = np.linspace(99, 0, 4, dtype=int)

print('Dense schedule length:', len(dense))
print('12-step schedule    :', sparse_12)
print('4-step schedule     :', sparse_4)

7. Toy Code: Video Latent Denoising Intuition#

This is a minimal spatiotemporal denoising sketch (not a full video diffusion model), included to connect concepts used by modern video systems.

Tvid, H, W = 12, 26, 26
video = np.zeros((Tvid, H, W))
yy, xx = np.indices((H, W))

for t in range(Tvid):
    cx = 6 + t
    cy = 13
    video[t] = np.exp(-((xx - cx) ** 2 + (yy - cy) ** 2) / 20.0)

noisy = video + 0.34 * rng.normal(size=video.shape)
den = noisy.copy()
for t in range(1, Tvid - 1):
    den[t] = 0.22 * noisy[t - 1] + 0.56 * noisy[t] + 0.22 * noisy[t + 1]

fig, ax = plt.subplots(3, 4, figsize=(10.8, 7.8))
idx = np.linspace(0, Tvid - 1, 4, dtype=int)
for j, t in enumerate(idx):
    ax[0, j].imshow(video[t], cmap='magma', vmin=0, vmax=1)
    ax[0, j].set_title(f'Clean t={t}')
    ax[0, j].axis('off')

    ax[1, j].imshow(noisy[t], cmap='magma', vmin=0, vmax=1)
    ax[1, j].set_title(f'Noisy t={t}')
    ax[1, j].axis('off')

    ax[2, j].imshow(den[t], cmap='magma', vmin=0, vmax=1)
    ax[2, j].set_title(f'Denoised t={t}')
    ax[2, j].axis('off')

plt.tight_layout()
plt.show()

8. How to Read New Diffusion Papers Efficiently#

For any new model, extract these first:

Representation: pixel or latent? spatial only or spatiotemporal?
Backbone: UNet, DiT, MMDiT, hybrid?
Objective: \(\epsilon\) / \(x_0\) / \(v\) / flow-matching / consistency?
Sampler: SDE, ODE, DDIM-like, distilled few-step?
Compute profile: training budget, inference steps, VRAM footprint.

This checklist makes comparisons much less ambiguous.

Final Summary#

Latent diffusion remains the key systems-level idea enabling high-resolution and multimodal generation.
Current frontier models combine transformer denoisers, stronger latent tokenization, and faster samplers/distillation.
Video diffusion has rapidly advanced through large open models with explicit spatiotemporal latent design.
You now have paper and Hugging Face starting points for both foundational and cutting-edge model families.