import matplotlib
if not hasattr(matplotlib.RcParams, "_get"):
    matplotlib.RcParams._get = dict.get

Lecture 1: The Essence of Generative AI#


This lecture builds the mathematical and intuitive foundation used in the rest of the series.

We establish what it means to model a data distribution, why likelihood is central, and how sampling connects equations to generation.

Learning Goals#

  1. Distinguish discriminative and generative modeling in probabilistic terms.

  2. Understand density modeling, likelihood, and maximum-likelihood estimation.

  3. Build intuition for KL divergence as mismatch between distributions.

  4. Connect latent-variable thinking to modern generative models.

  5. Prepare rigorously for VAE and diffusion derivations in the next lectures.

1. What Is a Generative Model?#

A model is generative if it defines a probability law over data and supports sample generation.

We write:

\[ x \sim p_\theta(x). \]

For supervised tasks, discriminative models focus on \(p_\theta(y\mid x)\). Generative models instead attempt to represent structure of \(x\) itself.

This distinction matters because generation requires learning how data is distributed, not only where class boundaries lie.

2. Dataset Likelihood and Maximum Likelihood Estimation#

Suppose we observe a dataset:

\[ \mathcal D = \{x_i\}_{i=1}^N. \]

Assuming i.i.d. samples under \(p_\theta(x)\):

\[ p_\theta(\mathcal D) = \prod_{i=1}^N p_\theta(x_i). \]

Taking logs converts products into sums:

\[ \log p_\theta(\mathcal D) = \sum_{i=1}^N \log p_\theta(x_i). \]

MLE chooses parameters that maximize this objective:

\[ \theta^* = \arg\max_\theta \sum_{i=1}^N \log p_\theta(x_i). \]

Conceptually, MLE pushes high probability mass toward observed samples.

import numpy as np
import matplotlib
if not hasattr(matplotlib.rcParams, '_get'):
    matplotlib.rcParams._get = matplotlib.rcParams.get
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde, multivariate_normal

plt.style.use('default')
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(5)
# 2D multimodal toy dataset used throughout Lecture 1.
means = np.array([[-2.1, -0.4], [0.7, 2.2], [2.3, -1.6]])
covs = np.array([
    [[0.32, 0.10], [0.10, 0.42]],
    [[0.30, -0.12], [-0.12, 0.34]],
    [[0.44, 0.05], [0.05, 0.24]],
])
probs = np.array([0.34, 0.41, 0.25])

n = 3200
comp = rng.choice(3, size=n, p=probs)
X = np.zeros((n, 2))
for k in range(3):
    idx = comp == k
    X[idx] = rng.multivariate_normal(means[k], covs[k], size=idx.sum())

fig, ax = plt.subplots(figsize=(6.2, 5.2))
ax.scatter(X[:, 0], X[:, 1], s=7, alpha=0.3)
ax.set_title('Observed Data Samples from an Unknown Distribution')
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.axis('equal')
plt.show()

3. Why Likelihood Optimization Works#

Minimizing negative log-likelihood is equivalent to minimizing empirical cross-entropy between data and model.

In expectation (under true data distribution \(p_\text{data}\)):

\[ \mathbb E_{x\sim p_\text{data}}[-\log p_\theta(x)] = H(p_\text{data}) + D_{\mathrm{KL}}(p_\text{data}\|p_\theta). \]

Since \(H(p_\text{data})\) does not depend on \(\theta\), optimizing likelihood minimizes a KL mismatch from data to model.

This is the formal reason MLE is a principled objective for generative modeling.

# MLE for a single Gaussian on the multimodal dataset.
# This intentionally underfits to illustrate model-family mismatch.
mu_hat = X.mean(axis=0)
cov_hat = np.cov(X.T)

rv_hat = multivariate_normal(mean=mu_hat, cov=cov_hat)
ll = rv_hat.logpdf(X).mean()

xx, yy = np.meshgrid(np.linspace(-5, 5, 220), np.linspace(-5, 5, 220))
pts = np.column_stack([xx.ravel(), yy.ravel()])
zz = rv_hat.pdf(pts).reshape(xx.shape)

fig, ax = plt.subplots(1, 2, figsize=(12.5, 4.8))
ax[0].scatter(X[:, 0], X[:, 1], s=5, alpha=0.22)
ax[0].set_title('Data (Multimodal)')
ax[0].axis('equal')

cs = ax[1].contour(xx, yy, zz, levels=10, cmap='viridis')
ax[1].clabel(cs, inline=True, fontsize=8)
ax[1].scatter(X[:, 0], X[:, 1], s=4, alpha=0.08, color='black')
ax[1].set_title(f'Single Gaussian MLE Fit (avg log-likelihood={ll:.3f})')
ax[1].axis('equal')

plt.tight_layout()
plt.show()

The previous plot illustrates an important modeling lesson:

  • Optimization can be correct.

  • But if the model class is too simple, it still cannot represent the true data geometry.

This motivates richer model families (latent-variable models, autoregressive models, diffusion models).

4. Sampling as the Operational Meaning of “Generative”#

A model is practically useful only if we can sample from it.

In many models, generation means running a stochastic program:

\[ z \sim p(z),\qquad x \sim p_\theta(x\mid z). \]

In other models (like diffusion), generation is iterative denoising from random noise.

Either way, generation always involves transforming random seeds into structured outputs.

# 1D density estimation and sampling demonstration.
x1 = X[:, 0]
kde = gaussian_kde(x1)
grid = np.linspace(x1.min() - 1.0, x1.max() + 1.0, 600)
dens = kde(grid)

# Approximate sampling from estimated density by CDF inversion.
pdf = dens / np.trapz(dens, grid)
cdf = np.cumsum(pdf)
cdf /= cdf[-1]

u = rng.uniform(size=4500)
x_samp = np.interp(u, cdf, grid)

fig, ax = plt.subplots(1, 2, figsize=(12.4, 4.2))
ax[0].hist(x1, bins=80, density=True, alpha=0.48, label='Observed data')
ax[0].plot(grid, pdf, lw=2.0, color='black', label='Estimated density')
ax[0].set_title('Density Estimation in 1D')
ax[0].legend()

ax[1].hist(x_samp, bins=80, density=True, alpha=0.55, label='Generated samples')
ax[1].plot(grid, pdf, lw=2.0, color='black', label='Target density')
ax[1].set_title('Sampling from Learned/Estimated Distribution')
ax[1].legend()

plt.tight_layout()
plt.show()

5. Gaussian Geometry: Mean, Variance, and Covariance#

Why are Gaussians everywhere in generative models?

  1. Closed-form manipulations are available.

  2. Noise injection naturally uses Gaussian perturbations.

  3. Latent-variable posteriors are often approximated as diagonal Gaussians.

In two dimensions, covariance creates ellipses aligned to principal uncertainty directions.

mean = np.array([0.0, 0.0])
cov = np.array([[2.0, 1.3], [1.3, 1.1]])
rv = multivariate_normal(mean=mean, cov=cov)

xx, yy = np.meshgrid(np.linspace(-4, 4, 240), np.linspace(-4, 4, 240))
pts = np.column_stack([xx.ravel(), yy.ravel()])
zz = rv.pdf(pts).reshape(xx.shape)

S = rng.multivariate_normal(mean, cov, size=1400)

fig, ax = plt.subplots(figsize=(6.4, 5.3))
cs = ax.contour(xx, yy, zz, levels=10, cmap='viridis')
ax.clabel(cs, inline=True, fontsize=8)
ax.scatter(S[:, 0], S[:, 1], s=6, alpha=0.25)
ax.set_title('Multivariate Gaussian: Level Sets and Samples')
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.axis('equal')
plt.show()

6. Latent Variables as Compression of Structure#

A latent-variable model writes:

\[ p_\theta(x) = \int p_\theta(x\mid z)p(z)\,dz. \]

Interpretation:

  • \(z\) is a compressed stochastic representation.

  • \(p(z)\) defines where latent codes live.

  • \(p_\theta(x\mid z)\) decodes latent structure into observations.

This decomposition is the conceptual bridge to VAEs.

# Hand-crafted latent generator for intuition (not learned).
Z = rng.normal(size=(3600, 2))
A = np.array([[1.7, 0.4], [-0.3, 1.25]])
nonlinear = np.column_stack([
    0.7 * np.sin(1.25 * Z[:, 0]),
    0.55 * np.cos(1.8 * Z[:, 1]),
])
Xg = Z @ A.T + nonlinear + 0.10 * rng.normal(size=(len(Z), 2))

fig, ax = plt.subplots(1, 2, figsize=(12.4, 4.9))
ax[0].scatter(Z[:, 0], Z[:, 1], s=5, alpha=0.30)
ax[0].set_title(r'Latent samples $z \sim \mathcal N(0, I)$')
ax[0].set_xlabel('$z_1$')
ax[0].set_ylabel('$z_2$')
ax[0].axis('equal')

ax[1].scatter(Xg[:, 0], Xg[:, 1], s=5, alpha=0.30, color='#DD8452')
ax[1].set_title(r'Decoded data samples $x = g_	heta(z)$')
ax[1].set_xlabel('$x_1$')
ax[1].set_ylabel('$x_2$')
ax[1].axis('equal')

plt.tight_layout()
plt.show()

7. Generative Model Families (Preview)#

Autoregressive#

Factorize with chain rule and sample one coordinate/token at a time.

\[ p(x)=\prod_i p(x_i\mid x_{<i}). \]

Variational Autoencoders (Lecture 2)#

Learn an encoder and decoder by maximizing ELBO.

Diffusion / Score Models (Lecture 3)#

Learn to invert gradual noising through iterative denoising.

Each family is a different computational strategy for approximating the same target: realistic sampling from high-dimensional data distributions.

8. Common Failure Modes to Keep in Mind#

  • Mode dropping / mode averaging: model misses or blurs modes.

  • Mismatch between training objective and sample quality: likelihood and perceptual quality can diverge.

  • Poor latent geometry: latent interpolation may not correspond to meaningful data transitions.

These issues motivate the design choices in VAEs and diffusion systems.

Summary#

  • Generative modeling means learning a distribution over data, not only decision boundaries.

  • Likelihood optimization is principled through KL decomposition.

  • Gaussian structure and latent variables recur because they offer tractability and geometric interpretability.

  • You now have the conceptual and probabilistic foundation needed for VAE ELBO derivation and implementation.

Next#

Continue to Lecture 2: Variational Autoencoders in Depth.