Thesis. A Research Scientist, GenAI loop is five overlapping interviews stacked on top of each other — coding, ML/CV fundamentals, research dive, system design, behavioral.
Main technical point. Almost every modern generative system collapses to three primitives: a forward corruption process, a denoiser/velocity that reverses it, and a sampler that integrates the reverse dynamics. VAE, DDPM, score, flow matching, rectified flow, DDIM, CFG, DPS are all specific instantiations of those primitives.
Practical implication. Revise in the order you’d build a system: warm up with Python, cover ML and classical CV foundations, master the transformer, stack on diffusion and flow models, then climb to text-to-image and video, and only then worry about serving and RL alignment. This post is that order.
⚠️ Not exhaustive. I wrote this down from memory after my own full-time interview prep cycle. It is biased toward the things that worked for me and that came up repeatedly across different research labs I interviewed with — so there are whole areas (classical RL, speech, 3D, retrieval systems, robotics) that are barely here, simply because they didn’t show up in my loops. Use it as one data point among several, not as a complete syllabus.
🤝 Suggest additions. This is a living document. If you know a better resource, a missed topic, a cleaner derivation, or a recent paper that should be in the pointer set, please open an issue on the repo or email me at saini.2@utexas.edu. I’ll keep merging good additions in.
0. Framing and the full outline
My loops decomposed into the same ladder every time. Knowing which round you are in keeps your answers at the right altitude.
- Coding. Usually one LeetCode-medium plus one applied-ML question (“implement MLE for a Gaussian,” “write the forward pass for scaled dot-product attention,” “compute KV-cache memory for a given model shape”).
- ML / CV fundamentals. Losses, MLE vs MAP, KL divergence, SVM hinge loss, clustering, classical CV (SSIM/VIF/LPIPS, histogram equalization, Sobel), CNN output shape, BatchNorm vs LayerNorm, self-supervised learning.
- Research dive. Depth-first on one of your papers. Derive the loss, ablate components, predict what breaks under a change.
- System design. Build a T2I or T2V product end-to-end — data → latent space → backbone → training → eval → serving — in 45 minutes on a whiteboard.
- Paper trace + behavioral. Contrast 5–8 landmark papers in the team’s area in 30 seconds each; be ready with a 30-second, 2-minute, and 10-minute pitch of your own work.
The rest of this note is the ladder itself. The order matters — each section builds on the last.
Full outline — click any entry to jump
- Coding tier. Python internals, CNN/attention-forward-pass warmups, DSA cadence, ML-system-design checklist.
- ML fundamentals. MLE/MAP, loss zoo (L1/L2/CE/hinge/triplet/KL), normalization, clustering, SVM.
- GenAI fundamentals. Generative vs discriminative, likelihood-based vs likelihood-free family tree.
- Classical CV. Histogram equalization, Sobel, Gaussian/median filters, NLM, SIFT, color theory, color spaces, Retinex.
- CV quality. PSNR, SSIM, VIF, LPIPS/DISTS, NR-IQA, HDR quality, FID/FVD, CLIPScore/HPS/ImageReward, VBench/PhysGenBench, MLLM-as-judge.
- CNNs & SSL. ConvNet primer, contrastive and masked SSL.
- Transformers. Scaled dot-product attention, multi-head, cross, masked, RoPE 1D/2D/3D, ALiBi, FlashAttention, GQA/MLA, KV cache.
- VAE & ELBO. Two derivations, reparameterization, VQ-VAE, SAE.
- DDPM. Forward chain, reverse parameterization, \(\epsilon\)-prediction, Tweedie.
- Score & SDE. DSM, EBM, VP/VE SDE, probability-flow ODE.
- Flow matching & rectified flow. Linear interpolant, reflow, mean flow, logit-normal time.
- Sampling & guidance. DDIM, Euler/Midpoint/Heun, CFG, classifier guidance, DPS, MPGD.
- Conditioning. AdaIN/FiLM/AdaLN, cross-attention, ControlNet, T2I-Adapter, LoRA, MM-DiT.
- T2I design space. SD1 → SD3 → Flux → Nano Banana; why CLIP needed T5.
- Video generation. Temporal attention, LVDM, inverse tone mapping (SDR→HDR).
- Discrete diffusion. Transition matrix, MLDM/LLaDA.
- LLMs & nanoGPT. Tokenization, causal LM, sampling.
- Evaluation metrics. When each metric saturates.
- Training lifecycle. Pretraining, mid-training, post-training (SFT, preference tuning, RLHF), and the diffusion-specific analogues.
- Training nuances. Mixed precision, ZeRO, FSDP, gradient accumulation, EMA.
- Inference nuances. Quantization, distillation, pruning, continuous batching, vLLM/TensorRT-LLM.
- RL alignment (details). PPO, DPO, GRPO, DAPO, Best-of-N.
- Paper trace + behavioral.
1. Coding tier
Python essentials
Interviewers love a 5-minute sanity check on Python internals before anything else.
- Execution model. Python source → lexer → parser → AST → bytecode → CPython VM (stack-based). PyPy adds JIT on top; CPython does not.
- PEP 8. 4-space indent, 79-char lines,
snake_casevariables,CamelCaseclasses. - Memory. Reference counting + generational GC. When refcount hits zero, the object is deallocated; the cyclic collector handles reference cycles.
*args, **kwargs. Variable positional (tuple) and keyword (dict) arguments.
Applied-ML warmups (run these in your head)
MLE for a Gaussian — derive and code. \(\nabla_\theta \log\prod_i f(x_i;\theta)=0\) gives closed-form \(\hat\mu=\tfrac{1}{N}\sum x_i,\ \hat\sigma^2=\tfrac{1}{N}\sum(x_i-\hat\mu)^2\).
import numpy as np
def gaussian_mle(x):
mu = x.mean()
sigma2 = ((x - mu) ** 2).mean()
return mu, sigma2
# MAP with Gaussian prior N(mu0, tau2) on mu, known sigma2
def gaussian_map_mean(x, sigma2, mu0, tau2):
n = len(x)
return (mu0 / tau2 + x.sum() / sigma2) / (1 / tau2 + n / sigma2)
CNN output shape. For input width \(W\), filter \(F\), stride \(S\), padding \(P\):
\[ O = \left\lfloor\frac{W - F + 2P}{S}\right\rfloor + 1 \]
k-th smallest. Heap, \(O(n\log k)\):
import heapq
def kth_smallest(nums, k):
return heapq.nsmallest(k, nums)[-1]
Naive attention forward pass — the single most common whiteboard ask.
import torch, math
def attention(q, k, v, mask=None):
# q, k, v: (B, H, T, d_k)
scores = (q @ k.transpose(-2, -1)) / math.sqrt(q.size(-1))
if mask is not None: # causal mask shape (T, T)
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
return attn @ v, attn
DSA cadence
What worked for me: 1 hour/day of LeetCode plus 2 structured hours, 2–3 problems daily, 300 most-frequent list over ~2 months[1]. Pattern list:
- Two pointers & sliding window (longest substring, k-distinct, min window).
- Binary search on the answer (Koko bananas, split array largest sum).
- Heap / priority queue (top-k, merge-k-lists, schedulers).
- Monotonic stack (daily temperatures, largest rectangle).
- Dynamic programming — 1D, 2D, on intervals, trees, bitmask.
- Graph BFS/DFS, Dijkstra, topological sort, union-find.
- Backtracking (permutations, combinations, N-queens).
- Bit manipulation and prefix-sum tricks.
ML system-design checklist (T2I/T2V)
- Data. Crawl → filter (CLIP relevance, aesthetics, dedup, NSFW), recaption with a VLM.
- Latent space. Train VAE / VQ-VAE / advanced VAE with perceptual + adversarial loss.
- Backbone. MM-DiT scale; positional encoding (RoPE 2D/3D); FSDP or ZeRO-3 sharding.
- Training. Flow matching with logit-normal time; EMA; mixed precision.
- Eval harness. Prompt set × metrics (CLIPScore, HPSv2, ImageReward, VBench, MLLM-judge).
- Inference. Heun/DPM-Solver; step distillation; FP8/INT8; CFG annealing; continuous batching for AR heads.
- Safety. Prompt pre-filter, output post-filter, concept-erasing LoRA.
Deep dive: NeetCode Coding Interview Roadmap · Karpathy — Neural Networks: Zero to Hero for the applied-ML coding muscle.
2. Machine learning fundamentals
MLE and MAP
\[ \theta^\text{MLE} = \arg\max_\theta \prod_i p(x_i\mid\theta), \quad \theta^\text{MAP} = \arg\max_\theta \prod_i p(x_i\mid\theta)\,p(\theta). \]
Practical identity: maximizing likelihood equals minimizing negative log-likelihood equals minimizing cross-entropy against empirical data distribution.
The loss-function zoo
- \(L_1\) vs \(L_2\). \(L_1\) promotes sparsity and robustness to outliers; \(L_2\) is smoother and mean-seeking. Image reconstruction typically uses a convex mix.
- Cross-entropy. Canonical classification/token-level loss.
- Hinge (SVM). \(\arg\min \tfrac{1}{2}\|w\|^2 + C\sum_i \max(0, 1 - y_i(w^\top x_i - b))\).
- Triplet. \(\max(0, d(A,P) - d(A,N) + m)\). Enforces anchor–positive closer than anchor–negative by margin \(m\).
- KL divergence. \(D_\text{KL}(p\Vert q)=\sum p\log(p/q)\). Non-negative, asymmetric, zero iff equal.
import torch.nn.functional as F
def kl_divergence(p, q, eps=1e-12): # both shape (..., C)
return (p * (p.clamp_min(eps).log() - q.clamp_min(eps).log())).sum(-1)
def hinge_loss(scores, y): # y in {-1, +1}
return F.relu(1 - y * scores).mean()
def triplet_loss(a, p, n, margin=1.0):
d_ap = (a - p).norm(dim=-1)
d_an = (a - n).norm(dim=-1)
return F.relu(d_ap - d_an + margin).mean()
Normalization
- BatchNorm. Normalize per-channel across the batch. Depends on batch statistics — breaks with very small batches.
- LayerNorm. Normalize per-sample across features. Batch-size independent; the default for transformers.
- RMSNorm. LayerNorm without mean subtraction; \(\tfrac{1}{\sqrt{\text{RMS}(x)^2+\varepsilon}}\cdot x\). Used in LLaMA-family models.
- GroupNorm. Middle ground used inside UNets and DiT blocks.
Clustering (quick)
- K-means. Hard assignment via nearest centroid; iterative \((\text{assign}, \text{update})\).
- GMM. Soft clustering with \(K\) Gaussians \((\mu_k,\Sigma_k,\pi_k)\) via EM; ellipsoidal clusters.
- DBSCAN. Density-based; arbitrary-shape clusters; noise is a first-class output.
SVM and kernels
Find the hyperplane \(w^\top x + b\) with max margin. Non-linear case: transform data via \(\phi\) into a higher-dim space where classes are linearly separable; the kernel trick avoids constructing \(\phi\) explicitly via \(K(x,y) = \phi(x)^\top\phi(y)\) (linear, polynomial, RBF).
Deep dive: Stanford CS229 (Andrew Ng) for the math · StatQuest with Josh Starmer for 10-minute intuition videos on MLE, PCA, SVM, GMM, clustering.
3. Generative modeling fundamentals
Generative modeling learns \(p(x)\) or \(p(x,y)\); discriminative modeling learns \(p(y\mid x)\) directly.
Family tree
- Likelihood-based: VAE, autoregressive, diffusion, normalizing flows, energy-based models.
- Likelihood-free: GANs — min-max adversarial, strong images at small scale but prone to mode collapse; WGAN-GP mitigates with a Lipschitz critic and gradient penalty.
Why the shift? Sampling directly from \(p(x)\) in high-dim is intractable; simpler to transport a Gaussian base to the data via a tractable forward corruption and a learned reverse — the diffusion/flow-matching recipe (sections 9–11).
Autoregressive factorization
\[ p_\theta(x) = p_\theta(x_1)\,p_\theta(x_2\mid x_1)\cdots p_\theta(x_L\mid x_{<L}) \]
Loss collapses to token-level CE. Sampling: top-\(k\) (restrict to top \(k\) logits) or nucleus (top-\(p\)) sampling.
Deep dive: Lilian Weng — From Autoencoder to Beta-VAE · HF CV Course — Generative Models (Unit 5).
4. Classical computer vision
Image processing primitives
- Histogram equalization. Remap intensities through the CDF of the image’s histogram: \(\text{img}'[i,j] = \text{CDF}(\text{img}[i,j])\cdot 255\). Boosts contrast on low-dynamic-range imagery.
- Sobel edges. \(G_x=\begin{bmatrix}-1&0&1\\-2&0&2\\-1&0&1\end{bmatrix},\ G_y=G_x^\top,\ |G|=\sqrt{G_x^2+G_y^2}\).
- Gaussian filter. Low-pass \(\tfrac{1}{16}\begin{bmatrix}1&2&1\\2&4&2\\1&2&1\end{bmatrix}\) smooths additive noise.
- Median filter. Best for salt-and-pepper noise (Gaussian blur makes it worse).
- Non-Local Means (NLM). Compare patches not pixels; average similar patches within a search window.
- SIFT. Scale-invariant keypoint detection via Difference-of-Gaussians; assign orientation; 128-dim descriptor.
import numpy as np
from scipy.ndimage import convolve
def sobel(img):
Gx = np.array([[-1,0,1],[-2,0,2],[-1,0,1]])
Gy = Gx.T
gx, gy = convolve(img, Gx), convolve(img, Gy)
return np.sqrt(gx**2 + gy**2)
def hist_eq(img): # img: uint8
hist, _ = np.histogram(img.ravel(), 256, (0,256))
cdf = hist.cumsum()
cdf = (cdf - cdf.min()) * 255 / (cdf.max() - cdf.min())
return cdf[img].astype(np.uint8)
Color theory and color spaces
Human vision perceives surface color as context (Retinex): the retina + cortex compare reflectance against surroundings across three wavelength channels (L/M/S) rather than raw luminance. That is color constancy — an apple is red under bright or dim light.
- sRGB / Rec. 709. 8-bit SDR; ~1/3 of visible gamut.
- Rec. 2020. 10/12-bit; wide-gamut HDR.
- XYZ (CIE 1931). Device-independent linear space.
- LAB. Perceptually uniform: \(L\) lightness, \(a\) green–red, \(b\) blue–yellow.
Three knobs define a color space: primaries (gamut vertices), white point (e.g. D65), transfer function (OETF/EOTF: gamma ~2.2 for sRGB; PQ/HLG for HDR).
Deep dive: Szeliski — Computer Vision: Algorithms and Applications (2nd ed., free PDF) · Cambridge in Colour tutorials for color theory, gamut, transfer functions.
5. Perceptual quality assessment
Full-reference (FR)
- PSNR. \(10\log_{10}(255^2/\text{MSE})\). Pixel fidelity; weak human correlation.
- SSIM. Luminance × contrast × structure: \(\text{SSIM}(x,y)=\frac{(2\mu_x\mu_y+C_1)(2\sigma_{xy}+C_2)}{(\mu_x^2+\mu_y^2+C_1)(\sigma_x^2+\sigma_y^2+C_2)}\).
- VIF.[2] Visual Information Fidelity: model images as natural-scene statistics (NSS) via Gaussian Scale Mixtures in a wavelet basis; HVS has internal noise \(n\); distortion adds attenuation \(g\) and noise \(v\); compute MI ratio \(\text{VIF} = I(C;F_\text{dist})/I(C;F_\text{ref})\).
- LPIPS. Feature-space distance over a frozen VGG with learned per-layer weights.
- DISTS. Structure + texture similarity in learned feature space; robust to texture shifts.
No-reference (NR)
- NSS-based: NIQE, BRISQUE, MSCN.
- Learned: Re-IQA, CONTRIQUE, DOVER[3] (decouples aesthetic and technical quality for video), Q-Align, DEQA (MLLM-based).
Generative-model metrics
- FID / FVD. Frechet distance of Inception (image) or I3D (video) features; saturates for SOTA T2I.
- CLIPScore. Cosine of CLIP text–image embeddings; good for alignment, weak for realism.
- HPSv2 / ImageReward.[4] Learned preference models on pick-a-pic data.
- VBench / VBenchv2.[5] Multi-facet video eval (subject identity, motion smoothness, dynamic degree, spatial relations).
- PhysGenBench / VideoPhy2. Physics commonsense for video.
- MLLM-as-judge. Cheap holistic eval with a capable VLM (Gemini, Qwen-VL, GPT-4o).
HDR-specific
Standard SDR metrics computed on tonemapped HDR previews are misleading. Use HDR-VDP, PU-encoded PSNR/SSIM, or MLLM-based HDR-aware judges. See HDR-Q slides for a CVPR 2026 take on this.
Deep dive: UT LIVE Lab publications for the IQA/VQA canon (SSIM, VIF, BRISQUE, NIQE, VMAF) · LPIPS repo + paper · VBench project page.
6. CNNs and self-supervised learning
CNN primer
Stanford CS231n is still the canonical course for this layer of the stack[6]. Quick hits:
- Convolution as learnable filter banks; parameter sharing + spatial equivariance.
- Receptive field grows with depth; dilated convs expand it cheaply.
- ResNet identity shortcuts enable very deep training; ConvNeXt revisits CNNs with transformer-style design.
Self-supervised learning
- Contrastive: SimCLR, MoCo. InfoNCE loss: \(\mathcal{L} = -\log\tfrac{\exp(q\cdot k^+/\tau)}{\sum_i \exp(q\cdot k_i/\tau)}\).
- BYOL / DINO. Non-contrastive; online + EMA target networks.
- Masked image modeling (MAE). Mask 75% of patches, reconstruct; strong downstream classification.
- CLIP. Contrastive on (image, caption) pairs; InfoNCE across the batch.
def info_nce(q, k_pos, k_neg, tau=0.07): # shapes: (B,d),(B,d),(B,K,d)
logits_pos = (q * k_pos).sum(-1, keepdim=True) / tau
logits_neg = (q.unsqueeze(1) * k_neg).sum(-1) / tau
logits = torch.cat([logits_pos, logits_neg], dim=-1)
labels = torch.zeros(q.size(0), dtype=torch.long, device=q.device)
return F.cross_entropy(logits, labels)
Deep dive: Stanford CS231n for CNNs · Lilian Weng — Contrastive Representation Learning · Meta AI — DINO blog.
7. Transformers
Attention in full
\[ \text{Attn}(Q,K,V) = \text{softmax}\!\left(\tfrac{QK^\top}{\sqrt{d_k}}\right)V. \]
Why \(\sqrt{d_k}\)? The dot product of two \(d_k\)-dim unit-variance vectors has variance \(d_k\). Without scaling, softmax saturates into near-one-hot distributions and gradients vanish on the un-picked tokens.
Cross-attention: \(Q\) from decoder sequence, \(K,V\) from encoder sequence (e.g., T5 tokens in a T2I UNet / DiT).
Masked attention: add a \(-\infty\) bias to forbidden positions before softmax. Causal masking is the lower-triangular case.
import torch, math, torch.nn.functional as F
class MultiHeadAttention(torch.nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0
self.h, self.dk = n_heads, d_model // n_heads
self.qkv = torch.nn.Linear(d_model, 3 * d_model)
self.out = torch.nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
B, T, D = x.shape
q, k, v = self.qkv(x).chunk(3, dim=-1)
q, k, v = [t.view(B, T, self.h, self.dk).transpose(1, 2) for t in (q, k, v)]
s = q @ k.transpose(-2, -1) / math.sqrt(self.dk)
if mask is not None:
s = s.masked_fill(mask == 0, float('-inf'))
a = F.softmax(s, dim=-1)
y = (a @ v).transpose(1, 2).contiguous().view(B, T, D)
return self.out(y)
Sparse, Flash, Linear, GQA, MLA
- Sparse attention. Block-sparse; each token attends to \(n\) blocks out of full set → \(O((T^2/n)d)\). Heads can learn different patterns.
- Native Sparse Attention (NSA). Combines token compression (coarse windowed pooling of K/V), selection (top-\(k\) via importance scores), and sliding window (local); the three paths are gated-summed.
- Flash Attention.[7] Tiling + online (safe) softmax reduces memory from \(O(N^2)\) to \(O(N)\). Key identity: maintain running max \(m_i\) and denominator \(d_i\) so
exp(x - m)never overflows, then rescale on each new tile. - Linear attention. \(\text{softmax}(QK^\top)\approx\phi(Q)\phi(K)^\top\); per-query cost \(O(nd)\). Good for long sequences and streaming, bad when you need sharp attention.
- Grouped Query Attention (GQA).[8] Share K/V across groups of query heads; e.g. 16 query heads, 4 K/V heads. LLaMA uses 1:8.
- Multi-head Latent Attention (MLA).[9] Compress K/V into a low-rank latent then project back at attention time; ~4–10× KV-cache reduction.
Positional encoding
- Sinusoidal. \(\text{PE}(\text{pos}, 2i)=\sin(\text{pos}/10000^{2i/d})\). Length-independent. Added to input embeddings.
- RoPE[10]. Rotate \(q, k\) by a position-dependent angle before the dot product. \(q^\top k\) becomes sensitive only to the relative angle. 2D RoPE splits embedding into \((x, y)\) chunks for images; 3D RoPE further for video \((t, x, y)\).
- ALiBi[11]. Add linear negative bias \(-m\cdot|i-j|\) in attention logits; acts as soft local attention; extrapolates cleanly to longer sequences.
def rope_1d(x, theta=10000.0): # x: (B, T, d) with d even
B, T, d = x.shape
pos = torch.arange(T, device=x.device).float()[:, None]
i = torch.arange(d // 2, device=x.device).float()
freq = 1.0 / (theta ** (2 * i / d))
angles = pos * freq # (T, d/2)
cos, sin = angles.cos(), angles.sin()
x1, x2 = x[..., 0::2], x[..., 1::2]
xr = torch.stack([x1*cos - x2*sin, x1*sin + x2*cos], dim=-1)
return xr.flatten(-2)
KV cache memory
\[ \text{mem} = 2 \times \text{bytes} \times n_\text{layers} \times d_\text{model} \times \text{seq\_len} \times \text{batch}. \]
30B params with \(n_\text{layers}=48,\ d_\text{model}=7168,\ \text{fp16} (2\text{ bytes})\), seq-len 1,024, batch 128 → KV cache ≈ 180 GB — roughly 3× the model weights. That memory pressure is why GQA/MLA and paged KV caches dominate production serving.
def kv_cache_bytes(layers, d_model, seq_len, batch, dtype_bytes=2):
return 2 * dtype_bytes * layers * d_model * seq_len * batch
kv_cache_bytes(48, 7168, 1024, 128) / 1e9 # ~180 GB
FFN, SwiGLU, norms
FFN is \(\sigma(W_1 x)W_2\); SwiGLU replaces \(\sigma\) with a gated Swish: \((xW + b)\otimes\text{Swish}(zV + c)\), Swish \(=x\sigma(\beta x)\). Ubiquitous in LLaMA.
Deep dive: Jay Alammar — The Illustrated Transformer · The Annotated Transformer (Harvard NLP) · Karpathy — “Let’s build GPT: from scratch” · Sebastian Raschka — A Visual Guide to Attention Variants.
8. VAE and the ELBO
VAE[12]: encoder \(q_\phi(z\mid x)\), decoder \(p_\theta(x\mid z)\). Evidence lower bound:
\[ \log p_\theta(x) \ge \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_\text{KL}\big(q_\phi(z\mid x)\,\Vert\,p(z)\big). \]
Derivation A (Jensen):
\[ \log p_\theta(x) = \log\int q_\phi(z\mid x)\tfrac{p_\theta(x,z)}{q_\phi(z\mid x)}dz \ge \mathbb{E}_{q_\phi}\log\tfrac{p_\theta(x,z)}{q_\phi(z\mid x)}. \]
Derivation B (chain rule):
\[ \log p_\theta(x) = \mathbb{E}_{q_\phi}\log\tfrac{p_\theta(x,z)}{q_\phi(z\mid x)} + D_\text{KL}\big(q_\phi(z\mid x)\,\Vert\,p_\theta(z\mid x)\big). \]
The gap (right term) tightens as the encoder matches the true posterior.
Reparameterization trick & closed-form KL
For \(q_\phi(z\mid x)=\mathcal{N}(\mu_\phi,\sigma_\phi^2 I)\) and \(p(z)=\mathcal{N}(0,I)\):
\[ z = \mu_\phi + \sigma_\phi\odot\epsilon,\ \epsilon\sim\mathcal{N}(0,I),\quad D_\text{KL} = \tfrac{1}{2}\sum_d(\mu_d^2 + \sigma_d^2 - 1 - 2\log\sigma_d). \]
class VAE(torch.nn.Module):
def __init__(self, enc, dec):
super().__init__(); self.enc, self.dec = enc, dec
def forward(self, x):
mu, log_sigma = self.enc(x).chunk(2, dim=-1)
z = mu + log_sigma.exp() * torch.randn_like(mu) # reparameterize
x_rec = self.dec(z)
kl = 0.5 * (mu.pow(2) + (2*log_sigma).exp() - 1 - 2*log_sigma).sum(-1)
rec = ((x - x_rec) ** 2).flatten(1).sum(-1)
return (rec + kl).mean(), x_rec
VQ-VAE and advanced VAEs
Vanilla VAE is blurry (L2 mean-seeking), vague in semantics, and prone to posterior collapse when the decoder is too strong. VQ-VAE[13] replaces the continuous latent with a discrete codebook (nearest-neighbor quantize, straight-through gradient) so the latent is transformer-friendly and sharp. SD3/Flux-era “advanced VAEs” add BN in latent space, more channels, REPA-style alignment to large vision-model embeddings, and spatial packing.
Deep dive: Lilian Weng — From Autoencoder to Beta-VAE · Doersch — Tutorial on VAEs (arXiv 1606.05908).
9. DDPM
Forward Markov chain[14]:
\[ q(x_t\mid x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\,x_{t-1}, \beta_t I),\quad q(x_t\mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}\,x_0, (1-\bar\alpha_t)I). \]
\(\bar\alpha_t = \prod_{s\le t}(1-\beta_s)\). One-shot corruption: \(x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon\).
Training objective
\[ \mathcal{L}_\text{simple}(\theta) = \mathbb{E}_{t,x_0,\epsilon}\!\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\epsilon,t)\|^2\right]. \]
def ddpm_loss(model, x0, alphas_bar):
t = torch.randint(0, len(alphas_bar), (x0.size(0),), device=x0.device)
ab = alphas_bar[t].view(-1, 1, 1, 1)
eps = torch.randn_like(x0)
xt = ab.sqrt() * x0 + (1 - ab).sqrt() * eps
eps_pred = model(xt, t)
return F.mse_loss(eps_pred, eps)
Ancestral sampler
\[ x_{t-1} = \tfrac{1}{\sqrt{\alpha_t}}\!\left(x_t - \tfrac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \sigma_t z. \]
Tweedie’s formula
\[ \hat x_{0\mid t} = \tfrac{1}{\sqrt{\bar\alpha_t}}(x_t - \sqrt{1-\bar\alpha_t}\,\epsilon_\theta(x_t, t)). \]
The single identity that powers DDIM, classifier guidance in data space, DPS, and every “predict \(\hat x_0\)” parameterization.
Deep dive: Lilian Weng — What are Diffusion Models? · Calvin Luo — Understanding Diffusion Models: A Unified Perspective · HF Diffusion Models Course.
10. Score-based models and the SDE umbrella
Score \(s(x) = \nabla_x\log p(x)\). Denoising score matching[15] sidesteps the intractable partition function of energy-based models \(p_\theta(x)=\tfrac{1}{Z(\theta)}\exp(-f_\theta(x))\):
\[ \mathcal{L}(\theta) = \tfrac{1}{2}\mathbb{E}_{x,\tilde x}\!\left[\|s_\theta(\tilde x) - \nabla_{\tilde x}\log q(\tilde x\mid x)\|^2\right]. \]
Song et al.[16] unified DDPM and NCSN under a continuous-time SDE \(dx_t = f(x_t,t)dt + g(t)dw_t\) with reverse \(dx_t = [f(x_t,t) - g^2(t)\nabla_x\log p_t(x_t)]dt + g(t)d\bar w_t\).
| Discrete | Continuous | |
|---|---|---|
| NCSN → VE-SDE | \(f=0,\ g=\sqrt{d\sigma^2/dt}\) | \(dx=g(t)dw\) |
| DDPM → VP-SDE | \(f=-\tfrac{1}{2}\beta(t)x,\ g=\sqrt{\beta_t}\) | \(dx=-\tfrac{1}{2}\beta(t)x\,dt+\sqrt{\beta_t}dw\) |
The probability-flow ODE drops the stochastic term: \(dx_t = [f(x_t,t) - \tfrac{1}{2}g^2(t)\nabla_x\log p_t(x_t)]dt\). Deterministic, preserves marginals, exact likelihood.
Deep dive: Yang Song — Generative Modeling by Estimating Gradients of the Data Distribution · Karras et al. EDM.
11. Flow matching and rectified flow
Flow matching[17] learns a velocity field \(v_\theta(x,t)\) transporting base \(p_0\) to data \(p_1\). Rectified Flow[18] uses the linear coupling \(x_t=(1-t)x_0+tx_1\), target \(v^\star(x,t)=\mathbb{E}[x_1-x_0\mid x_t=x,t]\):
\[ \mathcal{L}_\text{RF}(\theta)=\mathbb{E}_{(x_0,x_1)\sim\pi,\,t\sim\rho}\!\left[\|v_\theta(x_t,t)-(x_1-x_0)\|^2\right]. \]
def rf_loss(model, x1):
x0 = torch.randn_like(x1)
t = torch.rand(x1.size(0), device=x1.device) # uniform
tb = t.view(-1, 1, 1, 1)
xt = (1 - tb) * x0 + tb * x1
v_target = x1 - x0
v_pred = model(xt, t)
return F.mse_loss(v_pred, v_target)
Reflow. After first training, retrain on self-generated pairs \((x_0,\hat x_1)\). Curvature drops, few-step sampling becomes viable. For a dedicated walk-through see my Rectified Flow note.
Logit-normal time. Uniform \(t\) under-trains the middle of the trajectory. SD3 samples \(t\sim \sigma(\mathcal{N}(\mu,\sigma^2))\) to push mass where errors accumulate.
Mean flow. Predict average velocity over an interval \([t_1, t_2]\); approximates single-step generation.
Deep dive: NeurIPS 2024 Tutorial — Flow Matching for Generative Modeling · Google DeepMind — Diffusion Meets Flow Matching blog.
12. Sampling and guidance
DDIM
DDIM[19] replaces the ancestral Markov update with a deterministic step using \(\hat x_{0\mid t}\):
\[ x_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat x_{0\mid t} + \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\epsilon_\theta(x_t,t) + \sigma_t z. \]
\(\sigma_t=0\) → deterministic DDIM (same marginals, skip allowed). DDPM reverse with the diffusion term (\(g(t)d\bar w_t\)) versus DDIM reverse without it is the distinction to keep in mind.
@torch.no_grad()
def ddim_sample(model, alphas_bar, steps, shape, device, eta=0.0):
xt = torch.randn(shape, device=device)
ts = torch.linspace(len(alphas_bar) - 1, 0, steps + 1).long().to(device)
for i in range(steps):
t_now, t_nxt = ts[i], ts[i + 1]
ab, ab_nxt = alphas_bar[t_now], alphas_bar[t_nxt]
eps = model(xt, t_now.expand(shape[0]))
x0_hat = (xt - (1 - ab).sqrt() * eps) / ab.sqrt()
sigma = eta * ((1 - ab_nxt) / (1 - ab) * (1 - ab / ab_nxt)).sqrt()
noise = torch.randn_like(xt) if eta > 0 else 0
xt = ab_nxt.sqrt() * x0_hat + (1 - ab_nxt - sigma**2).sqrt() * eps + sigma * noise
return xt
ODE solvers
Given the PF-ODE / FM velocity, higher-order solvers buy quality at fixed step count:
- Euler. \(x_{t-\Delta t} = x_t + \Delta t\,v_\theta(x_t,t)\).
- Midpoint. Half-step then evaluate at midpoint.
- Heun. Euler step, re-evaluate, average (2nd order).
- DPM-Solver / EDM[20]. Exploit the semi-linear structure; strong baselines.
Classifier-free guidance (CFG)
Train one network on both conditional \(\epsilon_\theta(x,t,c)\) and unconditional \(\epsilon_\theta(x,t,\varnothing)\) by randomly dropping \(c\). At inference:
\[ \tilde\epsilon = \epsilon_\theta(x,t,\varnothing) + w\,(\epsilon_\theta(x,t,c) - \epsilon_\theta(x,t,\varnothing)). \]
@torch.no_grad()
def cfg_eps(model, x, t, c, w=7.5):
eps_c = model(x, t, c)
eps_unc = model(x, t, None)
return eps_unc + w * (eps_c - eps_unc)
On flow models, naive CFG amplifies trajectory curvature; my Rectified-CFG++[21] fixes this with a predictor-corrector step.
Classifier guidance, DPS, MPGD
Classifier guidance.[22] \(\nabla_x\log p(x\mid c) = \nabla_x\log p(c\mid x) + \nabla_x\log p(x)\). Uses a differentiable classifier.
DPS.[23] For inverse problems \(y = A(x)+n\): \(\nabla_{x_t}\log p(y\mid x_t)\approx -\tfrac{1}{\sigma_y^2}\nabla_{x_t}\|y - A(\hat x_{0\mid t})\|^2\). Goes off-manifold.
MPGD.[24] Project the guidance update back onto the manifold via a VAE/autoencoder tangent. Extra forward pass; big quality win.
Deep dive: Sander Dieleman — Guidance: a cheat code for diffusion models · Sander Dieleman — Perspectives on diffusion.
13. Conditioning in generative models
Three families; real systems combine them.
-
Architectural injection
- Concatenation \([x, c]\) — GAN/AR defaults.
- Cross-attention — T2I workhorse.
- Conditional norms:
- AdaIN: \(\gamma(c)\tfrac{x-\mu(x)}{\sigma(x)}+\beta(c)\).
- FiLM: \(c\to(\gamma,\beta)\to\gamma\odot x+\beta\).
- AdaLN / AdaLN-Zero: LayerNorm with scale/shift from \(c\); DiT/MM-DiT.
- Guidance methods: CFG, classifier guidance, attention-guided sampling.
- Spatial / structural conditioning
class LoRALinear(torch.nn.Module):
def __init__(self, base: torch.nn.Linear, r=8, alpha=16):
super().__init__()
self.base = base
for p in base.parameters(): p.requires_grad_(False)
self.A = torch.nn.Parameter(torch.randn(r, base.in_features) * 0.02)
self.B = torch.nn.Parameter(torch.zeros(base.out_features, r))
self.scale = alpha / r
def forward(self, x):
return self.base(x) + (x @ self.A.T @ self.B.T) * self.scale
MM-DiT (SD3 / Flux)
Caption encoded jointly by CLIP-L/14, CLIP-G/14, and T5-XXL. Pooled CLIP + sinusoidal timestep → MLP → AdaLN-Zero modulation. T5 token sequence is concatenated with image patch tokens; joint self-attention across the combined sequence, with per-stream norms and projections. AdaLN-Zero initializes modulation scale to zero so blocks act as identity at the start of training.
Deep dive: ControlNet paper · HF PEFT — LoRA conceptual guide · SD3 / MM-DiT paper.
14. Text-to-image design space
Every modern T2I/T2V decomposes on three axes. Memorize this grid.
| Model | Training | Latent | Backbone | Text |
|---|---|---|---|---|
| SD 1/2 | DDPM | VQ-GAN | UNet | CLIP |
| SD 3 / Flux 1 | Flow matching | Advanced VAE | DiT / MM-DiT | CLIP + T5 |
| Flux 2 / Z-Image / Qwen-Image | Flow matching | Advanced VAE | MM-DiT | LLM/VLM/MLLM + MM-DiT |
| Transfusion / Hunyuan 3.0 | Flow matching | Advanced VAE | Native-MM | Native MM tokens |
| Nano Banana / GPT-4o Image | FM + diffusion head | Advanced VAE | Native-MM | Native MM + tools |
Why CLIP alone stalled as text encoder
- Spatial relations (“dog left of cat”) — weak.
- Negation (“room without a window”) — fails.
- Counting and fine-grained attributes — fails.
- 77-token cap — too short.
T5-XXL handles long captions, dense semantics, and compositional detail; SD3/Flux use both (CLIP for pooled, T5 for token sequence). Flux 2-era models drop CLIP and use a single LLM/VLM encoder.
Scaling rule
Rule-of-thumb: #parameters ≈ 10× dataset size. Holds well for text; for T2I, data ceilings kick in as resolution climbs.
Deep dive: Stability AI — SD3 research blog · Black Forest Labs — Flux announcement · Transfusion paper (Meta).
15. Video generation and inverse problems
Temporal attention strategies
Joint 3D attention is prohibitive. Alternatives:
- Window attention. Current frame + last \(k\) neighbors.
- Sliding window. Stride < window; smoother transitions.
- Factorized spatial–temporal. Alternate spatial and temporal blocks (Lumiere, Hunyuan-Video).
- Token compression. Compress older frames into fewer K/V tokens.
Latent video diffusion
Shared outline across Movie Gen, Sora, Veo, Hunyuan-Video, Wan, OpenSora:
- 3D VAE (or causal 3D VAE) to compress space + time.
- MM-DiT backbone with alternating / joint 3D attention.
- Flow matching + CFG at inference.
- Camera / motion conditioning via cross-attention with motion tokens.
Inverse Tone Mapping (SDR→HDR)
From my LumaFlux line of work:
- Gain-map learning. Predict per-pixel gain scaling SDR luminance.
- Diffusion-prior ITM. Start from SDR conditioning, generate the HDR residual; Physically-Guided Adaptation (PGA) + Perceptual Cross-Modulation (PCM).
- Rational Quadratic Spline (RQS) decoder. Learnable invertible tone curve.
Deep dive: OpenAI — Sora technical report · Hunyuan-Video tech report · Meta — Movie Gen.
16. Discrete diffusion
For tokens/molecules, noising is random state transitions. Transition matrix \(Q\) gives
\[ x_t\mid x_{t-1}\sim\text{Cat}(p=x_{t-1}Q_t),\quad x_t\mid x_0\sim\text{Cat}(p=x_0\bar Q_t). \]
Reverse posterior
\[ q(x_{t-1}\mid x_t,x_0) = x_{t-1}\,\tfrac{x_tQ_t^\top\odot x_0\bar Q_{t-1}}{x_0\bar Q_t}. \]
MLDM, LLaDA[28], and the dLLM line apply this to text generation — the language-side analogue of image diffusion.
Deep dive: Austin et al. — Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM) · LLaDA paper.
17. LLMs, GPT, nanoGPT
All modern decoder-only LLMs are thin wrappers on the same stack: tokenizer → embedding → \(N\)× (RMSNorm, MHA with RoPE and GQA/MLA, RMSNorm, SwiGLU FFN) → LM head. Karpathy’s nanoGPT[29] is the smallest fluent reference — read it end-to-end.
class GPTBlock(torch.nn.Module):
def __init__(self, d, h):
super().__init__()
self.ln1 = torch.nn.LayerNorm(d)
self.attn = MultiHeadAttention(d, h)
self.ln2 = torch.nn.LayerNorm(d)
self.ffn = torch.nn.Sequential(
torch.nn.Linear(d, 4*d), torch.nn.GELU(),
torch.nn.Linear(4*d, d),
)
def forward(self, x, mask):
x = x + self.attn(self.ln1(x), mask)
x = x + self.ffn(self.ln2(x))
return x
Sampling
- Top-k. Keep highest-probability \(k\) tokens, renormalize.
- Nucleus (top-p). Smallest set with cumulative probability \(\ge p\).
- Temperature. Scale logits by \(1/T\) before softmax.
Deep dive: Karpathy — Let’s build GPT from scratch · Stanford CS336 — Language Modeling from Scratch.
18. Evaluation metrics (consolidated)
Section 5 covered image-quality metrics; this is the cross-modality summary.
| Modality | Fidelity | Alignment | Preference | Holistic |
|---|---|---|---|---|
| Image gen | FID, PSNR, SSIM, VIF, LPIPS | CLIPScore, VQAScore | HPSv2, ImageReward | MLLM-judge, HEIM |
| Video gen | FVD, CLIP-Temp/FrameSim | CLIPScore over frames | (learned) | VBench v2, PhysGenBench, VideoPhy2 |
| LLM | perplexity | BLEU, ROUGE | ArenaHard, AlpacaEval | MMLU, GSM8K, HELM |
Deep dive: Stanford HELM and HEIM for holistic LLM / image eval · VBench leaderboard for video.
19. Training lifecycle: pretraining → mid-training → post-training
Modern generative models (LLMs and diffusion/flow image/video models) follow a three-stage pipeline. This section is a pointer map — know what each stage does, what data it consumes, and what objective it optimizes.
Pretraining
Learn a world model by predicting the next piece of content on an enormous, minimally curated corpus.
- Objective (LLM). Next-token cross-entropy \(\mathcal{L} = -\mathbb{E}_{x}\sum_{t}\log p_\theta(x_t\mid x_{<t})\).
- Objective (diffusion/FM). \(\epsilon\)-prediction or velocity regression (sections 9 and 11). Trained on web-scale image–text pairs (LAION-5B, DataComp, CommonPool, internal crawls).
- Data. FineWeb / FineWeb-Edu, RedPajama, RefinedWeb, The Stack, plus code/math specialty mixes. Quality filtering (CLIP score, perplexity, fastText) and heavy deduplication matter more than raw volume.
- Compute profile. Long horizon (weeks–months), thousands of accelerators, LR warmup + cosine decay, effective batches of ~4M tokens, FP8/BF16 mixed precision, ZeRO-3 or FSDP sharding.
- Eval. Perplexity on held-out mix; a small suite of in-domain probes (MMLU, HumanEval for LLMs; small FID/CLIPScore for image models). Full eval comes later.
Mid-training (a.k.a. continued pretraining, annealed pretraining)
A second, smaller pretraining stage that reshapes the model’s capability profile before alignment. Much more impactful than the literature suggests.
- Long-context extension. Extend context from 4K/8K to 32K → 128K → 1M. Techniques: RoPE base rescaling (NTK-aware, YaRN), PI (position interpolation), needle-in-haystack eval.
- Capability injection. Upsample math (OpenWebMath, NuminaMath), code (The Stack v2), reasoning chains. The “annealing” trick: near the end of pretraining, switch the data mix heavily toward high-quality sources and drop LR sharply.
- Domain adaptation. Medical / legal / finance / multilingual corpora.
- For image/video models: resolution curriculum (64 → 256 → 1024 → native), aspect-ratio bucketing, human-aesthetic filtering, synthetic-caption rewriting with a VLM.
Post-training
Turn a raw next-token predictor / image denoiser into something that actually follows instructions and matches human preferences.
1. Supervised Fine-Tuning (SFT)
Continue the pretraining objective on a curated set of (instruction, high-quality response) pairs. Scale: 10K–1M examples; lower LR (1e-5 to 5e-6 range); 1–3 epochs. Loss is the same token-level CE but only on the response tokens (mask the prompt).
def sft_loss(logits, labels, response_mask):
# logits: (B, T, V); labels: (B, T); response_mask: (B, T) 1 where response
shift_logits = logits[:, :-1].contiguous()
shift_labels = labels[:, 1:].contiguous()
shift_mask = response_mask[:, 1:].contiguous().float()
ce = F.cross_entropy(shift_logits.transpose(1, 2), shift_labels, reduction='none')
return (ce * shift_mask).sum() / shift_mask.sum().clamp_min(1)
Common data: FLAN, Tulu, OpenAssistant, Alpaca, ShareGPT, instruction-synth with a strong teacher (“distillation SFT”). Synthetic data is now the majority of most SFT pipelines.
2. Preference data collection
Triples \((x, y_w, y_l)\) with a human or AI preference over responses \(y_w \succ y_l\). Sources: human labelers (UltraFeedback), strong teacher models (RLAIF), or self-play with a reward model.
3. Reward modeling + RLHF (PPO)
Fit a scalar reward model \(r_\phi(x,y)\) to preferences via a Bradley–Terry logistic loss \(\mathcal{L}_{RM} = -\log\sigma(r_\phi(x,y_w) - r_\phi(x,y_l))\). Then PPO the policy with \(r_\phi\) plus a per-token KL anchor \(r_t = r_\phi(x,y) - \beta\log\tfrac{\pi_\theta(y_t\mid\cdot)}{\pi_\text{ref}(y_t\mid\cdot)}\). Full math and loss in section 22.
4. Direct Preference Optimization (DPO)
Skip the reward model. Optimize the policy directly on preference pairs (derivation in section 22). In practice DPO is the baseline for almost every open-source post-training pipeline because it is stable, single-loop, and reward-model-free.
5. GRPO / DAPO — critic-free RL
Sample \(N\) responses per prompt, normalize rewards within the group, apply PPO-style token-level clipping. GRPO removed the value network entirely; DAPO further fixes clipping, sampling, and gradient-dilution pathologies. See section 22.
6. Rejection sampling / Best-of-N
Generate \(N\) candidates, score each with a reward model, keep the best. Embarrassingly simple, widely used as both a baseline and for generating SFT data for the next iteration (“rejection-sampling fine-tuning”, RSFT).
7. Constitutional AI / RLAIF
Replace human preference labels with AI-generated critiques against a written constitution. Scales preference collection to millions of examples cheaply; most frontier labs use a hybrid of human + AI labels.
Post-training for diffusion / flow models
The same ideas transfer — “response” becomes “generated image/video” and the reward is an image preference / aesthetic / alignment score (ImageReward, HPSv2, PickScore, VQAScore).
- SFT for diffusion. Fine-tune with the standard diffusion loss on a curated high-quality subset (e.g., aesthetically filtered, human-preferred).
- Diffusion-DPO. DPO adapted to the diffusion objective: compare noise predictions on winning vs losing images per timestep.
- DRaFT / ReFL / AlignProp. Backprop a differentiable reward through truncated sampling; very effective at preference alignment but compute-heavy.
- DPOK. PPO-style policy gradient on the diffusion sampler.
- LoRA preference tuning. Do any of the above through low-rank adapters for cheap style / aesthetic personalization.
Deep dive: Nathan Lambert — The RLHF Book · Interconnects — Post-Training 101 · HF Alignment Handbook · DeepSeek-R1 technical report.
20. Training nuances
Mixed precision
FP32 master copy + FP16/BF16 computation; loss scaling to prevent underflow of small gradients; BF16 is preferred (same exponent range as FP32, no loss scaling needed).
Data-parallel vs ZeRO vs FSDP
- DDP. Every GPU has a full model copy; gradients all-reduced each step.
- ZeRO-1/2/3[30]: shard optimizer state, then gradients, then parameters across GPUs.
- FSDP[31]: PyTorch-native ZeRO-3 equivalent; all-gather + reduce-scatter per layer; integrates with activation checkpointing.
Memory budget back-of-envelope
At fp16 training with Adam: params (2B) + grads (2B) + Adam moments (8B) + activations (depth-dependent) → ~12B + activations per parameter-byte. A 7B model needs ~84 GB just for weights, grads, and optimizer state — before activations. FSDP/ZeRO-3 shards that across N GPUs.
Gradient accumulation & EMA
Accumulate gradients over \(k\) micro-batches before stepping the optimizer → effective batch \(B\cdot k\) at no extra memory. EMA of weights (\(\theta_\text{ema}\leftarrow\alpha\theta_\text{ema}+(1-\alpha)\theta\), \(\alpha\approx0.999\)) is non-optional for diffusion training; sample with EMA weights.
class EMA:
def __init__(self, model, decay=0.999):
self.decay = decay
self.shadow = {n: p.detach().clone() for n, p in model.named_parameters()}
@torch.no_grad()
def update(self, model):
for n, p in model.named_parameters():
self.shadow[n].mul_(self.decay).add_(p.detach(), alpha=1 - self.decay)
Deep dive: HF — The Ultra-Scale Playbook (Training LLMs on GPU Clusters) · PyTorch FSDP Advanced Tutorial.
21. Inference nuances
Quantization
FP32 → FP16 (×2) → INT8 (×4) → INT4 (×8). Two common schemes:
- Zero-point (asymmetric). Map \([x_\min, x_\max]\) to \([0, 2^b-1]\); preserves zero exactly — important for ReLU activations.
- Absolute-max (symmetric). Map \([-|x|_\max, |x|_\max]\) to \([-2^{b-1}, 2^{b-1}-1]\).
def quantize_int8_symmetric(x):
scale = x.abs().max() / 127
q = (x / scale).round().clamp(-128, 127).to(torch.int8)
return q, scale
def dequantize(q, scale):
return q.float() * scale
Activation-aware schemes (AWQ, GPTQ, AQLM) keep sensitive channels at higher precision; LLM.int8()[32] handles outlier features in fp16 while quantizing the rest.
Pruning and distillation
- Pruning. Zero the lowest-magnitude \(x\%\) of weights per layer; structured pruning zeros entire heads/channels. Retrain or LoRA-adapt to recover.
- Distillation. Student learns \(\text{CE}(\text{student}, \text{true}) + \lambda\,D_\text{KL}(\text{student}\Vert\text{teacher})\).
- Distribution-matching distillation. Collapses multi-step image gen to 1–4 steps by matching output distributions at each noise level. Consistency models and LCM live in this family.
Serving stacks
- vLLM. Paged KV cache + continuous batching; the SOTA OSS LLM server.
- TensorRT-LLM. NVIDIA’s fused-kernel engine; best single-node throughput.
- Continuous batching. Merge new requests into in-flight batches at every decode step; throughput ↑ by 2–5× over static batching.
Deep dive: Anyscale — Continuous batching for LLM inference · vLLM — PagedAttention blog.
22. RL alignment — PPO, DPO, GRPO, DAPO in detail
PPO + RLHF
Three models: policy, reward, value. Clipped surrogate objective:
\[ \mathcal{L}_\text{PPO} = \mathbb{E}\!\left[\min\!\left(r_t A_t,\ \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) A_t\right)\right], \quad r_t = \tfrac{\pi_\theta(a_t\mid s_t)}{\pi_\text{ref}(a_t\mid s_t)}. \]
Advantage from GAE; KL penalty \(-\beta D_\text{KL}(\pi_\theta\Vert\pi_\text{ref})\) anchors the policy.
DPO
DPO[33] aligns on preference pairs \((y_w, y_l)\) without a reward model:
\[ \mathcal{L}_\text{DPO} = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\log\tfrac{\pi_\theta(y_w)}{\pi_\text{ref}(y_w)} - \beta\log\tfrac{\pi_\theta(y_l)}{\pi_\text{ref}(y_l)}\right)\right]. \]
def dpo_loss(pol_w, pol_l, ref_w, ref_l, beta=0.1):
# pol_w, ref_w: log-probs under policy / reference for chosen
# pol_l, ref_l: same for rejected
return -F.logsigmoid(beta * ((pol_w - ref_w) - (pol_l - ref_l))).mean()
GRPO
GRPO[34] drops the critic. For each prompt, sample \(N\) responses, score, and normalize within group:
\[ A_i = \tfrac{r_i - \text{mean}(R_\text{group})}{\text{std}(R_\text{group})+\varepsilon}. \]
def grpo_advantages(rewards): # rewards: (G,)
return (rewards - rewards.mean()) / (rewards.std() + 1e-6)
DAPO
DAPO addresses four GRPO failure modes[35]:
- Wasted learning signal (GRPO discards \(A=0\) samples).
- Fixed \(\varepsilon\) clipping is too aggressive for high-reward samples → Clip-Higher: asymmetric \((1+\text{high}, 1-\text{low})\).
- Redundant sampling → dynamic sampling (keep only samples with non-trivial advantage).
- Gradient dilution in long sequences → token-level gradient loss (average over all tokens across batch, not per-sample then over samples).
Best-of-N
Sample \(N\) outputs at \(T=0.7\), rank by reward model, keep the best. Embarrassingly simple, hard to beat at small \(N\).
Deep dive: HF blog — From GRPO to DAPO and GSPO · HF TRL library docs (DPO, PPO, GRPO implementations) · algoroxyolo — RL reading list (2026).
23. Paper trace and behavioral
For image-gen / quality / video roles, I lead with three papers — one minute each.
- Rectified-CFG++ (NeurIPS 2025). “Standard CFG breaks on flow models because it amplifies trajectory curvature. We introduce a predictor-corrector guidance with zero extra training cost that fixes artifacts on Flux, SD3, and Lumina-Next, improves text rendering, and comes with theoretical guarantees.” — blog.
- HDR-Q (CVPR 2026). “First multimodal LLM for HDR video quality assessment, with HAPO: contrastive KL, dual-entropy regularization, and SigLIP-2 HDR-aware encoding.” — slides.
- LumaFlux. “SDR-to-HDR inverse tone mapping using Flux 12B; 17M trainable parameters via PGA, PCM, and a Rational Quadratic Spline decoder.” — slides.
Rehearse three lengths: 30-second, 2-minute, 10-minute. Let the interviewer pick.
Checklist for the night before
- Python basics (GC, PEP 8,
*args/**kwargs) — 60 seconds. - MLE for a Gaussian with code.
- CNN output formula with one worked example.
- Scaled dot-product attention from scratch — derive and code.
- RoPE 1D/2D/3D, ALiBi, sinusoidal: pros/cons each.
- KV cache back-of-envelope (30B, 48-layer, 7168-dim).
- FlashAttention online softmax identity.
- VAE ELBO both derivations in 60 seconds.
- DDPM forward + reverse + \(\epsilon\) loss with code.
- Tweedie and \(\hat x_{0\mid t}\).
- VP-SDE vs VE-SDE table.
- Rectified flow loss, reflow, logit-normal time.
- DDIM step (stochastic ↔ deterministic).
- CFG in \(\epsilon\) and score form.
- DPS vs MPGD.
- T2I design-space table.
- SSIM, PSNR, VIF intuition; when FID saturates.
- Retinex and color spaces; PQ vs HLG.
- PPO / DPO / GRPO / DAPO one-liners with DAPO fixes.
- Three papers of my own, three durations each.
- One LeetCode-medium on the morning of.
Curated reading list
Resources I came back to repeatedly during prep. Grouped by ladder stage.
Foundations (DL / CUDA / CV)
- Bishop & Bishop — Deep Learning: Foundations and Concepts (2023). Modern, textbook-level replacement for Goodfellow; full HTML is free.
- Dive into Deep Learning (d2l.ai). Continuously updated; interactive PyTorch/JAX code alongside every chapter.
- Karpathy — Neural Networks: Zero to Hero + nanoGPT / nanochat. Micrograd → makemore → GPT → tokenizers → nanochat, end-to-end and all runnable.
- Lilian Weng’s Lil’Log. Still unmatched as a survey blog for attention, VAEs, diffusion, SSL, RLHF, alignment.
- Jay Alammar. Visual explanations of transformers, LLMs, embeddings.
- 3Blue1Brown — Neural Networks series. The best visual intuition for backprop and attention.
- Chris Olah’s blog and Distill. Mechanistic / interpretability intuition that still transfers.
- Stanford CS336 — Language Modeling from Scratch (Percy Liang, Tatsu Hashimoto, 2024/2025). The single best modern “build an LLM end-to-end” course.
- Stanford CS25 — Transformers United. Guest-lecture series with authors of many of the papers you’ll be asked about.
- Stanford CS231n (latest semester). Still canonical for CNNs and the vision stack.
- NVIDIA CUDA C++ Programming Guide + the GPU Mode lecture series / Discord. Modern, kernel-writing-focused CUDA learning path that supersedes the older tutorials.
- Hugging Face Computer Vision Course for hands-on CV with modern backbones.
SSL & GANs
- Self-Supervised Representation Learning — Lilian Weng.
- Contrastive Representation Learning — Lilian Weng.
- awesome-self-supervised-learning.
- Goodfellow NIPS 2016 GAN Tutorial.
Transformers, LLMs, GPT
- Attention Is All You Need.
- The Illustrated Transformer.
- Sebastian Raschka — A Visual Guide to Attention Variants — MHA, MQA, GQA, MLA, sliding-window, Flash with diagrams.
- Hugging Face LLM Course.
- karpathy/nanoGPT, ng-video-lecture.
- karpathy/nanochat — minimal end-to-end ChatGPT stack.
- Stanford CS224N.
Diffusion / Flow Matching / Rectified Flow
- What are Diffusion Models? — Lilian Weng.
- Understanding Diffusion Models: A Unified Perspective — Calvin Luo.
- Hugging Face Diffusion Models Course.
- Flow Matching for Generative Modeling — Lipman et al..
- Flow Straight and Fast (Rectified Flow) — Liu et al..
- Rectified Flow official repo.
- NeurIPS 2024 Tutorial — Flow Matching for Generative Modeling.
- MIT 6.S184 “The Principles of Diffusion Models and Flow Matching” (spring 2026) — course notes are the densest single source I know.
Training large models
- DeepSpeed, ZeRO tutorial.
- HF Accelerate FSDP docs, PyTorch Advanced FSDP.
- Efficient Training of Large Language Models on Distributed Infrastructure (2024 survey).
- Hugging Face Ultra-Scale Playbook — training LLMs on GPU clusters.
- Hugging Face Smol Training Playbook — secrets to building world-class small LLMs.
Inference
- LLM.int8() — Dettmers et al..
- HF × bitsandbytes 8-bit integration.
- Continuous batching for LLM inference.
- vLLM, TensorRT-LLM.
RL alignment
- RLHF Book — Nathan Lambert.
- From GRPO to DAPO and GSPO — HF blog.
- algoroxyolo — RL reading list (2026) — a curated, modern RL-for-LLMs reading map.
Interview question banks
- alirezadir/Machine-Learning-Interviews.
- amitshekhariitbhu/machine-learning-interview-questions.
- Devinterview-io/llms-interview-questions.
- LLM Interview Q&A Hub.
- llmgenai/LLMInterviewQuestions.
- a-tabaza/genai_interview_questions.
- awesome-generative-ai-guide.
- A Book for ML/DL Interview Questions (arXiv).
Coding practice
- NeetCode 150.
- LeetCode — target the 300 most-frequent list.
- FAANG-Coding-Interview-Questions.
- Graph series (Aditya Verma), DP series, Recursion series.
References
- NeetCode — LeetCode pattern-based prep.
- Sheikh & Bovik, Image Information and Visual Quality (VIF), IEEE TIP 2006.
- Wu et al., DOVER: Exploring Video Quality Assessment Through Aesthetic and Technical Perspectives, 2022.
- Wu et al., Human Preference Score v2 (HPSv2), 2023.
- Huang et al., VBench: Comprehensive Benchmark Suite for Video Generative Models, 2023.
- Stanford CS231n — Convolutional Neural Networks for Visual Recognition.
- Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention, 2022.
- Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models, 2023.
- DeepSeek-V2 Technical Report (introduces MLA), 2024.
- Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding, 2021.
- Press et al., Train Short, Test Long: Attention with Linear Biases (ALiBi), 2021.
- Kingma & Welling, Auto-Encoding Variational Bayes, 2013.
- van den Oord et al., Neural Discrete Representation Learning (VQ-VAE), 2017.
- Ho et al., Denoising Diffusion Probabilistic Models, 2020.
- Song & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution, 2019.
- Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, 2021.
- Lipman et al., Flow Matching for Generative Modeling, 2022.
- Liu et al., Flow Straight and Fast: Rectified Flow, 2022.
- Song et al., Denoising Diffusion Implicit Models, 2020.
- Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models (EDM), 2022.
- Saini et al., Rectified CFG++ for Flow Based Models, NeurIPS 2025.
- Dhariwal & Nichol, Diffusion Models Beat GANs on Image Synthesis (Classifier Guidance), 2021.
- Chung et al., Diffusion Posterior Sampling for General Noisy Inverse Problems (DPS), 2022.
- He et al., Manifold Preserving Guided Diffusion (MPGD), 2023.
- Zhang et al., Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet), 2023.
- Mou et al., T2I-Adapter, 2023.
- Hu et al., LoRA: Low-Rank Adaptation of Large Language Models, 2021.
- Nie et al., Large Language Diffusion Models (LLaDA), 2025.
- karpathy/nanoGPT.
- Rajbhandari et al., ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, 2019.
- Zhao et al., PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel, 2023.
- Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, 2022.
- Rafailov et al., Direct Preference Optimization, 2023.
- Shao et al., DeepSeekMath (GRPO), 2024.
- From GRPO to DAPO and GSPO — HF Blog (Yihua Zhang).