Prepping for a Research Scientist, GenAI Position — A Pointer Notebook

A topic-pointer notebook for Research Scientist / GenAI loops focused on image generation, perceptual quality, and video processing. Ordered progressively — coding warmups, then ML and CV fundamentals, then transformers, then diffusion and flow models, then T2I and video, and finally the training and inference tier.

Apr 2026 Interview Prep Diffusion Flow Models T2I Video VQA

Thesis. A Research Scientist, GenAI loop is five overlapping interviews stacked on top of each other — coding, ML/CV fundamentals, research dive, system design, behavioral.

Main technical point. Almost every modern generative system collapses to three primitives: a forward corruption process, a denoiser/velocity that reverses it, and a sampler that integrates the reverse dynamics. VAE, DDPM, score, flow matching, rectified flow, DDIM, CFG, DPS are all specific instantiations of those primitives.

Practical implication. Revise in the order you’d build a system: warm up with Python, cover ML and classical CV foundations, master the transformer, stack on diffusion and flow models, then climb to text-to-image and video, and only then worry about serving and RL alignment. This post is that order.

📓 How to read this post. This is a pointer notebook — a map of topics you should know, with short explanations, minimal code stubs, and links to the places where each topic is actually taught in depth. It does not contain the detailed material itself. Each section is meant to jog your memory, show you roughly how the pieces connect, and point you at a better resource for the full derivation. Treat it like an index card, not a textbook.

⚠️ Not exhaustive. I wrote this down from memory after my own full-time interview prep cycle. It is biased toward the things that worked for me and that came up repeatedly across different research labs I interviewed with — so there are whole areas (classical RL, speech, 3D, retrieval systems, robotics) that are barely here, simply because they didn’t show up in my loops. Use it as one data point among several, not as a complete syllabus.

🤝 Suggest additions. This is a living document. If you know a better resource, a missed topic, a cleaner derivation, or a recent paper that should be in the pointer set, please open an issue on the repo or email me at saini.2@utexas.edu. I’ll keep merging good additions in.

0. Framing and the full outline

My loops decomposed into the same ladder every time. Knowing which round you are in keeps your answers at the right altitude.

  • Coding. Usually one LeetCode-medium plus one applied-ML question (“implement MLE for a Gaussian,” “write the forward pass for scaled dot-product attention,” “compute KV-cache memory for a given model shape”).
  • ML / CV fundamentals. Losses, MLE vs MAP, KL divergence, SVM hinge loss, clustering, classical CV (SSIM/VIF/LPIPS, histogram equalization, Sobel), CNN output shape, BatchNorm vs LayerNorm, self-supervised learning.
  • Research dive. Depth-first on one of your papers. Derive the loss, ablate components, predict what breaks under a change.
  • System design. Build a T2I or T2V product end-to-end — data → latent space → backbone → training → eval → serving — in 45 minutes on a whiteboard.
  • Paper trace + behavioral. Contrast 5–8 landmark papers in the team’s area in 30 seconds each; be ready with a 30-second, 2-minute, and 10-minute pitch of your own work.

The rest of this note is the ladder itself. The order matters — each section builds on the last.

Full outline — click any entry to jump

  1. Coding tier. Python internals, CNN/attention-forward-pass warmups, DSA cadence, ML-system-design checklist.
  2. ML fundamentals. MLE/MAP, loss zoo (L1/L2/CE/hinge/triplet/KL), normalization, clustering, SVM.
  3. GenAI fundamentals. Generative vs discriminative, likelihood-based vs likelihood-free family tree.
  4. Classical CV. Histogram equalization, Sobel, Gaussian/median filters, NLM, SIFT, color theory, color spaces, Retinex.
  5. CV quality. PSNR, SSIM, VIF, LPIPS/DISTS, NR-IQA, HDR quality, FID/FVD, CLIPScore/HPS/ImageReward, VBench/PhysGenBench, MLLM-as-judge.
  6. CNNs & SSL. ConvNet primer, contrastive and masked SSL.
  7. Transformers. Scaled dot-product attention, multi-head, cross, masked, RoPE 1D/2D/3D, ALiBi, FlashAttention, GQA/MLA, KV cache.
  8. VAE & ELBO. Two derivations, reparameterization, VQ-VAE, SAE.
  9. DDPM. Forward chain, reverse parameterization, \(\epsilon\)-prediction, Tweedie.
  10. Score & SDE. DSM, EBM, VP/VE SDE, probability-flow ODE.
  11. Flow matching & rectified flow. Linear interpolant, reflow, mean flow, logit-normal time.
  12. Sampling & guidance. DDIM, Euler/Midpoint/Heun, CFG, classifier guidance, DPS, MPGD.
  13. Conditioning. AdaIN/FiLM/AdaLN, cross-attention, ControlNet, T2I-Adapter, LoRA, MM-DiT.
  14. T2I design space. SD1 → SD3 → Flux → Nano Banana; why CLIP needed T5.
  15. Video generation. Temporal attention, LVDM, inverse tone mapping (SDR→HDR).
  16. Discrete diffusion. Transition matrix, MLDM/LLaDA.
  17. LLMs & nanoGPT. Tokenization, causal LM, sampling.
  18. Evaluation metrics. When each metric saturates.
  19. Training lifecycle. Pretraining, mid-training, post-training (SFT, preference tuning, RLHF), and the diffusion-specific analogues.
  20. Training nuances. Mixed precision, ZeRO, FSDP, gradient accumulation, EMA.
  21. Inference nuances. Quantization, distillation, pruning, continuous batching, vLLM/TensorRT-LLM.
  22. RL alignment (details). PPO, DPO, GRPO, DAPO, Best-of-N.
  23. Paper trace + behavioral.

1. Coding tier

Python essentials

Interviewers love a 5-minute sanity check on Python internals before anything else.

  • Execution model. Python source → lexer → parser → AST → bytecode → CPython VM (stack-based). PyPy adds JIT on top; CPython does not.
  • PEP 8. 4-space indent, 79-char lines, snake_case variables, CamelCase classes.
  • Memory. Reference counting + generational GC. When refcount hits zero, the object is deallocated; the cyclic collector handles reference cycles.
  • *args, **kwargs. Variable positional (tuple) and keyword (dict) arguments.

Applied-ML warmups (run these in your head)

MLE for a Gaussian — derive and code. \(\nabla_\theta \log\prod_i f(x_i;\theta)=0\) gives closed-form \(\hat\mu=\tfrac{1}{N}\sum x_i,\ \hat\sigma^2=\tfrac{1}{N}\sum(x_i-\hat\mu)^2\).

import numpy as np

def gaussian_mle(x):
    mu = x.mean()
    sigma2 = ((x - mu) ** 2).mean()
    return mu, sigma2

# MAP with Gaussian prior N(mu0, tau2) on mu, known sigma2
def gaussian_map_mean(x, sigma2, mu0, tau2):
    n = len(x)
    return (mu0 / tau2 + x.sum() / sigma2) / (1 / tau2 + n / sigma2)

CNN output shape. For input width \(W\), filter \(F\), stride \(S\), padding \(P\):

\[ O = \left\lfloor\frac{W - F + 2P}{S}\right\rfloor + 1 \]

k-th smallest. Heap, \(O(n\log k)\):

import heapq
def kth_smallest(nums, k):
    return heapq.nsmallest(k, nums)[-1]

Naive attention forward pass — the single most common whiteboard ask.

import torch, math

def attention(q, k, v, mask=None):
    # q, k, v: (B, H, T, d_k)
    scores = (q @ k.transpose(-2, -1)) / math.sqrt(q.size(-1))
    if mask is not None:                       # causal mask shape (T, T)
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attn = torch.softmax(scores, dim=-1)
    return attn @ v, attn

DSA cadence

What worked for me: 1 hour/day of LeetCode plus 2 structured hours, 2–3 problems daily, 300 most-frequent list over ~2 months[1]. Pattern list:

  • Two pointers & sliding window (longest substring, k-distinct, min window).
  • Binary search on the answer (Koko bananas, split array largest sum).
  • Heap / priority queue (top-k, merge-k-lists, schedulers).
  • Monotonic stack (daily temperatures, largest rectangle).
  • Dynamic programming — 1D, 2D, on intervals, trees, bitmask.
  • Graph BFS/DFS, Dijkstra, topological sort, union-find.
  • Backtracking (permutations, combinations, N-queens).
  • Bit manipulation and prefix-sum tricks.

ML system-design checklist (T2I/T2V)

  1. Data. Crawl → filter (CLIP relevance, aesthetics, dedup, NSFW), recaption with a VLM.
  2. Latent space. Train VAE / VQ-VAE / advanced VAE with perceptual + adversarial loss.
  3. Backbone. MM-DiT scale; positional encoding (RoPE 2D/3D); FSDP or ZeRO-3 sharding.
  4. Training. Flow matching with logit-normal time; EMA; mixed precision.
  5. Eval harness. Prompt set × metrics (CLIPScore, HPSv2, ImageReward, VBench, MLLM-judge).
  6. Inference. Heun/DPM-Solver; step distillation; FP8/INT8; CFG annealing; continuous batching for AR heads.
  7. Safety. Prompt pre-filter, output post-filter, concept-erasing LoRA.

Deep dive: NeetCode Coding Interview Roadmap · Karpathy — Neural Networks: Zero to Hero for the applied-ML coding muscle.

2. Machine learning fundamentals

MLE and MAP

\[ \theta^\text{MLE} = \arg\max_\theta \prod_i p(x_i\mid\theta), \quad \theta^\text{MAP} = \arg\max_\theta \prod_i p(x_i\mid\theta)\,p(\theta). \]

Practical identity: maximizing likelihood equals minimizing negative log-likelihood equals minimizing cross-entropy against empirical data distribution.

The loss-function zoo

  • \(L_1\) vs \(L_2\). \(L_1\) promotes sparsity and robustness to outliers; \(L_2\) is smoother and mean-seeking. Image reconstruction typically uses a convex mix.
  • Cross-entropy. Canonical classification/token-level loss.
  • Hinge (SVM). \(\arg\min \tfrac{1}{2}\|w\|^2 + C\sum_i \max(0, 1 - y_i(w^\top x_i - b))\).
  • Triplet. \(\max(0, d(A,P) - d(A,N) + m)\). Enforces anchor–positive closer than anchor–negative by margin \(m\).
  • KL divergence. \(D_\text{KL}(p\Vert q)=\sum p\log(p/q)\). Non-negative, asymmetric, zero iff equal.
import torch.nn.functional as F

def kl_divergence(p, q, eps=1e-12):              # both shape (..., C)
    return (p * (p.clamp_min(eps).log() - q.clamp_min(eps).log())).sum(-1)

def hinge_loss(scores, y):                        # y in {-1, +1}
    return F.relu(1 - y * scores).mean()

def triplet_loss(a, p, n, margin=1.0):
    d_ap = (a - p).norm(dim=-1)
    d_an = (a - n).norm(dim=-1)
    return F.relu(d_ap - d_an + margin).mean()

Normalization

  • BatchNorm. Normalize per-channel across the batch. Depends on batch statistics — breaks with very small batches.
  • LayerNorm. Normalize per-sample across features. Batch-size independent; the default for transformers.
  • RMSNorm. LayerNorm without mean subtraction; \(\tfrac{1}{\sqrt{\text{RMS}(x)^2+\varepsilon}}\cdot x\). Used in LLaMA-family models.
  • GroupNorm. Middle ground used inside UNets and DiT blocks.

Clustering (quick)

  • K-means. Hard assignment via nearest centroid; iterative \((\text{assign}, \text{update})\).
  • GMM. Soft clustering with \(K\) Gaussians \((\mu_k,\Sigma_k,\pi_k)\) via EM; ellipsoidal clusters.
  • DBSCAN. Density-based; arbitrary-shape clusters; noise is a first-class output.

SVM and kernels

Find the hyperplane \(w^\top x + b\) with max margin. Non-linear case: transform data via \(\phi\) into a higher-dim space where classes are linearly separable; the kernel trick avoids constructing \(\phi\) explicitly via \(K(x,y) = \phi(x)^\top\phi(y)\) (linear, polynomial, RBF).

Deep dive: Stanford CS229 (Andrew Ng) for the math · StatQuest with Josh Starmer for 10-minute intuition videos on MLE, PCA, SVM, GMM, clustering.

3. Generative modeling fundamentals

Generative modeling learns \(p(x)\) or \(p(x,y)\); discriminative modeling learns \(p(y\mid x)\) directly.

Family tree

  • Likelihood-based: VAE, autoregressive, diffusion, normalizing flows, energy-based models.
  • Likelihood-free: GANs — min-max adversarial, strong images at small scale but prone to mode collapse; WGAN-GP mitigates with a Lipschitz critic and gradient penalty.

Why the shift? Sampling directly from \(p(x)\) in high-dim is intractable; simpler to transport a Gaussian base to the data via a tractable forward corruption and a learned reverse — the diffusion/flow-matching recipe (sections 9–11).

Autoregressive factorization

\[ p_\theta(x) = p_\theta(x_1)\,p_\theta(x_2\mid x_1)\cdots p_\theta(x_L\mid x_{<L}) \]

Loss collapses to token-level CE. Sampling: top-\(k\) (restrict to top \(k\) logits) or nucleus (top-\(p\)) sampling.

Deep dive: Lilian Weng — From Autoencoder to Beta-VAE · HF CV Course — Generative Models (Unit 5).

4. Classical computer vision

Image processing primitives

  • Histogram equalization. Remap intensities through the CDF of the image’s histogram: \(\text{img}'[i,j] = \text{CDF}(\text{img}[i,j])\cdot 255\). Boosts contrast on low-dynamic-range imagery.
  • Sobel edges. \(G_x=\begin{bmatrix}-1&0&1\\-2&0&2\\-1&0&1\end{bmatrix},\ G_y=G_x^\top,\ |G|=\sqrt{G_x^2+G_y^2}\).
  • Gaussian filter. Low-pass \(\tfrac{1}{16}\begin{bmatrix}1&2&1\\2&4&2\\1&2&1\end{bmatrix}\) smooths additive noise.
  • Median filter. Best for salt-and-pepper noise (Gaussian blur makes it worse).
  • Non-Local Means (NLM). Compare patches not pixels; average similar patches within a search window.
  • SIFT. Scale-invariant keypoint detection via Difference-of-Gaussians; assign orientation; 128-dim descriptor.
import numpy as np
from scipy.ndimage import convolve

def sobel(img):
    Gx = np.array([[-1,0,1],[-2,0,2],[-1,0,1]])
    Gy = Gx.T
    gx, gy = convolve(img, Gx), convolve(img, Gy)
    return np.sqrt(gx**2 + gy**2)

def hist_eq(img):                                  # img: uint8
    hist, _ = np.histogram(img.ravel(), 256, (0,256))
    cdf = hist.cumsum()
    cdf = (cdf - cdf.min()) * 255 / (cdf.max() - cdf.min())
    return cdf[img].astype(np.uint8)

Color theory and color spaces

Human vision perceives surface color as context (Retinex): the retina + cortex compare reflectance against surroundings across three wavelength channels (L/M/S) rather than raw luminance. That is color constancy — an apple is red under bright or dim light.

  • sRGB / Rec. 709. 8-bit SDR; ~1/3 of visible gamut.
  • Rec. 2020. 10/12-bit; wide-gamut HDR.
  • XYZ (CIE 1931). Device-independent linear space.
  • LAB. Perceptually uniform: \(L\) lightness, \(a\) green–red, \(b\) blue–yellow.

Three knobs define a color space: primaries (gamut vertices), white point (e.g. D65), transfer function (OETF/EOTF: gamma ~2.2 for sRGB; PQ/HLG for HDR).

Deep dive: Szeliski — Computer Vision: Algorithms and Applications (2nd ed., free PDF) · Cambridge in Colour tutorials for color theory, gamut, transfer functions.

5. Perceptual quality assessment

Full-reference (FR)

  • PSNR. \(10\log_{10}(255^2/\text{MSE})\). Pixel fidelity; weak human correlation.
  • SSIM. Luminance × contrast × structure: \(\text{SSIM}(x,y)=\frac{(2\mu_x\mu_y+C_1)(2\sigma_{xy}+C_2)}{(\mu_x^2+\mu_y^2+C_1)(\sigma_x^2+\sigma_y^2+C_2)}\).
  • VIF.[2] Visual Information Fidelity: model images as natural-scene statistics (NSS) via Gaussian Scale Mixtures in a wavelet basis; HVS has internal noise \(n\); distortion adds attenuation \(g\) and noise \(v\); compute MI ratio \(\text{VIF} = I(C;F_\text{dist})/I(C;F_\text{ref})\).
  • LPIPS. Feature-space distance over a frozen VGG with learned per-layer weights.
  • DISTS. Structure + texture similarity in learned feature space; robust to texture shifts.

No-reference (NR)

  • NSS-based: NIQE, BRISQUE, MSCN.
  • Learned: Re-IQA, CONTRIQUE, DOVER[3] (decouples aesthetic and technical quality for video), Q-Align, DEQA (MLLM-based).

Generative-model metrics

  • FID / FVD. Frechet distance of Inception (image) or I3D (video) features; saturates for SOTA T2I.
  • CLIPScore. Cosine of CLIP text–image embeddings; good for alignment, weak for realism.
  • HPSv2 / ImageReward.[4] Learned preference models on pick-a-pic data.
  • VBench / VBenchv2.[5] Multi-facet video eval (subject identity, motion smoothness, dynamic degree, spatial relations).
  • PhysGenBench / VideoPhy2. Physics commonsense for video.
  • MLLM-as-judge. Cheap holistic eval with a capable VLM (Gemini, Qwen-VL, GPT-4o).
When metrics saturate. FID, FVD, and CLIPScore are all learned and don’t reward novel outputs. On modern T2I/T2V, they’re close to noise past a threshold. Holistic, grounded, human-aligned metrics (HPSv2, ImageReward, VBenchv2, PhysGenBench, MLLM-judge) are the current default for paper-grade eval.

HDR-specific

Standard SDR metrics computed on tonemapped HDR previews are misleading. Use HDR-VDP, PU-encoded PSNR/SSIM, or MLLM-based HDR-aware judges. See HDR-Q slides for a CVPR 2026 take on this.

Deep dive: UT LIVE Lab publications for the IQA/VQA canon (SSIM, VIF, BRISQUE, NIQE, VMAF) · LPIPS repo + paper · VBench project page.

6. CNNs and self-supervised learning

CNN primer

Stanford CS231n is still the canonical course for this layer of the stack[6]. Quick hits:

  • Convolution as learnable filter banks; parameter sharing + spatial equivariance.
  • Receptive field grows with depth; dilated convs expand it cheaply.
  • ResNet identity shortcuts enable very deep training; ConvNeXt revisits CNNs with transformer-style design.

Self-supervised learning

  • Contrastive: SimCLR, MoCo. InfoNCE loss: \(\mathcal{L} = -\log\tfrac{\exp(q\cdot k^+/\tau)}{\sum_i \exp(q\cdot k_i/\tau)}\).
  • BYOL / DINO. Non-contrastive; online + EMA target networks.
  • Masked image modeling (MAE). Mask 75% of patches, reconstruct; strong downstream classification.
  • CLIP. Contrastive on (image, caption) pairs; InfoNCE across the batch.
def info_nce(q, k_pos, k_neg, tau=0.07):           # shapes: (B,d),(B,d),(B,K,d)
    logits_pos = (q * k_pos).sum(-1, keepdim=True) / tau
    logits_neg = (q.unsqueeze(1) * k_neg).sum(-1) / tau
    logits = torch.cat([logits_pos, logits_neg], dim=-1)
    labels = torch.zeros(q.size(0), dtype=torch.long, device=q.device)
    return F.cross_entropy(logits, labels)

Deep dive: Stanford CS231n for CNNs · Lilian Weng — Contrastive Representation Learning · Meta AI — DINO blog.

7. Transformers

Attention in full

\[ \text{Attn}(Q,K,V) = \text{softmax}\!\left(\tfrac{QK^\top}{\sqrt{d_k}}\right)V. \]

Why \(\sqrt{d_k}\)? The dot product of two \(d_k\)-dim unit-variance vectors has variance \(d_k\). Without scaling, softmax saturates into near-one-hot distributions and gradients vanish on the un-picked tokens.

Cross-attention: \(Q\) from decoder sequence, \(K,V\) from encoder sequence (e.g., T5 tokens in a T2I UNet / DiT).

Masked attention: add a \(-\infty\) bias to forbidden positions before softmax. Causal masking is the lower-triangular case.

import torch, math, torch.nn.functional as F

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.h, self.dk = n_heads, d_model // n_heads
        self.qkv = torch.nn.Linear(d_model, 3 * d_model)
        self.out = torch.nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, D = x.shape
        q, k, v = self.qkv(x).chunk(3, dim=-1)
        q, k, v = [t.view(B, T, self.h, self.dk).transpose(1, 2) for t in (q, k, v)]
        s = q @ k.transpose(-2, -1) / math.sqrt(self.dk)
        if mask is not None:
            s = s.masked_fill(mask == 0, float('-inf'))
        a = F.softmax(s, dim=-1)
        y = (a @ v).transpose(1, 2).contiguous().view(B, T, D)
        return self.out(y)

Sparse, Flash, Linear, GQA, MLA

  • Sparse attention. Block-sparse; each token attends to \(n\) blocks out of full set → \(O((T^2/n)d)\). Heads can learn different patterns.
  • Native Sparse Attention (NSA). Combines token compression (coarse windowed pooling of K/V), selection (top-\(k\) via importance scores), and sliding window (local); the three paths are gated-summed.
  • Flash Attention.[7] Tiling + online (safe) softmax reduces memory from \(O(N^2)\) to \(O(N)\). Key identity: maintain running max \(m_i\) and denominator \(d_i\) so exp(x - m) never overflows, then rescale on each new tile.
  • Linear attention. \(\text{softmax}(QK^\top)\approx\phi(Q)\phi(K)^\top\); per-query cost \(O(nd)\). Good for long sequences and streaming, bad when you need sharp attention.
  • Grouped Query Attention (GQA).[8] Share K/V across groups of query heads; e.g. 16 query heads, 4 K/V heads. LLaMA uses 1:8.
  • Multi-head Latent Attention (MLA).[9] Compress K/V into a low-rank latent then project back at attention time; ~4–10× KV-cache reduction.

Positional encoding

  • Sinusoidal. \(\text{PE}(\text{pos}, 2i)=\sin(\text{pos}/10000^{2i/d})\). Length-independent. Added to input embeddings.
  • RoPE[10]. Rotate \(q, k\) by a position-dependent angle before the dot product. \(q^\top k\) becomes sensitive only to the relative angle. 2D RoPE splits embedding into \((x, y)\) chunks for images; 3D RoPE further for video \((t, x, y)\).
  • ALiBi[11]. Add linear negative bias \(-m\cdot|i-j|\) in attention logits; acts as soft local attention; extrapolates cleanly to longer sequences.
def rope_1d(x, theta=10000.0):                     # x: (B, T, d) with d even
    B, T, d = x.shape
    pos = torch.arange(T, device=x.device).float()[:, None]
    i = torch.arange(d // 2, device=x.device).float()
    freq = 1.0 / (theta ** (2 * i / d))
    angles = pos * freq                            # (T, d/2)
    cos, sin = angles.cos(), angles.sin()
    x1, x2 = x[..., 0::2], x[..., 1::2]
    xr = torch.stack([x1*cos - x2*sin, x1*sin + x2*cos], dim=-1)
    return xr.flatten(-2)

KV cache memory

\[ \text{mem} = 2 \times \text{bytes} \times n_\text{layers} \times d_\text{model} \times \text{seq\_len} \times \text{batch}. \]

30B params with \(n_\text{layers}=48,\ d_\text{model}=7168,\ \text{fp16} (2\text{ bytes})\), seq-len 1,024, batch 128 → KV cache ≈ 180 GB — roughly 3× the model weights. That memory pressure is why GQA/MLA and paged KV caches dominate production serving.

def kv_cache_bytes(layers, d_model, seq_len, batch, dtype_bytes=2):
    return 2 * dtype_bytes * layers * d_model * seq_len * batch

kv_cache_bytes(48, 7168, 1024, 128) / 1e9          # ~180 GB

FFN, SwiGLU, norms

FFN is \(\sigma(W_1 x)W_2\); SwiGLU replaces \(\sigma\) with a gated Swish: \((xW + b)\otimes\text{Swish}(zV + c)\), Swish \(=x\sigma(\beta x)\). Ubiquitous in LLaMA.

Deep dive: Jay Alammar — The Illustrated Transformer · The Annotated Transformer (Harvard NLP) · Karpathy — “Let’s build GPT: from scratch” · Sebastian Raschka — A Visual Guide to Attention Variants.

8. VAE and the ELBO

VAE[12]: encoder \(q_\phi(z\mid x)\), decoder \(p_\theta(x\mid z)\). Evidence lower bound:

\[ \log p_\theta(x) \ge \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_\text{KL}\big(q_\phi(z\mid x)\,\Vert\,p(z)\big). \]

Derivation A (Jensen):

\[ \log p_\theta(x) = \log\int q_\phi(z\mid x)\tfrac{p_\theta(x,z)}{q_\phi(z\mid x)}dz \ge \mathbb{E}_{q_\phi}\log\tfrac{p_\theta(x,z)}{q_\phi(z\mid x)}. \]

Derivation B (chain rule):

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi}\log\tfrac{p_\theta(x,z)}{q_\phi(z\mid x)} + D_\text{KL}\big(q_\phi(z\mid x)\,\Vert\,p_\theta(z\mid x)\big). \]

The gap (right term) tightens as the encoder matches the true posterior.

Reparameterization trick & closed-form KL

For \(q_\phi(z\mid x)=\mathcal{N}(\mu_\phi,\sigma_\phi^2 I)\) and \(p(z)=\mathcal{N}(0,I)\):

\[ z = \mu_\phi + \sigma_\phi\odot\epsilon,\ \epsilon\sim\mathcal{N}(0,I),\quad D_\text{KL} = \tfrac{1}{2}\sum_d(\mu_d^2 + \sigma_d^2 - 1 - 2\log\sigma_d). \]

class VAE(torch.nn.Module):
    def __init__(self, enc, dec):
        super().__init__(); self.enc, self.dec = enc, dec
    def forward(self, x):
        mu, log_sigma = self.enc(x).chunk(2, dim=-1)
        z = mu + log_sigma.exp() * torch.randn_like(mu)   # reparameterize
        x_rec = self.dec(z)
        kl = 0.5 * (mu.pow(2) + (2*log_sigma).exp() - 1 - 2*log_sigma).sum(-1)
        rec = ((x - x_rec) ** 2).flatten(1).sum(-1)
        return (rec + kl).mean(), x_rec

VQ-VAE and advanced VAEs

Vanilla VAE is blurry (L2 mean-seeking), vague in semantics, and prone to posterior collapse when the decoder is too strong. VQ-VAE[13] replaces the continuous latent with a discrete codebook (nearest-neighbor quantize, straight-through gradient) so the latent is transformer-friendly and sharp. SD3/Flux-era “advanced VAEs” add BN in latent space, more channels, REPA-style alignment to large vision-model embeddings, and spatial packing.

Deep dive: Lilian Weng — From Autoencoder to Beta-VAE · Doersch — Tutorial on VAEs (arXiv 1606.05908).

9. DDPM

Forward Markov chain[14]:

\[ q(x_t\mid x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\,x_{t-1}, \beta_t I),\quad q(x_t\mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}\,x_0, (1-\bar\alpha_t)I). \]

\(\bar\alpha_t = \prod_{s\le t}(1-\beta_s)\). One-shot corruption: \(x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon\).

x0 x_{t-1} x_t x_{t+1} x_T q(x_t|x_{t-1}) p_θ(x_{t-1}|x_t) Forward (fixed, no params): noising Reverse (learned): denoising
The forward kernel is analytic; only the reverse is learned. Training is one-shot at a random \(t\) thanks to the closed-form marginal.

Training objective

\[ \mathcal{L}_\text{simple}(\theta) = \mathbb{E}_{t,x_0,\epsilon}\!\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\epsilon,t)\|^2\right]. \]

def ddpm_loss(model, x0, alphas_bar):
    t = torch.randint(0, len(alphas_bar), (x0.size(0),), device=x0.device)
    ab = alphas_bar[t].view(-1, 1, 1, 1)
    eps = torch.randn_like(x0)
    xt = ab.sqrt() * x0 + (1 - ab).sqrt() * eps
    eps_pred = model(xt, t)
    return F.mse_loss(eps_pred, eps)

Ancestral sampler

\[ x_{t-1} = \tfrac{1}{\sqrt{\alpha_t}}\!\left(x_t - \tfrac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\right) + \sigma_t z. \]

Tweedie’s formula

\[ \hat x_{0\mid t} = \tfrac{1}{\sqrt{\bar\alpha_t}}(x_t - \sqrt{1-\bar\alpha_t}\,\epsilon_\theta(x_t, t)). \]

The single identity that powers DDIM, classifier guidance in data space, DPS, and every “predict \(\hat x_0\)” parameterization.

Deep dive: Lilian Weng — What are Diffusion Models? · Calvin Luo — Understanding Diffusion Models: A Unified Perspective · HF Diffusion Models Course.

10. Score-based models and the SDE umbrella

Score \(s(x) = \nabla_x\log p(x)\). Denoising score matching[15] sidesteps the intractable partition function of energy-based models \(p_\theta(x)=\tfrac{1}{Z(\theta)}\exp(-f_\theta(x))\):

\[ \mathcal{L}(\theta) = \tfrac{1}{2}\mathbb{E}_{x,\tilde x}\!\left[\|s_\theta(\tilde x) - \nabla_{\tilde x}\log q(\tilde x\mid x)\|^2\right]. \]

Song et al.[16] unified DDPM and NCSN under a continuous-time SDE \(dx_t = f(x_t,t)dt + g(t)dw_t\) with reverse \(dx_t = [f(x_t,t) - g^2(t)\nabla_x\log p_t(x_t)]dt + g(t)d\bar w_t\).

DiscreteContinuous
NCSN → VE-SDE \(f=0,\ g=\sqrt{d\sigma^2/dt}\) \(dx=g(t)dw\)
DDPM → VP-SDE \(f=-\tfrac{1}{2}\beta(t)x,\ g=\sqrt{\beta_t}\) \(dx=-\tfrac{1}{2}\beta(t)x\,dt+\sqrt{\beta_t}dw\)

The probability-flow ODE drops the stochastic term: \(dx_t = [f(x_t,t) - \tfrac{1}{2}g^2(t)\nabla_x\log p_t(x_t)]dt\). Deterministic, preserves marginals, exact likelihood.

Deep dive: Yang Song — Generative Modeling by Estimating Gradients of the Data Distribution · Karras et al. EDM.

11. Flow matching and rectified flow

Flow matching[17] learns a velocity field \(v_\theta(x,t)\) transporting base \(p_0\) to data \(p_1\). Rectified Flow[18] uses the linear coupling \(x_t=(1-t)x_0+tx_1\), target \(v^\star(x,t)=\mathbb{E}[x_1-x_0\mid x_t=x,t]\):

\[ \mathcal{L}_\text{RF}(\theta)=\mathbb{E}_{(x_0,x_1)\sim\pi,\,t\sim\rho}\!\left[\|v_\theta(x_t,t)-(x_1-x_0)\|^2\right]. \]

def rf_loss(model, x1):
    x0 = torch.randn_like(x1)
    t = torch.rand(x1.size(0), device=x1.device)         # uniform
    tb = t.view(-1, 1, 1, 1)
    xt = (1 - tb) * x0 + tb * x1
    v_target = x1 - x0
    v_pred = model(xt, t)
    return F.mse_loss(v_pred, v_target)

Reflow. After first training, retrain on self-generated pairs \((x_0,\hat x_1)\). Curvature drops, few-step sampling becomes viable. For a dedicated walk-through see my Rectified Flow note.

Logit-normal time. Uniform \(t\) under-trains the middle of the trajectory. SD3 samples \(t\sim \sigma(\mathcal{N}(\mu,\sigma^2))\) to push mass where errors accumulate.

Mean flow. Predict average velocity over an interval \([t_1, t_2]\); approximates single-step generation.

Deep dive: NeurIPS 2024 Tutorial — Flow Matching for Generative Modeling · Google DeepMind — Diffusion Meets Flow Matching blog.

12. Sampling and guidance

DDIM

DDIM[19] replaces the ancestral Markov update with a deterministic step using \(\hat x_{0\mid t}\):

\[ x_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat x_{0\mid t} + \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\epsilon_\theta(x_t,t) + \sigma_t z. \]

\(\sigma_t=0\) → deterministic DDIM (same marginals, skip allowed). DDPM reverse with the diffusion term (\(g(t)d\bar w_t\)) versus DDIM reverse without it is the distinction to keep in mind.

@torch.no_grad()
def ddim_sample(model, alphas_bar, steps, shape, device, eta=0.0):
    xt = torch.randn(shape, device=device)
    ts = torch.linspace(len(alphas_bar) - 1, 0, steps + 1).long().to(device)
    for i in range(steps):
        t_now, t_nxt = ts[i], ts[i + 1]
        ab, ab_nxt = alphas_bar[t_now], alphas_bar[t_nxt]
        eps = model(xt, t_now.expand(shape[0]))
        x0_hat = (xt - (1 - ab).sqrt() * eps) / ab.sqrt()
        sigma = eta * ((1 - ab_nxt) / (1 - ab) * (1 - ab / ab_nxt)).sqrt()
        noise = torch.randn_like(xt) if eta > 0 else 0
        xt = ab_nxt.sqrt() * x0_hat + (1 - ab_nxt - sigma**2).sqrt() * eps + sigma * noise
    return xt

ODE solvers

Given the PF-ODE / FM velocity, higher-order solvers buy quality at fixed step count:

  • Euler. \(x_{t-\Delta t} = x_t + \Delta t\,v_\theta(x_t,t)\).
  • Midpoint. Half-step then evaluate at midpoint.
  • Heun. Euler step, re-evaluate, average (2nd order).
  • DPM-Solver / EDM[20]. Exploit the semi-linear structure; strong baselines.

Classifier-free guidance (CFG)

Train one network on both conditional \(\epsilon_\theta(x,t,c)\) and unconditional \(\epsilon_\theta(x,t,\varnothing)\) by randomly dropping \(c\). At inference:

\[ \tilde\epsilon = \epsilon_\theta(x,t,\varnothing) + w\,(\epsilon_\theta(x,t,c) - \epsilon_\theta(x,t,\varnothing)). \]

@torch.no_grad()
def cfg_eps(model, x, t, c, w=7.5):
    eps_c   = model(x, t, c)
    eps_unc = model(x, t, None)
    return eps_unc + w * (eps_c - eps_unc)

On flow models, naive CFG amplifies trajectory curvature; my Rectified-CFG++[21] fixes this with a predictor-corrector step.

Classifier guidance, DPS, MPGD

Classifier guidance.[22] \(\nabla_x\log p(x\mid c) = \nabla_x\log p(c\mid x) + \nabla_x\log p(x)\). Uses a differentiable classifier.

DPS.[23] For inverse problems \(y = A(x)+n\): \(\nabla_{x_t}\log p(y\mid x_t)\approx -\tfrac{1}{\sigma_y^2}\nabla_{x_t}\|y - A(\hat x_{0\mid t})\|^2\). Goes off-manifold.

MPGD.[24] Project the guidance update back onto the manifold via a VAE/autoencoder tangent. Extra forward pass; big quality win.

data manifold x̂0|t DPS update (off manifold) x̂0|t MPGD: project back via AE DPS pushes off the manifold; MPGD maps it back.
Same gradient direction, different landing point.

Deep dive: Sander Dieleman — Guidance: a cheat code for diffusion models · Sander Dieleman — Perspectives on diffusion.

13. Conditioning in generative models

Three families; real systems combine them.

  1. Architectural injection
    • Concatenation \([x, c]\) — GAN/AR defaults.
    • Cross-attention — T2I workhorse.
    • Conditional norms:
      • AdaIN: \(\gamma(c)\tfrac{x-\mu(x)}{\sigma(x)}+\beta(c)\).
      • FiLM: \(c\to(\gamma,\beta)\to\gamma\odot x+\beta\).
      • AdaLN / AdaLN-Zero: LayerNorm with scale/shift from \(c\); DiT/MM-DiT.
  2. Guidance methods: CFG, classifier guidance, attention-guided sampling.
  3. Spatial / structural conditioning
    • ControlNet[25]: locked base UNet + trainable copy added via zero-initialized 1×1 convs.
    • T2I-Adapter[26]: lightweight side network added at intermediate layers.
    • LoRA[27]: \(W + \Delta W = W + AB\); \(A\) Gaussian-init, \(B\) zero. Trainable params = \(r(d_\text{in}+d_\text{out})\) per layer.
class LoRALinear(torch.nn.Module):
    def __init__(self, base: torch.nn.Linear, r=8, alpha=16):
        super().__init__()
        self.base = base
        for p in base.parameters(): p.requires_grad_(False)
        self.A = torch.nn.Parameter(torch.randn(r, base.in_features) * 0.02)
        self.B = torch.nn.Parameter(torch.zeros(base.out_features, r))
        self.scale = alpha / r
    def forward(self, x):
        return self.base(x) + (x @ self.A.T @ self.B.T) * self.scale

MM-DiT (SD3 / Flux)

Caption encoded jointly by CLIP-L/14, CLIP-G/14, and T5-XXL. Pooled CLIP + sinusoidal timestep → MLP → AdaLN-Zero modulation. T5 token sequence is concatenated with image patch tokens; joint self-attention across the combined sequence, with per-stream norms and projections. AdaLN-Zero initializes modulation scale to zero so blocks act as identity at the start of training.

Deep dive: ControlNet paper · HF PEFT — LoRA conceptual guide · SD3 / MM-DiT paper.

14. Text-to-image design space

Every modern T2I/T2V decomposes on three axes. Memorize this grid.

ModelTrainingLatentBackboneText
SD 1/2DDPMVQ-GANUNetCLIP
SD 3 / Flux 1Flow matchingAdvanced VAEDiT / MM-DiTCLIP + T5
Flux 2 / Z-Image / Qwen-ImageFlow matchingAdvanced VAEMM-DiTLLM/VLM/MLLM + MM-DiT
Transfusion / Hunyuan 3.0Flow matchingAdvanced VAENative-MMNative MM tokens
Nano Banana / GPT-4o ImageFM + diffusion headAdvanced VAENative-MMNative MM + tools

Why CLIP alone stalled as text encoder

  • Spatial relations (“dog left of cat”) — weak.
  • Negation (“room without a window”) — fails.
  • Counting and fine-grained attributes — fails.
  • 77-token cap — too short.

T5-XXL handles long captions, dense semantics, and compositional detail; SD3/Flux use both (CLIP for pooled, T5 for token sequence). Flux 2-era models drop CLIP and use a single LLM/VLM encoder.

Scaling rule

Rule-of-thumb: #parameters ≈ 10× dataset size. Holds well for text; for T2I, data ceilings kick in as resolution climbs.

Deep dive: Stability AI — SD3 research blog · Black Forest Labs — Flux announcement · Transfusion paper (Meta).

15. Video generation and inverse problems

Temporal attention strategies

Joint 3D attention is prohibitive. Alternatives:

  • Window attention. Current frame + last \(k\) neighbors.
  • Sliding window. Stride < window; smoother transitions.
  • Factorized spatial–temporal. Alternate spatial and temporal blocks (Lumiere, Hunyuan-Video).
  • Token compression. Compress older frames into fewer K/V tokens.

Latent video diffusion

Shared outline across Movie Gen, Sora, Veo, Hunyuan-Video, Wan, OpenSora:

  1. 3D VAE (or causal 3D VAE) to compress space + time.
  2. MM-DiT backbone with alternating / joint 3D attention.
  3. Flow matching + CFG at inference.
  4. Camera / motion conditioning via cross-attention with motion tokens.

Inverse Tone Mapping (SDR→HDR)

From my LumaFlux line of work:

  • Gain-map learning. Predict per-pixel gain scaling SDR luminance.
  • Diffusion-prior ITM. Start from SDR conditioning, generate the HDR residual; Physically-Guided Adaptation (PGA) + Perceptual Cross-Modulation (PCM).
  • Rational Quadratic Spline (RQS) decoder. Learnable invertible tone curve.

Deep dive: OpenAI — Sora technical report · Hunyuan-Video tech report · Meta — Movie Gen.

16. Discrete diffusion

For tokens/molecules, noising is random state transitions. Transition matrix \(Q\) gives

\[ x_t\mid x_{t-1}\sim\text{Cat}(p=x_{t-1}Q_t),\quad x_t\mid x_0\sim\text{Cat}(p=x_0\bar Q_t). \]

Reverse posterior

\[ q(x_{t-1}\mid x_t,x_0) = x_{t-1}\,\tfrac{x_tQ_t^\top\odot x_0\bar Q_{t-1}}{x_0\bar Q_t}. \]

MLDM, LLaDA[28], and the dLLM line apply this to text generation — the language-side analogue of image diffusion.

Deep dive: Austin et al. — Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM) · LLaDA paper.

17. LLMs, GPT, nanoGPT

All modern decoder-only LLMs are thin wrappers on the same stack: tokenizer → embedding → \(N\)× (RMSNorm, MHA with RoPE and GQA/MLA, RMSNorm, SwiGLU FFN) → LM head. Karpathy’s nanoGPT[29] is the smallest fluent reference — read it end-to-end.

class GPTBlock(torch.nn.Module):
    def __init__(self, d, h):
        super().__init__()
        self.ln1 = torch.nn.LayerNorm(d)
        self.attn = MultiHeadAttention(d, h)
        self.ln2 = torch.nn.LayerNorm(d)
        self.ffn = torch.nn.Sequential(
            torch.nn.Linear(d, 4*d), torch.nn.GELU(),
            torch.nn.Linear(4*d, d),
        )
    def forward(self, x, mask):
        x = x + self.attn(self.ln1(x), mask)
        x = x + self.ffn(self.ln2(x))
        return x

Sampling

  • Top-k. Keep highest-probability \(k\) tokens, renormalize.
  • Nucleus (top-p). Smallest set with cumulative probability \(\ge p\).
  • Temperature. Scale logits by \(1/T\) before softmax.

Deep dive: Karpathy — Let’s build GPT from scratch · Stanford CS336 — Language Modeling from Scratch.

18. Evaluation metrics (consolidated)

Section 5 covered image-quality metrics; this is the cross-modality summary.

ModalityFidelityAlignmentPreferenceHolistic
Image gen FID, PSNR, SSIM, VIF, LPIPS CLIPScore, VQAScore HPSv2, ImageReward MLLM-judge, HEIM
Video gen FVD, CLIP-Temp/FrameSim CLIPScore over frames (learned) VBench v2, PhysGenBench, VideoPhy2
LLM perplexity BLEU, ROUGE ArenaHard, AlpacaEval MMLU, GSM8K, HELM

Deep dive: Stanford HELM and HEIM for holistic LLM / image eval · VBench leaderboard for video.

19. Training lifecycle: pretraining → mid-training → post-training

Modern generative models (LLMs and diffusion/flow image/video models) follow a three-stage pipeline. This section is a pointer map — know what each stage does, what data it consumes, and what objective it optimizes.

Pretraining broad web-scale data next-token / ε-loss Mid-training long-context, math/code higher-res, domain mix Post-training SFT → preferences → RL align to humans/tasks “Raw model” → “capable model” → “useful model” compute shifts right; data quality shifts right; data volume shifts left
Pretraining dominates compute; post-training dominates quality gains. Mid-training is the bridge that extends context, injects math/code/reasoning, and raises image resolution.

Pretraining

Learn a world model by predicting the next piece of content on an enormous, minimally curated corpus.

  • Objective (LLM). Next-token cross-entropy \(\mathcal{L} = -\mathbb{E}_{x}\sum_{t}\log p_\theta(x_t\mid x_{<t})\).
  • Objective (diffusion/FM). \(\epsilon\)-prediction or velocity regression (sections 9 and 11). Trained on web-scale image–text pairs (LAION-5B, DataComp, CommonPool, internal crawls).
  • Data. FineWeb / FineWeb-Edu, RedPajama, RefinedWeb, The Stack, plus code/math specialty mixes. Quality filtering (CLIP score, perplexity, fastText) and heavy deduplication matter more than raw volume.
  • Compute profile. Long horizon (weeks–months), thousands of accelerators, LR warmup + cosine decay, effective batches of ~4M tokens, FP8/BF16 mixed precision, ZeRO-3 or FSDP sharding.
  • Eval. Perplexity on held-out mix; a small suite of in-domain probes (MMLU, HumanEval for LLMs; small FID/CLIPScore for image models). Full eval comes later.

Mid-training (a.k.a. continued pretraining, annealed pretraining)

A second, smaller pretraining stage that reshapes the model’s capability profile before alignment. Much more impactful than the literature suggests.

  • Long-context extension. Extend context from 4K/8K to 32K → 128K → 1M. Techniques: RoPE base rescaling (NTK-aware, YaRN), PI (position interpolation), needle-in-haystack eval.
  • Capability injection. Upsample math (OpenWebMath, NuminaMath), code (The Stack v2), reasoning chains. The “annealing” trick: near the end of pretraining, switch the data mix heavily toward high-quality sources and drop LR sharply.
  • Domain adaptation. Medical / legal / finance / multilingual corpora.
  • For image/video models: resolution curriculum (64 → 256 → 1024 → native), aspect-ratio bucketing, human-aesthetic filtering, synthetic-caption rewriting with a VLM.

Post-training

Turn a raw next-token predictor / image denoiser into something that actually follows instructions and matches human preferences.

1. Supervised Fine-Tuning (SFT)

Continue the pretraining objective on a curated set of (instruction, high-quality response) pairs. Scale: 10K–1M examples; lower LR (1e-5 to 5e-6 range); 1–3 epochs. Loss is the same token-level CE but only on the response tokens (mask the prompt).

def sft_loss(logits, labels, response_mask):
    # logits: (B, T, V); labels: (B, T); response_mask: (B, T) 1 where response
    shift_logits = logits[:, :-1].contiguous()
    shift_labels = labels[:, 1:].contiguous()
    shift_mask = response_mask[:, 1:].contiguous().float()
    ce = F.cross_entropy(shift_logits.transpose(1, 2), shift_labels, reduction='none')
    return (ce * shift_mask).sum() / shift_mask.sum().clamp_min(1)

Common data: FLAN, Tulu, OpenAssistant, Alpaca, ShareGPT, instruction-synth with a strong teacher (“distillation SFT”). Synthetic data is now the majority of most SFT pipelines.

2. Preference data collection

Triples \((x, y_w, y_l)\) with a human or AI preference over responses \(y_w \succ y_l\). Sources: human labelers (UltraFeedback), strong teacher models (RLAIF), or self-play with a reward model.

3. Reward modeling + RLHF (PPO)

Fit a scalar reward model \(r_\phi(x,y)\) to preferences via a Bradley–Terry logistic loss \(\mathcal{L}_{RM} = -\log\sigma(r_\phi(x,y_w) - r_\phi(x,y_l))\). Then PPO the policy with \(r_\phi\) plus a per-token KL anchor \(r_t = r_\phi(x,y) - \beta\log\tfrac{\pi_\theta(y_t\mid\cdot)}{\pi_\text{ref}(y_t\mid\cdot)}\). Full math and loss in section 22.

4. Direct Preference Optimization (DPO)

Skip the reward model. Optimize the policy directly on preference pairs (derivation in section 22). In practice DPO is the baseline for almost every open-source post-training pipeline because it is stable, single-loop, and reward-model-free.

5. GRPO / DAPO — critic-free RL

Sample \(N\) responses per prompt, normalize rewards within the group, apply PPO-style token-level clipping. GRPO removed the value network entirely; DAPO further fixes clipping, sampling, and gradient-dilution pathologies. See section 22.

6. Rejection sampling / Best-of-N

Generate \(N\) candidates, score each with a reward model, keep the best. Embarrassingly simple, widely used as both a baseline and for generating SFT data for the next iteration (“rejection-sampling fine-tuning”, RSFT).

7. Constitutional AI / RLAIF

Replace human preference labels with AI-generated critiques against a written constitution. Scales preference collection to millions of examples cheaply; most frontier labs use a hybrid of human + AI labels.

Post-training for diffusion / flow models

The same ideas transfer — “response” becomes “generated image/video” and the reward is an image preference / aesthetic / alignment score (ImageReward, HPSv2, PickScore, VQAScore).

  • SFT for diffusion. Fine-tune with the standard diffusion loss on a curated high-quality subset (e.g., aesthetically filtered, human-preferred).
  • Diffusion-DPO. DPO adapted to the diffusion objective: compare noise predictions on winning vs losing images per timestep.
  • DRaFT / ReFL / AlignProp. Backprop a differentiable reward through truncated sampling; very effective at preference alignment but compute-heavy.
  • DPOK. PPO-style policy gradient on the diffusion sampler.
  • LoRA preference tuning. Do any of the above through low-rank adapters for cheap style / aesthetic personalization.
Rule of thumb. Pretraining buys capability; mid-training buys context and domain competence; post-training buys usefulness and safety. Most recent frontier wins have come from better post-training (o1/o3-style reasoning via RL, DeepSeek-R1 via GRPO on verifiable rewards), not from bigger pretraining.

Deep dive: Nathan Lambert — The RLHF Book · Interconnects — Post-Training 101 · HF Alignment Handbook · DeepSeek-R1 technical report.

20. Training nuances

Mixed precision

FP32 master copy + FP16/BF16 computation; loss scaling to prevent underflow of small gradients; BF16 is preferred (same exponent range as FP32, no loss scaling needed).

Data-parallel vs ZeRO vs FSDP

  • DDP. Every GPU has a full model copy; gradients all-reduced each step.
  • ZeRO-1/2/3[30]: shard optimizer state, then gradients, then parameters across GPUs.
  • FSDP[31]: PyTorch-native ZeRO-3 equivalent; all-gather + reduce-scatter per layer; integrates with activation checkpointing.

Memory budget back-of-envelope

At fp16 training with Adam: params (2B) + grads (2B) + Adam moments (8B) + activations (depth-dependent) → ~12B + activations per parameter-byte. A 7B model needs ~84 GB just for weights, grads, and optimizer state — before activations. FSDP/ZeRO-3 shards that across N GPUs.

Gradient accumulation & EMA

Accumulate gradients over \(k\) micro-batches before stepping the optimizer → effective batch \(B\cdot k\) at no extra memory. EMA of weights (\(\theta_\text{ema}\leftarrow\alpha\theta_\text{ema}+(1-\alpha)\theta\), \(\alpha\approx0.999\)) is non-optional for diffusion training; sample with EMA weights.

class EMA:
    def __init__(self, model, decay=0.999):
        self.decay = decay
        self.shadow = {n: p.detach().clone() for n, p in model.named_parameters()}
    @torch.no_grad()
    def update(self, model):
        for n, p in model.named_parameters():
            self.shadow[n].mul_(self.decay).add_(p.detach(), alpha=1 - self.decay)

Deep dive: HF — The Ultra-Scale Playbook (Training LLMs on GPU Clusters) · PyTorch FSDP Advanced Tutorial.

21. Inference nuances

Quantization

FP32 → FP16 (×2) → INT8 (×4) → INT4 (×8). Two common schemes:

  • Zero-point (asymmetric). Map \([x_\min, x_\max]\) to \([0, 2^b-1]\); preserves zero exactly — important for ReLU activations.
  • Absolute-max (symmetric). Map \([-|x|_\max, |x|_\max]\) to \([-2^{b-1}, 2^{b-1}-1]\).
def quantize_int8_symmetric(x):
    scale = x.abs().max() / 127
    q = (x / scale).round().clamp(-128, 127).to(torch.int8)
    return q, scale

def dequantize(q, scale):
    return q.float() * scale

Activation-aware schemes (AWQ, GPTQ, AQLM) keep sensitive channels at higher precision; LLM.int8()[32] handles outlier features in fp16 while quantizing the rest.

Pruning and distillation

  • Pruning. Zero the lowest-magnitude \(x\%\) of weights per layer; structured pruning zeros entire heads/channels. Retrain or LoRA-adapt to recover.
  • Distillation. Student learns \(\text{CE}(\text{student}, \text{true}) + \lambda\,D_\text{KL}(\text{student}\Vert\text{teacher})\).
  • Distribution-matching distillation. Collapses multi-step image gen to 1–4 steps by matching output distributions at each noise level. Consistency models and LCM live in this family.

Serving stacks

  • vLLM. Paged KV cache + continuous batching; the SOTA OSS LLM server.
  • TensorRT-LLM. NVIDIA’s fused-kernel engine; best single-node throughput.
  • Continuous batching. Merge new requests into in-flight batches at every decode step; throughput ↑ by 2–5× over static batching.

Deep dive: Anyscale — Continuous batching for LLM inference · vLLM — PagedAttention blog.

22. RL alignment — PPO, DPO, GRPO, DAPO in detail

PPO + RLHF

Three models: policy, reward, value. Clipped surrogate objective:

\[ \mathcal{L}_\text{PPO} = \mathbb{E}\!\left[\min\!\left(r_t A_t,\ \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) A_t\right)\right], \quad r_t = \tfrac{\pi_\theta(a_t\mid s_t)}{\pi_\text{ref}(a_t\mid s_t)}. \]

Advantage from GAE; KL penalty \(-\beta D_\text{KL}(\pi_\theta\Vert\pi_\text{ref})\) anchors the policy.

DPO

DPO[33] aligns on preference pairs \((y_w, y_l)\) without a reward model:

\[ \mathcal{L}_\text{DPO} = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\log\tfrac{\pi_\theta(y_w)}{\pi_\text{ref}(y_w)} - \beta\log\tfrac{\pi_\theta(y_l)}{\pi_\text{ref}(y_l)}\right)\right]. \]

def dpo_loss(pol_w, pol_l, ref_w, ref_l, beta=0.1):
    # pol_w, ref_w: log-probs under policy / reference for chosen
    # pol_l, ref_l: same for rejected
    return -F.logsigmoid(beta * ((pol_w - ref_w) - (pol_l - ref_l))).mean()

GRPO

GRPO[34] drops the critic. For each prompt, sample \(N\) responses, score, and normalize within group:

\[ A_i = \tfrac{r_i - \text{mean}(R_\text{group})}{\text{std}(R_\text{group})+\varepsilon}. \]

def grpo_advantages(rewards):                      # rewards: (G,)
    return (rewards - rewards.mean()) / (rewards.std() + 1e-6)

DAPO

DAPO addresses four GRPO failure modes[35]:

  1. Wasted learning signal (GRPO discards \(A=0\) samples).
  2. Fixed \(\varepsilon\) clipping is too aggressive for high-reward samples → Clip-Higher: asymmetric \((1+\text{high}, 1-\text{low})\).
  3. Redundant sampling → dynamic sampling (keep only samples with non-trivial advantage).
  4. Gradient dilution in long sequences → token-level gradient loss (average over all tokens across batch, not per-sample then over samples).

Best-of-N

Sample \(N\) outputs at \(T=0.7\), rank by reward model, keep the best. Embarrassingly simple, hard to beat at small \(N\).

Deep dive: HF blog — From GRPO to DAPO and GSPO · HF TRL library docs (DPO, PPO, GRPO implementations) · algoroxyolo — RL reading list (2026).

23. Paper trace and behavioral

For image-gen / quality / video roles, I lead with three papers — one minute each.

  1. Rectified-CFG++ (NeurIPS 2025). “Standard CFG breaks on flow models because it amplifies trajectory curvature. We introduce a predictor-corrector guidance with zero extra training cost that fixes artifacts on Flux, SD3, and Lumina-Next, improves text rendering, and comes with theoretical guarantees.”blog.
  2. HDR-Q (CVPR 2026). “First multimodal LLM for HDR video quality assessment, with HAPO: contrastive KL, dual-entropy regularization, and SigLIP-2 HDR-aware encoding.”slides.
  3. LumaFlux. “SDR-to-HDR inverse tone mapping using Flux 12B; 17M trainable parameters via PGA, PCM, and a Rational Quadratic Spline decoder.”slides.

Rehearse three lengths: 30-second, 2-minute, 10-minute. Let the interviewer pick.

Checklist for the night before

  • Python basics (GC, PEP 8, *args/**kwargs) — 60 seconds.
  • MLE for a Gaussian with code.
  • CNN output formula with one worked example.
  • Scaled dot-product attention from scratch — derive and code.
  • RoPE 1D/2D/3D, ALiBi, sinusoidal: pros/cons each.
  • KV cache back-of-envelope (30B, 48-layer, 7168-dim).
  • FlashAttention online softmax identity.
  • VAE ELBO both derivations in 60 seconds.
  • DDPM forward + reverse + \(\epsilon\) loss with code.
  • Tweedie and \(\hat x_{0\mid t}\).
  • VP-SDE vs VE-SDE table.
  • Rectified flow loss, reflow, logit-normal time.
  • DDIM step (stochastic ↔ deterministic).
  • CFG in \(\epsilon\) and score form.
  • DPS vs MPGD.
  • T2I design-space table.
  • SSIM, PSNR, VIF intuition; when FID saturates.
  • Retinex and color spaces; PQ vs HLG.
  • PPO / DPO / GRPO / DAPO one-liners with DAPO fixes.
  • Three papers of my own, three durations each.
  • One LeetCode-medium on the morning of.

Curated reading list

Resources I came back to repeatedly during prep. Grouped by ladder stage.

Live companion: my running Literature notebook on Notion has the paper-by-paper notes (HDR, SDR-to-HDR, diffusion, flow matching, video ITM, multimodal diffusion) along with tags, figures, and summaries that feed this post.

Foundations (DL / CUDA / CV)

General note. Most newer Stanford courses cover the foundations well and the lectures are on YouTube — just search the topic and pick the most recent offering. The list below is what I keep coming back to, but the honest meta-advice is: find the latest semester of CS229 (ML), CS231n (CV), CS224N (NLP), CS25 (Transformers United), CS336 (LLMs from Scratch), or CS330 (multitask & meta-learning), open the course website alongside the YouTube playlist, and read.

SSL & GANs

Transformers, LLMs, GPT

Diffusion / Flow Matching / Rectified Flow

Training large models

Inference

RL alignment

Interview question banks

Coding practice

References

  1. NeetCode — LeetCode pattern-based prep.
  2. Sheikh & Bovik, Image Information and Visual Quality (VIF), IEEE TIP 2006.
  3. Wu et al., DOVER: Exploring Video Quality Assessment Through Aesthetic and Technical Perspectives, 2022.
  4. Wu et al., Human Preference Score v2 (HPSv2), 2023.
  5. Huang et al., VBench: Comprehensive Benchmark Suite for Video Generative Models, 2023.
  6. Stanford CS231n — Convolutional Neural Networks for Visual Recognition.
  7. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention, 2022.
  8. Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models, 2023.
  9. DeepSeek-V2 Technical Report (introduces MLA), 2024.
  10. Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding, 2021.
  11. Press et al., Train Short, Test Long: Attention with Linear Biases (ALiBi), 2021.
  12. Kingma & Welling, Auto-Encoding Variational Bayes, 2013.
  13. van den Oord et al., Neural Discrete Representation Learning (VQ-VAE), 2017.
  14. Ho et al., Denoising Diffusion Probabilistic Models, 2020.
  15. Song & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution, 2019.
  16. Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, 2021.
  17. Lipman et al., Flow Matching for Generative Modeling, 2022.
  18. Liu et al., Flow Straight and Fast: Rectified Flow, 2022.
  19. Song et al., Denoising Diffusion Implicit Models, 2020.
  20. Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models (EDM), 2022.
  21. Saini et al., Rectified CFG++ for Flow Based Models, NeurIPS 2025.
  22. Dhariwal & Nichol, Diffusion Models Beat GANs on Image Synthesis (Classifier Guidance), 2021.
  23. Chung et al., Diffusion Posterior Sampling for General Noisy Inverse Problems (DPS), 2022.
  24. He et al., Manifold Preserving Guided Diffusion (MPGD), 2023.
  25. Zhang et al., Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet), 2023.
  26. Mou et al., T2I-Adapter, 2023.
  27. Hu et al., LoRA: Low-Rank Adaptation of Large Language Models, 2021.
  28. Nie et al., Large Language Diffusion Models (LLaDA), 2025.
  29. karpathy/nanoGPT.
  30. Rajbhandari et al., ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, 2019.
  31. Zhao et al., PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel, 2023.
  32. Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, 2022.
  33. Rafailov et al., Direct Preference Optimization, 2023.
  34. Shao et al., DeepSeekMath (GRPO), 2024.
  35. From GRPO to DAPO and GSPO — HF Blog (Yihua Zhang).