Rectified Flow - Technical Notes on Objectives, Geometry, and Training

Thesis. Rectified Flow is best understood as supervised regression of a time-indexed velocity field under a chosen coupling between noise and data distributions.

Main technical point. The objective is simple, but quality at low solver steps is controlled by trajectory curvature, not only endpoint correctness.

Practical implication. Better time sampling, loss weighting, and reflow-style rectification often buy more than adding another large architectural block.

1) Setup: Coupling, Interpolation, and Targets

Let \(x_0 \sim p_0\) be a base sample (usually Gaussian), and \(x_1 \sim p_1\) be a data sample. A coupling \(\pi(x_0, x_1)\) defines how these endpoints are paired¹^[1]. Given \(t \in [0,1]\), define the linear interpolation

\[ x_t = (1-t)x_0 + t x_1. \]

The instantaneous displacement target is

\[ u(x_0,x_1) = x_1 - x_0. \]

Rectified Flow trains a neural vector field \(v_\theta(x,t)\) to predict this displacement from \((x_t,t)\). If \(v_\theta\) matches the conditional mean velocity, integrating \(\dot{x}_t = v_\theta(x_t,t)\) pushes \(p_0\) toward \(p_1\) with straightened trajectories^[1]^[2].

Geometric view: low-step ODE solvers approximate straight trajectories much better than curved ones.

2) Loss Function and the Regression Optimum

The basic objective is

\[ \mathcal{L}_{\mathrm{RF}}(\theta) = \mathbb{E}_{(x_0,x_1)\sim\pi,\; t\sim\rho} \bigl[\|v_\theta(x_t,t) - (x_1-x_0)\|_2^2\bigr]. \]

Here \(\rho(t)\) is the time-sampling distribution (uniform is the default; non-uniform choices are often better in practice). The minimizer at each \((x,t)\) is the conditional expectation

\[ v^\star(x,t) = \mathbb{E}[x_1-x_0 \mid x_t=x,\; t]. \]

This follows from the projection theorem in \(L_2\):

\[ \mathbb{E}\|v-u\|^2 = \mathbb{E}\|v-v^\star\|^2 + \mathbb{E}\|u-v^\star\|^2, \]

so optimization can only reduce the first term; the second is irreducible variance from ambiguous pairings. This decomposition is useful when debugging: if training loss plateaus early with large data variance, the bottleneck may be coupling noise, not model capacity.

In matrix form for a batch of size \(B\), define \(X_t \in \mathbb{R}^{B\times d}\), \(U \in \mathbb{R}^{B\times d}\), and \(V_\theta(X_t,t)\in\mathbb{R}^{B\times d}\). Then

\[ \mathcal{L}_B(\theta)=\frac{1}{B}\|V_\theta(X_t,t)-U\|_F^2. \]

This Frobenius form makes clear that RF training is a standard least-squares regression over features \((x_t,t)\), which is why most optimization tricks transfer directly from diffusion training pipelines^[5].

3) Matrix Sanity Check: Deterministic Linear Map

Consider a deterministic linear transport \(x_1 = A x_0\)², with \(x_0 \sim \mathcal{N}(0, I)\). Then

\[ x_t = \bigl((1-t)I + tA\bigr)x_0 = M_t x_0, \quad M_t := (1-t)I+tA. \]

The target displacement is \(u=(A-I)x_0\), so expressed in terms of \(x_t\):

\[ u = (A-I)M_t^{-1}x_t = K_t x_t, \quad K_t := (A-I)M_t^{-1}. \]

Therefore the optimal velocity is exactly linear in \(x_t\) at each \(t\). If you fit a linear model class

\[ v_W(x,t)=W_t x, \]

then \(W_t=K_t\) is the minimizer, and normal equations recover it in closed form in finite data when \(X_t^\top X_t\) is full rank:

\[ W_t^{\star} = (X_t^\top X_t)^{-1}X_t^\top U. \]

This toy case explains why RF often converges quickly on low-dimensional synthetic distributions: the regression target is well-conditioned and nearly linear. In high-dimensional image manifolds, the same formula holds locally, but \(K_t\) varies sharply across regions, which is where network capacity and time embeddings matter.

4) Sampling Dynamics and Why Curvature Matters

Generation solves the ODE

\[ \frac{dx_t}{dt} = v_\theta(x_t,t), \qquad x_{t=0}\sim p_0. \]

Euler integration with \(N\) steps uses

\[ x_{k+1} = x_k + \Delta t\, v_\theta(x_k,t_k), \qquad \Delta t = 1/N. \]

The second time derivative along the trajectory is

\[ \ddot{x}_t = \partial_t v_\theta(x_t,t) + J_{v_\theta}(x_t,t)\,v_\theta(x_t,t), \]

and this directly controls local truncation error:

\[ e_{\text{local}} = \frac{\Delta t^2}{2}\ddot{x}_t + \mathcal{O}(\Delta t^3). \]

The practical message is simple: for a fixed step budget, you want small effective curvature. Reflow/rectification methods can be interpreted as pushing the learned coupling toward self-consistency so that the solver follows straighter paths^[1].

Same compute budget, different geometry. RF quality-speed tradeoff is often a curvature control problem.

5) Reflow Objective and Practical Loss Variants

A common rectification loop is:

Train base \(v_{\theta_0}\) with the standard RF objective.
Sample \(x_0\sim p_0\), then integrate \(\dot{x}=v_{\theta_0}(x,t)\) to obtain \(\hat{x}_1=\Phi_{\theta_0}^{0\to1}(x_0)\).
Use pairs \((x_0,\hat{x}_1)\) to retrain a new field \(v_{\theta_1}\).

This process is closely related to reflow/trajectory-straightening ideas in RF literature and to practical guidance adaptations in conditional flow sampling^[1]^[4].

The retrained objective is

\[ \mathcal{L}_{\mathrm{reflow}}(\theta) = \mathbb{E}_{x_0\sim p_0,\; t\sim\rho} \left[\left\|v_\theta\bigl((1-t)x_0+t\hat{x}_1,t\bigr) - (\hat{x}_1-x_0)\right\|^2\right]. \]

In addition, weighted losses are usually more stable than plain MSE in large latent spaces:

\[ \mathcal{L}_{\lambda}(\theta) = \mathbb{E}\left[\lambda(t)\,\|v_\theta(x_t,t)-u\|_2^2\right], \qquad \lambda(t)=\frac{1}{t(1-t)+\tau}. \]

The \(\lambda(t)\) factor up-weights boundary regions where errors are most visible. In practice, choose \(\tau\in[10^{-3},10^{-2}]\) to avoid exploding weights.

From a broader viewpoint, this sits inside the stochastic-interpolant family: change the interpolation law and you change both the conditional targets and the geometry of the learned field^[3].

Parameterization identities (useful in implementation)

If \(v\) is predicted, then endpoint estimates follow directly:

\[ \hat{x}_1 = x_t + (1-t)\,v_\theta(x_t,t), \qquad \hat{x}_0 = x_t - t\,v_\theta(x_t,t). \]

These formulas are often used for auxiliary losses: e.g., reconstruction losses on \(\hat{x}_1\), or consistency regularizers between endpoint predictions at neighboring times.

Minimal training snippet with weighting and stable target scaling

def rf_weighted_loss(model, x1, tau=1e-2):
    x0 = torch.randn_like(x1)
    # Beta sampling can emphasize boundaries without singular weights
    t = torch.distributions.Beta(0.9, 0.9).sample((x1.size(0),)).to(x1.device)
    t = t[:, None, None, None]

    xt = (1.0 - t) * x0 + t * x1
    target_v = x1 - x0
    pred_v = model(xt, t)

    w = 1.0 / (t * (1.0 - t) + tau)
    per_example = ((pred_v - target_v) ** 2).flatten(1).mean(1, keepdim=True)
    return (w.flatten(1) * per_example).mean()

6) Practical Notes (Detailed Checklist)

Time sampling: Uniform \(t\sim U[0,1]\) is a clean baseline, but beta distributions such as Beta(0.9, 0.9) often improve endpoint behavior. If you observe noisy textures near the final steps, increase mass near \(t\approx1\).
Loss scale management: In latent diffusion backbones, \(\|x_1-x_0\|\) can vary by channel and resolution. Normalize targets per channel or use adaptive gradient clipping; otherwise optimization focuses on high-variance channels.
EMA is non-optional: Maintain EMA weights for sampling. For RF, EMA typically improves perceived smoothness and text alignment because it reduces high-frequency oscillation in \(v_\theta\).
Solver choices: Euler is useful for debugging, but midpoint/Heun usually gives better quality at fixed steps. If the model is trained for 8-16 step inference, verify quality with the exact solver used in deployment, not only with high-step validation.
Guidance scaling: In conditional models, high classifier-free guidance can bend trajectories and amplify curvature terms. A practical compromise is guidance annealing: lower scale early, higher scale near \(t\to1\).
Reflow scheduling: One reflow pass often gives a large gain; additional passes may have diminishing returns. Measure both FID-like scores and user-facing prompt adherence before paying extra retraining cost.
Batch construction: Randomly permuting \(x_1\) each step is cheap and surprisingly effective in unconditional settings. For conditional tasks, pair inside each condition bucket (same text class or style bin) to reduce target variance.
Matrix diagnostics: Track \(\|J_v\|_F\) proxies (finite differences) and trajectory curvature statistics during validation. If curvature grows while training loss decreases, expect low-step sampling regressions.

Rectified Flow: A Technical Note on Objectives, Geometry, and Training