Trajectory Lock-in
Greedy unmasking reveals "easy" tokens first (articles, function words) while delaying structurally critical pivots. Under masked diffusion's locked-in property, early wrong commits cannot be corrected — errors cascade globally.
arXiv 2602.00250 · cs.LG · 2026
1The University of Texas at Austin 2Google
Masked Diffusion Models (MDMs) enable parallel, revisable decoding — but greedy confidence-based sampling leads to trajectory lock-in where early errors cascade. TABES introduces Backward-on-Entropy (BoE) Steering, a training-free framework that uses a single backward pass to select tokens that most reduce future masked uncertainty, yielding a superior quality–compute Pareto frontier.
Greedy unmasking reveals "easy" tokens first (articles, function words) while delaying structurally critical pivots. Under masked diffusion's locked-in property, early wrong commits cannot be corrected — errors cascade globally.
TIS = −⟨gi, Δei⟩ scores each masked token by how much revealing it reduces future masked entropy. Derived from a first-order Taylor expansion with bounded approximation error (Theorem 3.1).
Sparse adjoint primitive that restricts backward computation to the active candidate set. Reduces backward complexity from O(L²d) to O(|A|·L·d) while keeping forward predictions exactly unchanged.
BoE achieves a superior Pareto frontier for inference-time scaling on LLaDA-8B and LLaDA-1.5, consistently improving over confidence, margin, entropy, ReMDM, and LookUM baselines under compute-matched decoding.
The backward-on-entropy scoring (TIS) is the primary driver of accuracy gains. Removing it drops GSM8K from 73.9 to baseline levels.
Without it, runtime increases ~40% (2.71 vs 1.94 hrs) with negligible accuracy change. Sparse backward makes BoE practical.
Active fraction ρ provides a smooth speed–accuracy knob. ρ=0.25 recovers ~full accuracy (73.7 vs 73.9) at minimal overhead.