arXiv 2602.00250  ·  cs.LG  ·  2026

TABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models

Shreshth Saini1, Avinab Saha2, Balu Adsumilli2, Neil Birkbeck2, Yilin Wang2, Alan C. Bovik1

1The University of Texas at Austin    2Google

Masked Diffusion Models (MDMs) enable parallel, revisable decoding — but greedy confidence-based sampling leads to trajectory lock-in where early errors cascade. TABES introduces Backward-on-Entropy (BoE) Steering, a training-free framework that uses a single backward pass to select tokens that most reduce future masked uncertainty, yielding a superior quality–compute Pareto frontier.

📄 ArXiv Paper 💻 Code (Coming Soon)
Training-Free Model-Agnostic Gradient-Guided Inference-Time Scaling

Method Overview

TABES BoE Steering Pipeline: forward pass → candidate prefilter → surrogate backward → TIS scoring → unmask top-b tokens
BoE Steering Pipeline. At each denoising step, BoE runs a standard denoiser forward pass, pre-filters candidates, constructs a relaxed surrogate state with soft embeddings, computes a single backward signal, and scores candidates by Token Importance Score (TIS) — the predicted reduction in future masked entropy.

Key Ideas

🔒

Trajectory Lock-in

Greedy unmasking reveals "easy" tokens first (articles, function words) while delaying structurally critical pivots. Under masked diffusion's locked-in property, early wrong commits cannot be corrected — errors cascade globally.

🎯

Token Importance Score

TIS = −⟨gi, Δei⟩ scores each masked token by how much revealing it reduces future masked entropy. Derived from a first-order Taylor expansion with bounded approximation error (Theorem 3.1).

ActiveQueryAttention

Sparse adjoint primitive that restricts backward computation to the active candidate set. Reduces backward complexity from O(L²d) to O(|A|·L·d) while keeping forward predictions exactly unchanged.

The Problem: Why Greedy Unmasking Fails

Comparison of greedy local-confidence unmasking vs BoE steering showing trajectory lock-in problem and solution
Left: Greedy confidence-based schedules unmask easy tokens first, delaying load-bearing pivots. Early wrong commits become locked-in. Right: BoE prioritizes tokens with highest TIS — those that most reduce future uncertainty — yielding stable trajectories and faster entropy reduction.

Results

BoE achieves a superior Pareto frontier for inference-time scaling on LLaDA-8B and LLaDA-1.5, consistently improving over confidence, margin, entropy, ReMDM, and LookUM baselines under compute-matched decoding.

Bar chart comparing BoE against Confidence and LookUM baselines on MBPP, HumanEval, GSM8K, MATH500, and Sudoku benchmarks
LLaDA-8B pass@1 accuracy (L=256). BoE achieves best results on MBPP, GSM8K, and Sudoku while remaining competitive on HumanEval and MATH500. All methods are training-free.

What Matters? Ablation Highlights

🔬 Gradient signal is key

The backward-on-entropy scoring (TIS) is the primary driver of accuracy gains. Removing it drops GSM8K from 73.9 to baseline levels.

🏎️ ActiveQueryAttention essential

Without it, runtime increases ~40% (2.71 vs 1.94 hrs) with negligible accuracy change. Sparse backward makes BoE practical.

🎛️ ρ = 0.25 is a sweet spot

Active fraction ρ provides a smooth speed–accuracy knob. ρ=0.25 recovers ~full accuracy (73.7 vs 73.9) at minimal overhead.