arXiv 2602.00250 · cs.LG · 2026

TABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models

Shreshth Saini¹, Avinab Saha², Balu Adsumilli², Neil Birkbeck², Yilin Wang², Alan C. Bovik¹

¹The University of Texas at Austin ²Google

Masked Diffusion Models (MDMs) enable parallel, revisable decoding — but greedy confidence-based sampling leads to trajectory lock-in where early errors cascade. TABES introduces Backward-on-Entropy (BoE) Steering, a training-free framework that uses a single backward pass to select tokens that most reduce future masked uncertainty, yielding a superior quality–compute Pareto frontier.

📄 ArXiv Paper 💻 Code (Coming Soon)

Training-Free Model-Agnostic Gradient-Guided Inference-Time Scaling

Method Overview

TABES BoE Steering Pipeline: forward pass → candidate prefilter → surrogate backward → TIS scoring → unmask top-b tokens — **BoE Steering Pipeline.** At each denoising step, BoE runs a standard denoiser forward pass, pre-filters candidates, constructs a relaxed surrogate state with soft embeddings, computes a single backward signal, and scores candidates by Token Importance Score (TIS) — the predicted reduction in future masked entropy.

Key Ideas

🔒

Trajectory Lock-in

Greedy unmasking reveals "easy" tokens first (articles, function words) while delaying structurally critical pivots. Under masked diffusion's locked-in property, early wrong commits cannot be corrected — errors cascade globally.

🎯

Token Importance Score

TIS = −⟨g_i, Δe_i⟩ scores each masked token by how much revealing it reduces future masked entropy. Derived from a first-order Taylor expansion with bounded approximation error (Theorem 3.1).

⚡

ActiveQueryAttention

Sparse adjoint primitive that restricts backward computation to the active candidate set. Reduces backward complexity from O(L²d) to O(|A|·L·d) while keeping forward predictions exactly unchanged.

The Problem: Why Greedy Unmasking Fails

Comparison of greedy local-confidence unmasking vs BoE steering showing trajectory lock-in problem and solution — **Left:** Greedy confidence-based schedules unmask easy tokens first, delaying load-bearing pivots. Early wrong commits become locked-in. **Right:** BoE prioritizes tokens with highest TIS — those that most reduce future uncertainty — yielding stable trajectories and faster entropy reduction.

Results

BoE achieves a superior Pareto frontier for inference-time scaling on LLaDA-8B and LLaDA-1.5, consistently improving over confidence, margin, entropy, ReMDM, and LookUM baselines under compute-matched decoding.

Bar chart comparing BoE against Confidence and LookUM baselines on MBPP, HumanEval, GSM8K, MATH500, and Sudoku benchmarks — **LLaDA-8B pass@1 accuracy (L=256).** BoE achieves best results on MBPP, GSM8K, and Sudoku while remaining competitive on HumanEval and MATH500. All methods are training-free.

What Matters? Ablation Highlights

🔬 Gradient signal is key

The backward-on-entropy scoring (TIS) is the primary driver of accuracy gains. Removing it drops GSM8K from 73.9 to baseline levels.

🏎️ ActiveQueryAttention essential

Without it, runtime increases ~40% (2.71 vs 1.94 hrs) with negligible accuracy change. Sparse backward makes BoE practical.

🎛️ ρ = 0.25 is a sweet spot

Active fraction ρ provides a smooth speed–accuracy knob. ρ=0.25 recovers ~full accuracy (73.7 vs 73.9) at minimal overhead.