CVPR 2026

HDR-Q: Seeing Beyond 8 Bits

Multimodal LLMs for HDR Video Quality Assessment
with HDR-Aware Policy Optimization

Shreshth Saini

Laboratory for Image & Video Engineering (LIVE) · The University of Texas at Austin

Advised by Prof. Alan C. Bovik

Can AI see and reason about
HDR video quality?

Today's answer: No. Every existing model was built for the 8-bit SDR world.

What starts here changes the world

HDR Is Not "Better SDR" — It's a Different Signal Space

Every modern phone captures 10-bit HDR by default.

YouTube, Instagram, TikTok — billions of daily HDR uploads. HDR10 supports 10-bit depth, BT.2020 wide color gamut, PQ perceptual quantizer. Peak luminance ≥1000 nits vs SDR ~100 nits.

New perceptual phenomena SDR models cannot see:

Highlight clipping · Near-black banding · Color blooming · PQ quantization artifacts · Exposure flicker · Wide-gamut chroma shifts

Core argument:

Standard vision encoders (SigLIP, CLIP) process images in 8-bit sRGB. They are structurally blind to HDR-specific distortions. Not a training gap — a representation gap.

HDR vs SDR

HDR preserves luminance and color detail that SDR collapses

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

Where Current Methods Break

No existing method combines HDR-aware perception with quality reasoning

Method FamilyHDR InputPercept. Ground.ReasoningCont. MOSInterp.SROCC
Classical (BRISQUE, VMAF)0.41
Deep VQA (FastVQA, DOVER)~0.51
HDR-Specific (HIDRO-VQA)0.85
MLLM-VQA (Q-Insight, DeQA)~0.52
HDR-Q (Ours)0.92

The gap HDR-Q fills:

HDR-aware visual perception + continuous quality prediction + interpretable chain-of-thought reasoning. No prior method has all three.

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

Three Fundamental Obstacles

Each requires a dedicated solution — together they define the HDR-Q architecture

O1

SDR-pretrained
vision encoders

Blind to 10-bit PQ

O2

Continuous MOS
from token space

Bridging autoregressive & regression

O3

Modality
neglect

GRPO ignores HDR tokens

Evidence: GRPO with HDR input → 0.875 SROCC. SDR-only → 0.891. Adding HDR made it worse.

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

Beyond8Bits: The Foundation

First large-scale HDR-UGC quality dataset — the training ground for HDR-Q

44,276HDR-UGC video clips
6,861Unique source videos
1.5M+Human quality ratings
~35Ratings per video

Format: 10-bit HEVC, PQ transfer, BT.2020 · 360p–1080p · 0.2–5 Mbps bitrate ladder

Sources: 2,253 crowdsourced (diverse UGC) + 4,608 Vimeo CC (nature, outdoor, low-light)

Quality control: HDR10 display verification · Qualification quiz · Golden set (SROCC 0.85) · SUREAL MOS aggregation · Inter-subject SROCC 0.90

Beyond8Bits

Dataset diversity, HDR vs SDR characteristics, and HDR-Q performance gains

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

HDR-Q: Architecture Overview

Architecture

Solves O1: HDR-Aware Encoder

SigLIP-2 + dual-domain contrastive learning. Native 10-bit PQ. Dual HDR + SDR pathways.

Solves O3: HAPO Training

Contrastive KL + dual entropy + entropy-weighted advantage. Forces HDR modality attention.

Solves O2: Structured Output

Ovis2.5 (9B) + Rank-4 LoRA. <think> reasoning + <answer> MOS score. Gaussian reward σ=3.

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

HDR-Aware Vision Encoder

Contrastive HDR/SDR discrimination

Lctr = max(0, δ − D(E(xHDR), E(c)) + D(E(xSDR), E(c)))

Pull HDR close to caption, push SDR away

Full encoder loss

Lenc = LSigmoid + λctr · Lctr

Semantic alignment + HDR discrimination

Design justifications

Why SigLIP-2? Strong semantic priors, multimodal-compatible.
Why contrastive? HDR info = what's in HDR but absent in SDR.
Why 10-bit PQ input? Tone-mapping destroys the signal we need.
Captions by Qwen2.5-VL-72B for quality-aware descriptions.

Encoder training

SigLIP-2 contrastive finetuning with matched HDR-SDR pairs and quality-aware captions

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

The Modality Neglect Problem

The most surprising finding — and the core motivation for HAPO

HDR-Q (SDR input only)

0.8914

Standard GRPO (HDR+SDR input)

0.8753

Adding HDR made it worse.

Why this happens:

  • GRPO treats all tokens equally — no modality importance signal
  • Autoregressive generation enables text-context shortcuts
  • SDR features are familiar; HDR features are foreign → model ignores the foreign
  • Result: higher input dimensionality, lower information utilization

HAPO fixes this: 0.9206 SROCC

By explicitly rewarding different outputs for HDR vs SDR inputs.

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

HAPO — Core Mechanism: HDR-SDR Contrastive KL

If the model ignores HDR → identical outputs for HDR & SDR → DKL ≈ 0. We maximize this divergence.

KHDR(θ) = DKLHDRθ ‖ πSDRθ)

Maximized with coefficient γ = 0.5 in the HAPO objective

Formal guarantee:   Iθ*(output; HDR | SDR) ≥ γ · KHDR(θ*) − κ

Mutual information between output and HDR content is lower-bounded. The policy is mathematically guaranteed to use HDR information.

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

The HAPO Objective

JHAPO(θ) = E[Σ min(ρ · Ã, clip(…) · Ã)] − β DKLHDR ‖ πref) + γ KHDR − Hdual
Ã
Entropy-weighted
advantage (HEW)
λHEW=0.3
β DKL
Reference KL
stability
β=0.02
+γ KHDR
Contrastive KL
HDR grounding
γ=0.5
−Hdual
Dual entropy
anti-collapse
η₁=0.01, η₂=0.05
HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

Training Pipeline & Reward Design

Stage 1: Modality Alignment

Full HAPO with γ=0.5. Curated Beyond8Bits subset with matched HDR-SDR pairs. Goal: teach model to attend to HDR-specific information. Both stages use RL — no SFT stage.

Stage 2: Quality Calibration

Full Beyond8Bits corpus. Score reward as primary signal, reduced γ. Goal: prediction accuracy while preserving HDR grounding from Stage 1.

Why two stages?

Modality alignment and quality calibration are competing objectives. Aggressive contrastive training degrades absolute accuracy; pure quality training enables modality neglect. Sequential prioritization resolves the tension.

Composite Reward

R = wfmt·Rfmt + wsc·Rscore + wself·Rself

Rfmt: Binary — valid <think>/<answer> tags = 1
Rscore: Gaussian — exp(−(ŝ−s*)²/2σ²), σ=3. Smooth, differentiable.
Rself: Majority-vote consistency across K=8 completions.

Config: Ovis2.5 (9B) + Rank-4 LoRA · K=8 · ε=0.1 · 4×H200 · BF16 · T=8 frames · 10-bit PQ native input
HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

Main Results: Beyond8Bits Benchmark

MethodSROCC↑PLCC↑RMSE↓
BRISQUE0.4100.46911.70
CONTRIQUE0.6250.60515.02
CONVIQT0.7990.8108.48
HIDRO-VQA0.8510.8786.09
Q-Insight (best MLLM)0.5170.56220.78
HDR-Q (SDR only)0.8910.8907.42
HDR-Q (Full)0.9210.9125.16
0.921Beyond8Bits
0.908LIVE-HDR
(zero-shot)
0.725SFV+HDR
(zero-shot)
Cross-dataset

Zero-shot generalization without fine-tuning on target datasets

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

Qualitative: HDR-Q vs Baseline Reasoning

Qualitative

HDR-Q identifies true HDR artifacts (highlight preservation, banding, color fidelity). Ovis2.5 baseline hallucinates non-existent issues.

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

Ablation: Every Component Is Justified

VariantSROCCRMSETok. H
GRPO baseline0.8110.730.20
+ HDR Encoder0.838.960.24
HAPO w/o Contrastive KL0.867.100.29
HAPO w/o HEW0.886.110.27
HAPO w/o Dual Ent.0.915.820.26
HDR-Q (Full)0.925.150.33

1. Contrastive KL — CRITICAL

0.92 → 0.86 without. Largest single drop. Prevents modality neglect.

2. HEW — Token-level credit assignment

0.92 → 0.88. Directs gradient to quality-relevant tokens.

3. HDR Encoder — Foundation

0.83 → 0.81. Essential for 10-bit PQ representation.

4. Dual Entropy — Stability

0.92 → 0.91. Prevents collapse, maintains exploration.

Token entropy: 0.20 → 0.33. Model becomes more uncertain at quality-critical decisions, not less. This is healthy.

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

Training Dynamics: HAPO vs GRPO

Entropy

Token entropy during training — GRPO collapses, HAPO maintains healthy exploration

GRPO failure mode

Entropy drops → deterministic → ignores visual modality → text-context shortcuts

HAPO stabilization

Dual entropy maintains H ≈ 0.33 at quality-critical positions. Healthy exploration preserved.

Reasoning efficiency

CoT: 168 → 137 tokens (−18%). More concise, more focused. Boilerplate removed, quality observations retained.

HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

What We Learn & Remaining Challenges

Key insights

  • Modality neglect is the bottleneck, not encoder capacity. The contrastive KL mechanism is necessary and sufficient for HDR grounding.
  • Token-level entropy is a diagnostic signal. Increasing entropy at decision points = better visual grounding. Collapsing entropy = text shortcuts.
  • Two-stage RL outperforms SFT→RL. Both stages use RL, ensuring HDR grounding from initialization.

Remaining challenges

  • Temporal reasoning is limited to T=8 frame sampling. Long-range temporal quality patterns may be missed.
  • Extreme out-of-distribution content (e.g., synthetic HDR, gaming HDR) not well represented in training.
  • Inference cost — MLLM inference is slower than lightweight VQA models. Not yet real-time.
HDR-Q (CVPR '26)LIVE Lab, UT Austin
What starts here changes the world

Key Takeaways

1

First MLLM for HDR video quality. Prior MLLMs: 0.52 SROCC. HDR-Q: 0.92.

2

HAPO: principled solution to modality neglect. Three mechanisms, formal MI guarantee, all ablated.

3

Strong zero-shot generalization. 0.908 LIVE-HDR, 0.725 SFV+HDR — no fine-tuning.

4

Perception-grounded reasoning. CoT references HDR-specific phenomena baseline models cannot articulate.

Next: HDR-Q as reward model for HDR generation & restoration → closing the perception-generation loop.

HDR-Q (CVPR '26)LIVE Lab, UT Austin