Shreshth Saini

I am a third year PhD student at Laboratory of Image and Video Engineering at The University of Texas Austin, advised by Prof. Alan C. Bovik . My research focuses on the "Theoretical Foundations of Generative Models (e.g. Flows, Diffusion, and MLLMs) and their applications in efficient sampling, image/video-quality assessment (QA), editing, and inverse problems (eg: ITM). I collaborate with YouTube/ Google Media Algorithms team in my PhD. I will be joining Google Research as a student researcher in LUMA team starting June 2025.

Before starting my PhD at UT Austin, I worked as a Research Engineer-AI at Arkray, inc and as Machine Learning Engineer at BioMind AI. At both places, i was working on developing novel and scalable AI solutions for medical image analysis.

I received my Bachelor's degree in Electrical Engineering from the IIT Jodhpur. I was fortunate to be advised by Prof. Menglign Feng (NUS), Prof. Aditya Nigam (IIT Mandi), and Prof. Anil K. Tiwari(IIT Jodhpur) throughout my undergraduate research.

Contact  /  Google Scholar  /  LinkedIn  /  X  /  GitHub  /  CV (Dated 😞)

profile photo
Updates
  • Jun 2025: Excited to join Google Research as Student Researcher in the LUMA team!
  • Mar 2025: Honored to be appointed as Assistant Director of LIVE at UT Austin!
  • Jun 2024: Joined Amazon as Applied Scientist-II Intern in the Perception team!
  • Jan 2024: Joined Alibaba US as Research Intern to work on diffusion models!
  • Nov 2023: Paper accepted in 3rd Workshop on Image/Video/Audio Quality in CV and Gen AI, WACV 2024!
  • Jun 2023: Completed first-ever large-scale HDR subjective perceptual quality study on Amazon Mechanical Turk!
  • Aug 2022: Joined LIVE at UT Austin as PhD student under Prof. Alan C. Bovik!
  • Aug 2022: Awarded prestigious Engineering Graduate Fellowship until 2027!
  • Feb 2022: Joined BioMind, Singapore as Research Engineer/Machine Learning Engineer.
  • Aug 2020: Started as Research Engineer (AI) at Arkray, Inc.
  • May 2019: Research Assistant at National University of Singapore (NUS) under Prof. Mengling "Mornin" Feng.
  • Aug 2018: Undergraduate Researcher at Image Processing and Computer Vision Lab, IIT Jodhpur under Prof. Anil Kumar Tiwari.
  • May 2018: Research Intern at The Multimedia Analytics, Networks and Systems Lab, IIT Mandi under Prof. Aditya Nigam.

Research & Development Experience
Services

Reviewer: ICLR (2025), IEEE Trans. on Multimedia (2024), ICML (2025), CVPR (2025).
Assistant Director: LIVE at UT Austin (2025-Present).

Research Publications

(Recent - Generative AI / IQA / VQA)
Rectified CFG ++ for Flow based Models
S Saini, S Gupta, AC Bovik.
Writing - Target: NeurIPS 2025
PDF / ArXiv

Rectified CFG++ enhances conditional image generation with Rectified Flow models by adaptively correcting the latent trajectory. This method improves visual coherence and alignment with text prompts, outperforming existing samplers in generation quality and efficiency.

LGDM: Latent Guidance in Diffusion Models for Perceptual Evaluations
S Saini, R Liao, Y Ye, AC Bovik.
Under Review - ICML 2025
PDF / ArXiv

This study investigates the exploitation of diffusion model priors to achieve perceptual consistency in image quality assessment. By leveraging the inherent priors learned by diffusion models, the assessment of image quality is made more aligned with human perception, leading to more accurate and reliable evaluations.

Reasoning Through Perceptual Quality for UGC-HDR Videos
S Saini, N Birkbeck, Y Wang, B Adsumilli, AC Bovik.
Under Review
PDF / ArXiv / Page / Code

In this work, we propose 40K UGC-HDR subjective video quality database and use CoT in MLLM for zero-shot perceptual video quality assessment. This is the first and only large scale subjective database for UGC-HDR videos; it will help in developing objective metrics that accurately predict subjective quality scores.

BrightRate: Quality Assessment for User-Generated HDR Videos
S Saini, B Chen, N Birkbeck, Y Wang, B Adsumilli, AC Bovik
Under Review: ICCV 2025
PDF / ArXiv / Page / Code

BrightRate is designed for quality assessment in user-generated HDR videos, focusing on unique challenges like varying content and capture conditions. It offers a reliable way to evaluate and enhance the viewing experience of HDR content.

CHUG: Crowdsourced User-Generated HDR Video Quality Dataset
S Saini, N Birkbeck, Y Wang, B Adsumilli, AC Bovik
ICIP 2025
PDF / ArXiv / Page / Code

CHUG is a crowdsourced dataset for HDR video quality, addressing the need for diverse, real-world content. It aids in developing more accurate and robust quality assessment models.

Contrastive HDR-VQA: Deep Contrastive Representation Learning for High Dynamic Range Video Quality Assessment
S Saini, A Saha, AC Bovik.
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, Waikoloa, Hawaii
ArXiv / Data / Code

Contrastive HDR-VQA introduces a deep contrastive representation learning approach for high dynamic range video quality assessment. By learning robust representations through contrastive learning, the method achieves state-of-the-art performance in predicting the quality of HDR videos.

ITM-DM: Using Diffusion Models for UGC-Video Inverse Tone Mapping
S Saini, N Birkbeck, Y Wang, B Adsumilli, AC Bovik.
Ongoing Work - YouTube

This research explores the application of diffusion models for inverse tone mapping in user-generated content (UGC) videos. The ITM-DM approach leverages diffusion models to enhance the visual quality of UGC videos by effectively performing inverse tone mapping, thereby improving the viewing experience.

Prime-EditBench: A Real World Benchmark for Image and Video Editing Task using Diffusion Models
S Saini, P Korus, S Jin, AC Bovik.
Preprint: Amazon - Internal

Prime-EditBench is introduced as a real-world benchmark designed to evaluate the performance of image and video editing tasks using diffusion models. This benchmark provides a standardized platform for assessing the capabilities of these models in practical editing scenarios, facilitating advancements in the field.

Projects
Problems on General Inverse and Solution (CSE 393P)
GitHub Repository

This repository contains problems and solutions related to general inverse problems, as part of the CSE 393P course. It includes implementations and analyses of various inverse problem-solving techniques.

An Efficient Approach to Super-Resolution with Fine-Tuning Diffusion Models
Shreshth Saini, Yu-Chih Chen, Krishna Srikar Durbha
GitHub / PDF

Implementation of an efficient SR3 DM for Super Resolution. This project explores the potential of pre-trained diffusion models to enhance the generalization ability and reduce computation costs in image super-resolution tasks.

Zero-DA: Zero-shot Diffusion Model for Video Animation
Shreshth Saini, Krishna Srikar Durbha
GitHub / PDF

Zero-shot Diffusion Model for Video Animation (Zero-DA) adapts image generation models to video production. This framework tackles the challenge of maintaining temporal uniformity across video frames using hierarchical cross-frame constraints.

UBR-Unbiased-and-Robust-Recidivism-Prediction
Shreshth Saini, Albert Joe, Jiachen Wang, SayedMorteza Malaekeh
GitHub / PDF

This project aims to mitigate the inherent bias in recidivism score predictions by leveraging machine learning techniques to rectify and minimize biases towards gender and racial/ethnic groups.

Optical flow less video frame interpolation
Shreshth Saini, Krishna Srikar Durbha
GitHub / PDF

This project proposes the use of transformers to learn long-range interactions with mutual self-attention between frames as a surrogate for motion estimation in video frame interpolation.

Research Publications (old)

(Medical AI)
M2SLAe-Net: Multi-Scale Multi-Level Attention Embedded Network for Retinal Vessel Segmentation
S Saini, G Agrawal.
The IEEE International Symposium on Biomedical Imaging (IEEE ISBI), 2021 Acropolis-France
PDF / ArXiv / Page / Code

M2SLAe-Net introduces a multi-scale multi-level attention embedded network for improved retinal vessel segmentation. By integrating attention mechanisms at multiple scales and levels, the network achieves enhanced accuracy and robustness in segmenting retinal vessels, aiding in the diagnosis of various eye diseases.

(M)SLAe-Net: Multi-Scale Multi-Level Attention Embedded Network for Retinal Vessel Segmentation
S. Saini, G. Agrawal.
9th IEEE International Conference On Healthcare Informatics (IEEE ICHI), 2021 (full Oral Presentation) Victoria, British Columbia, Canada
PDF ArXiv

This paper presents (M)SLAe-Net, a multi-scale multi-level attention embedded network designed for precise retinal vessel segmentation. The network's architecture allows it to capture intricate details of retinal vessels, making it a valuable tool for early detection and diagnosis of retinal diseases.

B-SegNet: Branched-SegMentor Network for Skin Leison Segmentation
S Saini, YS Jeon, M Feng.
Association for Computing Machinery Conference on Health, Inference, and Learning (ACM CHIL), 2021 (full Oral Presentation)
PDF

B-SegNet introduces a branched SegMentor network for accurate skin lesion segmentation. By employing a branched architecture, the network effectively captures both local and global features of skin lesions, leading to improved segmentation performance and aiding in the diagnosis of skin cancer.

Detector-SegMentor Network for Skin Lesion Localizationand Segmentation
S Saini, D Gupta, AK Tiwari.
National Conference on Computer Vision, Pattern Recognition, Image Processing, & Graphics (NCVPRIPG), 2019 (full Oral Presentation), twin of ICVGIP
ArXiv

This paper presents a detector and SegMentor network for simultaneous skin lesion localization and segmentation. The network combines detection and segmentation tasks to provide a comprehensive solution for skin lesion analysis, enabling accurate localization and precise segmentation of lesions for improved diagnostic accuracy.

Journals
PixISegNet: pixel-level iris segmentation network using convolutional encoder–decoder with stacked hourglass bottleneck [Paper]
RR Jha, G Jaswal, S Saini, D Gupta, A Nigam.
The Institution of Engineering and Technology (IET Biometrics, 2019)
PDF

PixISegNet introduces a pixel-level iris segmentation network that utilizes a convolutional encoder-decoder architecture with a stacked hourglass bottleneck. This network achieves precise iris segmentation by effectively capturing both local and global features, making it suitable for various biometric applications.

Book Chapters
Iris Segmentation in the Wild using Encoder-Decoder based Deep Learning Techniques [Paper]
S Saini, D Gupta, RR Jha, G Jaswal, A Nigam.
AI and Deep Learning in Biometric Security: Trends, Potential and Challenge CRC Press (Taylor & Francis Group), 2020
PDF

This book chapter explores the use of encoder-decoder based deep learning techniques for iris segmentation in unconstrained environments. The proposed methods effectively handle challenges such as variations in lighting, occlusion, and off-angle images, making them suitable for real-world biometric applications.