I am a third year PhD student at Laboratory of Image and Video Engineering at The University of Texas Austin, advised by Prof. Alan C. Bovik . My research focuses on the "Theoretical Foundations of Generative Models (e.g. Flows, Diffusion, and MLLMs) and their applications in efficient sampling, image/video-quality assessment (QA), editing, and inverse problems (eg: ITM).
I collaborate with YouTube/ Google Media Algorithms team in my PhD. I will be joining Google Research as a student researcher in LUMA team starting June 2025.
Before starting my PhD at UT Austin, I worked as a Research Engineer-AI at Arkray, inc and as Machine Learning Engineer at BioMind AI. At both places, i was working on developing novel and scalable AI solutions for medical image analysis.
Aug 2018 - Aug 2020: Undergraduate Researcher, Image Processing and Computer Vision Lab, IIT Jodhpur
May 2018 - Aug 2018: Research Intern, The Multimedia Analytics, Networks and Systems Lab, IIT Mandi
Applied Scientist Intern | Amazon - Perception Team Seattle, Washington | June 2024 – August 2024
Worked with the Perception team on large-scale synthetic data generation
Developed novel edit-bench and T2I-based diffusion model for consistent image/video editing and generation
Aiming to conduct Image+Video editing challenge and workshop
Research Intern | Alibaba Group Sunnyvale, California | January 2024 – May 2024
Developed generalizable and robust Vision Model-based Video Quality Assessment (VQA) methods
Using Diffusion Model priors as perceptual consistency for IQA (Paper: under review)
Co-Founder | Short-X Austin, Texas | January 2023 – January 2024
Short-X aims to automate the arduous task of making short-form contents from traditional long-form content
Built core AI models and pipelines for Short-X, working on transcription, extracting semantically meaningful and unique highlights, removing pauses, identifying speaker and smart vertical cropping
Graduate Research Assistant | Laboratory for Image and Video Engineering, UT Austin Austin, Texas | August 2022 – Present
Developing scalable vision models for HDR videos for tasks like ITM/TM, gamut expansion & quality assessment
Created the largest HDR-SDR dataset for short-form videos (publicly available)
Developing video quality assessment methods for HDR videos, which uses Non-Linear expansion of extremes of sub-level luminance
Machine Learning Engineer | BioMind (Products) Singapore, Singapore | February 2022 – June 2022
Developed SOTA multimodal DL models for segmentation and classification of 25+ tumor/non-tumor classes
Exploited TFRecords for memory-intense 4D datasets and proposed multi-task model for tumor predictions
Research Engineer – AI | Arkray, Inc. Kyoto, Japan (Remote) | August 2020 – December 2021
Proposed semi-supervised DL models to learn from a large chunk of the private unlabelled and noisy 2D datasets
Deployed models for products: UrineSediment Analyzer, and automated BodyFluid Analyzer (Aution EYE)
Research Assistant | National University of Singapore Singapore | May 2019 – July 2019 Supervisor: Dr. Mengling 'Mornin' Feng
Developed novel deep learning architecture for large-scale public health datasets
Published SOTA results with low cost for skin lesion analysis
Undergraduate Researcher | Image Processing and Computer Vision Lab, IIT Jodhpur Jodhpur, India | August 2018 – August 2020 Supervisor: Dr. Anil Kumar Tiwari
Worked on developing ML methods aimed for AI-based diagnosis and treatment support
Developed DL models for retinal vessel & skin lesion segmentation, and diagnosis of left-atrium in 3D GE-MRIs
Research Intern | The Multimedia Analytics, Networks and Systems Lab, IIT Mandi Mandi, India | May 2018 – July 2018 Supervisor: Dr. Aditya Nigam
Developed novel CNN model for iris segmentation which uses cascaded hourglass modules at the bottleneck of encoder-decoder design
Services
Reviewer: ICLR (2025), IEEE Trans. on Multimedia (2024), ICML (2025), CVPR (2025).
Assistant Director: LIVE at UT Austin (2025-Present).
Volunteer at Internal Workshop on Deep Learning (IWDL), India (2018).
Established and Run LAMBDA Lab at IITJ (2018-2020).
Overall Head of Entrepreneurship and Innovation Cell at IITJ (2018-2019).
Assistant Head of Counselling Services at IITJ (2018-2019).
Rectified CFG++ enhances conditional image generation with Rectified Flow models by adaptively correcting the latent trajectory. This method improves visual coherence and alignment with text prompts, outperforming existing samplers in generation quality and efficiency.
This study investigates the exploitation of diffusion model priors to achieve perceptual consistency in image quality assessment. By leveraging the inherent priors learned by diffusion models, the assessment of image quality is made more aligned with human perception, leading to more accurate and reliable evaluations.
In this work, we propose 40K UGC-HDR subjective video quality database and use CoT in MLLM for zero-shot perceptual video quality assessment. This is the first and only large scale subjective database for UGC-HDR videos; it will help in developing objective metrics that accurately predict subjective quality scores.
BrightRate is designed for quality assessment in user-generated HDR videos, focusing on unique challenges like varying content and capture conditions. It offers a reliable way to evaluate and enhance the viewing experience of HDR content.
CHUG is a crowdsourced dataset for HDR video quality, addressing the need for diverse, real-world content. It aids in developing more accurate and robust quality assessment models.
Contrastive HDR-VQA introduces a deep contrastive representation learning approach for high dynamic range video quality assessment. By learning robust representations through contrastive learning, the method achieves state-of-the-art performance in predicting the quality of HDR videos.
This research explores the application of diffusion models for inverse tone mapping in user-generated content (UGC) videos. The ITM-DM approach leverages diffusion models to enhance the visual quality of UGC videos by effectively performing inverse tone mapping, thereby improving the viewing experience.
Prime-EditBench is introduced as a real-world benchmark designed to evaluate the performance of image and video editing tasks using diffusion models. This benchmark provides a standardized platform for assessing the capabilities of these models in practical editing scenarios, facilitating advancements in the field.
This repository contains problems and solutions related to general inverse problems, as part of the CSE 393P course. It includes implementations and analyses of various inverse problem-solving techniques.
Implementation of an efficient SR3 DM for Super Resolution. This project explores the potential of pre-trained diffusion models to enhance the generalization ability and reduce computation costs in image super-resolution tasks.
Zero-shot Diffusion Model for Video Animation (Zero-DA) adapts image generation models to video production. This framework tackles the challenge of maintaining temporal uniformity across video frames using hierarchical cross-frame constraints.
This project aims to mitigate the inherent bias in recidivism score predictions by leveraging machine learning techniques to rectify and minimize biases towards gender and racial/ethnic groups.
This project proposes the use of transformers to learn long-range interactions with mutual self-attention between frames as a surrogate for motion estimation in video frame interpolation.
M2SLAe-Net introduces a multi-scale multi-level attention embedded network for improved retinal vessel segmentation. By integrating attention mechanisms at multiple scales and levels, the network achieves enhanced accuracy and robustness in segmenting retinal vessels, aiding in the diagnosis of various eye diseases.
This paper presents (M)SLAe-Net, a multi-scale multi-level attention embedded network designed for precise retinal vessel segmentation. The network's architecture allows it to capture intricate details of retinal vessels, making it a valuable tool for early detection and diagnosis of retinal diseases.
B-SegNet introduces a branched SegMentor network for accurate skin lesion segmentation. By employing a branched architecture, the network effectively captures both local and global features of skin lesions, leading to improved segmentation performance and aiding in the diagnosis of skin cancer.
This paper presents a detector and SegMentor network for simultaneous skin lesion localization and segmentation. The network combines detection and segmentation tasks to provide a comprehensive solution for skin lesion analysis, enabling accurate localization and precise segmentation of lesions for improved diagnostic accuracy.
PixISegNet introduces a pixel-level iris segmentation network that utilizes a convolutional encoder-decoder architecture with a stacked hourglass bottleneck. This network achieves precise iris segmentation by effectively capturing both local and global features, making it suitable for various biometric applications.
This book chapter explores the use of encoder-decoder based deep learning techniques for iris segmentation in unconstrained environments. The proposed methods effectively handle challenges such as variations in lighting, occlusion, and off-angle images, making them suitable for real-world biometric applications.