Yuanhan (John) Zhang

Hi! I'm Yuanhan Zhang; here is the standard Chinese pronunciation of my given name: Yuanhan. I am a third-year PhD student at MMLab@NTU, supervised by Prof. Ziwei Liu.

My research interests lie in computer vision and deep learning. I focus on adapting foundation models—from vision to multimodal—for real-world use: benchmarking model performance and adapting models via parameter-efficient tuning, in-context learning, and instruction tuning.

Email (yuanhan002@e.ntu.edu.sg)  /  Google Scholar  /  Twitter  /  GitHub

Portrait of Yuanhan Zhang
News
  • [2025-07] We release Video Thinking Test (📽️ Video-TT 📽️), a holistic benchmark to assess advanced reasoning and understanding correctness/robustness between MLLMs and humans.
  • Older News & Activities
Publications
Video-TT teaser
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding
Yuanhan Zhang*, Yunice Chew*, Yuhao Dong, Aria Leo, Bo Hu, Ziwei Liu
ICCV, 2025
PDF / Dataset and Code

A holistic benchmark to assess advanced reasoning and understanding correctness/robustness between MLLMs and humans.

LLaVA-Video teaser
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li
TMLR, 2025
PDF / Dataset, Model and Code GitHub Repo stars

Fully open-sourced video LMM (code, model, and data) with competitive ability.

LLaVA-OneVision teaser
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
TMLR, 2025
PDF / Dataset and Code GitHub Repo stars

A family of LMMs consolidating insights into data, models, and visual representations.

LLaVA-NeXT-Interleave teaser
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li*, Renrui Zhang*, Hao Zhang*, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma
ICLR, 2025 (Spotlight)
PDF / Dataset and Code GitHub Repo stars

Tackling multi-image, video, and 3D in large multimodal models.

MMBench logo
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu*, Haodong Duan*, Yuanhan Zhang*, Bo Li*, Songyang Zhang*, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin
ECCV, 2024 (Oral)
PDF / Dataset and Code GitHub Repo stars

Benchmarking 20 abilities of vision-language models.

Octopus logo
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
Jingkang Yang, Yuhan Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
ECCV, 2024
PDF / Dataset and Code GitHub Repo stars

An embodied VLM trained with RLEF, strong at embodied visual planning and programming.

FunQA logo
FunQA: Towards Surprising Video Comprehension
Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu
ECCV, 2024
PDF / Dataset and Code GitHub Repo stars

Benchmarks funny, creative, and magic videos for challenging video understanding.

Otter brand title
Otter: A Multi-modal Model with In-context Instruction Tuning
Bo Li*, Yuanhan Zhang*, Liangyu Chen, Jinghao Wan, Fanyi Pu, Jingkang Yang, Chunyuan Li, Ziwei Liu
TPAMI
PDF / Dataset and Code GitHub Repo stars

A vision-language model with in-context instruction tuning.

Visual Prompt Retrieval teaser
What Makes Good Examples for Visual In-Context Learning?
Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
NeurIPS, 2023
PDF / Code GitHub Repo stars

Retrieving prompts for visual in-context learning.

Learning without Forgetting teaser
Learning without Forgetting for Vision-Language Models
Da-Wei Zhou, Yuanhan Zhang, Yan Wang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu
TPAMI
PDF / Code GitHub Repo stars

Learning without forgetting for vision-language models.

NOAH teaser
Neural Prompt Search
Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
TPAMI
PDF / Project Page / Code GitHub Repo stars

Searching prompt modules for parameter-efficient transfer learning.

3D point cloud KD teaser
3D Point Cloud Pre-training with Knowledge Distillation from 2D Images?
Yuan Yao, Yuanhan Zhang, Zhenfei Yin, Jiebo Luo, Wanli Ouyang, Xiaoshui Huang
ICME, 2023
PDF / Code

3D point cloud pre-training with knowledge distillation from 2D images.

OmniBenchmark teaser
Benchmarking Omni-Vision Representation through the Lens of Visual Realms
Yuanhan Zhang, Zhenfei Yin, Jing Shao, Ziwei Liu
ECCV, 2022
PDF / Project Page / Leaderboard / Challenge: ImageNet-1k Pretrain Track / Challenge: Open-Pretrain Track / Dataset and Code GitHub Repo stars

New benchmark for evaluating vision foundation models; supervised contrastive learning framework.

CelebA-Spoof logo
CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations
Yuanhan Zhang, Zhenfei Yin, Yidong Li, Guojun Yin, Junjie Yan, Jing Shao, Ziwei Liu
ECCV, 2020
PDF / Dataset / Demo / Code GitHub Repo stars

Large-scale face anti-spoofing dataset.

Activities
Public Office Hour

Last updated in Aug. 2025.

Homepage credits: Jon Barron.