Yuanhan Zhang

Yuanhan (John) Zhang

Hi! I'm Yuanhan Zhang; here is the standard Chinese pronunciation of my given name: Yuanhan. I am a third-year PhD student at MMLab@NTU, supervised by Prof. Ziwei Liu.

My research interests lie in computer vision and deep learning. I focus on adapting foundation models—from vision to multimodal—for real-world use: benchmarking model performance and adapting models via parameter-efficient tuning, in-context learning, and instruction tuning.

Email (yuanhan002@e.ntu.edu.sg) / Google Scholar / Twitter / GitHub

News

[2025-07] We release Video Thinking Test (📽️ Video-TT 📽️), a holistic benchmark to assess advanced reasoning and understanding correctness/robustness between MLLMs and humans.

Older News & Activities

[2024-10] We update LLaVA-Video (formerly LLaVA-NeXT-Video), releasing both the model and the data.
[2024-08] LLaVA-OneVision released — an LMM that excels across single-image, multi-image, and video tasks.
[2024-07] IJCV Outstanding Reviewer Award 2023.
[2024-07] NOAH accepted by TPAMI.
[2024-07] Three papers accepted at ECCV 2024.
[2024-06] Organized CVPR 2024 workshop: Prompting in Vision.
[2024-05] LLaVA-NeXT-Video released.
[2023-09] Visual Prompt Retrieval accepted to NeurIPS 2023.
[2023-09] Talk at Alibaba DAMO Academy, hosted by Dr. Lidong Bin.
[2023-07] Talk at HITSZ, hosted by Prof. Rui Shao.
[2023-06] Introducing Otter.
[2022-10] 1st place in Computer Vision in the Wild Challenge.
[2022-07] OmniBenchmark accepted to ECCV 2022.
[2022-03] Bamboo dataset released.

Publications

	Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Yuanhan Zhang^, Yunice Chew^, Yuhao Dong, Aria Leo, Bo Hu, Ziwei Liu ICCV, 2025 PDF / Dataset and Code A holistic benchmark to assess advanced reasoning and understanding correctness/robustness between MLLMs and humans.

	LLaVA-Video: Video Instruction Tuning With Synthetic Data Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li TMLR, 2025 PDF / Dataset, Model and Code Fully open-sourced video LMM (code, model, and data) with competitive ability.

	LLaVA-OneVision: Easy Visual Task Transfer Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li TMLR, 2025 PDF / Dataset and Code A family of LMMs consolidating insights into data, models, and visual representations.

	LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Feng Li^, Renrui Zhang^, Hao Zhang^, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma ICLR*, 2025 (Spotlight) PDF / Dataset and Code Tackling multi-image, video, and 3D in large multimodal models.

	MMBench: Is Your Multi-modal Model an All-around Player? Yuan Liu^, Haodong Duan^, Yuanhan Zhang^, Bo Li^, Songyang Zhang^, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin ECCV*, 2024 (Oral) PDF / Dataset and Code Benchmarking 20 abilities of vision-language models.

	Octopus: Embodied Vision-Language Programmer from Environmental Feedback Jingkang Yang, Yuhan Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu ECCV, 2024 PDF / Dataset and Code An embodied VLM trained with RLEF, strong at embodied visual planning and programming.

	FunQA: Towards Surprising Video Comprehension Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu ECCV, 2024 PDF / Dataset and Code Benchmarks funny, creative, and magic videos for challenging video understanding.

	Otter: A Multi-modal Model with In-context Instruction Tuning Bo Li^, Yuanhan Zhang^, Liangyu Chen, Jinghao Wan, Fanyi Pu, Jingkang Yang, Chunyuan Li, Ziwei Liu TPAMI PDF / Dataset and Code A vision-language model with in-context instruction tuning.


	What Makes Good Examples for Visual In-Context Learning? Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu NeurIPS, 2023 PDF / Code Retrieving prompts for visual in-context learning.

	Learning without Forgetting for Vision-Language Models Da-Wei Zhou, Yuanhan Zhang, Yan Wang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu TPAMI PDF / Code Learning without forgetting for vision-language models.

	Neural Prompt Search Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu TPAMI PDF / Project Page / Code Searching prompt modules for parameter-efficient transfer learning.

	3D Point Cloud Pre-training with Knowledge Distillation from 2D Images? Yuan Yao, Yuanhan Zhang, Zhenfei Yin, Jiebo Luo, Wanli Ouyang, Xiaoshui Huang ICME, 2023 PDF / Code 3D point cloud pre-training with knowledge distillation from 2D images.

	Benchmarking Omni-Vision Representation through the Lens of Visual Realms Yuanhan Zhang, Zhenfei Yin, Jing Shao, Ziwei Liu ECCV, 2022 PDF / Project Page / Leaderboard / Challenge: ImageNet-1k Pretrain Track / Challenge: Open-Pretrain Track / Dataset and Code New benchmark for evaluating vision foundation models; supervised contrastive learning framework.

	CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations Yuanhan Zhang, Zhenfei Yin, Yidong Li, Guojun Yin, Junjie Yan, Jing Shao, Ziwei Liu ECCV, 2020 PDF / Dataset / Demo / Code Large-scale face anti-spoofing dataset.

Activities

Reviewer: CVPR / ECCV / ICCV / ICLR / NeurIPS / TPAMI / IJCV
Organizing Committee: The AI Talk, CVPR 2024 Workshop (Prompting in Vision), ECCV 2022 Workshops (workshop1, workshop2)

Public Office Hour

Happy to chat about any topics :)

Last updated in Aug. 2025.

Homepage credits: Jon Barron.