Yuanhan (John) Zhang

Hi! I'm Yuanhan Zhang, here is the standard Chinese pronunciation for my first name : Yuanhan, a third-year PhD student at MMLab@NTU, supervised by Prof. Ziwei Liu.

My research interests lie in computer vision and deep learning. In particular, I am focused on adapting foundation models—from vision to multi-modal—for real-world exploration. This involves benchmarking model performance and adapting models through parameter-efficient tuning, in-context learning, instruction tuning, and preference modeling.

Email (yuanhan002@e.ntu.edu.sg)  /  Google Scholar  /  Twitter  /  Github

profile photo
News
  • [2024-10] We update LLaVA-Video (formerly LLaVA-NeXT-Video), releasing both the model and the Data .
  • [2024-08] We release LLaVA-OneVision! A LMM that excels across single-image, multi-image, and video tasks.
  • [2024-07] IJCV Outstanding Reviewer Award 2023.
  • [2024-07] Finally, NOAH has been accepted by TPAMI.
  • [2024-07] Three papers are accepted at ECCV 2024.
  • [2024-06] We're organizing a workshop on Prompting in Vision at CVPR 2024.
  • [2024-05] LLaVA-NeXT-Video is released. Our team continues to build the most powerful open-source large modality model!
  • Older News & Activities
Pre-prints
LLaVA-NeXT: Open Large Multimodal Models
on-going project, 2024
Project page / Model / Code GitHub Repo stars

A powerful open-source large modality model. I worked on video understanding + evaluation.

LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu Chunyuan Li
arXiv Preprint, 2024
PDF / Dataset, Model and Code GitHub Repo stars

Fully open-sourced video LMM model with competitive ability, including code, model, and data.

LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu Chunyuan Li
arXiv Preprint, 2024
PDF / Dataset and Code GitHub Repo stars

A family of LMMs developed by consolidating insights into data, models, and visual representations.

Otter: A multi-modal model with in-context instruction tuning
Bo Li*, Yuanhan Zhang*, Liangyu Chen, Jinghao Wan, Fanyi Pu, Jingkang Yang, Chunyuan Li, Ziwei Liu
arXiv Preprint, 2023
PDF / Dataset and Code GitHub Repo stars

A vision-language model with in-context instruction tuning.

Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy
Yuanhan Zhang, Qinghong Sun, Yichun Zhou, Zexin He,
Zhenfei Yin, Kun Wang, Lu Sheng, Yu Qiao, Jing Shao, Ziwei Liu
arXiv Preprint, 2022
PDF / Project Page / Demo / Code GitHub Repo stars

4 times larger than ImageNet; 2 time larger than Object365; Built by active learning.

Publications
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin
ECCV, 2024 (Oral)
PDF / Dataset and Code GitHub Repo stars

Benchmarking the 20 abilities of vision-language models.

Octopus: Embodied Vision-Language Programmer from Environmental Feedback
Jingkang Yang, Yuhan Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
ECCV, 2024
PDF / Dataset and Code GitHub Repo stars

An embodied vision-language model trained with RLEF, emerging superior in embodied visual planning and programming.

FunQA: Towards Surprising Video Comprehension
Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu
ECCV, 2024
PDF / Dataset and Code GitHub Repo stars

FunQA benchmarks funny, creative, and magic videos for challenging tasks.

Knowledge augmented instruction tuning for zero-shot animal species recognition
Zalan Fabian, Zhongqi Miao, Chunyuan Li, Yuanhan Zhang, Ziwei Liu, Andrés Hernåndez, Andrés Montes-Rojas, Rafael Escucha, Laura Siabatto, Andrés Link, Pablo Arbelåez, Rahul Dodhia, Juan Lavista Ferres
Instruction Tuning and Instruction Following Workshop@NeurIPS 2023.

PDF

A knowledge augmented vision-language model for AI conservation.

What Makes Good Examples for Visual In-Context Learning?
Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
NeurIPS, 2023
PDF / Code GitHub Repo stars

Retrieving prompt for visual in-context learning.

Neural Prompt Search
Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
TPAMI
PDF / Project Page / Code GitHub Repo stars

Searching prompt modules for parameter-efficient transfer learning.

3D Point Cloud Pre-training with Knowledge Distillation from 2D Images?
Yuan Yao, Yuanhan Zhang, Zhenfei Yin, Jiebo Luo, Wanli Ouyang, Xiaoshui Huang.
ICME, 2023
PDF / Code

3D Point Cloud Pre-training with Knowledge Distillation from 2D Images.

Benchmarking Omni-Vision Representation through the Lens of Visual Realms
Yuanhan Zhang, Zhenfei Yin, Jing Shao, Ziwei Liu
ECCV, 2022
PDF / Project Page / Leaderboard / Challenge:ImageNet1k-Pretrain Track / Challenge:Open-Pretrain Track / Dataset and Code GitHub Repo stars

New benchmark for evaluating vision foundation models; New supervised contrastive learning framework.

CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations
Yuanhan Zhang, Zhenfei Yin, Yidong Li, Guojun Yin, Junjie Yan, Jing Shao, Ziwei Liu
ECCV, 2020
PDF / Dataset / Demo / Code GitHub Repo stars

Large-scale face-antispoofing Dataset.

Activities
Public Office Hour

Last updated in Jun. 2024.

Homepage credits: Jon Barron.