Yuanhan (John) Zhang
Hi! I'm Yuanhan Zhang, here is the standard Chinese pronunciation for my first name : Yuanhan, a third-year PhD student at MMLab@NTU, supervised by Prof. Ziwei Liu.
My research interests lie in computer vision and deep learning. In particular, I am focused on adapting foundation modelsâfrom vision to multi-modalâfor real-world exploration. This involves benchmarking model performance and adapting models through parameter-efficient tuning, in-context learning, instruction tuning, and preference modeling.
Email (yuanhan002@e.ntu.edu.sg)  / 
Google Scholar  / 
Twitter  / 
Github
|
|
-
[2024-10] We update LLaVA-Video (formerly LLaVA-NeXT-Video), releasing both the model and the Data .
-
[2024-08] We release LLaVA-OneVision! A LMM that excels across single-image, multi-image, and video tasks.
-
[2024-07] IJCV Outstanding Reviewer Award 2023.
-
[2024-07] Finally, NOAH has been accepted by TPAMI.
-
[2024-07] Three papers are accepted at ECCV 2024.
-
[2024-06] We're organizing a workshop on Prompting in Vision at CVPR 2024.
-
[2024-05] LLaVA-NeXT-Video is released. Our team continues to build the most powerful open-source large modality model!
Older News & Activities
|
LLaVA-NeXT: Open Large Multimodal Models
on-going project, 2024
Project page /
Model /
Code
A powerful open-source large modality model. I worked on video understanding + evaluation.
|
|
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang
Jinming Wu,
Wei Li,
Bo Li,
Zejun Ma,
Ziwei Liu
Chunyuan Li
arXiv Preprint, 2024
PDF /
Dataset, Model and Code
Fully open-sourced video LMM model with competitive ability, including code, model, and data.
|
|
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li,
Yuanhan Zhang,
Dong Guo,
Renrui Zhang,
Feng Li,
Hao Zhang,
Kaichen Zhang,
Yanwei Li,
Ziwei Liu
Chunyuan Li
arXiv Preprint, 2024
PDF /
Dataset and Code
A family of LMMs developed by consolidating insights into data, models, and visual representations.
|
|
Otter: A multi-modal model with in-context instruction tuning
Bo Li*,
Yuanhan Zhang*,
Liangyu Chen,
Jinghao Wan,
Fanyi Pu,
Jingkang Yang,
Chunyuan Li,
Ziwei Liu
arXiv Preprint, 2023
PDF /
Dataset and Code
A vision-language model with in-context instruction tuning.
|
|
Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy
Yuanhan Zhang,
Qinghong Sun, Yichun Zhou, Zexin He,
Zhenfei Yin, Kun Wang,
Lu Sheng,
Yu Qiao,
Jing Shao,
Ziwei Liu
arXiv Preprint, 2022
PDF /
Project Page /
Demo /
Code
4 times larger than ImageNet; 2 time larger than Object365; Built by active learning.
|
|
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu,
Haodong Duan,
Yuanhan Zhang,
Bo Li,
Songyang Zhang,
Wangbo Zhao,
Yike Yuan,
Jiaqi Wang,
Conghui He,
Ziwei Liu,
Kai Chen,
Dahua Lin
ECCV, 2024 (Oral)
PDF /
Dataset and Code
Benchmarking the 20 abilities of vision-language models.
|
|
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
Jingkang Yang,
Yuhan Dong,
Shuai Liu,
Bo Li,
Ziyue Wang,
Chencheng Jiang,
Haoran Tan,
Jiamu Kang,
Yuanhan Zhang,
Kaiyang Zhou,
Ziwei Liu
ECCV, 2024
PDF /
Dataset and Code
An embodied vision-language model trained with RLEF, emerging superior in embodied visual planning and programming.
|
|
FunQA: Towards Surprising Video Comprehension
Binzhu Xie,
Sicheng Zhang,
Zitang Zhou,
Bo Li,
Yuanhan Zhang,
Jack Hessel,
Jingkang Yang,
Ziwei Liu
ECCV, 2024
PDF /
Dataset and Code
FunQA benchmarks funny, creative, and magic videos for challenging tasks.
|
|
Knowledge augmented instruction tuning for
zero-shot animal species recognition
Zalan Fabian,
Zhongqi Miao,
Chunyuan Li,
Yuanhan Zhang,
Ziwei Liu,
Andrés Hernåndez,
Andrés Montes-Rojas,
Rafael Escucha,
Laura Siabatto,
Andrés Link,
Pablo ArbelĂĄez,
Rahul Dodhia,
Juan Lavista Ferres
Instruction Tuning and Instruction Following Workshop@NeurIPS 2023.
PDF
A knowledge augmented vision-language model for AI conservation.
|
|
What Makes Good Examples for Visual In-Context Learning?
Yuanhan Zhang,
Kaiyang Zhou,
Ziwei Liu
NeurIPS, 2023
PDF /
Code
Retrieving prompt for visual in-context learning.
|
|
Neural Prompt Search
Yuanhan Zhang,
Kaiyang Zhou,
Ziwei Liu
TPAMI
PDF /
Project Page /
Code
Searching prompt modules for parameter-efficient transfer learning.
|
|
3D Point Cloud Pre-training with Knowledge Distillation from 2D Images?
Yuan Yao,
Yuanhan Zhang,
Zhenfei Yin,
Jiebo Luo,
Wanli Ouyang,
Xiaoshui Huang.
ICME, 2023
PDF /
Code
3D Point Cloud Pre-training with Knowledge Distillation from 2D Images.
|
|
Benchmarking Omni-Vision Representation through the Lens of Visual Realms
Yuanhan Zhang,
Zhenfei Yin,
Jing Shao,
Ziwei Liu
ECCV, 2022
PDF /
Project Page /
Leaderboard /
Challenge:ImageNet1k-Pretrain Track /
Challenge:Open-Pretrain Track /
Dataset and Code
New benchmark for evaluating vision foundation models; New supervised contrastive learning framework.
|
|
CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations
Yuanhan Zhang,
Zhenfei Yin,
Yidong Li,
Guojun Yin,
Junjie Yan,
Jing Shao,
Ziwei Liu
ECCV, 2020
PDF /
Dataset /
Demo /
Code
Large-scale face-antispoofing Dataset.
|
Last updated in Jun. 2024.
Homepage credits: Jon Barron.
|
|