Video Thinking Test

A Holistic Benchmark for Advanced Video Reasoning and Understanding

Yuanhan Zhang*, Yunice Chew*, Yuhao Dong, Aria Leo, Bo Hu, Ziwei Liu

* Equal contribution

ICCV 2025




Paper Dataset Examples

Abstract

We introduce the Video Thinking Test (Video-TT), a benchmark designed to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT 1) differentiates between errors due to inadequate frame sampling and genuine gaps in understanding complex visual narratives, and 2) evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance, underscoring the need for benchmarks like Video-TT to advance video understanding.

Video Demo

Data Annotation Pipeline

Data Annotation Pipeline Diagram

The dataset is built through a multi-step annotation and verification pipeline:

  • Ensuring Complexity: (Visual Complexity) Measures how visually challenging a video is, based on factors like unclear or unusual content, fast motion, complex object arrangements, and visual illusions that hinder recognition. (Narrative Complexity) Reflects how cognitively demanding the storyline is, including elements like plot twists, montage-style editing, subtle technical manipulations, and reliance on world knowledge for full comprehension.
  • Primary Question Annotation: Annotators select videos and create QA pairs requiring either visual or narrative complexity. A question is retained only if at least one top model (GPT-4o, LLaVA-Video, Qwen2.5-VL) fails to answer it correctly.
  • Answer & Rationale: Annotators provide the correct answer, a detailed reasoning process, and critique of incorrect model responses.
  • Sampling Check: Questions must be answerable from 80 uniformly sampled frames, ensuring reliance on visual rather than auditory cues.
  • Adversarial Question Expansion: Annotators create four challenging variants per primary question based on model failures, with minimal edits to the original answer and rationale.
  • Alignment Check: A consensus-based process among three annotators ensures consistency. Questions without agreement are discarded.

Dataset Statistics

The dataset includes 5,000 question-answer pairs across 1,000 videos. Questions were first grouped by reasoning level—element, event, or plot—based on video content. They were further categorized by the type of inquiry (e.g., Attributes, Localization). When a specific complexity factor appeared frequently within a category (e.g., over 50 instances), it was promoted to a sub-category (e.g., Element Attributes–Illusion). In total, 18 distinct question types are identified.

Performance

Among video-language models, performance varies widely. While open-source models like InternVL-2.5-8B perform well on straightforward questions (65.7% on Correctly-Led), they struggle with misleading prompts (24.5% on Wrongly-Led). LLaVA-Video-72B emerges as the strongest open-source model overall. Proprietary models such as GPT-4o and Gemini Pro outperform most open-source models, with GPT-4o showing greater robustness to misleading prompts (67.5% Correctly-Led, 39.8% Wrongly-Led), though still far behind human-level reasoning. Notably, LLaVA-Video-72B approaches GPT-4o's accuracy in multiple-choice settings, but falls short on primary open-ended questions—highlighting both a limitation of current open-source systems and a bias in existing benchmarks that overemphasize multiple-choice formats. Regarding natural adversarial robustness, the table reveals that humans remain the gold standard with 64.4% accuracy. GPT-4o ranks highest among models at 36.0%, but still significantly lags behind. Open-source models like InternVL-2.5-7B perform poorly (10.9%), with even larger variants offering minimal gains. These results underscore the challenge of building models that can resist adversarial perturbations and the need for more rigorous benchmarks targeting open-ended and robustness-centric tasks.

Acknowledgement

This study is supported by the Ministry of Education, Singapore, under its MOE AcRF Tier 2 (MOE-T2EP20221-0012, MOE-T2EP20223-0002), and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). Homepage credits: Panda-70M