We introduce the Video Thinking Test (Video-TT), a benchmark designed to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT 1) differentiates between errors due to inadequate frame sampling and genuine gaps in understanding complex visual narratives, and 2) evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance, underscoring the need for benchmarks like Video-TT to advance video understanding.
The dataset is built through a multi-step annotation and verification pipeline:
The dataset includes 5,000 question-answer pairs across 1,000 videos. Questions were first grouped by reasoning level—element, event, or plot—based on video content. They were further categorized by the type of inquiry (e.g., Attributes, Localization). When a specific complexity factor appeared frequently within a category (e.g., over 50 instances), it was promoted to a sub-category (e.g., Element Attributes–Illusion). In total, 18 distinct question types are identified.
Among video-language models, performance varies widely. While open-source models like InternVL-2.5-8B perform well on straightforward questions (65.7% on Correctly-Led), they struggle with misleading prompts (24.5% on Wrongly-Led). LLaVA-Video-72B emerges as the strongest open-source model overall. Proprietary models such as GPT-4o and Gemini Pro outperform most open-source models, with GPT-4o showing greater robustness to misleading prompts (67.5% Correctly-Led, 39.8% Wrongly-Led), though still far behind human-level reasoning. Notably, LLaVA-Video-72B approaches GPT-4o's accuracy in multiple-choice settings, but falls short on primary open-ended questions—highlighting both a limitation of current open-source systems and a bias in existing benchmarks that overemphasize multiple-choice formats. Regarding natural adversarial robustness, the table reveals that humans remain the gold standard with 64.4% accuracy. GPT-4o ranks highest among models at 36.0%, but still significantly lags behind. Open-source models like InternVL-2.5-7B perform poorly (10.9%), with even larger variants offering minimal gains. These results underscore the challenge of building models that can resist adversarial perturbations and the need for more rigorous benchmarks targeting open-ended and robustness-centric tasks.
This study is supported by the Ministry of Education, Singapore, under its MOE AcRF Tier 2 (MOE-T2EP20221-0012, MOE-T2EP20223-0002), and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). Homepage credits: Panda-70M