VGenST-Bench logo VGenST-Bench A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Jinho Park1· Youbin Kim1· Hogun Park1· Eunbyung Park2†
1 Sungkyunkwan University · 2 Yonsei University
Corresponding author
Explore the benchmark

Pick a taxonomy, see a task

Each combination of Spatial scale × Perspective × Scene dynamics maps to one task. 12 tasks × 4 themes shown.

Spatial scale
Perspective
Scene dynamics
Selected task MC Multi-Container Attribute Mapping
Task Description [Task description placeholder]
Level 1 Visual Perception
Level 2 Scene Understanding
Level 3 Spatio-Temporal Reasoning
Replace static/videos/<TASK>_theme<N>.mp4
Multi-Container Attribute Mapping Theme 1 of 4
Themes
Abstract

A new paradigm for benchmark construction

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. However, existing spatio-temporal reasoning benchmarks primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities.

In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct it, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs.

We establish a comprehensive 3 × 2 × 2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics, paired with a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

TL;DR

VGenST-Bench is a video benchmark that uses video generative models to actively synthesize controlled scenarios for evaluating spatio-temporal reasoning in MLLMs.

Overview of VGenST-Bench
Overview of VGenST-Bench. (A) Dataset generation: given input themes, our multi-agent pipeline jointly synthesizes videos paired with scene graphs, scenarios, and QA sets. (B) Task & level design: videos are organized along a 3 × 2 × 2 taxonomy with one spatio-temporal task assigned per cell, and QA pairs follow a three-level hierarchy. (C) Benchmark statistics: 1,200 videos and 33K QA pairs spanning 12 task types and 12 QA types.
Method

A multi-agent construction pipeline

Starting from a theme, four agents operate in sequence — Scene Graph, Scenario, Video, and QA — followed by a two-stage human quality-control review.

VGenST-Bench construction pipeline
VGenST-Bench construction pipeline. Starting from a theme, four agents operate in sequence. The Scene Graph Agent produces a structured scene graph specifying objects and spatial composition; the Scenario Agent expands it into a temporally grounded scenario with reasoning goal and timeline; the Video Agent synthesizes the corresponding image and video through generative models; and the QA Agent generates base MCQs from a task–QA applicability matrix and reformats each into three variants.

Tasks & QA Types

VGenST-Bench organizes 12 reasoning tasks under a 3 × 2 × 2 video taxonomy (spatial scale × perspective × scene dynamics), paired with 12 QA types in a three-level cognitive hierarchy that progresses from low-level perception to high-level spatio-temporal reasoning.

12 Tasks

Organized under a 3 × 2 × 2 taxonomy: spatial scale × perspective × scene dynamics.
Egocentric Exocentric
StaticDynamic StaticDynamic
Figural MCMulti-Container Attribute Mapping QCQuantity Change Tracking CIContainer Intersection Inference CMCausal Mapping
Vista DEDirection Estimation IOInteracted Object Identification HOHeight Ordering VIVisibility Identification
Environmental DSDirectional Signage Grounding RVRelative Velocity Identification LSLandmark Spatial Composition BTBehavioral Trigger Identification

12 QA Types

Organized along a three-level cognitive hierarchy (3 + 6 + 3).
L1Visual Perception
Object Existence · Object Attribute Recognition · 2D Frame Localization
L2Scene Understanding
Identity Tracking · Action Recognition · Object Counting · Temporal Ordering · Camera Motion · Spatial Layout
L3Spatio-Temporal Reasoning
Perspective-Taking · Counterfactual Reasoning · Predictive Reasoning
Each task uses a subset of QA types — see the task–QA applicability matrix in the paper appendix.
Citation

BibTeX

vgenst-bench.bib
@misc{park2026vgenstbenchbenchmarkspatiotemporalreasoning,
  title         = {VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis},
  author        = {Jinho Park and Youbin Kim and Hogun Park and Eunbyung Park},
  year          = {2026},
  eprint        = {2605.22570},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.22570},
}