VGenST-Bench

Abstract

A new paradigm for benchmark construction

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. However, existing spatio-temporal reasoning benchmarks primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities.

In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct it, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs.

We establish a comprehensive 3 × 2 × 2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics, paired with a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

TL;DR

VGenST-Bench is a video benchmark that uses video generative models to actively synthesize controlled scenarios for evaluating spatio-temporal reasoning in MLLMs.

**Overview of VGenST-Bench.** (A) Dataset generation: given input themes, our multi-agent pipeline jointly synthesizes videos paired with scene graphs, scenarios, and QA sets. (B) Task & level design: videos are organized along a 3 × 2 × 2 taxonomy with one spatio-temporal task assigned per cell, and QA pairs follow a three-level hierarchy. (C) Benchmark statistics: 1,200 videos and 33K QA pairs spanning 12 task types and 12 QA types.

Method

A multi-agent construction pipeline

Starting from a theme, four agents operate in sequence — Scene Graph, Scenario, Video, and QA — followed by a two-stage human quality-control review.

**VGenST-Bench construction pipeline.** Starting from a theme, four agents operate in sequence. The Scene Graph Agent produces a structured scene graph specifying objects and spatial composition; the Scenario Agent expands it into a temporally grounded scenario with reasoning goal and timeline; the Video Agent synthesizes the corresponding image and video through generative models; and the QA Agent generates base MCQs from a task–QA applicability matrix and reformats each into three variants.

Tasks & QA Types

VGenST-Bench organizes 12 reasoning tasks under a 3 × 2 × 2 video taxonomy (spatial scale × perspective × scene dynamics), paired with 12 QA types in a three-level cognitive hierarchy that progresses from low-level perception to high-level spatio-temporal reasoning.

12 Tasks

Organized under a 3 × 2 × 2 taxonomy: spatial scale × perspective × scene dynamics.

	Egocentric		Exocentric
	Static	Dynamic	Static	Dynamic
Figural	`MC`Multi-Container Attribute Mapping	`QC`Quantity Change Tracking	`CI`Container Intersection Inference	`CM`Causal Mapping
Vista	`DE`Direction Estimation	`IO`Interacted Object Identification	`HO`Height Ordering	`VI`Visibility Identification
Environmental	`DS`Directional Signage Grounding	`RV`Relative Velocity Identification	`LS`Landmark Spatial Composition	`BT`Behavioral Trigger Identification

12 QA Types

Organized along a three-level cognitive hierarchy (3 + 6 + 3).

L1Visual Perception

Object Existence · Object Attribute Recognition · 2D Frame Localization

L2Scene Understanding

Identity Tracking · Action Recognition · Object Counting · Temporal Ordering · Camera Motion · Spatial Layout

L3Spatio-Temporal Reasoning

Perspective-Taking · Counterfactual Reasoning · Predictive Reasoning

Each task uses a subset of QA types — see the task–QA applicability matrix in the paper appendix.

Citation

BibTeX

vgenst-bench.bib

@misc{park2026vgenstbenchbenchmarkspatiotemporalreasoning,
  title         = {VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis},
  author        = {Jinho Park and Youbin Kim and Hogun Park and Eunbyung Park},
  year          = {2026},
  eprint        = {2605.22570},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.22570},
}

VGenST-Bench A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis