Agentic 3D Spatial Reasoning

Skill-3D

Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Skill-3D Evolving Loop
01

Scene & Task

Query type · target objects · required evidence

02

Scene Memory

Rollouts · tool outputs · success / failure traces

03

Skill Library

Workflows + failure lessons

04

Policy Training

Agentic SFT + GRPO

39% → 78%Effective Tool Usage on VSI-Bench
+67%Gemini-3-Flash gain on MMSI-Bench
+43%Qwen3-VL-8B gain on VSI-Bench

Motivation

Tool availability is not enough.

Think3D Skill-3D
Radar chart comparing GPT-5.4, Think3D, and Skill-3D
Skill-3D motivation and overview
Skill-3D extracts scene-aware skills to guide tool use for 3D spatial reasoning.

Uniform Tool Use

Existing agents often apply fixed tool workflows across different spatial tasks, causing mismatched evidence and limited gains.

Scene-Aware Skills

Skill-3D retrieves task-specific skills to select the right tools and ground answers in relevant 3D evidence.

Method

Scene Memory, Skill Library, and skill-guided post-training.

Skill-3D method overview
Overview of Skill-3D: scene-aware skill extraction, skill-guided inference, and skill-guided agentic post-training.
01

Scene-Aware Skill Extraction

Successful rollouts become reusable workflows; failures become lessons.

02

Skill-Guided Inference

Skill-3D retrieves relevant skills and uses a compact subset to guide tool use and reasoning.

03

Skill-Guided Agentic Post-Training

Skill-guided trajectories train compact agents to select skills, call tools, and ground answers through SFT and GRPO.

Results

Consistent gains across closed- and open-source agents.

Comprehensive Evaluation

VSI-Bench, BLINK, CV-3D, and MMSI-Bench

Closed-source evaluation table

Closed-source evaluation across GPT-4o, GPT-5.4, Gemini-2.5-Pro, and Gemini-3-Flash.

Closed-source agents

Skill-3D consistently outperforms non-agentic, direct tool-use, and Think3D baselines across four closed-source MLLM agents. The gains are most pronounced on VSI-Bench, where different spatial categories require task-specific evidence such as object grounding, depth estimation, and multi-view verification.

Open-source agents

Skill-guided post-training transfers to compact Qwen3-VL-4B/8B agents. The results show that scene-aware tool-use behavior can be internalized by smaller policies through agentic SFT and GRPO, improving skill selection, tool usage, and evidence-grounded answering.

Analysis

Skill-3D improves tool-use quality, not just tool-call frequency.

Effective Tool Usage

Hover over each bar to inspect the exact value.

GPT-5.4 Think3D Skill-3D
1007550250
39.2%
58.5%
78.7%
36.4%
54.0%
79.2%
31.8%
48.3%
87.5%
30.5%
45.6%
80.3%
Effective Tool Usage improves across VSI-Bench, BLINK, CV-3D, and MMSI-Bench.

Task-Aligned Tool Distribution

Bars show tool-call frequency by task group; hover for exact values.

GPT-5.4 Think3D Skill-3D

Depth / Distance / Size

18%
61%
8%
10%
7%
67%

Spatial Relation / Direction

15%
57%
26%
7%
3%
45%
Tool usage becomes more task-aligned for depth/distance/size and direction reasoning tasks.

Qualitative Cases

Case Studies

Case 1 · Appearance Order
What will be the first-time appearance order of the following categories in the video: basket, door, pillow, laptop? Select from the following choices. (A) A. basket, pillow, door, laptop (B) B. pillow, door, laptop, basket (C) C. basket, door, pillow, laptop (D) D. door, basket, pillow, laptop
#dy_14 · appearance_order #st_06 · temporal_cue
Object Detection Image Segmentation
Object detection localizes pillow, door, laptop, and basket candidates throughout the video, while segmentation refines their boundaries and ensures consistent object identity across views. The temporal-cue skill then tracks the first frame in which each segmented object appears. The pillow is detected in the initial bedroom scene, followed by the door as the camera moves through the room. The laptop becomes visible later on the desk, and the basket only appears in the final views. Therefore, the first-time appearance order is B. pillow → door → laptop → basket.

Citation

BibTeX

@article{skill3d2026,
  title     = {Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning},
  author    = {Haoyuan Li and Zhengdong Hu and Jun Wang and Hehe Fan and Yi Yang},
  booktitle = {arxiv preprint arxiv:2606.07436},
  year      = {2026}
}