From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data construction pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing", achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

Method Overview

Overview of FSD Framework. FSD unlocks visual aids reasoning and generation through Spatial Relationship-Focused CoT, demonstrating exceptional generalization capabilities that enable zero-shot robot manipulation and achieving remarkable performance across multiple benchmarks.

🎯 Motivation & Challenges

Current Vision-Language-Action (VLA) models fall short of achieving robust zero-shot performance due to two fundamental limitations (Seeing to Doing issue):

(1) Scarcity: Robotics data remains limited compared to language and vision datasets, preventing similar scaling laws
(2) Heterogeneity: Significant variation in robot platforms makes end-to-end learning from vision to diverse action outputs challenging

🔬 Our Solution: FSD Pipeline

We present FSD (From Seeing to Doing), a novel pipeline that addresses generalization challenges through spatial relationship reasoning and visual aids generation. Our approach leverages VLMs' visual understanding capabilities, augmented with step-by-step reasoning to extract unified mid-level structural representations independent of robot embodiment.

FSD Pipeline Overview. Inspired by the process of human reasoning, FSD uses a spatial relationship graph as an anchor to derive a visual chain-of-thought reasoning process for visual trace generation.

📍 Visual Aids Representation

Our mid-level representation includes spatial affordance boxes/points and visual traces, each represented as marked coordinates within visual images. These visual aids provide: Expressive spatial information, Compact representation, and Embodiment-independent guidance.

🏗️ Three Core Components

1. Spatial Relationship-Focused Visual Chain-of-Thought (Sr-CoT)

Conducts multi-step reasoning anchored by object coordinates and spatial relationships, treating visual aid generation as a reasoning process. This enables the model to understand complex spatial configurations and generate appropriate visual guides.

2. Weak-to-Strong Data Construction Pipeline

Combines large-scale embodied datasets with common sense data, establishing a weak-to-strong capability enhancement training process. This systematic approach captures fine-grained spatial relationships and manipulation skills across diverse scenarios.

3. Self-Consistency Mechanism

Aligns understanding and generation by binding spatial coordinates with specific visual signals. This ensures robust and consistent decision-making across different scenarios by maintaining coherence between spatial reasoning and visual perception.

VABench: Visual Aids Generation Benchmark

We propose VABench, a challenging benchmark with 300 manually annotated problems from real-world and simulation datasets (OXE, BridgeData, Droid). VABench evaluates models' abilities to generate spatial affordance and visual traces from natural language instructions, using metrics including point accuracy, trajectory MAE/RMSE, and GPT-based qualitative scoring.

FSD VABench Results
FSD Prediction Results on VABench. Our model demonstrates strong performance in generating spatial affordance and visual traces from natural language instructions across diverse scenarios.

Evaluation of Spatial Understanding and Reasoning Capabilities

🏆 Outstanding Performance on Spatial Reasoning Benchmarks

We evaluate FSD-13B on 5 spatial reasoning benchmarks (CVBench, CRPE, SAT, BLINK, EmbSp) and demonstrate exceptional performance. Our model achieves the best overall ranking of 1.3 across all benchmarks.

Table 1: Performance comparison on 5 spatial reasoning benchmarks

Model	CVBench				CRPE				SAT		BLINK				EmbSp	Rank
	Count	2DRel	3DDep	Avg.	Exist.	Subj.	Pred.	Avg.	Val	Real	Count	MV	RelDepth	Avg.	Test
GPT-4V	62.4	71.1	79.8	70.4	90.6	76.7	65.1	75.2	44.8	50.7	60.8	55.6	59.7	62.2	36.1	-
GPT-4o	65.9	85.5	87.8	79.4	93.3	81.9	71.8	80.2	49.4	57.5	49.2	60.2	74.2	63.2	49.1	-
LLaVA-1.5-13B	58.2	46.6	53.0	51.4	88.7	57.4	54.2	63.9	51.4	41.6	45.0	41.4	53.2	52.4	35.1	4.8
SAT-Dynamic-13B	61.5	89.7	80.7	76.2	87.5	60.6	57.6	67.7	87.7	54.9	35.8	44.4	73.4	55.0	51.3	2.8
RoboPoint-13B	56.5	77.2	81.5	68.2	93.2	66.3	62.4	73.2	53.3	46.6	48.3	44.4	62.0	55.1	51.4	2.8
ASMv2-13B	58.9	68.9	68.9	66.4	92.1	69.2	59.0	71.4	63.9	46.7	59.2	44.4	56.5	56.3	57.4	3.1
FSD-13B	62.4	86.5	88.0	80.9	94.0	75.2	65.1	76.2	73.2	63.3	60.0	46.6	70.2	63.8	63.3	1.3

🔥 Excellent Object Reference and Free Space Localization Performance

FSD excels in object reference and free space localization tasks. For object reference (RoboRefit), FSD achieves 56.7% accuracy, surpassing GPT-4o (15.3%) and RoboPoint (49.8%) by significant margins. On free space reference (Where2Place), FSD performs competitively at 45.8% while substantially outperforming other baseline models.

Table 2: Performance comparison on object reference and free space localization

Benchmark	GPT-4o	SpaceLLaVA	LLaVA-NeXT-34B	SpatialBot-3B	ASMv2-13B	RoboBrain-7B	RoboPoint-13B	FSD-13B
RoboRefit	15.3	21.3	19.9	23.6	48.4	10.1	49.8	56.7
Where2Place	29.1	11.8	15.0	15.0	22.0	16.6	46.0	45.8

RoboRefit Results
RoboRefit Task Performance. Visual examples showing FSD's superior object reference capabilities compared to baseline methods.

Where2Place Results
Where2Place Task Performance. Examples demonstrating FSD's ability to identify optimal placement locations in free space.

🎯 Outstanding VABench Performance

On our proposed VABench benchmark, FSD demonstrates exceptional visual aids generation capabilities. FSD achieves 61.82% accuracy on VABench-Point, significantly outperforming GPT-4o (9.30%) and RoboPoint (19.09%). On VABench-VisualTrace, FSD excels with RMSE of 78.26 and GPT Score of 6.21.

Table 3: Performance comparison on VABench

(a) VABench-Point

Model	Accuracy ↑
GPT4o	9.30
ASMv2	10.07
RoboPoint	19.09
FSD	61.82

(b) VABench-VisualTrace

Model	RMSE↓	MAE↓	GPT Score↑
GPT4o	136.13	113.53	4.37
DINOv2 Predictor	128.32	117.49	4.01
FSD	78.26	63.44	6.21

Real Robot Manipulation Results

FSD demonstrates superior performance in real robot manipulation tasks, achieving 72% success rate across 8 diverse tasks. Our approach shows significant improvements in zero-shot generalization compared to baseline methods.

Real Robot Setup
Real Robot Experimental Setup. FSD demonstrates superior performance in real robot manipulation tasks, achieving 72% success rate across 8 diverse tasks.

Fold the towel from right to left

Move cucumber between pot and bowl

move strawberry on the left of brown basket

Pick up sponge and place it outside plate

Pick up strawberry

Place egg in green pot

SimplerEnv Simulation Results

FSD achieves 40.6% average success rate on SimplerEnv benchmark, demonstrating superior zero-shot performance compared to baseline methods. The results validate FSD's effectiveness in simulation environments with diverse manipulation tasks.

Table 4: SimplerEnv Evaluation on WidowX Robot

The results of baselines are derived from Qu et al. (2025). ZS: zero-shot, FT: fine-tuning using BridgeData. Each task is tested 24 episodes.

Model	Spoon→Towel		Carrot→Plate		Green→Yellow		Eggplant→Basket		Avg.
	Grasp	Succ.	Grasp	Succ.	Grasp	Succ.	Grasp	Succ.
RT-1-X O'Neill et al. (2023)	16.7%	0%	20.8%	4.2%	8.3%	0%	0.0%	0%	1.1%
Octo-S Team et al. (2024)	77.8%	47.2%	27.8%	9.7%	40.3%	4.2%	87.5%	56.9%	30.0%
OpenVLA Kim et al. (2024)	4.1%	0%	33.3%	0%	12.5%	0%	8.3%	4.1%	1.0%
RoboVLM (ZS) Li et al. (2024d)	37.5%	20.8%	33.3%	25.0%	8.3%	8.3%	0.0%	0%	13.5%
RoboVLM (FT) Li et al. (2024d)	54.2%	29.2%	25.0%	25.0%	45.8%	12.5%	58.3%	58.3%	31.3%
SpatialVLA (ZS) Qu et al. (2025)	25.0%	20.8%	41.7%	20.8%	58.3%	25.0%	79.2%	70.8%	34.4%
SpatialVLA (FT) Qu et al. (2025)	20.8%	16.7%	29.2%	25.0%	62.5%	29.2%	100%	100%	42.7%
FSD	58.3%	41.7%	58.3%	50.0%	91.7%	33.3%	37.5%	37.5%	40.6%

Carrot→Plate

Eggplant→Basket

Spoon→Towel

Green→Yellow

BibTeX

@article{fsd2024,
          title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
          author={Anonymous},
          journal={Under Review},
          year={2024}
        }