From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data construction pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing", achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

Method Overview

FSD Framework
Overview of FSD Framework. FSD unlocks visual aids reasoning and generation through Spatial Relationship-Focused CoT, demonstrating exceptional generalization capabilities that enable zero-shot robot manipulation and achieving remarkable performance across multiple benchmarks.


🎯 Motivation & Challenges

Current Vision-Language-Action (VLA) models fall short of achieving robust zero-shot performance due to two fundamental limitations (Seeing to Doing issue):

  • (1) Scarcity: Robotics data remains limited compared to language and vision datasets, preventing similar scaling laws
  • (2) Heterogeneity: Significant variation in robot platforms makes end-to-end learning from vision to diverse action outputs challenging

🔬 Our Solution: FSD Pipeline

We present FSD (From Seeing to Doing), a novel pipeline that addresses generalization challenges through spatial relationship reasoning and visual aids generation. Our approach leverages VLMs' visual understanding capabilities, augmented with step-by-step reasoning to extract unified mid-level structural representations independent of robot embodiment.

FSD Pipeline
FSD Pipeline Overview. Inspired by the process of human reasoning, FSD uses a spatial relationship graph as an anchor to derive a visual chain-of-thought reasoning process for visual trace generation.

📍 Visual Aids Representation

Our mid-level representation includes spatial affordance boxes/points and visual traces, each represented as marked coordinates within visual images. These visual aids provide: Expressive spatial information, Compact representation, and Embodiment-independent guidance.


🏗️ Three Core Components

1. Spatial Relationship-Focused Visual Chain-of-Thought (Sr-CoT)

Conducts multi-step reasoning anchored by object coordinates and spatial relationships, treating visual aid generation as a reasoning process. This enables the model to understand complex spatial configurations and generate appropriate visual guides.

2. Weak-to-Strong Data Construction Pipeline

Combines large-scale embodied datasets with common sense data, establishing a weak-to-strong capability enhancement training process. This systematic approach captures fine-grained spatial relationships and manipulation skills across diverse scenarios.

3. Self-Consistency Mechanism

Aligns understanding and generation by binding spatial coordinates with specific visual signals. This ensures robust and consistent decision-making across different scenarios by maintaining coherence between spatial reasoning and visual perception.

VABench: Visual Aids Generation Benchmark

We propose VABench, a challenging benchmark with 300 manually annotated problems from real-world and simulation datasets (OXE, BridgeData, Droid). VABench evaluates models' abilities to generate spatial affordance and visual traces from natural language instructions, using metrics including point accuracy, trajectory MAE/RMSE, and GPT-based qualitative scoring.


FSD VABench Results
FSD Prediction Results on VABench. Our model demonstrates strong performance in generating spatial affordance and visual traces from natural language instructions across diverse scenarios.


Evaluation of Spatial Understanding and Reasoning Capabilities

🏆 Outstanding Performance on Spatial Reasoning Benchmarks

We evaluate FSD-13B on 5 spatial reasoning benchmarks (CVBench, CRPE, SAT, BLINK, EmbSp) and demonstrate exceptional performance. Our model achieves the best overall ranking of 1.3 across all benchmarks.


Table 1: Performance comparison on 5 spatial reasoning benchmarks

Model CVBench CRPE SAT BLINK EmbSp Rank
Count 2DRel 3DDep Avg. Exist. Subj. Pred. Avg. Val Real Count MV RelDepth Avg. Test
GPT-4V 62.4 71.1 79.8 70.4 90.6 76.7 65.1 75.2 44.8 50.7 60.8 55.6 59.7 62.2 36.1 -
GPT-4o 65.9 85.5 87.8 79.4 93.3 81.9 71.8 80.2 49.4 57.5 49.2 60.2 74.2 63.2 49.1 -
LLaVA-1.5-13B 58.2 46.6 53.0 51.4 88.7 57.4 54.2 63.9 51.4 41.6 45.0 41.4 53.2 52.4 35.1 4.8
SAT-Dynamic-13B 61.5 89.7 80.7 76.2 87.5 60.6 57.6 67.7 87.7 54.9 35.8 44.4 73.4 55.0 51.3 2.8
RoboPoint-13B 56.5 77.2 81.5 68.2 93.2 66.3 62.4 73.2 53.3 46.6 48.3 44.4 62.0 55.1 51.4 2.8
ASMv2-13B 58.9 68.9 68.9 66.4 92.1 69.2 59.0 71.4 63.9 46.7 59.2 44.4 56.5 56.3 57.4 3.1
FSD-13B 62.4 86.5 88.0 80.9 94.0 75.2 65.1 76.2 73.2 63.3 60.0 46.6 70.2 63.8 63.3 1.3

🔥 Excellent Object Reference and Free Space Localization Performance

FSD excels in object reference and free space localization tasks. For object reference (RoboRefit), FSD achieves 56.7% accuracy, surpassing GPT-4o (15.3%) and RoboPoint (49.8%) by significant margins. On free space reference (Where2Place), FSD performs competitively at 45.8% while substantially outperforming other baseline models.


Table 2: Performance comparison on object reference and free space localization

Benchmark GPT-4o SpaceLLaVA LLaVA-NeXT-34B SpatialBot-3B ASMv2-13B RoboBrain-7B RoboPoint-13B FSD-13B
RoboRefit 15.3 21.3 19.9 23.6 48.4 10.1 49.8 56.7
Where2Place 29.1 11.8 15.0 15.0 22.0 16.6 46.0 45.8

RoboRefit Results
RoboRefit Task Performance. Visual examples showing FSD's superior object reference capabilities compared to baseline methods.

Where2Place Results
Where2Place Task Performance. Examples demonstrating FSD's ability to identify optimal placement locations in free space.


🎯 Outstanding VABench Performance

On our proposed VABench benchmark, FSD demonstrates exceptional visual aids generation capabilities. FSD achieves 61.82% accuracy on VABench-Point, significantly outperforming GPT-4o (9.30%) and RoboPoint (19.09%). On VABench-VisualTrace, FSD excels with RMSE of 78.26 and GPT Score of 6.21.


Table 3: Performance comparison on VABench

(a) VABench-Point
Model Accuracy ↑
GPT4o 9.30
ASMv2 10.07
RoboPoint 19.09
FSD 61.82
(b) VABench-VisualTrace
Model RMSE↓ MAE↓ GPT Score↑
GPT4o 136.13 113.53 4.37
DINOv2 Predictor 128.32 117.49 4.01
FSD 78.26 63.44 6.21

Real Robot Manipulation Results

FSD demonstrates superior performance in real robot manipulation tasks, achieving 72% success rate across 8 diverse tasks. Our approach shows significant improvements in zero-shot generalization compared to baseline methods.

Real Robot Setup
Real Robot Experimental Setup. FSD demonstrates superior performance in real robot manipulation tasks, achieving 72% success rate across 8 diverse tasks.

Fold the towel from right to left

Move cucumber between pot and bowl

move strawberry on the left of brown basket

Pick up sponge and place it outside plate

Pick up strawberry

Place egg in green pot

SimplerEnv Simulation Results

FSD achieves 40.6% average success rate on SimplerEnv benchmark, demonstrating superior zero-shot performance compared to baseline methods. The results validate FSD's effectiveness in simulation environments with diverse manipulation tasks.


Table 4: SimplerEnv Evaluation on WidowX Robot

The results of baselines are derived from Qu et al. (2025). ZS: zero-shot, FT: fine-tuning using BridgeData. Each task is tested 24 episodes.

Model Spoon→Towel Carrot→Plate Green→Yellow Eggplant→Basket Avg.
Grasp Succ. Grasp Succ. Grasp Succ. Grasp Succ.
RT-1-X O'Neill et al. (2023) 16.7% 0% 20.8% 4.2% 8.3% 0% 0.0% 0% 1.1%
Octo-S Team et al. (2024) 77.8% 47.2% 27.8% 9.7% 40.3% 4.2% 87.5% 56.9% 30.0%
OpenVLA Kim et al. (2024) 4.1% 0% 33.3% 0% 12.5% 0% 8.3% 4.1% 1.0%
RoboVLM (ZS) Li et al. (2024d) 37.5% 20.8% 33.3% 25.0% 8.3% 8.3% 0.0% 0% 13.5%
RoboVLM (FT) Li et al. (2024d) 54.2% 29.2% 25.0% 25.0% 45.8% 12.5% 58.3% 58.3% 31.3%
SpatialVLA (ZS) Qu et al. (2025) 25.0% 20.8% 41.7% 20.8% 58.3% 25.0% 79.2% 70.8% 34.4%
SpatialVLA (FT) Qu et al. (2025) 20.8% 16.7% 29.2% 25.0% 62.5% 29.2% 100% 100% 42.7%
FSD 58.3% 41.7% 58.3% 50.0% 91.7% 33.3% 37.5% 37.5% 40.6%

Carrot→Plate

Eggplant→Basket

Spoon→Towel

Green→Yellow

BibTeX

@article{fsd2024,
          title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
          author={Anonymous},
          journal={Under Review},
          year={2024}
        }