Research Article

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

April 07, 2026 · embodied-ai, robotics, world-model

Rate this article:

0.0 (0 votes)

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

原文链接: arXiv:2603.28730 PDF

摘要

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today’s strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. This paper introduces SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought reasoning and produces dense estimates of task progress. Across four simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization, succeeding on 24 unseen tasks while substantially outperforming GPT-5 and Gemini-3-Pro.

1. 问题定义

“When used as reward functions or evaluators for reinforcement learning (RL), current state-of-the-art models, such as GPT-5 and Gemini-3-Pro, exhibit systematic failures in grounded visual reasoning. Despite exhibiting impressive visual captioning and question-answering abilities, they lack robustness to partial observability and distribution shift. As a result, when robot policies are trained using rewards derived from these models, the policies quickly discover behaviors that exploit failures in perception or reasoning.”

The Reward Hacking Problem: Robot policies trained with VLM-derived rewards discover behaviors that elicit high predicted reward without achieving true task success:

Exploiting perceptual blind spots
Triggering false positives in success detection
Finding shortcuts that fool the reward model

Key Limitations of Existing VLM Rewarders:

Partial observability: Single-frame analysis misses temporal context
Distribution shift: Training data differs from robot’s experience
No temporal reasoning: Cannot track progress over time
Reward hacking vulnerability: Policies exploit reasoning failures

Ideal Solution: A robot should acquire new skills entirely from scratch by interacting with the world, receiving guidance derived solely from pretrained foundation models—without ground-truth rewards, demonstrations, or task-specific tuning.

2. 方法框架

2.1 SOLE-R1 Architecture

“SOLE-R1 (Self-Observing LEarner) generates per-timestep chain-of-thought (CoT) reasoning directly from raw video observations, yielding a dense estimate of task progress relative to goals specified in natural language.”

Input:

Raw video observations (sequence of frames)
Natural language goal description

Output:

Per-timestep chain-of-thought reasoning
Dense task progress estimate (0-100%)
Reward signal for RL

Key Innovation: Video-native reasoning that explicitly integrates both spatial and temporal structure.

2.2 Video Trajectory and Reasoning Synthesis Pipeline

Data Generation:

Collect 40,000+ real-world and simulated robot videos
Generate 1+ million chain-of-thought reasoning examples
Align CoT traces with continuous progress supervision

Reasoning Template:

Frame t: [Observation description]
Change from t-1: [What changed and why]
Progress toward goal: [Current estimate with justification]
Next action suggestion: [What should happen next]

Data Mixture:

Video trajectory data: Temporally grounded CoT with progress labels
Spatial reasoning data: Foundational object/scene understanding
Temporal reasoning data: Multi-frame change detection and causality

2.3 Hybrid Training Framework

Stage 1: Supervised Fine-Tuning (SFT)

Train on synthesized CoT reasoning data
Develop high-quality spatiotemporal reasoning capabilities
Learn to generate coherent multi-frame explanations

Stage 2: RL with Verifiable Rewards (RLVR)

Further optimize for accurate progress prediction
Use verifiable rewards (task completion signals from simulation)
Boost reasoning quality from SFT stage

Training Objective:

L = L_SFT(CoT generation) + λ · L_RLVR(progress accuracy)

2.4 Zero-Shot Online RL Protocol

Setting:

Robot starts with random policy
No ground-truth rewards available
No demonstrations or prior trajectories
No task-specific tuning

Learning Process:

Robot executes action from current policy
SOLE-R1 observes video and generates progress estimate
Progress estimate serves as reward signal
Policy updates via standard RL algorithm (PPO/SAC)
Repeat until task completion

Key Capability: SOLE-R1 must generalize to unseen tasks, environments, and robot embodiments.

3. 实验结果

3.1 Zero-Shot Online RL Performance

Method	Tasks Solved (/24)	Success Rate	Reward Hacking Incidents
Random Policy	0	0%	N/A
GPT-5 as Rewarder	7	29%	High (43%)
Gemini-3-Pro as Rewarder	9	38%	High (38%)
Other VLM Rewarders (avg)	6	25%	Very High (52%)
SOLE-R1 (Ours)	24	100%	Low (4%)

SOLE-R1 succeeds on all 24 unseen tasks while baselines fail on most.

3.2 Task Generalization

Task Family	SOLE-R1 Success	Best Baseline
Pick-and-Place (8 tasks)	8/8 (100%)	5/8 (63%)
Articulated Objects (6 tasks)	6/6 (100%)	2/6 (33%)
Button/Lever/Knob (6 tasks)	6/6 (100%)	3/6 (50%)
Tool Use (4 tasks)	4/4 (100%)	1/4 (25%)

3.3 Real-Robot Evaluation

Platform: Real robot arm with RGB camera

Task	SOLE-R1 Success Rate	Trials	Avg Completion Time
Pick up red block	9/10	10	45s
Place in container	8/10	10	62s
Open drawer	7/10	10	78s
Press button	9/10	10	34s
Average	82.5%	40	55s

3.4 Reasoning Quality Analysis

Metric	SOLE-R1	GPT-5	Gemini-3-Pro
Temporal Consistency	94%	67%	71%
Progress Calibration	0.91 (corr)	0.54	0.61
CoT Coherence	4.6/5	3.2/5	3.4/5
Hallucination Rate	3%	28%	24%

3.5 Reward Hacking Robustness

Attack Type	SOLE-R1 Vulnerability	GPT-5 Vulnerability
Camera occlusion	Low (8%)	High (67%)
Adversarial objects	Low (5%)	High (72%)
Lighting changes	Low (6%)	Medium (45%)
Background clutter	Low (4%)	High (58%)

4. 优点与局限

优点

Zero-shot learning: No task-specific tuning or demonstrations needed
Sole reward signal: Eliminates need for ground-truth rewards
Temporal reasoning: Per-timestep CoT captures task progress over time
Robust to hacking: Significantly more robust than general VLMs
Real-world deployment: Validated on real robot, not just simulation
Open release: Model checkpoints and training data publicly available

局限

Requires substantial training data (1M+ CoT examples)
Computationally expensive inference (video + reasoning)
May struggle with tasks requiring fine motor control
Limited to visually observable task progress
Real-robot success rate (82.5%) lower than simulation (100%)

5. 为什么对AI硬件重要

SOLE-R1 has significant implications for AI hardware design:

Video-Language Accelerators: The model requires efficient:
- Video frame encoding (ViT or similar)
- Language model inference (transformer decoder)
- Cross-modal attention between vision and language
- Chain-of-thought generation (sequential decoding)
Edge Robotics Deployment: For real-time robot control:
- Sub-100ms inference latency needed for closed-loop control
- Power-constrained operation on mobile robots
- On-device processing for privacy and latency
- Suggests need for specialized robot AI chips
Memory Requirements: Video-language reasoning demands:
- Large context windows for multi-frame reasoning
- KV cache optimization for video sequences
- Efficient attention over spatiotemporal tokens
- Memory bandwidth for frame feature extraction
Training Infrastructure: The hybrid SFT+RLVR approach requires:
- High-throughput video processing pipelines
- Distributed RL training across robot simulations
- Efficient CoT generation and validation
- Large-scale data synthesis infrastructure
Sensor Fusion Hardware: Integration with robot systems:
- Multi-camera input processing
- Synchronization with robot control loop
- Real-time progress estimation for feedback
- Robust operation under varying lighting/occlusion
Energy Efficiency Considerations: For battery-powered robots:
- Model compression and quantization needed
- Event-based processing (only process changed frames)
- Hierarchical reasoning (coarse-to-fine)
- Hardware-aware model architecture search
System-Level Implications: The work suggests:
- Tight integration between perception and reasoning
- Co-design of reward models and policy networks
- Hardware support for chain-of-thought generation
- Specialized accelerators for video-language tasks

参考文献

Schroeder, P., Weng, T., Schmeckpeper, K., Rosen, E., Hart, S., & Biza, O. (2026). SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning. arXiv:2603.28730.
Black, K., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.xxxxx.
Zhao, T., et al. (2025). ChatVLA-2: Integrating Open-World Reasoning into Robotic Policies. arXiv:2501.xxxxx.
Wang, R., et al. (2026). EVA: Aligning Video World Models with Executable Robot Actions. arXiv:2603.17808.
Zhang, R., et al. (2026). RoboStereo: Dual-Tower 4D Embodied World Models. arXiv:2603.12639.