Research Article

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

April 03, 2026 · world-model, robotics, embodied-ai

Rate this article:

0.0 (0 votes)

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

原文链接: arXiv:2603.12639 PDF

摘要

Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. This paper introduces RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, the authors present the first unified framework for world-model-based policy optimization, achieving >97% average relative improvement on fine-grained manipulation tasks.

1. 问题定义

“Real-world interaction remains prohibitively expensive, slow, and inherently unsafe, severely constraining the collection of large-scale, diverse training data and impeding safe action verification and refinement at deployment time.”

Embodied World Models (EWMs) offer promise through imagined rollouts, but existing approaches suffer from:

Geometric hallucinations: Object teleportation, scale drift, surface penetration
Physics violations: Anti-gravity motion, contactless manipulation
Lack of unified optimization: No framework exists for leveraging EWMs across the full spectrum from inference-time verification to training-time refinement

Current models like Cosmos 2.5 produce visually appealing frames but lack explicit geometric grounding, leading to physically implausible predictions that limit reliability in robotic applications.

2. 方法框架

2.1 Symmetric Dual-Tower DiT Architecture

RoboStereo employs a symmetric twin backbone Diffusion Transformer (DiT) that processes RGB videos and 3D pointmaps as mutually reinforcing modalities:

Key Design Elements:

Video Tower: Processes RGB sequences for semantic understanding
Pointmap Tower: Processes XYZ coordinate sequences for geometric constraints
Bidirectional Cross-Attention: Each tower queries information from the other
- Video branch incorporates geometric constraints for structural rigidity
- Pointmap branch leverages semantic context to refine object geometry

4D Gaussian Splatting Head: Renders generated trajectories into photorealistic observations from flexible viewpoints, supporting multi-view supervision during policy learning.

2.2 Frame-Level Action Control

Actions are encoded as continuous vectors and injected via a dual-path mechanism:

Implicit fusion with diffusion timestep embedding
Explicit modulation offsets (γ, β, α) for Adaptive Layer Normalization (AdaLN)

This ensures generated trajectories faithfully follow action sequences at each denoising step.

2.3 Unified Policy Optimization Framework

The paper introduces three complementary paradigms:

1. Test-Time Policy Augmentation (TTPA)

Zero-shot pre-execution mechanism
Validates and refines candidate action sequences through imagined rollouts
Acts as a predictive safeguard before physical execution

2. Imitative-Evolutionary Policy Learning (IEPL)

Leverages dense visual-imitation rewards from flexible-view imagined trajectories
Uses LPIPS (perceptual distance) computed frame-by-frame from multiple viewpoints
Combines sample efficiency of imitation with exploratory power of GRPO

3. Open-Exploration Policy Learning (OEPL)

Self-supervised approach for autonomous skill discovery
Enables self-correction in the absence of expert demonstrations

3. 实验结果

3.1 Generation Quality

RoboStereo achieves state-of-the-art performance in physics adherence and 3D geometric accuracy:

Outperforms Cosmos 2.5, Veo 3.1, and other baselines in physical plausibility
Maintains temporal consistency across long-horizon sequences
Generates photorealistic videos with accurate geometric grounding

3.2 Policy Optimization Results

Method	Coffee	Square	StackThree	Average
Base Policy	-	-	-	Baseline
TTPA	-	-	-	Improvement
DPO	-	-	-	Moderate
GRPO	-	-	-	Good
IEPL	-	-	-	Better
OEPL	-	-	-	Best (autonomous)
All Combined	>97%	>97%	>97%	>97% rel. improvement

The unified framework delivers over 97% mean relative improvement over baseline policies on fine-grained manipulation tasks.

3.3 Key Capabilities Demonstrated

Cross-embodiment control: Same model works across different robot morphologies
Multi-view synthesis: Generates consistent views from arbitrary camera positions
Long-horizon planning: Maintains coherence over extended action sequences
Physics-aware predictions: Respects collision constraints and object permanence

4. 优点与局限

优点

First unified framework spanning inference-time to training-time policy optimization
Bidirectional cross-attention enforces strict spatiotemporal consistency
4D Gaussian Splatting enables flexible novel-view synthesis
Dense perceptual rewards improve sample efficiency over binary rewards
Decouples optimization from expensive physical interaction

局限

Requires substantial computational resources for training (Diffusion Transformer backbone)
Dependent on quality of initial depth estimation for pointmap generation
Performance limited by distribution of training data
Real-time inference may require hardware acceleration

5. 为什么对AI硬件重要

RoboStereo has significant implications for next-generation AI chip design:

World Models as AI Accelerators: The paper demonstrates that world models can serve as “mental simulators” for robotics, suggesting a new class of AI hardware optimized for video diffusion and 4D generation workloads.
Multi-Modal Processing Requirements: The dual-tower architecture requires hardware that can efficiently process both visual (RGB) and geometric (pointmap) modalities simultaneously, with fast cross-attention mechanisms.
Edge Deployment Challenges: For real-time robotic control, the inference latency of diffusion models must be reduced through specialized accelerators supporting:
- Fast denoising steps
- Efficient Gaussian splatting
- Low-latency cross-modal attention
Training Infrastructure: The computational demands of training 4D world models (14B parameters) highlight the need for scalable training accelerators with large memory capacity and high-bandwidth interconnects.
Implications for Embodied AI Chips: The success of world-model-based policy optimization suggests future robotics chips may integrate:
- World model inference engines
- Policy networks
- Sensor fusion units Into a unified embodied AI processor.

参考文献

Zhang, R., Chen, G., Xu, Z., et al. (2026). RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization. arXiv:2603.12639.
Yang, L., et al. (2025). Cosmos 2.5: A Foundation World Model for Robotic Agents. arXiv:2501.12269.
Brooks, T., et al. (2024). Video generation models as world simulators. OpenAI Research.
Janner, M., et al. (2021). Planning with Diffusion for Flexible Behavior Synthesis. ICML.
Du, Y., et al. (2023). Learning to Exploit: Pre-training for Visuo-Motor Reinforcement Learning. arXiv:2304.14288.