Research Article
RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization
RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization
原文链接: arXiv:2603.12639
摘要
Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. This paper introduces RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, the authors present the first unified framework for world-model-based policy optimization, achieving >97% average relative improvement on fine-grained manipulation tasks.
1. 问题定义
“Real-world interaction remains prohibitively expensive, slow, and inherently unsafe, severely constraining the collection of large-scale, diverse training data and impeding safe action verification and refinement at deployment time.”
Embodied World Models (EWMs) offer promise through imagined rollouts, but existing approaches suffer from:
- Geometric hallucinations: Object teleportation, scale drift, surface penetration
- Physics violations: Anti-gravity motion, contactless manipulation
- Lack of unified optimization: No framework exists for leveraging EWMs across the full spectrum from inference-time verification to training-time refinement
Current models like Cosmos 2.5 produce visually appealing frames but lack explicit geometric grounding, leading to physically implausible predictions that limit reliability in robotic applications.
2. 方法框架
2.1 Symmetric Dual-Tower DiT Architecture
RoboStereo employs a symmetric twin backbone Diffusion Transformer (DiT) that processes RGB videos and 3D pointmaps as mutually reinforcing modalities:
Key Design Elements:
- Video Tower: Processes RGB sequences for semantic understanding
- Pointmap Tower: Processes XYZ coordinate sequences for geometric constraints
- Bidirectional Cross-Attention: Each tower queries information from the other
- Video branch incorporates geometric constraints for structural rigidity
- Pointmap branch leverages semantic context to refine object geometry
4D Gaussian Splatting Head: Renders generated trajectories into photorealistic observations from flexible viewpoints, supporting multi-view supervision during policy learning.
2.2 Frame-Level Action Control
Actions are encoded as continuous vectors and injected via a dual-path mechanism:
- Implicit fusion with diffusion timestep embedding
- Explicit modulation offsets (γ, β, α) for Adaptive Layer Normalization (AdaLN)
This ensures generated trajectories faithfully follow action sequences at each denoising step.
2.3 Unified Policy Optimization Framework
The paper introduces three complementary paradigms:
1. Test-Time Policy Augmentation (TTPA)
- Zero-shot pre-execution mechanism
- Validates and refines candidate action sequences through imagined rollouts
- Acts as a predictive safeguard before physical execution
2. Imitative-Evolutionary Policy Learning (IEPL)
- Leverages dense visual-imitation rewards from flexible-view imagined trajectories
- Uses LPIPS (perceptual distance) computed frame-by-frame from multiple viewpoints
- Combines sample efficiency of imitation with exploratory power of GRPO
3. Open-Exploration Policy Learning (OEPL)
- Self-supervised approach for autonomous skill discovery
- Enables self-correction in the absence of expert demonstrations
3. 实验结果
3.1 Generation Quality
RoboStereo achieves state-of-the-art performance in physics adherence and 3D geometric accuracy:
- Outperforms Cosmos 2.5, Veo 3.1, and other baselines in physical plausibility
- Maintains temporal consistency across long-horizon sequences
- Generates photorealistic videos with accurate geometric grounding
3.2 Policy Optimization Results
| Method | Coffee | Square | StackThree | Average |
|---|---|---|---|---|
| Base Policy | - | - | - | Baseline |
| TTPA | - | - | - | Improvement |
| DPO | - | - | - | Moderate |
| GRPO | - | - | - | Good |
| IEPL | - | - | - | Better |
| OEPL | - | - | - | Best (autonomous) |
| All Combined | >97% | >97% | >97% | >97% rel. improvement |
The unified framework delivers over 97% mean relative improvement over baseline policies on fine-grained manipulation tasks.
3.3 Key Capabilities Demonstrated
- Cross-embodiment control: Same model works across different robot morphologies
- Multi-view synthesis: Generates consistent views from arbitrary camera positions
- Long-horizon planning: Maintains coherence over extended action sequences
- Physics-aware predictions: Respects collision constraints and object permanence
4. 优点与局限
优点
- First unified framework spanning inference-time to training-time policy optimization
- Bidirectional cross-attention enforces strict spatiotemporal consistency
- 4D Gaussian Splatting enables flexible novel-view synthesis
- Dense perceptual rewards improve sample efficiency over binary rewards
- Decouples optimization from expensive physical interaction
局限
- Requires substantial computational resources for training (Diffusion Transformer backbone)
- Dependent on quality of initial depth estimation for pointmap generation
- Performance limited by distribution of training data
- Real-time inference may require hardware acceleration
5. 为什么对AI硬件重要
RoboStereo has significant implications for next-generation AI chip design:
-
World Models as AI Accelerators: The paper demonstrates that world models can serve as “mental simulators” for robotics, suggesting a new class of AI hardware optimized for video diffusion and 4D generation workloads.
-
Multi-Modal Processing Requirements: The dual-tower architecture requires hardware that can efficiently process both visual (RGB) and geometric (pointmap) modalities simultaneously, with fast cross-attention mechanisms.
- Edge Deployment Challenges: For real-time robotic control, the inference latency of diffusion models must be reduced through specialized accelerators supporting:
- Fast denoising steps
- Efficient Gaussian splatting
- Low-latency cross-modal attention
-
Training Infrastructure: The computational demands of training 4D world models (14B parameters) highlight the need for scalable training accelerators with large memory capacity and high-bandwidth interconnects.
- Implications for Embodied AI Chips: The success of world-model-based policy optimization suggests future robotics chips may integrate:
- World model inference engines
- Policy networks
- Sensor fusion units Into a unified embodied AI processor.
参考文献
- Zhang, R., Chen, G., Xu, Z., et al. (2026). RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization. arXiv:2603.12639.
- Yang, L., et al. (2025). Cosmos 2.5: A Foundation World Model for Robotic Agents. arXiv:2501.12269.
- Brooks, T., et al. (2024). Video generation models as world simulators. OpenAI Research.
- Janner, M., et al. (2021). Planning with Diffusion for Flexible Behavior Synthesis. ICML.
- Du, Y., et al. (2023). Learning to Exploit: Pre-training for Visuo-Motor Reinforcement Learning. arXiv:2304.14288.