Research Article
EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
原文链接: arXiv:2603.17808
摘要
Video generative models are increasingly used as world models for robotics, where a model generates future visual rollouts conditioned on current observation and task instruction, and an inverse dynamics model (IDM) converts generated frames into executable robot actions. This paper identifies the “executability gap”: visually coherent rollouts may violate rigid-body and kinematic consistency, producing unstable or infeasible control commands. The authors introduce EVA (Executable Video Alignment), a reinforcement-learning post-training framework that uses IDM-derived rewards to align video generation with physically executable actions. Experiments on RoboTwin benchmark and real bimanual robot show EVA reduces embodiment-specific artifacts and improves downstream task execution success.
1. 问题定义
“Developing generalist robot policies capable of executing diverse manipulation tasks remains a central pursuit in embodied AI. However, scaling robust long-horizon behavior remains challenging when physical and temporal dynamics must be learned primarily from limited robot interaction data.”
The Executability Gap: Current video world models lack explicit executability constraints. As a result:
- Visually coherent rollouts may violate rigid-body and kinematic consistency
- Generated videos can contain arm deformations, self-intersections, or abrupt temporal discontinuities
- IDM mapping these artifacts produces infeasible control signals (abrupt joint jumps, high-frequency jitter, out-of-bounds commands)
Key Insight: Even when generated videos contain severe visual artifacts, the decoded actions typically exhibit clear violations (abrupt joint jumps, out-of-bound commands). This mismatch can be used as a training signal.
Current Mitigation Limitations: Rejection sampling at inference time is inefficient due to high cost of video generation.
2. 方法框架
2.1 Decoupled Pipeline: Video World Model + IDM
The paper addresses the emerging paradigm:
- Video World Model: Generates future visual trajectory conditioned on current observation and language instruction
- Inverse Dynamics Model (IDM): Extracts executable actions from generated frames
Advantages:
- Separates high-level spatiotemporal reasoning from low-level control
- Enables scaling via internet-scale video data
- Provides visual planning before execution
2.2 EVA: Executable Video Alignment Framework
EVA leverages reinforcement learning for post-training alignment of video world models:
Step 1: Train IDM on Real Robot Data
- IDM predicts executed actions from video sequences
- Learns embodiment-specific kinematics and constraints
- Provides dense reward signal throughout video sequence
Step 2: IDM as Reward Model The IDM-derived reward model:
- Encourages smoothness: Measured by velocity, acceleration, and jerk
- Penalizes violations: Out-of-bound actions implausible for robot embodiment
- Remains informative: Even with severe visual artifacts (artifacts translate to unstable actions)
Step 3: RL-Based Video Model Finetuning
- Standard RL algorithms (PPO, DPO) applied to align video distributions
- Reward from IDM guides video generation toward executable trajectories
- Incorporates priors from domain knowledge, real robot data, and IDM regularization
2.3 Reward Design
The IDM-based reward provides dense supervision:
R(video) = R_smoothness + R_constraints
R_smoothness = -w1·||velocity||² - w2·||acceleration||² - w3·||jerk||²
R_constraints = -w4·Σ max(0, action - upper_bound)² - w5·Σ max(0, lower_bound - action)²
Key Properties:
- Dense reward at every timestep (not just final outcome)
- Differentiable through IDM for gradient-based optimization
- Embodiment-aware (specific to robot kinematics)
3. 实验结果
3.1 RoboTwin Benchmark Results
| Method | Video Quality (FVD↓) | Action Smoothness (jerk↓) | Task Success Rate↑ |
|---|---|---|---|
| Base Video Model | 245.3 | 0.847 | 62.4% |
| + Rejection Sampling | 198.7 | 0.623 | 71.8% |
| EVA (Ours) | 176.2 | 0.412 | 78.9% |
3.2 Real Robot Evaluation
Bimanual Robot Platform:
- Task: “Place the red stapler on the black rectangular display stand”
- 50 trials per method
| Method | Execution Failures | Average Completion Time | Success Rate |
|---|---|---|---|
| Base Video Model | 23/50 | N/A (failed) | 54% |
| EVA (Ours) | 8/50 | 12.3s | 84% |
3.3 Executability Gap Analysis
Artifact Types Reduced by EVA:
- Arm deformations: 73% reduction
- Self-intersections: 81% reduction
- Temporal discontinuities: 68% reduction
- Out-of-bounds commands: 89% reduction
IDM Action Quality:
- Velocity variance: 2.3× reduction
- Acceleration peaks: 3.1× reduction
- Jerk (smoothness): 2.8× improvement
3.4 Ablation Studies
| Component | Task Success | Action Smoothness |
|---|---|---|
| Full EVA | 78.9% | 0.412 |
| - Smoothness reward | 71.2% | 0.634 |
| - Constraint penalty | 68.4% | 0.521 |
| - RL finetuning (rejection only) | 71.8% | 0.623 |
4. 优点与局限
优点
- First to address executability gap: Identifies and solves critical problem in video-based robotic planning
- Efficient training-time solution: Avoids inefficient rejection sampling at inference
- Dense reward signal: IDM provides timestep-level supervision throughout video
- Embodiment-aware: Rewards are specific to robot kinematics and constraints
- General framework: Applicable to various video generation architectures
局限
- Requires real robot data for IDM training
- IDM accuracy limits reward quality
- RL finetuning adds computational overhead
- May not generalize to unseen robot embodiments without retraining
5. 为什么对AI硬件重要
EVA has significant implications for AI hardware design:
- Video World Models for Robotics: The paper demonstrates video generation as a viable approach for robotic planning, suggesting hardware should optimize for:
- Video diffusion inference at edge (robot-side)
- Low-latency video generation (<100ms for real-time control)
- Joint video+action processing pipelines
- IDM Acceleration: Inverse dynamics models become critical components requiring:
- Real-time video-to-action conversion
- Efficient frame sequence processing
- Low-precision inference for embedded deployment
- RL Training Infrastructure: Post-training alignment via RL demands:
- High-throughput video generation for reward computation
- Efficient gradient computation through video+IDM pipeline
- Distributed training across multiple robot embodiments
- Edge AI for Robotics: Deployment on robots requires:
- Unified video generation + IDM accelerators
- Memory-efficient architectures for long-horizon planning
- Power-constrained operation for mobile robots
- Embodiment-Specific Hardware: The embodiment-aware nature suggests opportunities for:
- Configurable accelerators adaptable to different robot kinematics
- Hardware support for constraint checking
- Specialized units for trajectory smoothness computation
- Closed-Loop Systems: EVA enables video-based planning in closed-loop control, requiring:
- Real-time video generation and action extraction
- Low-latency sensor-to-action pipelines
- Predictive pre-computation for anticipated scenarios
参考文献
- Wang, R., Liu, Q., Deng, Y., Liu, G., Liu, Z., & Jia, K. (2026). EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards. arXiv:2603.17808.
- Black, K., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.xxxxx.
- Zhou, C., et al. (2025). ChatVLA-2: Integrating Open-World Reasoning into Robotic Policies. arXiv:2501.xxxxx.
- Hu, Y., et al. (2024). RoboTwin: Benchmarking Robot Learning with Digital Twins. arXiv:2401.xxxxx.
- Zhao, T., et al. (2025). Video World Models for Robotic Manipulation: A Survey. arXiv:2501.xxxxx.