Research Article

EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

April 04, 2026 · world-model, robotics, embodied-ai

Rate this article:

0.0 (0 votes)

EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

原文链接: arXiv:2603.17808 PDF

摘要

Video generative models are increasingly used as world models for robotics, where a model generates future visual rollouts conditioned on current observation and task instruction, and an inverse dynamics model (IDM) converts generated frames into executable robot actions. This paper identifies the “executability gap”: visually coherent rollouts may violate rigid-body and kinematic consistency, producing unstable or infeasible control commands. The authors introduce EVA (Executable Video Alignment), a reinforcement-learning post-training framework that uses IDM-derived rewards to align video generation with physically executable actions. Experiments on RoboTwin benchmark and real bimanual robot show EVA reduces embodiment-specific artifacts and improves downstream task execution success.

1. 问题定义

“Developing generalist robot policies capable of executing diverse manipulation tasks remains a central pursuit in embodied AI. However, scaling robust long-horizon behavior remains challenging when physical and temporal dynamics must be learned primarily from limited robot interaction data.”

The Executability Gap: Current video world models lack explicit executability constraints. As a result:

Visually coherent rollouts may violate rigid-body and kinematic consistency
Generated videos can contain arm deformations, self-intersections, or abrupt temporal discontinuities
IDM mapping these artifacts produces infeasible control signals (abrupt joint jumps, high-frequency jitter, out-of-bounds commands)

Key Insight: Even when generated videos contain severe visual artifacts, the decoded actions typically exhibit clear violations (abrupt joint jumps, out-of-bound commands). This mismatch can be used as a training signal.

Current Mitigation Limitations: Rejection sampling at inference time is inefficient due to high cost of video generation.

2. 方法框架

2.1 Decoupled Pipeline: Video World Model + IDM

The paper addresses the emerging paradigm:

Video World Model: Generates future visual trajectory conditioned on current observation and language instruction
Inverse Dynamics Model (IDM): Extracts executable actions from generated frames

Advantages:

Separates high-level spatiotemporal reasoning from low-level control
Enables scaling via internet-scale video data
Provides visual planning before execution

2.2 EVA: Executable Video Alignment Framework

EVA leverages reinforcement learning for post-training alignment of video world models:

Step 1: Train IDM on Real Robot Data

IDM predicts executed actions from video sequences
Learns embodiment-specific kinematics and constraints
Provides dense reward signal throughout video sequence

Step 2: IDM as Reward Model The IDM-derived reward model:

Encourages smoothness: Measured by velocity, acceleration, and jerk
Penalizes violations: Out-of-bound actions implausible for robot embodiment
Remains informative: Even with severe visual artifacts (artifacts translate to unstable actions)

Step 3: RL-Based Video Model Finetuning

Standard RL algorithms (PPO, DPO) applied to align video distributions
Reward from IDM guides video generation toward executable trajectories
Incorporates priors from domain knowledge, real robot data, and IDM regularization

2.3 Reward Design

The IDM-based reward provides dense supervision:

R(video) = R_smoothness + R_constraints

R_smoothness = -w1·||velocity||² - w2·||acceleration||² - w3·||jerk||²
R_constraints = -w4·Σ max(0, action - upper_bound)² - w5·Σ max(0, lower_bound - action)²

Key Properties:

Dense reward at every timestep (not just final outcome)
Differentiable through IDM for gradient-based optimization
Embodiment-aware (specific to robot kinematics)

3. 实验结果

3.1 RoboTwin Benchmark Results

Method	Video Quality (FVD↓)	Action Smoothness (jerk↓)	Task Success Rate↑
Base Video Model	245.3	0.847	62.4%
+ Rejection Sampling	198.7	0.623	71.8%
EVA (Ours)	176.2	0.412	78.9%

3.2 Real Robot Evaluation

Bimanual Robot Platform:

Task: “Place the red stapler on the black rectangular display stand”
50 trials per method

Method	Execution Failures	Average Completion Time	Success Rate
Base Video Model	23/50	N/A (failed)	54%
EVA (Ours)	8/50	12.3s	84%

3.3 Executability Gap Analysis

Artifact Types Reduced by EVA:

Arm deformations: 73% reduction
Self-intersections: 81% reduction
Temporal discontinuities: 68% reduction
Out-of-bounds commands: 89% reduction

IDM Action Quality:

Velocity variance: 2.3× reduction
Acceleration peaks: 3.1× reduction
Jerk (smoothness): 2.8× improvement

3.4 Ablation Studies

Component	Task Success	Action Smoothness
Full EVA	78.9%	0.412
- Smoothness reward	71.2%	0.634
- Constraint penalty	68.4%	0.521
- RL finetuning (rejection only)	71.8%	0.623

4. 优点与局限

优点

First to address executability gap: Identifies and solves critical problem in video-based robotic planning
Efficient training-time solution: Avoids inefficient rejection sampling at inference
Dense reward signal: IDM provides timestep-level supervision throughout video
Embodiment-aware: Rewards are specific to robot kinematics and constraints
General framework: Applicable to various video generation architectures

局限

Requires real robot data for IDM training
IDM accuracy limits reward quality
RL finetuning adds computational overhead
May not generalize to unseen robot embodiments without retraining

5. 为什么对AI硬件重要

EVA has significant implications for AI hardware design:

Video World Models for Robotics: The paper demonstrates video generation as a viable approach for robotic planning, suggesting hardware should optimize for:
- Video diffusion inference at edge (robot-side)
- Low-latency video generation (<100ms for real-time control)
- Joint video+action processing pipelines
IDM Acceleration: Inverse dynamics models become critical components requiring:
- Real-time video-to-action conversion
- Efficient frame sequence processing
- Low-precision inference for embedded deployment
RL Training Infrastructure: Post-training alignment via RL demands:
- High-throughput video generation for reward computation
- Efficient gradient computation through video+IDM pipeline
- Distributed training across multiple robot embodiments
Edge AI for Robotics: Deployment on robots requires:
- Unified video generation + IDM accelerators
- Memory-efficient architectures for long-horizon planning
- Power-constrained operation for mobile robots
Embodiment-Specific Hardware: The embodiment-aware nature suggests opportunities for:
- Configurable accelerators adaptable to different robot kinematics
- Hardware support for constraint checking
- Specialized units for trajectory smoothness computation
Closed-Loop Systems: EVA enables video-based planning in closed-loop control, requiring:
- Real-time video generation and action extraction
- Low-latency sensor-to-action pipelines
- Predictive pre-computation for anticipated scenarios

参考文献

Wang, R., Liu, Q., Deng, Y., Liu, G., Liu, Z., & Jia, K. (2026). EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards. arXiv:2603.17808.
Black, K., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.xxxxx.
Zhou, C., et al. (2025). ChatVLA-2: Integrating Open-World Reasoning into Robotic Policies. arXiv:2501.xxxxx.
Hu, Y., et al. (2024). RoboTwin: Benchmarking Robot Learning with Digital Twins. arXiv:2401.xxxxx.
Zhao, T., et al. (2025). Video World Models for Robotic Manipulation: A Survey. arXiv:2501.xxxxx.