TLDR; We propose a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains.
Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs.
In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real).
Our approach has several advantages:
Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%.
ReBot includes three key components:
Original video
ROSIE
ReBot
Original video
ROSIE
ReBot
Original video
ROSIE
ReBot
Put spoon on towel
Put carrot on plate
Stack green block on yellow block
Put eggplant in yellow basket
Physical: unseen object sizes
Spoon size: 0.8x
Spoon size: 1.2x
Carrot size: 0.8x
Carrot size: 1.2x
Semantics: unseen instructions
"Place spoon onto towel"
"Put vegetable on plate"
"Put green cube onto yellow cube"
"Move eggplant into basket"
Subject: unseen objects
Put apple on plate
Put fanta can on towel
Put orange on plate
Put red bull can on plate
Pick coke can (standing)
Pick coke can (horizontal)
Pick coke can (vertical)
Put carrot in blue plate
Put grape in yellow plate
Put fanta can in blue plate
Put black cube in yellow plate
@article{fang2025rebot,
title={ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis},
author={Fang, Yu and Yang, Yue and Zhu, Xinghao and Zheng, Kaiyuan and Bertasius, Gedas and Szafir, Daniel and Ding, Mingyu},
journal={arXiv preprint arXiv:2503.14526},
year={2025}
}