Anonymous Submission
In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations by conditioning a pre-trained policy on task-specific examples, without retraining at test time. Despite this promise, training generalizable and scalable in-context imitation policies remains an open challenge. We present SynthICL, a scalable framework that trains ICIL policies entirely from RGB-only synthetic data. Specifically, we build a data generation pipeline to produce high-fidelity ICIL data and train a flow-matching transformer policy on the resulting dataset. SynthICL avoids the need for depth sensing, precise camera calibration, and real-world training data in prior approaches, offering a simpler and more scalable alternative. We further incorporate subgoal prediction by training the model to predict the next subgoal images, enabling more precise and visually grounded control. Evaluated on 16 unseen real-world manipulation tasks, SynthICL achieves an average success rate of 79% with only one demonstration provided at test time and outperforms prior methods.
Overview of the SynthICL pipeline. (1) We first generate a large-scale dataset of synthetic demonstrations. (2) We then train a flow-matching transformer policy on this dataset. (3) During training, the model conditions on context demonstrations to predict actions, while an auxiliary subgoal prediction objective encourages visually grounded representations.
SynthICL transfers policies trained on synthetic demonstrations to real-world robot execution, enabling one-shot in-context adaptation on unseen manipulation tasks.
Real-world evaluation on unseen manipulation tasks with only one demo provided.
Additional real-world rollouts across unseen tasks. Use the slider to browse paired examples.