This is a linkpost for https://robotics-transformer2.github.io/
Wild. If I'm reading the paper right, this uses the same dataset as RT-1 to ground the finetuning of the robot-commanding tokens, they just get to fine-tune an off the shelf multimodal transformer rather than having to make a custom solution (as in RT-1), and it works better.
Abstract
Approach Overview
Results
Videos
Link to videos: https://robotics-transformer2.github.io/#videos