Over at Google, large language models have been plugged into physics simulators to help them share a world model with their human interlocutors, resulting in big performance gains. They call it Mind's Eye.
This is how the authors describe the work:
Correct and complete understanding of properties and interactions in the physical world is not only essential to achieve human-level reasoning (Lake et al., 2017), but also fundamental to build a general-purpose embodied intelligence (Huang et al., 2022). In this work, we investigate to what extent current LMs understand the basic rules and principles of the physical world, and describe how to ground their reasoning with the aid of simulation. Our contributions are three-fold:
• We propose a new multi-task physics alignment dataset, UTOPIA, whose aim is to benchmark how well current LMs can understand and reason over some basic laws of physics (§2). The dataset contains 39 sub-tasks covering six common scenes that involve understanding basic principles of physics (e.g., conservation of momentum in elastic collisions), and all the ground-truth answers are automatically generated by a physics engine. We find that current large-scale LMs are still quite limited on many basic physics-related questions (24% accuracy of GPT-3 175B in zero-shot, and 38.2% in few-shot).
• We explore a paradigm that adds physics simulation to the LM reasoning pipeline (§3) to make the reasoning grounded within the physical world. Specifically, we first use a model to transform the given text-form question into rendering code, and then run the corresponding simulation on a physics engine (i.e., MuJoCo (Todorov et al., 2012)). Finally we append the simulation results to the input prompts of LMs during inference. Our method can serve as a plug-and-play framework that works with any LM and requires neither handcrafted prompts nor costly fine-tuning.
• We systematically evaluate the performance of popular LMs in different sizes on UTOPIA before and after augmentation by Mind’s Eye, and compare the augmented performance with many existing approaches (§4.2). We find Mind’s Eye outperforms other methods by a large margin in both zero-shot and few-shot settings. More importantly, Mind’s Eye is also effective for small LMs, and the performance with small LMs can be on par or even outperform that of 100× larger vanilla LMs.
This seems like a direct step down the CAIS path of development.
I suppose the follow-up question is: how effectively can a model learn to re-implement a physics simulator, if given access to it during training -- instead of being explicitly trained to generate XML config files to run the simulator during inference?
If it's substantially more efficient to use this paper's approach and train your model to use a general purpose (and transparent) physics simulator, I think this bodes well for interpretability in general. In the ELK formulation, this would enable Ontology Identification.
On this point, the paper says:
On the other hand, the general trend of "end-to-end trained is better than hand-crafted architectures" has been going strong in recent years; it's mentioned in the CAIS post, and Demis Hassabis noted that he thinks it's likely to continue in his interview by Lex Fridman (indeed they chatted quite a bit about using AI models to solve Physics problems). And indeed, DeepMind has a recent paper gesturing towards an end-to-end learned Physics model from video, which looks far less capable than the one shown in the OP, but two papers down the line, who knows.