Perhaps these can be thought of as homework questions -- when I imagine us successfully making AI go well, I imagine us building expertise such that we can answer these questions quickly and easily. Before I read the answers I'm going to think for 10min or so about each one and post my own guesses.
Useful links / background reading: The glorious EfficientZero: How it Works. Related comment. EfficientZero GitHub. LW discussion.
Some of these questions are about EfficientZero, the net trained recently; others are about EfficientZero the architecture, imagined to be suitably scaled up to AGI levels. "If we made a much bigger and longer-trained version of this (with suitable training environment) such that it was superhuman AGI..."
- EfficientZero vs. reward hacking and inner alignment failure:
- Barring inner alignment failure, it’ll eventually reward hack, right? That is, if it gets sufficiently knowledgeable and capable, it’ll realize that it can get loads of reward by hacking its reward channel, and then its core algorithm would evaluate that action/plan highly and do it. Right?
- But fortunately (?) maybe there would be an inner alignment failure and the part of it that predicts reward would predict low reward from that action, even in the limit of knowledge and capability? Because it’s learned to predict proxies for reward rather than reward itself, and continued to do so even as it got smarter and more capable? (Would this happen? Why would this not be corrected by further training? Has some sort of proxy crystallization set in? How?)
- EfficientZero approximates evidential decision theory, right?
- EfficientZero is a consequentialist (in the sense defined here) architecture, right? It’s not, for example, updateless or deontological. For example, it has no deontological constraints except by accident (i.e. if its predictor-net mistakenly predicted super low reward for actions of type X, always, even in cases where actually a reasonable intelligent predictor would predict high reward.) Right?
- What is the most complex environment AIs in the family of MuZero, EfficientZero, etc. have been trained on? Is it just some Atari game?
- Roughly how many parameters does EfficientZero have? If you don’t know, what about MuZero? What about the biggest net to date from that general family? The EfficientZero paper doesn't give a direct answer but it describes the architecture in enough detail that you might be able to calculate it...
- If we kept scaling up EfficientZero by OOMs in every way, what would happen? Would it eventually get to agenty AGI? / APS-AI? After all, it seems pretty sample-efficient already. What if its sample was an entire lifetime of diverse experiences?
It's a little bit less dramatic than that: the model-based simulation playing is interleaved with the groundtruth environment. It's more like you spend a year playing games in your head, then you play 1 30s bullet chess match with Magnus Carlsen (madeup ratio), then go back to playing in your head for another year. Or maybe we should say, "you clone yourself a thousand times, and play yourself at correspondence chess timescales for 1 game per pair in a training montage, and then go back for a rematch".
(The scenario where you play for 15 minutes at the beginning, and then pass a few subjective eons trying to master the game with no further real data, would correspond to sample-efficient "offline reinforcement learning" (review), eg https://arxiv.org/abs/2104.06294#deepmind / https://arxiv.org/abs/2111.05424 , https://arxiv.org/abs/2006.13888 https://arxiv.org/abs/2006.13888 . Very important in its own right, of course, but poses challenges of its own related to quality of those 15 minutes - what if your 15 minute sample doesn't include any games which happen to use en passant, say? How could you ever learn that should be part of your model of chess? When you interleave and learn online, Carlsen might surprise you with an en passant a few games in, and then all your subsequent imagined games can include that possible move and you use it yourself.).
I don't think it does. If humans can't learn efficiently by imagining hypothetical games like machines can, so much the worse for them. The goal is to win.