Perhaps these can be thought of as homework questions -- when I imagine us successfully making AI go well, I imagine us building expertise such that we can answer these questions quickly and easily. Before I read the answers I'm going to think for 10min or so about each one and post my own guesses.
Useful links / background reading: The glorious EfficientZero: How it Works. Related comment. EfficientZero GitHub. LW discussion.
Some of these questions are about EfficientZero, the net trained recently; others are about EfficientZero the architecture, imagined to be suitably scaled up to AGI levels. "If we made a much bigger and longer-trained version of this (with suitable training environment) such that it was superhuman AGI..."
- EfficientZero vs. reward hacking and inner alignment failure:
- Barring inner alignment failure, it’ll eventually reward hack, right? That is, if it gets sufficiently knowledgeable and capable, it’ll realize that it can get loads of reward by hacking its reward channel, and then its core algorithm would evaluate that action/plan highly and do it. Right?
- But fortunately (?) maybe there would be an inner alignment failure and the part of it that predicts reward would predict low reward from that action, even in the limit of knowledge and capability? Because it’s learned to predict proxies for reward rather than reward itself, and continued to do so even as it got smarter and more capable? (Would this happen? Why would this not be corrected by further training? Has some sort of proxy crystallization set in? How?)
- EfficientZero approximates evidential decision theory, right?
- EfficientZero is a consequentialist (in the sense defined here) architecture, right? It’s not, for example, updateless or deontological. For example, it has no deontological constraints except by accident (i.e. if its predictor-net mistakenly predicted super low reward for actions of type X, always, even in cases where actually a reasonable intelligent predictor would predict high reward.) Right?
- What is the most complex environment AIs in the family of MuZero, EfficientZero, etc. have been trained on? Is it just some Atari game?
- Roughly how many parameters does EfficientZero have? If you don’t know, what about MuZero? What about the biggest net to date from that general family? The EfficientZero paper doesn't give a direct answer but it describes the architecture in enough detail that you might be able to calculate it...
- If we kept scaling up EfficientZero by OOMs in every way, what would happen? Would it eventually get to agenty AGI? / APS-AI? After all, it seems pretty sample-efficient already. What if its sample was an entire lifetime of diverse experiences?
Hmm, I think my comment came across as setting up a horse-race between EfficientZero and human brains, in a way that I didn't intend. Sorry for bad choice of words. In particular, when I wrote "how AI compares to human brains", I meant in the sense of "In what ways are they similar vs different? What are their relative strengths and weaknesses? Etc.", but I guess it sounded like I was saying "human brain algorithms are better and EfficientZero is worse". Sorry.
I could write a "human brain algorithms are fundamentally more powerful than EfficientZero" argument, but I wasn't trying to, and such an argument sure as heck wouldn't fit in a comment. :-)
Sure. If Atari sample efficiency is what we ultimately care about, then the results speak for themselves. For my part, I was using sample efficiency as a hint about other topics that are not themselves sample efficiency. For example, I think that if somebody wants to understand AlphaZero, the fact that it trained on 40,000,000 games of self-play is a highly relevant and interesting datapoint. Suppose you were to then say "…but of those 40,000,000 games, fundamentally it really only needed 100 games with the external simulator to learn the rules. The other 39,999,900 games might as well have been 'in its head'. This was proven in follow-up work.". I would reply: "Oh. OK. That's interesting too. But I still care about the 40,000,000 number. I still see that number as a very important part of understanding the nature of AlphaZero and similar systems."
(I'm not sure we're disagreeing about anything…)