This article was outlined by Nate Soares, inflated by Rob Bensinger, and then edited by Nate. Content warning: the tone of this post feels defensive to me. I don't generally enjoy writing in "defensive" mode, but I've had this argument thrice recently in surprising places, and so it seemed worth writing my thoughts up anyway.
In last year’s Ngo/Yudkowsky conversation, one of Richard’s big criticisms of Eliezer was, roughly, ‘Why the heck have you spent so much time focusing on recursive self-improvement? Is that not indicative of poor reasoning about AGI?’
I’ve heard similar criticisms of MIRI and FHI’s past focus on orthogonality and instrumental convergence: these notions seem obvious, so either MIRI and FHI must be totally confused about what the big central debates in AI alignment are, or they must have some very weird set of beliefs on which these notions are somehow super-relevant.
This seems to be a pretty common criticism of past-MIRI (and, similarly, of past-FHI); in the past month or so, I’ve heard it two other times while talking to other OpenAI and Open Phil people.
This argument looks misguided to me, and I hypothesize that a bunch of the misguidedness is coming from a simple failure to understand the relevant history.
I joined this field in 2013-2014, which is far from "early", but is early enough that I can attest that recursive self-improvement, orthogonality, etc. were geared towards a different argumentative environment, one dominated by claims like "AGI is impossible", "AGI won't be able to exceed humans by much", and "AGI will naturally be good".
A possible response: “Okay, but ‘sufficiently smart AGI will recursively self-improve’ and ‘AI isn’t automatically nice’ are still obvious. You should have just ignored the people who couldn’t immediately see this, and focused on the arguments that would be relevant to hypothetical savvy people in the future, once the latter joined in the discussion.”
I have some sympathy for this argument. Some considerations weighing against, though, are:
- I think it makes more sense to filter on argument validity, rather than “obviousness”. What’s obvious varies a lot from individual to individual. If just about everyone talking about AGI is saying “obviously false” things (as was indeed the case in 2010), then it makes sense to at least try publicly writing up the obvious counter-arguments.
- This seems to assume that the old arguments (e.g., in Superintelligence) didn’t work. In contrast, I think it’s quite plausible that “everyone with a drop of sense in them agrees with those arguments today” is true in large part because these propositions were explicitly laid out and argued for in the past. The claims we take as background now are the claims that were fought for by the old guard.
- I think this argument overstates how many people in ML today grok the “obvious” points. E.g., based on a recent DeepMind Podcast episode, these sound like likely points of disagreement with David Silver.
But even if you think this was a strategic error, I still think it’s important to recognize that MIRI and FHI were arguing correctly against the mistaken views of the time, rather than arguing poorly against future views.
Recursive self-improvement
Why did past-MIRI talk so much about recursive self-improvement? Was it because Eliezer was super confident that humanity was going to get to AGI via the route of a seed AI that understands its own source code?
I doubt it. My read is that Eliezer did have "seed AI" as a top guess, back before the deep learning revolution. But I don't think that's the main source of all the discussion of recursive self-improvement in the period around 2008.
Rather, my read of the history is that MIRI was operating in an argumentative environment where:
- Ray Kurzweil was claiming things along the lines of ‘Moore’s Law will continue into the indefinite future, even past the point where AGI can contribute to AGI research.’ (The Five Theses, in 2013, is a list of the key things Kurzweilians were getting wrong.)
- Robin Hanson was claiming things along the lines of ‘The power is in the culture; superintelligences wouldn’t be able to outstrip the rest of humanity.’
The memetic environment was one where most people were either ignoring the topic altogether, or asserting ‘AGI cannot fly all that high’, or asserting ‘AGI flying high would be business-as-usual (e.g., with respect to growth rates)’.
The weighty conclusion of the "recursive self-improvement" meme is not “expect seed AI”. The weighty conclusion is “sufficiently smart AI will rapidly improve to heights that leave humans in the dust”.
Note that this conclusion is still, to the best of my knowledge, completely true, and recursive self-improvement is a correct argument for it.
Which is not to say that recursive self-improvement happens before the end of the world; if the first AGI's mind is sufficiently complex and kludgy, it’s entirely possible that the cognitions it implements are able to (e.g.) crack nanotech well enough to kill all humans, before they’re able to crack themselves.
The big update over the last decade has been that humans might be able to fumble their way to AGI that can do crazy stuff before it does much self-improvement.
(Though, to be clear, from my perspective it’s still entirely plausible that you will be able to turn the first general reasoners to their own architecture and get a big boost, and so there's still a decent chance that self-improvement plays an important early role. (Probably destroying the world in the process, of course. Doubly so given that I expect it’s even harder to understand and align a system if it’s self-improving.))
In other words, it doesn’t seem to me like developments like deep learning have undermined the recursive self-improvement argument in any real way. The argument seems solid to me, and reality seems quite consistent with it.
Taking into account its past context, recursive self-improvement was a super conservative argument that has been vindicated in its conservatism.
It was an argument for the proposition “AGI will be able to exceed the heck out of humans”. And AlphaZero came along and was like, “Yep, that’s true.”
Recursive self-improvement was a super conservative argument for “AI blows past human culture eventually”; when reality then comes along and says “yes, this happens in 2016 when the systems are far from truly general”, the update to make is that this way of thinking about AGI sharply outperformed, not that this way of thinking was silly because it talked about sci-fi stuff like recursive self-improvement when it turns out you can do crazy stuff without even going that far. As Eliezer put it, “reality held a more extreme position than I did on the Yudkowsky-Hanson spectrum”.
If arguments like recursive self-improvement and orthogonality seem irrelevant and obvious now, then great! Intellectual progress has been made. If we're lucky and get to the next stop on the train, then I’ll hopefully be able to link back to this post when people look back and ask why we were arguing about all these other silly obvious things back in 2022.
Deep learning
I think "MIRI staff spent a bunch of time talking about instrumental convergence, orthogonality, recursive self-improvement, etc." is a silly criticism.
On the other hand, I think "MIRI staff were slow to update about how far deep learning might go" is a fair criticism, and we lose Bayes points here, especially relative to people who were vocally bullish about deep learning before late 2015 / early 2016.
In 2003, deep learning didn't work, and nothing else worked all that well either. A reasonable guess was that we'd need to understand intelligence in order to get unstuck; and if you understand intelligence, then an obvious way to achieve superintelligence is to build a simple, small, clean AI that can take over the hard work of improving itself. This is the idea of “seed AI”, as I understand it. I don’t think 2003-Eliezer thought this direction was certain, but I think he had a bunch of probability mass on it.[1]
I think that Eliezer’s model was somewhat surprised by humanity’s subsequent failure to gain much understanding of intelligence, and also by the fact that humanity was able to find relatively brute-force-ish methods that were computationally tractable enough to produce a lot of intelligence anyway.
But I also think this was a reasonable take in 2003. Other people had even better takes — Shane Legg comes to mind. He stuck his neck out early with narrow predictions that panned out. Props to Shane.
I personally had run-of-the-mill bad ideas about AI as late as 2010, and didn't turn my attention to this field until about 2013, which means that I lost a bunch of Bayes points relative to the people who managed to figure out in 1990 or 2000 that AGI will be our final invention. (Yes, even if the people who called it in 2000 were expecting seed AI rather than deep learning, back when nothing was really working. I reject the Copenhagen Theory Of Forecasting, according to which you gain special epistemic advantage from not having noticed the problem early enough to guess wrongly.)
My sense is that MIRI started taking the deep learning revolution much more seriously in 2013, while having reservations about whether broadly deep-learning-like techniques would be the first way humanity reached AGI. Even now, it’s not completely obvious to me that this will be the broad paradigm in which AGI is first developed, though something like that seems fairly likely at this point. But, if memory serves, during the Jan. 2015 Puerto Rico conference I was treating the chance of deep learning going all the way as being in the 10-40% range; so I don't think it would be fair to characterize me as being totally blindsided.
My impression is that Eliezer and I, at least, updated harder in 2015/16, in the wake of AlphaGo, than a bunch of other locals (and I, at least, think I've been less surprised than various other vocal locals by GPT, PaLM, etc. in recent years).
Could we have done better? Yes. Did we lose Bayes points? Yes, especially relative to folks like Shane Legg.
But since 2016, it mostly looks to me like with each AGI advancement, others update towards my current position. So I'm feeling pretty good about the predictive power of my current models.
Maybe this all sounds like revisionism to you, and your impression of FOOM-debate-era Eliezer was that he loved GOFAI and thought recursive self-improvement was the only advantage digital intelligence could have over human intelligence.
And, I wasn't here in that era. But I note that Eliezer said the opposite at the time; and the track record for such claims seems to hold more examples of “mistakenly rounding the other side’s views off to a simpler, more-cognitively-available caricature”, and fewer examples of “peering past the veil of the author’s text to see his hidden soul”.
Also: It’s important to ask proponents of a theory what they predict will happen, before crowing about how their theory made a misprediction. You're always welcome to ask for my predictions in advance.
(I’ve been making this offer to people who disagree with me about whether I have egg on my face since 2015, and have rarely been taken up on it. E.g.: yes, we too predict that it's easy to get GPT-3 to tell you the answers that humans label "aligned" to simple word problems about what we think of as “ethical”, or whatever. That’s never where we thought the difficulty of the alignment problem was in the first place. Before saying that this shows that alignment is actually easy contra everything MIRI folk said, consider asking some MIRI folk for their predictions about what you’ll see.)
- ^
In particular, I think Eliezer’s best guess was AI systems that would look small, clean, and well-understood relative to the large opaque artifacts produced by deep learning. That doesn’t mean that he was picturing GOFAI; there exist a wide range of possibilities of the form “you understand intelligence well enough to not have to hand off the entire task to a gradient-descent-ish process to do it for you” that do not reduce to “coding everything by hand”, and certainly don’t reduce to “reasoning deductively rather than probabilistically”.
(still travelling; still not going to reply in a ton of depth; sorry. also, this is very off-the-cuff and unreflected-upon.)
For all that someone says "my image classifier is very good", I do not expect it to be able to correctly classify "a screenshot of the code for an FAI" as distinct from everything else. There are some cognitive tasks that look so involved as to require smart-enough-to-be-dangerous capabilities. Some such cognitive tasks can be recast as "being smart", just as they can be cast as "image classification". Those ones will be hard without scary capabilities. Solutions to easier cognitive problems (whether cast as "image classification" or "being smart" or whatever) by non-scary systems don't feel to me like they undermine this model.
"Being good" is one of those things where the fact that a non-scary AI checks a bunch of "it was being good" boxes before some consequent AI gets scary, does not give me much confidence that the consequent AI will also be good, much like how your chimps can check a bunch of "is having kids" boxes without ultimately being an IGF maximizer when they grow up.
My cached guess as to our disageement vis a vis "being good in a Chinese bureaucracy" is whether or not some of the difficult cognitive challenges (such as understanding certain math problems well enough to have insights about them) decompose such that those cognitions can be split across a bunch of non-scary reasoners in a way that succeeds at the difficult cognition without the aggregate itself being scary. I continue to doubt that and don't feel like we've seen much evidence either way yet (but perhaps you know things I do not).
To be clear, I agree that GPT-3 already has strong enough understanding to solve the sorts of problems Eliezer was talking about in the "get my grandma out of the burning house" argument. I read (perhaps ahistorically) the grandma-house argument as being about how specifying precisely what you want is real hard. I agree that AIs will be able to learn a pretty good concept of what we want without a ton of trouble. (Probably not so well that we can just select one of their concepts and have it optimize for that, in the fantasy-world where we can leaf through its concepts and have it optimize for one of them, because of how the empirically-learned concepts are more likely to be like "what we think we want" than "what we would want if we were more who we wished to be" etc. etc.)
Separately, in other contexts where I talk about AI systems understanding the consequences of their actions being a bottleneck, it's understanding of consequences sufficient for things like fully-automated programming and engineering. Which look to me like they require a lot of understanding-of-consequences that GPT-3 does not yet possess. My "for the record" above was trying to make that clear, but wasn't making the above point where I think we agree clear; sorry about that.
It would take a bunch of banging, but there's probably some sort of "the human engineer can stare at the engineering puzzle and tell you the solution (by using thinking-about-consequences in the manner that seems to me to be tricky)" that I doubt an AI can replicate before being pretty close to being a good engineer. Or similar with, like, looking at a large amount of buggy code (where fixing the bug requires understanding some subtle behavior of the whole system) and then telling you the fix; I doubt an AI can do that before it's close to being able to do the "core" cognitive work of computer programming.
Maybe somewhat? My models are mostly like "I'm not sure how far language models can get, but I don't think they can get to full-auto programming or engineering", and when someone is like "well they got a little farther (although not as far as you say they can't)!", it does not feel to me like a big hit. My guess is it feels to you like it should be a bigger hit, because you're modelling the skills that copilot currently exhibits as being more on-a-continuum with the skills I don't expect language models can pull off, and so any march along the continuum looks to you like it must be making me sweat?
If things like copilot smoothly increase in "programming capability" to the point that they can do fully-automated programming of complex projects like twitter, then I'd be surprised.
I still lose a few Bayes points each day to your models, which more narrowly predict that we'll take each next small step, whereas my models are more uncertain and say "for all I know, today is the day that language models hit their wall". I don't see the ratios as very large, though.
A man can dream. We may yet be able to find one, though historically when we've tried it looks to me like we are mostly reading the same history in different ways, which makes things tricky.