The part I find the most significant and concerning is his statement that the path forward is something like Gemini crossed with AlphaZero. Most of the concerns about aligning foundation model agents surround unpredictable results of using powerful RL that creates goal-directed behavior in complex, hidden ways. I find those concerns highly realistic; the List of Lethalities (LoL) focuses on these risks, and the subtle but powerful principle "humans fuck things up, especially on the first (hundred)? tries".
That's why I'd much rather see us try Goals selected from learned knowledge: an alternative to RL alignment for our first alignment attempts. This might be possible even if RL is heavily involved in some elements of training; I'm trying to think through those scenario now.
My general thoughts on Deepmind's strategy can be found in my comment here, as well as discussing the impact of RL agentizing an AI more generally, and short answer, I'm a little more concerned than in the case of pre-trained AIs like GPT-4 or GPT-N, and some more alignment work should go to that scenario, but the reward is likely to be densely defined such that the AI has limited opportunities for breaking it via instrumental convergence:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=DgLC43S7PgMuC878j
(BTW, I also see this as a problem for Section B2, as it's examples rely on the analogy of evolution, but there are critical details that disallow us generalizing from "Evolution failed at aligning us to X" to "Humans can't align AIs to X". Also incorrectly assumes that corrigibility is anti-natural for consequentialist/Expected Utility Maximizing AIs and highly capable AIs, because GPT-4 and GPT-N do likely have a utility function that is learned as described here:
https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization
I might have more to say on that post later.)
I've replied to AGI Ruin: A List of Lethalities here:
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
It's a very long comment due to having me to respond to a lot of points, so get a drink and a snack while you read this comment.
I endorse the link to that other comment. We've got what feels like a useful discussion over there on exactly this issue.
Note I have written about how I'd actually do alignment in practice, such that we can get the densely defined signal of human values/instruction following to hold yesterday:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg
I think governments can set up a futureproof licensing regime for large training runs, not to mention working now to create 'pause buttons' or
Can you say more about what kinds of things governments could be doing to create pause buttons?
Very notable quotes in favor of international coordination and a 'CERN for AI' setup:
International cooperation on safety and deployment norms will be needed since AI is digital and if e.g. China deploys an AI it won't be contained to China.
I still think 'CERN for AI' is the best path for the final few years of the AGI project, to improve safety.
The YouTube "chapters" are mixed up, e.g. the question about regulation comes 5 minutes after the regulation chapter ends. Ignore them.
Noteworthy parts:
8:40: Near-term AI is hyped too much (think current startups, VCs, exaggerated claims about what AI can do, crazy ideas that aren't ready) but AGI is under-hyped and under-appreciated.
16:45: "Gemini is a project that has only existed for a year . . . our trajectory is very good; when we talk next time we should hopefully be right at the forefront."
17:20–18:50: Current AI doesn't work as a digital assistant. The next era/generation is agents. DeepMind is well-positioned to work on agents: "combining AlphaGo with Gemini."
24:00: Staged deployment is nice: red-teaming then closed beta then public deployment.
28:37: Openness (at Google: e.g. publishing transformers, AlphaCode, AlphaFold) is almost always a universal good. But dual-use technology—including AGI—is an exception. With dual-use technology, you want good scientists to still use the technology and advance as quickly as possible, but also restrict access for bad actors. Openness is fine today but in 2-4 years or when systems are more agentic it'll be dangerous. Maybe labs should only open-source models that are lagging a year behind the frontier (and DeepMind will probably take this approach, and indeed is currently doing ~this by releasing Gemma weights).
31:20: "The problem with open source is if something goes wrong you can't recall it. With a proprietary model if your bad actor starts using it in a bad way you can close the tap off . . . but once you open-source something there's no pulling it back. It's a one-way door, so you should be very sure when you do that."
31:42: Can an AGI be contained? We don't know how to do that [this suggests a misalignment/escape threat model but it's not explicit]. Sandboxing and normal security is good for intermediate systems but won't be good enough to contain an AGI smarter than us. We'll have to design protocols for AGI in the future: "when that time comes we'll have better ideas for how to contain that, potentially also using AI systems and tools to monitor the next versions of the AI system."
33:00: Regulation? It's good that people in government are starting to understand AI and AISIs are being set up before the stakes get really high. International cooperation on safety and deployment norms will be needed since AI is digital and if e.g. China deploys an AI it won't be contained to China. Also:
My #1 emerging dangerous capability to test for is deception because if the AI can be deceptive then you can't trust other tests [deceptive alignment threat model but not explicit]. Also agency and self-replication.
37:10: We don't know how to design a system that could come up with the Reimann hypothesis or invent Go. (Despite achieving superhuman Go and being close to AI substantially assisting at proving theorems.)
38:00: Superintelligence and meaning — we'll cure all the diseases and solve energy and solve climate and have radical abundance; no details on what the long-term future looks like.
39:49: How do you make sure that AGI benefits everybody? You don't make a single system for everyone. Instead there'll be provably safe architectures and different people/countries will have personalized AIs. We need international cooperation to build AGI using safe architectures—certainly there are some safe ways and some unsafe ways—and then after making it through that everyone can have their own AGIs.
41:54: "There's two cases to worry about: there's bad uses by bad individuals or nations—human misuse—and then there's the AI itself, as it gets closer to AGI, going off the rails. You need different solutions for those two problems." [No elaboration.]
43:06: Avengers assembled: I still think 'CERN for AI' is the best path for the final few years of the AGI project, to improve safety.
43:55: "Today people disagree [on] the risks — you see very famous people saying there's no risks and then you have people like Geoff Hinton saying there's lot of risks. I'm in the middle."
49:24: DeepMind (founded 2010) was supposedly a 20-year project — usually 20-year projects stay 20 years away, but we're on track! "I wouldn't be surprised if [AGI] comes in the next decade."
My take: nothing surprising; Demis continues to seem reasonable. Somewhat disappointed that he said little about risks and safety plans. Somewhat disappointed by his sit tight and assess approach to regulation—I think governments can set up a futureproof licensing regime for large training runs, not to mention working now to create 'pause buttons' or make labs more transparent on dangerous capabilities—but it's no worse than anyone else in the industry.