This was interesting and I would like to see more AI research organizations conducting + publishing similar surveys.
Thanks! For those interested in conducting similar surveys, here is a version of the spreadsheet you can copy (by request elsewhere in the comments).
I'm glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven't rerun the survey so I don't really know. Looking back at the "possible implications for our work" section, we are working on basically all of these things.
Thoughts on some of the cruxes in the post based on last year's developments:
I hoped to see other groups do the survey as well - looks like this didn't happen, though a few people asked me to share the template at the time. It would be particularly interesting if someone ran a version of the survey with separate ratings for "agreement with the statement" and "agreement with the implications for risk".
Curated. I appreciate the DeepMind alignment team taking the time to engage with Eliezer's list and write up their thoughts. It feels helpful to know which particular points have a relative consensus and which are controversial, all the more so when they offer reasons. And in the specific case of this list, which claims failure is nigh certain, it's interesting to see on which points people disagree and think there's hope.
While I agree with much of this content, I think you guys (the anonymous authors) are most likely to be wrong in your disagreement with the "alien concepts" point (#33).
To make a more specific claim (to be evaluated separately), I mostly expect this due to speed advantage, combined with examples of how human concepts are alien relative to those of analogously speed-disadvantaged living systems. For instance, most plants and somatic (non-neuronal) animal components use a lot of (very slow) electrical signalling to make very complex decisions (e.g., morphogenesis and healing; see Michael Levin's work on reprogramming regenerative organisms by decoding their electrical signalling). To the extent that these living systems (plants, and animal-parts) utilize "concepts" in the course of their complex decision-making, at present they seem quite alien to us, and many people (including some likely responders to this comment) will say that plants and somatic animal components entirely lack intelligence and do not make decisions. I'm not trying to argue for some kind of panpsychism or expanding circle of compassion here, just pointing out a large body of research (again, start with Levin) showing complex and robust decision-making within plants and (even more so) animal bodies, which humans consider relatively speaking "unintelligent" or at least "not thinking in what we regard to be valid abstract concepts", and I think there will be a similar disparity between humans and A(G)I after it runs for a while (say, 1000 subjective civilization-years, or a few days to a year of human-clock-time).
I expect lots of alien concepts in domains where AI far surpasses humans (e.g. I expect this to be true of AlphaFold). But if you look at the text of the ruin argument:
Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
I think this is pretty questionable. I expect that a good chunk of GPT-3's cognition is something that could be translated into something comprehensible, mostly because I think humans are really good at language and GPT-3 is only somewhat better on some axes (and worse on others). I don't remember what I said on this survey but right now I'm feeling like it's "Unclear", since I expect lots of AIs to have lots of alien concepts, but I don't think I expect quite as much alienness as Eliezer seems to expect.
(And this does seem to materially change how difficult you expect alignment to be; on my view you can hope that in addition to all the alien concepts the AI also has regular concepts about "am I doing what my designers want" or "am I deceiving the humans" which you could then hope to extract with interpretability.)
Also, I wonder to what extent our own "thinking" is based on concepts we ourselves understand. I'd bet I don't really understand what concepts most of my own thinking processes use.
Like: what are the exact concepts I use when I throw a ball? Is there a term for velocity, gravity constant or air friction, or is it just some completely "alien" computation which is "inlined" and "tree-shaked" of any unneeded abstractions, which just sends motor outputs given the target position?
Or: what concepts do I use to know what word to place at this place in this sentence? Do I use concepts like "subject", or "verb" or "sentiment", or rather just go with the flow subconsciously, having just a vague idea of the direction I am going with this argument?
Or: what concepts do I really use when deciding to rotate the steering wheel 2 degrees to the right when driving a car through a forest road's gentle turn? Do I think about "angles", "asphalt", "trees", "centrifugal force", "tire friction", or rather just try to push the future into the direction where the road ahead looks more straight to me and somehow I just know that this steering wheel is "straightening" the image I see?
Or: how exactly do I solve (not: verify an already written proof) a math problem? How does the solution pop into my mind? Is there some systematic search over all possible terms and derivations, or rather some giant hash-map-like interconnected "related tricks and transformations I seen before" which get proposed?
I think my point is that we should not conflate the way we actually solve problems (subconsciously?), with the way we talk (consciously) about solutions we've already found when trying to verify them ourselves (the inner monologue) or convey them to another person. First of all, the Release binary and Debug binaries can differ (it's completely different experience to ride a bike for a first time, than an on 2000th attempt). Second, the on-the-wire format and the data structure before serialization can be very different (the way I explain how to solve an equation to my kid is not exactly how I solve it).
I think, that training a separate AI to interpret for us the inner workings of another AI is risky, the same way a Public Relations department or a lawyer doesn't necessarily give you the honest picture of what the client is really up to.
Also, I there's much talk about distinction between system 1 and 2, or subconsciousness and consciousness, etc.
But, do we really treat seriously the implication of all that: the concepts our conscious part of mind uses to "explain" the subconscious actions have almost nothing to do with how it actually happened. If we force the AI to use these concepts it will either lie to us ("Your honor, as we shall soon see the defendant wanted to..") , or be crippled (have you tried to drive a car using just the concepts from physics text book?). But even in the later case it looks like a lie to me, because even if the AI is really using the concepts it claims/seems/reported to be using, there's still the mismatch in myself: I think I now understand that the AI works just like me, while in the reality I work completely differently than I thought. How bad that is depends on problem domain, IMHO. This might be pretty good if the AI is trying to solve a problem like "how to throw a ball" and a program using physic equations is actually also a good way of doing it. But once we get to more complicated stuff like operating a autonomous drone on the battlefield or governing country's budget I think there's a risk because we don't really know how we ourselves make these kind of decisions.
Yes, this surprised me to. Perhaps it was the phrasing that they disagreed with? If you asked them about all possible intelligences in mindspace, and asked them if they thought AGI would fall very close to most human minds, maybe their answer would be different.
As far as I can tell the major disagreements are about us having a plan and taking a pivotal act. There seems to be general "consensus" (Unclear, Mostly Agree, Agree) about what the problems are and how an AGI might look. Since no pivotal acts is needed either you think that we will be able to tackle this problem with the resources we have and will have, you have (way) longer timelines (let's assume Eliezer timeline is 2032 for argument's sake) or you expect the world to make a major shift in priorities concerning AGI.
Am I correct in assuming this or am I missing some alternatives ?
(I'm on the DeepMind alignment team)
There's a fair amount of disagreement within the team as well. I'll try to say some things that I think almost everyone on the team would agree with but I could easily be wrong about that.
you think that we will be able to tackle this problem with the resources we have and will have
Presumably even on a pivotal act framing, we also have to execute a pivotal act with the resources we have and will have, so I'm not really understanding what the distinction is here? But I'm guessing that this is closest to "our" belief of the options you listed.
Note that this doesn't mean that "we" think you can get x-risk down to zero; it means that "we" think that non-pivotal-act strategies reduce x-risk more than pivotal-act strategies.
I misused the definition of a pivotal act which makes it confusing. My bad!
I understood the phrase pivotal act more in the spirit of out-off distribution effort. To rephrase it more clearly: Do "you" think an out-off distribution effort is needed right now ? For example sacrificing the long term (20 years) for the short term (5 years) or going for high risk-high reward strategies.
Or should we stay on our current trajectory, since it maximizes our chances of winning ? (which as far as I can tell is "your" opinion)
To the extent I understand you (which at this point I think I do), yes, "we" think we should stay on our current trajectory.
I broadly agree with this post, especially the "(unilateral) pivotal acts are not the right approach" aspect.
The pivotal acts proposed are extremely specific solutions to specific problems, and are only applicable in very specific scenarios of AI clearly being on the brink of vastly surpassing human intelligence. That should be clarified whenever they are brought up; it's a thought experiment solution to a thought experiment problem, and if it suddenly stops being a thought experiment then that's great because you have the solution on a silver platter.
My viewpoint is that the most dangerous risks rely on inner alignment issues, and that is basically because of very bad transparency tools, instrumental convergence issues toward power and deception, and mesa-optimizers essentially ruining what outer alignment you have. If you could figure out a reliable way to detect or make sure that deceptive models could never be reached in your training process, that would relieve a lot of my fears of X-risk from AI.
I actually think Eliezer is underrating civilizational competence once AGI is released via the MNM effect, as it happened for Covid, unfortunately this only buys time before the end. A superhuman intelligence that is deceiving human civilization based on instrumental convergence essentially will win barring pivotal acts, as Eliezer says. The goal of AI safety is to make alignment not dependent on heroic, pivotal actions.
So Andrew Critich's hopes of not needing pivotal acts only works if significant portions of the alignment problem are solved or at least ameliorated, which we are not super close to doing, So whether alignment will require pivotal acts is directly dependent on solving the alignment problem more generally.
Pivotal acts are a worse solution to alignment and shouldn't be thought of as the default solution, but it is a back-pocket solution we shouldn't forget about.
If I had to rate a crux between Eliezer Yudkowsky's/Rob Bensinger's/Nate Soares's/MIRI's views and Deepmind's Safety Team views or Andrew Critich's view, it's whether the Alignment problem is foresight-loaded (And thus civilization will be incompetent as well as safety requiring more pivotal acts) or empirically-loaded, where we don't need to see the bullets in advance (And thus civilization could be more competent and pivotal acts matter less). It's an interesting crux to be sure.
PS: Does Deepmind's Safety Team have real power to disapprove AI projects? Or are they like the Google Ethics team, where they had no power to disapprove AI projects without being fired.
We don't have the power to shut down projects, but we can make recommendations and provide input into decisions about projects
So you can have non-binding recommendations and input, but no actual binding power over the capabilities researchers, right?
Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don't think it's possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.
The MNM effect is essentially people can strongly react to disasters once they happen since they don't want to die, and this can prevent the worst outcomes from happening. It's a short-term response, but can become a control system.
Here's a link: https://www.lesswrong.com/posts/EgdHK523ZM4zPiX5q/coronavirus-as-a-test-run-for-x-risks
Request: could you make a version of this (e.g. with all of your responses stripped) that I/anyone can make a copy of?
Here is a spreadsheet you can copy. This one has a column for each person - if you want to sort the rows by agreement, you need to do it manually after people enter their ratings. I think it's possible to automate this but I was too lazy.
It seems like “get people to implement stronger restrictions” or “explain misalignment risks” or “come up with better regulation” or “differentially improve alignment” are all better applications of an AGI than “do a pivotal act”.
Is there any justification for this? I mean, it's easy to see how you can "explain misalignment risks", but usefulness of it depend on eventually stopping every single PC owner from destroying the world. Similarly with “differentially improve alignment” - what you going to do with new aligned superhuman AI - give to every factory owner? It doesn't sound like it's actually optimized for survival.
Thankfully, the Landauer limit as well as current results in AI mean we probably don't have to do a pivotal act because only companies for the foreseeable future will be able to create AGI.
"Companies develop AGI in foreseeable future" means soon after they will be able to improve technology to the point where not only them can do it. If the plan is "keep AGI ourselves" then ok, but it still leaves the question of "untill what?".
I would say that companies should stop trying to race to AGI. AGI, if it is to be ever built, should only come after we solved inner alignment at least.
Great analysis! I’m curious about the disagreement with needing a pivotal act. Is this disagreement more epistemic or normative? That is to say do you think they assign a very low probability of needing a pivotal act to prevent misaligned AGI? Or do they have concerns about the potential consequences of this mentality? (people competing with each other to create powerful AGI, accidentally creating a misaligned AGI as a result, public opinion, etc.)
I would say the primary disagreement is epistemic - I think most of us would assign a low probability to a pivotal act defined as "a discrete action by a small group of people that flips the gameboard" being necessary. We also disagree on a normative level with the pivotal act framing, e.g. for reasons described in Critch's post on this topic.
If I had to say why I'd disagree with a pivotal act framing, PR would probably be the obvious risk, but another risk is that it's easy to by default politicize this type of thing, and by default there are no guardrails.
It will be extremely tempting to change the world to fit your ideology and politics, and that gets very dangerous very quickly.
Now unfortunately this might really happen, we can be in the unhappy scenario where pivotal acts are required, but if it ever comes to that, they will need to be narrow pivotal acts, and the political system shouldn't be changed unless it relates to AI Safety. Narrowness is a virtue, and one of the most important ones during pivotal acts.
#4. Can't cooperate to avoid AGI
Maybe we can. This is how the Montreal Protocol came about: scientists discovered that chlorofluorocarbons were bad for the ozone. Governments believed them, then the Montreal Protocol was signed, and CFC use fell by 99.7%, leading to the stabilization of the ozone layer, perhaps the greatest example of global cooperation in history.
It took around 15 years from the time scientists discovered that chlorofluorocarbons were causing a major problem to the time the Montreal Protocol was adopted.
How can scientists convince the world to cooperate on AGI alignment in less time?
They haven't managed to do it so far for climate change, which has received massively more attention than AGI. I have seen many times this example being used to argue that we can indeed be successful at coordinating for major challenges, but I think this case is misleading: CFC never played a major role in the economy and they were easily replaceable, so forbidding them was not such an important move.
We had some discussions of the AGI ruin arguments within the DeepMind alignment team to clarify for ourselves which of these arguments we are most concerned about and what the implications are for our work. This post summarizes the opinions of a subset of the alignment team on these arguments. Disclaimer: these are our own opinions that do not represent the views of DeepMind as a whole or its broader community of safety researchers.
This doc shows opinions and comments from 8 people on the alignment team (without attribution). For each section of the list, we show a table summarizing agreement / disagreement with the arguments in that section (the tables can be found in this sheet). Each row is sorted from Agree to Disagree, so a column does not correspond to a specific person. We also provide detailed comments and clarifications on each argument from the team members.
For each argument, we include a shorthand description in a few words for ease of reference, and a summary in 1-2 sentences (usually copied from the bolded parts of the original arguments). We apologize for some inevitable misrepresentation of the original arguments in these summaries. Note that some respondents looked at the original arguments while others looked at the summaries when providing their opinions (though everyone has read the original list at some point before providing opinions).
A general problem when evaluating the arguments was that people often agreed with the argument as stated, but disagreed about the severity of its implications for AGI risk. A lot of these ended up as "mostly agree / unclear / mostly disagree" ratings. It would have been better to gather two separate scores (agreement with the statement and agreement with implications for risk).
Summary of agreements, disagreements and implications
Most agreement:
Most disagreement:
Most controversial among the team:
Cruxes from the most controversial arguments:
Possible implications for our work:
Section A: "strategic challenges" (#1-9)
Summary
Detailed comments
#1. Human level is nothing special / data efficiency
Summary: AGI will not be upper-bounded by human ability or human learning speed (similarly to AlphaGo). Things much smarter than human would be able to learn from less evidence than humans require.
#2. Unaligned superintelligence could easily take over
Summary: A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.
#3. Can't iterate on dangerous domains
Summary: At some point there will be a 'first critical try' at operating at a 'dangerous' level of intelligence, and on this 'first critical try', we need to get alignment right.
#4. Can't cooperate to avoid AGI
Summary: The world can't just decide not to build AGI.
#5. Narrow AI is insufficient
Summary: We can't just build a very weak system.
#6. Pivotal act is necessary
Summary: We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.
#7. There are no weak pivotal acts because a pivotal act requires power
Summary: It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness.
#8. Capabilities generalize out of desired scope
Summary: The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve.
#9. A pivotal act is a dangerous regime
Summary: The builders of a safe system would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that.
Section B.1: The distributional leap (#10-15)
Summary
Detailed comments
#10. Large distributional shift to dangerous domains
Summary: On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.
#11. Sim to real is hard
Summary: There's no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world.
#12. High intelligence is a large shift
Summary: Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level.
#13. Some problems only occur above an intelligence threshold
Summary: Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.
#14. Some problems only occur in dangerous domains
Summary: Some problems seem like their natural order of appearance could be that they first appear only in fully dangerous domains.
#15. Capability gains from intelligence are correlated
Summary: Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.
Section B.2: Central difficulties of outer and inner alignment (#16-24)
Summary
Detailed comments
#16. Inner misalignment
Summary: Outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.
#17. Can't control inner properties
Summary: On the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.
#18. No ground truth (no comments)
Summary: There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned'.
#19. Pointers problem
Summary: There is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment.
#20. Flawed human feedback
Summary: Human raters make systematic errors - regular, compactly describable, predictable errors.
#21. Capabilities go further
Summary: Capabilities generalize further than alignment once capabilities start to generalize far.
#22. No simple alignment core
Summary: There is a simple core of general intelligence but there is no analogous simple core of alignment.
#23. Corrigibility is anti-natural.
Summary: Corrigibility is anti-natural to consequentialist reasoning.
#24. Sovereign vs corrigibility
Summary: There are two fundamentally different approaches you can potentially take to alignment [a sovereign optimizing CEV or a corrigible agent], which are unsolvable for two different sets of reasons. Therefore by ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.
Section B.3: Central difficulties of sufficiently good and useful transparency / interpretability (#25-33)
Summary
Detailed comments
#25. Real interpretability is out of reach
Summary: We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.
#26. Interpretability is insufficient
Summary: Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system that isn't planning to kill us.
#27. Selecting for undetectability
Summary: Optimizing against an interpreted thought optimizes against interpretability.
#28. Large option space (no comments)
Summary: A powerful AI searches parts of the option space we don't, and we can't foresee all its options.
#29. Real world is an opaque domain
Summary: AGI outputs go through a huge opaque domain before they have their real consequences, so we cannot evaluate consequences based on outputs.
#30. Powerful vs understandable
Summary: No humanly checkable output is powerful enough to save the world.
#31. Hidden deception
Summary: You can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about.
#32. Language is insufficient or unsafe
Summary: Imitating human text can only be powerful enough if it spawns an inner non-imitative intelligence.
#33. Alien concepts
Summary: The AI does not think like you do, it is utterly alien on a staggering scale.
Section B.4: Miscellaneous unworkable schemes (#34-36)
Summary
Detailed comments
#34. Multipolar collusion
Summary: Humans cannot participate in coordination schemes between superintelligences.
#35. Multi-agent is single-agent
Summary: Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other.
#36. Human flaws make containment difficult (no comments)
Summary: Only relatively weak AGIs can be contained; the human operators are not secure systems.
Section C: "civilizational inadequacy" (#37-43)
Summary
Detailed comments
#37. Optimism until failure
Summary: People have a default assumption of optimism in the face of uncertainty, until encountering hard evidence of difficulty.
#38. Lack of focus on real safety problems
Summary: AI safety field is not being productive on the lethal problems. The incentives are for working on things where success is easier.
#39. Can't train people in security mindset
Summary: This ability to "notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them" currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others.
#40. Can't just hire geniuses to solve alignment
Summary: You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them.
#41. You have to be able to write this list
Summary: Reading this document cannot make somebody a core alignment researcher, you have to be able to write it.
#42. There's no plan
Summary: Surviving worlds probably have a plan for how to survive by this point.
#43. Unawareness of the risks
Summary: Not enough people have noticed or understood the risks.