This is mostly in response to stuff written by Richard, but I'm interested in everyone's read of the situation.
While I don't find Eliezer's core intuitions about intelligence too implausible, they don't seem compelling enough to do as much work as Eliezer argues they do. As in the Foom debate, I think that our object-level discussions were constrained by our different underlying attitudes towards high-level abstractions, which are hard to pin down (let alone resolve).
Given this, I think that the most productive mode of intellectual engagement with Eliezer's worldview going forward is probably not to continue debating it (since that would likely hit those same underlying disagreements), but rather to try to inhabit it deeply enough to rederive his conclusions and find new explanations of them which then lead to clearer object-level cruxes.
I'm not sure yet how to word this as a question without some introductory paragraphs. When I read Eliezer, I often feel like he has a coherent worldview that sees lots of deep connections and explains lots of things, and that he's actively trying to be coherent / explain everything. [This is what I think you're pointing to with his 'attitude toward...
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict"). At a high level I don't think "mainline" is a great concept for describing probability distributions over the future except in certain exceptional cases (though I may not understand what "mainline" means), and that neat stories that fit everything usually don't work well (unless, or often even if, generated in hindsight).
In answer to your "why is this," I think it's a combination of moderate differences in functioning and large differences in communication style. I think Eliezer has a way of thinking about the future that is quite different from mine and I'm somewhat skeptical of and feel like Eliezer is overselling (which is what got me into this discussion), but that's probably smaller than a large difference in communication style (driven partly by different skills, different aesthetics, and different ideas about what kinds of standards discourse should aspire to).
I think I may not understand well the basic lesson / broader point, so will probably be more helpful on object level points and will mostly go answer those in the time I have.
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict").
Sometimes I'll be tracking a finite number of "concrete hypotheses", where every hypothesis is 'fully fleshed out', and be doing a particle-filtering style updating process, where sometimes hypotheses gain or lose weight, sometimes they get ruled out or need to split, or so on. In those cases, I'm moderately confident that every 'hypothesis' corresponds to a 'real world', constrained by how well as I can get my imagination to correspond to reality. [A 'finite number' depends on the situation, but I think it's normally something like 2-5, unless it's an area I've built up a lot of cache about.]
Sometimes I'll be tracking a bunch of "surface-level features", where the distributions on the features don't always imply coherent underlying worlds, either on their own or in combination with other features. (For example, I might have guesses about the probability th...
I think my way of thinking about things is often a lot like "draw random samples," more like drawing N random samples rather than particle filtering (I guess since we aren't making observations as we go---if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).
The main complexity feels like the thing you point out where it's impossible to make them fully fleshed out, so you build a bunch of intuitions about what is consistent (and could be fleshed out given enough time) and then refine those intuitions only periodically when you actually try to flesh something out and see if it makes sense. And often you go even further and just talk about relationships amongst surface level features using intuitions refined from a bunch of samples.
I feel like a distinctive feature of Eliezer's dialog w.r.t. foom / alignment difficulty is that he has a lot of views about strong regularities that should hold across all of these worlds. And then disputes about whether worlds are plausible often turn on things like "is this property of the described world likely?" which is tough because obviously everyone agrees that ev...
EDIT: I wrote this before seeing Paul's response; hence a significant amount of repetition.
They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'
Why is this?
Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like "in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to be safe", I'm obviously not claiming that this is a realistic thing that I expect to happen, so it's not coming from my "complete mental universe"; I'm just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.
That being said, I think there is a more interesting difference here, but that your description of it is inaccurate (at least for me).
From my perspective I am implicitly representing a probability distribution over possible futures in my head. When I say "maybe X happens", or "X is not absurd", I'm saying that my probability distribution assign...
In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd)."
On my understanding of Eliezer's picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.
Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.
Relevant Feynman quote:
I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.
For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.
Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.
As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal to which to aspire. What makes you expect that?
I'll try to explain the technique and why it's useful. I'll start with a non-probabilistic version of the idea, since it's a little simpler conceptually, then talk about the corresponding idea in the presence of uncertainty.
Suppose I'm building a mathematical model of some system or class of systems. As part of the modelling process, I write down some conditions which I expect the system to satisfy - think energy conservation, or Newton's Laws, or market efficiency, depending on what kind of systems we're talking about. My hope/plan is to derive (i.e. prove) some predictions from these...
The most recent post has a related exchange between Eliezer and Rohin:
Eliezer: I think the critical insight - though it has a format that basically nobody except me ever visibly invokes in those terms, and I worry maybe it can only be taught by a kind of life experience that's very hard to obtain - is the realization that any consistent reasonable story about underlying mechanisms will give you less optimistic forecasts than the ones you get by freely combining surface desiderata
Rohin: Yeah, I think I do not in fact understand why that is true for any consistent reasonable story.
If I'm being locally nitpicky, I argue that Eliezer's thing is a very mild overstatement (it should be "≤" instead of "<") but given that we're talking about forecasts, we're talking about uncertainty, and so we should expect "less" optimism instead of just "not more" optimism, and so I think Eliezer's statement stands as a general principle about engineering design.
This also feels to me like the sort of thing that I somehow want to direct attention towards. Either this principle is right and relevant (and it would be good for the field if all the AI safety thinkers held it!), or there's some deep confusion of mine that I'd like cleared up.
Sorry, I probably should have been more clear about the "this is a quote from a longer dialogue, the missing context is important." I do think that the disagreement about "how relevant is this to 'actual disagreement'?" is basically the live thing, not whether or not you agree with the basic abstract point.
My current sense is that you're right that the thing you're doing is more specific than the general case (and one of the ways you can tell is the line of argumentation you give about chance of doom), and also Eliezer can still be correctly observing that you have too many free parameters (even if the number of free parameters is two instead of arbitrarily large). I think arguments about what you're selecting for either cash out in mechanistic algorithms, or they can deceive you in this particular way.
Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there's 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of ...
[I think there's a thing Eliezer does a lot, which I have mixed feelings about, which is matching people's statements to patterns and then responding to the generator of the pattern in Eliezer's head, which only sometimes corresponds to the generator in the other person's head.]
I want to add an additional meta-pattern – there was a once a person who thought I had a particular bias. They'd go around telling me "Ray, you're exhibiting that bias right now. Whatever rationalization you're coming up with right now, it's not the real reason you're arguing X." And I was like "c'mon man. I have a ton of introspective access to myself and I can tell that this 'rationalization' is actually a pretty good reason to believe X and I trust that my reasoning process is real."
But... eventually I realized I just actually had two motivations going on. When I introspected, I was running a check for a positive result on "is Ray displaying rational thought?". When they extrospected me (i.e. reading my facial expressions), they were checking for a positive result on "does Ray seem biased in this particular way?".
And both checks totally returned 'true', and that was an accurate assessment.
The partic...
(For object-level responses, see comments on parallel threads.)
I want to push back on an implicit framing in lines like:
there's some value to more people thinking thru / shooting down their own edge cases [...], instead of pushing the work to Eliezer.
people aren't updating on the meta-level point and continue to attempt 'rolling their own crypto', asking if Eliezer can poke the hole in this new procedure
This makes it sound like the rest of us don't try to break our proposals, push the work to Eliezer, agree with Eliezer when he finds a problem, and then not update that maybe future proposals will have problems.
Whereas in reality, I try to break my proposals, don't agree with Eliezer's diagnoses of the problems, and usually don't ask Eliezer because I don't expect his answer to be useful to me (and previously didn't expect him to respond). I expect this is true of others (like Paul and Richard) as well.
But also my sense is that there's some deep benefit from "having mainlines" and conversations that are mostly 'sentences-on-mainline'?
I agree with this. Or, if you feel ~evenly split between two options, have two mainlines and focus a bunch on those (including picking at cruxes and revising your mainline view over time).
But:
Like, it feels to me like Eliezer was generating sentences on his mainline, and Richard was responding with 'since you're being overly pessimistic, I will be overly optimistic to balance', with no attempt to have his response match his own mainline.
I do note that there are some situations where rushing to tell a 'mainline story' might be the wrong move:
These conversations are great and I really admire the transparency. It's really nice to see discussions that normally happen in private happen instead in public where everyone can reflect, give feedback, and improve their own thoughts. On the other hand, the combined conversations combined to a decent-sized novel - LW says 198,846 words! Is anyone considering investing heavily in summarizing the content for people to get involved without having to read all that content?
Echoing that I loved these conversations and I'm super grateful to everyone who participated — especially Richard, Paul, Eliezer, Nate, Ajeya, Carl, Rohin, and Jaan, who contributed a lot.
I don't plan to try to summarize the discussions or distill key take-aways myself (other than the extremely cursory job I did on https://intelligence.org/late-2021-miri-conversations/), but I'm very keen on seeing others attempt that, especially as part of a process to figure out their own models and do some evaluative work.
I think I'd rather see partial summaries/responses that go deep, instead of a more exhaustive but shallow summary; and I'd rather see summaries that center the author's own view (what's your personal take-away? what are your objections? which things were small versus large updates? etc.) over something that tries to be maximally objective and impersonal. But all the options seem good to me.
Question for Richard, Paul, and/or Rohin: What's a story, full of implausibly concrete details but nevertheless a member of some largish plausible-to-you cluster of possible outcomes, in which things go well? (Paying particular attention to how early AGI systems are deployed and to what purposes, or how catastrophic deployments are otherwise forstalled.)
I wrote this doc a couple of years ago (while I was at CHAI). It's got many rough edges (I think I wrote it in one sitting and never bothered to rewrite it to make it better), but I still endorse the general gist, if we're talking about what systems are being deployed to do and what happens amongst organizations. It doesn't totally answer your question (it's more focused on what happens before we get systems that could kill everyone), but it seems pretty related.
(I haven't brought it up before because it seems to me like the disagreement is much more in the "mechanisms underlying intelligence", which that doc barely talks about, and the stuff it does say feels pretty outdated; I'd say different things now.)
Eliezer and Nate, my guess is that most of your perspective on the alignment problem for the past several years has come from the thinking and explorations you've personally done, rather than reading work done by others.
But, if you have read interesting work by others that's changed your mind or given you helpful insights, what has it been? Some old CS textbook? Random Gwern articles? An economics textbook? Playing around yourself with ML systems?
One thing in the posts I found surprising was Eliezers assertion that you needed a dangerous superintelligence to get nanotech. If the AI is expected to do everything itself, including inventing the concept of nanotech, I agree that this is dangerously superintelligent.
However, suppose Alpha Quantum can reliably approximate the behaviour of almost any particle configuration. Not literally any, it can't run a quantum computer factorizing large numbers better than factoring algorithms, but enough to design a nanomachine. (It has been trained to approximate the ground truth of quantum mechanics equations, and it does this very well.)
For example, you could use IDA, start training to imitate a simulation of a handful of particles, then compose several smaller nets into one large one.
Add a nice user interface and we can drag and drop atoms.
You can add optimization, gradient descent trying to maximize the efficiency of a motor, or minimize the size of a logic gate. All of this is optimised to fit a simple equation, so assuming you don't have smart general mesaoptimizers forming, and deducing how to manipulate humans based on very little info about humans, you shoul...
I wrote Consequentialism & Corrigibility shortly after and partly in response to the first (Ngo-Yudkowsky) discussion. If anyone has an argument or belief that the general architecture / approach I have in mind (see the “My corrigibility proposal sketch” section) is fundamentally doomed as a path to corrigibility and capability—as opposed to merely “reliant on solving lots of hard-but-not-necessarily-impossible open problems”—I'd be interested to hear it. Thanks in advance. :)
After reading some of the newer MIRI dialogues, I'm less convinced than I once was that I know what "corrigibility" actually is. Could you say a few words about what kind of behavior you concretely expect to see from a "corrigible" agent, followed by how [you expect] those behaviors [to] fit into the "trajectory-constraining" framework you propose in your post?
EDIT: This is not purely a question for Steven, incidentally (or at least, the first half isn't); anyone else who wants to take a shot at answering should feel free to do so. In particular I'd be interested in hearing answers from Eliezer or anyone else historically involved in the invention of the term.
Question for anyone, but particularly interested in hearing from Christiano, Shah, or Ngo: any thoughts on what happens when alignment schemes that worked in lower-capability regimes fail to generalize to higher-capability regimes?
For example, you could imagine a spectrum of outcomes from "no generalization" (illustrative example: galaxies tiled with paperclips) to "some generalization" (illustrative example: galaxies tiled with "hedonium" human-ish happiness-brainware) to "enough generalization that existing humans recognizably survive, but something still went wrong from our current perspective" (illustrative examples: "Failed Utopia #4-2", Friendship Is Optimal, "With Folded Hands"). Given that not every biological civilization solves the problem, what does the rest of the multiverse look like? (How is measure distributed on something like my example spectrum, or whatever I should have typed instead?)
(Previous work: Yudkowsky 2009 "Value Is Fragile", Christiano 2018 "When Is Unaligned AI Morally Valuable?", Grace 2019 "But Exactly How Complex and Fragile?".)
When alignment schemes fail to scale, I think it typically means that they work while the system is unable to overpower/outsmart the oversight process, and then break down when the system becomes able to do so. I think that this usually results in the AI shifting from behavior that is mostly constrained by the training process to behavior that is mostly unconstrained (once they effectively disempower humans).
I think the results are relatively unlikely to be good in virtue of "the AI internalized something about our values, just not everything", and I'm pretty skeptical of recognizable "near miss" scenarios rather than AI gradually careening in very hard-to-predict directions with minimal connection with the surface features of the training process.
Overall I think that the most likely outcome is a universe that is orthogonal to anything we directly care about, maybe with a vaguely similar flavor owing to convergence depending on how AI motivations shake out. (But likely not close enough to feel great, and quite plausibly with almost no visible relation. Probably much more different from us than we are from aliens.)
I think it's fairly plausible that the results are OK just beca...
Basically agree with Paul, and I especially want to note that I've barely thought about it and so this would likely change a ton with more information. To put some numbers of my own:
These are from my own perspective of what these categories mean, which I expect are pretty different from yours -- e.g. maybe I'm at ~2% that upon reflection I'd decide that hedonium is great and so that's actually perfect generalization; in the last category I include lots of worlds that I wouldn't describe as "existing humans recognizably survive", e.g. we decide to become digital uploads, then get lots of cognitive enhancements, throw away a bunch of evolutionary baggage, but also we never expand to the stars because AI has taken control of it and given us only Earth.
I think the biggest avenues for improving the answers would be to reflect more on the kindness + cooperation and acausal trade stories Paul mentions, as well as the possibility that a few AIs end up generalizing close to correctly and working ...
I finished reading all the conversations a few hours ago. I have no follow-up questions (except maybe "now what?"), I'm still updating from all those words.
One except in particular, from the latest post, jumped at me (from Eliezer Yudkowsky, emphasis mine):
This is not aimed particularly at you, but I hope the reader may understand something of why Eliezer Yudkowsky goes about sounding so gloomy all the time about other people's prospects for noticing what will kill them, by themselves, without Eliezer constantly hovering over their shoulder every minute prompting them with almost all of the answer.
The past years or reading about alignment have left me with an intense initial distrust of any alignment research agenda. Maybe it's ordinary paranoia, maybe something more. I've not come up with any new ideas myself, and I'm not particularly confident in my ability to find flaws in someone else's proposal (what if I'm not smart enough to understand them properly? What if I make things even more confused and waste everyone's time?)
After thousands and thousands of lengthy conversations where it takes everyone ages to understand where threat models disagree, why some avenue of research is p...
Not sure if it's a right place to ask, instead of just googling it, but anyway: does anyone know what's the current state of AI security practices at DeepMind, OpenAI and other such places? Like, did they estimate probability of GPT-3 killing everyone before turning it on, do they have procedures for not turning something on, did they test these procedures by someone impersonating unaligned GPT and trying to manipulate researchers, things like that?
Questions about the standard-university-textbook from the future that tells us how to build an AGI. I'll take answers on any of these!
I'm going to try and write a table of contents for the textbook, just because it seems like a fun exercise.
Epistemic status: unbridled speculation
Volume I: Foundation
Part I: Statistical Learning Theory
Part II: Computational Learning Theory
Part III: Universal Priors
I don't think there is an "AGI textbook" any more than there is an "industrialization textbook." There are lots of books about general principles and useful kinds of machines. That said, if I had to make wild guesses about roughly what that future understanding would look like:
Eliezer, do you have any advice for someone wanting to enter this research space at (from your perspective) the eleventh hour? I’ve just finished a BS in math and am starting a PhD in CS, but I still don’t feel like I have the technical skills to grapple with these issues, and probably won’t for a few years. What are the most plausible routes for someone like me to make a difference in alignment, if any?
I don't have any such advice at the moment. It's not clear to me what makes a difference at this point.
We'd absolutely pay him if he showed up and said he wanted to work on the problem. Every time I've asked about trying anything like this, all the advisors claim that you cannot pay people at the Terry Tao level to work on problems that don't interest them. We have already extensively verified that it doesn't particularly work for eg university professors.
Every time I've asked about trying anything like this, all the advisors claim that you cannot pay people at the Terry Tao level to work on problems that don't interest them.
As I am sure you would agree, Neumann/Tao-level people are a very different breed from even very, very, very good professors. It is plausible they are significantly more sane than the average genius.
Given the enormous glut of money in EA trying to help here and the terrifying thing where a lot of the people who matter have really short timelines, I think it is worth testing this empirically with Tao himself and Tao-level people.
It is worth noting that Neumann occasionally did contract work for extraordinary sums.
I'm not sure whether the unspoken context of this comment is "We tried to hire Terry Tao and he declined, citing lack of interest in AI alignment" vs "we assume, based on not having been contacted by Terry Tao, that he is not interested in AI alignment."
If the latter: the implicit assumption seems to be that if Terry Tao would find AI alignment to be an interesting project, we should strongly expect him to both know about it and have approached MIRI regarding it, neither which seems particularly likely given the low public profile of both AI alignment in general and MIRI in particular.
If the former: bummer.
You're probably already aware of this, but just in case not:
Demis Hassabis said the following about getting Terrence Tao to work on AI safety:
I always imagine that as we got closer to the sort of gray zone that you were talking about earlier, the best thing to do might be to pause the pushing of the performance of these systems so that you can analyze down to minute detail exactly and maybe even prove things mathematically about the system so that you know the limits and otherwise of the systems that you're building. At that point I think all the world's greatest minds should probably be thinking about this problem. So that was what I would be advocating to you know the Terence Tao’s of this world, the best mathematicians. Actually I've even talked to him about this—I know you're working on the Riemann hypothesis or something which is the best thing in mathematics but actually this is more pressing. I have this sort of idea of like almost uh ‘Avengers assembled’ of the scientific world because that's a bit of like my dream.
The header image of Tao's blog is a graph representing "flattening the curve" of the Covid-19 spread. One avenue for convincing elite talent that alignment is a problem is a media campaign that brings the problem of alignment into popular consciousness.
I have some ideas about how this might begin. "Educational" YouTuber CGP Grey, (5.2M subscribers) got talked into making a pair of videos advocating for anti-aging research by another large YouTuber, Kurzgesagt (18M subscribers). I'd bet that they could both be persuaded into making AI alignment videos.
Not even a "In 90% of possible worlds, we're irreversibly doomed, but in the remaining 10%, here's the advice that would work"?
Eliezer, when you told Richard that your probability of a successful miracle is very low, you added the following note:
Though a lot of that is dominated, not by the probability of a positive miracle, but by the extent to which we seem unprepared to take advantage of it, and so would not be saved by one.
I don't mean to ask for positive fairy tales when I ask: could you list some things you could see in the world that would cause you to feel that we were well-prepared to take advantage of one if we got one?
My obvious quick guess would be "I know of an ML project that made a breakthrough as impressive as GPT-3 and this is secret to the outer world, and the organization is keenly interested in alignment". But I am also interested in broader and less obvious ones. For example if the folks around here had successfully made a covid vaccine I think that would likely require us to be in a much more competent and responsive situation. Alternatively if folks made other historic scientific breakthroughs guided by some model of how it helps prevent AI doom, I'd feel more like this power could be turned to relevant directions.
Anyway, these are some of the things I quickly generate, but I'm interested in what comes to your mind?
Curated. I found the entire sequence of conversations quite valuable, and it seemed good both to let people know it had wrapped up, and curate it while the AMA was still going on.
Question from evelynciara on the EA Forum:
Do you believe that AGI poses a greater existential risk than other proposed x-risk hazards, such as engineered pandemics? Why or why not?
For sure. It's tricky to wipe out humanity entirely without optimizing for that in particular -- nuclear war, climate change, and extremely bad natural pandemics look to me like they're at most global catastrophes, rather than existential threats. It might in fact be easier to wipe out humanity by enginering a pandemic that's specifically optimized for this task (than it is to develop AGI), but we don't see vast resources flowing into humanity-killing-virus projects, the way that we see vast resources flowing into AGI projects. By my accounting, most other x-risks look like wild tail risks (what if there's a large, competent, state-funded successfully-secretive death-cult???), whereas the AI x-risk is what happens by default, on the mainline (humanity is storming ahead towards AGI as fast as they can, pouring billions of dollars into it per year, and by default what happens when they succeed is that they accidentally unleash an optimizer that optimizes for our extinction, as a convergent instrumental subgoal of whatever rando thing it's optimizing).
[W]iping out humanity is the most expensive of these options and the AGI would likely get itself destroyed while trying to do that[.]
It would be pretty easy and cheap for something much smarter than a human to kill all humans. The classic scenario is:
...A. [...] The notion of a 'superintelligence' is not that it sits around in Goldman Sachs's basement trading stocks for its corporate masters. The concrete illustration I often use is that a superintelligence asks itself what the fastest possible route is to increasing its real-world power, and then, rather than bothering with the digital counters that humans call money, the superintelligence solves the protein structure prediction problem, emails some DNA sequences to online peptide synthesis labs, and gets back a batch of proteins which it can mix together to create an acoustically controlled equivalent of an artificial ribosome which it can use to make second-stage nanotechnology which manufactures third-stage nanotechnology which manufactures diamondoid molecular nanotechnology and then... well, it doesn't really matter from our perspective what comes after that, because from a human perspective any technology more advan
I would be interested to hear opinions about what fraction of people could possibly produce useful alignment work?
Ignoring the hurdle of "knowing about AI safety at all", i.e. assuming they took some time to engage with it (e.g. they took the AGI Safety Fundamentals course). Also assume they got some good mentorship (e.g. from one of you) and then decided to commit full-time (and got funding for that). The thing I'm trying to get at is more about having the mental horsepower + epistemics + creativity + whatever other qualities are useful, or likely being able to get there after some years of training.
Also note that I mean direct useful work, not indirect meta things like outreach or being a PA to a good alignment researcher etc. (these can be super important, but I think it's productive to think of them as a distinct class). E.g. I would include being a software engineer at Anthropic, but exclude doing grocery-shopping for your favorite alignment researcher.
An answer could look like "X% of the general population" or "half the people who could get a STEM degree at Ivy League schools if they tried" or "a tenth of the people who win the Fields medal".
I think it's useful to have a sens...
(Off the cuff answer including some random guesses and estimates I won't stand behind, focused on the kind of theoretical alignment work I'm spending most of my days thinking about right now.)
Over the long run I would guess that alignment is broadly similar to other research areas, where a large/healthy field could support lots of work from lots of people, where some kinds of contributions are very heavy-tailed but there is a lot of complementarity and many researchers are having large overall marginal impacts.
Right now I think difficulties (at least for growing the kind of alignment work I'm most excited about) are mostly related to trying to expand quickly, greatly exacerbated by not having a good idea what's going on / what we should be trying to do, and not having a straightforward motivating methodology/test case since you are trying to do things in advance motivated by altruistic impact. I'm still optimistic that we will be able to scale up reasonably quickly such that many more people are helpfully engaged in the future and eventually these difficulties will be resolved.
In the very short term, while other bottlenecks are severe, I think it's mostly a question of how to use c...
"Possibly produce useful alignment work" is a really low bar, such that the answer is ~100%. Lots of things are possible. I'm going to instead answer "for what fraction of people would I think that the Long-Term Future Fund should fund them on the current margin".
If you imagine that the people are motivated to work on AI safety, get good mentorship, and are working full-time, then I think on my views most people who could get into an ML PhD in any university would qualify, and a similar number of other people as well (e.g. strong coders who are less good at the random stuff that academia wants). Primarily this is because I think that the mentors have useful ideas that could progress faster with "normal science" work (rather than requiring "paradigm-defining" work).
In practice, there is not that much mentorship to go around, and so the mentors end up spending time with the strongest people from the previous category, and so the weakest people end up not having mentorship and so aren't worth funding on the current margin.
I'd hope that this changes in the next few years, with the field transitioning from "you can do 'normal science' if you are frequently talking to one of the people who have paradigms in their head" to "the paradigms are understandable from the online written material; one can do 'normal science' within a paradigm autonomously".
I'm still very vague on Yudkowksy's main counterargument to Ngo in the dialogues — about how saving the world requires a powerful search over a large space of possibilities, and therefore by default involves running dangerous optimizers that will kill us. This is a more concrete question aiming to make my understanding less vague; Yudkowksy said:
"AI systems that do better alignment research" are dangerous in virtue of the lethally powerful work they are doing, not because of some particular narrow way of doing that work. If you can do it by gradient descent then that means gradient descent got to the point of doing lethally dangerous work. Asking for safely weak systems that do world-savingly strong tasks is almost everywhere a case of asking for nonwet water, and asking for AI that does alignment research is an extreme case in point.
I don't understand why alignment research falls into this bucket of "world-savingly strong, therefore lethally strong". My intuitive reasoning is: the inner-alignment is a math problem, about certain properties of things involving functions A -> (B -> A) or whatever; and if we actually knew how to phrase that math problem crisply and ...
The way the type corresponds loosely to the "type of agency" (if you kinda squint at the arrow symbol and play fast-and-loose) is that it suggests a machine that eats a description of how actions () leads to outcome (), and produces from that description an action.
Consider stating an alignment property for on elements of this type. What sort of thing must it say?
Perhaps you wish to say "when is fed the actual description of the world, it selects the best possible action". Congratulations, in fact exists, it is called . This does not help you.
Perhaps you instead wish to say "when is fed the actual description of the world, it selects an action that gets at least 0.5 utility, after consuming only 1^15 units of compute" or whatever. Now, set aside the fact that you won't find such a function with your theorem-prover AI before somebody else has ended the world (understanding intelligence well enough to build one that you can prove that theorem about, is pro'lly harder than whatever else people are deploying AGIs towards), and set aside also the fact that you're leaving a lot of utility on the table; even if that worked, you're still screwed.
Why are you still scr...
It's #1, with a light side order of #3 that doesn't matter because #1.
I'm not sure where to start on explaining. How would you state a theorem that an AGI would put two cellular-identical strawberries on a plate, including inventing and building all technology required to do that, without destroying the world? If you can state this theorem you've done 250% of the work required to align an AGI.
It seems to me that a major crux about AI strategy routes through "is civilization generally adequate or not?". It seems like people have pretty different intuitions and ontologies here. Here's an attempt at some questions of varying levels of concreteness, to tease out some worldview implications.
(I normally use the phrase "civilizational adequacy", but I think that's kinda a technical term that means a specific thing and I think maybe I'm pointing at a broader concept.)
"Does civilization generally behave sensibly?" This is a vague question, some possible subquestions:
I don't think this is the main crux -- disagreements about mechanisms of intelligence seem far more important -- but to answer the questions:
Do you think major AI orgs will realize that AI is potentially worldendingly dangerous, and have any kind of process at all to handle that?
Clearly yes? They have safety teams that are focused on x-risk? I suspect I have misunderstood your question.
(Maybe you mean the bigger tech companies like FAANG, in which case I'm still at > 95% on yes, but I suspect I am still misunderstanding your question.)
(I know less about Chinese orgs but I still think "probably yes" if they become major AGI orgs.)
Do you think government intervention on AI regulations or policies will be net-positive or net-negative, for purposes of preventing x-risk?
Net positive, though mostly because it seems kinda hard to be net negative relative to "no regulation at all", not because I think the regulations will be well thought out. The main tradeoff that companies face seems to be speed / capabilities vs safety; it seems unlikely that even "random" regulations increase the speed and capabilities that companies can achieve. (Though it's certainly possible, e.g. a regulation fo...
There's something I had interpreted the original CEV paper to be implying, but wasn't sure if it was still part of the strategic landscape, which was "have the alignment project being working towards a goal that was highly visibly fair, to disincentive races." Was that an intentional part of the goal, or was it just that CEV seemed something like "the right thing to do" (independent of it's impact on races?)
How does Eliezer think about it now?
Yes, it was an intentional part of the goal.
If there were any possibility of surviving the first AGI built, then it would be nice to have AGI projects promising to do something that wouldn't look like trying to seize control of the Future for themselves, when, much later (subjectively?), they became able to do something like CEV. I don't see much evidence that they're able to think on the level of abstraction that CEV was stated on, though, nor that they're able to understand the 'seizing control of the Future' failure mode that CEV is meant to prevent, and they would not understand why CEV was a solution to the problem while 'Apple pie and democracy for everyone forever!' was not a solution to that problem. If at most one AGI project can understand the problem to which CEV is a solution, then it's not a solution to races between AGI projects. I suppose it could still be a solution to letting one AGI project scale even when incorporating highly intelligent people with some object-level moral disagreements.
To what extent do you think pivotal-acts-in-particular are strategically important (i.e. "successfully do a pivotal act, and if necessary build an AGI to do it" is the primary driving goal), vs "pivotal acts are useful shorthand to refer to the kind of intelligence level where it matters than an AGI be 'really safe'".
I'm interested in particular in responses from Eliezer, Rohin, and perhaps Richard Ngo. (I've had private chats with Rohin that I thought were useful to share and this comment is sort of creating a framing device for sharing them, but I've bee...
My Eliezer-model thinks pivotal acts are genuinely, for-real, actually important. Like, he's not being metaphorical or making a pedagogical point when he says (paraphrasing) 'we need to use the first AGI systems to execute a huge, disruptive, game-board-flipping action, or we're all dead'.
When my Eliezer-model says that the most plausible pivotal acts he's aware of involve capabilities roughly at the level of 'develop nanotech' or 'put two cellular-identical strawberries on a plate', he's being completely literal. If some significantly weaker capability level realistically suffices for a pivotal act, then my Eliezer-model wants us to switch to focusing on that (far safer) capability level instead.
If we can save the world before we get anywhere near AGI, then we don't necessarily have to sort out how consequentialist, dangerous, hardware-overhang-y, etc. the first AGI systems will be. We can just push the 'End The Acute Existential Risk Period' button, and punt most other questions to the non-time-pressured Reflection that follows.
The goal is to bring x-risk down to near-zero, aka "End the Acute Risk Period". My usual story for how we do this is roughly "we create a methodology for building AI systems that allows you to align them at low cost relative to the cost of gaining capabilities; everyone uses this method, we have some governance / regulations to catch any stragglers who aren't using it but still can make dangerous systems".
If I talk to Eliezer, I expect him to say "yes, in this story you have executed a pivotal act, via magical low-cost alignment that we definitely do not get before we all die". In other words, the crux is in whether you can get an alignment solution with the properties I mentioned (and maybe also in whether people will be sensible enough to use the method + do the right governance). So with Eliezer I end up talking about those cruxes, rather than talking about "pivotal acts" per se, but I'm always imagining the "get an alignment solution, have everyone use it" plan.
When I talk to people who are attempting to model Eliezer, or defer to Eliezer, or speaking out of their own model that's heavily Eliezer-based, and I present this plan to them, and then they start thinking about pivotal...
This question is not directed at anyone in particular, but I'd want to hear some alignment researchers answer it. As a rough guess, how much would it affect your research—in the sense of changing your priorities, or altering your strategy of impact, and method of attack on the problem—if you made any of the following epistemic updates?
(Feel free to disambiguate anything here that's ambiguous or poorly worded.)
A question for Eliezer: If you were superintelligent, would you destroy the world? If not, why not?
If your answer is "yes" and the same would be true for me and everyone else for some reason I don't understand, then we're probably doomed. If it is "no" (or even just "maybe"), then there must be something about the way we humans think that would prevent world destruction even if one of us were ultra-powerful. If we can understand that and transfer it to an AGI, we should be able to prevent destruction, right?
I would "destroy the world" from the perspective of natural selection in the sense that I would transform it in many ways, none of which were making lots of copies of my DNA, or the information in it, or even having tons of kids half resembling my old biological self.
From the perspective of my highly similar fellow humans with whom I evolved in context, they'd get nice stuff, because "my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me, as the result of my being strictly outer-optimized over millions of generations for inclusive genetic fitness, which I now don't care about at all.
Paperclip-numbers do well out of paperclip-number maximization. The hapless outer creators of the thing that weirdly ends up a paperclip maximizer, not so much.
"my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me
This may not be what evolution had "in mind" when it created us. But couldn't we copy something like this into a machine so that it "thinks" of us (and our descendants) as its "fellow humans" who should "get nice stuff"? I understand that we don't know how to do that yet. But the fact that Eliezer has some kind of "don't destroy the world from a fellow human perspective" goal function inside his brain seems to mean a) that such a function exists and b) that it can be coded into a neuronal network, right?
I was also thinking about the specific way we humans weigh competing goals and values against each other. So while for instance we do destroy much of the biosphere by blindly pursuing our misaligned goals, some of us still care about nature and animal welfare and rain forests, and we may even be able to prevent total destruction of them.
I think we (mostly) all agree that we want to somehow encode human values into AGIs. That's not a new idea. The devil is in the details.
I see how my above question seems naive. Maybe it is. But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of "value learning". (Copied from my answer to AprilSR:) I stumbled across two papers from a few years ago by a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment (https://arxiv.org/abs/1701.01487 and https://arxiv.org/abs/1703.06354).They appear a bit shallow to me and don't contain any specific ideas on how to implement this. But maybe Muraven has a point here.
Yes. But my impression so far is that anything we can even imagine in terms of a goal function will go badly wrong somehow. So I find it a bit reassuring that at least one such function that will not necessarily lead to doom seems to exist, even if we don't know how to encode it yet.
I guess there's some meta-level question here that I'm interested in, as a sort of elaboration, which is something like: how do you go about balancing which meta-levels of the world to satisfy and which to destroy? [I kind of have a sense that Eliezer's answer can be guessed as an extension of the meta-ethics sequence, and so am interested both in his actual answer and other people's answers.]
For example, one might imagine a mostly-upload situation like The Metamorphosis of Prime Intellect / Friendship is Optimal / Second Life / etc., wherein everyone gets a materially abundant digital life in their shard of the metaverse, with communication heavily constrained (if nothing else, by requiring mutual consent). This, of course, discards as no-longer-relevant entities that exist on higher meta-levels; nations will be mostly irrelevant in such a world, companies will mostly stop existing, and so on.
But one could also apply the same logic a level lower. If you take Internal Family Systems / mental modules seriously, humans don't look like atomic objects, they look like a collection of simpler subagents balanced together in a sort of precarious way. (One part of you wants to accumulate lo...
It was all very interesting, but what was the goal of these discussions? I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety. Is the reasoning that the same mental process that assigned too low probability would not be able to come up with actual solution? Or something like "at the time they think their solution reduced probability of failure from 5% to 0.1% it would still be much higher"? Seems to be only possible if people don't understand arguments about inner optimisators or what not, as opposed to disagreeing with them.
Changing one's mind on P(doom) can be useful for people comparing across cause areas (e.g. Open Phil), but it's not all that important for me and was not one of my goals.
Generally when people have big disagreements about some high-level question like P(doom), it means that they have very different underlying models that drive their reasoning within that domain. The main goal (for me) is to acquire underlying models that I can then use in the future.
Acquiring a new underlying model that I actually believe would probably be more important than the rest of my work in a full year combined. It would typically have significant implications on what sorts of proposals can and cannot work, and would influence what research I do for years to come. In the case of Eliezer's model specifically, it would completely change what research I do, since Eliezer's model specifically predicts that the research I do is useless (I think).
I didn't particularly expect to actually acquire a new model that I believed from these conversations, but there was some probability of that, and I did expect that I would learn at least a few new things I hadn't previously considered. I'm unfortunately quite bad at noticing my own "updates", so I can't easily point to examples. That being said, I'm confident that I would now be significantly better at Eliezer's ITT than before the conversations.
I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety.
What specific actions do you have in mind when you say "global coordination on safety", and how much of the problem do you think these actions solve?
My own view is that 'caring about AI x-risk at all' is a pretty small (albeit indispensable) step. There are lots of decisions that hinge on things other than 'is AGI risky at all'.
I agree with Rohin that the useful thing is trying to understand each other's overall models of the world and try to converge on them, not p(doom) per se. I gave some examples here of some important implications of having more Paul-ish models versus more Eliezer-ish models.
More broadly, examples of important questions people in the field seem to disagree a lot about:
Am I correct in assuming that your baseline belief right now is that alignment will not be solved before the first AGI is created? As a tangentially-related question, do you believe there is any significant likelihood that we could create a “semi-aligned” AGI (which would optimize for “less bad,” but still potentially dystopian futures) more easily than solving for full alignment? If so, how much energy should we be putting into exploring that possibility space? (Latter question adapted from the discussion around https://www.lesswrong.com/posts/wRq6cwtHpXB9zF9gj/better-a-brave-new-world-than-a-dead-one)
Nope. It's just as hard and harder than aligning on some more limited pivotal task. This is Sacrifice to the Gods; you imagine accepting some big downside but the big downside doesn't actually buy you anything.
During this weekend's SERI Conference, to my understanding, Paul Christiano specified that his work focuses on preventing AI to disempower humans and disregards externalities. Whose work focuses on understanding these externalities, such as wellbeing and freedom experienced by humans and other sentience, including AI and animals? Is it possible to safely employ the AI that has the best total externalities, measured across times under the veil of ignorance? Is it necessary that overall beneficial systems are developed prior to the existence of AGI, so that ...
One argument for alignment difficulty is that corrigibility is "anti-natural" in a certain sense. I've tried to write out my understanding of this argument, and would be curious if anyone could add or improve anything about it.
I'd be equally interested in any attempts at succinctly stating other arguments for/against alignment difficulty.
Will MIRI want to hire programmers once the pandemic is over? What kind of programmers? What other kinds of people do you seek to hire?
So, about that "any future details would make me update in one direction, so I may as well update now" move: I think it would be helpful to have a description of how it can possibly be a correct thing to do at all from Bayesian standpoint. Like, is situation supposed to be that you already have a general mechanism generating these details and just don't realise it? But then you need reasons to believe that general mechanism. Or is it just "I did such update bunch of times and it usually worked"? Or what?
I'm late to the party by a month, but I'm interested in your take (especially Rohin's) on the following:
Conditional on an existential catastrophe happening due to AI systems, what is your credence that the catastrophe will occur only after the involved systems are deployed?
I have a question for the folks who think AGI alignment is achievable in the near term in small steps or by limiting AGI behavior to make it safe. How hard will it be to achieve alignment for simple organisms as a proof of concept for human value alignment? How hard would it be to put effective limits or guardrails on the resulting AGI if we let the organisms interact directly with the AGI while still preserving their values? Imagine a setup where interactions by the organism must be interpreted as requests for food, shelter, entertainment, uplift, etc....
With the release of Rohin Shah and Eliezer Yudkowsky's conversation, the Late 2021 MIRI Conversations sequence is now complete.
This post is intended as a generalized comment section for discussing the whole sequence, now that it's finished. Feel free to:
In particular, Eliezer Yudkowsky, Richard Ngo, Paul Christiano, Nate Soares, and Rohin Shah expressed active interest in receiving follow-up questions here. The Schelling time when they're likeliest to be answering questions is Wednesday March 2, though they may participate on other days too.