I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. EG my recent attempt to operationalize a bet with Nate went nowhere. Paul trying to get Eliezer to bet during the MIRI dialogues also went nowhere, or barely anywhere—I think they ended up making some random bet about how long an IMO challenge would take to be solved by AI. (feels pretty weak and unrelated to me. lame. but huge props to Paul for being so ready to bet, that made me take him a lot more seriously.)
This paragrap...
Your comments' points seem like further evidence for my position. That said, your comment appears to serve the function of complicating the conversation, and that happens to have the consequence of diffusing the impact of my point. I do not allege that you are doing so on purpose, but I think it's important to notice. I would have been more convinced by a reply of "no, you're wrong, here's the concrete bet(s) EY made or was willing to make but Paul balked."
I will here repeat a quote[1] which seems relevant:
...[Christiano][12:29]
my desir
If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!?
Quoting the abstract of MIRI's "The Value Learning Problem" paper (emphasis added):
...Autonomous AI systems’ programmed goals can easily fall short of programmers’ intentions. Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended. We discuss early ideas on how one might design smarte
But the benefit of a Pause is that you use the extra time to do something in particular. Why wouldn't you want to fiscally sponsor research on problems that you think need to be solved for the future of Earth-originating intelligent life to go well?
MIRI still sponsors some alignment research, and I expect we'll sponsor more alignment research directions in the future. I'd say MIRI leadership didn't have enough aggregate hope in Agent Foundations in particular to want to keep supporting it ourselves (though I consider its existence net-positive).
My mo...
...I don't find this convincing. I think the target "dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment" is narrow (or just nonexistent, using the methods we're likely to have on hand).
Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where ch
one positive feature it does have, it proposes to rely on a multitude of "limited weakly-superhuman artificial alignment researchers" and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable.
I don't find this convincing. I think the target "dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment" is narrow (or just nonexistent, using the methods we're likely to have on hand).
Even if this exists, verification seems extraordinarily difficu...
As a start, you can prohibit sufficiently large training runs. This isn't a necessary-and-sufficient condition, and doesn't necessarily solve the problem on its own, and there's room for debate about how risk changes as a function of training resources. But it's a place to start, when the field is mostly flying blind about where the risks arise; and choosing a relatively conservative threshold makes obvious sense when failing to leave enough safety buffer means human extinction. (And when algorithmic progress is likely to reduce the minimum dangerous train...
Alternatively, they either don't buy the perils or believes there's a chance the other chance may not?
If they "don't buy the perils", and the perils are real, then Leopold's scenario is falsified and we shouldn't be pushing for the USG to build ASI.
If there are no perils at all, then sure, Leopold's scenario and mine are both false. I didn't mean to imply that our two views are the only options.
Separately, Leopold's model of "what are the dangers?" is different from mine. But I don't think the dangers Leopold is worried about are dramatically easier to und...
Why? 95% risk of doom isn't certainty, but seems obviously more than sufficient.
For that matter, why would the USG want to build AGI if they considered it a coinflip whether this will kill everyone or not? The USG could choose the coinflip, or it could choose to try to prevent China from putting the world at risk without creating that risk itself. "Sit back and watch other countries build doomsday weapons" and "build doomsday weapons yourself" are not the only two options.
Why? 95% risk of doom isn't certainty, but seems obviously more than sufficient.
If AI itself leads to doom, it likely doesn't matter whether it was developed by US Americans or by the Chinese. But if it doesn't lead to doom (the remaining 5%) it matters a lot which country is first, because that country is likely to achieve world domination.
The USG could choose the coinflip, or it could choose to try to prevent China from putting the world at risk without creating that risk itself.
Short of choosing a nuclear war with China, the US can't do much to d...
Leopold's scenario requires that the USG come to deeply understand all the perils and details of AGI and ASI (since they otherwise don't have a hope of building and aligning a superintelligence), but then needs to choose to gamble its hegemony, its very existence, and the lives of all its citizens on a half-baked mad science initiative, when it could simply work with its allies to block the tech's development and maintain the status quo at minimal risk.
Success in this scenario requires a weird combination of USG prescience with self-destructiveness: enough...
when it would potentially be vastly easier to spearhead an international alliance to prohibit this technology.
I would be interested in reading more about the methods that could be used to prohibit the proliferation of this technology (you can assume a "wake-up" from the USG).
I think one of the biggest fears would be that any sort of international alliance would not have perfect/robust detection capabilities, so you're always risking the fact that someone might be running a rogue AGI project.
Also, separately, there's the issue of "at some point, doesn...
Responding to Matt Reardon's point on the EA Forum:
...Leopold's implicit response as I see it:
- Convincing all stakeholders of high p(doom) such that they take decisive, coordinated action is wildly improbable ("step 1: get everyone to agree with me" is the foundation of many terrible plans and almost no good ones)
- Still improbable, but less wildly, is the idea that we can steer institutions towards sensitivity to risk on the margin and that those institutions can position themselves to solve the technical and other challenges ahead
Maybe the key insight is that
I do have a lot of reservations about Leopold's plan. But one positive feature it does have, it proposes to rely on a multitude of "limited weakly-superhuman artificial alignment researchers" and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable. So his plan does seem to have a good chance to overcome the factor that AI existential safety research is a
field that has not been particularly productive or fast in the past
and also to overcome other factors requiring overreliance on humans and on current ...
As is typical for Twitter, we also signal-boosted a lot of other people's takes. Some non-MIRI people whose social media takes I've recently liked include Wei Dai, Daniel Kokotajlo, Jeffrey Ladish, Patrick McKenzie, Zvi Mowshowitz, Kelsey Piper, and Liron Shapira.
The stuff I've been tweeting doesn't constitute an official MIRI statement — e.g., I don't usually run these tweets by other MIRI folks, and I'm not assuming everyone at MIRI agrees with me or would phrase things the same way. That said, some recent comments and questions from me and Eliezer:
...Every protest I've witnessed seemed to be designed to annoy and alienate its witnesses, making it as clear as possible that there was no way to talk to these people, that their minds were on rails. I think most people recognize that as cult shit and are alienated by that.
In the last year, I've seen a Twitter video of an AI risk protest (I think possibly in continental Europe?) that struck me as extremely good: calm, thoughtful, accessible, punchy, and sensible-sounding statements and interview answers. If I find the link again, I'll add it here as a model ...
Could we talk about a specific expert you have in mind, who thinks this is a bad strategy in this particular case?
AI risk is a pretty weird case, in a number of ways: it's highly counter-intuitive, not particularly politically polarized / entrenched, seems to require unprecedentedly fast and aggressive action by multiple countries, is almost maximally high-stakes, etc. "Be careful what you say, try to look normal, and slowly accumulate political capital and connections in the hope of swaying policymakers long-term" isn't an unconditionally good strategy, i...
I'm interpreting "realize" colloquially, as in, "be aware of". I don't think the people discussed in the post just haven't had it occur to them that pre-singularity wealth doesn't matter because a win singularity society very likely wouldn't care much about it. Instead someone might, for example...
Two things:
Some of the other things you suggest, like future systems keeping humans physically alive, do not seem plausible to me.
I agree with Gretta here, and I think this is a crux. If MIRI folks thought it were likely that AI will leave a few humans biologically alive (as opposed to information-theoretically revivable), I don't think we'd be comfortable saying "AI is going to kill everyone". (I encourage other MIRI folks to chime in if they disagree with me about the counterfactual.)
I also personally have maybe half my probability mass on "the AI just doesn't stor...
FWIW I do think "don't trust this guy" is warranted; I don't know that he's malicious, but I think he's just exceptionally incompetent relative to the average tech reporter you're likely to see stories from.
Like, in 2018 Metz wrote a full-length article on smarter-than-human AI that included the following frankly incredible sentence:
...During a recent Tesla earnings call, Mr. Musk, who has struggled with questions about his company’s financial losses and concerns about the quality of its vehicles, chastised the news media for not focusing on the deaths that a
FWIW, Cade Metz was reaching out to MIRI and some other folks in the x-risk space back in January 2020, and I went to read some of his articles and came to the conclusion in January that he's one of the least competent journalists -- like, most likely to misunderstand his beat and emit obvious howlers -- that I'd ever encountered. I told folks as much at the time, and advised against talking to him just on the basis that a lot of his journalism is comically bad and you'll risk looking foolish if you tap him.
This was six months before Metz caused SSC to shu...
It's pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for "actually terminally has good intentions".
Aren't there a lot of clearer words for this? "Well-intentioned", "nice", "benevolent", etc.
(And a lot of terms, like "value loading" and "value learning", that are pointing at the research project of getting good intentions into the AI.)
To my ear, "aligned person" sounds less like "this person wishes the best for me", and more like "this person will behave in the right ways".
If I hear that Russia an...
Aren't there a lot of clearer words for this? "Well-intentioned", "nice", "benevolent", etc.
Fair enough. I guess it just seems somewhat incongruous to say. "Oh yes, the AI is aligned. Of course it might desperately crave murdering all of us in its heart (we certainly haven't ruled this out with our current approach), but it is aligned because we've made it so that it wouldn't get away with it if it tried."
"Should" in order to achieve a certain end? To meet some criterion? To boost a term in your utility function?
In the OP: "Should" in order to have more accurate beliefs/expectations. E.g., I should anticipate (with high probability) that the Sun will rise tomorrow in my part of the world, rather than it remaining night.
Why would the laws of physics conspire to vindicate a random human intuition that arose for unrelated reasons?
We do agree that the intuition arose for unrelated reasons, right? There's nothing in our evolutionary history, and no empirical observation, that causally connects the mechanism you're positing and the widespread human hunch "you can't copy me".
If the intuition is right, we agree that it's only right by coincidence. So why are we desperately searching for ways to try to make the intuition right?
...It also doesn't force us to believe that a bunch of w
You're missing the bigger picture and pattern-matching in the wrong direction. I am not saying the above because I have a need to preserve my "soul" due to misguided intuitions. On the contrary, the reason for my disagreement is that I believe you are not staring into the abyss of physicalism hard enough. When I said I'm agnostic in my previous comment, I said it because physics and empiricism lead me to consider reality as more "unfamiliar" than you do (assuming that my model of your beliefs is accurate). From my perspective, your post and your conc...
Yeah. My point was, we can't even be sure which behavior-preserving optimizations (of the kind done by optimizing compilers, say) will preserve consciousness. It's worrying because these optimizations can happen innocuously, e.g. when your upload gets migrated to a newer CPU with fancier heuristics. And yeah, when self-modification comes into the picture, it gets even worse.
You can easily clear this confusion if you rephrase it as "You should anticipate having any of these experiences". Then it's immediately clear that we are talking about two separate screens.
This introduces some other ambiguities. E.g., "you should anticipate having any of these experiences" may make it sound like you have a choice as to which experience to rationally expect.
And it's also clear that our curriocity isn't actually satisfied. That the question "which one of these two will actually be the case" is still very much on the table.
... And the answer...
My first issue with your post is that this initial ontological assumption is neither mentioned explicitly nor motivated. Nothing in your post can be used as proof of this initial assumption.
There are always going to be many different ways someone could object to a view. If you were a Christian, you'd perhaps be objecting that the existence of incorporeal God-given Souls is the real crux of the matter, and if I were intellectually honest I'd be devoting the first half of the post to arguing against the Christian Soul.
Rather than trying to anticipate these o...
First off, would you agree with my model of your beliefs? Would you consider it an accurate description?
Also, let me make clear that I don't believe in cartesian souls. I, like you, lean towards physicalism, I just don't commit to the explanation of consciousness based on the idea of the brain as a **classical** electronic circuit. I don't fully dismiss it either, but I think it is worse on philosophical grounds than assuming that there is some (potentially minor) quantum effect going on inside the brain that is an integral part of the explanation fo...
Wouldn't it follow that in the same way you anticipate the future experiences of the brain that you "find yourself in" (i.e. the person reading this) you should anticipate all experiences, i.e. that all brain states occur with the same kind of me-ness/vivid immediacy?
What's the empirical or physical content of this belief?
I worry that this may be another case of the Cartesian Ghost rearing its ugly head. We notice that there's no physical thingie that makes the Ghost more connected to one experience or the other; so rather than exorcising the Ghost entirel...
As a test, I asked a non-philosopher friend of mine what their view is. Here's a transcript of our short conversation: https://docs.google.com/document/d/1s1HOhrWrcYQ5S187vmpfzZcBfolYFIbeTYgqeebNIA0/edit
I was a bit annoyingly repetitive with trying to confirm and re-confirm what their view is, but I think it's clear from the exchange that my interpretation is correct at least for this person.
Is there even anybody claiming there is an experiential difference?
Yep! Ask someone with this view whether the current stream of consciousness continues from their pre-uploaded self to their post-uploaded self, like it continues when they pass through a doorway. The typical claim is some version of "this stream of consciousness will end, what comes next is only oblivion", not "oh sure, the stream of consciousness is going to continue in the same way it always does, but I prefer not to use the English word 'me' to refer to the later parts of that stream of ...
The problem was that you first seemed to belittle questions about word meanings ("self") as being "just" about "definitions" that are "purely verbal".
I did no such thing!
Luckily now you concede that the question about the meaning of "I" isn't just about (arbitrary) "definitions"
Read the blog post at the top of this page! It's my attempt to answer the question of when a mind is "me", and you'll notice it's not talking about definitions.
...But we already know all the empirical facts: Someone goes into the teleporter, a bit later someone comes out at the other e
You're also free to define "I" however you want in your values.
Sort of!
FWIW, I typically use "alignment research" to mean "AI research aimed at making it possible to safely do ambitious things with sufficiently-capable AI" (with an emphasis on "safely"). So I'd include things like Chris Olah's interpretability research, even if the proximate impact of this is just "we understand what's going on better, so we may be more able to predict and finely control future systems" and the proximate impact is not "the AI is now less inclined to kill you".
Some examples: I wouldn't necessarily think of "figure out how we want to airgap the...
It's pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for "actually terminally has good intentions". So, this makes me sad to call this alignment research. Of course, this type of research maybe instrumentally useful for making AIs more aligned, but so will a bunch of other stuff (e.g. earning to give).
Fair enough if you think we should just eat this terminology issue and then coin a new term like "actually real-alignment-targeting-directly alignment research". Idk what the right term is obviously.
But that isn't an experience. It's two experiences. You will not have an experience of having two experiences. Two experiences will experience having been one person.
Sure; from my perspective, you're saying the same thing as me.
Are you going to care about 1000 different copies equally?
How am I supposed to choose between them?
Why? If "I" is arbitrary definition, then “When I step through this doorway, will I have another experience?" depends on this arbitrary definition and so is also arbitrary.
Which things count as "I" isn't an arbitrary definition; it's just a fuzzy natural-language concept.
(I guess you can call that "arbitrary" if you want, but then all the other words in the sentence, like "doorway" and "step", are also "arbitrary".)
Analogy: When you're writing in your personal diary, you're free to define "table" however you want. But in ordinary English-language discourse...
The problem is another way to phrase this is a superintelligent weapon system - "ending a risk period" by "reliably, and efficiently doing a small number of specific concrete tasks" means using physical force to impose your will on others.
The pivotal acts I usually think about actually don't route through physically messing with anyone else. I'm usually thinking about using aligned AGI to bootstrap to fast human whole-brain emulation, then using the ems to bootstrap to fully aligned CEV AI.
If someone pushes a "destroy the world" button then the ems or CEV ...
To pick out a couple of specific examples from your list, Wei Dai:
14. Human-controlled AIs causing ethical disasters (e.g., large scale suffering that can't be "balanced out" later) prior to reaching moral/philosophical maturity
This is a serious long-term concern if we don't kill ourselves first, but it's not something I see as a premise for "the priority is for governments around the world to form an international agreement to halt AI progress". If AI were easy to use for concrete tasks like "build nanotechnology" but hard to use for things like CEV, I'd ...
Yep, before I saw orthonormal's response I had a draft-reply written that says almost literally the same thing:
we just call 'em like we see 'em
[...]
insofar as we make bad predictions, we should get penalized for it. and insofar as we think alignment difficulty is the crux for 'why we need to shut it all down', we'd rather directly argue against illusory alignment progress (and directly acknowledge real major alignment progress as a real reason to be less confident of shutdown as a strategy) rather than redirect to something less cruxy
I'll also add: Nate (u...
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal "maximize diamonds in an aligned way", why not a bunch of small grounded ones.
- "Plan the factory layout of the diamond synthesis plant with these requirements".
- "Order the equipment needed, here's the payment credentials".
- "Supervise construction this workday comparing to original plans"
- "Given this step of the plan, do it"
- (Once the factory is built) "remove the output from diamond synthesis machine A53 and clean it".
That is how MIRI imagines a sane developer using just-b...
From briefly talking to Eliezer about this the other day, I think the story from MIRI's perspective is more like:
We could have defined the goal more narrowly or generically than that, but that just seemed like an invitation to take your eye off the ball: if we aren't going to think about the question of how...
getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.
The problem is another way to phrase this is a superintelligent weapon system - "ending a risk period" by "reliably, and efficiently doing a small number of specific concrete tasks" means using physical force to impose your will on others.
On reflection, I do not think that it is a wise idea to factor the path to a good future through a global AI-assisted coup.
Instead one should tr...
In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said:
...example thing that would be relatively impressive to me: specific, comprehensive understanding of models (with the caveat that that knowledge may lend itself more (and sooner) to capabilities before alignment). demonstrated e.g. by the ability to precisely predict the capabilities and quirks of the next generation (before running it)
i'd also still be impressed by simple theories of aimable co
I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.
You're changing the topic to "can you do X without wanting Y?", when the original question was "can you do X without wanting anything at all?".
Nate's answer to nearly all questions of the form "can you do X without wanting Y?" is "yes", hence his second claim in the OP: "the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular".
...I do need to answer that question u
When the post says:
This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".
It seems like it's saying that if you prompt an LM with "Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way," and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.
Which is a fine definition to pick. But the point is that in thi...
The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.
See my reply to Bogdan here. The issue isn't "inelegance"; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.
Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or "ancient culinary arts and medicine shortly after a cultural reboot", such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than...
Some of Nate’s quick thoughts (paraphrased), after chatting with him:
Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-me...
I read and responded to some pieces of that post when it came out; I don't know whether Eliezer, Nate, etc. read it, and I'm guessing it didn't shift MIRI, except as one of many data points "person X is now loudly in favor of a pause (and other people seem receptive), so maybe this is more politically tractable than we thought".
I'd say that Kerry Vaughan was the main person who started smashing this Overton window, and this started in April/May/June of 2022. By late December my recollection is that this public conversation was already fully in swing and MI...
I didn't cross-post it, but I've poked EY about the title!