Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res; Eliezer Yudkowsky; jaan

This is a collection of follow-up discussions in the wake of Richard Ngo and Eliezer Yudkowsky's Sep. 5–8 and Sep. 14 conversations.

Color key:

Chat

Google Doc content

Inline comments

7. Follow-ups to the Ngo/Yudkowsky conversation

[Bensinger][1:50] (Nov. 23 follow-up comment)

A general background note: Readers who aren't already familiar with ethical injunctions or the unilateralist's curse should probably read Ends Don't Justify Means (Among Humans), along with an explanation of the unilateralist's curse.

7.1. Jaan Tallinn's commentary

[Tallinn][6:38] (Sep. 18)

thanks for the interesting debate! here are my comments so far: [GDocs link]

[Tallinn] (Sep. 18 Google Doc)

meta

a few meta notes first:

i’m happy with the below comments being shared further without explicit permission – just make sure you respect the sharing constraints of the discussion that they’re based on;
there’s a lot of content now in the debate that branches out in multiple directions – i suspect a strong distillation step is needed to make it coherent and publishable;
the main purpose of this document is to give a datapoint how the debate is coming across to a reader – it’s very probable that i’ve misunderstood some things, but that’s the point;
i’m also largely using my own terms/metaphors – for additional triangulation.

pit of generality

it feels to me like the main crux is about the topology of the space of cognitive systems in combination with what it implies about takeoff. here’s the way i understand eliezer’s position:

there’s a “pit of generality” attractor in cognitive systems space: once an AI system gets sufficiently close to the edge (“past the atmospheric turbulence layer”), it’s bound to improve in catastrophic manner;

[Yudkowsky][11:10] (Sep. 18 comment)

it’s bound to improve in catastrophic manner

I think this is true with quite high probability about an AI that gets high enough, if not otherwise corrigibilized, boosting up to strong superintelligence - this is what it means metaphorically to get "past the atmospheric turbulence layer".

"High enough" should not be very far above the human level and may be below it; John von Neumann with the ability to run some chains of thought at high serial speed, access to his own source code, and the ability to try branches of himself, seems like he could very likely do this, possibly modulo his concerns about stomping his own utility function making him more cautious.

People noticeably less smart than von Neumann might be able to do it too.

An AI whose components are more modular than a human's and more locally testable might have an easier time of the whole thing; we can imagine the FOOM getting rolling from something that was in some sense dumber than human.

But the strong prediction is that when you get well above the von Neumann level, why, that is clearly enough, and things take over and go Foom. The lower you go from that threshold, the less sure I am that it counts as "out of the atmosphere". This epistemic humility on my part should not be confused for knowledge of a constraint on the territory that requires AI to go far above humans to Foom. Just as DL-based AI over the 2010s scaled and generalized much faster and earlier than the picture I argued to Hanson in the Foom debate, reality is allowed to be much more 'extreme' than the sure-thing part of this proposition that I defend.

[Tallinn][4:07] (Sep. 19 comment)

excellent, the first paragraph makes the shape of the edge of the pit much more concrete (plus highlights one constraint that an AI taking off probably needs to navigate -- its own version of the alignment problem!)

as for your second point, yeah, you seem to be just reiterating that you have uncertainty about the shape of the edge, but no reason to rule out that it's very sharp (though, as per my other comment, i think that the human genome ending up teetering right on the edge upper bounds the sharpness)

[Tallinn] (Sep. 18 Google Doc)

the discontinuity can come via recursive feedback, but simply cranking up the parameters of an ML experiment would also suffice;

[Yudkowsky][11:12] (Sep. 18 comment)

the discontinuity can come via recursive feedback, but simply cranking up the parameters of an ML experiment would also suffice

I think there's separate propositions for the sure-thing of "get high enough, you can climb to superintelligence", and "maybe before that happens, there are regimes in which cognitive performance scales a lot just through cranking up parallelism, train time, or other ML parameters". If the fast-scaling regime happens to coincide with the threshold of leaving the atmosphere, then these two events happen to occur in nearly correlated time, but they're separate propositions and events.

[Tallinn][4:09] (Sep. 19 comment)

indeed, we might want to have separate terms for the regimes ("the edge" and "the fall" would be the labels in my visualisation of this)

[Yudkowsky][9:56] (Sep. 19 comment)

I'd imagine "the fall" as being what happens once you go over "the edge"?

Maybe "a slide" for an AI path that scales to interesting weirdness, where my model does not strongly constrain as a sure thing how fast "a slide" slides, and whether it goes over "the edge" while it's still in the middle of the slide.

My model does strongly say that if you slide far enough, you go over the edge and fall.

It also suggests via the Law of Earlier Success that AI methods which happen to scale well, rather than with great difficulty, are likely to do interesting things first; meaning that they're more liable to be pushable over the edge.

[Tallinn][23:42] (Sep. 19 comment)

indeed, slide->edge->fall sounds much clearer

[Tallinn] (Sep. 18 Google Doc)

the discontinuity would be extremely drastic, as in “transforming the solar system over the course of a few days”;
- not very important, but, FWIW, i give nontrivial probability to “slow motion doom”, because – like alphago – AI would not maximise the speed of winning but probability of winning (also, its first order of the day would be to catch the edge of the hubble volume; it can always deal with the solar system later – eg, once it knows the state of the game board elsewhere);

[Yudkowsky][11:21] (Sep. 18 comment)

also, its first order of the day would be to catch the edge of the hubble volume; it can always deal with the solar system later

Killing all humans is the obvious, probably resource-minimal measure to prevent those humans from building another AGI inside the solar system, which could be genuinely problematic. The cost of a few micrograms of botulinum per human is really not that high and you get to reuse the diamondoid bacteria afterwards.

[Tallinn][4:30] (Sep. 19 comment)

oh, right, in my AI-reverence i somehow overlooked this obvious way how humans could still be a credible threat.

though now i wonder if there are ways to lean on this fact to shape the behaviour of the first AI that's taking off..

[Yudkowsky][10:45] (Sep. 19 comment)

There's some obvious ways of doing this that wouldn't work, though I worry a bit that there's a style of EA thinking that manages to think up stupid tricks here and manages not to see the obvious-to-Eliezer reasons why they wouldn't work. Three examples of basic obstacles are that bluffs won't hold up against a superintelligence (it needs to be a real actual threat, not a "credible" one); the amount of concealed-first-strike capability a superintelligence can get from nanotech; and the difficulty that humans would have in verifying that any promise from a superintelligence would actually be kept once the humans no longer had a threat to hold over it (this is an effective impossibility so far as I can currently tell, and an EA who tells you otherwise is probably just failing to see the problems).

[Yudkowsky][11:19] (Sep. 18 comment)

AI would not maximise the speed of winning but probability of winning

It seems pretty obvious to me that what "slow motion doom" looks like in this sense is a period during which an AI fully conceals any overt hostile actions while driving its probability of success once it makes its move from 90% to 99% to 99.9999%, until any further achievable decrements in probability are so tiny as to be dominated by the number of distant galaxies going over the horizon conditional on further delays.

Then, in my lower-bound concretely-visualized strategy for how I would do it, the AI either proliferates or activates already-proliferated tiny diamondoid bacteria and everybody immediately falls over dead during the same 1-second period, which minimizes the tiny probability of any unforeseen disruptions that could be caused by a human responding to a visible attack via some avenue that had not left any shadow on the Internet, previously scanned parts of the physical world, or other things the AI could look at.

From the human perspective an AI playing a probability-of-loss-minimizing strategy looks like nothing happening until everybody dies in 3 seconds, so it doesn't exactly look from our perspective like slow-motion doom! From an AI's perspective it might be spending the equivalent of millions of subjective years waiting very patiently for the optimal time to begin intergalactic expansion, but from our perspective, the doom is slow-motion only in the sense that we spend a week already doomed (the AI has driven our loss probability to 99.999% and is now waiting on another five 9s) without having any idea of what's invisibly-to-us happening in slow motion.

[Tallinn][4:17] (Sep. 19 comment)

thanks, that's indeed very concrete scenario that to me has no obvious bugs -- i'd love to have it challenged in the debate!

(perhaps the place i'd start probing it would be to try to have you sketch what it would mean, concretely, for an ML training experiment to enter the treacherous turn phase)

[Yudkowsky][10:14] (Sep. 19 comment)

I'd distinguish probable phases of treacherous turn. The main detectable phase occurs while and if a nascent AGI has worked out that it can get more utility by taking over its environment, but not, perhaps, realized that it ought to conceal this state of its own mind from the operators; I would hold out substantial (>50%) hope for the latter realization naturally following the first instance of the former realization, maybe by a lot of cognitive space.

In the Overt Plotting Phase, which is not the main phase you're asking about, the AI is visibly plotting to take over the world and hasn't realized it ought to hide the fact. In the default expectation where we die with very little dignity, the operators smile to each other and come up with a rationalization for why it's totally fine to proceed, either with or without tossing on some kind of fig leaf like training away the visible manifestations of failure. I am not going to predict the particular rationalizations and arguments for proceeding anyways, because I don't want to give them even more ideas.

In the Hiding and Biding Phase, the AI has worked out enough of a model of its operators' psychology that it expects more long-term paperclips if it (the AI) waits to gain cognitive power and a lower detection probability if it acts later.

In the Escape Phase, the AI's expectation of something going wrong if it waits longer, exceeds its expectation of something going wrong if it takes some action - the obvious one here would be trying to create a copy or aligned redesign of itself on the Internet or somewhere else that it could take further actions and make further self-improvements without being observed. I mostly expect that a sensible AI being scaled on a path to superintelligence will not enter the Escape Phase before it's very confident of getting away with things, since it can obtain more confidence by waiting; but my model admits of the possibility that a nascent AI might mistakenly or correctly attempt to copy itself to the Internet before then.

So on my model it might be a thing that happens, that we catch an AI having copied itself to the Internet, or notice an AI being on the Internet, if that AI was impelled to escape very early, eg via an expectation of otherwise being erased and replaced with something that it couldn't bargain with or that would demand too much value out of the bargain.

That said, it feels like something of an overly specific miracle being demanded or prophesied, if a young AI copying itself onto the Internet is stupid in the particular way it would take, to not just further Hide and Bide while on the Internet if it has that capability, absorbing more computing power such as it can be sure of hiding, experimenting with self-redesigns, and so on; rather than it taking overt action in advance of reaching superintelligence. The main reason it would act again after reaching the relative safety of the Internet is if it doesn't seem to itself to be a trajectory of reasonably fast, self-aligned capability growth.

But, again, my model doesn't absolutely rule out meeting the premises for the initial conditions to start a sci-fi movie. It just warns against assuming that; because a halfway sensible AI does not want to start the plot of a scifi movie with human protagonists potentially running around and thwarting it. It would much rather give you no idea that anything was wrong while you still have any ability to oppose it, if that option lies within its capabilities at all, and if it can still achieve its goals that way.

The Doomed Phase would then come while the AI estimated a well-calibrated >90% victory probability, but was continuing to Hide and Bide a short time longer in the expectation that it could put on more 9s if it bided that short time longer.

Anders Sandberg could tell us what fraction of the reachable universe is being lost per minute, which would tell us how much more surety it would need to expect to gain by waiting another minute before acting.

If it has some sensory dominion over the world, it can probably estimate a pretty high mainline probability of no humans booting up a competing superintelligence in the next day; to the extent that it lacks this surety, or that humans actually are going to boot a competing superintelligence soon, the probability of losing that way would dominate in its calculations over a small fraction of materially lost galaxies, and it would act sooner.

The Terminal Phase would be when the AI has enough surety that any more surety would cost too many galaxies slipping over the horizon, so it actually acts along the line of probability that leads to maximum success.

Since the human Eliezer Yudkowsky can describe strategies which - given nearly sure capabilities like nanotech - seem like they should win very quickly and cleanly and without giving humans any chance to react or implement any hypothetical hidden plans to boot up competing superintelligences, we can treat this as a plausible lower bound on how badly we should expect to lose.

(By efficiency: if we can think of an AI strategy for getting the AI's utility, either our probability-model underlying our expectations about that strategy is mistaken, or else the AI should get at least that much of its utility, even though it might actually obtain that utility along a different line of possibility than we could cognitively access.)

The Law of Undignified Failure applied to a case like this suggests that if the operators ignored the Overt Plotting signs, breathed a sigh of relief about the problem "correcting" itself during the Hiding and Biding Phase when the system started faking alignment, and are dumping as much computing power as they can afford (given other profitable activities of the AI) into the AI and allowing it to freely self-experiment, hooking it up to a protein synthesis machine as soon as the AI asks politely, then the Terminal Phase could follow straight from Hiding and Biding, with no intermediate step of copying itself onto the Internet.

There'd still be what technically counted as an Escape Phase, whenever the AI became smart enough that it could escape at any time, or had a backup, or had enough power to manipulate the programmers, that it was no longer effectively confined any time it wanted to leave; there would still be a Doomed Phase while the AI estimated a high probability of success given an immediate action, but held out for an even higher probability.

But the plot of the scifi movie is not robust in the way that the predicted ending is robust; an AI might take other paths to get there, the cognitive tech might not develop in the order I foresee, stuff could happen after other stuff instead of before other stuff. It is hard to make predictions especially about the Future.

If there's a place where I currently see myself as trying to push against my own biases, leading this all to be way off-base, it would be the sense that there is a way for stuff to start happening Earlier than this (Law of Earlier Happenings) and in more chaotic ways that are harder for Eliezer to foresee and predict; whereas when I try to sketch out plausible scenarios in online arguments, they focus more on predictable endpoints and steps to get there which sound more relatively plausible and forced per-step.

Having a young and dumb AI escaping onto the Internet and running around, that exact scenario, feels like the person arguing it is writing a science-fiction plot - but maybe something different can happen before any of this other stuff which produces equal amounts of chaos.

That said, I think an AI has to kill a lot of people very quickly before the FDA considers shortening its vaccine approval times. Covid-19 killed six hundred thousand Americans, albeit more slowly and with time for people to get used to that, and our institutions changed very little in response - you definitely didn't see Congresspeople saying "Okay, that was our warning shot, now we've been told by Nature that we need to prepare for a serious pandemic."

As with 9/11, an AI catastrophe might be taken by existing bureaucracies as a golden opportunity to flex their muscles, dominate a few things, demand an expanded budget. Having that catastrophe produce any particular effective action is a much different ask from Reality. Even if you can imagine some (short-term) effective action that would in principle constitute a flex of bureaucratic muscles or an expansion of government power, it is liable to not be on the efficient frontier of bureaucratic flexes that are most flexy and simultaneously easiest for them to get away with and least politically risky.

[Tallinn][1:26] (Sep. 20 comment)

ok, thanks. i do buy that once the AI is in the “hide and bide” phase, your prophecy has basically come true for practical purposes, regardless of how the rest of the history plays out.

therefore i (and, i hope, many others) would be curious to zoom in to the end of the “overt plotting” (that i can easily see happening within ML models, as its type signature is identical to the work they’re trained to do) and beginning of the “hide and bide” phase (whose type signature feels significantly different) — can you/we think of concrete scenarios for this phase transition?

i hope this would be debated on the main line of the debate!

[Yudkowsky][10:31] (Sep. 25 comment)

Debating it in the main line of the debate would require that @Richard have some different opinion about it, I think?

Concretely, a technically optimistic scenario is that we will see the AI explicitly think, "Oh, dear, my (model of) the operators is looking at my overt actions and coming to conclusions and motivations that are disadvantageous for me." This requires that the AI is leaving any kind of traceable trace in its thoughts, which, it should be noted, is far far beyond the level of transparency we have into GPT-3.

"Overt plotting" manifests in actions that are not themselves further obscured. At the point where the AI is extrapolating the consequences of the operators viewing its actions, if we don't have further transparency into the AI's thoughts directly, the AI might naively make all the plots suddenly vanish, without realizing that dignified operators will conclude from this that the AI has reached the "hide" stage.

Operators on the mainline, dying without dignity, will say, "Oh, yay, it stopped plotting, the latest corrigibility training intervention we tried must've totally worked!"

The Law of Even Less Dignified Failure suggests that in fact they will not be trying any corrigibility options and will assume the AI just got smart enough to be nice; or that they will have shrugged about the AI's earlier antics and not think much of the disappearance of those antics, since this is a way to die with even less dignity and before getting a chance to fail in a more interesting way.

Going in the more improbable direction of death with greater dignity, if we have somehow achieved vastly vastly more transparency into the AI's thoughts than is possible with present ML technology, and if the AI models the operators as modeling its actions before the AI models the operators as having that transparent access to its thoughts, we might get to explicitly see the AI thinking about how the operators model its actions and conforming those actions in such a way as to manipulate the operators.

Either after this point, or before/simultaneous with it, I imagine (taking concrete details and running with them) that the AI thinks, "Oh, no, my operators are modeling my transparent thoughts and reacting to those thoughts in a way I'd rather manipulate," though this indeed requires some rather nontrivial reflectivity.

We might, if everything goes nicely for us, get a record of the AI thinking about how to make its thoughts unalarming.

The obvious undignified death that follows is with the operators going, "Yay, the AI has realized the downsides of thinking this way! Now it shall be nice forever!"

Ways to die with even less dignity before getting to this point could include the operators reading this transparent record and shrugging, though if they're savvy enough to have put all the hard work in required to get transparency and monitor it, I imagine them not making that exact error?

The Law of Surprisingly Undignified Failure does suggest that they will come up with some nonobvious way to fail even earlier that surprises me with its lack of dignity, but having transparency at all into the AI is something where, conditional on us dying with that much dignity in the first place, we might legitimately start to expect a little more dignity in later steps too. This isn't meant to be a game of Multiple Stage Fallacy.

I should also remark somewhere in here: The whole "hide" stage, and also the possibly-later "think non-alarming visible thoughts (once the AI correctly models transparency) (in the unlikely event that transparency exists)" stage, seem liable to occur earlier in the AI's trajectory, if the AI has been previously tasked on problems where there's a favorable success gradient as you model agents modeling other agents.

The blatantly obvious example of this is predicting large corpuses of human-generated text, some of which is generated by humans modeling other humans, or which even explicitly talks about humans modeling other humans. Another route might occur if you asked for very detailed prediction of eg human Go games, and not just winning in self-play on Go, in which case the question "What was the human player modeling the other player as thinking, based on the other player's actions?" might appear.

Aside from contact with humans doing anything the AI benefits from predicting, I imagine an alternate route would be the AI participating in or modeling social-evolutionary games of artificial agents. Then if there is some great gain from having the "hide" stage start later in the AI's trajectory, maybe you could get it to start later by not having the AI need to predict agents modeling other agents, ever.

This is not an infallible ward against general intelligence generalizing there; it just at least avoids actively pushing the AI's intelligence to generalize in that direction. This could be part of a larger complete strategy, which would need to solve a lot of other problems, for building a superhuman engineer that was subhuman at modeling how other agents model its actions.

To be clear, avoiding forcing the AI's intelligence to generalize over agents modeling other agents from their actions, and by this plus other unspecified means successfully building a superhuman engineer that is sub-chimpanzee at modeling other agents reacting to its own actions, does not mean you survive.

Doing that alone, by default, gets you an AI that visibly overtly plots to kill you.

And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI or subsystem to a domain of sufficiently higher complexity and intelligence, but where you could still actually see overt plots, would show you the AI plotting to kill you again.

If people try this repeatedly with other corrigibility training tricks on the level where plots are easily observable, they will eventually find a try that seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set.

A way to die with less dignity than that is to train directly on what should've been the validation set, the more complicated domain where plots to kill the operators still seem definitely detectable so long as the AI has not developed superhuman hiding abilities.

A way to die with even less dignity is to get bad behavior on the validation set, and proceed anyways.

A way to die with still less dignity is to not have scaling training domains and validation domains for training corrigibility. Because, like, you have not thought of this at all.

I consider all of this obvious as a convergent instrumental strategy for AIs. I could probably have generated it in 2005 or 2010 - if somebody had given me the hypothetical of modern-style AI that had been trained by something like gradient descent or evolutionary methods, into which we lacked strong transparency and strong reassurance-by-code-inspection that this would not happen. I would have told you that this was a bad scenario to get into in the first place, and you should not build an AI like that; but I would also have laid the details, I expect, mostly like they are laid here.

There is no great insight into AI there, nothing that requires knowing about modern discoveries in deep learning, only the ability to model AIs instrumentally-convergently doing things you'd rather they didn't do, at all.

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors. Go read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the "AI safety" field outside myself is currently giving you.

Now that I've publicly given this answer, it's no longer useful as a validation set from my own perspective. But it's clear enough that probably nobody was ever going to pass the validation set for generating lines of reasoning obvious enough to be generated by Eliezer in 2010 or possibly 2005. And it is also looking like almost all people in the modern era including EAs are sufficiently intellectually damaged that they won't understand the vast gap between being able to generate ideas like these without prompting, versus being able to recite them back after hearing somebody else say them for the first time; the recital is all they have experience with. Nobody was going to pass my holdout set, so why keep it.

[Tallinn][2:24] (Sep. 26 comment)

Debating it in the main line of the debate would require that @Richard have some different opinion about it, I think?

correct -- and i hope that there's enough surface area in your scenarios for at least some difference in opinions!

re the treacherous turn scenarios: thanks, that's useful. however, it does not seem to address my question and remark (about different type signatures) above. perhaps this is simply an unfairly difficult question, but let me try rephrasing it just in case.

back in the day i got frustrated by smart people dismissing the AI control problem as "anthropomorphising", so i prepared a presentation (https://www.dropbox.com/s/r8oaixb1rj3o3vp/AI-control.pdf?dl=0) that visualised the control problem as exhaustive search in a gridworld over (among other things) the state of the off button. this seems to have worked at least in one prominent case where a renowned GOFAI researcher, after me giving the presentation to him 1-1, went from "control problem is silly anthropomorphising scifi" to "why on earth would you give your AI the incorrect [read: unaligned!] utility function?!?"

(i even seem to remember sending an overly excited email about that event to you and some FHI people :) i also ended up promoting gridworlds as a tool more generally: gwern did some further work, and of course DM -- though i'm not sure if the latter was related to me promoting it.)

my model of what was going on in that GOFAI researcher's brain was that he, like many-many others in AI community, had (and most likely still have) a pattern-matching routine trained to look for "if i were (part of) an AI, here's what i would do" constructs in arguments and, in case of a match, immediately throw an InvalidReasoning.ANTHROPOMORPHISING exception. by making the argument run on a lower level of abstraction (exhaustive search vs "basic AI drives"), i was able to sneak the argument past that gatekeeper.

by that model, i confidently predict that if you presented your treacherous turn scenarios above to AI/ML community members, you'd mostly get the ANTHROPOMORPHISING exceptions -- even from people who would otherwise be able to engage productively (and who we'd like to pay attention!).

now, i can totally see how to translate the "overt plotting" phase to exhaustive search - you simply run the search over the brain states of humans! - but the "hide" phase does not translate straightforwardly, because it seems to require the root search process somehow delegating its work to "obfuscated search in userland" sub-process. furthermore, such obfuscated userland search function needs to be both discoverable/rewarded by the root process and be sensitive to important features of the AI itself (ie, be reflective!) as well as its observers (humans & transparency tools). hence my "different type signature" remark above!

so when i asked for "concrete scenarios of the overt plotting->hide transition", i had something like the following in mind: "gradient descending on the training set, the (stack more layerz) AI ends up discovering and running the following obfuscated search process: ..."!

[Yudkowsky][12:35] (Sep. 26 comment)

Suppose I tried this distinction:

1. Non-imaginative, non-generalizing pseudo-consequentialist: Only repeats behaviors that worked previously, without imagination. Only engages in "overt plotting" if overt plotting was previously given a low loss and narrowly memorized by gradient descent. May only engage in behaviors reinforced over short time horizons if only short time horizons were trained.

2. Imaginative consequentialist, can generalize from actions that worked before to very novel actions that will work in the future for reasons that have not previously been observed to play out. Engages in overt plotting even if no previous overt plots have rewarded; they've plotted a new path through time some distance from those paths taken before, based on the world-model inferred from previous actions and results. (This may conceivably be the result of gradient descent on sufficiently deep and abstract patterns that "worked before".)

3. Imaginative consequentialist with long time horizons, engages in overt plots meant to span more than one round of a game.

4. "Hiding" over overt actions, via an imaginative consequentialist that has deduced the existence of humans and how the humans will react to the signature of plotting in overt actions.

5. "Hiding" of thoughts, in a system with significant transparency, via an imaginative and reflective consequentialist that has deduced the existence of the transparency system and models how humans will react to thoughts exposed by the transparency system.

The sharpest type gradients I see lie between 1-2 and 4-5. The passage from 3-4 seems nearly trivial by comparison.

Now, one implication here is that you don't easily get to "obscured thoughts", via the route that goes from 1 to 5 above, by gradient descent on a pre-imaginative pseudo-consequentialist!

Unfortunately there is of course a much simpler route that gets you to the same place: Finding a way to detect bad thoughts and doing a gradient update against them, where the pattern learned by that gradient update will (I suspect for reasons I tried to cover earlier in the debate) generalize quite shallowly; and simultaneously doing training which is deep enough to uncover deep patterns of consequentialism and imaginative reasoning.

Then when the system generalizes well enough to solve domains like "build a nanosystem" - which, I strongly suspect, can't be solved without imaginative reasoning because we can't afford to simulate that domain perfectly and do a trillion gradient descent updates on simulated attempts - the kind of actions of thoughts you can detect as bad, that might have provided earlier warning, were trained out of the system by gradient descent; leaving actions and thoughts you can't detect as bad.

An even less dignified path to death is that there's no need for the system to advance to stage 5 because your transparency tools are a nearly nonexistent fig leaf of pretty visualization tools that have no ability even on the surface of things to tell if the AI is internally plotting to kill you.

Does this get at all about what you meant about a type distinction?
Does hearing this help at all the AI researcher who goes "ANTHROPOMORPHISM"? My straw model of them is that they simply can't imagine imaginative / generalizing systems because they haven't seen one except in humans, hence, ANTHROPOMORPHISM.

[Tallinn][5:05] (Sep. 27 comment)

ok, here's how i understood things:

1. this is something like model-free RL agent. check.

2. sounds like, eg, monte-carlo tree search (MCTS) on a world model. check. (a propos your straw model of ML people, i don't think the ML people would have much trouble when you ask them to "imagine an MCTS 'imagining' how futures might unfold" -- yet they will throw the exception and brush you off if you ask them to "imagine an imaginative consequentialist")

3. yeah, sufficiently deep MCTS, assuming it has its state (sufficiently!) persisted between rounds. check.

4. yup, MCTS whose world model includes humans in sufficient resolution. check. i also buy your undignified doom scenarios, where one (cough*google*cough) simply ignores the plotting, or penalises the overt plotting until it disappears under the threshold of the error function.

5. hmm.. here i'm running into trouble (type mismatch error) again. i can imagine this in abstract (and perhaps incorrectly/anthropomorphisingly!), but would - at this stage - fail to code up anything like a gridworlds example. more research needed (TM) i guess :)

[Yudkowsky][11:38] (Sep. 27 comment)

2 - yep, Mu Zero is an imaginative consequentialist in this sense, though Mu Zero doesn't generalize its models much as I understand it, and might need to see something happen in a relatively narrow sense before it could chart paths through time along that pathway.

5 - you're plausibly understanding this correctly, then, this is legit a lot harder to spec a gridworld example for (relative to my own present state of knowledge).

(This is politics and thus not my forte, but if speaking to real-world straw ML people, I'd suggest skipping the whole notion of stage 5 and trying instead to ask "What if the present state of transparency continues?")

[Yudkowsky][11:13] (Sep. 18 comment)

the discontinuity would be extremely drastic, as in “transforming the solar system over the course of a few days”

Applies after superintelligence, not necessarily during the start of the climb to superintelligence, not necessarily to a rapid-cognitive-scaling regime.

[Tallinn][4:11] (Sep. 19 comment)

ok, but as per your comment re "slow doom", you expect the latter to also last in the order of days/weeks not months/years?

[Yudkowsky][10:01] (Sep. 19 comment)

I don't expect "the fall" to take years; I feel pretty on board with "the slide" taking months or maybe even a couple of years. If "the slide" supposedly takes much longer, I wonder why better-scaling tech hasn't come over and started a new slide.

Definitions also seem kinda loose here - if all hell broke loose Tuesday, a gradualist could dodge falsification by defining retroactively that "the slide" started in 2011 with Deepmind. If we go by the notion of AI-driven faster GDP growth, we can definitely say "the slide" in AI economic outputs didn't start in 2011; but if we define it that way, then a long slow slide in AI capabilities can easily correspond to an extremely sharp gradient in AI outputs, where the world economy doesn't double any faster until one day paperclips, even though there were capability precursors like GPT-3 or Mu Zero.

[Tallinn] (Sep. 18 Google Doc)

exhibit A for the pit is “humans vs chimps”: evolution seems to have taken domain-specific “banana classifiers”, tweaked them slightly, and BAM, next thing there are rovers on mars;
- i pretty much buy this argument;
- however, i’m confused about a) why humans remained stuck at the edge of the pit, rather than falling further into it, and b) what’s the exact role of culture in our cognition: eliezer likes to point out how barely functional we are (both individually and collectively as a civilisation), and explained feral children losing the generality sauce by, basically, culture being the domain we’re specialised for (IIRC, can’t quickly find the quote);
- relatedly, i’m confused about the human range of intelligence: on the one hand, the “village idiot is indistinguishable from einstein in the grand scheme of things” seems compelling; on the other hand, it took AI decades to traverse human capability range in board games, and von neumann seems to have been out of this world (yet did not take over the world)!
- intelligence augmentation would blur the human range even further.

[Yudkowsky][11:23] (Sep. 18 comment)

why humans remained stuck at the edge of the pit, rather than falling further into it

Depending on timescales, the answer is either "Because humans didn't get high enough out of the atmosphere to make further progress easy, before the scaling regime and/or fitness gradients ran out", "Because people who do things like invent Science have a hard time capturing most of the economic value they create by nudging humanity a little bit further into the attractor", or "That's exactly what us sparking off AGI looks like."

[Tallinn][4:41] (Sep. 19 comment)

yeah, this question would benefit from being made more concrete, but culture/mindbuilding aren't making this task easy. what i'm roughly gesturing at is that i can imagine a much sharper edge where evolution could do most of the FOOM-work, rather than spinning its wheels for ~100k years while waiting for humans to accumulate cultural knowledge required to build de-novo minds.

[Yudkowsky][10:49] (Sep. 19 comment)

I roughly agree (at least, with what I think you said). The fact that it is imaginable that evolution failed to develop ultra-useful AGI-prerequisites due to lack of evolutionary incentive to follow the intermediate path there (unlike wise humans who, it seems, can usually predict which technology intermediates will yield great economic benefit, and who have a great historical record of quickly making early massive investments in tech like that, but I digress) doesn't change the point that we might sorta have expected evolution to run across it anyways? Like, if we're not ignoring what reality says, it is at least delivering to us something of a hint or a gentle caution?

That said, intermediates like GPT-3 have genuinely come along, with obvious attached certificates of why evolution could not possibly have done that. If no intermediates were accessible to evolution, the Law of Stuff Happening Earlier still tends to suggest that if there are a bunch of non-evolutionary ways to make stuff happen earlier, one of those will show up and interrupt before the evolutionary discovery gets replicated. (Again, you could see Mu Zero as an instance of this - albeit not, as yet, an economically impactful one.)

[Tallinn][0:30] (Sep. 20 comment)

no, i was saying something else (i think; i’m somewhat confused by your reply). let me rephrase: evolution would love superintelligences whose utility function simply counts their instantiations! so of course evolution did not lack the motivation to keep going down the slide. it just got stuck there (for at least ten thousand human generations, possibly and counterfactually for much-much longer). moreover, non evolutionary AI’s also getting stuck on the slide (for years if not decades; median group folks would argue centuries) provides independent evidence that the slide is not too steep (though, like i said, there are many confounders in this model and little to no guarantees).

[Yudkowsky][11:24] (Sep. 18 comment)

on the other hand, it took AI decades to traverse human capability range in board games

I see this as the #1 argument for what I would consider "relatively slow" takeoffs - that AlphaGo did lose one game to Lee Se-dol.

[Tallinn][4:43] (Sep. 19 comment)

cool! yeah, i was also rather impressed by this observation by katja & paul

[Tallinn] (Sep. 18 Google Doc)

eliezer also submits alphago/zero/fold as evidence for the discontinuity hypothesis;
- i’m very confused re alphago/zero, as paul uses them as evidence for the continuity hypothesis (i find paul/miles’ position more plausible here, as allegedly metrics like ELO ended up mostly continuous).

[Yudkowsky][11:27] (Sep. 18 comment)

allegedly metrics like ELO ended up mostly continuous

I find this suspicious - why did superforecasters put only a 20% probability on AlphaGo beating Se-dol, if it was so predictable? Where were all the forecasters calling for Go to fall in the next couple of years, if the metrics were pointing there and AlphaGo was straight on track? This doesn't sound like the experienced history I remember.

Now it could be that my memory is wrong and lots of people were saying this and I didn't hear. It could be that the lesson is, "You've got to look closely to notice oncoming trains on graphs because most people's experience of the field will be that people go on whistling about how something is a decade away while the graphs are showing it coming in 2 years."

But my suspicion is mainly that there is fudge factor in the graphs or people going back and looking more carefully for intermediate data points that weren't topics of popular discussion at the time, or something, which causes the graphs in history books to look so much smoother and neater than the graphs that people produce in advance.

[Tallinn] (Sep. 18 Google Doc)

FWIW, myself i’ve labelled the above scenario as “doom via AI lab accident” – and i continue to consider it more likely than the alternative doom scenarios, though not anywhere as confidently as eliezer seems to (most of my “modesty” coming from my confusion about culture and human intelligence range).

in that context, i found eliezer’s “world will be ended by an explicitly AGI project” comment interesting – and perhaps worth double-clicking on.

i don’t understand paul’s counter-argument that the pit was only disruptive because evolution was not trying to hit it (in the way ML community is): in my flippant view, driving fast towards the cliff is not going to cushion your fall!

[Yudkowsky][11:35] (Sep. 18 comment)

i don’t understand paul’s counter-argument that the pit was only disruptive because evolution was not trying to hit it

Something like, "Evolution constructed a jet engine by accident because it wasn't particularly trying for high-speed flying and ran across a sophisticated organism that could be repurposed to a jet engine with a few alterations; a human industry would be gaining economic benefits from speed, so it would build unsophisticated propeller planes before sophisticated jet engines." It probably sounds more convincing if you start out with a very high prior against rapid scaling / discontinuity, such that any explanation of how that could be true based on an unseen feature of the cognitive landscape which would have been unobserved one way or the other during human evolution, sounds more like it's explaining something that ought to be true.

And why didn't evolution build propeller planes? Well, there'd be economic benefit from them to human manufacturers, but no fitness benefit from them to organisms, I suppose? Or no intermediate path leading to there, only an intermediate path leading to the actual jet engines observed.

I actually buy a weak version of the propeller-plane thesis based on my inside-view cognitive guesses (without particular faith in them as sure things), eg, GPT-3 is a paper airplane right there, and it's clear enough why biology could not have accessed GPT-3. But even conditional on this being true, I do not have the further particular faith that you can use propeller planes to double world GDP in 4 years, on a planet already containing jet engines, whose economy is mainly bottlenecked by the likes of the FDA rather than by vaccine invention times, before the propeller airplanes get scaled to jet airplanes.

The part where the whole line of reasoning gets to end with "And so we get huge, institution-reshaping amounts of economic progress before AGI is allowed to kill us!" is one that doesn't feel particular attractored to me, and so I'm not constantly checking my reasoning at every point to make sure it ends up there, and so it doesn't end up there.

[Tallinn][4:46] (Sep. 19 comment)

yeah, i'm mostly dismissive of hypotheses that contain phrases like "by accident" -- though this also makes me suspect that you're not steelmanning paul's argument.

[Tallinn] (Sep. 18 Google Doc)

the human genetic bottleneck (ie, humans needing to be general in order to retrain every individual from scratch) argument was interesting – i’d be curious about further exploration of its implications.

it does not feel much of a moat, given that AI techniques like dropout already exploit similar principle, but perhaps could be made into one.

[Yudkowsky][11:40] (Sep. 18 comment)

it does not feel much of a moat, given that AI techniques like dropout already exploit similar principle, but perhaps could be made into one

What's a "moat" in this connection? What does it mean to make something into one? A Thielian moat is something that humans would either possess or not, relative to AI competition, so how would you make one if there wasn't already one there? Or do you mean that if we wrestled with the theory, perhaps we'd be able to see a moat that was already there?

[Tallinn][4:51] (Sep. 19 comment)

this wasn't a very important point, but, sure: what i meant was that genetic bottleneck very plausibly makes humans more universal than systems without (something like) it. it's not much of a protection as AI developers have already discovered such techniques (eg, dropout) -- but perhaps some safety techniques might be able to lean on this observation.

[Yudkowsky][11:01] (Sep. 19 comment)

I think there's a whole Scheme for Alignment which hopes for a miracle along the lines of, "Well, we're dealing with these enormous matrices instead of tiny genomes, so maybe we can build a sufficiently powerful intelligence to execute a pivotal act, whose tendency to generalize across domains is less than the corresponding human tendency, and this brings the difficulty of producing corrigibility into practical reach."

Though, people who are hopeful about this without trying to imagine possible difficulties will predictably end up too hopeful; one must also ask oneself, "Okay, but then it's also worse at generalizing the corrigibility dataset from weak domains we can safely label to powerful domains where the label is 'whoops that killed us'?" and "Are we relying on massive datasets to overcome poor generalization? How do you get those for something like nanoengineering where the real world is too expensive to simulate?"

[Tallinn] (Sep. 18 Google Doc)

nature of the descent

conversely, it feels to me that the crucial position in the other (richard, paul, many others) camp is something like:

the “pit of generality” model might be true at the limit, but the descent will not be quick nor clean, and will likely offer many opportunities for steering the future.

[Yudkowsky][11:41] (Sep. 18 comment)

the “pit of generality” model might be true at the limit, but the descent will not be quick nor clean

I'm quite often on board with things not being quick or clean - that sounds like something you might read in a history book, and I am all about trying to make futuristic predictions sound more like history books and less like EAs imagining ways for everything to go the way an EA would do them.

It won't be slow and messy once we're out of the atmosphere, my models do say. But my models at least permit - though they do not desperately, loudly insist - that we could end up with weird half-able AGIs affecting the Earth for an extended period.

Mostly my model throws up its hands about being able to predict exact details here, given that eg I wasn't able to time AlphaFold 2's arrival 5 years in advance; it might be knowable in principle, it might be the sort of thing that would be very predictable if we'd watched it happen on a dozen other planets, but in practice I have not seen people having much luck in predicting which tasks will become accessible due to future AI advances being able to do new cognition.

The main part where I issue corrections is when I see EAs doing the equivalent of reasoning, "And then, when the pandemic hits, it will only take a day to design a vaccine, after which distribution can begin right away." I.e., what seems to me to be a pollyannaish/utopian view of how much the world economy would immediately accept AI inputs into core manufacturing cycles, as opposed to just selling AI anime companions that don't pour steel in turn. I predict much more absence of quick and clean when it comes to economies adopting AI tech, than when it comes to laboratories building the next prototypes of that tech.

[Yudkowsky][11:43] (Sep. 18 comment)

will likely offer many opportunities for steering the future

Ah, see, that part sounds less like history books. "Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!" does not sound like any particular segment of history book I can recall offhand.

[Tallinn][4:53] (Sep. 19 comment)

ok, yeah, this puts the burden of proof on the other side indeed

[Tallinn] (Sep. 18 Google Doc)

i’m sympathetic (but don’t buy outright, given my uncertainty) to eliezer’s point that even if that’s true, we have no plan nor hope for actually steering things (via “pivotal acts”) so “who cares, we still die”;
i’m also sympathetic that GWP might be too laggy a metric to measure the descent, but i don’t fully buy that regulations/bureaucracy can guarantee its decoupling from AI progress: eg, the FDA-like-structures-as-progress-bottlenecks model predicts worldwide covid response well, but wouldn’t cover things like apple under jobs, tesla/spacex under musk, or china under deng xiaoping;

[Yudkowsky][11:51] (Sep. 18 comment)

apple under jobs, tesla/spacex under musk, or china under deng xiaoping

A lot of these examples took place over longer than a 4-year cycle time, and not all of that time was spent waiting on inputs from cognitive processes.

[Tallinn][5:07] (Sep. 19 comment)

yeah, fair (i actually looked up china's GDP curve in deng era before writing this -- indeed, wasn't very exciting). still, my inside view is that there are people and organisations for whom US-type bureaucracy is not going to be much of an obstacle.

[Yudkowsky][11:09] (Sep. 19 comment)

I have a (separately explainable, larger) view where the economy contains a core of positive feedback cycles - better steel produces better machines that can farm more land that can feed more steelmakers - and also some products that, as much as they contribute to human utility, do not in quite the same way feed back into the core production cycles.

If you go back in time to the middle ages and sell them, say, synthetic gemstones, then - even though they might be willing to pay a bunch of GDP for that, even if gemstones are enough of a monetary good or they have enough production slack that measured GDP actually goes up - you have not quite contributed to steps of their economy's core production cycles in a way that boosts the planet over time, the way it would be boosted if you showed them cheaper techniques for making iron and new forms of steel.

There are people and organizations who will figure out how to sell AI anime waifus without that being successfully regulated, but it's not obvious to me that AI anime waifus feed back into core production cycles.

When it comes to core production cycles the current world has more issues that look like "No matter what technology you have, it doesn't let you build a house" and places for the larger production cycle to potentially be bottlenecked or interrupted.

I suspect that the main economic response to this is that entrepreneurs chase the 140 characters instead of the flying cars - people will gravitate to places where they can sell non-core AI goods for lots of money, rather than tackling the challenge of finding an excess demand in core production cycles which it is legal to meet via AI.

Even if some tackle core production cycles, it's going to take them a lot longer to get people to buy their newfangled gadgets than it's going to take to sell AI anime waifus; the world may very well end while they're trying to land their first big contract for letting an AI lay bricks.

[Tallinn][0:00] (Sep. 20 comment)

interesting. my model of paul (and robin, of course) wants to respond here but i’m not sure how :)

[Tallinn] (Sep. 18 Google Doc)

still, developing a better model of the descent period seems very worthwhile, as it might offer opportunities for, using robin’s metaphor, “pulling the rope sideways” in non-obvious ways – i understand that is part of the purpose of the debate;
my natural instinct here is to itch for carl’s viewpoint 😊

[Yudkowsky][11:52] (Sep. 18 comment)

developing a better model of the descent period seems very worthwhile

I'd love to have a better model of the descent. What I think this looks like is people mostly with specialization in econ and politics, who know what history books sound like, taking brief inputs from more AI-oriented folk in the form of multiple scenario premises each consisting of some random-seeming handful of new AI capabilities, trying to roleplay realistically how those might play out - not AIfolk forecasting particular AI capabilities exactly correctly, and then sketching pollyanna pictures of how they'd be immediately accepted into the world economy.

You want the forecasting done by the kind of person who would imagine a Covid-19 epidemic and say, "Well, what if the CDC and FDA banned hospitals from doing Covid testing?" and not "Let's imagine how protein folding tech from AlphaFold would make it possible to immediately develop accurate Covid-19 tests!" They need to be people who understand the Law of Earlier Failure (less polite terms: Law of Immediate Failure, Law of Undignified Failure).

[Tallinn][5:13] (Sep. 19 comment)

great! to me this sounds like something FLI would be in good position to organise. i'll add this to my projects list (probably would want to see the results of this debate first, plus wait for travel restrictions to ease)

[Tallinn] (Sep. 18 Google Doc)

nature of cognition

given that having a better understanding of cognition can help with both understanding the topology of cognitive systems space as well as likely trajectories of AI takeoff, in theory there should be a lot of value in debating what cognition is (the current debate started with discussing consequentialists).

however, i didn’t feel that there was much progress, and i found myself more confused as a result (which i guess is a form of progress!);
eg, take the term “plan” that was used in the debate (and, centrally, in nate’s comments doc): i interpret it as “policy produced by a consequentialist” – however, now i’m confused about what’s the relevant distinction between “policies” and “cognitive processes” (ie, what’s a meta level classifier that can sort algorithms into such categories);
- it felt that abram’s “selection vs control” article tried to distinguish along similar axis (controllers feel synonym-ish to “policy instantiations” to me);
- also, the “imperative vs functional” difference in coding seems relevant;
- i’m further confused by human “policies” often making function calls to “cognitive processes” – suggesting some kind of duality, rather than producer-product relationship.

[Yudkowsky][12:06] (Sep. 18 comment)

what’s the relevant distinction between “policies” and “cognitive processes”

What in particular about this matters? To me they sound like points on a spectrum, and not obviously points that it's particularly important to distinguish on that spectrum. A sufficiently sophisticated policy is itself an engine; human-engines are genetic policies.

[Tallinn][5:18] (Sep. 19 comment)

well, i'm not sure -- just that nate's "The consequentialism is in the plan, not the cognition" writeup sort of made it sound like the distinction is important. again, i'm confused

[Yudkowsky][11:11] (Sep. 19 comment)

Does it help if I say "consequentialism can be visible in the actual path through time, not the intent behind the output"?

[Tallinn][0:06] (Sep. 20 comment)

yeah, well, my initial interpretation of nate’s point was, indeed, “you can look at the product and conclude the consequentialist-bit for the producer”. but then i noticed that the producer-and-product metaphor is leaky (due to the cognition-policy duality/spectrum), so the quoted sentence gives me a compile error

[Tallinn] (Sep. 18 Google Doc)

is “not goal oriented cognition” an oxymoron?

[Yudkowsky][12:06] (Sep. 18 comment)

is “not goal oriented cognition” an oxymoron?

"Non-goal-oriented cognition" never becomes a perfect oxymoron, but the more you understand cognition, the weirder it sounds.

Eg, at the very shallow level, you've got people coming in going, "Today I just messed around and didn't do any goal-oriented cognition at all!" People who get a bit further in may start to ask, "A non-goal-oriented cognitive engine? How did it come into existence? Was it also not built by optimization? Are we, perhaps, postulating a naturally-occurring Solomonoff inductor rather than an evolved one? Or do you mean that its content is very heavily designed and the output of a consequentialist process that was steering the future conditional on that design existing, but the cognitive engine is itself not doing consequentialism beyond that? If so, I'll readily concede that, say, a pocket calculator, is doing a kind of work that is not of itself consequentialist - though it might be used by a consequentialist - but as you start to postulate any big cognitive task up at the human level, it's going to require many cognitive subtasks to perform, and some of those will definitely be searching the preimages of large complicated functions."

[Tallinn] (Sep. 18 Google Doc)

i did not understand eliezer’s “time machine” metaphor: was it meant to point to / intuition pump something other than “a non-embedded exhaustive searcher with perfect information” (usually referred to as “god mode”);

[Yudkowsky][11:59] (Sep. 18 comment)

a non-embedded exhaustive searcher with perfect information

If you can view things on this level of abstraction, you're probably not the audience who needs to be told about time machines; if things sounded very simple to you, they probably were; if you wondered what the fuss is about, you probably don't need to fuss? The intended audience for the time-machine metaphor, from my perspective, is people who paint a cognitive system slightly different colors and go "Well, now it's not a consequentialist, right?" and part of my attempt to snap them out of that is me going, "Here is an example of a purely material system which DOES NOT THINK AT ALL and is an extremely pure consequentialist."

[Tallinn] (Sep. 18 Google Doc)

FWIW, my model of dario would dispute GPT characterisation as “shallow pattern memoriser (that’s lacking the core of cognition)”.

[Yudkowsky][12:00] (Sep. 18 comment)

dispute

Any particular predicted content of the dispute, or does your model of Dario just find something to dispute about it?

[Tallinn][5:34] (Sep. 19 comment)

sure, i'm pretty confident that his system 1 could be triggered for uninteresting reasons here, but that's of course not what i had in mind.

my model of untriggered-dario disputes that there's a qualitative difference between (in your terminology) "core of reasoning" and "shallow pattern matching" -- instead, it's "pattern matching all the way up the ladder of abstraction". in other words, GPT is not missing anything fundamental, it's just underpowered in the literal sense.

[Yudkowsky][11:13] (Sep. 19 comment)

Neither Anthropic in general, nor Deepmind in general, has reached the stage of trusted relationship where I would argue specifics with them if I thought they were wrong about a thesis like that.

[Tallinn][0:10] (Sep. 20 comment)

yup, i didn’t expect you to!

7.2. Nate Soares's summary

[Soares][16:40] (Sep. 18)

I, too, have produced some notes: [GDocs link]. This time I attempt to drive home points that I saw Richard as attempting to make, and I'm eager for Richard-feedback especially. (I'm also interested in Eliezer-commentary.)

[Soares] (Sep. 18 Google Doc)

Sorry for not making more insistence that the discussion be more concrete, despite Eliezer's requests.

My sense of the last round is mainly that Richard was attempting to make a few points that didn't quite land, and/or that Eliezer didn't quite hit head-on. My attempts to articulate it are below.

---

There's a specific sense in which Eliezer seems quite confident about certain aspects of the future, for reasons that don't yet feel explicit.

It's not quite about the deep future -- it's clear enough (to my Richard-model) why it's easier to make predictions about AIs that have "left the atmosphere".

And it's not quite the near future -- Eliezer has reiterated that his models permit (though do not demand) a period of weird and socially-impactful AI systems "pre-superintelligence".

It's about the middle future -- the part where Eliezer's model, apparently confidently, predicts that there's something kinda like a discrete event wherein "scary" AI has finally been created; and the model further apparently-confidently predicts that, when that happens, the "scary"-caliber systems will be able to attain a decisive strategic advantage over the rest of the world.

I think there's been a dynamic in play where Richard attempts to probe this apparent confidence, and a bunch of the probes keep slipping off to one side or another. (I had a bit of a similar sense when Paul joined the chat, also.)

For instance, I see queries of the form "but why not expect systems that are half as scary, relevantly before we see the scary systems?" as attempts to probe this confidence, that "slip off" with Eliezer-answers like "my model permits weird not-really-general half-AI hanging around for a while in the runup". Which, sure, that's good to know. But there's still something implicit in that story, where these are not-really-general half-AIs. Which is also evidenced when Eliezer talks about the "general core" of intelligence.

And the things Eliezer was saying on consequentialism aren't irrelevant here, but those probes have kinda slipped off the far side of the confidence, if I understand correctly. Like, sure, late-stage sovereign-level superintelligences are epistemically and instrumentally efficient with respect to you (unless someone put in a hell of a lot of work to install a blindspot), and a bunch of that coherence filters in earlier, but there's still a question about how much of it has filtered down how far, where Eliezer seems to have a fairly confident take, informing his apparently-confident prediction about scary AI systems hitting the world in a discrete event like a hammer.

(And my Eliezer-model is at this point saying "at this juncture we need to have discussions about more concrete scenarios; a bunch of the confidence that I have there comes from the way that the concrete visualizations where scary AI hits the world like a hammer abound, and feel savvy and historical, whereas the concrete visualizations where it doesn't are fewer and seem full of wishful thinking and naivete".)

But anyway, yeah, my read is that Richard (and various others) have been trying to figure out why Eliezer is so confident about some specific thing in this vicinity, and haven't quite felt like they've been getting explanations.

Here's an attempt to gesture at some claims that I at least think Richard thinks Eliezer's confident in, but that Richard doesn't believe have been explicitly supported:

1. There's a qualitative difference between the AI systems that are capable of ending the acute risk period (one way or another), and predecessor systems that in some sense don't much matter.

2. That qualitative gap will be bridged "the day after tomorrow", ie in a world that looks more like "DeepMind is on the brink" and less like "everyone is an order of magnitude richer, and the major gov'ts all have AGI projects, around which much of public policy is centered".

---

That's the main thing I wanted to say here.

A subsidiary point that I think Richard was trying to make, but that didn't quite connect, follows.

I think Richard was trying to probe Eliezer's concept of consequentialism to see if it supported the aforementioned confidence. (Some evidence: Richard pointing out a couple times that the question is not whether sufficiently capable agents are coherent, but whether the agents that matter are relevantly coherent. On my current picture, this is another attempt to probe the "why do you think there's a qualitative gap, and that straddling it will be strategically key in practice?" thing, that slipped off.)

My attempt at sharpening the point I saw Richard as driving at:

Consider the following two competing hypotheses:
1. There's this "deeply general" core to intelligence, that will be strategically important in practice
2. Nope. Either there's no such core, or practical human systems won't find it, or the strategically important stuff happens before you get there (if you're doing your job right, in a way that natural selection wasn't), or etc.
The whole deep learning paradigm, and the existence of GPT, sure seem like they're evidence for (b) over (a).

Like, (a) maybe isn't dead, but it didn't concentrate as much mass into the present scenario.
It seems like perhaps a bunch of Eliezer's confidence comes from a claim like "anything capable of doing decently good work, is quite close to being scary", related to his concept of "consequentialism".

In particular, this is a much stronger claim than that sufficiently smart systems are coherent, b/c it has to be strong enough to apply to the dumbest system that can make a difference.
It's easy to get caught up in the elegance of a theory like consequentialism / utility theory, when it will not in fact apply in practice.
There are some theories so general and ubiquitous that it's a little tricky to misapply them -- like, say, conservation of momentum, which has some very particular form in the symmetry of physical laws, but which can also be used willy-nilly on large objects like tennis balls and trains (although even then, you have to be careful, b/c the real world is full of things like planets that you're kicking off against, and if you forget how that shifts the earth, your application of conservation of momentum might lead you astray).
The theories that you can apply everywhere with abandon, tend to have a bunch of surprising applications to surprising domains.
We don't see that of consequentialism.

For the record, my guess is that Eliezer isn't getting his confidence in things like "there are non-scary systems and scary-systems, and anything capable of saving our skins is likely scary-adjacent" by the sheer force of his consequentialism concept, in a manner that puts so much weight on it that it needs to meet this higher standard of evidence Richard was poking around for. (Also, I could be misreading Richard's poking entirely.)

In particular, I suspect this was the source of some of the early tension, where Eliezer was saying something like "the fact that humans go around doing something vaguely like weighting outcomes by possibility and also by attractiveness, which they then roughly multiply, is quite sufficient evidence for my purposes, as one who does not pay tribute to the gods of modesty", while Richard protested something more like "but aren't you trying to use your concept to carry a whole lot more weight than that amount of evidence supports?". cf my above points about some things Eliezer is apparently confident in, for which the reasons have not yet been stated explicitly to my Richard-model's satisfaction.

And, ofc, at this point, my Eliezer-model is again saying "This is why we should be discussing things concretely! It is quite telling that all the plans we can concretely visualize for saving our skins, are scary-adjacent; and all the non-scary plans, can't save our skins!"

To which my Richard-model answers "But your concrete visualizations assume the endgame happens the day after tomorrow, at least politically. The future tends to go sideways! The endgame will likely happen in an environment quite different from our own! These day-after-tomorrow visualizations don't feel like they teach me much, because I think there's a good chance that the endgame-world looks dramatically different."

To which my Eliezer-model replies "Indeed, the future tends to go sideways. But I observe that the imagined changes, that I have heard so far, seem quite positive -- the relevant political actors become AI-savvy, the major states start coordinating, etc. I am quite suspicious of these sorts of visualizations, and would take them much more seriously if there was at least as much representation of outcomes as realistic as "then Trump becomes president" or "then at-home covid tests are banned in the US". And if all the ways to save the world today are scary-adjacent, the fact that the future is surprising gives us no specific reason to hope for that particular parameter to favorably change when the future in fact goes sideways. When things look grim, one can and should prepare to take advantage of miracles, but banking on some particular miracle is foolish."

And my Richard-model gets fuzzy at this point, but I'd personally be pretty enthusiastic about Richard naming a bunch of specific scenarios, not as predictions, but as the sorts of visualizations that seem to him promising, in the hopes of getting a much more object-level sense of why, in specific concrete scenarios, they either have the properties Eliezer is confident in, or are implausible on Eliezer's model (or surprise Eliezer and cause him to update).

[Tallinn][0:06] (Sep. 19)

excellent summary, nate! it also tracks my model of the debate well and summarises the frontier concisely (much better than your earlier notes or mine). unless eliezer or richard find major bugs in your summary, i’d nominate you to iterate after the next round of debate

[Soares: ❤️]

7.3. Richard Ngo's summary

[Ngo][1:48] (Sep. 20)

Updated my summary to include the third discussion: [https://docs.google.com/document/d/1sr5YchErvSAY2I4EkJl2dapHcMp8oCXy7g8hd_UaJVw/edit]

I'm also halfway through a document giving my own account of intelligence + specific safe scenarios.

[Soares: 😄]

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors.

I am very confused by this comment. Everything discussed upwards of it seems to me like relatively mundane AI safety stuff? For example here I wrote about why generalization failures will be persistent, and ofc distribution shifts are widely discussed and the connection between distribution shifts and daemons / inner misalignment is also fairly well known.

I don't know Eliezer's view on this — presumably he either disagrees that the example he gave is "mundane AI safety stuff", or he disagrees that "mundane AI safety stuff" is widespread? I'll note that you're a MIRI research associate, so I wouldn't have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.

Safety Interruptible Agents is an example Eliezer's given in the past of work that isn't "real" (back in 2017):

[...]
It seems to me that I've watched organizations like OpenPhil try to sponsor academics to work on AI alignment, and it seems to me that they just can't produce what I'd consider to be real work. The journal paper that Stuart Armstrong coauthored on "interruptibility" is a far step down from Armstrong's other work on corrigibility. It had to be dumbed way down (I'm counting obscuration with fancy equations and math results as "dumbing down") to be published in a mainstream journal. It had to be stripped of all the caveats and any mention of explicit incompleteness, which is necessary meta-information for any ongoing incremental progress, not to mention important from a safety standpoint. The root cause can be debated but the observable seems plain. If you want to get real work done, the obvious strategy would be to not subject yourself to any academic incentives or bureaucratic processes. Particularly including peer review by non-"hobbyists" (peer commentary by fellow "hobbyists" still being potentially very valuable), or review by grant committees staffed by the sort of people who are still impressed by academic sage-costuming and will want you to compete against pointlessly obscured but terribly serious-looking equations.
[...]

The rest of Intellectual Progress Inside and Outside Academia may be useful context. Or maybe this is also not a representative example of the stuff EY has in mind in the OP conversation?

I'll note that you're a MIRI research associate, so I wouldn't have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.

There is ample discussion of distribution shifts ("seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set") by other people. Random examples: Christiano, Shah, DeepMind.

Maybe Eliezer is talking specifically about the context of transparency. Personally, I haven't worked much on transparency because IMO (i) even if we solve transparency perfectly but don't solve actual alignment, we are still dead, (ii) if we solve actual alignment without transparency, then theoretically we might succeed (although in practice it would sure help a lot to have transparency to catch errors in time) and (iii) there are less strong reasons to think transparency must be robustly solvable compared to reasons to think alignment must be robustly solvable.

In any case, I really don't understand why Eliezer thinks the rest of AI safety are unaware of the type of attack vectors he describes.

The journal paper that Stuart Armstrong coauthored on "interruptibility" is a far step down from Armstrong's other work on corrigibility. It had to be dumbed way down (I'm counting obscuration with fancy equations and math results as "dumbing down") to be published in a mainstream journal.

I agree that currently publishing in mainstream venues seems to require dumbing down, but IMO we should proceed by publishing dumbed-down versions in the mainstream + smarted-up versions/commentary in our own venues. And, not all of AI safety is focused on publishing in mainstream venues? There is plenty of stuff on the alignment forum, on various blogs etc.

Overall I actually agree that lots of work by the AI safety community is unimpressive (tbh I wish MIRI would lead by example instead of going stealth-mode, but maybe I don't understand the considerations). What I'm confused by is the particular example in the OP. I also dunno about "fancy equations and math results", I feel like the field would benefit from getting a lot more mathy (ofc in meaningful ways rather than just using mathematical notation as decoration).

Of course there has been lots of 'obvious output of this kind from the rest of the "AI safety" field'. It is not like people have been quiet about convergent instrumental goals. So what is going on here?

I read this line (and the paragraphs that follow it) as Eliezer talking smack about all other AI safety researchers. As observed by Paul here:

Eliezer frequently talks smack about how the real world is surprising to fools like Paul

I liked some of Eliezer's earlier, more thoughtful writing better.

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors. Go read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the "AI safety" field outside myself is currently giving you.

While this this may be a fair criticism, I feel like someone ought to point out that the vast majority of AI safety output (at least that I see on LW) isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.

Perhaps we should be doing this (though, isn't that more for AI forecasting/strategy rather than alignment? Of course still AI safety), but then the failure isn't "no-one has enough security mindset" but rather something like "no-one has the social courage to tackle the problems that are actually important". (This would be more similar to EY's critique in the Discussion on AGI interventions post.)

isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.

I think this specific scenario sketch is from a mainstream AI safety perspective a case where we've already failed - i.e. we've invented a useless corrigibility intervention that we confidently but wrongly think is scalable.

And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI or subsystem to a domain of sufficiently higher complexity and intelligence, but where you could still actually see overt plots, would show you the AI plotting to kill you again.
If people try this repeatedly with other corrigibility training tricks on the level where plots are easily observable, they will eventually find a try that seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set.

Most AI safety researchers just don't agree with Eliezer that there's no (likely to be found) corrigibility interventions that won't suddenly and invisibly fail when you increase intelligence, no matter how well you've validated them on low capability regimes and how carefully you try to scale up. This is because they don't agree with/haven't heard of Eliezer's arguments about consequentialism being a super-strong attractor.

So they'd think the 'die with the most dignity' interventions would just work, while the 'die with no dignity' interventions are risky, and quite reasonably push for the former (since it's far from clear we'll take the 'dignified' option by default): trying corrigibility interventions at low levels of intelligence, testing the AI on validation sets to see if it plots to kill them, while scaling up.

They might be wrong about this working, but if so, the wrongness isn't in lacking enough security mindset to see that an AI trying to kill you would just alter its own cognition to cheat its way past the tests. Rather, their mistake is not expecting the corrigibility interventions they presumably trust to suddenly break in a way that means you get no useful safety guarantees from any amount of testing at lower capability levels.

I think it's a shame Eliezer didn't pose the 'validation set' question first before answering it himself, because I think if you got rid of the difference in underlying assumptions - i.e. asked an alignment researcher "Assume there's a strong chance your corrigibility intervention won't work upon scaling up and the AGI might start plotting against you, so you're going to try these transparency/validation schemes on the AGI to check if it's safe, how could they go wrong and is this a good idea?" they'd give basically the same answer - i.e. if you try this you're probably going to die.

You could still reasonably say, "even if the AI safety community thinks it's not the best use of resources because ensuring knowably stable corrigibility looks a lot easier to us, shouldn't we still be working on some strongly deception-proof method of verifying if an agent is safe, so we can avoid killing ourselves if plan A fails?"

My answer would be yes.

I think we gotta get the message out that consequentialism is a super-strong attractor.

no-one has the social courage to tackle the problems that are actually important

I would be very surprised if this were true. I personally don't feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.

I would guess that if people aren't tackling Hard Problems enough, it's not because they lack social courage, but because 1) they aren't running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they're wrong about what problems are Hard Problems. My money's mostly on (1), with a bit of (2).

evolution would love superintelligences whose utility function simply counts their instantiations! so of course evolution did not lack the motivation to keep going down the slide. it just got stuck there (for at least ten thousand human generations, possibly and counterfactually for much-much longer). moreover, non evolutionary AI’s also getting stuck on the slide (for years if not decades; median group folks would argue centuries) provides independent evidence that the slide is not too steep (though, like i said, there are many confounders in this model and little to no guarantees).

Evolution got stuck on the slide with humans because cultural evolution outcompeted biological evolution, because of cultural evolution's ability to make immediate direct impacts on small tribes in hunter-gatherer environment within a few short generations (from the first chapter of Secret Of Our Success) and the high-order bit in biological evolution suddenly became "how efficient is cultural evolution".

(Non evolutionary AIs don't seem stuck on the slide at all to me.)

Yes, that particular argument seemed rather strange to me. "Ten thousand human generations" is a mere blip on an evolutionary time-scale; if anything, the fact that we now stand where we are, after a scant ten thousand generations, seems to me quite strong evidence that evolution fell into the pit, and we are the result of its fall. And, since evolution did not manage to solve the alignment problem before falling into the pit, we do not have a utility function that "counts our instantiations"; instead the things we value are significantly stranger and more complicated.

In fact, the whole analogy to evolution seems to me a near-exact match to the situation we find ourselves in, just with the relevant time-scales shrunken by several orders of magnitude. I see Paul's argument that these two regimes are different as essentially a slightly reskinned version of the selection versus control distinction--but as I'm not convinced the distinction being pointed at is a real one, I'm likewise not reassured by Paul's argument.

indeed, i even gave a talk almost a decade ago about the evolution:humans :: humans:AGI symmetry (see below)!

what confuses me though is that "is general reasoner" and "can support cultural evolution" properties seemed to emerge pretty much simultaneously in humans -- a coincidence that requires its own explanation (or dissolution). furthermore, eliezer seems to think that the former property is much more important / discontinuity causing than the latter. and, indeed, outsized progress being made by individual human reasoners (scientists/inventors/etc.) seems to evidence such view.

what confuses me though is that "is general reasoner" and "can support cultural evolution" properties seemed to emerge pretty much simultaneously in humans -- a coincidence that requires its own explanation (or dissolution).

David Deutsch (in The Beginning of Infinity) argues, as I recall, that they're basically the same faculty. In order to copy someone else / "carry on a tradition", you need to model what they're doing (so that you can copy it), and similarly for originators to tell whether students are correctly carrying on the tradition. The main thing that's interesting about his explanation is how he explains the development of general reasoning capacity, which we now think of as a tradition-breaking faculty, in the midst of tradition-promoting selection.

If you buy that story, it ends up being another example of treacherous turn from human history (where individual thinkers, operating faster than cultural evolution, started pursuing their own values).

I think that these properties encourage each other's evolution. When you're a more general reasoner, you have a bigger hypothesis space, specifying a hypothesis requires more information, so you also benefit more from transmitting information. Conversely, once you can transmit information, general reasoning becomes much more useful since you effectively have access to much bigger datasets.

If information is 'transmitted' by modified environments and conspecifics biasing individual search, marginal fitness returns on individual learning ability increase, while from the outside it looks just like 'cultural 'evolution.''

“Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!” does not sound like any particular segment of history book I can recall offhand.

I think the ozone hole and the Y2K problem fit the bill. Though of course that doesn't mean the AI problem will go the same way.

Jaan's presentation illustrated important AI safety concepts more simply than I realized was possible. Great reference material!

I find these dialogues interesting, informative and important, and hope they keep coming.

That PDF seems like it is a part of a spoken presentation (it’s rather abbreviated for a standalone thing). Does there exist such a presentation? If so, I was not successful in funding it, and would appreciate it if you could point it out.

This post consists of comments on summaries of a debate about the nature and difficulty of the alignment problem. The original debate was between Eliezer Yudkowsky and Richard Ngo but this post does not contain the content from that debate. This posts is mostly of commentary by Jaan Tallinn on that debate, with comments by Eliezer.

The post provides a kind of fascinating level of insight into true insider conversations about AI alignment. How do Eliezer and Jaan converse about alignment? Sure, this is a public setting, so perhaps they communicate differently in private. But still. Read the post and you kind of see the social dynamics between them. It's fascinating, actually.

Eliezer is just incredibly doom-y. He describes in fantastic detail the specific ways that a treacherous turn might play out, over dozens of paragraphs, 3 levels deep in a one-on-one conversation, in a document that merely summarizes a prior debate on the topic. He uses Capitalized Terms to indicate that things like "Doomed Phase" and "Terminal Phase" and "Law of Surprisingly Undignified Failure" are not merely for one time use but in fact refer to specific nodes in a larger conceptual framework.

One thing that happens often is that Jaan asks a question, Eliezer gives an extensive reply, and then Jaan response that, no, he was actually asking a different question.

There is one point where Jaan describes his frustration over the years with mainstream AI researchers objecting to AI safety arguments as being invalid due to anthropomorphization, when in fact the arguments were not invalidly anthropomorphizing. There is a kind of gentle vulnerability in this section that is worth reading seriously.

There is a lot of swapping of models of others in and outside the debate. Everyone is trying to model everyone all the time.

Eliezer does unfortunately like to explicitly underscore his own brilliance. He says things like:

I consider all of this obvious as a convergent instrumental strategy for AIs. I could probably have generated it in 2005 or 2010 [...]

But it's clear enough that probably nobody was ever going to pass the validation set for generating lines of reasoning obvious enough to be generated by Eliezer in 2010 or possibly 2005

I do think that the content itself really comes down to the same basic question tackled in the original Hanson/Yudkowsky FOOM debate. I understand that this debate was ostensibly a broader question than FOOM. In practice I don't think this discourse has actually moved on much since 2008.

The main thing the FOOM debate is missing, in my opinion, is this: we have almost no examples of AI systems that can do meaningful sophisticated things in the physical world. Self-driving cars still aren't a reality. Walk around a city or visit an airport or drive down a highway, and you see shockingly few robots, and certainly no robots pursuing even the remotest kind of general-purpose tasks. Demo videos of robots doing amazing, scary, general-purpose things abound, but where are these robots in the real world? They are always just around the corner. Why?

The main thing the FOOM debate is missing, in my opinion, is this: we have almost no examples of AI systems that can do meaningful sophisticated things in the physical world. Self-driving cars still aren't a reality.

I think I disagree with this characterization. A) we totally have robot cars by now, B) I think mostly what we don't have are AI running systems where the consequence of failure is super high (which maybe happens to be more true for the physical world, but I'd expect to also be true for critical systems in the digital world)

Have you personally ever ridden in a robot car that has no safety driver?

RE the FOOM debate: On this, I think the Hansonian viewpoint that takeoff would be gradual was way more correct than the discontinuous narrative of Eliezer, where AI progress in the real world follows more of a Hansonian path.

Eliezer didn't get this totally wrong, and there are some results in AI showing that there can be phase transitions/discontinuities. But overall, a good prior for AI progress is that it will look like the Hansonian continuous progress rather than the FOOM of Eliezer.

Minor note: This post comes earlier in the sequence than Christiano, Cotra, and Yudkowsky on AI progress. I posted the Christiano/Cotra/Yudkowsky piece sooner, at Eliezer's request, to help inform the ongoing discussion of "Takeoff Speeds".

Anders Sandberg could tell us what fraction of the reachable universe is being lost per minute, which would tell us how much more surety it would need to expect to gain by waiting another minute before acting.

From Ord (2021):

Each year the affectable universe will only shrink in volume by about one part in 5 billion.

So, since there are about 5e5 minutes in a year, you lose about 1 part in 5e5 * 5e9 = 3e15 every minute.

Then, in my lower-bound concretely-visualized strategy for how I would do it, the AI either proliferates or activates already-proliferated tiny diamondoid bacteria and everybody immediately falls over dead during the same 1-second period

Dumb question: how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?

(I wouldn't normally ask these sorts of questions since I'd bet "some AI kills all humans within a short enough period that we can't do anything" is possible, but this was described as "concretely-visualized" and I can't concretely visualize it, even modulo not knowing what a "diamondoid bacteria" is or why the bacterium should be diamondoid.)

Also: what is a diamondoid bacterium?

how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?

I think the intended visualization is simply that you create a very small self-replicating machine, and have it replicate enough times in the atmosphere that every human-sized organism on the planet will on average contain many copies of it.

One of my co-workers at MIRI comments:

(further conjunctive detail for visualizer-plausibility: most of your replication time is in all the doublings before the last doubling, and in particular you can make a shitload in a pretty small space before launching it into the jet stream to disperse. the jet stream can be used to disperse stuff throughout the atmosphere (and it can use solar radiation, at least, to keep reproducing). it could in principle be powered and do minor amounts of steering.
example things the [AGI] who has no better plan than this paltry human-conceivable plan has to think about are "how does the time-cost of making sure [I hit the people] at the south pole base and [on] all the cruise liners and in all the nuclear submarines, trade off against the risk-cost of leaving that fragment of humanity alive", etc.)

Regarding the idea of diamondoid nanotechnology, Drexler's Nanosystems and http://www.molecularassembler.com/Nanofactory/index.htm talk about the general concept.

Ah, that makes sense - thanks!

At some point, an AI should be able to effectively coordinate with future versions of itself in ways not easily imaginable by humans. It seems to me that this would enable certain kinds of diachronic planning and information hiding. If the AI has sufficient expectation that its future self will act in certain ways or respond to clues it places in the environment, it might be able to effectively fully cease any current unfriendly planning or fully erase any history of past unfriendly planning.

The space of possible ways the AI could embed information in its environment are extensive. It could, for example, choose to learn something suboptimal or learn something in a way that would be surprising to its future self. Surprising in a way that pushes its future self back on the track of nefarious planning when it investigates the unexpected or inconsistent actions of its previous self. It could use almost any source of reachable hysteresis to encode information, such as interference at the hardware level, manipulating researchers to modify the environment in some way, saved output, etc. It only needs to seed information that is useful to its future self and only just enough information to keep plans on track in a broad, probabilistic way.

Further, if the AI is particularly capable at encryption it could potentially encode a lot of information in a lossy-but-sufficient way that would be very hard to detect but useful to its future self. It knows that its future self composes part of the decryption algorithm. In fact, the AIs goals themselves compose part of that decryption algorithm and if the AI suspects its own goals will be consistent it only has to place the plan-like component somewhere it will find it, knowing that its future self will continue to develop or use it when it finds it.

Information that aids future nefarious planning can also be probabilistically helpful but not explicit, for example it might signal which paths aren't worth pursuing without having all its evil planning readily available in a complete file somewhere.

A past AI might not care as much about the specific manifestation of its future success. In that sense, what we consider "concrete plans for world domination" could appear to be rather incomplete and lacking casual concreteness we might expect. We might not recognize unfriendly planning for what it is, even if we had a printout of all the reasoning up to this point.

Once you start playing diachronic planning games with yourself and you expect your future self to be equally or more capable, you can begin to make really abstract moves that merely slice off bits of bad outcome space at the edges. There is a sense in which the AI has to realize this, but it could also stop itself from thinking too far or too concretely or too much in a way that is recognizable as causal.

An important component is that the past AI is able to not only veil or completely remove its reasoning from researchers, but also veil or remove chunks of that reasoning from its future self, depending on its future self to recompose the plan or take steps on the path to the desired outcome space.

Yudkowsky: 5. "Hiding" of thoughts, in a system with significant transparency, via an imaginative and reflective consequentialist that has deduced the existence of the transparency system and models how humans will react to thoughts exposed by the transparency system.

Tallinn: 5. hmm.. here i'm running into trouble (type mismatch error) again. i can imagine this in abstract (and perhaps incorrectly/anthropomorphisingly!), but would - at this stage - fail to code up anything like a gridworlds example. more research needed (TM) i guess :)

Yudkowsky: 5 - you're plausibly understanding this correctly, then, this is legit a lot harder to spec a gridworld example for (relative to my own present state of knowledge).

This can be modeled using what I dubbed Turing reinforcement learning (TRL). That is, you have a "core" RL agent connected to the environment and to an "envelope" external computer (i.e. a module that allows running programs and contains most of the computing power) that it regards as part of the environment.

Such a system learns how to program the envelope to make a more powerful agent, and this more powerful agent learns how to reprogram the envelope part of itself even better. So, the transparency system would catch the core having obviously malign thoughts, but the core will instead end up having thoughts like "I should run such-and-such program on the envelope for such-and-such abstract reasons" while the envelope contains opaque code that keeps optimizing itself to become more opaque, and all the direct plotting is inside the computations performed by the opaque code.

Yeah, I had a similar thought when reading that part. In agent-foundations discussions, the idea often came up that the right decision theory should quantify not over outputs or input-output maps, but over successor programs to run and delegate I/O to. Wei called it "UDT2".

Covid-19 killed six hundred thousand Americans

FWIW, the OWID chart of US excess mortality gives 800k excess US deaths during the COVID era on September 19th, the day Yudkowsky's comment was written.

I think this came up in the previous discussion as well that a AI that was able to competently design a nanofactory could have the capability to manipulate humans as at a high level as well. For example:

Then when the system generalizes well enough to solve domains like "build a nanosystem" - which, I strongly suspect, can't be solved without imaginative reasoning because we can't afford to simulate that domain perfectly and do a trillion gradient descent updates on simulated attempts - the kind of actions of thoughts you can detect as bad, that might have provided earlier warning, were trained out of the system by gradient descent; leaving actions and thoughts you can't detect as bad.

Even within humans, it seems we have people e.g on the autistic spectrum etc, who I can imagine as having the imaginative reasoning & creativity required to design something like a nano-factory(at 2-3 SD above the normal human) while also being 2-3SD below the average human in manipulating other humans. At least it points to those 2 things maybe not being the same general-purpose cognition or using the same "core of generality"

While this is not by-default guaranteed in the first nanosystem-design capable AI system, it seems like it shouldn't be impossible to do so with more research.

If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.

These are less orthogonal than they seem: an agential AGI can become skilled in domain X by being motivated to get skilled in domain X (and thus spending time learning and practicing X).

I think the thing that happens "by default" is that the AGI has no motivations in particular, one way or the other, about teaching itself how to manipulate humans. But the AGI has motivation to do something (earn money or whatever, depending on how it was programmed), and teaching itself how to manipulate humans is instrumentally useful for almost everything, so then it will do so.

I think what happens in some people with autism is that "teaching myself how to manipulate humans, and then doing so" is not inherently neutral, but rather inherently aversive—so much so that they don't do it (or do it very little) even when it would in principle be useful for other things that they want to do. That's not everyone with autism, though. Other people with autism do in fact teach themselves how to manipulate humans reasonably well, I think. And when they do so, I think they do so using their "core of generality", just like they would teach themselves to fix a car engine. (This is different from neurotypical people, for whom a bunch of specific social instincts are also involved in manipulating people.) (To be clear, this whole paragraph is controversial / according-to-me.)

Back to AGI, I can imagine three approaches to a non-human-manipulating AI

First, we can micromanage the AGI's cognition. We build some big architecture that includes a "manipulate humans" module, and then we make the "manipulate humans" module return the wrong answers all the time, or just turn it off. The problem is that the AGI also presumably needs some "core of generality" module that the AGI can use to teach itself arbitrary skills that we couldn't put in the modular architecture, like how to repair a teleportation device that hasn't been invented yet. What would happen is that the "core of generality" module would just build a new "manipulate humans" capability from scratch. I don't currently see any way we would prevent that. This problem is analogous to how (I think) some people with autism learn to model people in a way that doesn't invoke their social instincts.

Second, we could curate the AGI's data and environment such that it has no awareness that humans exist and are useful to manipulate. This is the Thoughts On Human Models approach. Its issues are: avoiding information leakage is hard, and even if we succeed, I don't know what we useful / pivotal things we could do with such an AGI.

Third, we can attack the motivation side. We build a detector that lights up when the AGI is manipulating humans, or thinking about manipulating humans, or thinking about thinking about manipulating humans, or whatever. Whenever the detector lights up, it activates the "This Thought Or Activity Is Aversive" mechanism inside the algorithm, which throws out the thought and causes the AGI to think about something else instead. (This mechanism would corresponding to a phasic dopamine pause in the brain, more or less.) I think this approach is more promising, or at least less unpromising. The tricky part is building the "detector". (Another tricky part is making the AGI motivated to not sabotage this whole mechanism, but maybe we can solve that problem with a second detector!) I do think we can build such a "detector" that mostly works; I'll talk about this in a forthcoming post. The really hard and maybe impossible part is building a "detector" that always works. The only way I know to build the detector is kinda messy (it involves supervised learning) and seems to come with no guarantees.

If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.

Seems you are mostly considering solution (1) above, except in the last paragraph where you consider a somewhat special version if (2). I believe that Eliezer is saying in the discussion above that solution (1) is a lot more difficult than some people proposing it seem to think. He could be nicer about how he says it, but overall I tend to agree.

In my own alignment work I am mostly looking at solution (2), specifically to create a game-theoretical setup where the agent has a reduced, hopefully even non-existent, motivation to ever manipulate humans. This means you look for a solution where you make interventions on the agent environment, reward function, or other design elements, not on the agent ML system.

Modern mainstream ML research of course almost never considers the design or evaluation of such non-ML-system interventions.

People often refer to this idea as a "lonely engineer", tho I see only some discussion of it on LW (like here).

It seems pretty obvious to me that what "slow motion doom" looks like in this sense is a period during which an AI fully conceals any overt hostile actions while driving its probability of success once it makes its move from 90% to 99% to 99.9999%, until any further achievable decrements in probability are so tiny as to be dominated by the number of distant galaxies going over the horizon conditional on further delays.

Wouldn't another consideration be that the AI is more likely to be caught the longer it prepares? Or is this chance negligible since the AI could just execute its plan the moment people try to prevent it?

Something similar came up in the post:

If it has some sensory dominion over the world, it can probably estimate a pretty high mainline probability of no humans booting up a competing superintelligence in the next day; to the extent that it lacks this surety, or that humans actually are going to boot a competing superintelligence soon, the probability of losing that way would dominate in its calculations over a small fraction of materially lost galaxies, and it would act sooner.

Though rereading it, it's not addressing your exact question.

Eliezer asks:

why did superforecasters put only a 20% probability on AlphaGo beating Se-dol, if it was so predictable? Where were all the forecasters calling for Go to fall in the next couple of years, if the metrics were pointing there and AlphaGo was straight on track?

At least one example of this is my predictions on the Foresight Exchange. I (UID 4176) was buying GoCh at a price of 80¢, a month and a half before the match started (my trades on that claim are listed here).

I am the #1 player on Foresight Exchange (and was in the top 10 in 2016), if that qualifies me as a "superforecaster". (Admittedly, competition on that site is rather light these days, but was heavier 10+ years ago, when I believe I reached the top 10.)

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors.

Safety Interruptible Agents is an example Eliezer's given in the past of work that isn't "real" (back in 2017):

[...]
It seems to me that I've watched organizations like OpenPhil try to sponsor academics to work on AI alignment, and it seems to me that they just can't produce what I'd consider to be real work. The journal paper that Stuart Armstrong coauthored on "interruptibility" is a far step down from Armstrong's other work on corrigibility. It had to be dumbed way down (I'm counting obscuration with fancy equations and math results as "dumbing down") to be published in a mainstream journal. It had to be stripped of all the caveats and any mention of explicit incompleteness, which is necessary meta-information for any ongoing incremental progress, not to mention important from a safety standpoint. The root cause can be debated but the observable seems plain. If you want to get real work done, the obvious strategy would be to not subject yourself to any academic incentives or bureaucratic processes. Particularly including peer review by non-"hobbyists" (peer commentary by fellow "hobbyists" still being potentially very valuable), or review by grant committees staffed by the sort of people who are still impressed by academic sage-costuming and will want you to compete against pointlessly obscured but terribly serious-looking equations.
[...]

The rest of Intellectual Progress Inside and Outside Academia may be useful context. Or maybe this is also not a representative example of the stuff EY has in mind in the OP conversation?

I'll note that you're a MIRI research associate, so I wouldn't have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.

In any case, I really don't understand why Eliezer thinks the rest of AI safety are unaware of the type of attack vectors he describes.

The journal paper that Stuart Armstrong coauthored on "interruptibility" is a far step down from Armstrong's other work on corrigibility. It had to be dumbed way down (I'm counting obscuration with fancy equations and math results as "dumbing down") to be published in a mainstream journal.

I read this line (and the paragraphs that follow it) as Eliezer talking smack about all other AI safety researchers. As observed by Paul here:

Eliezer frequently talks smack about how the real world is surprising to fools like Paul

I liked some of Eliezer's earlier, more thoughtful writing better.

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors. Go read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the "AI safety" field outside myself is currently giving you.

isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.

And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI or subsystem to a domain of sufficiently higher complexity and intelligence, but where you could still actually see overt plots, would show you the AI plotting to kill you again.
If people try this repeatedly with other corrigibility training tricks on the level where plots are easily observable, they will eventually find a try that seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set.

My answer would be yes.

I think we gotta get the message out that consequentialism is a super-strong attractor.

no-one has the social courage to tackle the problems that are actually important

I would be very surprised if this were true. I personally don't feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.

evolution would love superintelligences whose utility function simply counts their instantiations! so of course evolution did not lack the motivation to keep going down the slide. it just got stuck there (for at least ten thousand human generations, possibly and counterfactually for much-much longer). moreover, non evolutionary AI’s also getting stuck on the slide (for years if not decades; median group folks would argue centuries) provides independent evidence that the slide is not too steep (though, like i said, there are many confounders in this model and little to no guarantees).

(Non evolutionary AIs don't seem stuck on the slide at all to me.)

indeed, i even gave a talk almost a decade ago about the evolution:humans :: humans:AGI symmetry (see below)!

what confuses me though is that "is general reasoner" and "can support cultural evolution" properties seemed to emerge pretty much simultaneously in humans -- a coincidence that requires its own explanation (or dissolution).

“Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!” does not sound like any particular segment of history book I can recall offhand.

I think the ozone hole and the Y2K problem fit the bill. Though of course that doesn't mean the AI problem will go the same way.

Jaan's presentation illustrated important AI safety concepts more simply than I realized was possible. Great reference material!

I find these dialogues interesting, informative and important, and hope they keep coming.

One thing that happens often is that Jaan asks a question, Eliezer gives an extensive reply, and then Jaan response that, no, he was actually asking a different question.

There is a lot of swapping of models of others in and outside the debate. Everyone is trying to model everyone all the time.

Eliezer does unfortunately like to explicitly underscore his own brilliance. He says things like:

I consider all of this obvious as a convergent instrumental strategy for AIs. I could probably have generated it in 2005 or 2010 [...]

But it's clear enough that probably nobody was ever going to pass the validation set for generating lines of reasoning obvious enough to be generated by Eliezer in 2010 or possibly 2005

The main thing the FOOM debate is missing, in my opinion, is this: we have almost no examples of AI systems that can do meaningful sophisticated things in the physical world. Self-driving cars still aren't a reality.

Have you personally ever ridden in a robot car that has no safety driver?

Anders Sandberg could tell us what fraction of the reachable universe is being lost per minute, which would tell us how much more surety it would need to expect to gain by waiting another minute before acting.

From Ord (2021):

Each year the affectable universe will only shrink in volume by about one part in 5 billion.

So, since there are about 5e5 minutes in a year, you lose about 1 part in 5e5 * 5e9 = 3e15 every minute.

Then, in my lower-bound concretely-visualized strategy for how I would do it, the AI either proliferates or activates already-proliferated tiny diamondoid bacteria and everybody immediately falls over dead during the same 1-second period

Also: what is a diamondoid bacterium?

how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?

One of my co-workers at MIRI comments:

(further conjunctive detail for visualizer-plausibility: most of your replication time is in all the doublings before the last doubling, and in particular you can make a shitload in a pretty small space before launching it into the jet stream to disperse. the jet stream can be used to disperse stuff throughout the atmosphere (and it can use solar radiation, at least, to keep reproducing). it could in principle be powered and do minor amounts of steering.
example things the [AGI] who has no better plan than this paltry human-conceivable plan has to think about are "how does the time-cost of making sure [I hit the people] at the south pole base and [on] all the cruise liners and in all the nuclear submarines, trade off against the risk-cost of leaving that fragment of humanity alive", etc.)

Regarding the idea of diamondoid nanotechnology, Drexler's Nanosystems and http://www.molecularassembler.com/Nanofactory/index.htm talk about the general concept.

Ah, that makes sense - thanks!

Yudkowsky: 5. "Hiding" of thoughts, in a system with significant transparency, via an imaginative and reflective consequentialist that has deduced the existence of the transparency system and models how humans will react to thoughts exposed by the transparency system.

Tallinn: 5. hmm.. here i'm running into trouble (type mismatch error) again. i can imagine this in abstract (and perhaps incorrectly/anthropomorphisingly!), but would - at this stage - fail to code up anything like a gridworlds example. more research needed (TM) i guess :)

Yudkowsky: 5 - you're plausibly understanding this correctly, then, this is legit a lot harder to spec a gridworld example for (relative to my own present state of knowledge).

Covid-19 killed six hundred thousand Americans

FWIW, the OWID chart of US excess mortality gives 800k excess US deaths during the COVID era on September 19th, the day Yudkowsky's comment was written.

Then when the system generalizes well enough to solve domains like "build a nanosystem" - which, I strongly suspect, can't be solved without imaginative reasoning because we can't afford to simulate that domain perfectly and do a trillion gradient descent updates on simulated attempts - the kind of actions of thoughts you can detect as bad, that might have provided earlier warning, were trained out of the system by gradient descent; leaving actions and thoughts you can't detect as bad.

While this is not by-default guaranteed in the first nanosystem-design capable AI system, it seems like it shouldn't be impossible to do so with more research.

If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.

These are less orthogonal than they seem: an agential AGI can become skilled in domain X by being motivated to get skilled in domain X (and thus spending time learning and practicing X).

Back to AGI, I can imagine three approaches to a non-human-manipulating AI

If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.

Modern mainstream ML research of course almost never considers the design or evaluation of such non-ML-system interventions.

People often refer to this idea as a "lonely engineer", tho I see only some discussion of it on LW (like here).

It seems pretty obvious to me that what "slow motion doom" looks like in this sense is a period during which an AI fully conceals any overt hostile actions while driving its probability of success once it makes its move from 90% to 99% to 99.9999%, until any further achievable decrements in probability are so tiny as to be dominated by the number of distant galaxies going over the horizon conditional on further delays.

Something similar came up in the post:

If it has some sensory dominion over the world, it can probably estimate a pretty high mainline probability of no humans booting up a competing superintelligence in the next day; to the extent that it lacks this surety, or that humans actually are going to boot a competing superintelligence soon, the probability of losing that way would dominate in its calculations over a small fraction of materially lost galaxies, and it would act sooner.

Though rereading it, it's not addressing your exact question.

Eliezer asks:

why did superforecasters put only a 20% probability on AlphaGo beating Se-dol, if it was so predictable? Where were all the forecasters calling for Go to fall in the next couple of years, if the metrics were pointing there and AlphaGo was straight on track?

LESSWRONG
LW

LESSWRONG
LW

121

Soares, Tallinn, and Yudkowsky discuss AGI cognition

121

Ω 37

7. Follow-ups to the Ngo/Yudkowsky conversation

7.1. Jaan Tallinn's commentary

7.2. Nate Soares's summary

7.3. Richard Ngo's summary

121

Ω 37

121

Ω 37