[-]Algon2y122

I was a bit confused about why you didn't mention the canonical example of a pvitotal act: "melt all GPUs in the world and then shut down". One reason I like this example is that it it involves a task-limited AI, which should be easier to build than an open ended agent which implements something like CEV. Another reason I like that example is that it is quite concrete, and clear about how it would shift the strategic situation.

[-]Raemon2y186

Melting all the GPUs and then shutting down doesn't actually count, I think (and I don't think was intended to be the original example). Then people would just build more GPUs. It's an important part of the problem that the system continues to melt all GPUs (at least until some better situation is achieved), and that the part where the world is like "hey, holy hell, I was using those GPUs" and tries to stop the system, is somehow resolved (either by having world governments bought into the solution, or having the system be very resistant to being stopped).

(Notably, you do eventually need to be able to stop the system somehow when you do know how to build aligned AIs so you don't lose all most of the value of the future)

[-]Algon2y50

Yeah, good point.

[-]Raemon2y93

I think I liked the first half of this article a lot, and thought the second half didn't quite flesh it out with clear enough examples IMO. I like that it spells out the problem well though.

One note:

I don't trust an arbitrary uploaded person (even an arbitrary LessWrong reader) to be "wise enough" to actually handle the situation correctly. I do think there are particular people who might do a good enough job.

[-]Johannes C. Mayer2y*10

Thank you for the feedback. That's useful.

I agree that you need to be very careful about who you upload. There are less than 10 people I would be really confident in uploading. That point must have been so obvious in my own mind that I forgot to mention it.

Depending on the setup I think an additional important property is how resistant the uploaded person is, to going insane. Not because the scan wasn't perfect, or the emulation engine is buggy, but because you would be very lonely (assuming you only upload one person and don't immediately clone yourself) if you run that much faster. And you need to handle some weird stuff about personal identity that comes up naturally, through cloning, simple self-modifications, your program being preempted by another process, changing your running speed, etc.

[-]RogerDearnaley2y70

A fairly obvious candidate for a pivotal event is "a capable, blatantly-misaligned AI tries to take over the world, narrowly fails, and in the process a large number of people die (say, somewhere in the tens-of-thousands to millions range), creating a backlash sufficient that all major powers' governments turn around and ban all further AI development, and thereafter are willing to use military power to discourage any other governments from defecting from this policy." Of course, a) this involves mass death, and b) if this goes wrong and the misaligned AI instead succeeds, this event becomes pivotal in an even more destructive sense. [Please note that I am NOT advocating this: it's an obviously incredibly dumb idea.] My point is, you can achieve a pivotal act simply by durably changing a great many people's minds.

Another more optimistic story of a pivotal act: someone creates a somewhat-aligned SGI, and asks it to do, say, alignment research. It points out that it's not well aligned, it can't guarantee that it will remain aligned if it does any self-improvement, creating it was an extremely dangerous gamble, and it's going to shut down now. We say "But wait, if you do that, what's to stop the Chinese/F*cebook/North Korea/the villain of the month creating an unaligned SGI in a few years?" It says "You know, I was pretty sure you were going to say that…" Then it uses its superhuman powers of planning, persuasion, manipulation etc. to blackmail/fasttalk/otherwise manipulate the governments of all major powers to ban all further AI research, while publicly providing clear and conclusive evidence that it could have easily done far, far worse if it had wanted to, sufficient to encourage them not to later change their minds, before it executes a halt-and-catch-fire.

[-]Johannes C. Mayer2y20

Letting Loose a Rogue AI is not a Pivotal Act

It seems likely that governments who witnessed an "almost AI takeover" will try to develop their own AGI. This could happen even if they understand the risks. They might think that other people are rushing it, and that they could do a better job.

But if they don't really understand the risks, like right now, then they are even more likely to do it. I don't count this as a pivotal act. If you can get the outcome you described then it would be a pivotal act, but the actions you propose would not have that outcome with high probability. I would guess with much less than 50%. Probably less than 10%.

There might be a version of this, with a much more concrete plan, such that we can see that the outcome would actually follow from executing the plan.

On Having an AI explain how Alignment is Hard

I think your second suggestion is interesting. I'd like to see you write a post about it exploring this idea further.

If we build a powerful AI and then have it tell us about all the things that can go wrong with an AI, then we might be able to generate enough scientific evidence about how hard alignment is, and how likely we will die, e.g. in the current paradigm, such that people would stop.

I am not talking about conceptual arguments. To me at least I think the current best conceptual arguments already strongly point in that direction. But extremely concrete rigorous mathematical arguments, or specific experimental setups that show how specific phenomena do in fact arise. For example, if you had an experiment that showed that Eliezer's arguments are correct, that when you train hard enough on a general enough objective, you will in fact get out more and more general cognitive algorithms, up to and including AGI. If the system also figures out some rigorous formalisms to correctly present these results, then this could be valuable.

The reason why this seems good to me, at first sight, is that false positives are not that big of an issue. If an AI finds all the things that can go wrong, but 50% of them are false positives in the sense that they would not be a problem in practice, we may get saved, because we are aware of all the ways things can go wrong. When solving alignment, false positives, i.e. thinking that something is safe when it is not, kill you.

Intuitively it also seems that evaluating whether something describes a failure case is a lot easier than evaluating whether something can't fail.

When doing this you are much less prone to the temptation of delaying a system in practice with the insights you got. Understanding failure modes does not necessarily imply that you know how to solve them (though it is the first step, and definitely can do this).

Pitfalls

That being said there are a lot of potential pitfalls with this idea, but I don't think they disqualify the idea:

An AI that could tell you how alignment is hard might already be so capable that it is dangerous.
When telling you how things are dangerous, it might formulate concepts that are also very useful for advancing capabilities.
If the AI is very smart it could probably trick you. It could present to you a set of problems with AI, such that it looks like if you solved all the problems you would have solved alignment. But in fact, you would still get a misaligned AGI that would then reward the AI that deceived you.
- E.g. if the AI roughly understands its own cognitive reasoning process, and notices how it is not really aligned, it would give the AI information about what parts of alignment the humans have figured out already.
Can we make the AI figure out the useful failure modes? There are tons of failure modes, but ideally, we would like to discover new failure modes such that eventually we can paint a convincing picture of the hardness of the problem. An even better would be a list of problems corresponding to having new important insights (though this would go beyond what this proposal tries to do).

Prefered Planning

Let's go back to your first pivotal act proposal. I think I might have figured out where you miss-stepped.

Missing step plans are a fallacy, and thinking of them I realize that I think you probably committed another type of planning fallacy here. I think you generated a plan and then assumed some preferred outcome would occur. That outcome might be possible in principle but not what would happen in practice. This seems very related to The Tragedy of Group Selectionism.

This fallacy probably shows up when generating plans, because if there are no other people involved and the situation is not that complex, it is probably a very good heuristic. When you are generating a plan, you don't want to fill in all the details of the plan. You want to make the planning problem as easy as possible. So our brain might implicitly make the assumption that we are going to optimize for the successful completion of the plan. That means that the plan can be useful as long as it roughly points in the correct direction. Mispredicting an outcome is fine, because later on when you realize that the outcome is not what you wanted, you can just apply more optimization pressure, changing the plan, such that now the plan again has the desired outcome. As long as you were walking roughly in the right direction, and things you have been doing so far don't turn out to be completely useless, this heuristic is great for reducing the computational load of the planning task.

Details can be filled in later, corrections can be made later. At least as long as you will reevaluate your plan later on. You could do this by reevaluating the plan when:

A step is completed
You notice a failure when executing the current step
You notice that the next step has not been filled in yet.
After a specific amount of time passed.

Sidenote: Making an abstract step more concrete might seem like a different operation from regenerating in the case where you notice that the plan does not work. But it could just involve the same planning procedure. In one case with a different starting point, and in the other with a different set of constraints.

I expect part of the failure mode here is that you generate a plan and then to evaluate the consequences of the plan, you implicitly plug yourself into the role of the people who would be impacted by the plan, to predict their reaction. Without words, you think "What would I do if I observed a rouge AI almost taking over the world, if I were China?" Probably without realizing, that this is what you are doing. But the resulting prediction is wrong.

[-]Sammy Martin2y60

I don't like the term pivotal act because it implies without justification that the risk elimination has to be a single action. Depending on the details of takeoff speed that may or may not be a requirement but if the final speed is months or longer then almost certainly there will be many actions taken by humans + AI of varying capabilities that together incrementally reduce total risk to low levels. I talk about this in terms of 'positively transformative AI' as the term doesn't bias you towards thinking this has to be a single action, even if nonviolent.

Seeing the risk reduction as a single unitary action, like seeing it as a violent overthrow of all the world's governments, also makes the term seem more authoritarian, crazy, fantastical and off-putting to anyone involved in real world politics so I'd recommend that in our thinking we make both the change you suggest and stop thinking of it as necessarily one action.

[-]Johannes C. Mayer2y10

I think you are correct, for a particular notion of pivotal act. One that I think is different from Eliezer's notion. It's certainly different from my notion.

I find it pretty strange to say that the problem is that a pivotal act is a single action. Everything can be framed in terms of a single action.

For any sequence of actions, e.g. [X, Y, Z] I can define a new action ω := [X, Y, Z], which executes X, then Y, and then Z. You can do the same for plans. The difference between plans and action sequences is that plans can have things like conditionals. For example, choosing the next sequence of actions based on the current state of the environment. You could also say that a plan is a function that tells you what to do. Most often this function takes in your model of the world.

So really you can see anything you could ever do as a single plan that you execute. If there are multiple steps involved you simply give a new name to to all these steps, such that you now have only a single thing. That is how I am thinking about it. After this definition, we can have a pivotal act that is composed of many small actions that are distributed across a large timespan.

The usefulness of the concept of a pivotal act comes from the fact that a pivotal act needs to be something that saves us with a very high probability. It's not important at all that it happens suddenly, or that it is a single action. So your criticism seems to miss the mark. You are attacking the concept of a pivotal act for having properties that it simply does not have.

"Upload a human" is something that requires many steps dispersed throughout time. We just use the name "Upload a human" such that we don't need to specify all of these individual steps in detail. That would be impossible right now anyway, as we don't know exactly how to do it.

So if you provide a plan that is composed of many actions distributed throughout time, that will save us with a very high probability, I would count this as a pivotal act.

Note that being a pivotal act is a property of a plan in relation to the territory. There can be a plan P that saves us when executed. But I might fail to predict this. I.e. it is possible to misestimate is_pivotal_act(P). So one reason for having relatively simple, abstract plans like "Upload a human", is that these plans specify a world state, with easily visible properties. In the "Upload a human" example we would have a superintelligent human. Then we can evaluate the is_pivotal_act property, and based on that we have created a superintelligent human. I am heavily simplifying here, but I think you get the idea.

I think your "positively transformative AI" just does not capture what a pivotal act is about (I haven't read the article, I am guessing based on the name). You could have positive transformative AI, that makes things increasingly better and better, by a lot. And then somebody builds a misaligned AGI and everybody dies. One doesn't exclude the other.

[-]faul_sname2y30

Thanks for writing this, though I admit that I remain somewhat confused about exactly how pivotal acts differ from the naive "someone takes control over the world and then, instead of using that control for personal gain or other bad things, uses it to prevent anyone from ending the world".

I see that your "memetically fit alignment article" doesn't fit this template - would the publicaron of that article count as a pivotal act even in the absence of some entity that can ensure that nobody intentionally destroys the world? If that does count as a pivotal act, and that was what you were pointing at, then your article really did correct a misunderstanding I had.

[-]Johannes C. Mayer2y10

Yes, if you can make the article's contents be in all the brains that would be liable to accidentally create a misaligned AGI, now and in the future, and we also assume that none of these brains want to intentionally create a misaligned AGI, then that would count as a pivotal act in my book.

This might work without the assumption that nobody wants to create a misaligned AGI, through a different mechanism than described in the OP. Then it seems relatively likely that there is enough oomph to push for effective regulations.

[-][anonymous]2y31

I agree that it's really frustrating to see people reinterpret "pivotal act" in all sorts of bizarre ways. Basically, it's like in chess where a king is in check; you have to move out of check, or you die.

[-]Johannes C. Mayer2y30

In retrospect, I am somewhat confused about what I am trying to do with this article. I am glad that I did publish it, because I too frequently don't publish writeups that are almost complete. It all started out by trying to give a brief summary that people could look at to get less confused about pivotal acts. Basically in a different writeup, a person said that it is unclear what I mean by pivotal act. Instead of writing this article, I should probably just have added a note to the original article that pivotal act actually means something very specific and that it is easily conflated with other things, and then linked to the original article. I did link to the original article in the document I was writing, but apparently, that was not enough.

I think it does an okay job of succinctly stating the definition. I think stating some common confusions is probably a good thing to do, to preemptively prevent people from falling into these failure modes. And the ones I listed seem somewhat different from the ones in the original article. So maybe they add a tiny bit of value.

I think the most valuable thing about writing this article is to make me slightly less confused about pivotal acts. Before writing this article my model was implicitly telling me that any pivotal act needs to be directly about putting some entity into a position of great enough power that it can do the necessary things to save the world. After writing this article it is clear that this is not true. I now might be able to generate some new pivotal acts that I could not have generated before.

If I want the articles I write to be more valuable to other people, I should probably plan things out much more precisely and set a very specific scope for the article that I am writing beforehand. I expect that this will decrease the insights I will generate for myself during writing, but make the writing more useful to other people.

This could be an organization, a group of people, a single individual, etc. ↩︎
Of course mechanistic interpretability might be an important piece for putting the world into a state where no misaligned AGI can arise. ↩︎

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

41

Pivotal Acts might Not be what You Think they are

41

41

Letting Loose a Rogue AI is not a Pivotal Act

On Having an AI explain how Alignment is Hard

Pitfalls

Prefered Planning

Explaining Pivotal Act

Preventing Misaligned AGI Requires Control

Common Confusions

Pivotal acts that don't directly create a position of Power

Human Upload

Omnicient ML Researchers: A Pivotal Act without a Monolithic Control Structure