Remmelt

Research coordinator of Stop/Pause area at AI Safety Camp.

See explainer on why AGI could not be controlled enough to stay safe:
lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable

 

Sequences

Bias in Evaluating AGI X-Risks
Developments toward Uncontrollable AI
Why Not Try Build Safe AGI?

Comments

Sorted by
Remmelt10

Fair question. You can assume it is AoE.

Research leads are not going to be too picky in terms of what hour you send the application in,

There is no need to worry about the exact deadline. Even if you send in your application on the next day, that probably won't significantly impact your chances of getting picked up by your desired project(s).

Sooner is better, since many research leads will begin composing their teams after the 17th, but there is no hard cut-off point.

Remmelt10

Thanks!  These are thoughtful points. See some clarifications below:
 

AGI could be very catastrophic even when it stops existing a year later.

You're right. I'm not even covering all the other bad stuff that could happen in the short-term, that we might still be able to prevent, like AGI triggering global nuclear war.

What I'm referring to is unpreventable convergence on extinction.
 

If AGI makes earth uninhabitable in a trillion years, that could be a good outcome nonetheless.

Agreed that could be a good outcome if it could be attainable.

In practice, the convergence reasoning is about total human extinction happening within 500 years after 'AGI' has been introduced into the environment (with very very little probability remainder above that).

In theory of course, to converge toward 100% chance, you are reasoning about going across a timeline of potentially infinite span.
 

I don't know whether that covers "humans can survive on mars with a space-suit",

Yes, it does cover that. Whatever technological means we could think of shielding ourselves, or 'AGI' could come up with to create as (temporary) barriers against the human-toxic landscape it creates, still would not be enough.
 

if humans evolve/change to handle situations that they currently do not survive under

Unfortunately, this is not workable. The mismatch between the (expanding) set of conditions needed for maintaining/increasing configurations of the AGI artificial hardware and for our human organic wetware is too great. 

Also, if you try entirely changing our underlying substrate to the artificial substrate, you've basically removed the human and are left with 'AGI'. The lossy scans of human brains ported onto hardware would no longer feel as 'humans' can feel, and will be further changed/selected for to fit with their artificial substrate. This is because what humans and feel and express as emotions is grounded in the distributed and locally context-dependent functioning of organic molecules (eg. hormones) in our body.

Remmelt10

Update: reverting my forecast back to 80% chance likelihood for these reasons.

Remmelt70

I'm also feeling less "optimistic" about an AI crash given:

  1. The election result involving a bunch of tech investors and execs pushing for influence through Trump's campaign (with a stated intention to deregulate tech).
  2. A military veteran saying that the military could be holding up the AI industry like "Atlas holding the globe", and an AI PhD saying that hyperscaled data centers, deep learning, etc, could be super useful for war.

I will revise my previous forecast back to 80%+ chance.

Remmelt10

Yes, I agree formalisation is needed. See comment by flandry39 in this thread on how one might go about doing so. 

Worth considering is that there are actually two aspects that make it hard to define the term ‘alignment’ such to allow for sufficiently rigorous reasoning:

  1. It must allow for logically valid reasoning (therefore requiring formalisation).
  2. It must allow for empirically sound reasoning (ie. the premises correspond with how the world works). 

In my reply above, I did not help you much with (1.). Though even while still using the English language, I managed to restate a vague notion of alignment in more precise terms.

Notice how it does help to define the correspondences with how the world works (2.):

  • “That ‘AGI’ continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.”

The reason why 2. is important is that just formalisation is not enough. Just describing and/or deriving logical relations between mathematical objects does not say something about the physical world. Somewhere in your fully communicated definition there also needs to be a description of how the mathematical objects correspond with real-world phenonema. Often, mathematicians do this by talking to collaborators about what symbols mean while they scribble the symbols out on eg. a whiteboard.

But whatever way you do it, you need to communicate how the definition corresponds to things happening in the real world, in order to show that it is a rigorous definition. Otherwise, others could still critique you that the formally precise definition is not rigorous, because it does not adequately (or explicitly) represent the real-world problem.

Remmelt10

For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo's post Lenses of Control.

Remmelt1-2

Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing!

It's less hard than you think, if you use a minimal-threshold definition of alignment: 

That "AGI" continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under. 

Remmelt10

Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.

Remmelt10

The question is more if it can ever be truly proved at all, or if it doesn't turn out to be an undecidable problem.

Control limits can show that it is an undecidable problem. 

A limited scope of control can in turn be used to prove that a dynamic convergent on human-lethality is uncontrollable. That would be a basis for an impossibility proof by contradiction (cannot control AGI effects to stay in line with human safety).

Remmelt30

Awesome directions. I want to bump this up.
 

This might include AGI predicting its own future behaviour, which is kind of essential for it to stick to a reliably aligned course of action.

There is a simple way of representing this problem that already shows the limitations. 

Assume that AGI continues to learn new code from observations (inputs from the world) – since learning is what allows the AGI to stay autonomous and adaptable in acting across changing domains of the world.

Then in order for AGI code to be run to make predictions about relevant functioning of its future code:

  • Current code has to predict what future code will be learned from future unknown inputs (there would be no point in learning then if the inputs were predictable and known ahead of time).
  • Also, current code has to predict how the future code will compute subsequent unknown inputs into outputs, presumably using some shortcut algorithm that can infer relevant behavioural properties across the span of possible computationally-complex code.
  • Further, current code would have to predict how the outputs would result in relevant outside effects (where relevant to sticking to a reliably human-aligned course of action)
    • Where it is relevant how some of those effects could feed back into sensor inputs (and therefore could cause drifts in the learned code and the functioning of that code).
    • Where other potential destabilising feedback loops are also relevant, particularly that of evolutionary selection.

       
Load More