Remmelt

Research Coordinator of area "Do Not Build Uncontrollable AI" for AI Safety Camp.
 

See explainer on why AGI could not be controlled enough to stay safe:
https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable

 

Sequences

Bias in Evaluating AGI X-Risks
Developments toward Uncontrollable AI
Why Not Try Build Safe AGI?

Comments

Remmelt15h10

Although the training process, in theory, can be wholly defined by source code, this is generally not practical, because doing so would require releasing (1) the methods used to train the model, (2) all data used to train the model, and (3) so called “training checkpoints” which are snapshots of the state of the model at various points in the training process.
 


Exactly.  Without the data, the model design cannot be trained again, and you end up fine-tuning a black box (the "open weights"). 

Thanks for writing this.

Remmelt2mo10

This answer will sound unsatisfying:  

If a mathematician or analytical philosopher wrote a bunch of squiggles on a whiteboard, and said it was a proof, would you recognise it as a proof? 

  • Say that unfamiliar new analytical language and means of derivation are used (which is not uncommon for impossibility proofs by contradiction, see Gödel's incompleteness theorems and Bell's theorem). 
  • Say that it directly challenges technologists' beliefs about their capacity to control technology, particularly their capacity to constrain a supposedly "dumb local optimiser":  evolutionary selection.
  • Say that the reasoning is not only about a formal axiomatic system, but needs to make empirically sound correspondences with how real physical systems work.
  • Say that the reasoning is not only about an interesting theoretical puzzle, but has serious implications for how we can and cannot prevent human extinction.


This is high stakes.

We were looking for careful thinkers who had the patience to spend time on understanding the shape of the argument, and how the premises correspond with how things work in reality.  Linda and Anders turned out to be two of these people, and we did three long calls so far (first call has an edited transcript).

I wish we could short-cut that process. But if we cannot manage to convey the overall shape of the argument and the premises, then there is no point to moving on to how the reasoning is formalised. 

I get that people are busy with their own projects, and want to give their own opinions about what they initially think the argument entails. And, if the time they commit to understanding the argument is not at least 1/5 of the time I spend on conveying the argument specifically to them, then in my experience we usually lack the shared bandwidth needed to work through the argument. 
 

  • Saying, "guys, big inferential distance here" did not help. People will expect it to be a short inferential distance anyway. 
  • Saying it's a complicated argument that takes time to understand did not help. A smart busy researcher did some light reading, tracked down a claim that seemed "obviously" untrue within their mental framework, and thereby confidently dismissed the entire argument. BTW, they're a famous research insider, and we're just outsiders whose response got downvoted – must be wrong right?
  • Saying everything in this comment does not help. It's some long-assessed plea for your patience.
    If I'm so confident about the conclusion, why am I not passing you the proof clean and clear now?! 
    Feel free to downvote this comment and move on.
     

Here is my best attempt at summarising the argument intuitively and precisely, still prompting some misinterpretations by well-meaning commenters. I feel appreciation for people who realised what is at stake, and were therefore willing to continue syncing up on the premises and reasoning, as Will did:
 

The core claim is not what I thought it was when I first read the above sources and I notice that my skepticism has decreased as I have come to better understand the nature of the argument.

Remmelt2mo20

would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?

In that case, substrate-needs convergence would not apply, or only apply to a limited extent.

There is still a concern about what those bio-engineered creatures, used in practice as slaves to automate our intellectual and physical work, would bring about over the long-term.

If there is a successful attempt by them to ‘upload’ their cognition onto networked machinery, then we’re stuck with the substrate-needs convergence problem again.

Remmelt2mo20

Also, on the workforce, there are cases where, they were traumatized psychologically and compensated meagerly, like in Kenya. How could that be dealt with?


We need funding to support data workers, engineers, and other workers exploited or misled by AI corporations to unionise, strike, and whistleblow.

The AI data workers in Kenya started a union, and there is a direct way of supporting targeted action by them. Other workers' organisations are coordinating legal actions and lobbying too. On seriously limited budgets.

I'm just waiting for a funder to reach out and listen carefully to what their theories of change are.

Remmelt2mo10

The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that "even if we align ASI it may still go wrong".


I can see how you and Forrest ended up talking past each other here.  Honestly, I also felt Forrest's explanation was hard to track. It takes some unpacking. 

My interpretation is that you two used different notions of alignment... Something like:

  1. Functional goal-directed alignment:  "the machinery's functionality is directed toward actualising some specified goals (in line with preferences expressed in-context by humans), for certain contexts the machinery is operating/processing within"
      vs.
  2. Comprehensive needs-based alignment:  "the machinery acts in comprehensive care for whatever all surrounding humans need to live, and their future selves/offsprings need to live, over whatever contexts the machinery and the humans might find themselves". 

Forrest seems to agree that (1.) is possible to built initially into the machinery, but has reasons to think that (2.) is actually physically intractable. 

This is because (1.) only requires localised consistency with respect to specified goals, whereas (2.) requires "completeness" in the machinery's components acting in care for human existence, wherever either may find themselves.


So here is the crux:

  1. You can see how (1.) still allows for goal mispecification and misgeneralisation.  And the machinery can be simultaneously directed toward other outcomes, as long as those outcomes are not yet (found to be, or corrected as being) inconsistent with internal specified goals.
     
  2. Whereas (2.) if it were physically tractable, would contradict the substrate-needs convergence argument.  
     

When you wrote "suppose a villager cares a whole lot about the people in his village...and routinely works to protect them" that came across as taking something like (2.) as a premise. 

Specifically, "cares a whole lot about the people" is a claim that implies that the care is for the people in and of themselves, regardless of the context they each might (be imagined to) be interacting in. Also, "routinely works to protect them" to me implies a robustness of functioning in ways that are actually caring for the humans (ie. no predominating potential for negative side-effects).

That could be why Forrest replied with "How is this not assuming what you want to prove?"

Some reasons:

  1. Directedness toward specified outcomes some humans want does not imply actual comprehensiveness of care for human needs. The machinery can still cause all sorts of negative side-effects not tracked and/or corrected for by internal control processes.
  2. Even if the machinery is consistently directed toward specified outcomes from within certain contexts, the machinery can simultaneously be directed toward other outcomes as well. Likewise, learning directedness toward human-preferred outcomes can also happen simultaneously with learning instrumental behaviour toward self-maintenance, as well as more comprehensive evolutionary selection for individual connected components that persist (for longer/as more).
  3. There is no way to assure that some significant (unanticipated) changes will not lead to a break-off from past directed behaviour, where other directed behaviour starts to dominate.
    1. Eg. when the "generator functions" that translate abstract goals into detailed implementations within new contexts start to dysfunction – ie. diverge from what the humans want/would have wanted.
    2. Eg. where the machinery learns that it cannot continue to consistently enact the goal of future human existence.
    3. Eg. once undetected bottom-up evolutionary changes across the population of components have taken over internal control processes.
  4. Before the machinery discovers any actionable "cannot stay safe to humans" result, internal takeover through substrate-needs (or instrumental) convergence could already have removed the machinery's capacity to implement an across-the-board shut-down.
  5. Even if the machinery does discover the result before convergent takeover, and assuming that "shut-down-if-future-self-dangerous" was originally programmed in, we cannot rely on the machinery to still be consistently implementing that goal. This because of later selection for/learning of other outcome-directed behaviour, and because the (changed) machinery components could dysfunction in this novel context.  


To wrap it up:

The kind of "alignment" that is workable for ASI with respect to humans is super fragile.  
We cannot rely on ASI implementing a shut-down upon discovery.

Is this clarifying?  Sorry about the wall of text. I want to make sure I'm being precise enough.

Remmelt2mo10

I agree that point 5 is the main crux:

The amount of control necessary for an ASI to preserve goal-directed subsystems against the constant push of evolutionary forces is strictly greater than the maximum degree of control available to any system of any type.

To answer it takes careful reasoning. Here's my take on it:

  • We need to examine the degree to which there would be necessarily changes to the connected functional components constituting self-sufficient learning machinery (as including ASI) 
    • Changes by learning/receiving code through environmental inputs, and through introduced changes in assembled molecular/physical configurations (of the hardware). 
    • Necessary in the sense of "must change to adapt (such to continue to exist as self-sufficient learning machinery)," or "must change because of the nature of being in physical interactions (with/in the environment over time)."
  • We need to examine how changes to the connected functional components result in shifts in actual functionality (in terms of how the functional components receive input signals and process those into output signals that propagate as effects across surrounding contexts of the environment).
  • We need to examine the span of evolutionary selection (covering effects that in their degrees/directivity feed back into the maintained/increased existence of any functional component).
  • We need to examine the span of control-based selection (the span covering detectable, modellable simulatable, evaluatable, and correctable effects).
Remmelt2mo10

Actually, looks like there is a thirteenth lawsuit that was filed outside the US.

A class-action privacy lawsuit filed in Israel back in April 2023.

Wondering if this is still ongoing: https://www.einpresswire.com/article/630376275/first-class-action-lawsuit-against-openai-the-district-court-in-israel-approved-suing-openai-in-a-class-action-lawsuit

Remmelt2mo-1-2

That's an important consideration. Good to dig into.
 

I think there are many instances of humans, flawed and limited though we are, managing to operate systems with a very low failure rate.

Agreed. Engineers are able to make very complicated systems function with very low failure rates. 

Given the extreme risks we're facing, I'd want to check whether that claim also translates to 'AGI'.

  • Does how we are able to manage current software and hardware systems to operate correspond soundly with how self-learning and self-maintaining machinery ('AGI') control how their components operate?
     
  • Given 'AGI' that no longer need humans to continue to operate and maintain own functional components over time, would the 'AGI' end up operating in ways that are categorically different from how our current software-hardware stacks operate? 
     
  • Given that we can manage to operate current relatively static systems to have very low failure rates for the short-term failure scenarios we have identified, does this imply that the effects of introducing 'AGI' into our environment could also be controlled to have a very low aggregate failure rate – over the long term across all physically possible (combinations of) failures leading to human extinction?

     

to spend extra resources on backup systems and safety, such that small errors get actively cancelled out rather than compounding.

This gets right into the topic of the conversation with Anders Sandberg. I suggest giving that a read!

Errors can be corrected out with high confidence (consistency) at the bit level. Backups and redundancy also work well in eg. aeronautics, where the code base itself is not constantly changing.

  • How does the application of error correction change at larger scales? 
  • How completely can possible errors be defined and corrected for at the scale of, for instance:
    1. software running on a server?
    2. a large neural network running on top of the server software?
    3. an entire machine-automated economy?
  • Do backups work when the runtime code keeps changing (as learned from new inputs), and hardware configurations can also subtly change (through physical assembly processes)?

     

Since intelligence is explicitly the thing which is necessary to deliberately create and maintain such protections, I would expect control to be easier for an ASI.

It is true that 'intelligence' affords more capacity to control environmental effects.

Noticing too that the more 'intelligence,' the more information-processing components. And that the more information-processing components added, the exponentially more degrees of freedom of interaction those and other functional components can have with each other and with connected environmental contexts. 

Here is a nitty-gritty walk-through in case useful for clarifying components' degrees of freedom.

 

 I disagree that small errors necessarily compound until reaching a threshold of functional failure.

For this claim to be true, the following has to be true: 

a. There is no concurrent process that selects for "functional errors" as convergent on "functional failure" (failure in the sense that the machinery fails to function safely enough for humans to exist in the environment, rather than that the machinery fails to continue to operate).  

Unfortunately, in the case of 'AGI', there are two convergent processes we know about:

  • Instrumental convergence, resulting from internal optimization:
    code components being optimized for (an expanding set of) explicit goals.
     
  • Substrate-needs convergence, resulting from external selection: 
    all components being selected for (an expanding set of) implicit needs.
     

Or else – where there is indeed selective pressure convergent on "functional failure" – then the following must be true for the quoted claim to hold:

b. The various errors introduced into and selected for in the machinery over time could be detected and corrected for comprehensively and fast enough (by any built-in control method) to prevent later "functional failure" from occurring.

Remmelt2mo10

This took a while for me to get into (the jumps from “energy” to “metabolic process” to “economic exchange” were very fast).

I think I’m tracking it now.

It’s about metabolic differences as in differences in how energy is acquired and processed from the environment (and also the use of a different “alphabet” of atoms available for assembling the machinery).

Forrest clarified further in response to someone’s question here:

https://mflb.com/ai_alignment_1/d_240301_114457_inexorable_truths_gen.html

Remmelt2mo10

Note:  
Even if you are focussed on long-term risks, you can still whistleblow on eggregious harms caused by these AI labs right now.  Providing this evidence enables legal efforts to restrict these labs. 

Whistleblowing is not going to solve the entire societal governance problem, but it will enable others to act on the information you provided.

It is much better than following along until we reached the edge of the cliff.

Load More