tdlr;  This is a project to put a model of AGI uncontainability into math. Goal is to make the reasoning straightforward to verify. 

Apply now

It is a standard practice in computer science to first show that a problem doesn’t belong to a class of unsolvable problems before investing resources into trying to solve it or deciding what approaches to try.

   — Roman Yampolskiy


Researchers tried for years to invent methods to control ‘AGI’, based on the assumption that it must be possible to control some ‘AGI’ enough to not eventually cause extinction.

If this assumption proves to be false, it is crucial for the AI Safety community to know. It would shift our priorities away from still trying to somehow make ‘AGI’ safe, toward acting to restrict development of ‘AGI’.

Dozens of academic results demonstrate fundamental limits to control. In addition, our advising researcher shared various forms of reasoning for the uncontainability of any ‘AGI’. However, no theorem has been constructed in mathematical notation that proves uncontainability (ie. insufficiently “complete” control) from empirically sound premises. This project is about attempting to construct and verify such a theorem. 
 

Collaborative context

In our circle of collaborators is Anders Sandberg, our advising researcher Forrest and his research assistant Justin, and the project lead Remmelt Ellen. We are funded by two grants by the Survival and Flourishing Fund. 

We want to convey the reasoning in a form that can be verified by other researchers. This is hard.

Historically, impossibility theorems based on incompleteness involved large inferential gaps or difficulties in reaching scientific consensus:

  • Gödel had to invent a new language unfamiliar to mathematicians.
  • CAP theorem was fiercely debated for years by computer scientists.
  • Bell’s theorem took physicists decades of careful testing to verify.

Forrest already formalized his reasoning of why AGI’s effects would be uncontainable (and furthermore, converge on human-lethal conditions). But he did so using a combination of computer programs and precise analytical language. This is not a form that is accessible to other researchers. We need to put the reasoning into mathematical notation, so that other researchers can verify the logical validity of the result. 

Along with the math, we must publish an explanation of what the terms mean and how premises correspond with how the physical world works. Without that explanation, it is not even clear to other researchers what the reasoning is about.

To make explanation easier, Forrest found a way to distill the argument into an intuitive and simple model, over five conversations with Anders:

  • 1st conversation at a conference panel was exploratory.
  • 2nd conversation specified the control theory of AGI (transcript).
  • 4th conversation introduced a framework of input-output relations 
    between the control system, AGI, and the humans’ environment (transcript).
  • 5th conversation reached the hashiness model of AGI uncontainability (transcript).

For this project, you can join up in a team to construct a mathematical proof of AGI uncontainability based on the reasoning. Or work with Anders to identify proof methods and later verify the math (identifying any validity/soundness issues). 

We will meet with Anders to work behind a whiteboard for a day (in Oxford or Stockholm). Depending on progress, we may do a longer co-working weekend. From there, we will draft one or more papers. 

Update: an excellent mathematician outside AI Safety will already start independent work in December toward a math paper – maybe you can offer feedback for verifying and/or communicating the relevance of this string of math symbols to the community. 
 

Overviews of the hashiness model

A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable. 

In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously.

One such model is to study families of binary functions parametrized by their circuit complexity and their "hashiness" (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output.

   — Anders Sandberg

 

To start digging into the hashiness model:

Part of the work for this project is to turn these one-on-one explanations into your own write-up. This way, Forrest can give input, and we can check whether we’re on the same page. This helps reduce important confusions before moving on to doing the math. 
 

We ask that you take your time to understand before criticizing 

Please be prepared to take time to put pieces of the argument together. We’re at the frontiers and it’s going to be messy at first. It took 8 hours of conversation to clearly convey a model of uncontainability to Anders, who is well-known for his ability to dig deep quickly into almost any academic field at the Oxford lunch table. 

While we would not want you to take the truth of the hashiness model on faith, we do ask you to hold back on trying to find flaws with the model, before you acquired an accurate understanding of it. If you have not first taken the time to understand the model, you’re likely to attack a straw version, which is not very useful. This means that to some extent we’re asking you to trust that digging into the hashiness model is worth your time, without being able to provide good evidence for this up-front. This is unfortunately unavoidable. 

If you don’t have the patience for this type of project, that is ok, but it also means this project is not for you. Both Remmelt and Forrest have ample experience trying to convey reasoning to people who are not willing to spend the time it takes to understand it, and it usually ends in disappointment and frustration on both sides. 

If this project goes well, we’ll have a much more accessible formulation by April. Meanwhile, we are very happy to take questions meant to clarify the model’s premises, terms, and reasoning. 

We are also open to exploring different ways of collaborating. We want you to have the flexibility and freedom to work through the reasoning yourself, and if something about the team set-up is not going well, we want to know!
 

How to get involved

  1. Try dig into the explanations linked above. Do email me any specific question: remmelt@aisafety.camp
  2. Apply for the project (#7 in the list) if you think you can help construct a mathematical model and/or verify the logic. It is a high-commitment project, so we’ll mostly consider applicants who mark it as first preference.
  3. Wait to see if you get invited to calls:
    1. An interview chat with Remmelt, to see if we’re aligned on what the project is about. Possibly, another interview with Forrest or Anders.
  4. If you get invited to the team:
    1. Meet and get to know each other in Signal.
    2. Write out the reasoning roughly yourself in prose, so we can check whether we are on the same page, what premises and reasoning steps are still missing, etc. This intermediate write-up is also useful for taking abstract concepts mentioned in the transcripts and narrowing down any ambiguity of how those concepts correspond to things observed in the real world. You might end up posting your write-up online as a clarifying exposition of the argument.
    3. Find a day to work with Anders behind a whiteboard, depending on where he’s at. Maybe we’ll continue at a coworking retreat in Oxford. Note: basic travel and stay costs are covered.
    4. Continue working remotely, maybe breaking out into smaller dedicated teams. Write out the math and review others’ math.
    5. Draft one or more papers for Forrest and/or Anders to review. Reach out to other researchers for feedback.
    6. Submit paper to a journal, and publish the text on the Alignment Forum.
       

Output

By April, we will have drafted at least one paper that:

  • Introduces the problem of AGI uncontainability.
  • Explains the premises and terms (with diagram depicting the information flow from Z to Q to X to Z).
  • Presents a mathematical formalization of the hashiness model.
  • Presents any proof by contradiction, derived step-by-step from the premises.
  • Clarifies the scope of application of the proof form (for control system integrated into AGI, iterated feedback over time, etc).
  • Concludes on the relevance of the result to future decision-making.

Next work is to extend the uncontainability model to include an uncontainable dynamic convergent on the extinction of organic life. The former formalizes why the internal control feedback loop is insufficiently comprehensive to prevent eventual human-lethal outcomes, whereas the latter formalizes why an external feedback loop is sufficiently comprehensive to eventually converge on human-lethal outcomes. This external feedback loop involves changes in code – as configured out of AGI’s artificial substrate – converging over time on code being expressed as physical effects that are maintaining and/or increasing of that code. 


→ To check out the roles you can take, see the doc
→ You can apply at AI Safety Camp. Official deadline is 17 November.