I don't think this is very likely, but a possible path to alignment is formal goal alignment, which is basically the following two step plan:

  1. Define a formal goal that robustly leads to good outcomes under heavy optimization pressure
  2. Build something that robustly pursues the formal goal you give it

I think currently the best proposal for step 1 is QACI. In this post, I propose an alternative that is probably worse but definitely not Pareto-worse.

High-Level Overview

Step 1.1: Build a large facility ("The Vessel"). Populate The Vessel with very smart, very sane people (e.g. Eliezer Yudkowsky, Tamsin Leake, Gene Smith) and labs and equipment that would be useful for starting a new civilization.

Step 1.2: Mark The Vessel with something that is easy to identify within the Tegmark IV multiverse ("The Vessel Flag").

Step 1.3: Leave the people and stuff in The Vessel for a little while, and then destroy The Flag and dismantle The Vessel.

Step 2: Define CCS as the result of the following:

Step 2.1: Grab The Vessel out of a Universal Turing Machine, identifying it by the Flag (this is the very very hard part)

Step 2.2: Locate the solar system that contains The Vessel, and run it back 2 billion years. (this is another very hard part)

Step 2.3: Put The Vessel on the Earth in this solar system, and simulate the solar system until either a success condition or a failure condition is met. The idea here is that the Vessel's inhabitants repopulate the Earth with a civilization much smarter and saner than ours that will have a much easier time solving alignment. More importantly, this civilization will have effectively unlimited time to solve alignment.

Step 2.4: The success condition is the creation of The Output Flag. Accompanying the Output Flag is some data. Interpret that data as a mathematical expression.

Step 2.5: Evaluate this expression and interpret it as a utility function.

Step 3: Build a singleton AI that maximizes E[CCS(world)].

The Details

TODO: I will soon either update this post or make more posts with more details as I come up with them.

CCS vs QACI

  • QACI requires a true name of "counterfactual", but that's about it. It just needs to ask, "If we replace this blob with a question, what will most likely replace the answer blob?". Physics and everything else is expected to be inferred from the existence of this "question" blob. CCS, on the other hand, requires a prior specification of an approximation of physics at least good enough to simulate an Earth with humans for billions of years, or maybe some weird ontology translation thing or something.
  • QACI is a function that must be called recursively (since we aren't expecting anyone to solve alignment fully within the short interval), creating a big complicated graph. There are lots of clever tricks for preventing this from causing a memetic catastrophe, but there are lots of places these tricks can fail. CCS, on the other hand, only needs to be called once. The simulacra solving alignment have a LOT more time than we do, and they can build an entire civilization optimized around our/their goal.
  • QACI is vulnerable to superintelligences launched within the simulated world (since it is the modern world with all of its AI development, and there might be a bunch of timelines dying during the QACI interval without us realizing). CCS, on the other hand, simulates a very small world (just the solar system) with a civilization that will quickly become powerful enough to prevent any other intelligence to evolve.
  • The output is easier to "grab" from QACI, since it's just a file on a computer that can straightforwardly be interpreted as a math expression. But I think if we figure out how to grab the vessel, we can probably use a very similar method to grab the output.
  • In general, CCS seems safer but also much harder than QACI.
New Comment