Actually, I think the universal prior being malign actually does break this. (I thought it might be only a little malign, which would be okay, but after a little reading it appears that it might be really malign!)
A crude example of how this might impact IMDEAR is that, while using solomonoff inductive inference to model the human, it sneakily inserts evil nanobots into the model of the bloodstream. (This specific issue can probably be patched, but there are more subtle ways it can mess up IMDEAR.)
Even creating a model of the simulation environment is messed up, since I planned on using inference for the difficult part.
The only thing I guess we can hope for is if we find a different prior that isn't malign, and for now we just leave the prior as a free variable. (See some of the ping backs on the universal prior for some approaches in this direction.) But I'm not sure how likely we are to find such a prior. 🤔
Also, Paul Christiano has a proposal with similar requirements to IMDEAR, but at a lower tech level: Specifying a human precisely (reprise).
The alternative is to adjust IMDEAR to not use solomonoff induction at all, and define/model everything directly, but this is probably much harder.
it is not the case that, simply, "the universal prior is malign". various universal priors (solomonoff induction, levin search, what QACI does, many other options…), being used in various ways, are to-various-extents malign. it depends a lot what you're doing.
i'm quite hopeful that we can get sufficiently-not-malign uses of some universal prior for QACI, and thus probably also for IMDEAR (conditional on the rest of IMDEAR being workable).
Update: I'm pretty sure that the universal prior being malign defeats this proposal (see this comment). I'll try to think of ways to salvage it, but for now I'm not sure that IMDEAR is feasible. It might be worth it to work on some of these steps ahead of time in case they are resolved, but for now it does not seem likely.
Thanks to Justis Mills for feedback (through the LessWrong feedback service) on an earlier draft of this post.
Goal
I propose an outer alignment solution called IMDEAR (Inference from a Mathematical Description of an Existing Alignment Research) that can be pursued at our current tech level. Then I outline an actual engineering program for how we can start to pursue it. The most ambitious outcome is that we produce a single artifact that, when combined with an algorithm for inner aligned AGI (with certain desiderata), solves the full alignment problem, including the strawberry alignment problem. Even if IMDEAR fails, I think the engineering program will still produce artifacts that are useful for alignment overall.
I also expect it to have nearly no capabilities implications, to be suitable to be done in a large group (similar to mechanistic interpretability), and to rely mostly on skills very different from current alignment and AI research (so it hopefully won't need to compete for talent with other alignment projects). In particular, the bulk of the engineering effort will be based on skills in physics, applied psychology, and programming. Even without these skills, project members can still be useful in helping research and brainstorm designs for the IRL training camp and furnishing the simulation environment. (A couple people with leadership skills would also be good since I'm bad at that, lol.) I also expect it to have a decent feedback loops on progress; either we are working towards the final artifact or we are stuck.
Finally, I will create a group chat for the discussion of this proposal and perhaps even to let people coordinate the engineering program. If no objections against the engineering program stand, I see no reason not to start!
If you wish to contribute or observe, jump to the chatroom section.
Background
IMDEAR is an indirect normativity proposal. Most of the credit is due to Paul Christiano, because IMDEAR is just a modification of A formalization of indirect normativity. My main contribution is to change the definition of the initial conditions of the hypothetical simulation to be more feasible. I also make a small change (which I'm uncertain about) where the AGI chooses actions it predicts to be corrigible as defined by the formalization, instead of maximizing a utility chosen by the formalization.
A newer proposal than Paul's that is similar is question-answer counterfactual intervals.
High-level idea
IMDEAR is a mathematical definition for a solution to corrigibility that a specific alignment researcher would discover, given infinite computing power and extreme caution. This is similar to CEV.
We tackle this in an extremely straightforward and literal way. We do not define an idealized alignment researcher that we connect to a Turing machine. Instead, a simplified explanation of IMDEAR is this:
The rest of the post goes over additional details such as: including information to identify what time and reality we are in (since the alignment researcher's DNA alone is not enough information to capture their values), carefully selecting and preparing the alignment researcher and their environment, etc...
The AGI can not hack the reward signal, because there is none. Messing with the alignment researcher IRL accomplishes nothing, because we have hard-coded their physical description into the formalization. The only thing it can do is guess what the simulation will result in. Even though we can not compute it exactly, we as humans can guess that the result will involve not killing all humans. The AGI will surely be able to guess that as well! The genie knows what we want!
Desiderata for the inner aligned AGI to be paired with IMDEAR
The separation between inner and outer alignment can be a bit murky, so let us define exactly what we want.
We need a way to give a program a purely mathematical description of a problem to solve. For the purposes of IMDEAR this will be "predict the result very well" (at least as well as a human mathematician), but we can also work with "generate mathematical outputs that maximize the utility function defined by the description" (this is what Paul Christiano uses). IMDEAR does the ontology identification.
We can idealize the inner alignment solution as a logical inductor. (We can not directly use logical induction because it is not efficient enough, but other than the inefficiency it is perfect.)
Given IMDEAR and the inner alignment solution, we can write a program to solve alignment:
This is not pseudo code. An engineering project for IMDEAR would focus on actually creating the
imdear
python library and uploading it to PyPI (among other formats; to make it as easy as possible to use we would want to port it to many different ones). The only difference is that it probably won't be me in theresearcher
argument!An AGI can guess the solution to a transcomputational problem?
Yes, just not perfectly.
Note that I not the first one to observe this, but it isn't particularly well documented, so let me write a section about it.
From Christiano's A formalization of indirect normativity:
From MIRI's Questions of Reasoning Under Logical Uncertainty
From Tammy Carado's QACI proposal
So this is already a fairly common requirement in other alignment proposals (including the two most mature schools of alignment thought).
In particular, IMDEAR relies on these two similar assumptions (which AFAIK have not been named, so I will name them).
Superhuman mathematical credence: For any proper scoring rule and any specific problem (including transcomputational ones), any human (or human organization) should expect the AI to get an equal or better score than them.
Superhuman mathematical optimization: For any mathematical function from strings to real numbers, any human (or human organization) should expect the AI to be able to choose a string that scores as high or higher than what they would choose.
Note that AGI, under most definitions, satisfies these two properties. A more difficult assumption that IMDEAR relies on is that one can point AGI at any mathematical function, but hopefully this is a useful reduction from the full alignment problem.
As an illustration of how these assumptions get at the hard bit of alignment, consider AIXI. What do I expect it to do? I can not know for sure what it will do due to Vinge's principle, but I can say:
What if I tell it I will give it more reward if it duplicates a strawberry?
Okay, but what if instead of AIXI, I have a powerful IMDEAR maximizer with the prompt "duplicate a strawberry"?
Components of IMDEAR
We will denote the alignment researcher that will be formalized by the letter E.
Model of physics
First we will need a model of physics. Paul Christiano proposed modelling psychology instead:
But due to some changes in IMDEAR, we can do the much easier task of modelling physics.
In particular, we just need a way to encode a configuration of particles at a single snapshot in time in a large volume (either a single room or the entire Earth (and its atmosphere) depending on some other details). This encoding needs to be good enough that we can identify DNA molecules based on their nucleic acid sequence and to define distances between molecules. The ability to "take photos" might be useful as well.
Instead of solving psychology, we create a couple of identifiers that allow us to search for a configuration of particles that contains E. There will be multiple such configurations, and so we weigh them according to something like Solomonoff Induction.
Note that this use of identifiers is the most dangerous part compared to Paul Christiano's post. That's because we are doing a large unbounded search. I think it can be sufficiently mitigated though.
Identifiers
Timestamp
To avoid a situation where entities in the future recreate the other identifiers, we need a timestamp. We will require some method of data sanitization for this step. It does not matter too much which one we pick. Without loss of generality, I will assume that we throw the data storage medium into the volcano Kīlauea.
While standing at the edge of Kīlauea, randomly generate a message, encode it into a string of DNA (or some other format readable by the physics model like the one used in A Boy and His Atom), compute a cryptographic hash H of the message, and then cast everything except for the cryptographic hash into the fire.
The timestamp now checks the condition "the configuration particles of particles contains a strand of DNA whose cryptographic hash is H". There will be many such configurations, but it is very unlikely that a future configuration of our reality will.
Realitystamp
Before creating the timestamp, encode all of the internet (this has already at different points in time into DNA (something like this has already been done for Wikipedia). The realitystamp checks if the configuration of particles contains that piece of DNA. Thanks to our use of Solomonoff Induction, our reality is the most likely one to contain such a sequence of DNA (since it explains the difference between the different points in time).
Identitystamp
The identitystamp is the part where we actually identify the alignment researcher E separately from all other physical objects in our reality. We sequence E's DNA. Then we formalize E as the largest clump of particles that are within some distance of a DNA molecule that matches the sequence.
If E has an identical twin, we might need some other biometric identifier such as iris recognition (assuming our physics model can handle it).
The hypothetical simulation
We call it the hypothetical simulation because, even though we are writing actual code, we never actually run it. We just have the AGI guess what the result would be if we hypothetically ran it, hence the name.
Simulation environment
Now that we have mathematically formalized E, we need to create the simulation environment that we drop E into. Keep in mind that sense we are just creating a mathematical description, we do not need to actually run the simulation, so it can use an unbounded amount of computation.
The simulation environment should contain a sign saying that the alignment researcher is in an IMDEAR simulation (they should already know about IMDEAR of course since we selected them), a supercomputer with unlimited computing power (this would require modifying the physics model a bit), and some amenities. For example, we might include a recreation center and a button outside of it labelled "reset recreation center". (Resetting it back to its original state is an easy way to restock and clean it.)
The supercomputer has a copy of
inner_alignment
so that they only need to worry about outer alignment.Preparing the alignment researcher
Even with the best possible simulation environment, we want to give E the best chance of mental stability and corrigibility. Before taking the identity stamp, we will want to carefully select E and prepare them. This is analogous to the process used to pick and train astronauts. See Mental preparation for Mars as an example.
Alignment researcher creates an aligned and corrigible super-intelligence
In the simulation, E creates an aligned and corrigible super-intelligence. They do this very carefully, because if they create an unaligned super-intelligence that would be very bad. They take as much time as they need to solve alignment. They then somehow output this. They should not just use IMDEAR (unless they improve it somehow).
Aligned super-intelligence decides on an action
The aligned super-intelligence now sees the
prompt
andinternal_sensors
arguments toimdear.do
and creates a plan to accomplish the task. It then somehow outputs this. The plan might suggest more calls toimdear.do
, or it might write an entirely new AGI, or it could suggest a direct solution that humans can carry out.Note that even though the AGI in real life is dumber than this super-intelligence (because the super-intelligence has more computing power), it can predict that it will be both competent and corrigible in the same way that humans can predict that stockfish will be good at chess.
Formal verification
We can not formally verify the end result of course, but it should be built on formally verified software as much as possible, along side other forms of quality control.
Weaknesses and further questions
How research on IMDEAR ties into other alignment approaches
Sketch for an engineering project
How do we accomplish IMDEAR? Here is a sketch:
Chatroom
If you wish to contribute to or even just observe IMDEAR, join the EleutherAI discord server and then go to the IMDEAR project (this is just a thread basically, IMDEAR is not endorsed by EAI).
I have already started a framework we can build on. I have tried to set it up to people can make sporadic or even one-off contributions, like Wikipedia. We will refine it at some later stage, but for now it just needs content!