My alignment proposal involves aligning an encoding of human-friendly values and then turning on a self-improving AGI with that encoding as its target. Obviously this involve "aligning an encoding of human-friendly values" and also "turning on a self-improving AGI with a specific target", two things we currently do not know how to do...

As expected, this AGI would then make plans and work on tasks that move us towards a more human-friendly universe. By Vingean reflection, I am sure that I don’t know what those tasks would be, only that they would be more effective than the tasks that human-level intelligences would come up with. I’m quite confident some of them would involve planning, experimental science, and engineering. Although, there may be ontologies that work better than those. I speculate the AGI plans would involve using available computer-controlled machinery to research and manufacture the tools needed to eliminate human disease and provide a post-scarcity economy where humanity can resolve our disagreements and decide what we want to do with our cosmic potential. I imagine the AGI would also use the help of people for tasks that are not currently more efficient for computer-controlled machinery. Likely paying these people as employees at first and communicating to them with generated speech, diagrams, and other files as required. But this is all speculation that I am confident I cannot accurately predict, as my strategizing is not superintelligent.

Unfortunately, my proposal doesn’t have many concrete details. It is more of a map of the unsolved issues and roughly the order I believe they need to be solved. This plan gives proper focus to avoiding the accidental creation of self-improving systems with human-incompatible targets which would lead to human extinction. I'm stressed out by the, currently ubiquitous, hand waving "this obviously isn't a recursive self improvment capable system", despite having no formal reasoning to show why we should believe that to be the case, and even less reason to believe the system couldn't be used as a component in such a system, or exhibit other dangerous capabilities.

The plan comprises 9 steps, some of which can be worked on in parallel, and others strictly depend on certain previous steps. The exact way they fit together is important, but is left as an exercise for the reader (for now). Much of it may need to be iterative.


My plan as of right now would be:

1.) SI borders and bounds: Complete a model of the optimization power, domain, and recalcitrance of systems in general, that can be used to predict the ‘criticality’ of a system, that is, the rate that it would lose coherence or improve itself if it was set to, or decided to, re-engineer itself. This should encompass development of a theory of general decision/ optimization systems, probably by drawing inspiration from many existing fields. It should apply coherently to large and complicated systems such as business organizations, memetic systems, and other systems composed of many computer programs and/or other parts. Emphatically, it should apply to large ML models and the systems they are being or could be deployed into.

2.) Model building: This step could only progress safely as informed by progress in step 1. Work could be done building large multi-modal world models, similar to LLM today, taking extreme care not to put any into a supercritical state as informed by step 1.

( I originally wrote this before LLMs had progressed to the point they have today... I now feel that it is possible we have progressed with step 2 as far as we will ever need to. I think we need progress in other steps to determine if / when any further AI model development is necessary. )

I can speculate limitation on further model development would be done by some combination of a) limiting optimization power, b) specific bounded optimization targets, c) limited domain, or d) increasing recalcitrance. I favor (a) and (c), but I haven’t seen work done on (d). Perhaps cryptographically obfuscating the functioning of an AI from itself could be done in such a way that it could be provably impossible for self-improvement given some specific system. Similarly agents built out of prompting LLM api's don't obviously have access to their own weights, but that again is subject to cryptographic considerations. (b) seems like Bostrom's description of domesticity, which makes me uncomfortable, but may turn out to be quite reasonable.

3.) Value target finding: Compile a taxonomy of ways to refer to the thing the system should be aligned to including simple prompt phrases like “CEV”, “human friendliness”, “human values”, “global values”, etc, and ranging to complex target specification schemes such as those involving integrating knowledge from diverse brain readings. It is important to draw on diverse worldviews to compile this taxonomy.

4.) Value encoding: Research the embeddings of the elements from step 3 in the models developed in step 2. For those which it is infeasible to generate embedding, research why they may be valuable and why it may be ok to proceed without them. The initial focus should be on how the reference embeddings relate to one another and how the embeddings relate across models. A very good outcome of this stage would be the development of a multimodal mapping to a semantic space and vector within that space which stands as a good candidate to be the optimization target for a future superintelligent system.

5.) Value generalization to SI: Using understanding from steps 4 and 1, consider how various systems set to optimize for potential targets from step 4 would generalize as their domain moves from local to global to superintelligent. I suspect this may expose a dissonance between human values and human-friendly values which may require resolution.

6.) Human value embeddings: Study the optimization targets of individual humans and groups of humans and our global civilization within the multimodal mappings developed in step 4.

7.) Double-check value embeddings: Use knowledge from steps 5 and 6 to develop some kind of proof that the proposed target embedding is within the space of target embeddings of humankind. This may be complicated if it turns out that human values and human-friendly values do not intersect as explored in step 5.

8.) Global consensus: Publish the findings of step 7 and wait for political agreement on the embedding as the target for superintelligent optimization, then using theory from 1, turn on a supercritical system set to optimize the embedded target under its given multimodal world mapping.

Step 8 leaves me conflicted. I feel uncertain it is ethical to allow suffering in the world which would be prevented by AGI deployment to continue for the time this would take, but I also feel uncomfortable forcing others to live in a world ruled by an AGI they did not choose.

9.) Be sure: Maybe double-check the calculations before turning on the final machine that humanity will ever have the freedom to fail to have build correctly. After that we will either be forced to fail, or forced not to fail.


Where do I see myself in this plan?

Currently I'm interested in Mechanistic Interpretability, which I think will strongly support many of the above steps, although worryingly, it may support step 2 without proper safety from step 1. This does seems to be the current global situation afaik...

At some point, after I feel I have intuition into the working of modern AI systems, and have built some credibility as a researcher, I hope to shift to focusing on Agent Foundations, especially as it relates to steps 1, 4, and 6.

: )


If you think this is a good/bad plan, or that we are already collectively enacting this plan, or that it would be impossible to coordinate this plan... I'd love to hear about your thoughts!

New Comment