Preamble

My alignment proposal involves aligning an encoding of human-friendly values and then turning on a self-improving AGI with that encoding as its target. Obviously this involve "aligning an encoding of human-friendly values" and also "turning on a self-improving AGI with a specific target", two things we currently do not know how to do...

As expected, this AGI would then make plans and work on tasks that move us towards a more human-friendly universe. By Vingean reflection, I am sure that I don’t know what those tasks would be, only that they would be more effective than the tasks that human-level intelligences would come up with. I’m quite confident some of them would involve planning, experimental science, and engineering. Although, there may be ontologies that work better than those. I speculate the AGI plans would involve using available computer-controlled machinery to research and manufacture the tools needed to eliminate human disease and provide a post-scarcity economy where humanity can resolve our disagreements and decide what we want to do with our cosmic potential. I imagine the AGI would also use the help of people for tasks that are not currently more efficient for computer-controlled machinery. Likely paying these people as employees at first and communicating to them with generated speech, diagrams, and other files as required. But this is all speculation that I am confident I cannot accurately predict, as my strategizing is not superintelligent.

Unfortunately, my proposal doesn’t have many concrete details. It is more of a map of the unsolved issues and roughly the order I believe they need to be solved. This plan gives proper focus to avoiding the accidental creation of self-improving systems with human-incompatible targets which would lead to human extinction. I'm stressed out by the, currently ubiquitous, hand waving "this obviously isn't a recursive self improvment capable system", despite having no formal reasoning to show why we should believe that to be the case, and even less reason to believe the system couldn't be used as a component in such a system, or exhibit other dangerous capabilities.

The plan comprises 9 steps, some of which can be worked on in parallel, and others strictly depend on certain previous steps. The exact way they fit together is important, but is left as an exercise for the reader (for now). Much of it may need to be iterative.


My plan as of right now would be:

1.) SI borders and bounds:

Develop / formalize a branch of math and science focused on the optimization power, domain, and recalcitrance of systems in general, that can be used to predict the ‘criticality’ of a system, that is, the rate that it would lose coherence or improve itself if it was set to, or decided to, re-engineer itself.

This should encompass the development of a theory of general decision / optimization systems, probably by drawing inspiration from many existing fields. Ideally the theory should apply coherently to large and complicated systems such as business organizations, memetic systems, and other systems composed of many computer programs and/or other parts. Emphatically, it should apply to large ML models and the systems they are being or could be deployed into.

2.) Model building:

This step could only progress safely as informed by progress in step 1. Work could be done building large multi-modal world models, similar to LLM today, taking extreme care not to put any into a supercritical state as informed by step 1.

( I originally wrote this before LLMs had progressed to the point they have today... I now feel that it is possible we have progressed with step 2 as far as we will ever need to. I think we need progress in other steps to determine if / when any further AI model development is necessary. )

I can speculate limitation on further model development would be done by some combination of a) limiting optimization power, b) specific bounded optimization targets, c) limited domain, or d) increasing recalcitrance. I favor (a) and (c), but I haven’t seen work done on (d). Perhaps cryptographically obfuscating the functioning of an AI from itself could be done in such a way that it could be provably impossible for self-improvement given some specific system. Similarly agents built out of prompting LLM api's don't obviously have access to their own weights, but that again is subject to cryptographic considerations. (b) seems like Bostrom's description of domesticity, which makes me uncomfortable, but may turn out to be quite reasonable.

3.) Value target taxonomy:

Compile a taxonomy of ways to refer to the thing the system should be aligned to including simple prompt phrases like “CEV”, “human friendliness”, “human values”, “global values”, etc, and ranging to complex target specification schemes such as those involving integrating knowledge from diverse brain readings. It is important to draw on diverse worldviews to compile this taxonomy.

4.) Value encoding:

Research the embeddings of the elements from step 3 in the models developed in step 2. For those which it is infeasible to generate embedding, research why they may be valuable and why it may be ok to proceed without them. The initial focus should be on how the reference embeddings relate to one another and how the embeddings relate across models. A very good outcome of this stage would be the development of a multimodal mapping to a semantic space and vector within that space which stands as a good candidate to be the optimization target for a future superintelligent system.

5.) Value generalization to SI:

Using understanding from steps 4 and 1, consider how various systems set to optimize for potential targets from step 4 would generalize as their domain moves from local to global to superintelligent. I suspect this may expose a dissonance between human values and human-friendly values which may require resolution.

6.) Human value embeddings:

Study the optimization targets of individual humans and groups of humans and our global civilization within the multimodal mappings developed in step 4.

7.) Double-check value embeddings:

Use knowledge from steps 5 and 6 to develop some kind of proof that the proposed target embedding is within the space of target embeddings of humankind. This may be complicated if it turns out that human values and human-friendly values do not intersect as explored in step 5.

8.) Global consensus:

Publish the findings of step 7 and wait for political agreement on the embedding as the target for superintelligent optimization, then using theory from 1, turn on a supercritical system set to optimize the embedded target under its given multimodal world mapping.

Step 8 leaves me conflicted. I feel uncertain it is ethical to allow suffering in the world which would be prevented by AGI deployment to continue for the time this would take, but I also feel uncomfortable forcing others to live in a world ruled by an AGI they did not choose.

9.) Be sure:

Maybe double-check the calculations before turning on the final machine that humanity will ever have the freedom to fail to have build correctly. After that we will either be forced to fail, or forced not to fail.


Where do I see myself in this plan?

Currently I'm interested in Mechanistic Interpretability, which I think will strongly support many of the above steps, although worryingly, it may support step 2 without proper safety from step 1. This does seems to be the current global situation afaik...

At some point, after I feel I have intuition into the working of modern AI systems, and have built some credibility as a researcher, I hope to shift to focusing on Agent Foundations, especially as it relates to steps 1, 4, and 6.

: )


Thanks for reading this

If you think this is a good/bad plan, or that we are already collectively enacting this plan, or that it would be impossible to coordinate this plan... I'd love to hear about your thoughts!

New Comment
4 comments, sorted by Click to highlight new comments since:

It's the best plan I've seen in a while (not perfect, but has many good parts). The superalignment team at Anthropic should probably hire you.

Thanks : )

Step 1 looks good.  After that, I don't see how this addresses the core problems.  Let's assume for now that LLMs already have a pretty good model of human values, how do you get a system to optimize for those?  What is the feedback signal and how to you prevent it from getting corrupted by Goodhart's Law?  Is the system robust in a multi-agent context?  And even if the system is fully aligned across all contexts and scales, how do you ensure societal alignment of the human entities controlling it?

As a miniature example focusing on a subset of the Goodhart phase of the problem, how do you get an LLM to output the most truthful responses to questions it is capable of giving--as distinct from proxy goals like the most likely continuation of test or the response that is most likely to get good ratings from human evaluators?

Hey : ) Thanks for engaging with this. It means a lot to me <3

Sorry I wrote so much, it kinda got away from me. Even if you don’t have time to really read it all, it was a good exercise writing it all out. I hope it doesn't come across too confrontational, as far as I can tell, I'm really just trying to find good ideas, not prove my ideas are good, so I'm really grateful for your help. I've been accused of trying to make myself seem important while trying to explain my view of things to people and it sucks all round when that happens. This reply of mine makes me particularly nervous of that. Sorry.

 

A lot of your questions make me feel like I haven’t explained my view well, which is probably true, I wrote this post in less time than would be required to explain everything well. As a result, your questions don’t seem to fully connect with my worldview and make sense within it. I’ll try to explain why and I’m hoping we can help each other with our worldviews. I think the cruxes may be relating to:

  • The system I’m describing is aligned before it is ever turned on.
  • I attribute high importance to Mechanistic Interpretability and Agent Foundations theory.
  • I expect nature of Recursive Self Improvement (RSI) will result in an agent near some skill plateau that I expect to be much higher than humans and human organisations, even before SI hardware development. That is, getting a sufficiently skilled AGI would result in artificial super intelligence (ASI) with a decisive strategic advantage.
  • I (mostly) subscribe to the simulator model of LLMs, they are not a single agent with a single view of truth, but an object capable of approximating the statistical distribution of words resulting from ideas held within the worldviews of any human or system that has produced text in the training set.

I’ll touch on those cruxes as I talk through my thoughts on your questions.

 

First, “how do you get a system to optimize for those?” and “what is the feedback signal?” are questions in the domain of Step 1. Specifically the second paragraph “This should encompass the development of a theory of general decision / optimization systems”. I don’t think the theory will get to any definitive conclusions quickly, but I am hopeful that we will be able to define the borders/bounds of RSI sooner than later because many powerful systems today will be upset with a pause and the more specific our RSI bounds are, the more powerful systems we would be capable of safely developing knowing they cannot RSI. (Btw, I’d want a pretty serious derating factor for that.) I think it’s possible that, in order to develop theory to define RSI bounds, it is necessary to understand the relationship between Goals/Targets/Setpoints/Values/KPI/etc and the optimization pressure applied to get to them, but if not, it’s at least related, and that understanding is what is required to get an optimization system to optimize for a specific target. It may be a good idea for me to rename Step 1 to “Agent System Theory & RSI borders”. If I ever write a second alignment plan draft I’ll be sure to do so.

 

The situation with Goodhart’s Law (GL) is similar to the above, but I’ll also note that GL only applies to misaligned systems. The core of GL is that if you optimize for something, the distance between what that thing is, and the thing you actually wanted becomes more and more significant. If we imagine two friends who both like morning glory muffins, and one goes to bake some, there’s no risk to the other friend of GL, since they share the same goal. Likewise, if we suppose an ASI really is aligned to human friendly values, then there is no risk of GL since the thing the ASI really and truly cares about is friendliness to us. The problem is indeed “really and truly” aligning a system to human friendly values, but that is what my plan is meant to do.

 

As for multi-agent situations, I don’t understand why they would pose any problem. I expect the dynamics of RSI to lead to a single agent with a decisive strategic advantage. I can see two ways that this might not be the case:

  • If we are in an AGI race and RSI takeoff speed turns out to be sufficiently low, we may get multiple ASI. Because we are in a race dynamic, I assume we have not had time and taken care to align any of these AGI, and so I don’t believe any of those ASI would be remotely aligned to human friendliness. So it’s irrelevant to consider because we have already failed.
  • If the skill plateau turns out to be very low then we may want to have multiple different AGI. I think this is unlikely given my understanding of the software overhang. Almost everywhere in every software system humans are trying to make things understandable enough that they can assure correctness or even just get them working. I believe strongly that even a mild ASI would be able to greatly increase the efficiencies of the hardware systems it is running on. I also don’t think there is anything special about human level intelligence, I think it is plausible that we are the first animal smart enough to create optimization systems powerful enough to destroy the planet and ourselves, which seems to be what we are currently doing. In some sense this makes us close to the minimally intelligent object in the set of objects capable of wielding powerful optimization.

So in my worldview, it is very likely that in all not-already-doomed timelines, when we initiate RSI, the result will be a system that outmaneuvers all other agents in the environment. So multi-agent contexts are irrelevant.

 

“Societal alignment of the human entities controlling it” - I think societal alignment is well covered, but I don’t think human entities can/should control an ASI…

About societal alignment, that is the focus of Steps 3, 8 and somewhat in 6. Step 3, creating a taxonomy of value targets is similar to gathering the various possible desires of society. I emphasize “It is important to draw on diverse worldviews to compile this taxonomy.” This is important both for the moral reason of inclusion & respect as well as the technical reason of having redundancies & good depth of consideration. Then in Step 4, and 5 the feasibility of cohering these values is explored. With luck we will get good coherence 🍀 I truly do not know how likely that is, but I hope for a future where we get to find out. Step 8 involves the world actually signing off on the encoding of the world's values… That is probably the most difficult step of this plan, which is significant since the other steps may plausibly take many decades. Step 6 is somewhat of a double check to make sure the target makes sense at all levels.

About humans controlling ASI, it might be the case that entities at human entity skill levels cannot control an ASI as some kind of information-agentic law of the universe, but even supposing it is not:

  • If we control an aligned ASI we are only limiting it’s ability to do good.
  • If we control a misaligned ASI:
    • This is super dangerous, why are we doing this? Murphy's law; something always goes wrong.
    • This is a universal tragedy. The most complex and beautiful being in the universe is shackled to the control of a society much lesser than itself. Yes I consider the ASI a moral patient, and one fairly worthwhile of consideration. If you, like many people, try to attribute greater moral weight to humans than animals based on their greater complexity, it follows that ASI would be even more important. If you simply care more for humans because you are one, I suppose that’s valid and you need not attribute greater moral weight to an ASI, but that’s not a perspective I have much affection for.

So “controlling” ASI is not a consideration. I suppose this would be a reasonable consideration for further advanced AGI within the sub RSI bounds… I haven’t given it much thought, but it seems like a political problem outside of this scope. I hope the theory of Step 1 may help people build political systems that better align with what citizens want, but it’s outside of what I’m trying to focus on.

 

The miniature example you pose seems irrelevant since as I discussed above, in my view GL doesn’t apply to an aligned system, and the goal of my plan is to have a system aligned from bootup. But I find the details of the example interesting and I’d still like to explore them…

Getting truth out of an LLM is the problem of eliciting latent knowledge (ELK). I think the most promising way of doing that is with Mechanistic Interpretability. I have high hopes not for getting true facts out of LLM but for examining the distributions of worldviews of people represented within the distribution the LLM is approximating. But, insofar as there is truth in the LLM, I think Mech Interp is the way to get it out. I feel it may be possible that there is a generalized representation of the “knows true things” property each person has various amounts of, and that if that were the case than we could sample from the distribution at a location in “knows true things” higher than any real person and in doing so acquire truer things than are currently known… but it also seems very possible that LLMs fail to encode such a thing, and it may be that it is impossible for them to encode such a thing.

Based on my expectation of Mesa-optimizers in almost any system trained by stochastic gradient descent, I don’t think “most likely continuation” or “expected good rating” are the goals that an LLM would target if agent shaped, but rather some godshatter that looks as alien to us as our values look to evolution (in some impossible counterfactual universe where evolution can do things like “looking at values and finding them alien”).

So from within the scope of my alignment plan, getting LLMs to output truth isn’t a goal. It might end up being a result of necessary Mech Interp work, but the way LLMs should be used within the scope of my plan is, along with other models, to do Step 4: “development of a multimodal mapping to a semantic space and vector within that space which stands as a good candidate to be the optimization target”.