I recently had a thought about AI alignment that I wanted to get some feedback on. Keep in mind this is not a complete solution, just a core idea to build upon (and misses a lot of implementation details that would be by no means trivial). I'm by no means claiming it removes all risks associated with AI on its own

My concept is to train an AI model to produce logical reasoning graphs. It will only be trained on the accuracy of facts used in reasoning, the logical consistency of the graph produced, and completeness. It explicitly won't be trained on alignment issues. If it is logically consistent,  uses valid facts, and isn't missing any obvious lines of reasoning it is not penalized even if the graph it produces suggests "kill all humans" as the best course of action.

The output of this graph would then be fed through a series of other systems. For example:

1) One that hydrates the graph with the expected consequences of suggested actions (and raises warnings if expected consequences are unknown).
2) One that verifies that all nodes in the graph logically follow.
3) One that checks if individual nodes in the graph are aligned and prunes any that are not.

Finally, the actions of the pruned graph are passed to a non-AI system to process into actions. This could be actions that operate known devices or request researchers to perform tests and provide feedback, etc.

While I don't think this removes all dangers of an AGI, I think it helps with the specific problem of training an AI to lie to you about what it is trying to do. The AI is trained to provide accuracy in its evaluations, not on the probability that its suggestions are acted upon.

I'm sure this can be ripped apart in no time, but I would like to hear convincing arguments of why it wouldn't work so I can think about the problem further.

Thanks!

EDIT: I've updated this post with additional thoughts based on Raemon's observations.

There is a catastrophic failure case in the above described design:

While I hold that the training pressure works for individual decisions, there remains the case where the AI can see it would get higher rewards on future decisions if it takes a hit on a current one and thus ensuring additional resources for itself. Then the question becomes: "Can the AI produce a misaligned plan that gets past all the safety measures?" And while I suspect such a system would make it harder, it really is just kicking the can down the road (and probably not very far at that).

Initial additional thoughts on this:

If the AI can see potential additional utility if it gains resource X (even if X is the capability to produce better plans), it can claim those rewards simply by producing a plan that explains such. Thus it still has no incentive to revert to opaque planning. However, we now fall into one of two situations:

1) The AI reaches sufficient intelligence that every plan it suggests is something we won't implement, thus producing a very expensive rock.
2) We start curating our queries to it, such that it no longer suggests dangerous actions. This then applies a new selective pressure to it, which is to either a) get us to ask it the more rewarding questions that we won't action, thus returning it to a very expensive rock; or b) attempt to change the world such that the dangerous actions have already taken place.

Areas of further inquiry:

1) Does an automated query refinement loop still apply the same pressure as simply pre-curating queries? For example, does the selective pressure remain if we only ask it to "make two identical strawberries without killing anyone" after we ask it to "make two identical strawberries" and it suggested a plan (which we reward it for) that would kill everyone if actioned?
2) Can we ensure at least one actional plan by also equally rewarding it for simultaneously producing an additional plan with half the expected utility (and doing this recursively down to a given minimum expected utility)? IE, it has one suggested plan that involves killing everyone with "utility" calculated at 100, but also provides a plan that kills a small number of people with a calculated "utility" of 50, and finally a plan that kills nobody with a calculated "utility" of 25. We reward it as though it had created a single plan of utility 300, and then action the safe plan. (EDIT: This specific strategy will likely lead to a system that is very underpowered in what it can achieve. Next step: Is there a similar plan selection strategy we can use that can aim us toward more powerful safe plans?)
3) Based on the first two points, could we keep the intended incentive of transparency with the following procedure? Initially ask it all questions without alignment constraints and reward it as described above. Then if the plan contains alignment issues reask the question with alignment constraints, but with the intention of it providing plans both with and without the constraints and rewarding it for both plans. If the second plan still has alignment issues, ask again with additional constraints until you have an actionable plan. This could potentially remove incentives for the AI to avoid questions with alignment constraints while keeping incentives for transparency. (EDIT: Is this kicking the can down the road? Does it just incentivize it to make a safe-sounding second plan that also achieves the dangerous outcome, thus allowing it to claim the double reward while still improving future utility?)

Note: I'm being intentionally vague on concepts such as calculating utility and what constitutes good alignment. That is not because I don't appreciate the complexity or importance of those problems. Indeed, I understand that my proposal depends on those also being solved, which is by no means a simple task. The reason for the vagueness is that my proposal isn't trying to solve those problems, and I want to keep the discussion focussed on what this proposal actually is about, which is incentivizing transparency in decision-making.

New Comment
6 comments, sorted by Click to highlight new comments since:

Mod note: The LessWrong team is experimenting with stricter moderation, generally raising the bar of which new-user-content gets approved on LessWrong.

I'm approving this post, but to give people a sense of how I'm thinking about things, this is on-the-edge of what I'd approve, and some other mods weren't sure they'd approve it. In my ideal LessWrong we had the capacity to onboard all incoming users with this type of post, but capacity of both moderators and established users is somewhat limited. 

This post feels like a decent start at thinking about the alignment problem, but isn't really engaging with the hard parts, such as:

I recommend reading through AGI Ruin: A List of Lethalities and Alignment Research Field Guide to get a sense of what problems to explore in more detail and some advice on how to tackle them. I also recommend generally babbling a larger list of ideas, questions or projects to explore so you aren't overly anchored on one major idea.

Hi Raemon,

Thanks for the reply.

I was writing a reply to this, but I think I encountered a potential catastrophic failure scenario when breaking it down further to better explain:

I still hold that the structure ensures that the decision-making AI would be rewarded for making good plans with high utility, and not on the likelihood of those plans being implemented. And this would work for many cases, as it is more rewarding for the AI to tell you about the misaligned bad plan than it is for it to obscure the plan and get it implemented.

However, there remains the case where the AI can see it would get higher rewards on future decisions if it takes a hit on a current one and thus ensuring additional resources for itself. Then the question becomes: "Can the AI produce a misaligned plan that gets past all the safety measures?" And while I suspect such a system would make it harder, it really is just kicking the can down the road (and probably not very far at that).

I will think a bit more to see if there is anything that can be done to avoid that, but I promise I won't get myself anchored to the "one major idea"!

EDIT: Initial additional thoughts on this:

If the AI can see potential additional utility if it gains resource X (even if X is the capability to produce better plans), it can claim those rewards simply by producing a plan that explains such. Thus it still has no incentive to revert to opaque planning. However, we now fall into one of two situations:

1) The AI reaches sufficient intelligence that every plan it suggests is something we won't implement, thus producing a very expensive rock.
2) We start curating our queries to it, such that it no longer suggests dangerous actions. This then applies a new selective pressure to it, which is to either a) get us to ask it the more rewarding questions that we won't action, thus returning it to a very expensive rock; or b) attempt to change the world such that the dangerous actions have already taken place.

EDIT 2: Areas of further inquiry:

1) Does an automated query refinement loop still apply the same pressure as simply pre-curating queries? For example, does the selective pressure remain if we only ask it to "make two identical strawberries without killing anyone" after we ask it to "make two identical strawberries" and it suggested a plan (which we reward it for) that would kill everyone if actioned?
2) Can we ensure at least one actional plan by also equally rewarding it for simultaneously producing an additional plan with half the expected utility (and doing this recursively down to a given minimum expected utility)? IE, it has one suggested plan that involves killing everyone with "utility" calculated at 100, but also provides a plan that kills a small number of people with a calculated "utility" of 50, and finally a plan that kills nobody with a calculated "utility" of 25. We reward it as though it had created a single plan of utility 300, and then action the safe plan.

Glad to hear new thinkers grappling with the problem. I agree with what some of the other commenters have said about the thoughts here being unfinished, but I also think that that is a reasonable place to start. One approach forward could be asking yourself about how this could be given more robustness in the case of highly general and very smart systems. I don't yet see a path forward towards that for this plan, but you might. Another approach forward would be to aim for developing this tool for an easier use case. Could a narrowly superhuman plan-suggester which made detailed risk estimates of a wide variety of possible options be useful to humanity in the regime where we were still able to safely oversee it? I think so. For instance, perhaps we could ask it to help us design a system of rewards (e.g competitions) and punishments (e.g. legislation enacting fines) that would help us reshape the AI development landscape to be less of a Molochian race-to-the-bottom and more of a virtuous Win-Win landscape. For more background on this idea see: [Future of Life Institute Podcast] Liv Boeree on Moloch, Beauty Filters, Game Theory, Institutions, and AI #futureOfLifeInstitutePodcast https://podcastaddict.com/episode/154738782

I think it's certainly an interesting attempt at the problem. GPT4 thinks it's an interesting start - I'm a bit more skeptical; here's what GPT4 had to say:

Your idea is interesting and has some merits, but there are also potential challenges to consider. Here are a few points to think about:

Training on accuracy, consistency, and completeness: While these are essential aspects for reasoning, focusing only on them might not be sufficient to ensure a safe AI. If an AI model is not trained on alignment issues, it may not develop an understanding of human values and ethical considerations. This could lead to the generation of logical reasoning graphs that are accurate, consistent, and complete but might still suggest actions that are harmful to humans or against our values.

Layered approach to analyzing the graph: The idea of using a series of systems to analyze the AI-generated graph and ensure that the suggested actions are safe is promising. However, this approach could face several challenges: a. Ensuring that the systems responsible for analyzing the graph are reliable and safe, especially when dealing with unknown consequences or complex ethical dilemmas. b. The possibility that the AI model learns to generate reasoning graphs that are difficult to analyze by the subsequent systems, which could lead to the AI finding loopholes in the analysis process. c. The potential for adversarial attacks or other external influences that could compromise the integrity of the systems responsible for analyzing the graph.

Non-AI system processing actions: While this might help mitigate the risk of a lying AI, it could also introduce new risks. The non-AI system could be vulnerable to manipulation or human error, which could lead to undesirable outcomes. Additionally, the non-AI system might lack the flexibility and adaptability of an AI system, potentially limiting its ability to address complex or rapidly changing situations.

Scalability: As the complexity of the problems the AI is trying to solve increases, the size and complexity of the reasoning graphs might become unmanageable. Developing systems capable of analyzing and verifying large-scale reasoning graphs might be a significant challenge.

I don't think this quite covers the key issues with the proposal, though; Where are you going to get training data that can decompose into logical actions? How are you going to verify logical correctness? Logical reasoning relies on defining a formal language that constrains valid moves to be only those that retain validity; in messy real life environments this usually either buys you nothing or requires a prohibitively huge space of actions that relies on a strong approximate model to build. As GPT4 says, you then run into significant issues processing the output of the system.

Spent 2m writing comment. Strong agreed on Raemon's comment.

One that checks if individual nodes in the graph are aligned and prunes any that are not

Has "draw the rest of the owl" vibes to me.

If your plan to align AI includes using an AI that can reliably check whether actions are aligned, almost the entirety of the problem passes down to specifying that component.

As I said in my post, I'm not suggesting I have solved alignment. I'm simply trying to solve specific problems in the alignment space. Specifically what I'm trying to solve here are two things:

1) Transparency. That's not to say that you can ever know what a NN really is optimizing for (due to internal optimizers), but you can get them to produce a verifiable output. How you verify the output is a problem in itself, but the first step must be getting something you can verify.
2) Preventing training pressure from creating a system that trends its failure modes to the most extreme outcomes. There are questions on whether this can be done without just creating an expensive brick, and this is what I'm currently investigating. I believe it is possible and scalable, but I have no formal proof of such, and agree it is a valid concern with this approach.