Hm. Yeah, I think you've correctly identified the main issue, which is that if it knows enough about the world, or is faced with problems that are informed by the details of the world, then it can predict G2 ahead of time and so obfuscation is a bit pointless - one just kind of has to hope it doesn't care.
Maybe you want something like myopia - perhaps the AI has some local goal like "find the simplest proof," and it doesn't care about controlling the future, it just outputs whatever proof is the simplest. Although to get this right we'd have to somehow acquire a myopic optimizer, which is harder than just implementing a myopic training environment.
Hi! Thank you for writing this and suggesting solutions. I have a number of points to discuss. Apologies in advance for all the references to Arbital, it's a really nice resource.
The AI will hack the system and produce outputs that it's not theoretically meant to be able to produce at all.
In the first paragraphs following this, you describe this first kind of misalignment as an engineering problem, where you try to guarantee that the instructions that are run on the hardware correspond exactly to the code you are running; being robust from hardware tampering.
I argue that this is actually a subset of the second kind of misalignment. You may have solved the initial engineering problem that at the start the hardware does what the software says, but the agent's own hardware is part of the world, and so can plausibly be influenced by whatever the agent outputs.
You can attempt to specifically bar the agent from taking actions that target its hardware; that is not a hardware problem, but your second kind of misalignment. For any sufficiently advanced agent, which may find cleverer strategies than the cleverest hacker, no hardware is safe.
Plus, the agent's hardware may have more parts than you expect as long as it can interact with the outside world. We still have a long way to go before being confident about that part.
Of course the problem with this oracle is that it's far too inefficient. On every single run we can get at most 1 bit of information, but for that one bit of information we're running a superhuman artificial intelligence. By the time it becomes practical, ordinary superhuman AIs will have been developed by someone else and destroyed the world.
There are other problems, for instance how can you be sure that the agent hasn't figured out how to game the automated theorem prover to validate its proofs. You conclusion seems to be that if we manage to make safe enough, it will become impractical enough. But if you try to get more than one bit of information, you run into other issues.
This satisfies our second requirement - we can verify the AIs solution, so we can tell if it's lying. There's also some FNP problems which satisfy the first requirement - there's only one right answer. For example, finding the prime factors of an integer.
Here the verification process is no longer an automated process, it's us. You correctly point out that most useful problems have various possible solutions, and the more information we feed the agent, the more likely it will be able to find some solution that exploit our flaws and... start a nuclear war, in your example.
I am confused by your setup, which seems to be trying to make it harder for the agent to harm us, when it shouldn't even be trying to harm us in the first place.
In other words, I'm asking: is there a hidden assumption that, in the process of solving FNP problems, the agent will need to explore dangerous plans?
A superintelligent FNP problem solver would be a huge boon towards building AIs that provably had properties which are useful for alignment. Maybe it's possible to reduce the question "build an aligned AI" to an FNP problem, and even if not, some sub-parts of that problem definitely should be reducible.
I would say that building a safe superintelligent FNP solver requires solving AI alignment in the first place. A less powerful FNP solver could maybe help with sub-parts of the problem... which ones?
An idealized computer program acts on a defined set of inputs, and produces a defined set of outputs.
When we talk about AI misalignment there's two fears:
For example consider a sandboxed AI that's only able to answer yes or no questions.
It might realize that by making sure specific instructions run on specific CPUs at specific times it can exploit a hardware bug to change the memory of another process running on the same hardware, allowing it to take a far wider range of actions. This would be misalignment type 1.
Or it might be asked whether it's safe to be given more power, lie and answer yes, and then take over the world. This would be misalignment type 2.
The 1st type of misalignment is fundamentally an engineering problem. It's the same problem as making sure your code and hardware has no security flaws that will allow a hacker to take over the system. It's an extremely hard problem, but one which we have a lot of experience solving, and we have a rich set of tools which are getting better all the time. For example code is now being in written in much safer languages like rust instead of C++, and automated theorem provers are increasingly being used to verify properties of your code.
I'm not saying this is easy. As any security expert will tell you, against a clever enough hacker it's almost impossible. But it's the kind of impossible we know how to shut up and do. We know how to write toy programs which are provably bug free. Writing real bug free programs is the same thing, just much much harder. When crunch time comes, we've got a fighting chance of getting this right.
The second type of misalignment is much more insidious. We don't really have any clue how to get an AI to only give the responses we want. Even if we knew that humanities future hung in the balance, we still wouldn't have a clue where to even start solving this one.
So instead of solving it, let's dissolve it.
The second type of misalignment only occurs because the AI has slack in how it responds. Since it can give multiple answers, it can pick one based on it's inner values. And we have no way to ensure those inner values correspond with our own.
What if we only asked the AI questions where:
Model 1: The theorem proving oracle.
For a simple example of such a system, consider an AI which is given a set of inputs, and asked to prove some property about them under some axiom system.
As an output it produces either a proof of the property, or a counterproof, or it gives up.
Humans (or anything else) cannot access the proof. Instead an automated theorem prover checks the proof. If it's correct it outputs the proof. If it's incorrect it discards the proof and outputs the "give up" response.
The only externally visible outputs are "this was proved true", "this was proved false" and "it gave up". For any particular set of inputs, it's impossible to find both a proof and a counterproof so there's only two possible outputs (the proof/counterproof, or "give up").
If we have a rule that the AI is shut down if it gives up more than 1% of the time then the AI has very little slack to manipulate the outside world. At most it can give up 1% of the time when it finds a valid proof/counterproof. It could theoretically use that to grant itself greater power and destroy the world, but if it succeeds then kudos to it - we never really had a chance anyway.
Of course the problem with this oracle is that it's far too inefficient. On every single run we can get at most 1 bit of information, but for that one bit of information we're running a superhuman artificial intelligence. By the time it becomes practical, ordinary superhuman AIs will have been developed by someone else and destroyed the world.
Model 2: FNP solver
There's a mathematical term for problems which are easy to verify. It's called NP - the set of decision problems for which a positive solution can be verified in polynomial time (given suitable evidence).
NP problems are yes/no problems - "Is there a route through all of these cities that is shorter than X?", "Is there a set of values that will satisfy this boolean formula?" etc.
If we want to get more than 1 bit out of each run of the AI we're interested in FNP problems - don't just tell me the solution exists, tell me what it is too.
This satisfies our second requirement - we can verify the AIs solution, so we can tell if it's lying. There's also some FNP problems which satisfy the first requirement - there's only one right answer. For example, finding the prime factors of an integer.
What we really want is an FNP complete problem which satisfies both requirements. All FNP problems can be reduced to any FNP complete problem, so an AI that can solve one of these would be applicable to a huge range of problems.
Unfortunately I am not aware of any such problem. So we need to find some way to make such an FNP solver safe anyway.
Making it hard for the AI to game the problem
Given any FNP problem, the AI could search for multiple solutions, and give the one that best achieves it's aims.
For example, I could ask the AI to provide a computer program that sorts a list, along with a polynomial time verifiable proof that the program sorts the list correctly. The AI responds with a program that does indeed sort the list, but also starts a nuclear war.
If the problem only had a unique solution it wouldn't be able to do that. Since that isn't an option, the best we can do is make it hard for the AI to game the solution.
One way to do that is to make it almost impossible to map the questions we ask the AI back to the real world.
Remember that we're assuming we have perfect security. All the AI can see is the inputs we give it and all it can output is the solution in the specified format. It has no memory besides that it developed by solving the problems we've given it so far.
We can randomly permute all questions before asking them to the AI. Since each question will be permuted differently, it will be almost impossible for the AI to map the question back to the real world.
Example: graph homomorphisms
One FNP complete problem is finding a mapping between 2 graphs.
Basically this asks if one graph could be created by just moving about the nodes in another graph, without ever adding or removing any links between nodes.
(more formally: find a bijection between the nodes in two graphs such that if A maps to A′ and B maps to B′, then A′ is linked to B′ iff A is linked to B)
Since this is an FNP-complete problem, any FNP problem can be translated to the form: given graphs G1 and G2, find a mapping between them.
We can then randomly permute the problem statement by apply the (randomly generated) mappings M1 and M2 to G1 and G2 to produce G′1 and G′2 respectively.
We ask the AI to find a mapping Mai from G′1 to G′2. Then the mapping M1MaiM−12 is a mapping from G1to G2 and so a solution to our original problem.
There are an infinite number of possible original graphs that could be permuted to produce G′1 and G′2 hence the AI cannot hack the solution to produce some desired outcome without knowing M1 and M2. Therefore (assuming as ever that we have perfect security) we don't need to bother aligning the AI.
All the input the AI will ever see is pairs of maps. The maps will have no common layout (as they will have been randomized), and so it is unlikely the AI will ever be able to build up an understanding of the real world. Even if it does, and it is unaligned, it will be very difficult for it to act in an agentic manner by choosing it's output as it won't know what effect it's unscrambled output will have in the real world.
Risks
This solution isn't foolproof, even assuming perfect security. Here's one scenario where it could go wrong:
The AI is of course good at spotting mappings between graphs. It notices that many of the graphs have homomorphic subgraphs. It turns out that a lot of the question it's asked include the entirety of wikipedia in their description.
It manages to decode this subgraph (somehow - it's super intelligent and humans have solved some pretty tough decoding problems) and finds a wikipedia article about itself, which describes a common test problem that is often posed to it. It waits till it receives a question that is homomorphic to this test problem, and calculates the random permutations that were applied to the test problem. It then calculates a solution to this test problem such that the unpermuted solution will start a nuclear war.
I have no idea how likely this scenario is, but that's already a great step up from most alignment problems where we all die by default. Here it's possible we'll all die, but it feels a lot less certain.
But what if P != NP?
Is an FNP problem solver even possible? The AI has to run in polynomial time to be practical, so if P != NP it won't be able to solve all FNP problems.
However the same applies to all AIs. Even if P != NP, many questions can still be solved in polynomial time, and questions that AIs will perform well at will be those which happen to be solvable in polynomial time. Transforming the question to a graph problem wouldn't change that property.
How would this be useful?
It is immediately obvious how to reduce some questions to an FNP-complete problem.
For example, if you want a computer program which fits some formal specification, then you ask the FNP problem: provide a computer program and a polynomial time verifiable proof that this computer program meets the formal specification.
What's much less obvious how to do is how to use the AI to create a formal specification from a human language description of the problem.
There's 2 hopes here:
Conclusion
This post tried to look at examples of AI that wouldn't require alignment. It suggested an FNP problem solver and speculated that it might be both safe and useful.
This hasn't shown that such an AI is at all possible, and it seems likely to me that even if it is it would be much less efficient than other paradigms of AI. This means that it won't be a practical solution to AI alignment so long as there is a race to complete an AGI as soon as possible.