The AI will build a tool that will conform to the specification.
To refine this setup, we split the AI into two systems. One system, the AI learner, is an algorithm that can train an AI. Another system, the AI neuroscientist, extracts out the algorithm that the AI learner has learned into a format that is more human-comprehensible:
Specifically in the talk, he gives the example of turning a neural network into a Python program. I.e. we want to automate some version of mechanistic interpretability.
We don't only return the program though. We also return a formally checkable proof that the program conforms to our specifications.
Problems
How do you reference the real world in your spec?
The first problem is how are we gonna write the spec correctly? I can easily write a spec for a program that is very limited in scope. A program that only needs to act in some logical abstract world such as a theorem searcher, a chess player, or really a program playing any computer game. There the environment can be logically specified.
But as soon as we need to talk about the real world, this is extremely difficult. How do you formally write down that humans should not be killed? How do you even reference at all a human in your spec? You need to solve the Pointers Problem first.
How fast is Mechanistic interpretability
It seems that what Max describes is possible in principle if you additionally solve some problems that he doesn't mention. But mechanistic interpretability lags far behind what models can do right now. AFAIK the algorithms that people can extract right now are very simple. And progress is slow.
Another thing to consider is that if you could put the algorithms of advanced AI systems into a much more human-comprehensible format, then that also helps a lot with advancing AI which would be bad.
This entire project seems good enough such that I endorse some people working on this, conditional on that they do so in secret as soon as it really starts working. If they do it openly it might be more harmful than helpful.
Proofing that the add function is correct
Timestamp: 10:36
Max talks about proofing that the function that we have distilled out of the network, will infact add all numbers correctly, not just the ones that are in the training data.
But how do you prove this without already knowing the add function? If you had the add function you can show that it is equivalent to the add function.
You don't necessarily need to know the add function, but you need to know exactly what properties a program needs to satisfy such that it adds all numbers together correctly.
When writing software tests a property-based testing style (e.g. withHypnosis) you frequently find yourself accidentally reimplementing the code you want to test. I expect a similar problem will show up here, because really in property-based testing you define properties that you want your program routines to satisfy, which is basically exactly what Max proposes.
That said I expect that the description length of a specification of correctness is much shorter on average, compared to the actual program. Saying what is true/correct is easier than specifying how to get the true/correct thing. Programs are about formalizing process, meaning that you need to specify the process, in addition to specifying properties of the output.
But again like I already said, saying what you want formally is extremely difficult once you start to talk about the real world. To the extent that we don't know how to do it at all right now.
Actually, I think you could do this if we had a system that would build up an interpretable world model. But we certainly don't have them right now, and Max not mentioning this problem makes me worry that he will not solve it.
A System Proofed save can still kill you
You could prove that a computer program halts within 1 day because let's for the sake of argument assume (suriously) that no AI can take over the world in 1 day. You get a proof from your system saying that the system indeed always will halt, i.e. turn itself off within 1 day.
Now you run that program on a computer and it doesn't turn itself off, because it self-modified into a new program that won't shut down. Possibly through exploiting a bug in Linux. But really anything that is not modeled in detail by your formal verification tool is an opportunity for things to go wrong.
So this setup can still fail even when you get a formal proof that tells you that the program has the properties you want because the formal verification tool does not model the world in sufficient detail.
In Max Tegmark - How to keep AI under control - TED Talk Max presents a method to make AI systems safe (all images from that talk's slides):
The basic idea is:
To refine this setup, we split the AI into two systems. One system, the AI learner, is an algorithm that can train an AI. Another system, the AI neuroscientist, extracts out the algorithm that the AI learner has learned into a format that is more human-comprehensible:
Specifically in the talk, he gives the example of turning a neural network into a Python program. I.e. we want to automate some version of mechanistic interpretability.
We don't only return the program though. We also return a formally checkable proof that the program conforms to our specifications.
Problems
How do you reference the real world in your spec?
The first problem is how are we gonna write the spec correctly? I can easily write a spec for a program that is very limited in scope. A program that only needs to act in some logical abstract world such as a theorem searcher, a chess player, or really a program playing any computer game. There the environment can be logically specified.
But as soon as we need to talk about the real world, this is extremely difficult. How do you formally write down that humans should not be killed? How do you even reference at all a human in your spec? You need to solve the Pointers Problem first.
How fast is Mechanistic interpretability
It seems that what Max describes is possible in principle if you additionally solve some problems that he doesn't mention. But mechanistic interpretability lags far behind what models can do right now. AFAIK the algorithms that people can extract right now are very simple. And progress is slow.
Another thing to consider is that if you could put the algorithms of advanced AI systems into a much more human-comprehensible format, then that also helps a lot with advancing AI which would be bad.
This entire project seems good enough such that I endorse some people working on this, conditional on that they do so in secret as soon as it really starts working. If they do it openly it might be more harmful than helpful.
Proofing that the add function is correct
Timestamp: 10:36
Max talks about proofing that the function that we have distilled out of the network, will infact add all numbers correctly, not just the ones that are in the training data.
But how do you prove this without already knowing the add function? If you had the add function you can show that it is equivalent to the add function.
You don't necessarily need to know the add function, but you need to know exactly what properties a program needs to satisfy such that it adds all numbers together correctly.
When writing software tests a property-based testing style (e.g. withHypnosis) you frequently find yourself accidentally reimplementing the code you want to test. I expect a similar problem will show up here, because really in property-based testing you define properties that you want your program routines to satisfy, which is basically exactly what Max proposes.
That said I expect that the description length of a specification of correctness is much shorter on average, compared to the actual program. Saying what is true/correct is easier than specifying how to get the true/correct thing. Programs are about formalizing process, meaning that you need to specify the process, in addition to specifying properties of the output.
But again like I already said, saying what you want formally is extremely difficult once you start to talk about the real world. To the extent that we don't know how to do it at all right now.
Actually, I think you could do this if we had a system that would build up an interpretable world model. But we certainly don't have them right now, and Max not mentioning this problem makes me worry that he will not solve it.
A System Proofed save can still kill you
You could prove that a computer program halts within 1 day because let's for the sake of argument assume (suriously) that no AI can take over the world in 1 day. You get a proof from your system saying that the system indeed always will halt, i.e. turn itself off within 1 day.
Now you run that program on a computer and it doesn't turn itself off, because it self-modified into a new program that won't shut down. Possibly through exploiting a bug in Linux. But really anything that is not modeled in detail by your formal verification tool is an opportunity for things to go wrong.
So this setup can still fail even when you get a formal proof that tells you that the program has the properties you want because the formal verification tool does not model the world in sufficient detail.