When facing a problem in computer science, one can often write a program that solves that problem directly. Usually, a programmer understands why their programs do what they do. Not always, but usually. One can even use tools from formal verification to prove that programs have certain properties, such as correctly implementing a specification.
For some problems, it's really hard to explicitly write a program to solve them, but it's easy to write down a metric for how good a program is at solving that problem. The field of machine learning studies programs that themselves search over a restricted class of programs, looking for programs that are good at solving the given problem. A classic example is using gradient descent to search over neural networks.
In general, solving an optimization problem involves an optimization process, an optimization metric, and a domain of candidate solutions. In some cases, we can safely specify an optimization metric, and delegate our choice of solution to an optimization process. But in other cases, we know that this process leads to bad results. This is the classic pattern of Goodhart's Law that arises in principal-agent problems.
Analogy to Paying for Results
Using an optimization program is also structurally analogous to the economic technique of paying for results. When your country has a cobra problem, and you start paying people to bring you dead cobras, sometimes they'll bring you wild cobras they killed. And sometimes they'll start breeding cobras to kill and bring to you. An optimization program is in some sense incentivized to bring you solutions that score highly on the metric you gave it. Which can be a problem if that metric isn't aligned with what you actually want.
Robin Hanson describes some factors that lead paying-for-results to actually deliver results the principal is happy with:
We have some good basic theory on paying for results. For example, paying your agents for results works better when you can measure the things that you want sooner and more accurately, when you are more risk-averse, and when your agents are less risk-averse. It is less less useful when you can watch your agents well, and you know what they should be doing to get good outcomes.
In other words, when you know what your delegate should do and can effectively supervise them, you should just ask them to do that. This is the model used in many fields, including emergency medicine, and I think that this is the overall architecture that our software systems should use as well.
There are also some domains that we understand well enough to safely apply techniques like optimization and paying for results. Statisticians are justifiably confident that when they run an optimizer to find the line that best fits a collection of data points, that optimizer will just do that instead of scheming to get out of its box and take over the universe. It would probably also be fine to use a market to solicit candidate lines from the public, since it only takes one honest participant to submit an optimal line for the system as a whole to adopt that as the line of best fit.
In that post, a program outputting 1 corresponds to it engaging in behavior which is aligned with the goals of its principals. Whereas outputting a 0 corresponds to a catastrophic alignment failure, which might be unrecoverable. If a program keeps outputting 1 forever, things are great and we enjoy the benefits of capable AI systems. If it ever outputs 0, an existential catastrophe occurs and humanity goes extinct.
Well, no pressure. But we can formally verify that the program shown in that image really will output 1 forever. It's also kind of intuitively obvious in this case, but my intuition has been wrong plenty of times in the past and I'm a big fan of not taking risks when it comes to not destroying the world. If the problem of AI safety were as easy as writing a single program that outputs 1 forever, I think our civilization would be up to that task.
I believe that a key distinguisher between safe and unsafe applications of optimization is legibility. When we sufficiently understand the optimization process, optimization metric, and domain of candidate solutions, we can be justifiably confident about whether optimizing that metric over that domain is safe. And in some cases, sufficient understanding sounds like "I've proved as a theorem that this domain contains no unsafe elements." We should be able to do that for lines of best fit. We know we can't do that for recurrent neural networks, since those can implement arbitrary programs.
When facing a problem in computer science, one can often write a program that solves that problem directly. Usually, a programmer understands why their programs do what they do. Not always, but usually. One can even use tools from formal verification to prove that programs have certain properties, such as correctly implementing a specification.
For some problems, it's really hard to explicitly write a program to solve them, but it's easy to write down a metric for how good a program is at solving that problem. The field of machine learning studies programs that themselves search over a restricted class of programs, looking for programs that are good at solving the given problem. A classic example is using gradient descent to search over neural networks.
In general, solving an optimization problem involves an optimization process, an optimization metric, and a domain of candidate solutions. In some cases, we can safely specify an optimization metric, and delegate our choice of solution to an optimization process. But in other cases, we know that this process leads to bad results. This is the classic pattern of Goodhart's Law that arises in principal-agent problems.
Analogy to Paying for Results
Using an optimization program is also structurally analogous to the economic technique of paying for results. When your country has a cobra problem, and you start paying people to bring you dead cobras, sometimes they'll bring you wild cobras they killed. And sometimes they'll start breeding cobras to kill and bring to you. An optimization program is in some sense incentivized to bring you solutions that score highly on the metric you gave it. Which can be a problem if that metric isn't aligned with what you actually want.
Robin Hanson describes some factors that lead paying-for-results to actually deliver results the principal is happy with:
In other words, when you know what your delegate should do and can effectively supervise them, you should just ask them to do that. This is the model used in many fields, including emergency medicine, and I think that this is the overall architecture that our software systems should use as well.
There are also some domains that we understand well enough to safely apply techniques like optimization and paying for results. Statisticians are justifiably confident that when they run an optimizer to find the line that best fits a collection of data points, that optimizer will just do that instead of scheming to get out of its box and take over the universe. It would probably also be fine to use a market to solicit candidate lines from the public, since it only takes one honest participant to submit an optimal line for the system as a whole to adopt that as the line of best fit.
AI Safety
Here's an image from the great Subsystem Alignment post, from the great Embedded Agency sequence:
In that post, a program outputting 1 corresponds to it engaging in behavior which is aligned with the goals of its principals. Whereas outputting a 0 corresponds to a catastrophic alignment failure, which might be unrecoverable. If a program keeps outputting 1 forever, things are great and we enjoy the benefits of capable AI systems. If it ever outputs 0, an existential catastrophe occurs and humanity goes extinct.
Well, no pressure. But we can formally verify that the program shown in that image really will output 1 forever. It's also kind of intuitively obvious in this case, but my intuition has been wrong plenty of times in the past and I'm a big fan of not taking risks when it comes to not destroying the world. If the problem of AI safety were as easy as writing a single program that outputs 1 forever, I think our civilization would be up to that task.
I believe that a key distinguisher between safe and unsafe applications of optimization is legibility. When we sufficiently understand the optimization process, optimization metric, and domain of candidate solutions, we can be justifiably confident about whether optimizing that metric over that domain is safe. And in some cases, sufficient understanding sounds like "I've proved as a theorem that this domain contains no unsafe elements." We should be able to do that for lines of best fit. We know we can't do that for recurrent neural networks, since those can implement arbitrary programs.