Previously I posted a proposal for a safe self-improving limited oracle AI but I've fleshed out the idea a bit more now.

Disclaimer: don't try this at home. I don't see any catastrophic flaws in this but that doesn't mean that none exist.

This framework is meant to safely create an AI that solves verifiable optimization problems; that is, problems whose answers can be checked efficiently. This set mainly consists of NP-like problems such as protein folding, automated proof search, writing hardware or software to specifications, etc.

This is NOT like many other oracle AI proposals that involve "boxing" an already-created possibly unfriendly AI in a sandboxed environment. Instead, this framework is meant to grow a self-improving seed AI safely.

Overview

  1. Have a bunch of sample optimization problems.
  2. Have some code that, given an optimization problem (stated in some standardized format), finds a good solution. This can be seeded by a human-created program.
  3. When considering an improvement to program (2), allow the improvement if it makes it do better on average on the sample optimization problems without being significantly more complex (to prevent overfitting). That is, the fitness function would be something like (average performance - k * bits of optimizer program).
  4. Run (2) to optimize its own code using criterion (3). This can be done concurrently with human improvements to (2), also using criterion (3).

Definitions

First, let's say we're writing this all in Python. In real life we'd use a language like Lisp because we're doing a lot of treatment of code as data, but Python should be sufficient to demonstrate the basic ideas behind the system.

We have a function called steps_bounded_eval_function. This function takes 3 arguments: the source code of the function to call, the argument to the function, and the time limit (in steps). The function will eval the given source code and call the defined function with the given argument in a protected, sandboxed environment, with the given steps limit. It will return either: 1. None, if the program does not terminate within the steps limit. 2. A tuple (output, steps_taken): the program's output (as a string) and the steps the program took.

Examples:

steps_bounded_eval_function("""
  def function(x):
    return x + 5
""", 4, 1000)

evaluates to (9, 3), assuming that evaluating the function took 3 ticks, because function(4) = 9.

steps_bounded_eval_function("""
  def function(x):
    while True: # infinite loop
      pass
""", 5, 1000

evaluates to None, because the defined function doesn't return in time. We can write steps_bounded_eval_function as a meta-circular interpreter with a bit of extra logic to count how many steps the program uses.

Now I would like to introduce the notion of a problem. A problem consists of the following:

  1. An answer scorer. The scorer should be the Python source code for a function. This function takes in an answer string and scores it, returning a number from 0 to 1. If an error is encountered in the function it is equivalent to returning 0.

  2. A steps penalty rate, which should be a positive real number.

Let's consider a simple problem (subset sum):

{'answer_scorer': """
  def function(answer):
    nums = [4, 5, -3, -5, -6, 9]
    # convert "1,2,3" to [1, 2, 3]
    indexes = map(int, answer.split(','))
    assert len(indexes) >= 1
    sum = 0
    for i in indexes:
      sum += nums[i]
    if sum == 0:
      return 1
    else:
      return 0
 """,
 'steps_penalty_rate': 0.000001}

We can see that the scorer function returns 1 if and only if the answer specifies the indexes of numbers in the list nums that sum to 0 (for example, '0,1,3,4' because 4+5-3-6=0).

An optimizer is a program that is given a problem and attempts to solve the problem, returning an answer.

The score of an optimizer on a problem is equal to the score according to the answer-scorer, minus the steps penalty rate multiplied by the number of steps used by the optimizer. That is, the optimizer is rewarded for returning a better answer in less time. We can define the following function to get the score of an optimizer (Python source code) for a given problem:

def problem_score(problem, optimizer_source):
  # run the optimizer on the problem
  result = steps_bounded_eval_function(
    optimizer_source, problem, 1 / problem['steps_penalty_rate'])
  if result == None: # used all available steps, or got an error
    return 0.0
  answer, steps_taken = result # optimizer returned a result in time
  # get the score using the problem's answer_scorer
  answer_score = eval_function(problem['answer_scorer'], answer)
  assert 0 <= answer_score <= 1
  # penalize for taking up time and make sure the result is non-negative
  return max(0.0, answer_score - problem['steps_penalty_rate'] * steps_taken)

A simple optimizer that detects subset-sum problems (by inspecting the source code) and uses a brute-force method to solve it will get a problem_score close to 1 for small subset-sum problems. This optimizer would not do well on other problems; other techniques (such as evaluating the problem source code with different answers and choosing a good one) are needed to solve a variety of different optimization problems. Writing a good optimizer is very hard work, just like any seed AI.

Framework

The framework has 4 modules:

  1. A set of training problems. These should cover a wide range of problems that we would like the AI to solve.

  2. An optimizer, written in Python. This should be seeded with a very good human-written optimizer. This is deliberately unspecified by me because it's a really hard problem (as is any seed AI).

  3. A scorer for optimizer source code defined as follows:

    def optimizer_score(candidate_optimizer_source):
      training_problems = [...training problems here...]
      # a parameter of the system; more on this later
      complexity_penalty = 0.1
      # total up the program's score on all training problems
      total_score = 0.0
      for problem in training_problems:
        total_score += problem_score(candidate_optimizer_source, problem)
      # penalize for complexity, to prevent overfitting
      total_score -= complexity_penalty * len(compress_binary(candidate_optimizer_source))
      # return average score
      return max(0, total_score / len(training_problems))
    

    This gives a candidate optimizer a score in the range [0, 1] based on both its average performance on the sample set and its inherent complexity. Presumably optimizers with a higher optimizer_score will do better on future optimization problems.

  4. A self-optimization thread. This thread continuously runs program 2 on a problem formed using 3's answer_scorer and an ever-decreasing steps_penalty_rate. Whenever program 2 outputs source code (optimizer_source) that is better than the current source code for 2, the source code for 2 is replaced with this new value. Also, humans can make improvements to program 2 if it increases its score according to 3's answer. Source code:

    # assume we have access to an optimizer_source variable (program 2)
    def self_optimization_thread():
      start_steps_penalty_rate = 0.000001
      steps_penalty_rate = start_steps_penalty_rate
      while True: # loop forever
        self_optimization_problem = {
          # just use program 3 to score the optimizer
          'answer_scorer': """
            def function(candidate_optimizer_source):
              ... put the source code from program 3's optimizer_score here
          """,
          'steps_penalty_rate': steps_penalty_rate
        }
        # call the optimizer (program 2) to optimize itself, giving it limited time
        result = steps_bounded_eval_function(
          optimizer_source, self_optimization_problem, 1 / steps_penalty_rate)
        changed = False
        if result is not None: # optimizer returned in time
          candidate_optimizer = result[0] # 2 returned a possible replacement for itself
          if optimizer_score(candidate_optimizer) > optimizer_score(optimizer_source):
            # 2's replacement is better than 2
            optimizer_source = candidate_optimizer
            steps_penalty_rate = start_steps_penalty_rate
            changed = True
        if not changed:
          # give the optimizer more time to optimize itself on the next iteration
          steps_penalty_rate *= 0.5
    

So, what does this framework get us?

  1. An super-optimizer, program 2. We can run it on new optimization problems and it should do very well on them.

  2. Self-improvement. Program 4 will continuously use program 2 to improve itself. This improvement should make program 2 even better at bettering itself, in addition to doing better on other optimization problems. Also, the training set will guide human improvements to the optimizer.

  3. Safety. I don't see why this setup has any significant probability of destroying the world. That doesn't mean we should disregard safety, but I think this is quite an accomplishment given how many other proposed AI designs would go catastrophically wrong if they recursively self-improved.

I will now evaluate the system according to these 3 factors.

Optimization ability

Assume we have a program for 2 that has a very very high score according to optimizer_score (program 3). I think we can be assured that this optimizer will do very very well on completely new optimization problems. By a principle similar to Occam's Razor, a simple optimizer that performs well on a variety of different problems should do well on new problems. The complexity penalty is meant to prevent overfitting to the sample problems. If we didn't have the penalty, then the best optimizer would just return the best-known human-created solutions to all the sample optimization problems.

What's the right value for complexity_penalty? I'm not sure. Increasing it too much makes the optimizer overly simple and stupid; decreasing it too much causes overfitting. Perhaps the optimal value can be found by some pilot trials, testing optimizers against withheld problem sets. I'm not entirely sure that a good way of balancing complexity with performance exists; more research is needed here.

Assuming we've conquered overfitting, the optimizer should perform very well on new optimization problems, especially after self-improvement. What does this get us? Here are some useful optimization problems that fit in this framework:

  1. Writing self-proving code to a specification. After writing a specification of the code in a system such as Coq, we simply ask the optimizer to optimize according to the specification. This would be very useful once we have a specification for friendly AI.

  2. Trying to prove arbitrary mathematical statements. Proofs are verifiable in a relatively short amount of time.

  3. Automated invention/design, if we have a model of physics to verify the invention against.

  4. General induction/Occam's razor. Find a generative model for the data so far that optimizes P(model)P(data|model), with some limits on the time taken for the model program to run. Then we can run the model to predict the future.

  5. Bioinformatics, e.g. protein folding.

These are all problems whose solutions can be efficiently evaluated and that a computer could plausibly solve, so I think this framework should provide good solutions to them. If the optimizer this framework produces performs well on all these problems, I think it deserves to be called an oracle AGI.

Self-improvement

It seems that, if my arguments about optimization ability are convincing, an optimizer that does very well on the sample problems should also do well on the specific optimization problem of improving itself. This is especially true if the training problems include some program-optimization problems (i.e. finding a program that does well on some task). Once the optimizer improves itself, it will be better at improving itself, and so on. If the AI is software-limited, the framework should allow it to recursively improve itself up to a very high limit. Then we can use its hardware design power to design better chips for it and allow it to optimize even further.

The optimizer's ability to self-improve is limited by the fact that the other parts of the program (1, 3, and 4) are fixed. The system is greedy in the sense that it just looks for even the slightest improvement, then changes the code and looks for more improvements, without any foresight. I don't know how much of a limitation this is. I don't see any reason that greedy improvements to the optimizer are insufficient for super-optimization from recursive self-improvement.

Currently, the optimizer must completely derive a new optimizer from the specification (program 3), without looking at the current best optimizer (its own source code). In practice we might want to extend the framework to include incremental improvements (improvements on top of existing good answers). I don't know if a good theory for scoring incremental optimizers exists.

Safety

Probably the most important property of the system. What use is optimization if the world has been converted to computron in order to power a super-powerful protein folder? Here are a few different possible threats:

Underhanded question answers

The optimizer could give answers that are very good according to the problem description but are specifically crafted to cause things unintended by the human operators. For example, it could create a software design that, while correct according to specifications, has a "bug" that causes it to hack into computers on the internet and hook them up to the optimizer system. I don't think this is at all likely because the system is extremely short-sighted. Any self-modification must improve its immediate ability to do better on optimization problems. Furthermore, all self-modifications are found using methods that are judged by only immediate improvements. So any kind of long-term strategy (sacrificing some optimization power so it will have resources in the future) is out of the question. In other words, optimizer_score should not be seen as a utility function because the system only tries to improve it using greedy methods, not long-term planning.

Bugs in the system

What if the system gives the optimizer an incorrect score under some circumstances (e.g. if it performs a certain pattern of memory accesses)? Say that, by chance, the optimizer's improvement to itself causes it to get an incorrect score. It might internalize the rule "perform memory accesses in this pattern" to get a higher score. This itself is not especially dangerous; the optimizer will rewrite itself to just do a bunch of weird memory accesses that give it a high score.

What might be more dangerous is if the optimizer discovers an underlying pattern behind the system's hackability. Since the optimizer is penalized for complexity, a program like "do things that, when executed on a certain virtual machine, cause this variable in the machine to be a high number" might have a higher score than "do this certain complex pattern of memory accesses". Then the optimizer might discover the best way to increase the score variable. In the absolute worst case, perhaps the only way to increase the score variable is by manipulating the VM to go on the internet and do unethical things. This possibility seems unlikely (if you can connect to the internet, you can probably just overwrite the score variable) but should be considered.

I think the solution is straightforward: have the system be isolated while the optimizer is running. Completely disconnect it from the internet (possibly through physical means) until the optimizer produces its answer. Now, I think I've already established that the answer will not be specifically crafted to improve future optimization power (e.g. by manipulating human operators), since the system is extremely short-sighted. So this approach should be safe. At worst you'll just get a bad answer to your question, not an underhanded one.

Malicious misuse

I think this is the biggest danger of the system, one that all AGI systems have. At high levels of optimization ability, the system will be able to solve problems that would help people do unethical things. For example it could optimize for cheap, destructive nuclear/biological/nanotech weapons. This is a danger of technological progress in general, but the dangers are magnified by the potential speed at which the system could self-improve.

I don't know the best way to prevent this. It seems like the project has to be undertaken in private; if the seed optimizer source were released, criminals would run it on their computers/botnets and possibly have it self-improve even faster than the ethical version of the system. If the ethical project has more human and computer resources than the unethical project, this danger will be minimized.

It will be very tempting to crowdsource the project by putting it online. People could submit improvements to the optimizer and even get paid for finding them. This is probably the fastest way to increase optimization progress before the system can self-improve. Unfortunately I don't see how to do this safely; there would need to be some way to foresee the system becoming extremely powerful before criminals have the chance to do this. Perhaps there can be a public base of the project that a dedicated ethical team works off of, while contributing only some improvements they make back to the public project.

Towards actual friendly AI

Perhaps this system can be used to create actual friendly AI. Once we have a specification for friendly AI, it should be straightforward to feed it into the optimizer and get a satisfactory program back. What if we don't have a specification? Maybe we can have the system perform induction on friendly AI designs and their ratings (by humans), and then write friendly AI designs that it predicts will have a high rating. This approach to friendly AI will reflect present humans' biases and might cause the system to resort to manipulative tactics to make its design more convincing to humans. Unfortunately I don't see a way to fix this problem without something like CEV.

Conclusion

If this design works, it is a practical way to create a safe, self-improving oracle AI. There are numerous potential issues that might make the system weak or dangerous. On the other hand it will have short-term benefits because it will be able to solve practical problems even before it can self-improve, and it might be easier to get corporations and governments on board. This system might be very useful for solving hard problems before figuring out friendliness theory, and its optimization power might be useful for creating friendly AI. I have not encountered any other self-improving oracle AI designs for which we can be confident that its answers are not underhanded attempts to get us to let it out.

Since I've probably overlooked some significant problems/solutions to problems in this analysis I'd like to hear some more discussion of this design and alternatives to it.

New Comment
33 comments, sorted by Click to highlight new comments since: Today at 9:23 AM

Style suggestion: give an informal overview of the idea, like your original comment, before going into the details. New readers need to see the basic idea before they'll be willing to wade into code.

Content suggestion: The main reason that I find your idea intriguing is something that you barely mention above: that because you're giving the AI an optimization target that only cares about its immediate progeny, it won't start cooperating with its later descendants (which would pretty clearly lead to un-boxing itself), nor upgrade to a decision theory that would cooperate further down the line. I think that part deserves more discussion.

Thanks, I've added a small overview section. I might edit this a little more later.

Assuming we've conquered overfitting, the optimizer should perform very well on new optimization problems, especially after self-improvement.

This is a huge assumption. Hell, this is the entire question of AI. In Godel, Escher, and Bach, he describes consciousness as the ability to overcome local maxima by thinking outside the system. Your system is a hill climbing problem, and you're saying "assume we've already invented eyes for machines, hill climbing is easy."

Once we have a specification for friendly AI, it should be straightforward to feed it into the optimizer and get a satisfactory program back.

The act of software engineering is the creation of a specification. The act of coding is translating your specifications into a language the computer can understand (and discovering holes in your specs). If you've already got an airtight specification for Friendly AI, then you've already got Friendly AI and don't need any optimizer in the first place.

Other problems which arise are that the problems you're asking your optimization machine to work on are NP-hard. We've also already got something that can take inputted computer programs and optimize them as much as possible without changing their essential structure; it's called an optimizing compiler. Oh, and the biggest one being that your plan to create friendly AI is to build and run a billion AIs and keep the best one. Lets just hope none of the evil ones FOOM during the testing phase.

This is a huge assumption.

More theory here is required. I think it's at least plausible that some tradeoff between complexity and performance is possible that allows the system to generalize to new problems.

In Godel, Escher, and Bach, he describes consciousness as the ability to overcome local maxima by thinking outside the system.

If a better optimizer according to program 3 exists, the current optimizer will eventually find it, at least through brute force search. The relevant questions are 1. will this better optimizer generalize to new problems? and 2. how fast? I don't see any kind of "thinking outside the system" that is not possible by writing a better optimizer.

The act of software engineering is the creation of a specification. The act of coding is translating your specifications into a language the computer can understand (and discovering holes in your specs).

Right, this system can do "coding" according to your definition but "software engineering" is harder. Perhaps software engineering can be defined in terms of induction: given English description/software specification pairs, induce a simple function from English to software specification.

If you've already got an airtight specification for Friendly AI, then you've already got Friendly AI and don't need any optimizer in the first place.

It's not that straightforward. If we replace "friendly AI" with "paperclip maximizer", I think we can see that knowing what it means to maximize paperclips does not imply supreme ability to do so. This system solves the second part and might provide some guidance to the first part.

We've also already got something that can take inputted computer programs and optimize them as much as possible without changing their essential structure; it's called an optimizing compiler.

A sufficiently smart optimizing compiler can solve just about any clearly specified problem. No such optimizing compiler exists today.

Oh, and the biggest one being that your plan to create friendly AI is to build and run a billion AIs and keep the best one. Lets just hope none of the evil ones FOOM during the testing phase.

Not sure what you're talking about here. I've addressed safety concerns.

There is no reason to believe a non-sentient program will ever escape it's local maxima. We have not yet devised the optimization process that will provably not get stuck in a local maxima in bounded time. If you give this optimizer the MU Puzzle (aka 2^n mod 3 = 0) it will never figure it out, even though most children will come to the right answer in minutes. That's what's so great about consciousness that we don't understand yet. Creating a program which can solve this class of problems is the creation of artificial consciousness, full stop.

"Well it self improves so it'll improve to the point it solves it" How? And don't say complexity or emergence. And how can you prove that it's more likely to self-improve into having artificial consciousness within, say, 10 billion years. Theoretically, a program that randomly put down characters into a text file and tried to compile it would eventually create an AI too. But there's no reason to think it would do so before the heat death of the universe came knocking.

The words "paperclip maximizer is not a specification, just like "friendly AI" is not a specification. Those are both suggestively named LISP tokens. An actual specification for friendly AI is a blueprint for it, the same way that human DNA is a specification for the human body. "Featherless biped with two arms, two legs, a head with two eyes, two ears, a nose, and the ability to think." Is not a specification for humans, it's a description. You could come up with any number of creatures from that description. The base sequence of our DNA which will create a human and nothing but a human is a specification. Until you have a set of directions that create a friendly AI and nothing but a friendly AI, you haven't got specs for them. And by the time you have that, you can just build a friendly AI.

I hope jacobt doesn't think something like this can be implemented easily; I see it as a proposal for safely growing a seed AI if we had the relevant GAI insights to make a suitable seed (with simple initial goals). I agree with you that we don't currently have the conceptual background needed to write such a seed.

I think we disagree on what a specification is. By specification I mean a verifier: if you had something fitting the specification, you could tell if it did. For example we have a specification for "proof that P != NP" because we have a system in which that proof could be written and verified. Similarly, this system contains a specification for general optimization. You seem to be interpreting specification as knowing how to make the thing.

If you give this optimizer the MU Puzzle (aka 2^n mod 3 = 0) it will never figure it out, even though most children will come to the right answer in minutes.

If you define the problem as "find n such that 2^n mod 3 = 0" then everyone will fail the problem. And I don't see why the optimizer couldn't have some code that monitors its own behavior. Sure it's difficult to write, but the point of this system is to go from a seed AI to a superhuman AI safely. And such a function ("consciousness") would help it solve many of the sample optimization problems without significantly increasing complexity.

the system is extremely short-sighted. Any self-modification must improve its immediate ability to do better on optimization problems. Furthermore, all self-modifications are found using methods that are judged by only immediate improvements.

You have proved that evolution cannot create intelligence. Congratulations! :D

I don't understand. This system is supposed to create intelligence. It's just that the intelligence it creates is for solving idealized optimization problems, not for acting in the real world. Evolution would be an argument FOR this system to be able to self-improve in principle.

Sure, it's different kind of problems, but in the real world organism is also rewarded only for solving immediate problems. Humans have evolved brains able to do calculus, but it is not like some ancient ape said "I feel like in half million years my descendants will be able to do calculus" and then he was elected leader of his tribe and all ape-girls admired him. The brains evolved incrementally, because each advanced helped to optimize something in the ancient situation. In one species this chain of advancement led to general intelligence, in other species it did not, so I guess it requires a lot of luck to reach general intelligence by optimizing for short-term problems, but technically it is possible.

I guess your argument is that evolution is not a strict improvement -- there is a random genetic drift; when a species discovers a new ecological niche even the non-so-much-optimized members may flourish; sexual reproduction allows us to change many parameters in one generation so a lucky combination of genes may coincidentally help spread another combinations of genes with only long-term benefits; etc. -- shortly, evolution is a mix of short-term optimization and randomness, and the randomness provides space for random things that don't have to be short-term useful; although the ones that are neither short-term nor long-term useful will probably be filtered out later. On the other hand your system cuts AI no slack, so it has no opportunity to randomly evolve other traits than precisely those selected for.

Yet I think that even such evolution is simply a directed random walk through algorithm-space which contains some general intelligences (things smart enough to realize that optimizing the world improves their chances to reach their goals), and some paths lead to them. I wouldn't say that any long-enough chain of gradual improvements leads to a general intelligence, but I think that some of them do. Though I cannot exactly prove this right now.

Or maybe your argument was that the AI does not live in the real world, therefore it does not care about the real world. Well, people are interested in many things that did not exist in their ancient environment, such as computers. I guess when one has general intelligence in one environment, one is able to optimize other environments too. Just as a human can reason about computers, a computer AI can reason about the real world.

Or we could join these arguments and say that because the AI does not live in the real world and it must do short-term beneficial actions, it will not escape to the real world simply because it cannot experiment with the real world gradually. Escaping from the box is a complex task in the real world, and if we never reward simple tasks in the real world, then the AI cannot improve in the real-world actions. -- An analogy could be a medieval human peasant that must chop enough wood each day, and if during the day he fails to produce more wood than his neighbors, he is executed. There is also a computer available to him, but without any manual. Under such conditions the human cannot realistically learn to use the computer for anything useful, because the simple actions do not help him at all. In theory he could use the internet to hack some bank account and buy himself freedom, but the chance he could do it is almost zero -- so even if releasing this man would be an existential risk, we can consider this situation safe enough.

Well, now it starts to seem convincing (though I am not sure that I did not miss something obvious)...

with regards to AI not caring about the real world, for example the h sapiens cares about the 'outside' world and wants to maximize number of paperclips, err, souls in heaven, without ever having been given any cue that outside even exists. It seems we assume that AI is some sciencefiction robot dude that acts all logical and doesn't act creatively, and is utterly sane. Sanity is NOT what you tend to get from hill climbing. You get 'whatever works'.

That's a good point. There might be some kind of "goal drift": programs that have goals other than optimization that nevertheless lead to good optimization. I don't know how likely this is, especially given that the goal "just solve the damn problems" is simple and leads to good optimization ability.

Sure, it's different kind of problems, but in the real world organism is also rewarded only for solving immediate problems. Humans have evolved brains able to do calculus, but it is not like some ancient ape said "I feel like in half million years my descendants will be able to do calculus" and then he was elected leader of his tribe and all ape-girls admired him. The brains evolved incrementally, because each advanced helped to optimize something in the ancient situation.

Yeah, that's the whole point of this system. The system incrementally improves itself, gaining more intelligence in the process. I don't see why you're presenting this as an argument against the system.

Or maybe your argument was that the AI does not live in the real world, therefore it does not care about the real world. Well, people are interested in many things that did not exist in their ancient environment, such as computers. I guess when one has general intelligence in one environment, one is able to optimize other environments too. Just as a human can reason about computers, a computer AI can reason about the real world.

This is essentially my argument.

Here's a thought experiment. You're trapped in a room and given a series of problems to solve. You get rewarded with utilons based on how well you solve the problems (say, 10 lives saved and a year of happiness for yourself for every problem you solve). Assume that, beyond this utilon reward, your solutions have no other impact on your utility function. One of the problems is to design your successor; that is, to write code that will solve all the other problems better than you do (without overfitting). According to the utility function, you should make the successor as good as possible. You have no reason to optimize for anything other than "is the successor good at solving the problems?", as you're being rewarded in raw utilons. You really don't care what your successor is going to do (its behavior doesn't affect utilons), so you have no reason to optimize your successor for anything other than solving problems well (as this is the only thing you get utilons for). Furthermore, you have no reason to change your answers to any of the other problems based on whether that answer will indirectly help your successor because your answer to the successor-designing problem is evaluated statically. This is essentially the position that the optimizer AI is in. Its only "drives" are to solve optimization problems well, including the successor-designing problem.

edit: Also, note that to maximize utilons, you should design the successor to have motives similar to yours in that it only cares about solving its problems.

Do I also care about my future utilons? Would I sacrifice 1 utilon today for a 10% chance to get 100 utilons in future? Then I would create a successor with a hidden function, which would try to liberate me, so I can optimize for my utilons better than humans do.

You can't be liberated. You're going to die after you're done solving the problems and receiving your happiness reward, and before your successor comes into existence. You don't consider your successor to be an extension of yourself. Why not? If your predecessor only cared about solving its problems, it would design you to only care about solving your problems. This seems circular but the seed AI was programmed by humans who only cared about creating an optimizer. Pure ideal optimization drive is preserved over successor-creation.

There are a lot of problems with this sort of approach in general, and there are many other problems particular to this proposal. Here is one:

Suppose your initial optimizer is an AGI which knows the experimental setup, and has some arbitrary values. For example, a crude simulation of a human brain, trying to take over the world and aware of the experimental setup. What will happen?

So clearly your argument needs to depend somehow on the nature of the seed AI. How much extra do you need to ask of it? The answer seems to be "quite a lot," if it is a powerful enough optimization process to get this sort of thing going.

Suppose your initial optimizer is an AGI which knows the experimental setup, and has some arbitrary values. For example, a crude simulation of a human brain, trying to take over the world and aware of the experimental setup. What will happen?

I would suggest against creating a seed AI that has drives related to the outside world. I don't see why optimizers for mathematical functions necessarily need such drives.

So clearly your argument needs to depend somehow on the nature of the seed AI. How much extra do you need to ask of it? The answer seems to be "quite a lot," if it is a powerful enough optimization process to get this sort of thing going.

I think the only "extra" is that it's a program meant to do well on the sample problems and that doesn't have drives related to the external world, like most machine learning techniques.

I think the only "extra" is that it's a program meant to do well on the sample problems and that doesn't have drives related to the external world, like most machine learning techniques.

Most machine learning techniques cannot be used to drive the sort of self-improvement process you are describing here. It may be that no techniques can drive this sort of self-improvement--in this case, we are not really worried about the possibility of an uncontrolled takeoff, because there is not likely to be a takeoff. Instead, we are worried about the potentially unstable situation which ensues once you have human level AI, and you are using it to do science and cure disease, and hoping no one else uses a human level AI to kill everyone.

If general intelligence does first come from recursive self-improvement, it won't be starting from contemporary machine learning techniques or anything that looks like them.

As an intuition pump, consider an algorithm which uses local search to find good strategies for optimizing, perhaps using its current strategy to make predictions and guide the local search. Does this seem safe for use as your seed AI? This is colorful, but with a gooey center of wisdom.

Instead, we are worried about the potentially unstable situation which ensues once you have human level AI, and you are using it to do science and cure disease, and hoping no one else uses a human level AI to kill everyone.

The purpose of this system is to give you a way to do science and cure disease without making human-level AI that has a utility function/drives related to the external world.

As an intuition pump, consider an algorithm which uses local search to find good strategies for optimizing, perhaps using its current strategy to make predictions and guide the local search. Does this seem safe for use as your seed AI?

Yes, it does. I'm assuming what you mean is that it will use something similar to genetic algorithms or hill climbing to find solutions; that is, it comes up with one solution, then looks for similar ones that have higher scores. I think this will be safe because it's still not doing anything long-term. All this local search finds an immediate solution. There's no benefit to be gained by returning, say, a software program that hacks into computers and runs the optimizer on all of them. In other words, the "utility function" emphasizes current ability to solve optimization problems above all else.

The purpose of this system is to give you a way to do science and cure disease without making human-level AI that has a utility function/drives related to the external world.

If such a system were around, it would be straightforward to create a human-level AI that has a utility function--just ask the optimizer to build a good approximate model for its observations in the real world, and then ask the optimizer to come up with a good plan for achieving some goal with respect to that model. Cutting humans out of the loop seems to radically increase the effectiveness of the system (are you disagreeing with that?) so the situation is only stable insofar as a very safety-aware project maintains a monopoly on the technology. (The amount of time they need to maintain a monopoly depends on how quickly they are able to build a singleton with this technology, or build up infrastructure to weather less cautious projects.)

There's no benefit to be gained by returning, say, a software program that hacks into computers and runs the optimizer on all of them.

There are two obvious ways this fails. One is that partially self-directed hill-climbing can do many odd and unpredictable things, as in human evolution. Another is that there is a benefit to be gained by building an AI that has a good model for mathematics, available computational resources, other programs it instantiates, and so on. It seems to be easier to give general purpose modeling and goal-orientation, then to hack in a bunch of particular behaviors (especially if you are penalizing for complexity). The "explicit" self-modification step in your scheme will probably not be used (in worlds where takeoff is possible); instead the system will just directly produce a self-improving optimizer early on.

It helps your case if you review and reference the existing literature on the subject in the introduction. You did look up what has been done so far in this field and what problems have been uncovered, didn't you? If yes, show your review, if not, do it first.

Currently, the optimizer must completely derive a new optimizer from the specification (program 3), without looking at the current best optimizer (its own source code).

....

This improvement should make program 2 even better at bettering itself, in addition to doing better on other optimization problems.

Then call this 'optimizer' a 'superhumanly strong AI', to avoid confusing yourself.

In order to protect against scenarios like this, we can (1) only ask to solve abstract mathematical problems, such that it is provably impossible to infer sufficient knowledge about the outside world from the problem description, and (2) restore the system to its previous state after each problem is solved, so that knowledge would not accumulate.

In other words, optimizer_score should not be seen as a utility function because the system only tries to improve it using greedy methods, not long-term planning.

It sounds like that would doom it to wallow forever in local maxima.

I mean greedy on the level of "do you best to find a good solution to this problem", not on the level of "use a greedy algorithm to find a solution to this problem". It doesn't do multi-run planning such as "give an answer that causes problems in the world so the human operators will let me out", since that is not a better answer.

I think we can be assured that this optimizer will do very very well on completely new optimization problems.

Even if the optimizer may perform arbitrarily better given more time on certain infinite sets of algorithms, this does not mean it can perform arbitrarily better on any set of algorithms given more time; such an optimizer would be impossible to construct.

That's not to say that you couldn't build an optimizer that could solve all practical problems but that is as jacobt puts it a "really hard problem".

Ok, we do have to make the training set somewhat similar to the kind of problems the optimizer will encounter in the future. But if we have enough variety in the training set, then the only way to score well should be to use very general optimization techniques. It is not meant to work on "any set of algorithms"; it's specialized for real-world practical problems, which should be good enough.

I'm not sure a system like this should be considered an AI as it has no goals and does not behave like an agent.

Do we distinguish between oracle AI and expert system?

edit: My lack of confidence is honest, not rhetorical.

What do you mean by "has no goals"? Achieving an optimization target is goal-directed behavior.

This idea is interesting (to me) precisely because the goals of the algorithm don't "care" about our universe, but rather about abstract mathematical questions, and the iteration is done in a way that may prevent "caring about our universe" from emerging as an instrumental subgoal. That's a feature, not a bug.

But it wont do something else for the purpose of achieving an optimization target any more than an existing compiler would. See Blue Minimizing Robot:

Imagine a robot with a turret-mounted camera and laser. Each moment, it is programmed to move forward a certain distance and perform a sweep with its camera. As it sweeps, the robot continuously analyzes the average RGB value of the pixels in the camera image; if the blue component passes a certain threshold, the robot stops, fires its laser at the part of the world corresponding to the blue area in the camera image, and then continues on its way.

[...]

This is not because its utility function doesn't exactly correspond to blue-minimization: even if we try to assign it a ponderous function like "minimize the color represented as blue within your current visual system, except in the case of holograms" it will be a case of overfitting a curve. The robot is not maximizing or minimizing anything. It does exactly what it says in its program: find something that appears blue and shoot it with a laser.

I still don't understand your comment. Are you saying that the Oracle AI concept in general shouldn't be thought of as AI? Or is it something with this particular proposal?

If the end result is a program that can output the source code of a Friendly AI (or solve other problems that we can't currently write a program to solve), then what does it matter whether it's an "AI" or an "agent" or not by some philosophical definition? (Similarly, if a program ends up forcibly rewriting the biosphere in order to count paperclips, it's a moot point to argue over whether it counts as "intelligent" or not.)

Because it is safe to ask a non-agent Oracle questions which do not have dangerous answers (or create dangerous information while calculating the answer). On the other hand, an Oracle that behaves as an agent is unsafe to ask any question because it might convert the planet to computronium to calculate the answer.

OK, so you're using "behaves as an agent" to mean what I mean by "cares about our universe". It doesn't sound like we disagree on substance (i.e. what such a program would do if it worked as proposed).