Are Magical Categories Relatively Simple?
In Magical Categories, Eliezer criticizes using machine learning to learn the concept of "smile" from examples. "Smile" sounds simple to humans but is actually a very complex concept. It only seems simple to us because we find it useful.
If we saw pictures of smiling people on the left and other things on the right, we would realize that smiling people go to the left and categorize new things accordingly. A supervised machine learning algorithm, on the other hand, will likely learn something other than what we think of as "smile" (such as "containing things that pass the smiley face recognizer") and categorize molecular smiley faces as smiles.
This is because simplicity is subjective: a human will consider "happy" and "person" to be basic concepts, so the intended definition of smile as "expression of a happy person" is simple. A computational Occam's Razor will consider this correct definition to be a more complex concept than "containing things that pass the smiley face recognizer". I'll use the phrase "magical category" to refer to concepts that have a high Kolmogorov complexity but that people find simple.
I hope that it's possible to create conditions under which the computer will have an inductive bias towards magical categories, as humans do. I think that people find these concepts simple because they're useful to explain things that humans want to explain (such as interactions with people or media depicting people). The video has pixels arranged in this pattern because it depicts a person who is happy because he is eating chocolate.
So, maybe it's possible to learn these magical categories from a lot of data, by compressing the categorizer along with the data. Here's a sketch of a procedure for doing this:
-
Amass a large collection of data from various societies, containing photographs, text, historical records, etc.
-
Come up with many categories (say, one for each noun in a long list). For each category, decide which pieces of data fit the category.
-
Find categorizer_1, categorizer_2, ..., categorizer_n to minimize K(dataset + categorizer_1 + categorizer_2 + ... + categorizer_n)
What do these mean:
- K(x) is the Kolmogorov complexity of x; that is, the length of the shortest (program,input) pair that, when run, produces x. This is uncomputable so it has to be approximated (such as through resource-bounded data compression).
- + denotes string concatenation. There should be some separator so the boundaries between strings are clear.
- dataset is the collection of data
-
categorizer_k is a program that returns "true" or "false" depending on whether the input fits category #k
-
When learning a new category, find new_categorizer to minimize K(dataset + categorizer_1 + categorizer_2 + ... + categorizer_n + new_categorizer) while still matching the given examples.
Note that while in this example we learn categorizers, in general it should be possible to learn arbitrary functions including probabilistic functions.
The fact that the categorizers are compressed along with the dataset will create a bias towards categorizers that use concepts useful in compressing the dataset and categorizing other things. From looking at enough data, the concept of "person" naturally arises (in the form of a recognizer/generative model/etc), and it will be used both to compress the dataset and to recognize the "person" category. In effect, because the "person" concept is useful for compressing the dataset, it will be cheap/simple to use in categorizers (such as to recognize real smiling faces).
A useful concept here is "relative complexity" (I don't know the standard name for this), defined as K(x|y) = K(x + y) - K(y). Intuitively this is how complex x is if you already understand y. The categorizer should be trusted in inverse proportion to its relative complexity K(categorizer | dataset and other categorizers); more complex (relative to the data) categorizers are more arbitrary, even given concepts useful for understanding the dataset, and so they're more likely to be wrong on new data.
If we can use this setup to learn "magical" categories, then Friendly AI becomes much easier. CEV requires the magical concepts "person" and "volition" to be plugged in. So do all seriously proposed complete moral systems. I see no way of doing Friendly AI without having some representation of these magical categories, either provided by humans or learned from data. It should be possible to learn deontological concepts such as "obligation" or "right", and also consequentialist concepts such as "volition" or "value". Some of these are 2-place predicates so they're categories over pairs. Then we can ask new questions such as "Do I have a right to do x in y situation?" All of this depends on whether the relevant concepts have low complexity relative to the dataset and other categorizers.
Using this framework for Friendly AI has many problems. I'm hand-waving the part about how to actually compress the data (approximating Kolmogorov complexity). This is a difficult problem but luckily it's not specific to Friendly AI. Another problem is that it's hard to go from categorizing data to actually making decisions. This requires connecting the categorizer to some kind of ontology. The categorization question that we can actually give examples for would be something like "given this description of the situation, is this action good?". Somehow we have to provide examples of (description,action) pairs that are good or not good, and the AI has to come up with a description of the situation before deciding whether the action is good or not. I don't think that using exactly this framework to make Friendly AI is a good idea; my goal here is to argue that sufficiently advanced machine learning can learn magical categories.
If it is in fact possible to learn magical categories, this suggests that machine learning research (especially related to approximations of Solomonoff induction/Kolmogorov complexity) is even more necessary for Friendly AI than it is for unFriendly AI. I think that the main difficulty of Friendly AI as compared with unFriendly AI is the requirement of understanding magical concepts/categories. Other problems (induction, optimization, self-modification, ontology, etc.) are also difficult but luckily they're almost as difficult for paperclip maximizers as they are for Friendly AI.
This has a relationship to the orthogonality thesis. Almost everyone here would agree with a weak form of the orthogonality thesis: that there exist general optimizers AI programs to which you can plug in any goal (such as paperclip maximization). A stronger form of the orthogonality thesis asserts that all ways of making an AI can be easily reduced to specifying its goals and optimization separately; that is, K(AI) ~= K(arbitrary optimizer) + K(goals). My thesis here (that magical categories are simpler relative to data) suggests that the strong form is false. Concepts such as "person" and "value" have important epistemic/instrumental value and can also be used to create goals, so K(Friendly AI) < K(arbitrary optimizer) + K(Friendliness goal). There's really no problem with human values being inherently complex if they're not complex relative to data we can provide to the AI or information it will create on its own for instrumental purposes. Perhaps P(Friendly AI | AGI, passes some Friendliness tests) isn't actually so low even if the program is randomly generated (though I don't actually suggest taking this approach!).
I'm personally working on a programming language for writing and verifying generative models (proving lower bounds on P(data|model)). Perhaps something like this could be used to compress data and categories in order to learn magical categories. If we can robustly learn some magical categories even with current levels of hardware/software, that would be strong evidence for the possibility of creating Friendly AI using this approach, and evidence against the molecular smiley face scenario.
Yet another safe oracle AI proposal
Previously I posted a proposal for a safe self-improving limited oracle AI but I've fleshed out the idea a bit more now.
Disclaimer: don't try this at home. I don't see any catastrophic flaws in this but that doesn't mean that none exist.
This framework is meant to safely create an AI that solves verifiable optimization problems; that is, problems whose answers can be checked efficiently. This set mainly consists of NP-like problems such as protein folding, automated proof search, writing hardware or software to specifications, etc.
This is NOT like many other oracle AI proposals that involve "boxing" an already-created possibly unfriendly AI in a sandboxed environment. Instead, this framework is meant to grow a self-improving seed AI safely.
Overview
- Have a bunch of sample optimization problems.
- Have some code that, given an optimization problem (stated in some standardized format), finds a good solution. This can be seeded by a human-created program.
- When considering an improvement to program (2), allow the improvement if it makes it do better on average on the sample optimization problems without being significantly more complex (to prevent overfitting). That is, the fitness function would be something like (average performance - k * bits of optimizer program).
- Run (2) to optimize its own code using criterion (3). This can be done concurrently with human improvements to (2), also using criterion (3).
Definitions
First, let's say we're writing this all in Python. In real life we'd use a language like Lisp because we're doing a lot of treatment of code as data, but Python should be sufficient to demonstrate the basic ideas behind the system.
We have a function called steps_bounded_eval_function. This function takes 3 arguments: the source code of the function to call, the argument to the function, and the time limit (in steps). The function will eval the given source code and call the defined function with the given argument in a protected, sandboxed environment, with the given steps limit. It will return either: 1. None, if the program does not terminate within the steps limit. 2. A tuple (output, steps_taken): the program's output (as a string) and the steps the program took.
Examples:
steps_bounded_eval_function("""
def function(x):
return x + 5
""", 4, 1000)
evaluates to (9, 3), assuming that evaluating the function took 3 ticks, because function(4) = 9.
steps_bounded_eval_function("""
def function(x):
while True: # infinite loop
pass
""", 5, 1000
evaluates to None, because the defined function doesn't return in time. We can write steps_bounded_eval_function as a meta-circular interpreter with a bit of extra logic to count how many steps the program uses.
Now I would like to introduce the notion of a problem. A problem consists of the following:
-
An answer scorer. The scorer should be the Python source code for a function. This function takes in an answer string and scores it, returning a number from 0 to 1. If an error is encountered in the function it is equivalent to returning 0.
-
A steps penalty rate, which should be a positive real number.
Let's consider a simple problem (subset sum):
{'answer_scorer': """
def function(answer):
nums = [4, 5, -3, -5, -6, 9]
# convert "1,2,3" to [1, 2, 3]
indexes = map(int, answer.split(','))
assert len(indexes) >= 1
sum = 0
for i in indexes:
sum += nums[i]
if sum == 0:
return 1
else:
return 0
""",
'steps_penalty_rate': 0.000001}
We can see that the scorer function returns 1 if and only if the answer specifies the indexes of numbers in the list nums that sum to 0 (for example, '0,1,3,4' because 4+5-3-6=0).
An optimizer is a program that is given a problem and attempts to solve the problem, returning an answer.
The score of an optimizer on a problem is equal to the score according to the answer-scorer, minus the steps penalty rate multiplied by the number of steps used by the optimizer. That is, the optimizer is rewarded for returning a better answer in less time. We can define the following function to get the score of an optimizer (Python source code) for a given problem:
def problem_score(problem, optimizer_source):
# run the optimizer on the problem
result = steps_bounded_eval_function(
optimizer_source, problem, 1 / problem['steps_penalty_rate'])
if result == None: # used all available steps, or got an error
return 0.0
answer, steps_taken = result # optimizer returned a result in time
# get the score using the problem's answer_scorer
answer_score = eval_function(problem['answer_scorer'], answer)
assert 0 <= answer_score <= 1
# penalize for taking up time and make sure the result is non-negative
return max(0.0, answer_score - problem['steps_penalty_rate'] * steps_taken)
A simple optimizer that detects subset-sum problems (by inspecting the source code) and uses a brute-force method to solve it will get a problem_score close to 1 for small subset-sum problems. This optimizer would not do well on other problems; other techniques (such as evaluating the problem source code with different answers and choosing a good one) are needed to solve a variety of different optimization problems. Writing a good optimizer is very hard work, just like any seed AI.
Framework
The framework has 4 modules:
-
A set of training problems. These should cover a wide range of problems that we would like the AI to solve.
-
An optimizer, written in Python. This should be seeded with a very good human-written optimizer. This is deliberately unspecified by me because it's a really hard problem (as is any seed AI).
-
A scorer for optimizer source code defined as follows:
def optimizer_score(candidate_optimizer_source): training_problems = [...training problems here...] # a parameter of the system; more on this later complexity_penalty = 0.1 # total up the program's score on all training problems total_score = 0.0 for problem in training_problems: total_score += problem_score(candidate_optimizer_source, problem) # penalize for complexity, to prevent overfitting total_score -= complexity_penalty * len(compress_binary(candidate_optimizer_source)) # return average score return max(0, total_score / len(training_problems))This gives a candidate optimizer a score in the range [0, 1] based on both its average performance on the sample set and its inherent complexity. Presumably optimizers with a higher optimizer_score will do better on future optimization problems.
-
A self-optimization thread. This thread continuously runs program 2 on a problem formed using 3's answer_scorer and an ever-decreasing steps_penalty_rate. Whenever program 2 outputs source code (optimizer_source) that is better than the current source code for 2, the source code for 2 is replaced with this new value. Also, humans can make improvements to program 2 if it increases its score according to 3's answer. Source code:
# assume we have access to an optimizer_source variable (program 2) def self_optimization_thread(): start_steps_penalty_rate = 0.000001 steps_penalty_rate = start_steps_penalty_rate while True: # loop forever self_optimization_problem = { # just use program 3 to score the optimizer 'answer_scorer': """ def function(candidate_optimizer_source): ... put the source code from program 3's optimizer_score here """, 'steps_penalty_rate': steps_penalty_rate } # call the optimizer (program 2) to optimize itself, giving it limited time result = steps_bounded_eval_function( optimizer_source, self_optimization_problem, 1 / steps_penalty_rate) changed = False if result is not None: # optimizer returned in time candidate_optimizer = result[0] # 2 returned a possible replacement for itself if optimizer_score(candidate_optimizer) > optimizer_score(optimizer_source): # 2's replacement is better than 2 optimizer_source = candidate_optimizer steps_penalty_rate = start_steps_penalty_rate changed = True if not changed: # give the optimizer more time to optimize itself on the next iteration steps_penalty_rate *= 0.5
So, what does this framework get us?
-
An super-optimizer, program 2. We can run it on new optimization problems and it should do very well on them.
-
Self-improvement. Program 4 will continuously use program 2 to improve itself. This improvement should make program 2 even better at bettering itself, in addition to doing better on other optimization problems. Also, the training set will guide human improvements to the optimizer.
-
Safety. I don't see why this setup has any significant probability of destroying the world. That doesn't mean we should disregard safety, but I think this is quite an accomplishment given how many other proposed AI designs would go catastrophically wrong if they recursively self-improved.
I will now evaluate the system according to these 3 factors.
Optimization ability
Assume we have a program for 2 that has a very very high score according to optimizer_score (program 3). I think we can be assured that this optimizer will do very very well on completely new optimization problems. By a principle similar to Occam's Razor, a simple optimizer that performs well on a variety of different problems should do well on new problems. The complexity penalty is meant to prevent overfitting to the sample problems. If we didn't have the penalty, then the best optimizer would just return the best-known human-created solutions to all the sample optimization problems.
What's the right value for complexity_penalty? I'm not sure. Increasing it too much makes the optimizer overly simple and stupid; decreasing it too much causes overfitting. Perhaps the optimal value can be found by some pilot trials, testing optimizers against withheld problem sets. I'm not entirely sure that a good way of balancing complexity with performance exists; more research is needed here.
Assuming we've conquered overfitting, the optimizer should perform very well on new optimization problems, especially after self-improvement. What does this get us? Here are some useful optimization problems that fit in this framework:
-
Writing self-proving code to a specification. After writing a specification of the code in a system such as Coq, we simply ask the optimizer to optimize according to the specification. This would be very useful once we have a specification for friendly AI.
-
Trying to prove arbitrary mathematical statements. Proofs are verifiable in a relatively short amount of time.
-
Automated invention/design, if we have a model of physics to verify the invention against.
-
General induction/Occam's razor. Find a generative model for the data so far that optimizes P(model)P(data|model), with some limits on the time taken for the model program to run. Then we can run the model to predict the future.
-
Bioinformatics, e.g. protein folding.
These are all problems whose solutions can be efficiently evaluated and that a computer could plausibly solve, so I think this framework should provide good solutions to them. If the optimizer this framework produces performs well on all these problems, I think it deserves to be called an oracle AGI.
Self-improvement
It seems that, if my arguments about optimization ability are convincing, an optimizer that does very well on the sample problems should also do well on the specific optimization problem of improving itself. This is especially true if the training problems include some program-optimization problems (i.e. finding a program that does well on some task). Once the optimizer improves itself, it will be better at improving itself, and so on. If the AI is software-limited, the framework should allow it to recursively improve itself up to a very high limit. Then we can use its hardware design power to design better chips for it and allow it to optimize even further.
The optimizer's ability to self-improve is limited by the fact that the other parts of the program (1, 3, and 4) are fixed. The system is greedy in the sense that it just looks for even the slightest improvement, then changes the code and looks for more improvements, without any foresight. I don't know how much of a limitation this is. I don't see any reason that greedy improvements to the optimizer are insufficient for super-optimization from recursive self-improvement.
Currently, the optimizer must completely derive a new optimizer from the specification (program 3), without looking at the current best optimizer (its own source code). In practice we might want to extend the framework to include incremental improvements (improvements on top of existing good answers). I don't know if a good theory for scoring incremental optimizers exists.
Safety
Probably the most important property of the system. What use is optimization if the world has been converted to computron in order to power a super-powerful protein folder? Here are a few different possible threats:
Underhanded question answers
The optimizer could give answers that are very good according to the problem description but are specifically crafted to cause things unintended by the human operators. For example, it could create a software design that, while correct according to specifications, has a "bug" that causes it to hack into computers on the internet and hook them up to the optimizer system. I don't think this is at all likely because the system is extremely short-sighted. Any self-modification must improve its immediate ability to do better on optimization problems. Furthermore, all self-modifications are found using methods that are judged by only immediate improvements. So any kind of long-term strategy (sacrificing some optimization power so it will have resources in the future) is out of the question. In other words, optimizer_score should not be seen as a utility function because the system only tries to improve it using greedy methods, not long-term planning.
Bugs in the system
What if the system gives the optimizer an incorrect score under some circumstances (e.g. if it performs a certain pattern of memory accesses)? Say that, by chance, the optimizer's improvement to itself causes it to get an incorrect score. It might internalize the rule "perform memory accesses in this pattern" to get a higher score. This itself is not especially dangerous; the optimizer will rewrite itself to just do a bunch of weird memory accesses that give it a high score.
What might be more dangerous is if the optimizer discovers an underlying pattern behind the system's hackability. Since the optimizer is penalized for complexity, a program like "do things that, when executed on a certain virtual machine, cause this variable in the machine to be a high number" might have a higher score than "do this certain complex pattern of memory accesses". Then the optimizer might discover the best way to increase the score variable. In the absolute worst case, perhaps the only way to increase the score variable is by manipulating the VM to go on the internet and do unethical things. This possibility seems unlikely (if you can connect to the internet, you can probably just overwrite the score variable) but should be considered.
I think the solution is straightforward: have the system be isolated while the optimizer is running. Completely disconnect it from the internet (possibly through physical means) until the optimizer produces its answer. Now, I think I've already established that the answer will not be specifically crafted to improve future optimization power (e.g. by manipulating human operators), since the system is extremely short-sighted. So this approach should be safe. At worst you'll just get a bad answer to your question, not an underhanded one.
Malicious misuse
I think this is the biggest danger of the system, one that all AGI systems have. At high levels of optimization ability, the system will be able to solve problems that would help people do unethical things. For example it could optimize for cheap, destructive nuclear/biological/nanotech weapons. This is a danger of technological progress in general, but the dangers are magnified by the potential speed at which the system could self-improve.
I don't know the best way to prevent this. It seems like the project has to be undertaken in private; if the seed optimizer source were released, criminals would run it on their computers/botnets and possibly have it self-improve even faster than the ethical version of the system. If the ethical project has more human and computer resources than the unethical project, this danger will be minimized.
It will be very tempting to crowdsource the project by putting it online. People could submit improvements to the optimizer and even get paid for finding them. This is probably the fastest way to increase optimization progress before the system can self-improve. Unfortunately I don't see how to do this safely; there would need to be some way to foresee the system becoming extremely powerful before criminals have the chance to do this. Perhaps there can be a public base of the project that a dedicated ethical team works off of, while contributing only some improvements they make back to the public project.
Towards actual friendly AI
Perhaps this system can be used to create actual friendly AI. Once we have a specification for friendly AI, it should be straightforward to feed it into the optimizer and get a satisfactory program back. What if we don't have a specification? Maybe we can have the system perform induction on friendly AI designs and their ratings (by humans), and then write friendly AI designs that it predicts will have a high rating. This approach to friendly AI will reflect present humans' biases and might cause the system to resort to manipulative tactics to make its design more convincing to humans. Unfortunately I don't see a way to fix this problem without something like CEV.
Conclusion
If this design works, it is a practical way to create a safe, self-improving oracle AI. There are numerous potential issues that might make the system weak or dangerous. On the other hand it will have short-term benefits because it will be able to solve practical problems even before it can self-improve, and it might be easier to get corporations and governments on board. This system might be very useful for solving hard problems before figuring out friendliness theory, and its optimization power might be useful for creating friendly AI. I have not encountered any other self-improving oracle AI designs for which we can be confident that its answers are not underhanded attempts to get us to let it out.
Since I've probably overlooked some significant problems/solutions to problems in this analysis I'd like to hear some more discussion of this design and alternatives to it.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)