Open subproblems in aligning a Task-based AGI

Discuss the wikitag on this page. Here is the place to ask questions and propose changes.
New Comment
11 comments, sorted by

the strawberry diagrams are currently unavailable

Is this really more complex than "All primes between 3 and 19"? I think you need more numbers before you can import the definition of prime and have that be simple.

It seems critical to distinguish the cases where

  1. We are hoping the AI generalizes the concept of "burrito" in the intended way to new data,
  2. The definition of burrito is "something our burrito-identifier would identify as a burrito given enough time," and we are just hoping the AI doesn't make mistakes. (The burrito-identifier is some process that we can actually run in order to determine whether something is a burrito.)

As you've probably gathered, I feel hopeless about case (1).

In case (2), any agent that can learn the concept "definitely a burrito" could use this concept to produce definitely-burritos and thereby achieve high reward in the RL game. So the mere existence of the easy-to-learn definitely-a-burrito concept seems to imply that our learner will behave well. We don't have to actually explicitly do any work about conservative concepts (except to better understand the behavior of our learner).

I've never managed to get quite clear on your picture. My impression is that:

  • you think that case (2) is doomed because there is no realistic prospect for creating a good enough burrito-evaluator,
  • you think that even with a good enough burrito-evaluator, you would still have serious trouble because of errors.

I think your optimism about case (1) is defensible; I disagree, but not for super straightforward reasons. The main disagreement is probably about case (2).

I think that your concern about generating a good enough burrito-evaluator is also defensible; I am optimistic, but even on my view this would require resolving a number of big research problems.

I think your concern about mistakes, and especially about something like "conservative concepts" as a way to reduce the scope for mistakes, is less defensible. I don't feel like this is as complex an issue---the case for delegating this to the learning algorithm seems quite strong, and I don't feel you've really given a case on the other side.

Note that this is related to what you've been calling Identifying ambiguous inductions, and I do think that there are techniques in that space that could help avoid mistakes. (Though I would definitely frame that problem differently.) So it's possible we're not really disagreeing here either. But my best guess is that you are underestimating to the extent to which some of these issues could/should be delegated to the learner itself, supposing that we could resolve your other concerns (i.e. supposing that we could construct a good enough burrito-evaluator).

I think the key question is whether:

  1. the burrito judge needs to be extremely powerful, or
  2. the burrito judge needs to be modestly more powerful than the burrito producer.

In world 1 I agree that the burrito-evaluator seems pretty tough to build. We certainly have disagreements about that case, but I'm happy to set it aside for now.

In world 2 things seem much less scary. Because I only need to run these evaluations with e.g. 1% probability, the judge can use 50x more resources than the burrito producer. So it's imaginable that the judge can be more powerful than the producer.

You seem to think that we are in world 1. I think that we are probably in world 2, but I'm certainly not sure. I discuss the issue in this post.

Some observations:

  • The judge's job is easier if they are evaluating steps of the plan, before those steps are taken, rather than actually letting the burrito producer take actions. So let's do it that way.
  • The judge can look at the burrito producer's computation, and at the training process that produced that computation, and can change the burrito producer's training procedure to make that computation more understandable.
  • If the judge were epistemically efficient with respect to the producer, then maximizing the judge's expectation of a burrito's quality would be the same as maximizing the burrito producer's expectation of a burrito's quality. That's basically what we want. So the real issue is narrower than you might expect, it's some kind of epistemic version of "offense vs. defense," where the producer can think particular thoughts that the judge doesn't happen to think, and so the producer might expect to be able to deceive/attack the judge even though the judge is smarter. This is what the judge is trying to avoid by looking at the producer's computation.

So I don't think that we can just ask the judge to evaluate the burrito; but the judge has enough going for her that I expect we can find some strategy that lets her win. I think this is the biggest open problem for my current approach.

As you've probably gathered, I feel hopeless about case (1).

Okay, I didn't understand this. My reaction was something like "Isn't conservatively generalizing burritos from sample burritos a much simpler problem than defining an ideal criterion for burritos which probably requires something like an ideal advisor theory over extrapolated humans to talk about all the poisons that people could detect given enough computing power?" but I think I should maybe just ask you to clarify what you mean. The interpretation my brain generated was something like "Predicting a human 9p's Go moves is easier than generating 9p-level Go moves" which seems clearly false to me so I probably misunderstood you.

In case (2), any agent that can learn the concept "definitely a burrito" could use this concept to produce definitely-burritos and thereby achieve high reward in the RL game. So the mere existence of the easy-to-learn definitely-a-burrito concept seems to imply that our learner will behave well. We don't have to actually explicitly do any work about conservative concepts (except to better understand the behavior of our learner).

I don't understand this at all. Are we supposing that we have an inviolable physical machine that outputs burrito ratings and can't be shorted by seizing control of the reward channel or by including poisons that the machine-builders didn't know about? ...actually I should just ask you to clarify this paragraph.

On the act-based model, the user would say something like "paint all the cars pink," and the AI would take this as evidence about what individual steps the user would approve of. Effectiveness at painting all cars pink is one consideration that the user would use. Most of the problems on your list are other considerations that would affect the user's judgment.

The difference between us seems to be something like: I feel it is best to address almost all of these problems by using learning, and so I am trying to reduce them to a traditional learning problem. For example, I would like a human to reject plans that have huge side effects, and for the agent to learn that big side effects should be avoided. You don't expect that it will be easy to learn to address these problems, and so think that we should solve them ourselves to make sure they really get solved. (I think you called my position optimism about "special case sense.")

I might endorse something like your approach at some stage---once we have set everything up as a learning problem, we can ask what parts of the learning problem are likely to be especially difficult+important, and focus our efforts on making sure that systems can solve those problems (which may involve solving them ourselves, or may just involve differential ML progress). But it seems weird to me to start this way.

Some considerations that seem relevant to me:

  • To the extent we can set up all of these problems as parts of a learning problem, it just seems like an empirical question which ones will be hard, and how hard they will be. I think that you are wrong about this empirical question, and you think I am wrong, but perhaps we can agree that it is an empirical question?
  • Setting things up as a learning problem is not only helpful for AI systems. It also automatically turns nebulous philosophical issues into precise technical problems, since they now correspond to e.g. receiving higher reward in some reinforcement learning environment.
  • In terms of comparative-advantage-across-time, it seems better for us to identify anything that can't be addressed by learning, and will require e.g. philosophical labor, and to postpone problems that might be addressed by learning or clever algorithms (since in the future people will have access to more powerful learning systems and cleverer algorithms)
  • The historical track record for hand-coding vs. learning is not good. For example, even probabilistic reasoning seems at this point like it's something that our agents should learn on their own (to the extent that probability is relevant to ML, it is increasingly as a technique relevant to analyzing ML systems rather than as a hard-coded feature of their reasoning). So it seems natural to first make sure that everything can be attacked as a learning problem, before trying to solve a bunch of particular learning problems by hand.

It's possible that the difference between us is that I think it is feasible to reduce almost all of these problems to traditional learning problems, where you disagree. But when we've actually talked about it, you seem to have consistently opted for positions like "in some sense this is 'just' a prediction problem, but I suspect that solving it will require us to understand X." And concretely, it seems to me like we have an extremely promising approach for reducing most of these problems to learning problems.

To the extent we can set up all of these problems as parts of a learning problem, it just seems like an empirical question which ones will be hard, and how hard they will be. I think that you are wrong about this empirical question, and you think I am wrong, but perhaps we can agree that it is an empirical question?

The main thing I'd be nervous about is having the difference in our opinions be testable before the mission-critical stage. Like, maybe simple learning systems exhibit pathologies and you're like "Oh that'll be fixed with sufficient predictive power" and I say "Even if you're right, I'm not sure the world doesn't end before then." Or conversely, maybe toy models seem to learn the concept perfectly and I'm like "That's because you're using a test set that's an identical set of problems to the training set" and you're like "That's a pretty good model for how I think superhuman intelligence would also go, because it would be able to generalize better over the greater differences" and I'm like "But you're not testing the mission-critical part of the assumption."

The historical track record for hand-coding vs. learning is not good. For example, even probabilistic reasoning seems at this point like it's something that our agents should learn on their own (to the extent that probability is relevant to ML, it is increasingly as a technique relevant to analyzing ML systems rather than as a hard-coded feature of their reasoning).

We might have an empirical disagreement about to what extent theory plays a role in practice in ML, but I suspect we also have a policy disagreement about how important transparency is in practice to success - i.e., how likely we are to die like squirrels if we try to use a system whose desired/required dynamics we don't understand on an abstract level.

So it seems natural to first make sure that everything can be attacked as a learning problem, before trying to solve a bunch of particular learning problems by hand.

I'm not against trying both approaches in parallel.

To me, the most natural way to approach this is to take a probability distribution over "what it means to be a burrito," and to produce a thing that is maximally likely to be a burrito rather than a thing which is maximally burrito-like. Of course this still depends on having a good distribution over "what it means to be a burrito" (as does your approach).

It's not obvious to me that these two approaches mean the same thing. Let's say that an AI sees some stale burritos and some fresh burritos, with the former being classified as negative examples and the latter being specified as positive examples. If you use the simplest but not conservative concept that classifies the training data, maybe you max out the probability that something will be classified as a burrito by eliminating every trace of staleness... or moving even further along some dimension that distinguishes stale from fresh burritos.

Now, it's possible that this would be fixed automatically by having a mixture of hypotheses about what might underlie the good-burrito classification and that one of the hypotheses would be "maybe a burrito can't be too fresh", but again, this is not obvious to me.

It seems to me that, in general, when we learn a mixture of the simplest concepts that might assign probabilities well over previously labeled classifications, we might still be ending up with something with a nonconservative maximum. Maybe the AI learns to model the human system for classifying burritos and then presents us with a weird object whose appearance hacks us to suddenly be absolutely certain that it is a burrito - this is just me trying to wave my hands in the direction of what seems like it might be an underlying difference between "learn a probabilistic classification rule and max it out" and "try to draw a simple concept that is conservatively narrow".

It might be the case that given sufficient imagination to consider many possible hypotheses, trying to fit all of those hypotheses well (which might not be the same as maxing out the mixture) is an implementation of conservatism, or even that just trying to max out the mixture turns out to implement conservatism in practice. But then it might also be the case that in the not-far-superhuman regime, taking a direct approach to making merely powerful learning systems be 'conservative' rather than 'max out the probability considering many hypotheses' would be more tractable or straightforward as an engineering problem.

Presumably the advantage of this approach---rather than simply learning to imitate the human burrito-making process or even human burritos, is that it might be easier to do. Is that right?

I think that's a valid goal, but I'm not sure how well "conservative generalizations" actually address the problem. Certainly it still leaves you at a significant disadvantage relative to a non-conservative agent, and it seems more natural to first consider direct approaches to making imitation effective (like bootstrapping + meeting halfway).

Of course all of these approaches still involve a lot of extra work, so maybe the difference is are expectations about how different research angles will work out.

I was talking to Chelsea Finn about IRL a few weeks ago, and she said that they had encountered the situation where they

  • Demonstrated the intended behavior (I think it was putting a block into a slot)
  • Trained the robot to recognize success
  • Trained the robot reproduce that behavior, i.e. to do something it would recognize as success

At which point it positioned the block so that it looked (to its cameras) like the block was in a slot, while in fact it was far away.

I think they then added joint position information so that the AI could more reliably estimate whether the block was in the slot, and that fixed the problem.

Of course this problem can be solved in many ways and this instance doesn't illustrate the full difficulty etc. but I think it's a nice illustration anyway.