In Inaccessible Information, Paul Christiano lays out a fundamental challenge in training machine learning systems to give us insight into parts of the world that we cannot directly verify. The core problem he lays out is as follows.

Suppose we lived in a world that had invented machine learning but not Newtonian mechanics. And suppose we trained some machine learning model to predict the motion of the planets across the sky -- we could do this by observing the position of the planets over, say, a few hundred days, and using this as training data for, say, a recurrent neural network. And suppose further that this worked and our training process yielded a model that output highly accurate predictions many days into the future. If all we wanted was to know the position of the planets in the sky then -- good news -- we’re done. But we might hope to use our model to gain some direct insight into the nature of the motion of the planets (i.e. the laws of gravity, although we wouldn’t know that this is what we were looking for).

Presumably the machine learning model has in some sense discovered Newtonian mechanics using the training data we fed it, since this is surely the most compact way to predict the position of the planets far into the future. But we certainly can’t just read off the laws of Newtonian mechanics by looking at the millions or billions or trillions of weights in the trained model. How might we extract insight into the nature of the motion of the planets from this model?

Well we might train a model to output both predictions about the position of the planets in the sky and a natural language description of what’s really going on behind the scenes (i.e. the laws of gravity). We’re assuming that we have enough training data that the training process was already able to derive these laws, so it’s not unreasonable to train a model that also outputs such legible descriptions. But in order to train a model that outputs such legible descriptions we need to generate a reward signal that incentivizes the right kind of legible descriptions. And herein lies the core of the problem: in this hypothesized world we do not know the true laws of Newtonian mechanics, so we cannot generate a reward signal by comparing the output of our model to ground truth during training. We might instead generate a reward signal that (1) measures how accurate the predictions of the position of the planets are, and (2) measures how succinct and plausible the legible descriptions are. But then what we are really training is a model that is good at producing succinct descriptions that seem plausible to humans. This may be a very very different (and dangerous) thing to do since there are lots of ways that a description can seem plausible to a human while being quite divorced from the truth.

Christiano calls this the instrumental policy: the policy that produces succinct descriptions that merely seem plausible to humans:

The real problem comes from what I’ll call the instrumental policy. Let’s say we’ve tried to dream up a loss function L(x, y) to incentivize the model to correctly answer information we can check, and give at least plausible and consistent answers on things we can’t check. By definition, the values L(x, y) are themselves accessible. Then it’s natural to learn a policy like: “on input x, produce the output y for which the loss L(x, y) will be minimal.” Let’s write BAD for this policy.

Christiano uses the term “inaccessible information” for information like the laws of gravity in this example: information about the underlying nature of things that a machine learning model might learn quite accurately as latent info in service of making predictions, but that is difficult to extract directly due to the difficulty of constructing a reward signal that correctly incentives the model to explain it clearly and honestly.

We want models that tell the truth, but we don’t know how to recognize honest models in cases where we can’t simply compare what they say to an independently verifiable reality during training.

Examples

Christiano gives the following examples of accessible vs inaccessible information:

  • “What will Alice say?” vs “What is Alice thinking?”
  • “What’s on my financial statement?” vs. “How much money do I really have?”
  • “Am I coughing?” vs. “What’s happening with my immune system?”
  • “How will senators vote?” vs. “What’s the state of political alliances and agreements in the senate?”
  • “What do I see on my computer screen?” vs. “Is my computer compromised?”
  • “What’s the market price of this company?” vs. “How valuable is this IP really?”
  • “Will the machine break tomorrow?” vs. “Is there hard-to-observe damage in this component?”
  • “What does the news show me from 5000 miles away?” vs. “What’s actually happening 5000 miles away?”
  • “Is this argument convincing?” vs. “Is this argument correct?”
  • “What will happen tomorrow?” vs. “What will happen in a year” (depending on whether models transfer to long horizons)

In each case it is relatively straightforward to train a machine learning model to predict the accessible information ("what will Alice say?") since we can generate a reward signal by comparing the model's predictions to ground truth during training. But training a model to predict inaccessible information ("what is Alice thinking?") is problematic since we have no reliable way to produce a reward signal.

The fundamental dilemma of black box learning

In machine learning we use the following basic approach to developing intelligent systems:

  1. Define a hypothesis space containing a sufficiently broad range of policies that at least one policy in the hypothesis space probably meets our design criteria
  2. Set up an evaluation procedure that measures the extent to which any specific policy meets our design criteria
  3. Search the hypothesis space for a policy that the evaluation procedure ranks highly

This is a very unusual design procedure. It is very different from, for example, the way a set of chopsticks or a microwave or an air conditioner is designed. It would be surprising to visit a chopstick factory and discover that one part of the factory was producing chopsticks of various shapes and sizes and a completely separate part of the factory was evaluating each one and providing only a narrow “reward signal” in return.

But in machine learning this design procedure has proven powerful and compelling. It is often easier to specify a reasonable evaluation procedure than to find a design from first principles. For example, suppose we wish to design a computer program that correctly discriminates between pictures of cats and pictures of dogs. To do this, we can set up an evaluation procedure that uses a data set of hand-labelled pictures of cats and dogs, and then use machine learning to search for a policy that correctly labels them. In contrast we do not at present know how to design an algorithm from first principles that does the same thing. There are many, many problems where it is easier to recognize a good solution than to design a good solution from scratch, and for this reason machine learning has proven very useful across many parts of the economy.

But when we build sophisticated systems, the evaluation problem becomes very difficult. Christiano’s write-up explores the difficulty of evaluating whether a model is honest when all we can do is provide inputs to the model and observe outputs.

In order to really understand whether a model is honest or not we need to look inside the model and understand how it works. We need to somehow see the gears of its internal cognition in a way that lets us see clearly that it is running an algorithm that honestly looks at data from the world and honestly searches for a succinct explanation and honestly outputs that explanation in a legible form. Christiano says as much:

If we were able to actually understand something about what the policy was doing, even crudely, it might let us discriminate between instrumental and intended behavior. I don’t think we have any concrete proposals for how to understand what the policy is doing well enough to make this distinction, or how to integrate it into training. But I also don’t think we have a clear sense of the obstructions, and I think there are various obvious obstructions to interpretability in general that don’t apply to this approach.

It seems to me that Christiano’s write-up is a fairly general and compelling knock-down of the black-box approach to design in which we build an evaluation procedure and then rely on search to find a policy that our evaluation procedure ranks highly. Christiano is pointing out a general pitfall we will run into if we take this approach.

Hope and despair

I was surprised to see Christiano make the following reference to MIRI’s perspective on this problem:

I would describe MIRI’s approach to this problem [...] as despair + hope you can find some other way to produce powerful AI.

Yes it’s true that much of MIRI’s research is about finding a solution to the design problem for intelligent systems that does not rest on a blind search for policies that satisfy some evaluation procedure. But it seems strange to describe this approach as “hope you can find some other way to produce powerful AI”, as though we know of no other approach to engineering sophisticated systems other than search. In fact the vast majority of the day-to-day systems that we use in our lives have been constructed via design: airplanes, toothbrushes, cellphones, railroads, microwaves, ball point pens, solar panels. All these systems were engineered via first-principles design, perhaps using search for certain subcomponents in some cases, but certainly not using end-to-end search. It is the search approach that is new and unusual, and while it has proven powerful and useful in the development of certain intelligent systems, we should not for a moment think of it as the only game in town.

New Comment
15 comments, sorted by Click to highlight new comments since:

I thought this was a great summary, thanks!

Yes it’s true that much of MIRI’s research is about finding a solution to the design problem for intelligent systems that does not rest on a blind search for policies that satisfy some evaluation procedure. But it seems strange to describe this approach as “hope you can find some other way to produce powerful AI”, as though we know of no other approach to engineering sophisticated systems other than search.

I agree that the success of design in other domains is a great sign and reason for hope. But for now such approaches are being badly outperformed by search (in AI).

Maybe it's unfair to say "find some other way to produce powerful AI" because we already know the way: just design it yourself. But I think "design" is basically just another word for "find some way to do it," and we don't yet have any history of competitive designs to imitate or extrapolate from.

Personally, the main reason I'm optimistic about design in the future is that the designers may themselves be AI systems. That may help close the current gap between design and search, since both could then benefit from large amounts of computing power. (And it's plausible that we are currently bottlenecked on a meta-design problem of figuring out how to build automated designers.) That said, it's completely unclear whether that will actually beat search.

I consider my job as preparing for the worst w.r.t. search, since that currently seems like a better place to invest resources (and I think it's reasonably likely that dangerous search will be involved even if our AI ecosystem mostly revolves around design). I do think that I'd fall back to pushing on design if this ended up looking hopeless enough. If that happens, I'm hoping that by that time we'll have some much harder evidence that search is a lost cause, so that we can get other people to also jump ship from search to design.

Thanks for the note Paul.

I agree re finding hard evidence that search is a lost cause, and I see how your overall work in the field has the property of (hopefully) either finding a safe way to use search, or producing evidence (perhaps weak or perhaps strong) that search is a lost cause.

As I speak to young (and senior!) ML folk, I notice they often struggle to conceive of what a non-search approach to AI really means. I'm excited about elucidating what search and design really are, and getting more people to consider using aspects of design alongside search.

[-]VaniverΩ480
But for now such approaches are being badly outperformed by search (in AI).

I suspect the edge here depends on the level of abstraction. That is, Go bots that use search can badly outperform Go bots that don't use any search, but using search at the 'high level' (like in MuZero) only somewhat outperforms using design at that level (like in AlphaZero).

It wouldn't surprise me if search always has an edge (at basically any level, exposing things to adjustment by gradient descent makes performance on key metrics better), but if the edge is small it seems plausible to focus on design.

Thanks for this way of thinking about AlphaZero as a hybrid design/search system - I found this helpful.

I'm confused by this distinction.

I can see why you'd say AlphaZero has more of a "design" element than MuZero, because of the MCTS. But if on some absolute scale you say that AlphaZero is a design / search hybrid, then presumably you should also say the OpenAI Five is a design / search hybrid, since it uses PPO at the outer layer, which is a designed algorithm. This seems wrong. (Also, it seems like many current proposals for building AGI out of ML would be classified as design / search hybrids.)

Maybe the distinction is that AlphaZero uses MCTS at test time? Would AlphaZero without MCTS at test time be only search? (Aside: it's not great that when we remove Monte Carlo Tree Search we're now saying that the design is gone and only search remains)

More generally, I don't see how AlphaZero is making any headway on this problem:

It seems to me that Christiano’s write-up is a fairly general and compelling knock-down of the black-box approach to design in which we build an evaluation procedure and then rely on search to find a policy that our evaluation procedure ranks highly.
[-]VaniverΩ580
But if on some absolute scale you say that AlphaZero is a design / search hybrid, then presumably you should also say the OpenAI Five is a design / search hybrid, since it uses PPO at the outer layer, which is a designed algorithm. This seems wrong.

I think I'm willing to bite that bullet; like, as far as we know the only stuff that's "search all the way up" is biological evolution.

But 'hybrid' seems a little strange; like, I think design normally has search as a subcomponent (in imaginary space, at least, and I think often also search through reality), and so in some sense any design that isn't a fully formed vision from God is a design/search hybrid. (If my networks use RELU activations 'by design', isn't that really by the search process of the ML community as a whole? And yet it's still useful to distinguish networks which determine what nonlinearity to use from local data, which which networks have it determined for them by an external process, which potentially has a story for why that's the right thing to do.)

Total horse takeover seems relevant as another way to think about intervening to 'control' things at varying levels of abstraction.

[The core thing about design that seems important and relevant here is that there's a "story for why the design will work", whereas search is more of an observational fact of what was out there when you looked. It seems like it might be easier to build a 'safe design' out of smaller sub-designs, whereas trying to search for a safe algorithm using search runs into all the anthropic problems of empiricism.]

I really appreciate this post. Re-explaining Paul’s new post clearly and simply in your own words helped me a great deal, I now feel that I’ll have a far easier time engaging with Paul’s post if I want to. (Your feeling about MIRI’s approach came across fairly clearly too in the context of what you’d set up.)

Curated for these reasons. (Also a solid discussion in the comments.)

"But it seems strange to describe this approach as 'hope you can find some other way to produce powerful AI', as though we know of no other approach to engineering sophisticated systems other than search."

If I had to summarise the history of AI in one sentence, it'd be something like: a bunch of very smart people spent a long time trying to engineer sophisticated systems without using search, and it didn't go very well until they started using very large-scale search.

I'd also point out that the most sophisticated systems we can currently engineer are much complex than brains. So the extent to which this analogy applies seems to me to be fairly limited.

If I had to summarise the history of AI in one sentence, it'd be something like: a bunch of very smart people spent a long time trying to engineer sophisticated systems without using search, and it didn't go very well until they started using very large-scale search.

Yeah this is not such a terrible one-sentence summary of AI over the past 20 years (maybe even over the whole history of AI). There are of course lots of exceptions, lots of systems that were built successfully using design. The autonomous cars being built today have algorithms that are highly design-oriented, with search used only for subcomponents in perception and parts of planning. But yes we have seen some really big breakthroughs by using search. I like Vaniver's example of AlphaZero as a system built via a combination of design and search.

Search is clearly extremely powerful, and I see no fundamental problem with using it wherever it is safe to do so. But there seem to be some deep obstructions to using search safely in the end-to-end construction of sophisticated AI systems. If this is so -- and as Paul points out, it's not actually clear yet that there is no way around these obstructions -- then we need to go beyond search.

finding a solution to the design problem for intelligent systems that does not rest on a blind search for policies that satisfy some evaluation procedure

I'm a bit confused by this. If you want your AI to come up with new ideas that you hadn't already thought of, then it kinda has to do something like running a search over a space of possible ideas. If you want your AI to understand concepts that you don't already have yourself and didn't put in by hand, then it kinda has to be at least a little bit black-box-ish.

In other words, let's say you design a beautiful AGI architecture, and you understand every part of it when it starts (I'm actually kinda optimistic that this part is possible), and then you tell the AGI to go read a book. After having read that book, the AGI has morphed into a new smarter system which is closer to "black-box discovered by a search process" (where the learning algorithm itself is the search process).

Right? Or sorry if I'm being confused.

Thanks for this question. No you're not confused!

There are two levels of search that we need to think about here: at the outer level, we use machine learning to search for an AI design that works at all. Then, at the inner level, when we deploy this AI into the world, it most likely uses search to find good explanations of its sensor data (i.e. to understand things that we didn't put in by hand) and most likely also uses search to find plans that lead to fulfilment of its goals.

It seems to me that design at least needs to be part of the story for how we do the outer-level construction of a basic AI architecture. Any good architecture very likely then uses search in some way at the inner level.

Evan wrote a great sequence about inner and outer optimization

OK, well I spend most of my time thinking about a particular AGI architecture (1 2 etc.) in which the learning algorithm is legible and hand-coded ... and let me tell you, in that case, all the problems of AGI safety and alignment are still really really hard, including the "inaccessible information" stuff that Paul was talking about here.

If you're saying that it would be even worse if, on top of that, the learning algorithm itself is opaque, because it was discovered from a search through algorithm-space ... well OK, yeah sure, that does seem even worse.

" Presumably the machine learning model has in some sense discovered Newtonian mechanics using the training data we fed it, since this is surely the most compact way to predict the position of the planets far into the future. "

To me, this seems to be an entirely unrealistic presumption (also true for any of its parallels; not just when it is strictly about the position of planets). Even the claim that NM is "surely the most compact [...]" is questionable, given that obviously we know from history that there had been models able to predict just the position of stars since ancient times, and in this hypothetical situation where we somehow have knowledge of the position of planets (maybe through developments in telescopic technology) there is no reason to assume analogous models with the ancient ones with stars couldn't apply, thus NM would not be specifically needed to be part of what the machine was calculating.


Furthermore, I have some issue with the author's sense that the machine calculating something is somehow calculating it in a manner which inherently allows for the calculation to be translatable in many ways. While a human thinker inevitably thinks in ways which are open to translation and adaptation, this is true because as humans we do not think in a set way: any thinking pattern or collections of such patterns can - in theory - consist of a vast number of different neural connections and variations. Only as a finished mental product can it seem to have a very set meaning. For example, if we ask a child if their food was nice, they may say "yes, it was", and we would have that statement as something meaning something set, but we would never actually be aware of the set neural coding of that reply, for the simple reason that there isn't just one.

For a machine, on the other hand, a calculation is inherently an output on a non-translatable, set basis. Which is another way of saying that the machine does not think. This problem isn't likely to be solved by just coding a machine in such a way that it could have many different possible "connections" when its output would be the same, cause with humans this happens naturally, and one can suspect that human thinking itself is in a way just a byproduct of something not tied to actual thinking but the sense of existence. Which is, again, another way of saying that a machine is not alive. Personally, I think AI in the way it is currently imagined, is not possible. Perhaps some hybrid of machine-dna may produce a type of AI, but it would again be due to the DNA forcing a sense of existence and it would still take very impressive work to use that to advance Ai itself; I think it can be used to study DNA itself, though, through the machine's interaction with it.

How might we extract insight into the nature of the motion of the planets from this model?

Have it predict the location of the planets based on where you are (and what time it is).

and a natural language description of what’s really going on behind the scenes

Predict how far away these planets are.

And herein lies the core of the problem: in this hypothesized world we do not know the true laws of Newtonian mechanics, so we cannot generate a reward signal by comparing the output of our model to ground truth during training.

Seems like not being able to connect that NL description to what's going on in the NN is a bigger deal.

Also see [1]


To do this, we can set up an evaluation procedure that uses a data set of hand-labelled pictures of cats and dogs, and then use machine learning to search for a policy that correctly labels them. In contrast we do not at present know how to design an algorithm from first principles that does the same thing.
All these systems were engineered via first-principles design

Can we taboo "first principles"? (And maybe design.)


we should not for a moment think of it as the only game in town.

Arguably, the thought "experiment" early in the post can be thought of as backwards. Is it that

  • 'search produces great results, but we don't know/understand how things 'actually work'', or that
  • 'Search comes from saying 'we don't care how it works, just make something that works'

[1] This article seems relevant: https://www.quantamagazine.org/how-artificial-intelligence-is-changing-science-20190311/, though it seems like a case of 'if you don't do X* one way, you'll do it another'.

*Modeling, perhaps. 'Science seems to pay little attention to the hypothesis generating procedure.' Or:

1. Find a correlation.

2. Check if there's causation.

3. Repeat one and two for Better Models.