This is an interesting perspective on the AI safety problem. I really like the ethos of this post, where there isn't a huge opposition between AI capabilities and AI safety, but instead we are simply trying to figure out how to use the (helpful!) capabilities developed by AI researchers to do useful things.
If I think about this from the perspective of reducing existential risk, it seems like you would also need to make the argument that AI systems are unlikely to pose a great threat before they are human-level (a claim I mostly agree with), or that the solutions will generalize to sub-human-level AI systems. Is there a reason this isn't in the post? I worry that I'm not properly understanding the motivations or generators behind the post.
Reading this post was the first time I felt I understood what Paul's (and many others') research was motivated by. I think about it regularly, and it comes up in conversation a fair bit.
Given that you liked it, can you explain to me why? After having already read the sequence (much of it twice), I'm pretty confused about the structure in the first section. I don't see the point of introducing the Steering Problem at all. It's similar to Intent Alignment, but not exactly the same (since it makes stronger assumptions about the nature of the AI) -- and the rest of the sequence (and IDA in general) seems to be trying to solve intent alignment, not the steering problem. It's listed under 'motivation' but I don't really get aspect, either. I don't know how I'm supposed to connect it to the rest of the sequence.
Things this post does:
This gives real-world motivation and a direct problem to be solved to help the real-world problem, and engages concretely with the bigger picture in a way other posts in the sequence don't (e.g. "Clarifying AI Alignment" is a clarification and doesn't explicitly motivate why the problem is important).
In general, especially at the time when these posts were published, when I'd read them I felt like I understood a particular detail very clearly, but I did not understand the bigger picture of why that particular detail was interesting, or why those particular assumptions should be salient, and this post helped me understand it a great deal.
I'm not sure if the above still helped? Does it make sense what experience I had reading the post? I'm also interested if other posts feel to you like they clearly motivate iterated amplification.
The motivation seems trivial to me, which might be part of the problem. Most training procedures are obviously outer-misaligned, so if we have one that may plausibly be outer-aligned (and might plausibly scale), that seems like an obvious reason to take it seriously. I've felt like I totally got that once I first understood what IDA is trying to do.
Does it make sense what experience I had reading the post?
It does, but it still leaves me with the problem that it doesn't seem to be connected to the remaining sequence. IDA isn't about how we take an already trained system with human-level performance and use it for good things; it's about how we train a system from the ground up.
The real problem may be that I expect a post like this to closely tie into the actual scheme, so when it doesn't I take it as evidence that I've misunderstood something. What this post talks about may just not be intended to be [the problem the remaining sequence is trying to solve].
The motivation seems trivial to me, which might be part of the proble
Yeah, for a long time many people have been very confused about the motivation of Paul's research, so I don't think you're typical in this regard. I think that due to this sequence, Alex Zhu's FAQ, lots of writing by Evan, Paul's post "What Failure Looks Like", and more posts on LW by others, a many more people understand Paul's work on a basic level that was not at all the case 3 years ago. Like, you say "Most training procedures are obviously outer-misaligned", and 'outer-alignment' was not a concept with a name or a write-up at the time this sequence was published.
I agree that this post talks about assuming human-level performance, whereas much of iterated amplification also relaxes that assumption. My sense is that if someone were to just read this sequence, it would still help them focus on the brunt of the problem, that being 'helpful' or 'useful' is not well-defined in the way many other tasks are, and help realize the urgency of this task and why it's possible to make progress now.
That makes me feel a bit like the student who thinks they can debate the professor after having researched the field for 5 minutes.
But it would actually be a really good sign if the area has become more accessible.
When you talk about 'black-box' versions of Hugh, do you envision that H is able to answer questions relating to the cognitive processes that lead to the answer given, or about H's thinking in general? This seems to contradict the spirit of a black box but self reflection is an important part of Hugh's cognitive ability.
Perhaps they are both useful possibilities, my intuition is that this kind of self reflection is as far from being possible for AI as any human ability and so we should expect that we might have systems powerful enough to take on wide responsibility without this ability. If it were possible, though, the ability to use loops of self reflection to check whether a cognitive process serves a certain goal would be very helpful.
We’ll start by defining “as useful for X as Hugh,” and then we will informally say that a program is “as useful” as Hugh if it’s as useful for the tasks we care most about.
If a program is useful accomplishing the tasks we care most about, while being horrible for the things we care less about, would the program still be considered useful? For example, suppose I care a lot about music, and just a little about comedy. If an AI was useful for making the music I listen to slightly better, but completely destroyed my ability to get comedy, I'm not sure it's a good idea to call such a thing "useful".
This part feels underdefined:
A program P is more useful than Hugh for X if, for every project using H to accomplish X, we can efficiently transform it into a new project which uses P to accomplish X. The new project shouldn’t be much more expensive---it shouldn’t take much longer, use much more computation or many additional resources, involve much more human labor, or have significant additional side-effects.
Why quantify over projects? Why is it not sufficient to say that P is as useful as H if it can also accomplish X?
Seems like you want to say that P can achieve X in more ways, but I fail to see why that is obviously relevant. What is even a project?
Or is this some kind of built in measure to prevent side effects, by making P achieve X in a humanlike way? Still doesn't feel obvious enough.
This is the typical way of talking about "more useful than" in computer science.
Saying "there is some way to use P to efficiently accomplish X" isn't necessarily helpful to someone who can't find that way. We want to say: if you can find a way to do X with H, then you can find a way to do it with P. And we need an efficiency requirement for the statement to be meaningful at all.
Most AI research focuses on reproducing human abilities: to learn, infer, and reason; to perceive, plan, and predict. There is a complementary problem which (understandably) receives much less attention: if you had these abilities, what would you do with them?
The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities?
This post explains what the steering problem is and why I think it’s worth spending time on.
Introduction
A capable, well-motivated human can be extremely useful: they can work without oversight, produce results that need not be double-checked, and work towards goals that aren’t precisely defined. These capabilities are critical in domains where decisions cannot be easily supervised, whether because they are too fast, too complex, or too numerous.
In some sense “be as useful as possible” is just another task at which a machine might reach human-level performance. But it is different from the concrete capabilities normally considered in AI research.
We can say clearly what it means to "predict well," "plan well," or "reason well." If we ignored computational limits, machines could achieve any of these goals today. And before the existing vision of AI is realized, we must necessarily achieve each of these goals.
For now, "be as useful as possible" is in a different category. We can't say exactly what it means. We could not do it no matter how fast our computers could compute. And even if we resolved the most salient challenges in AI, we could remain in the dark about this one.
Consider a capable AI tasked with running an academic conference. How should it use its capabilities to make decisions?
Everyday experience with humans shows how hard delegation can be, and how much easier it is to assign a task to someone who actually cares about the outcome.
Of course there is already pressure to write useful programs in addition to smart programs, and some AI research studies how to efficiently and robustly communicate desired behaviors. For now, available solutions apply only in limited domains or to weak agents. The steering problem is to close this gap.
Motivation
A system which "merely" predicted well would be extraordinarily useful. Why does it matter whether we know how to make a system which is “as useful as possible”?
Our machines will probably do some things very effectively. We know what it means to "act well" in the service of a given goal. For example, using human cognitive abilities as a black box, we could probably design autonomous corporations which very effectively maximized growth. If the black box was cheaper than the real thing, such autonomous corporations could displace their conventional competitors.
If machines can do everything equally well, then this would be great news. If not, society’s direction may be profoundly influenced by what can and cannot be done easily. For example, if we can only maximize what we can precisely define, we may inadvertently end up with a world filled with machines trying their hardest to build bigger factories and better widgets, uninterested in anything we consider intrinsically valuable.
All technologies are more useful for some tasks than others, but machine intelligence might be particularly problematic because it can entrench itself. For example, a rational profit-maximizing corporation might distribute itself throughout the world, pay people to help protect it, make well-crafted moral appeals for equal treatment, or campaign to change policy. Although such corporations could bring large benefits in the short term, in the long run they may be difficult or impossible to uproot, even once they serve no one’s interests.
Why now?
Reproducing human abilities gets a lot of deserved attention. Figuring out exactly what you’d do once you succeed feels like planning the celebration before the victory: it might be interesting, but why can’t it wait?
But at large scales it becomes hard to speed up progress by increasing the number of researchers. Fewer people working for longer may ultimately be more efficient (even if earlier researchers are at a disadvantage). This is particularly pressing if we may eventually want to invest much more effort in the steering problem.
In section 3 we discuss some other reasons not to work on the steering problem: Is work done now likely to be relevant? Is there any concrete work to do now? Should we wait until we can do experiments? Are there adequate incentives to resolve this problem already?
Defining the problem precisely
Recall our problem statement:
The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities?
We’ll adopt a particular human, Hugh, as our “well-motivated human:” we’ll assume that we have black-box access to Hugh-level cognitive abilities, and we’ll try to write a program which is as useful as Hugh.
Abilities
In reality, AI research yields complicated sets of related abilities, with rich internal structure and no simple performance guarantees. But in order to do concrete work in advance, we will model abilities as black boxes with well-defined contracts.
We’re particularly interested in tasks which are “AI complete” in the sense that human-level performance on that task could be used as a black box to achieve human-level performance on a very wide range of tasks. For now, we’ll further focus on domains where performance can be unambiguously defined.
Some examples:
When talking about Hugh’s predictions, judgments, or decisions, we imagine that Hugh has access to a reasonably powerful computer, which he can use to process or display data. For example, if Hugh is given the binary data from a camera, he can render it on a screen in order to make predictions about it.
We can also consider a particularly degenerate ability:
Although unlimited computation seems exceptionally powerful, it’s not immediately clear how to solve the steering problem even using such an extreme ability.
Measuring usefulness
What does it mean for a program to be “as useful” as Hugh?
We’ll start by defining “as useful for X as Hugh,” and then we will informally say that a program is “as useful” as Hugh if it’s as useful for the tasks we care most about.
Consider H, a black box that simulates Hugh or perhaps consults a version of Hugh who is working remotely. We’ll suppose that running H takes the same amount of time as consulting our Hugh-level black boxes. A project to accomplish X could potentially use as many copies of H as it can afford to run.
A program P is more useful than Hugh for X if, for every project using H to accomplish X, we can efficiently transform it into a new project which uses P to accomplish X. The new project shouldn’t be much more expensive---it shouldn’t take much longer, use much more computation or many additional resources, involve much more human labor, or have significant additional side-effects.
Well-motivated
What it does it mean for Hugh to be well-motivated?
The easiest approach is universal quantification: for any human Hugh, if we run our program using Hugh-level black boxes, it should be as useful as Hugh.
Alternatively, we can leverage our intuitive sense of what it means for someone to be well-motivated to do X, and define “well-motivated” to mean “motivated to help the user’s project succeed.”
Scaling up
If we are given better black boxes, we should make a better program. This is captured by the requirement that our program should be as useful as Hugh, no matter how capable Hugh is (as long as the black boxes are equally capable).
Ideally, our solutions should scale far past human-level abilities. This is not a theoretical concern---in many domains computers already have significantly superhuman abilities. This requirement is harder to make precise, because we can no longer talk about the “human benchmark.” But in general, we would like to build systems which are (1) working towards their owner’s interests, and (2) nearly as effective as the best goal-directed systems that can be built using the available abilities. The ideal solution to the steering problem will have these characteristics in general, even when the black-box abilities are radically superhuman.
This is an abridged version of this document from 2014; most of the document is now superseded by later posts in this sequence.
Tomorrow's AI Alignment Forum sequences post will be 'Embedded Agency (text)' in the sequence Embedded Agency, by Scott Garrabrant and Abram Demski.
The next post in this sequence will come out on Thursday 15th November, and will be 'Clarifying "AI Alignment"' by Paul Christiano.