Safety engineering, target selection, and alignment theory

So8res

This post is the latest in a series introducing the basic ideas behind MIRI's research program. To contribute, or learn more about what we've been up to recently, see the MIRI fundraiser page. Our 2015 winter funding drive concludes tonight (31 Dec 15) at midnight.

Artificial intelligence capabilities research is aimed at making computer systems more intelligent — able to solve a wider range of problems more effectively and efficiently. We can distinguish this from research specifically aimed at making AI systems at various capability levels safer, or more "robust and beneficial." In this post, I distinguish three kinds of direct research that might be thought of as "AI safety" work: safety engineering, target selection, and alignment theory.

Imagine a world where humans somehow developed heavier-than-air flight before developing a firm understanding of calculus or celestial mechanics. In a world like that, what work would be needed in order to safely transport humans to the Moon?

In this case, we can say that the main task at hand is one of engineering a rocket and refining fuel such that the rocket, when launched, accelerates upwards and does not explode. The boundary of space can be compared to the boundary between narrowly intelligent and generally intelligent AI. Both boundaries are fuzzy, but have engineering importance: spacecraft and aircraft have different uses and face different constraints.

Paired with this task of developing rocket capabilities is a safety engineering task. Safety engineering is the art of ensuring that an engineered system provides acceptable levels of safety. When it comes to achieving a soft landing on the Moon, there are many different roles for safety engineering to play. One team of engineers might ensure that the materials used in constructing the rocket are capable of withstanding the stress of a rocket launch with significant margin for error. Another might design escape systems that ensure the humans in the rocket can survive even in the event of failure. Another might design life support systems capable of supporting the crew in dangerous environments.

A separate important task is target selection, i.e., picking where on the Moon to land. In the case of a Moon mission, targeting research might entail things like designing and constructing telescopes (if they didn't exist already) and identifying a landing zone on the Moon. Of course, only so much targeting can be done in advance, and the lunar landing vehicle may need to be designed so that it can alter the landing target at the last minute as new data comes in; this again would require feats of engineering.

Beyond the task of (safely) reaching escape velocity and figuring out where you want to go, there is one more crucial prerequisite for landing on the Moon. This is rocket alignment research, the technical work required to reach the correct final destination. We'll use this as an analogy to illustrate MIRI's research focus, the problem of artificial intelligence alignment.

The alignment challenge

Hitting a certain target on the Moon isn't as simple as carefully pointing the nose of the rocket at the relevant lunar coordinate and hitting "launch" — not even if you trust your pilots to make course corrections as necessary. There's also the important task of plotting trajectories between celestial bodies.

This rocket alignment task may require a distinct body of theoretical knowledge that isn't required just for getting a payload off of the planet. Without calculus, designing a functional rocket would be enormously difficult. Still, with enough tenacity and enough resources to spare, we could imagine a civilization reaching space after many years of trial and error — at which point they would be confronted with the problem that reaching space isn't sufficient for steering toward a specific location.¹

The first rocket alignment researchers might ask, "What trajectory would we have our rocket take under ideal conditions, without worrying about winds or explosions or fuel efficiency?" If even that question were beyond their current abilities, they might simplify the problem still further, asking, "At what angle and velocity would we fire a cannonball such that it enters a stable orbit around Earth, assuming that Earth is perfectly spherical and has no atmosphere?"

To an early rocket engineer, for whom even the problem of building any vehicle that makes it off the launch pad remains a frustrating task, the alignment theorist's questions might look out-of-touch. The engineer may ask "Don't you know that rockets aren't going to be fired out of cannons?" or "What does going in circles around the Earth have to do with getting to the Moon?" Yet understanding rocket alignment is quite important when it comes to achieving a soft landing on the Moon. If you don't yet know at what angle and velocity to fire a cannonball such that it would end up in a stable orbit on a perfectly spherical planet with no atmosphere, then you may need to develop a better understanding of celestial mechanics before you attempt a Moon mission.

Three forms of AI safety research

The case is similar with AI research. AI capabilities work comes part and parcel with associated safety engineering tasks. Working today, an AI safety engineer might focus on making the internals of large classes of software more transparent and interpretable by humans. They might ensure that the system fails gracefully in the face of adversarial observations. They might design security protocols and early warning systems that help operators prevent or handle system failures.²

AI safety engineering is indispensable work, and it's infeasible to separate safety engineering from capabilities engineering. Day-to-day safety work in aerospace engineering doesn't rely on committees of ethicists peering over engineers' shoulders. Some engineers will happen to spend their time on components of the system that are there for reasons of safety — such as failsafe mechanisms or fallback life-support — but safety engineering is an integral part of engineering for safety-critical systems, rather than a separate discipline.

In the domain of AI, target selection addresses the question: if one could build a powerful AI system, what should one use it for? The potential development of superintelligence raises a number of thorny questions in theoretical and applied ethics. Some of those questions can plausibly be resolved in the near future by moral philosophers and psychologists, and by the AI research community. Others will undoubtedly need to be left to the future. Stuart Russell goes so far as to predict that “in the future, moral philosophy will be a key industry sector.” We agree that this is an important area of study, but it is not the main focus of the Machine Intelligence Research Institute.

Researchers at MIRI focus on problems of AI alignment. We ask questions analogous to "at what angle and velocity would we fire a cannonball to put it in a stable orbit, if Earth were perfectly spherical and had no atmosphere?"

Selecting promising AI alignment research paths is not a simple task. With the benefit of hindsight, it's easy enough to say that early rocket alignment researchers should begin by inventing calculus and studying gravitation. For someone who doesn't yet have a clear understanding of what "calculus" or "gravitation" are, however, choosing research topics might be quite a bit more difficult. The fruitful research directions would need to compete with fruitless ones, such as studying aether or Aristotelian physics; and which research programs are fruitless may not be obvious in advance.

Toward a theory of alignable agents

What are some plausible candidates for the role of "calculus" or "gravitation" in the field of AI?

At MIRI, we currently focus on subjects such as good reasoning under deductive limitations, decision theories that work well even for agents embedded in large environments, and reasoning procedures that approve of the way they reason. This research often involves building toy models and studying problems under dramatic simplifications, analogous to assuming a perfectly spherical Earth with no atmosphere.

One common question we hear about alignment research runs analogously to: "If you don't develop calculus, what bad thing happens to your rocket? Do you think the pilot will be struggling to make a course correction, and find that they simply can't add up the tiny vectors fast enough? That scenario just doesn't sound plausible."

This misunderstanding perhaps stems from an attempt to draw too direct a line between alignment theory and specific present-day engineering tasks. The point of developing calculus is not to allow the pilot to make course corrections quickly; the point is to make it possible to discuss curved rocket trajectories in a world where the best tools available assume that rockets move in straight lines.

The case is similar with, e.g., attempts to develop theories of logical uncertainty. The problem is not that we visualize a specific AI system encountering a catastrophic failure because it mishandles logical uncertainty; the problem is that all our existing tools for describing the behavior of rational agents assume that those agents are logically omniscient, making our best theories incommensurate with our best practical AI designs.

At this point, the goal of alignment research is not to solve particular engineering problems. The goal of early rocket alignment research is to develop shared language and tools for generating and evaluating rocket trajectories, which will require developing calculus and celestial mechanics if they do not already exist. Similarly, the goal of AI alignment research is to develop shared language and tools for generating and evaluating methods by which powerful AI systems could be designed to act as intended.

One might worry that it is difficult to set benchmarks of success for alignment research. Is a Newtonian understanding of gravitation sufficient to attempt a Moon landing, or must one develop a complete theory of general relativity before believing that one can land softly on the Moon?³

In the case of AI alignment, there is at least one obvious benchmark to focus on initially. Imagine we had access to an incredibly powerful computer with access to the internet, an automated factory, and large sums of money. If we could program that computer to reliably achieve some simple goal (such as producing as much diamond as possible), then a large share of the AI alignment research would be completed.

The pursuit of a goal such as this one is more or less MIRI's approach to AI alignment research. We think of this question as our version of the question, "Could you hit the Moon with a rocket if fuel and winds were no concern?" Answering that question, on its own, won’t ensure that smarter-than-human AI systems are aligned with our goals; but it would represent a major advance over our current knowledge, and it doesn’t look like the kind of basic insight that we can safely skip over.

What next?

Over the past year, we've seen a massive increase in attention towards the task of ensuring that future AI systems are robust and beneficial. AI safety work is being taken very seriously, and AI engineers are stepping up and acknowledging that safety engineering is not separable from capabilities engineering. It is becoming apparent that as the field of artificial intelligence matures, safety engineering will become a more and more firmly embedded part of AI culture. Meanwhile, new investigations of target selection and other safety questions will be showcased at an AI and Ethics workshop at AAAI-16, one of the larger annual conferences in the field.

A fourth variety of safety work is also receiving increased support: strategy research. If your nation is currently engaged in a cold war and locked in a space race, you may well want to consult with game theorists and strategists so as to ensure that your attempts to put a person on the Moon do not upset a delicate political balance and lead to a nuclear war.⁴ If international coalitions will be required in order to establish treaties regarding the use of space, then diplomacy may also become a relevant aspect of safety work. The same principles hold when it comes to AI, where coalition-building and global coordination may play an important role in the technology's development and use.

Strategy research has been on the rise this year. AI Impacts is producing strategic analyses relevant to the designers of this potentially world-changing technology, and will soon be joined by the Strategic Artificial Intelligence Research Centre. The new Leverhulme Centre for the Future of Intelligence will be pulling together people across many different disciplines to study the social impact of AI, forging new collaborations. The Global Priorities Project, meanwhile, is analyzing what types of interventions might be most effective at ensuring positive outcomes from the development of powerful AI systems.

The field is moving fast, and these developments are quite exciting. Throughout it all, though, AI alignment research in particular still seems largely under-served.

MIRI is not the only group working on AI alignment; a handful of researchers from other organizations and institutions are also beginning to ask similar questions. MIRI's particular approach to AI alignment research is by no means the only way one available — when first thinking about how to put humans on the Moon, one might want to consider both rockets and space elevators. Regardless of who does the research or where they do it, it is important that alignment research receive attention.

Smarter-than-human AI systems may be many decades away, and they may not closely resemble any existing software. This limits our ability to identify productive safety engineering approaches. At the same time, the difficulty of specifying our values makes it difficult to identify productive research in moral theory. Alignment research has the advantage of being abstract enough to be potentially applicable to a wide variety of future computing systems, while being formalizable enough to admit of unambiguous progress. By prioritizing such work, therefore, we believe that the field of AI safety will be able to ground itself in technical work without losing sight of the most consequential questions in AI.

Safety engineering, moral theory, strategy, and general collaboration-building are all important parts of the project of developing safe and useful AI. On the whole, these areas look poised to thrive as a result of the recent rise in interest in long-term outcomes, and I'm thrilled to see more effort and investment going towards those important tasks.

The question is: What do we need to invest in next? The type of growth that I most want to see happen in the AI community next would be growth in AI alignment research, via the formation of new groups or organizations focused primarily on AI alignment and the expansion of existing AI alignment teams at MIRI, UC Berkeley, the Future of Humanity Institute at Oxford at Oxford, and other institutions.

Before trying to land a rocket on the Moon, it's important that we know how we would put a cannonball into a stable orbit. Absent a good theoretical understanding of rocket alignment, it might well be possible for a civilization to eventually reach escape velocity; but getting somewhere valuable and exciting and new, and getting there reliably, is a whole extra challenge.

¹ Similarly, we could imagine a civilization that lives on the only planet in its solar system, or lives on a planet with perpetual cloud cover obscuring all objects except the Sun and Moon. Such a civilization might have an adequate understanding of terrestrial mechanics while lacking a model of celestial mechanics and lacking the knowledge that the same dynamical laws hold on Earth and in space. There would then be a gap in our theoretical understanding of rocket alignment, distinct from limitations in our understanding of how to reach escape velocity. ↩

² Roman Yampolskiy has used the term “AI safety engineering” to refer to the study of AI systems that can provide proofs of their safety for external verification, including some theoretical research that we would term "alignment research." His usage differs from the usage here. ↩

³ In either case, of course, we wouldn't want to put a moratorium on the space program while we wait for a unified theory of quantum mechanics and general relativity. We don't need a perfect understanding of gravity. ↩

⁴ This was a role historically played by the RAND corporation. ↩

This is all old hat, targeted at people with minimal exposure to MIRI/SIAI ideas.

I'd be interested to hear something newer in an article format, though!

One might worry that it is difficult to set benchmarks of success for alignment research. Is a Newtonian understanding of gravitation sufficient to attempt a Moon landing, or must one develop a complete theory of general relativity before believing that one can land softly on the Moon?3

In the case of AI alignment, there is at least one obvious benchmark to focus on initially. Imagine we had access to an incredibly powerful computer with access to the internet, an automated factory, and large sums of money. If we could program that computer to reliably achieve some simple goal (such as producing as much diamond as possible), then a large share of the AI alignment research would be completed.

Are we close to meeting this benchmark?

See http://arbital.com/pages/7686083980732512719 for an incomplete short story written by Eliezer that considerably extends the rocket alignment analogy.

One common question we hear about alignment research runs analogously to: "If you don't develop calculus, what bad thing happens to your rocket? Do you think the pilot will be struggling to make a course correction, and find that they simply can't add up the tiny vectors fast enough? That scenario just doesn't sound plausible."

Actually, that sounds entirely plausible.

The case is similar with, e.g., attempts to develop theories of logical uncertainty. The problem is not that we visualize a specific AI system encountering a catastrophic failure because it mishandles logical uncertainty; the problem is that all our existing tools for describing the behavior of rational agents assume that those agents are logically omniscient, making our best theories incommensurate with our best practical AI designs.

Well, of course, part of the problem is that the best theories of "rational agents" try to assume Homo Economicus into being, and insist on cutting off all the ways in which physically-realizable minds cannot fit. So we need a definition of rationality that makes sense in a world where agents don't have completed infinities of computational power and can be modified by the environment and don't come with built-in utility functions that necessarily map physically realizable situations to the real numbers.

If we could program that computer to reliably achieve some simple goal (such as producing as much diamond as possible), then a large share of the AI alignment research would be completed.

Wait wait wait. You're saying that the path between Clippy and a prospective completed FAI is shorter than the path between today's AI state-of-the-art and Clippy? Because it sounds like you're saying that, even though I really don't expect you to say that.

On the upside, I do think we can spell out a research program to get us there, which will be grounded in current computational cog-sci and ML literature, which will also help with Friendliness/alignment engineering, which will not engender arguments with Jessica over math this time.

But now for the mandatory remark: you are insane and will kill us all ;-), rabble rabble rabble.

Clippy is a thought experiment used to illustrate two ideas: terminal goals are orthogonal to capabilities ("the AI does not love you"), and they tend to have instrumental goals like resource acquisition and self-preservation ("the AI does not hate you, but..."). This highlights the fact that highly capable AI can be dangerous even if it's reliably pursuing some known goal and the goal isn't ambitious or malicious. For that reason, Clippy comes up a lot as an intuition pump for why we need to get started early on safety research.

But 'a system causes harm in the course of reliably pursuing some known, stable, obviously-non-humane goal' is a very small minority of the actual disaster scenarios MIRI researchers are worried about. Not because it looks easy to go from a highly reliable diamond maximizer to an aligned superintelligence, but because there appear to be a larger number of ways things can go wrong before we get to that point.

We can fail to understand an advanced AI system well enough to know how 'goals' are encoded in it, forcing us to infer and alter goals indirectly.
We can understand the system's 'goals,' but have them be in the wrong idiom for a safe superintelligence (e.g., rewards for a reinforcement learner).
We can understand the system well enough to specify its goals, but not understand our own goals fully or precisely enough to specify them correctly. We come up with an intuitively 'friendly' goal (something more promising-sounding than 'maximize the number of paperclips'), but it's still the wrong goal.
Similarly: We can understand the system well enough to specify safe behavior in its initial context, but the system stops being safe after it or its environment undergoes a change. An example of this is instability under self-modification.
We can design advanced AI systems we don't realize (or don't care) have consequentialist goals. This includes systems we don't realize are powerful optimizers, e.g., ones whose goal-oriented behavior may depend in complicated ways on the interaction of multiple AI systems, or ones that function as unnoticed subsystems of non-consequentialists.

Ok, so now I'm understanding, and I think our models match up better than I'd thought. You're basically saying that (1)-(2) and (4)-(5) are a major portion of the alignment research that actually needs doing, even while (3) has become, so to speak, the famous "Hard Problem of" FAI, when in fact it's only (let's lazily call it) 20% of what actually needs doing.

I can also definitely buy, based on what I've read, that better formalisms for 1, 2, 4, and 5 can all help make (3) easier.

Programming a computer to reliably make lots of diamonds (or paperclips) is not creating Clippy for the same reason that programming Google Maps to produce the shortest distance between two locations is not creating Clippy. People program computers to do X, where X doesn't consider the welfare of humans, all the time. The programming is not really "do X no matter what", it's "do X using these methods". Google Maps will not start trying to hack the computers of construction equipment in order to build a bridge and shorten the distance it finds between two points.

Programming a computer to reliably make lots of diamonds (or paperclips) is not creating Clippy for the same reason that programming Google Maps to produce the shortest distance between two locations is not creating Clippy.

Ok, but that makes Nate's statement very confusing. We already understand, "up to" R&D effort, how to program computers to use various peripherals to perform a task in the physical world without intelligence, using fixed methods. I'm left confused at what industrial automation has to do with AI alignment research.

Imagine a world where humans somehow achieved jet-propelled flight before developing a firm understanding of calculus or celestial mechanics.

No need to imagine it. Rockets have been around since at least the 10th century.

In a world like that, what work would be needed in order to safely transport humans to the Moon?

Pretty much the same work that was needed in order to transport humans to the Moon at all.

Note how humans didn't manage to fly rockets to the Moon, or even to use them as really effective weapons, until they figured out calculus, celestial mechanics, and a ton of other stuff.

By your analogy, one of the main criticism of doing MIRI-style AGI safety research now is that it's like 10th century Chinese philosophers doing Saturn V safety research based on what they knew about fire arrows.

By your analogy, one of the main criticism of doing MIRI-style AGI safety research now is that it's like 10th century Chinese philosophers doing Saturn V safety research based on what they knew about fire arrows.

This is a fairly common criticism, yeah. The point of the post is that MIRI-style AI alignment research is less like this and more like Chinese mathematicians researching calculus and gravity, which is still difficult, but much easier than attempting to do safety engineering on the Saturn V far in advance :-)

Don't kid yourself in the effort to seem humble: it's an entirely feasible research effort.

This is all old hat, targeted at people with minimal exposure to MIRI/SIAI ideas.

I'd be interested to hear something newer in an article format, though!

One might worry that it is difficult to set benchmarks of success for alignment research. Is a Newtonian understanding of gravitation sufficient to attempt a Moon landing, or must one develop a complete theory of general relativity before believing that one can land softly on the Moon?3

In the case of AI alignment, there is at least one obvious benchmark to focus on initially. Imagine we had access to an incredibly powerful computer with access to the internet, an automated factory, and large sums of money. If we could program that computer to reliably achieve some simple goal (such as producing as much diamond as possible), then a large share of the AI alignment research would be completed.

Are we close to meeting this benchmark?

See http://arbital.com/pages/7686083980732512719 for an incomplete short story written by Eliezer that considerably extends the rocket alignment analogy.

One common question we hear about alignment research runs analogously to: "If you don't develop calculus, what bad thing happens to your rocket? Do you think the pilot will be struggling to make a course correction, and find that they simply can't add up the tiny vectors fast enough? That scenario just doesn't sound plausible."

Actually, that sounds entirely plausible.

The case is similar with, e.g., attempts to develop theories of logical uncertainty. The problem is not that we visualize a specific AI system encountering a catastrophic failure because it mishandles logical uncertainty; the problem is that all our existing tools for describing the behavior of rational agents assume that those agents are logically omniscient, making our best theories incommensurate with our best practical AI designs.

If we could program that computer to reliably achieve some simple goal (such as producing as much diamond as possible), then a large share of the AI alignment research would be completed.

But now for the mandatory remark: you are insane and will kill us all ;-), rabble rabble rabble.

We can fail to understand an advanced AI system well enough to know how 'goals' are encoded in it, forcing us to infer and alter goals indirectly.
We can understand the system's 'goals,' but have them be in the wrong idiom for a safe superintelligence (e.g., rewards for a reinforcement learner).
We can understand the system well enough to specify its goals, but not understand our own goals fully or precisely enough to specify them correctly. We come up with an intuitively 'friendly' goal (something more promising-sounding than 'maximize the number of paperclips'), but it's still the wrong goal.
Similarly: We can understand the system well enough to specify safe behavior in its initial context, but the system stops being safe after it or its environment undergoes a change. An example of this is instability under self-modification.
We can design advanced AI systems we don't realize (or don't care) have consequentialist goals. This includes systems we don't realize are powerful optimizers, e.g., ones whose goal-oriented behavior may depend in complicated ways on the interaction of multiple AI systems, or ones that function as unnoticed subsystems of non-consequentialists.

I can also definitely buy, based on what I've read, that better formalisms for 1, 2, 4, and 5 can all help make (3) easier.

Programming a computer to reliably make lots of diamonds (or paperclips) is not creating Clippy for the same reason that programming Google Maps to produce the shortest distance between two locations is not creating Clippy.

Imagine a world where humans somehow achieved jet-propelled flight before developing a firm understanding of calculus or celestial mechanics.

No need to imagine it. Rockets have been around since at least the 10th century.

In a world like that, what work would be needed in order to safely transport humans to the Moon?

Pretty much the same work that was needed in order to transport humans to the Moon at all.

Note how humans didn't manage to fly rockets to the Moon, or even to use them as really effective weapons, until they figured out calculus, celestial mechanics, and a ton of other stuff.

By your analogy, one of the main criticism of doing MIRI-style AGI safety research now is that it's like 10th century Chinese philosophers doing Saturn V safety research based on what they knew about fire arrows.

Don't kid yourself in the effort to seem humble: it's an entirely feasible research effort.

LESSWRONG
LW

LESSWRONG
LW

26

Safety engineering, target selection, and alignment theory

26

The alignment challenge

Three forms of AI safety research

Toward a theory of alignable agents

What next?

26

26