This post is my attempt to summarize and distill the major public debates about MIRI's highly reliable agent designs (HRAD) work (which includes work on decision theory), including the discussions in Realism about rationality and Daniel Dewey's My current thoughts on MIRI's "highly reliable agent design" work. Part of the difficulty with discussing the value of HRAD work is that it's not even clear what the disagreement is about, so my summary takes the form of multiple possible "worlds" we might be in; each world consists of a positive case for doing HRAD work, along with the potential objections to that case, which results in one or more cruxes.

I will talk about "being in a world" throughout this post. What I mean by this is the following: If we are "in world X", that means that the case for HRAD work outlined in world X is the one that most resonates with MIRI people as their motivation for doing HRAD work; and that when people disagree about the value of HRAD work, this is what the disagreement is about. When I say that "I think we are in this world", I don't mean that I agree with this case for HRAD work; it just means that this is what I think MIRI people think.

In this post, the pro-HRAD stance is something like "HRAD work is the most important kind of technical research in AI alignment; it is the overwhelming priority and we're pretty much screwed if we under-invest in this kind of research" and the anti-HRAD stance is something like "HRAD work seems significantly less promising than other technical AI alignment agendas, such as the approaches to directly align machine learning systems (e.g. iterated amplification)". There is a much weaker pro-HRAD stance, which is something like "HRAD work is interesting and doing more of it adds value, but it's not necessarily the most important kind of technical AI alignment research to be working on"; this post is not about this weaker stance.

Clarifying some terms

Before describing the various worlds, I want to present some distinctions that have come up in discussions about HRAD, which will be relevant when distinguishing between the worlds.

Levels of abstraction vs levels of indirection

The idea of levels of abstraction was introduced in the context of debate about HRAD work by Rohin Shah, and is described in this comment (start from "When groups of humans try to build complicated stuff"). For more background, see these articles on Wikipedia.

Later on, in this comment Rohin gave a somewhat different "levels" idea, which I've decided to call "levels of indirection". The idea is that there might not be a hierarchy of abstraction, but there's still multiple intermediate layers between the theory you have and the end-result you want. The relevant "levels of indirection" is the sequence HRAD → machine learning → AGI. Even though levels of indirection are different from levels of abstraction, the idea is that the same principle applies, where the more levels there are, the harder it becomes for a theory to apply to the final level.

Precise vs imprecise theory

A precise theory is one which can scale to 2+ levels of abstraction/indirection.

An imprecise theory is one which can scale to at most 1 level of abstraction/indirection.

More intuitively, a precise theory is more mathy, rigorous, and exact like pure math and physics, and an imprecise theory is less mathy, like economics and psychology.

Building agents from the ground up vs understanding the behavior of rational agents and predicting roughly what they will do

This distinction comes from Abram Demski's comment. However, I'm not confident I've understood this distinction in the way that Abram intended it, so what I describe below may be a slightly different distinction.

Building agents from the ground up means having a precise theory of rationality that allows us to build an AGI in a satisfying way, e.g. where someone with security mindset can be confident that it is aligned. Importantly, we allow the AGI to be built using whatever way is safest or most theoretically satisfying, rather than requiring that the AGI be built using whatever methods are mainstream (e.g. current machine learning methods).

Understanding the behavior of rational agents and predicting roughly what they will do means being handed an arbitrary agent implemented in some way (e.g. via blackbox ML) and then being able to predict roughly how it will act.

I think of the difference between these two as the difference between existential and universal quantification: "there exists x such that P(x)" and "for all x we have P(x)", where P(x) is something like "we can understand and predict how x will act in a satisfying way". The former only says that we can build some AGI using the precise theory that we understand well, whereas the latter says we have to deal with whatever kind of AGI that ends up being developed using methods we might not understand well.

World 1

Case for HRAD

The goal of HRAD research is to generally become less confused about things like counterfactual reasoning and logical uncertainty. Becoming less confused about these things will: help AGI builders avoid, detect, and fix safety issues; help AGI builders predict or explain safety issues; help to conceptually clarify the AI alignment problem; and help us be satisfied that the AGI is doing what we want. Moreover, unless we become less confused about these things, we are likely to screw up alignment because we won't deeply understand how our AI systems are reasoning. There are other ways to gain clarity on alignment, such as by working on iterated amplification, but these approaches don't decompose cognitive work enough.

For this case, it is not important for the final product of HRAD to be a precise theory. Even if the final theory of embedded agency is imprecise, or even if there is no "final say" on the topic, if we are merely much less confused than we are now, that is still good enough to help us ensure AI systems are aligned.

Why I think we might be in this world

The main reason I think we might be in this world (i.e. that the above case is the motivating reason for MIRI prioritizing HRAD work) is that people at MIRI frequently seem to be saying something like the case above. However, they also seem to be saying different things in other places, so I'm not confident this is actually their case. Here are some examples:

  • Eliezer Yudkowsky: "Techniques you can actually adapt in a safe AI, come the day, will probably have very simple cores — the sort of core concept that takes up three paragraphs, where any reviewer who didn’t spend five years struggling on the problem themselves will think, “Oh I could have thought of that.” Someday there may be a book full of clever and difficult things to say about the simple core — contrast the simplicity of the core concept of causal models, versus the complexity of proving all the clever things Judea Pearl had to say about causal models. But the planetary benefit is mainly from posing understandable problems crisply enough so that people can see they are open, and then from the simpler abstract properties of a found solution — complicated aspects will not carry over to real AIs later."
  • Rob Bensinger: "We’re working on decision theory because there’s a cluster of confusing issues here (e.g., counterfactuals, updatelessness, coordination) that represent a lot of holes or anomalies in our current best understanding of what high-quality reasoning is and how it works." and phrases like "developing an understanding of roughly what counterfactuals are and how they work" and "very roughly how/why it works" -- This post then doesn't really specify whether or not the final output is expected to be precise. (The analogy with probability theory and rockets gestures at precise theories, but the post doesn't come out and say it.)
  • Abram Demski: "I don't think there's a true rationality out there in the world, or a true decision theory out there in the world, or even a true notion of intelligence out there in the world. I work on agent foundations because there's still something I'm confused about even after that, and furthermore, AI safety work seems fairly hopeless while still so radically confused about the-phenomena-which-we-use-intelligence-and-rationality-and-agency-and-decision-theory-to-describe."
  • Nate Soares: "The main case for HRAD problems is that we expect them to help in a gestalt way with many different known failure modes (and, plausibly, unknown ones). E.g., 'developing a basic understanding of counterfactual reasoning improves our ability to understand the first AGI systems in a general way, and if we understand AGI better it's likelier we can build systems to address deception, edge instantiation, goal instability, and a number of other problems'."
  • In the deconfusion section of MIRI's 2018 update, some of the examples of deconfusion are not precise/mathematical in nature (e.g. see the paragraph starting with "In 1998, conversations about AI risk and technological singularity scenarios often went in circles in a funny sort of way" and the list after "Among the bits of conceptual progress that MIRI contributed to are"). There are more mathematical examples in the post, but the fact that there are also non-mathematical examples suggests that having a precise theory of rationality is not important to the case for HRAD work. There's also the quote "As AI researchers explore the space of optimizers, what will it take to ensure that the first highly capable optimizers that researchers find are optimizers they know how to aim at chosen tasks? I’m not sure, because I’m still in some sense confused about the question."

The crux

One way to reject this case for HRAD work is by saying that imprecise theories of rationality are insufficient for helping to align AI systems. This is what Rohin does in this comment where he says imprecise theories cannot build things "2+ levels above".

There is a separate potential rejection, which is to say that either HRAD work will never result in precise theories or that even a precise theory is insufficient for helping to align AI systems. However, these move the crux to a place where they apply to more restricted worlds where the goal of HRAD work is specifically to come up with a precise theory, so these will be covered in the other worlds below.

There is a third rejection, which is to argue that other approaches (such as iterated amplification) are more promising for gaining clarity on alignment. In this case, the main disagreement may instead be about other agendas rather than about HRAD.

World 2

Case for HRAD

The goal of HRAD research is to come up with a theory of rationality that is so precise that it allows one to build an agent from the ground up. Deconfusion is still important, as with world 1, but in this case we don't merely want any kind of deconfusion, but specifically deconfusion which is accompanied by a precise theory of rationality.

For this case, HRAD research isn't intended to produce a precise theory about how to predict ML systems, or to be able to make precise predictions about what ML systems will do. Instead, the idea is that the precise theory of rationality will help AGI builders avoid, detect, and fix safety issues; predict or explain safety issues; help to conceptually clarify the AI alignment problem; and help us be satisfied that the AGI is doing what we want. In other words, instead of directly using a precise theory about understanding/predicting rational agents in general, we use the precise theory about rationality to help us roughly predict what rational agents will do in general (including ML systems).

As with world 1, unless we become less confused, we are likely to screw up alignment because we won't deeply understand how our AI systems are reasoning. There are other ways to gain clarity on alignment, such as by working on iterated amplification, but these approaches don't decompose cognitive work enough.

Why I think we might be in this world

This seems to be what Abram is saying in this comment (see especially the part after "I guess there's a tricky interpretational issue here").

It also seems to match what Rohin is saying in these two comments.

The examples MIRI people sometimes give for precedents of HRAD-ish work, like the work done by Turing, Shannon, and Maxwell are precise mathematical theories.

The crux

There seem to be two possible rejections of this case:

  • We can reject the existence of the precise theory of rationality. This is what Rohin does in this comment and this comment where he says "MIRI's theories will always be the relatively-imprecise theories that can't scale to '2+ levels above'." Paul Christiano seems to also do this, as summarized by Jessica Taylor in this post: intuition 18 is "There are reasons to expect the details of reasoning well to be 'messy'."
  • We can argue that even a precise theory of rationality is insufficient for helping to align AI systems. This seems to be what Daniel Dewey is doing in this post when he says things like "AIXI and Solomonoff induction are particularly strong examples of work that is very close to HRAD, but don't seem to have been applicable to real AI systems" and "It seems plausible that the kinds of axiomatic descriptions that HRAD work could produce would be too taxing to be usefully applied to any practical AI system".

World 3

Case for HRAD

The goal of HRAD research is to directly come up with a precise theory for understanding the behavior of rational agents and predicting what they will do. Deconfusion is still important, as with worlds 1 and 2, but in this case we don't merely want any kind of deconfusion, but specifically deconfusion which is accompanied by a precise theory that allows us to predict agents' behavior in general. And a precise theory is important, but we don't merely want a precise theory that lets us build an agent; we want our theory to act like a box that takes in an arbitrary agent (such as one built using ML and other black boxes) and allows us to analyze its behavior.

This theory can then be used to help AGI builders avoid, detect, and fix safety issues; predict or explain safety issues; help to conceptually clarify the AI alignment problem; and help us be satisfied that the AGI is doing what we want.

As with world 1 and 2, unless we become less confused, we are likely to screw up alignment because we won't deeply understand how our AI systems are reasoning. There are other ways to gain clarity on alignment, such as by working on iterated amplification, but these approaches don't decompose cognitive work enough.

Why I think we might be in this world

I mostly don't think we're in this world, but some critics might think we are.

For example Abram says in this comment: "I can see how Ricraz would read statements of the first type [i.e. having precise understanding of rationality] as suggesting very strong claims of the second type [i.e. being able to understand the behavior of agents in general]."

Daniel Dewey might also expect to be in this world; it's hard for me to tell based on his post about HRAD.

The crux

The crux in this world is basically the same as the first rejection for world 2: we can reject the existence of a precise theory for understanding the behavior of arbitrary rational agents.

Conclusion, and moving forward

To summarize the above, combining all of possible worlds, the pro-HRAD stance becomes:

(ML safety agenda not promising) and (
  (even an imprecise theory of rationality helps to align AGI) or
  ((a precise theory of rationality can be found) and
   (a precise theory of rationality can be used to help align AGI)) or
  (a precise theory to predict behavior of arbitrary agent can be found)
)

and the anti-HRAD stance is the negation of the above:

(ML safety agenda promising) or (
  (an imprecise theory of rationality cannot be used to help align AGI) and
  ((a precise theory of rationality cannot be found) or
   (even a precise theory of rationality cannot be used to help align AGI)) and
  (a precise theory to predict behavior of arbitrary agent cannot be found)
)

How does this fit under the Double Crux framework? The current "overall crux" is a messy proposition consisting of multiple conjunctions and disjunctions, and fully resolving the disagreement can in the worst case require assigning truth values to all five parts: the statement "A and (B or (C and D) or E)", with disagreements resolved in the order A=True, B=False, C=True, D=False can still be true or false depending on the value of E. From an efficiency perspective, if some of the conjunctions/disjunctions don't matter, we want to get rid of them in order to simplify the structure of the overall crux (this corresponds to identifying which "world" we are in, using the terminology of this post), and we also might want to pick an ordering of which parts to resolve first (for example, with A=True and B=True, we already know the overall proposition is true).

So some steps for moving the discussion forward:

  • I think it would be great to get HRAD proponents/opponents to be like "we're definitely in world X, and not any of the other worlds" or even be like "actually, the case for HRAD really is disjunctive, so both of the cases in worlds X and Y apply".
  • If I missed any additional possible worlds, or if I described one of the worlds incorrectly, I am interested in hearing about it.
  • If it becomes clear which world we are in, then the next step is to drill down on the crux(es) in that world.

Thanks to Ben Cottier, Rohin Shah, and Joe Bernstein for feedback on this post.

New Comment
15 comments, sorted by Click to highlight new comments since:

World 3 doesn't strike me as a thing you can get in the critical period when AGI is a new technology. Worlds 1 and 2 sound approximately right to me, though the way I would say it is roughly: We can use math to better understand reasoning, and the process of doing this will likely improve our informal and heuristic descriptions of reasoning too, and will likely involve us recognizing that we were in some ways using the wrong high-level concepts to think about reasoning.

I haven't run the characterization above by any MIRI researchers, and different MIRI researchers have different models of how the world is likeliest to achieve aligned AGI. Also, I think it's generally hard to say what a process of getting less confused is likely to look like when you're still confused.

... we don't merely want a precise theory that lets us build an agent; we want our theory to act like a box that takes in an arbitrary agent (such as one built using ML and other black boxes) and allows us to analyze its behavior.

FWIW, this is what I consider myself to be mainly working towards, and I do expect that the problem is directly solvable. I don't think that's a necessary case to make in order for HRAD-style research to be far and away the highest priority for AI safety (so it's not necessarily a crux), but I do think it's both sufficient and true.

(I really like this post, as I said to Issa elsewhere, but) I realized after discussing this earlier that I don't agree with a key part of the precise vs. imprecise model distinction.

A precise theory is one which can scale to 2+ levels of abstraction/indirection.
An imprecise theory is one which can scale to at most 1 level of abstraction/indirection.

I think this is wrong. More levels of abstraction are worse, not better. Specifically, if a model exactly describes a system on one level, any abstraction will lose predictive power. (Ignoring computational cost - which I'll get back to,) Quantum theory is more specifically predictive than Newtonian physics. The reason that we can move up and down levels is because we understand the system well enough to quantify how much precision we are losing, not because we can move further without losing precision.

The reason that precise theories are better is because they are tractable enough to quantify how far we can move away from them, and how much we lose by doing so. The problem with economics isn't that we don't have accurate enough models of human behavior to aggregate them, but that the inaccuracy isn't precise enough to allow understanding how the uncertainty from psychology shows up in economics. Fore example, behavioral economics is partly useless because we can't build equilibrium models - and the reason is because we can't quantify how they are wrong. For economics, we're better off with the worse model of rational agents, which we know is wrong, but can kind-of start to quantify by how much, so we can do economic analyses.

Your world descriptions and your objections seem to focus on HRAD being the only prerequisite to being able to create an aligned AGI, rather than simply one of them (and is the one worth focusing on because of a combination of factors, such as - which areas of research are the least attended to by other researches, which areas could provide insights useful to then attack other ones, which ones are the most likely to be on a critical path, etc). It could very well be an "overwhelming priority" as you stated the position you are trying to understand, without the goal being "to come up with a theory of rationality [...] that [...] allows one to build an agent from the ground up".

I am thinking of the following optimization problem. Let R1 be all the research that we anticipate getting completed by the mainstream AI community by the time they create an AGI. Let R2 be the smallest amount of successful research such that R1+R2 allows you to create an aligned AGI. What research questions we know to formulate today, and have a way to start attacking today that are the most likely to be in R2? And among the top choices, which ones are also 1) more likely to produce insights that would help with other parts of R2, and 2) less likely to compress the AGI timeline even further? It seems possible to believe in HRAD being such a good choice (working backwards from R2) without being in one of your world's (all of which work forward from HRAD).

Planned summary for the Alignment Newsletter:

This post tries to identify the possible cases for highly reliable agent design (HRAD) work to be the main priority of AI alignment. HRAD is a category of work at MIRI that aims to build a theory of intelligence and agency that can explain things like logical uncertainty and counterfactual reasoning.
The first case for HRAD work is that by becoming less confused about these phenomena, we will be able to help AGI builders predict, explain, avoid, detect, and fix safety issues and help to conceptually clarify the AI alignment problem. For this purpose, we just need _conceptual_ deconfusion -- it isn’t necessary that there must be precise equations defining what an AI system does.
The second case is that if we get a precise, mathematical theory, we can use it to build an agent that we understand “from the ground up”, rather than throwing the black box of deep learning at the problem.
The last case is that by understanding how intelligence works will give us a theory that allows us to predict how _arbitrary_ agents will behave, which will be useful for AI alignment in all the ways described in the first case and <@more@>(@Theory of Ideal Agents, or of Existing Agents?@).
Looking through past discussion on the topic, the author believes that people at MIRI primarily believe in the first two cases. Meanwhile, critics (particularly me) say that it seems pretty unlikely that we can build a precise, mathematical theory, and a more conceptual but imprecise theory may help us understand reasoning better but is less likely to generalize sufficiently well to say important and non-trivial things about AI alignment for the systems we are actually building.

Planned opinion:

I like this post -- it seems like an accessible summary of the state of the debate so far. My opinions are already in the post, so I don’t have much to add.

Thanks for the post :) To be clear, I'm very excited about conceptual and deconfusion work in general, in order to come up with imprecise theories of rationality and intelligence. I guess this puts my position in world 1. The thing I'm not excited about is the prospect of getting to this final imprecise theory via doing precise technical research. In other words, I'd prefer HRAD work to draw more on cognitive science and less on maths and logic. I outline some of the intuitions behind that in this post.

Having said that, when I've critiqued HRAD work in the past, on a couple of occasions I've later realised that the criticism wasn't aimed at a crux for people actually working on it (here's my explanation of one of those cases). To some extent this is because, without a clearly-laid-out position to criticise, the critic has the difficult task of first clarifying the position then rebutting it. But I should still flag that I don't know how much HRAD researchers would actually disagree with my claims in the first paragraph.

Thanks for the post, it is a helpful disjunction of possibilities and set of links to prior discussion.

I think that the post would be clearer if instead of sections called "Why I think we might be in this world" it had section with the same content called "Links to where people have discussed being in this world" or something similar. I'm not really sure why you use the title you do, it threw me for a bit.

When I say that "I think we are in this world", I don't mean that I agree with this case for HRAD work; it just means that this is what I think MIRI people think.

According to this definition, "Links to where people have discussed being in this world" would mean that the links should be to people making arguments that MIRI people believe X, rather than that X is true.

With help from David Manheim, this post has now been turned into a paper. Thanks to everyone who commented on the post!

One way to reject this case for HRAD work is by saying that imprecise theories of rationality are insufficient for helping to align AI systems. This is what Rohin does in this comment where he says imprecise theories cannot build things "2+ levels above".

I should note that there are some things in world 1 that I wouldn't reject this way -- e.g. one of the examples of deconfusion is “anyhow, we could just unplug [the AGI].” That is directly talking about AGI safety, and so deconfusion on that point is "1 level away" from the systems we actually build, and isn't subject to the critique. (And indeed, I think it is important and great that this statement has been deconfused!)

It is my impression though that current HRAD work is not "directly talking about AGI safety", and is instead talking about things that are "further away", to which I would apply the critique.

I think theoretical work on AI safety has multiple different benefits, but I prefer a slightly different categorization. I like categorizing in terms of the sort of safety guarantees we can get, on a spectrum from "stronger but harder to get" to "weaker but easier to get". Specifically, the reasonable goals for such research IMO are as follows.

Plan A is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a proof that this algorithm is aligned, or at least a solid base of theoretical and empirical evidence, similarly to the situation in cryptography. This more or less correspond to World 2.

Plan B is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a specific impractical but provably aligned algorithm (iv) informal and empirical arguments suggesting that the former algorithm is as aligned as the latter. As an analogy consider Q-learning (an impractical algorithm with provable convergence guarantees) and deep Q-learning (a practical algorithm with no currently known convergence guarantees, designed by analogy to the former). This sort of still corresponds to World 2 but not quite.

Plan C is having enough theory to at least have rigorous models of all possible failure modes, and theory-inspired informal and empirical arguments why a certain algorithm avoids them. As an analogy, concepts such as VC dimension and Rademacher complexity allow us being more precise in our reasoning about underfitting and overfitting, even if we don't know how to compute them in practical scenarios. This corresponds to World 1, I guess?

In a sane civilization the solution would be not building AGI until we can implement Plan A. In the real civilization, we should go with the best plan that will be ready by the time competing projects become too dangerous to ignore.

World 3 seems too ambitious to me, since analyzing arbitrary code is almost always an intractable problem (e.g. Rice's theorem). You would need at least some constraints on how your agent is designed.

I think that the plans you lay out are all directly talking about the AI system we eventually build, and as a result I'm more optimistic about them (and your work, as it's easy to see how it makes progress towards these plans) relative to HRAD.

In contrast, as far as I can tell, HRAD work does not directly contribute to any of these plans, and instead the case seems to rely on something more indirect where a better understanding of reasoning will later help us execute on one of these plans. It's this indirection that makes me worried.

Well, HRAD certainly has relations to my own research programme. Embedded agency seems important since human values are probably "embedded" to some extent, counterfactuals are important for translating knowledge from the user's subjective vantage point to the AI's subjective vantage point, reflection is important if it's required for high capability (as Turning RL suggests). I do agree that having a high level plan for solving the problem is important to focus the research in the right directions.

[-][anonymous]10

a mathematical formalization of alignment

I can barely see how this is possible if we're talking about alignment to humans, even with a hypothetical formal theory of embedded agency. Do you imagine human values are cleanly represented and extractable, and that we can (potentially very indirectly) reference those values formally? Do you mean something else by "formalization of alignment" that doesn't involve formal descriptions of human minds?

For examples of what a formalization of alignment could look like, see this and this.