This is a reference post. It explains a fairly standard class of arguments, and is intended to be the opposite of novel; I just want a standard explanation to link to when invoking these arguments.

When planning or problem-solving, we focus on the hard subproblems. If I’m planning a road trip from New York City to Los Angeles, I’m mostly going to worry about which roads are fastest or prettiest, not about finding gas stations. Gas stations are abundant, so that subproblem is easy and I don’t worry about it until harder parts of the plan are worked out. On the other hand, if I were driving an electric car, then the locations of charging stations would be much more central to my trip-planning. In general, the hard subproblems have the most influence on the high-level shape of our solution, because solving them eats up the most degrees of freedom.

In the context of AI alignment, which subproblems are hard and which are easy?

Here’s one class of arguments: compute capacity and data capacity are both growing rapidly over time, so it makes sense to treat those as “cheap” - i.e. anything which can be solved by throwing more compute/data at it is easy. The hard subproblems, then, are those which are still hard even with arbitrarily large amounts of compute and data.

In particular, with arbitrary compute and data, we basically know how to get best-possible predictive power on a given data set: Bayesian updates on low-level physics models or, more generally, approximations of Solomonoff induction. So we’ll also assume predictive power is “cheap” - i.e. anything which can be solved by more predictive power is easy.

This is also reasonable in machine learning practice - once a problem is reduced to predictive power on some dataset, we can throw algorithms at it until it’s solved. The hard part - as many data scientists will attest - is reducing our real objective to a prediction problem and collecting the necessary data. It’s rare to find a client with a problem where all we need is predictive power and the necessary data is just sitting there.

(We could also view this as an interface argument: “predictive problems” are a standard interface, with libraries, tools, algorithms, theory and specialists all set up to handle them. As in many other areas, setting up our actual problem to fit that interface while still consistently doing what we want is the hard/expensive part.)

The upshot of all this: in order to identify alignment subproblems which are likely to be hard, it’s useful to ask what would go wrong if the world-modelling parts of our system just do Bayesian updates on low-level physics models or use approximations of Solomonoff induction. We don’t ask this because we actually expect to use such algorithms, but rather because we expect that the failure modes which still appear under such assumptions are the hard failure modes.

New Comment
6 comments, sorted by Click to highlight new comments since:

FWIW, I think of Eliezer's essay Methodology of Unbounded Analysis as the standard ref here. (But, it has not yet been ported over to Alignment Forum or LW.)

[Deleted]

I would say the reason to assume infinite compute was less about which parts of the problem are hard, and more about which parts can be solved without a solution to the rest.

Good solutions often have even better solutions nearby. In particular, we would expect most efficient and comprehensible finite algorithms to tend towards some nice infinite behaviour in the limit. If we find an infinite algorithm, that's a good point to start looking for finite approximations. It is also often easier to search for an infinite algorithm than a good approximation. Backpropigation in gradient descent is a trickier algorithm than brute force search. Logical induction is more complicated to understand than brute force proof search.

See also Robustness to Scale. You wrote that "we expect that the failure modes which still appear under such assumptions are the hard failure modes" (emphasis mine). But there are some failure modes which don't appear with existing algorithms, yet are hypothesized to appear in the limit of more data and compute, such as the "malign universal prior" problem. It's unclear how much to worry about these problems, because as you say, we don't actually expect to use e.g. Solomonoff induction. I suspect a key issue is whether the problem is an inevitable result of scaling any algorithm, vs a quirk of the particular infinite data/compute algorithm being discussed.

But there are some failure modes which don't appear with existing algorithms, yet are hypothesized to appear in the limit of more data and compute...

This is a great point to bring up. One thing the OP probably doesn't emphasize enough is: just because one particular infinite-data/compute algorithm runs into a problem, does not mean that problem is hard.

Zooming out for a moment, the strategy the OP is using is problem relaxation: we remove a constraint from the problem (in this case data/compute constraints), solve that relaxed problem, then use the relaxed solution to inform our solution to the original problem. Note that any solution to the original problem is still a solution to the relaxed problem, so the relaxed problem cannot ever be any harder than the original. If it ever seems like a relaxed problem is harder than the original problem, then a mistake has been made.

In context: we relax alignment problems by removing the data/compute constraints. That does not mean we're required to use approximations of Solomonoff induction, or required to use perfect predictive power; it just means that we are allowed to use those things in our solution. If we can solve the problem by e.g. simply not using Solomonoff induction, then it's an easy problem in the infinite data/compute setting just like it's an easy problem in a more realistic setting.

If we don't know of any way to solve the problem, even when we're allowed infinite data/compute, then it's a good hard-problem candidate.

Interesting to compare/contrast with "The Ideal Fades into the Background" from What does it mean to apply decision theory? (to be clear, I don't think the two posts are opposed)