It's easier to make your way to the supermarket than it is to compute the fastest route, which is yet easier than computing the fastest route for someone running backwards and doing two and a half jumping jacks every five seconds and who only follows the route percent of the time. Sometimes, constraints are necessary. Constraints come with costs. Sometimes, the costs are worth it.

Aspiring researchers trying to think about AI alignment might[1] have a failure mode which goes something like… this:

Oh man, so we need to solve both outer and inner alignment to build a superintelligent agent which is competitive with unaligned approaches and also doesn't take much longer to train, and also we have to know this ahead of time. Maybe we could use some kind of prediction of what people want... but wait, there's also problems with using human models! How can it help people if it can't model people? Ugh, and what about self-modification?! How is this agent even reasoning about the universe from inside the universe?

The aspiring researcher slumps in frustration, mutters a curse under their breath, and hangs up their hat – "guess this whole alignment thing isn't for me...". And isn't that so? All their brain could do was pattern-match onto already-proposed solutions and cached thinking.

There's more than one thing going wrong here, but I'm just going to focus on one. Given that person's understanding of AI alignment, this problem is wildly overconstrained. Whether or not alignment research is right for them, there's just no way that anyone's brain is going to fulfill this insane solution request!

Sometimes, constraints are necessary. I think that the alignment community is pretty good at finding plausibly necessary constraints. Maybe some of the above aren't necessary – maybe there's One Clever Trick you come up with which obviates one of these concerns.

Constraints come with costs. Sometimes, the costs are worth it. In this context, I think the costs are very much worth it. Under this implicit framing of the problem, you're pretty hosed if you don't get even outer alignment right.

However, even if the real problem has crazy constraints, that doesn't mean you should immediately tackle the fully constrained problem. I think you should often relax the problem first: eliminate or weaken constraints until you reach a problem which is still a little confusing, but which you can get some traction on.

Even if you know an unbounded solution to chess, you might still be 47 years away from a bounded solution. But if you can't state a program that solves the problem in principle, you are in some sense confused about the nature of the cognitive work needed to solve the problem. If you can't even solve a problem given infinite computing power, you definitely can't solve it using bounded computing power. (Imagine Poe trying to write a chess-playing program before he'd had the insight about search trees.)

~ The methodology of unbounded analysis

Historically, I tend to be too slow to relax research problems. On the flipside, all of my favorite research ideas were directly enabled by problem relaxation. Instead of just telling you what to do and then having you forget this advice in five minutes, I'm going to paint it into your mind using two stories.

Attainable Utility Preservation

It's spring of 2018, and I've written myself into a corner. My work with CHAI for that summer was supposed to be on impact measurement, but I inconveniently posted a convincing-to-me argument that impact measurement cannot admit a clean solution:

I want to penalize the AI for having side effects on the world.[2] Suppose I have a function which looks at the consequences of the agent's actions and magically returns all of the side effects. Even if you have this function, you still have to assign blame for each effect – either the vase breaking was the AI's fault, or it wasn't.

If the AI penalizes itself for everything, it'll try to stop people from breaking vases – it'll be clingy. But if you magically have a model of how people are acting in the world, and the AI magically only penalizes itself for things which are its fault, then the AI is incentivized to blackmail people to break vases in ways which don't technically count as its fault. Oops.

Summer dawned, and I occupied myself with reading – lots and lots of reading. Eventually, enough was enough – I wanted to figure this out. I strode through my school's library, markers in my hand and determination in my heart. I was determined not to leave before understanding a) exactly why impact measurement is impossible to solve cleanly, or b) how to solve it.

I reached the whiteboard, and then – with adrenaline pumping through my veins – I realized that I had no idea what this "impact" thing even is. Oops.

I'm staring at the whiteboard.

A minute passes.

59 more minutes pass.

I'd been thinking about how, in hindsight, it was so important that Shannon had first written a perfect chess-playing algorithm which required infinite compute, that Hutter had written an AGI algorithm which required infinite compute. I didn't know how to solve impact under all the constraints, but what if I assumed something here?

What if I had infinite computing power? No… Still confused, don't see how to do it. Oh yeah, and what if the AI had a perfect world model. Hm... What if we could write down a fully specified utility function which represented human preferences? Could I measure impact if I knew that?

The answer was almost trivially obvious. My first thought was that negative impact would be a decrease in true utility, but that wasn't quite right. I realized that impact measure needs to also capture decrease in ability to achieve utility. That's an optimal value function... So the negative impact would be the decrease in attainable utility for human values![3]

Okay, but we don't and won't know the "true" utility function. What if... we just penalized shift in all attainable utilities?

I then wrote down The Attainable Utility Preservation Equation, more or less. Although it took me a few weeks to believe and realize, that equation solved all of the impact measurement problems which had seemed so insurmountable to me just minutes before.[4]

Formalizing Instrumental Convergence

It's spring of 2019, and I've written myself into a corner. My first post on AUP was confusing – I'd failed to truly communicate what I was trying to say. Inspired by Embedded Agency, I was planning an illustrated sequence of my own.

I was working through a bit of reasoning on how your ability to achieve one goal interacts with your ability to achieve seemingly unrelated goals. Spending a lot of money on red dice helps you for the collecting-dice goal, but makes it harder to become the best juggler in the world. That's a weird fact, but it's an important fact which underlies much of AUP's empirical success. I didn't understand why this fact was true.

At an impromptu presentation in 2018, I'd remarked that "AUP wields instrumental convergence as a weapon against the alignment problem itself". I tried thinking about it using the formalisms of reinforcement learning. Suddenly, I asked myself

Why is instrumental convergence even a thing?

I paused. I went outside for a walk, and I paced. The walk lengthened, and I still didn't understand why. Maybe it was just a "brute fact", an "emergent" phenomenon – nope, not buying that. There's an explanation somewhere.

I went back to the drawing board – to the whiteboard, in fact. I stopped trying to understand the general case and I focused on specific toy environments. I'm looking at an environment like this

and I'm thinking, most agents go from 1 to 3. "Why does my brain think this?", I asked myself. Unhelpfully, my brain decided not to respond.

I'm staring at the whiteboard.

A minute passes.

29 more minutes pass.

I'm reminded of a paper my advisor had me read for my qualifying exam. The paper talked about a dual formulation for reinforcement learning environments, where you consider the available trajectories through the future instead of the available policies. I take a picture of the whiteboard and head back to my office.

I run into a friend. We start talking about work. I say, "I'm about 80% sure I have the insight I need – this is how I felt in the past in situations like this, and I turned out to be right".

I turned out to be right. I started building up an entire theory of this dual formalism. Instead of asking myself about the general case of instrumental convergence in arbitrary computable environments, I considered small deterministic Markov decision processes. I started proving everything I could, building up my understanding piece by piece. This turned out to make all difference.

Half a year later, I'd built up enough theory that I was able to explain a great deal (but not everything) about instrumental convergence.

Conclusion

Problem relaxation isn't always the right tactic. For example, if the problem isn't well-posed, it won't work well – imagine trying to "relax" the "problem" of free will! However, I think it's often the right move.

The move itself is simple: consider the simplest instance of the problem which is still confusing. Then, make a ton simplifying assumptions while still keeping part of the difficulty present – don't assume away all of the difficulty. Finally, tackle the relaxed problem.

In general, this seems like a skill that successful researchers and mathematicians learn to use. MIRI does a lot of this, for example. If you're new to the research game, this might be one of the crucial things to pick up on. Even though I detailed how this has worked for me, I think I could benefit from relaxing more.

The world is going to hell. You might be working on a hard (or even an impossible) problem. We plausibly stand on the precipice of extinction and utter annihilation.

Just relax.

This is meant as a reference post. I'm not the first to talk using problem relaxation in this way. For example, see The methodology of unbounded analysis.


  1. This failure mode is just my best guess – I haven't actually surveyed aspiring researchers. ↩︎

  2. The "convincing-to-me argument" contains a lot of confused reasoning about impact measurement, of course. For one, thinking about side effects is not a good way of conceptualizing the impact measurement problem. ↩︎

  3. The initial thought wasn't as clear as "penalize decrease in attainable utility for human values" – I was initially quite confused by the AUP equation. "What the heck is this equation, and how do I break it?".

    It took me a few weeks to get a handle for why it seemed to work so well. It wasn't for a month or two that I began to understand what was actually going on, eventually leading to the Reframing Impact sequence. However, for the reader's convenience, I whitewashed my reasoning here a bit. ↩︎

  4. At first, I wasn't very excited about AUP – I was new to alignment, and it took a lot of evidence to overcome the prior improbability of my having actually found something to be excited about. It took several weeks before I stopped thinking it likely that my idea was probably secretly and horribly bad.

    However, I kept staring at the strange equation – I kept trying to break it, to find some obvious loophole which would send me back to the drawing board. I never found it. Looking back over a year later, AUP does presently have loopholes, but they're not obvious, nor should they have sent me back to the drawing board.

    I started to get excited about the idea. Two weeks later, my workday was wrapping up and I left the library.

    Okay, I think there's about a good chance that this ends up solving impact. If I'm right, I'll want to have a photo to commemorate it.

    I turned heel, descending back into the library's basement. I took the photograph. I'm glad that I did.

    Discovering AUP was one of the happiest moments of my life. It gave me confidence that I could think, and it gave me some confidence that we can win – that we can solve alignment. ↩︎

New Comment
8 comments, sorted by Click to highlight new comments since:

A key AI safety skill is moving back and forth, as needed, between "could we solve problem X if we assume Y?" and "can we assume Y?".

[-]habrykaΩ6110

Promoted to curated: This is a technique I've seen mentioned in a bunch of places, but I haven't seen a good writeup for it, and I found it quite valuable to read. 

[-]TAG10

imagine trying to “relax” the “problem” of free will!

I have:-

Naturalism helps in the construction of a viable model of libertarian free will, because it becomes clear that choice cannot be an irreducible, atomic process. A common objection to libertarian free will has it that a random event cannot be sufficiently rational or connected to an individuals character, whereas a determined decision cannot be free, so that a choice is un-free or it is capricious, objectionably random (irrational or unrelated to the agents character and desires). This argument, the "dilemma of determinism" makes the tacit assumption that a decision-making is either wholly determined or wholly random. However, if decision-making is complex, it can consist of a mixture of more deterministic and more random elements. A naturalistic theory of free will can therefore recommend itself as being able refute the Dilemma of Determinism through mere compromise: a complex and heterogenous decsion making process can be deterministic enough to be related to an individual's character, yet indeterministic enough to count as free, for realistic levels of freedom.

Note that such a compromise inevitably involves a modest or deflated view of both ingredients. Freedom of choice is not seen as an omnipotent ability to perform any action, but as an agent's the ability to perform an action chosen from a subset of possible actions that it is capable of conceiving and performing. Decision making is seen as rational enough to avoid mere caprice: following naturalistic assumptions we do not see agents as ideal reasoners.

That's an interesting approach, though I don't currently see how it solves the dilemma. If the premises are that...

1) A random event cannot be sufficiently rational or connected to an individuals character

2) Determined decision cannot be free

...and if both of these effectively reduce (as I think they do) to...

1) We cannot control (choose between) random action selections

2) We cannot control (choose between) deterministic action selections

...I'm not sure how two things which we cannot control can combine into something which we can.

For example, I cannot significantly influence the weather, nor can I significantly influence the orbit of Saturn. There is no admixture of these two variables results that I can influence any more than I can influence them individually. Likewise, if I cannot freely choose actions that are random or deterministic, I also cannot freely choose actions possessing some degree of both aspects.

[-]TAG10

It is often assumed that indeterminism can only come into play as part of a complex process of decision-making when the deterministic element has reached an impasse, and indeterminism has the "casting vote" (like an internalised version of tossing a coin when you cannot make up your mind). This model, which we call the Buridan model, has the advantage that you have some level of commitment to both courses of action; neither is exactly against your wishes. It is, however, not so good for rationality and self-control. The indetermistic coin-toss can reasonably be seen as "the" crucial cause of your decision, yet it is not under your control.

In our model, by contrast, the indeterministic element is moved back in the decision-making process. A functional unit we call the "Random Idea Generator" proposed multiple ideas and courses of action, which are then pruned back by a more-or-less deterministic process called the "Sensible Idea Selector". (This arrangement is structurally modeled on random mutation and natural selection in Darwinian theory). The output of the R.I.G is "controlled" in the sense that the rest of the system does not have to act on its proposals. It can filter out anything too wild or irrational. Nonetheless, in a "rewinding history" scenario, the individual could have acted differently, as required by libertarian free will, because their R.I.G. could have come out with different proposals — and it would still be something they wanted to do, because it would not have been translated into action without the consent of the rest of the neural apparatus. (As naturalists, we take it that a "self" is the sum total of neural activity and not a ghost-in-the-machine).

Relaxing from the problem is not always the right strategy. For example, if the problem is not posed well, it will not work well. Try to "relax" the "problem" of free will! But I think it is often the right move.

Getting around is simple: Consider the simplest example of a problem that is still confusing. Then make a lot of simplification estimates by putting part of the present difficulty; don't guess all the difficulty. Finally, approach the problem at ease.

[-]jp10

Why is this comment bold?

My guess is that it was an accident. Fixed it for the author.