Reframing Impact

TurnTrout

Technical Appendix: First safeguard?

This sequence is written to be broadly accessible, although perhaps its focus on capable AI systems assumes familiarity with basic arguments for the importance of AI alignment. The technical appendices are an exception, targeting the technically inclined.

Why do I claim that an impact measure would be "the first proposed safeguard which maybe actually stops a powerful agent with an imperfect objective from ruining things – without assuming anything about the objective"?

The safeguard proposal shouldn't have to say "and here we solve this opaque, hard problem, and then it works". If we have the impact measure, we have the math, and then we have the code.

So what about:

Quantilizers? This seems to be the most plausible alternative; mild optimization and impact measurement share many properties. But
- What happens if the agent is already powerful? A greater proportion of plans could be catastrophic, since the agent is in a better position to cause them.
- Where does the base distribution come from (opaque, hard problem?), and how do we know it's safe to sample from?
  - In the linked paper, Jessica Taylor suggests the idea of learning a human distribution over actions – how robustly would we need to learn this distribution? How numerous are catastrophic plans, and what is a catastrophe, defined without reference to our values in particular? (That definition requires understanding impact!)
Value learning? But
- We only want this if our (human) values are learned!
  - Value learning is impossible without assumptions, and getting good enough assumptions could be really hard. If we don't know if we can get value learning / reward specification right, we'd like safeguards which don't fail because value learning goes wrong. The point of a safeguard is that it can catch you if the main thing falls through; if the safeguard fails because the main thing does, that's pointless.
Corrigibility? At present, I'm excited about this property because I suspect it has a simple core principle. But
- Even if the system is responsive to correction (and non-manipulative, and whatever other properties we associate with corrigibility), what if we become unable to correct it as a result of early actions (if the agent "moves too quickly", so to speak)?
  - Paul Christiano's take on corrigibility is much broader and an exception to this critique.
- What is the core principle?

Notes

The three sections of this sequence will respectively answer three questions:
- Why do we think some things are big deals?
- Why are capable goal-directed AIs incentivized to catastrophically affect us by default?
- How might we build agents without these incentives?
The first part of this sequence focuses on foundational concepts crucial for understanding the deeper nature of impact. We will not yet be discussing what to implement.
I strongly encourage completing the exercises. At times you shall be given a time limit; it’s important to learn not only to reason correctly, but with speed.

The best way to use this book is NOT to simply read it or study it, but to read a question and STOP. Even close the book. Even put it away and THINK about the question. Only after you have formed a reasoned opinion should you read the solution. Why torture yourself thinking? Why jog? Why do push-ups?

If you are given a hammer with which to drive nails at the age of three you may think to yourself, "OK, nice." But if you are given a hard rock with which to drive nails at the age of three, and at the age of four you are given a hammer, you think to yourself, "What a marvellous invention!" You see, you can't really appreciate the solution until you first appreciate the problem.

~ Thinking Physics

My paperclip-Balrog illustration is metaphorical: a good impact measure would hold steadfast against the daunting challenge of formally asking for the right thing from a powerful agent. The illustration does not represent an internal conflict within that agent. As water flows downhill, an impact-penalizing Frank prefers low-impact plans.
- The drawing is based on gonzalokenny's amazing work.
Some of you may have a different conception of impact; I ask that you grasp the thing that I’m pointing to. In doing so, you might come to see your mental algorithm is the same. Ask not “is this what I initially had in mind?”, but rather “does this make sense as a thing-to-call-'impact'?”.
H/T Rohin Shah for suggesting the three key properties. Alison Bowden contributed several small drawings and enormous help with earlier drafts.

Promoted to curated: I really liked this sequence. I think in many ways it has helped me think about AI Alignment from a new perspective, and I really like the illustrations and the way it was written, and how it actively helped me along the way thing actively about the problems, instead of just passively telling me solutions.

Now that the sequence is complete, it seemed like a good time to curate the first post in the sequence.

I enjoyed the post and in particular really liked the illustrated format. Definitely planning to read the rest!

I'm now wishing more technical blog posts were illustrated like this...

Checking that you've read the Embedded Agency sequence?

Yup, I have (and the untrollable mathematician one). I dashed off that comment but really meant something like, "I hope this trend takes off."

One misgiving I have about the illustrated format is that it's less accessible than text. I hope the authors of work in this format keep the needs of a wide variety of readers in mind.

Accessible in what way? I’m planning to put up a full a text version at the end.

EDIT: I haven't done this yet, unfortunately. I still want to do it.

If the question about accessibility hasn't been resolved, I think Ramana Kumar was talking about making the text readable for people with visual impairments.

I'm nominating the entire sequence because it's brought a lot of conceptual clarity to the notion of "impact", and has allowed me to be much more precise in things I say about "impact".

This post (or sequence of posts) not only gave me a better handle on impact and what that means for agents, but it also is a concrete example of de-confusion work. The execution of the explanations gives an "obvious in hindsight" feeling, with "5-minute timer"-like questions which pushed me to actually try and solve the open question of an impact measure. It's even inspired me to apply this approach to other topics in my life that had previously confused me; it gave me the tools and a model to follow.

And, the illustrations are pretty fun and engaging, too.

(I was briefly confused by the "Think about what Frank brings us for each distance" "slide" because it doesn't include the pinkest marble: I saw the second-pinkest marble (on the largest dotted circle) thinking that it was meant to be the pinkest (because it's rightmost on the "Search radius" legend) and was like, "Wait, why is the pinkest marble closer than the terrorist in this slide when it was farther away in the previous slide?")

Yeah, the Maximum Pink marble has a sheen on it, but outside of that admittedly obscure cue... there's only so many gradations of pink you can tell apart at once.

Here are prediction questions for the predictions that TurnTrout himself provided in the concluding post of the Reframing Impact sequence.

I think this post is broadly making two claims -

Impactful things fundamentally feel different.
A good Impact Measure should be designed in a way that it strongly safeguards against almost any imperfect objective.

It is also (maybe implicitly) claiming that the three properties mentioned completely specify a good impact measure.

I am looking forward to reading the rest of the sequence with arguments supporting these claims.

It is also (maybe implicitly) claiming that the three properties mentioned completely specify a good impact measure.

I don't know that I'd claim that these completely specify a good impact measure, but I'd imagine most impact measures satisfying these properties are good (i.e. natural curves fit to those three points end up pretty good, I think).

I propose to measure impact by counting bits of optimization power, as in my Oracle question contest submission. Find some distribution over plans we might use if we didn't have an AI, such as stock market trading policies. Have the AI output a program that outputs plans according to some distribution. Measure impact by computing a divergence between the two distributions, such as the maximum pointwise quotient - if no plan becomes more than twice as likely, that's no more than one bit of optimization power. Note that the AI is incentivized to prove its output's impact bound to some dumb proof checker. If the AI cuts away the unprofitable half of policies, that is more than enough to get stupid rich.

Beautifully illustrated.

Now that the sequence is complete, it seemed like a good time to curate the first post in the sequence.

I enjoyed the post and in particular really liked the illustrated format. Definitely planning to read the rest!

I'm now wishing more technical blog posts were illustrated like this...

Checking that you've read the Embedded Agency sequence?

Yup, I have (and the untrollable mathematician one). I dashed off that comment but really meant something like, "I hope this trend takes off."

One misgiving I have about the illustrated format is that it's less accessible than text. I hope the authors of work in this format keep the needs of a wide variety of readers in mind.

Accessible in what way? I’m planning to put up a full a text version at the end.

EDIT: I haven't done this yet, unfortunately. I still want to do it.

If the question about accessibility hasn't been resolved, I think Ramana Kumar was talking about making the text readable for people with visual impairments.

I'm nominating the entire sequence because it's brought a lot of conceptual clarity to the notion of "impact", and has allowed me to be much more precise in things I say about "impact".

And, the illustrations are pretty fun and engaging, too.

Yeah, the Maximum Pink marble has a sheen on it, but outside of that admittedly obscure cue... there's only so many gradations of pink you can tell apart at once.

Here are prediction questions for the predictions that TurnTrout himself provided in the concluding post of the Reframing Impact sequence.

I think this post is broadly making two claims -

Impactful things fundamentally feel different.
A good Impact Measure should be designed in a way that it strongly safeguards against almost any imperfect objective.

It is also (maybe implicitly) claiming that the three properties mentioned completely specify a good impact measure.

I am looking forward to reading the rest of the sequence with arguments supporting these claims.

It is also (maybe implicitly) claiming that the three properties mentioned completely specify a good impact measure.

Beautifully illustrated.

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

niplav (67%)

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

TurnTrout (85%)

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

AUP_conceptual prevents catastrophe, assuming the catastrophic convergence conjecture.

99%

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

TurnTrout (65%)

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

niplav (72%)

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

Some version of Attainable Utility Preservation solves side effect problems for an extremely wide class of real-world tasks and for subhuman agents.

99%

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

niplav (55%)

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

TurnTrout (65%)

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

For the superhuman case, penalizing the agent for increasing its own Attainable Utility (AU) is better than penalizing the agent for increasing other AUs.

99%

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

TurnTrout (75%)

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

niplav (90%),ryan_greenblatt (95%)

Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world.

99%

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

Raemon (75%)

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

ryan_greenblatt (80%),niplav (80%)

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

TurnTrout (95%)

Attainable Utility theory describes how people feel impacted

99%

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

Bird Concept (59%)

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

ryan_greenblatt (65%),ejacob (65%)

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

TurnTrout (70%),niplav (75%)

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

The catastrophic convergence conjecture is true. That is, unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.

99%

sophia_xu (9%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

ryan_greenblatt (15%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

TurnTrout (25%)

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

niplav (51%)

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense).

99%

98

Reframing Impact

98

Ω 30

Technical Appendix: First safeguard?

98

Ω 30

98

Ω 30