b_sen — LessWrong

Let's See You Write That Corrigibility Tag

[Epistemic status: Unpolished conceptual exploration, possibly of concepts that are extremely obvious and/or have already been discussed. Abandoning concerns about obviousness, previous discussion, polish, fitting the list-of-principles frame, etc. in favor of saying anything at all.] [ETA: Written in about half an hour, with some distraction and wording struggles.]

What is the hypothetical ideal of a corrigible AI? Without worrying about whether it can be implemented in practice or is even tractable to design, just as a theoretical reference to compare proposals to?

I propose that the hypothetical ideal is not an AI that lets the programmer shut it down, but an AI that wants to be corrected - one that will allow a programmer to work on it while it is live and aid the programmer by honestly explaining the results of any changes. It is entirely plausible that this is not achievable by currently-known techniques because we don't know how to do "caring about a world-state rather than a sensory input / reward signal," let alone "actually wanting to fulfill human values but being uncertain about those values", but this still seems to me the ideal.

Suppose such an AI is asked to place a strawberry on the bottom plate of a stack of plates. It would rather set the rest of the stack aside non-destructively than smash them, because it is uncertain about what humans would prefer to be done with those plates and leaving them intact allows more future options. It would rather take the strawberry from a nearby bowl than creating a new strawberry plantation, because it is uncertain about what humans would prefer to be done with the resources that would be directed towards a new strawberry plantation. Likewise, it would rather not run off and develop nanofabrication. It would rather make a decent attempt and then ask the human for feedback instead of turning Earth into computronium to verify the placement of the strawberry, because again, uncertainty over ideal use of resources. It would rather not deceive the human asker or the programmer, because deceiving humans reduces the expected value of future corrections. This seems to me to be what is desired of considerations like "low impact", "myopia", "task uncertainty", "satisficing" ...

The list of principles should flow from considering the ideal and obstacles to getting there, along with security-mindset considerations. Just because you believe your AI is safe given unfettered Internet access doesn't mean you should give it unfettered Internet access - but if you don't believe your AI is safe given unfettered Internet access, this is a red flag that it is "working against you" on some level.

Lesswrong 2016 Survey

b_sen10y470

I took the survey!

Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 119

b_sen11y00

By "fanfic" do you mean: 1) the written content itself, 2) the model of the fictional universe (including characters) generated in the process of writing that content, 3) the model of the fictional universe (including characters) generated in the process of reading that content, or 4) something else?

The written content is unlikely to bootstrap itself on account of not being code, but its effects on the minds that read it are less clear.

In all seriousness, though, if I had a fully mapped out path for bootstrapping well beyond human intelligence I would have better uses for it than writing recursive Harry Potter fanfiction. I was thinking more along the lines of Harry fixing his dark side and interpreting the Unbreakable Vow.

Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 119

b_sen11y20

Adding to my previous prediction comment:

Predictions: (I'll have to score them all after the epilogue is released, but hey, it means we get an epilogue.)

The "phoenix’s egg" password will (directly or indirectly) allow Harry to find Narcissa Malfoy. 70%

The Line of Merlin feeds information to its rightful holder when they’re holding it. 60%

At least one Legilimency conversation occurred during Chapter 119. 90%

Speculations:

What happens if Harry casts the True Patronus through the Elder Wand? Given that it’s a Deathly Hallow that raises the priors of something interesting happening with a spell embodying a preference for life over death, but with the "sense of strength and constrained danger, like a leashed wolf" Harry feels when holding it I’m not sure what the effect would be.

Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 119

b_sen11y90

I don't have a link offhand, but I recall EY stating his reasons for not boosting Hermione:

She doesn't need the boost to compete with the other characters, including Harry
If she was boosted, the story would be "Hermione Granger Discovers the Methods of Rationality and Becomes Omnipotent" (i.e. a thoroughly power-boosted Hermione would break the story)
A boosted Hermione would plausibly be smarter then EY

Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 119

b_sen11y00

The Author's Note mention of the delayed epilogue (combined with some of the foreshadowing in HPMOR) feels to me like an invitation to write the obvious continuation fic Harry Potter and the Methods of Self-Modification, set between the ending of HPMOR and the epilogue. Does anyone else find this the obvious continuation?

I'm also not sure if writing the fic would actually be a good idea; anyone want to help me evaluate it?

Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 118

b_sen11y00

I hope Voldemort's "fallback weapon" also had sunlight-resistant skin. Otherwise Hermione might have issues with going outside...

I also take it that Harry's refusal to give Quirrell's eulogy even before he knew Q = V is because of his views on death in general.

Speaking of the eulogy, is Harry cheering at the end? And does he have any way of protecting his Transfigurations against Finite Incantatem?

Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 117

b_sen11y00

Assuming it's her arm, which is plausible given that Harry noticed its thinness but isn't confirmed. In any case, I was mainly just raising the question.

Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 117

b_sen11y00

I wonder if Harry can help Draco by teaching him the True Patronus (possibly to have Draco resurrect Lucius). It would be a nice callback to their early scientific discoveries and Harry teaching Draco Patronus 1.0, although Harry might have to be very careful about how he does it.

I also notice that Lesath wasn't among the kids who'd lost parents, so they didn't find Bellatrix among the Death Eaters at the graveyard. Where is Bellatrix?

Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 116

b_sen11y10

Adding to my previous prediction comment:

Predictions:

Harry did at least one plot-relevant thing in the time we haven’t seen (between him time-turning back at the graveyard in Chapter 115 and returning to the Quidditch match for Chapter 116). 80%

Harry intentionally made his scar bleed in Chapter 116. 95% (Perhaps using Muggle special effects?)

Someone will see through Harry’s acting (in Chapter 116 at the Quidditch match), whether by deducing things themselves or being told some part of what really happened, by the end of the story. 90%

Draco will figure out that Harry was involved in Voldemort’s recent defeat. 70% (This does not specify how he figures it out, although Harry has left several big clues besides his bad acting. For example, the stone in his ring changing colour before Voldemort dies and him falling to his knees only when he hears the explosion rather than 20 seconds earlier.)

Some thought in-universe will be given to saving Lucius from death by Harry’s partial Transfiguration by the end of the story. 75%

The "luminous white quiver running over the holly" and "drifting motes of silver light like tiny specks of Patronus Charm" when Harry Obliviates Voldemort in Chapter 115 are not normal for Obliviation. 90%

The "luminous white quiver running over the holly" and "drifting motes of silver light like tiny specks of Patronus Charm" when Harry Obliviates Voldemort in Chapter 115 indicate something. 80%

Obliviated Voldemort will retain the memory of at least one instance of casting the starlight spell for Harry. 50%

Harry will insist that Hermione learn to protect her mind. 75%

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments