Thanks for bringing all of this together - I think this paints a fine picture of my current best hope for deontic sufficiency. If we can do better than that, great!
Update: For a better exposition of the same idea, see Agent membranes and formalizing “safety”.
Update: For a better exposition of the same core idea, see Agent membranes and formalizing “safety”.
Here's one specific way that «boundaries» could directly apply to AI safety.
For context on what "«boundaries»" are, see «Boundaries» and AI safety compilation (also see the tag for this concept).
In this post, I will focus on the way Davidad conceives using «boundaries» within his Open Agency Architecture safety paradigm. Essentially, Davidad's hope is that «boundaries» can be used to formalize a sort of MVP morality for the first AI systems.
Update: Davidad left a comment endorsing this post. He also later tweeted about it in a twitter reply.[1]
Why «boundaries»?
So, in an ideal future, we would get CEV alignment in the first AGI.
However, this seems really hard, and it might be easier to get AI x-risk off the table first (thus ending the "acute risk period"), and then figure out how to do the rest of alignment later.[2]
In which case, we don't actually need the first AGI to understand all of human values/ethics— we only need it to understand a minimum subset that ensures safety.
But which subset? And how could it be formalized in a consistent manner?
This is where the concept of «boundaries» comes in, because the concept has two nice properties:
The hope, then, is that the «boundaries» concept could be formalized into a sort of MVP morality that could be used in the first AI system(s).
Concretely, one way Davidad envisions implementing «boundaries» is by tasking an AI system to minimize the occurrence of ~objective «boundary» violations for its citizens.
That said, I disagree with such an implementation and I will propose an alternative in another post.
Also related: Acausal normalcy
Quotes from Davidad that support this view
(All bolding below is mine.)
Davidad tweeted in 2022 Aug:
next tweet:
later in the thread:
Davidad in AI Neorealism: a threat model & success criterion for existential safety (2022 Dec):
Davidad in An Open Agency Architecture for Safe Transformative AI (2022 Dec):
Also see this tweet from Davidad in 2023 Feb:
Further explanation of the OAA's Deontic Sufficiency Hypothesis in Davidad's Bold Plan for Alignment: An In-Depth Explanation (2023 Apr) by Charbel-Raphaël and Gabin:
Also:
(The post also explains that the "(*)" prefix means "Important", as distinct from "not essential".)
This comment by Davidad (2023 Jan):
From Reframing inner alignment by Davidad (2022 Dec):
From A list of core AI safety problems and how I hope to solve them (2023 Aug):
FWIW he left this comment before I simplified this post a lot on 2023 Sept 15.
P.S.: Davidad explains this directly in A list of core AI safety problems and how I hope to solve them.