For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.
If we get TAI in the next decade or so, it will almost certainly contain an LLM, at least as a component. Human values are complex and fragile, and we spend a huge amount of our time writing about them: roughly half the Dewey Decimal system consists of many different subfields of "How to Make Humans Happy 101", including virtually all of the soft sciences (Anthropology, Medicine, Ergonomics, Economics…), arts, and crafts. Current LLMs have read tens of trillions of tokens of our content, including terrabytes of this material, and as a result even GPT-4 (definitely less than TAI) can do a pretty good job of answering moral questions and commenting on possible undesirable side effects and downsides of plans. So if we have sufficient control of our TAI to ensure that it is extremely unlikely to kill us all, then presumably we can also tell it "also don't do anything that your LLM says is a bad idea or we wouldn't like, at least not without checking carefully with us first", and get a passable take on human values and impact regularization as well. So if we have enough control to block your red arrow, we can also take at least a passable first cut at the green arrow as well. Which by itself probably isn't enough to stand up to many bits of optimization pressure without Goodharting, but is a lot better then ignoring the green arrow entirely. Also any TAI that can do STEM can understand and avoid Goodharting.
I agree that just not killing everyone is a much easier problem. Consider zoos: the manual for "How Not to Kill Everything in Your Care: The Orangutan Edition" is probably only a few hundred pages or less, and has a significant overlap with the corresponding editions for all of the other primates, including Homo sapiens. However, LLMs can handle datasets vastly larger than that, so this compactness is only relevant if you're trying to add some sort of mathematical or software framework on top of it that can handle of data, but not terabytes.
The more recent Safeguarded AI document has some parts that seem to me to go against the interpretation I had, which seems to go along the lines of this post.
Namely, that davidad's proposal was not "CEV full alignment on AI that can be safely scaled without limit" but rather "sufficient control of AI that is as little more powerful as possible than sufficiently powerful for ethical global non-proliferation".
In other words:
The Safeguarded AI document says this though:
and that this milestone could be achieved, thereby making it safe to unleash the full potential of superhuman AI agents, within a time frame that is short enough (<15 years) [bold mine]
and
and with enough economic dividends along the way (>5% of unconstrained AI’s potential value) [bold mine][1]
I'm probably missing something, but that seems to imply a claim that the control approach would be resilient against arbitrarily powerful misaligned AI?
A related thing I'm confused about is the part that says:
one eventual application of these safety-critical assemblages is defending humanity against potential future rogue AIs [bold mine]
Whereas I previously thought that the point of the proposal was to create AI powerful-enough and controlled-enough to ethically establish global non-proliferation (so that "potential future rogue AIs" wouldn't exist in the first place), it now seems to go in the direction of Good(-enough) AI defending against potential Bad AI?
The "unconstrained AI" in this sentence seems to be about how much value would be achieved from adoption of the safe/constrained design versus the counterfactual value of mainstream/unconstrained AI. My mistake.
The "constrained" still seems to refer to whether there's a "box" around the AI, with all output funneled through formal verification checks on their predicted consequences. It does not seem to refer to a constraint on the "power level" ("boundedness") of the AI within the box.
It could be the case that these two goals are separable and independent:
This is what Davidad calls this the Deontic Sufficiency Hypothesis.
If the hypothesis is true, it should be possible to de-pessimize and mitigate the urgent risk from AI without necessarily ensuring that AI creates actively positive outcomes. Because, for safety, it is only necessary to ensure that actively harmful outcomes do not occur. And hopefully this is easier than achieving “full alignment”.
Safety first! We can figure out the rest later.
Quotes from Davidad's The Open Agency Architecture plans
This is Davidad’s plan with the Open Agency Architecture (OAA).
A list of core AI safety problems and how I hope to solve them (2023 August)
Davidad's Bold Plan for Alignment: An In-Depth Explanation — LessWrong (2023 April)
An Open Agency Architecture for Safe Transformative AI (2022 December)
AI Neorealism: a threat model & success criterion for existential safety (2022 December)
How to formalize safety?
If the deontic sufficiency hypothesis is true, there should be an independent/separable way to formalize what “safety” is. This is why I think boundaries/membranes could be helpful for AI safety: See Agent membranes and formalizing “safety”.
Thanks to Jonathan Ng for reviewing a draft of this post and to Alexander Gietelink Oldenziel for encouraging me to post it.
Note that Davidad has
notreviewed or verified this post.