Why I find Davidad's plan interesting

Paul W

Abstract

The aim of this text is to give an overview of Davidad’s safety plan, while also outlining some of the limitations and challenges. Additionally, I’ll explain why I would like to contribute to this field.

Disclaimers

I am not Davidad, I tried to convey his ideas as I understand them. While my interpretation may not be exact, I hope it still holds value. Also, this post does not focus on the technical details; I might write another one later, with a deeper technical discussion^[1].

Epistemic status

I began exploring these questions during a weekend-long research sprint. Then, for a couple of months, I kept thinking about them, reading related posts (Davidad’s List of AI Safety Problems). Additionally, I engaged in discussions with Charbel, who co-authored this post on Davidad's plan. While empirical evidence may be limited at this stage, I anticipate more to emerge as efforts progress^[2]. I'm currently learning potentially relevant mathematical theory. However, the aim here is to offer a broad explanation of key ideas rather than an in-depth analysis of the technical problems.

Acknowledgements

Thanks to Charbel-Raphaël Segerie for discussing these questions with me. Thanks to Jeanne S. for useful suggestions.

Outline

0/ Definitions

I/ The plan, and why it might yield positive results

II/ Technical questions, risks, objections

III/ Why I'm interested in contributing

0/Definitions

In what follows, I make a distinction between “Artificial General Intelligence” and “Artificial Super Intelligence”, the latter having strongly superhuman performance levels, while the former can be compared to a human being (if we ignore the speed and duplicability factors).

I shall use “AI Safety” in a loose way^[3], essentially relying on the “notkilleveryoneism” idea: the point is to limit the overall amount of harm ^[4]caused by AIs (and humans), while keeping some of the benefits of powerful AIs.

A “formalized world-model” is one inside which formal proofs (that accept relevant forms of nondeterminism) can be derived, typically regarding the consequences of some actions. A key underlying assumption, of course, is that the model is close enough to the real world, and that it contains descriptions of the phenomena that are relevant to AI safety^[5].

By “meaningful formal guarantees”, I mean safety-relevant properties (of the AI, or of the system that includes the AI and its deployment environment, or variations of these) derived in a formalized framework, as opposed to empirical testing. The point is that one can then trust these guarantees roughly as much as one trusts the formalized world-model.

When I mention “strategy”, “strategic thinking”, etc. I refer to the art of formulating relevant intermediate steps in order to achieve a given terminal goal, and amending these whenever the context varies. I am not using the term in the standard deterministic game-theoretic sense^[6]. The main distinction I make is between “strategic questions”, and “technical questions”.

When I say "promoting AI Safety", I mean "gathering resources (funds, compute, talent), getting relevant regulation passed and enforced, making the industry use safer solutions, etc."

I/Overview of the plan

Context

In my mind, the main underlying assumptions of the plan are the following:

Scaling existing training methods (e.g. RL) is too risky when dealing with ASIs.
All else being equal, an AI system with meaningful formal guarantees is safer.
It is possible to construct powerful (AGI, possibly ASI-level) AI systems with meaningful formal guarantees.
It is possible to find a global political agreement on a set of formal desiderata sufficient for “notkilleveryoneism”.
It is possible to promote AI Safety by successfully carrying out projects that are both safety-focused and profitable.

Goals

It's crucial to distinguish between the ambitious long-term vision of the plan and the more achievable intermediate steps. The ultimate objective of the plan might be summarized as "mitigating the risks posed by advanced AIs to a level where we can then focus on further advancing alignment and AI Safety in general." For example, if the aim is indeed to address the problem of "notkilleveryoneism," the assumptions underpinning the third and fourth steps may become less tenable.

Conversely, the incremental goals involve establishing an alternative AI development framework, demonstrating the compatibility of safety measures with economic interests, and formulating a positive objective that encourages cooperation among stakeholders^[7]. This is supposed to help promote AI Safety.

Means

Now that I have outlined the objectives, let's explore potential pathways to achieve them. Here are the primary technical insights I've identified (with more detailed insights available in Davidad's post).

Instead of attempting to dissect or reverse-engineer the inner workings of the AI system being developed, our focus would shift towards evaluating, within a formalized world-model of its deployment environment, whether the AI's behavior is sufficiently likely to be acceptable.
This evaluation wouldn't solely rely on empirical observations within the world-model, but rather on formal proofs demonstrating that, across all^[8] pertinent deployment scenarios, the outcomes of the AI's actions are acceptable with extremely high probability.
An implication of this "formal proofs framework" is that the training process rewards the verifiability of the AI's actions. Consequently, barring glitches^[9], the AI constructed during training becomes inherently easier to verify.
This approach is anticipated to mitigate the risk of "Out Of Distribution" scenarios, or at the very least, provide us with a more systematic understanding of them: OOD issues arise when the world-model fails to accurately represent something, or when our compilation of "relevant contexts" proves incomplete.
The process should be automated as much as possible, leveraging finely-tuned human-level AIs as assistants. These assistants can aid in both world-modeling tasks and formalizing desiderata. For proof searching, stronger AIs can be employed, tasked solely with generating formal proofs^[10].

Now, regarding strategy, these ideas seem relevant:

This plan can make use of classical academic work and talents, which can help speed things up^[11].
Working proofs-of-concept/prototypes would have real-world applications, helping gain credibility and resources for subsequent steps.

Now, I'll outline several potential positive outcomes even in the event of failure:

A world-model that falls short of meeting all our criteria could still be valuable for evaluation, red-teaming, meta-level advising, or other purposes.
More broadly, research conducted within this plan may offer insights that prove beneficial for other endeavors as well^[12].
In the event of failure, it's more probable that we simply haven't developed anything powerful, rather than having unleashed an unsafe AGI/ASI.

II/Technical Questions, Risks, Objections

Problems to solve

Obviously, there are many at least very difficult ^[13]technical problems to solve to have any chance of success, especially for the maximal goal. The following questions seem most relevant^[14]:

How to create a framework that allows both accurate, efficient ^[15]modeling, and formal non-deterministic proofs ? How much advanced theory is needed ?
How to construct a good enough world-model ? The process likely involves both leveraging *some* of the existing scientific literature, and performing new experiments. Also, can static analysis ^[16]of the trained AI suffice, given that the world-model might diverge from reality ?
How to actually train the AIs required to carry out the plan ?

How it could fail

Even if the technical requirements are fulfilled, there are several potential pitfalls:

An unaligned actor, be it a human organization or a rogue AI, could gain access to the world-model and exploit it for their own agenda. Thus, what was intended as a tool for AI Safety might serve any objective^[17].
Alternative, less safe architectures might demonstrate superior performance or ease of development, leading major actors to favor them over a formal proofs paradigm^[18]. Of course, the ultimate goal is not solely to build a safe AI, but also to prevent unsafe AIs and humans from causing significant harm.
Powerful AIs might take actions that fundamentally alter the world state, rendering the formal world-model obsolete.
There may be political resistance to allowing the verified AI to execute its tasks, even if prior agreements were in place, particularly if those tasks seem counter-intuitive^[19].

The bigger picture

Now, as a counterpoint to my own arguments: even if my understanding of this plan is accurate and its underlying assumptions are valid, I would still need to evaluate it against other plans and determine the most effective approach. It's important to note that the optimal plan, or even just viable ones, might involve investing resources in a combination of proposals^[20]. Discussing this kind of strategic questions is worth a dedicated post.

III/Why I'm interested in contributing

The perspective of an individual with limited yet potentially valuable resources does not equate to that of humanity as a whole, assuming such a notion is coherent. There may exist superior overall plans to which I might not be able to contribute as significantly. For instance, my strengths lie more in mathematical research than in policy-making^[21].

Furthermore, I believe that having a vague sense of the end goal significantly enhances the quality of my research efforts, coupled with at least a glimmer of hope that my work could be impactful. While I'm hesitant to argue that blindly optimistic researchers outperform those who are fully aware (akin to the different skill sets of a captain versus a general), I also refrain from claiming the opposite^[22].

Regarding other math-oriented fields within AI Safety research, such as Singular Learning Theory or Agent Foundations, either I'm not as persuaded by their end goals (in the case of SLT), or I perceive my potential contributions as less influential (in AF, SLT). This isn't to suggest that these areas should be disregarded^[23].

These are the primary reasons why I am currently set on trying to contribute to Davidad’s programme, one way or another.

In case other people are interested: there are several areas, requiring various profiles, where much work is yet to be done. A tentative list:

For world-modelling, purely mathematical work is needed, to develop a suitable theory.
Also for world-modelling, developing version-control systems, and computationally efficient implementation, are key.
Developing the proof system(s), formal verifiers, etc., requires people fluent with formal methods.
There is a substantial amount of Machine Learning work to be done as well, hence good ML practitioners are needed !
Formalizing desiderata is more about social-choice theory, collective decision-making questions. People with expertise in these domains might be interested.
The first phases of the plan will involve finding applications of these "formal verification" ideas, that aren't "solving notkilleveryoneism". The range of such potential applications being quite large, domain experts from various fields may join.
Lastly, given that this plan draws upon numerous diverse fields, effective communication of ideas is paramount. However, conveying complex technical concepts in an accessible manner is not always straightforward. Individuals adept at producing high-quality pedagogical content on technical subjects can be immensely valuable in this regard.

See also page 6 of the call for proposals, and page 11 of the programme thesis, from which most of this was extracted.

^{^}
If you're interested, in a short and more technical summary, I can recommend this comment.
^{^}
One important first step is to assess feasibility of the maximal version of this plan.
^{^}
If you really need precise definitions, feel free to plug in your own, and see if the text then makes sense.
^{^}
Yes, from this point of view, you need to define precisely enough, what you mean by harm and benefits. Or rather, the outcome you get depends on your specifications.
^{^}
Yes, that's a lot.
^{^}
Key features are imperfect information, bounded computing power, randomness; see for instance Clausewitz's notion of friction. As far as I know, these constraints imply that strategy is more about finding and applying good heuristics, rather than solving puzzles. Feel free to provide counterarguments in the comments.
^{^}
One might even view fostering cooperation as the main goal of the plan.
^{^}
Yes, we would really like to cover all of them.
^{^}
One could argue that the AI being trained and verified might find bugs in the world-model, and exploit them. This possibility might be worth investigating.
^{^}
The idea here is that these proof-searching AIs can be boxed and are not a threat. They would only be trained to "play the formal proof-searching game".
^{^}
As far as I know, training good researchers takes time and resources, and the AI Safety field is short on senior people.
^{^}
For instance, work on nondeterministic multi-scale modelling, or development of processes that help express human desiderata, might be used elsewhere.
^{^}
Knowing precisely how difficult they are is itself quite a challenge. The first steps involve getting an approximate answer.
^{^}
Some tentative answers already exist, but I won't say much about them here. See for instance ARIA's call for proposals, pages 8-10.
^{^}
It might be worth recalling that, if your model, or your verification process, takes too long, then it is useless.
^{^}
Static analysis would mean checking the AI once, with a given version of the world-model, then deploy it. Note that, if you have a functioning shutdown process, you are still able to iterate.
^{^}
Dual-use issues apply to these tools as well, not only to AIs.
^{^}
I am afraid I do not know enough about economics and politics to quantify this issue. The game-theoretic analysis (Appendix B) aims at providing an answer.
^{^}
Even if it respects the desiderata, a drastic action, formally proved rational, might or might not be accepted by all stakeholders.
^{^}
Consider a basic scenario: suppose there are two proposals, only one of which can be effective in reality. Assuming that the resources required to discern their efficacy don't hinder future success, you could optimize your chances by initially dividing your resources between the two and then directing your focus accordingly.
However, this isn't the only conceivable scenario. Alternatively, both plans might have potential, but necessitate almost all available resources to yield results. In such a case, concentrating on a single plan would be more prudent.
The question then arises: which scenario reflects our current reality ?
^{^}
And yet, policy-making might be more effective than mathematical research in this case.
^{^}
Feel free to give relevant arguments in the comments, if you thought about these questions.
^{^}
At least, additional arguments would be required.

LESSWRONG
LW