Review

The purpose of this post is to briefly talk about two categories of alignment and how I think AI safety evaluations should use them. I only consider responsible deployment in 20XX because I often find discussions about 2XXX to be distracting and unproductive. Also, for the remainder of the post, I will only be talking about LLMs, since these could be the 'operating systems' of future AI's (Ex. HuggingGPT). In the near future, LLMs have the highest potential for abuse, and they will be the focus here.

As Connor Leahy and possibly others say, we should be optimistic for the sole reason that we get to build the AIs of the future. A more realistic version of this quote is: we should be optimistic about building our own AI, while acknowledging that others get to build their own AI. This adjustment is made to bring light to the fact that without standardization of safety practices, different groups will have potentially unsound or unknown safety practices. 

Safety standardization in AI is challenging due to the lack of comprehensive safety tools, but I want to outline potential industry standards for my own edification. This will be a reminder of the safe practices that I will advocate for during my career.

Mandatory Alignment Evaluations!

Alignment evaluations should be mandatory for all models after GPT-4 and should be conducted by third-party and in-house groups.  It will not be enough if AI companies are relied on to graciously open their doors to the Alignment Research Center (ARC). By sheer chance the OpenAI executives had the integrity to have GPT-4 be tested for self-replication and resource acquisition. While the openness is a great present state, my concerns lie with AI capital owners of the near future and the absence of concrete safety obligations, standards and regulations. 

The case for oversight and regulation is difficult to make in the United States, but it's an important one. The main crux of the argument for regulation is the classification of generated content as a commercial and industrial product. This definition informs an intuition for the regulation of generated content on the same basis as any influential product massively distributed to the public sphere and private sectors (i.e. food stuff, fabric, children's toys, cyber security etc.). I plan on making a post about that in the future and when I do it will be linked here.

Instead of justifying a legal basis for regulation, in this post I will talk about two classes of alignment evaluation. I think these two classes are broad enough to hold up in the future however, like most things, techniques in practice will/should be a combination of many tricks.

Static Alignment

Static alignment is when evaluators unleash algorithms onto the weights of a model and look for specific structures. Since looking at weights does not require running the model, this is the safest type of alignment evaluation. Currently there is no ability for static evaluation but hopefully these will emerge from a field like Mechanistic Interpretability. 

In its best form this type of analysis will look at weights through some lens and identify inner structures that can be manipulated to massively change or control model behavior. If this is possible then it is also possible to study the way that these structures form during training as well as how that formation relates to training data. 

In its lesser form, static alignment in large models will prove to be a Sisyphean act of looking for inner structures in model N and hoping that they generalize to model N + 1. If we are unable to build robust static analysis tools, by the time model N + 1 arrives we will have failed to detect some percent of the danger, and this missed danger will compound as N increases.

Despite the previous paragraph, I am optimistic because of work being done by Mechanistic Interpretability researchers like Neel Nanda. He has a three part series talking about how he reverse-engineered an impressively complex algorithm that a single layer transformer model learned to perform modular addition. This series is an example of how, although the algorithms implemented by LLMs are not designed to be human readable, these algorithms can be understood through rigorous research and mathematical understanding.

Dynamic Alignment

Dynamic alignment is a class of evaluations that a third-party organization like ARC should be required to conduct. This type of testing usually consists of giving a model limited power and resources and seeing how it behaves. Another type of testing could look at how a model performs competing with itself. Both of these evaluation types prepare for various takeoff scenarios. 

One type of dangerous takeoff scenario is when a super intelligent AI is acting in the world untethered. This would have unknown consequences for the world's economy, public trust and international relations. Dynamic evaluation seeks to identify models that, if given the ability to act untethered, pose a risk to humanity. What would we do if we found an AI able to fail a dynamic evaluation? Who knows, but I hope we would not release it and study every facet of its internals. While it is important to decide what we will do if a model fails an ARC evaluation, we should instead focus on increasing the scope of the evaluation set and making each more robust. 

Another dangerous takeoff scenario is when a sufficiently dangerous model is irresponsibly released to the public (I'm looking at you Meta...) and the world's leading evil-doers are given a tool powerful enough to cripple a nation. If the models are instead only powerful enough to topple things smaller than nations, then it is likely that the world will have to function with super intelligent agents with conflicting goals. These models, which simulate some kind of agency, will be competing or even collaborating with one another. This type of behavior should be studied and that is why I consider it as a type of evaluation ARC should look into for GPT-5+ and its competitors.

This type of evaluation is notably dangerous since you are giving resources to an under-studied force. We also run into two classic problems. One problem is that a model smart enough to take over the world is likely smart enough to pass a dynamic evaluation. The other problem is that a model smart enough to take over the world is likely smart enough to break out of whichever box we put it in. The problems of deception and self-jail breaking are beyond the scope of this post, so I will not delve deeper.

Static and Dynamic Alignment in Conjunction

Responsible deployment will involve both types of alignment but the order in which the evaluations are conducted matters. Safety is maximized when static evaluations are done before dynamic evaluations as opposed to the other way around. This should be obvious from a research and scientific lens since static analysis looks for states in the model that will inform the dynamic evaluations. Although running a model will be required to verify claims made by static analysis, the research to verify theories will hopefully be done in less risky scenarios.

The Standard Process

The standards in a world with responsible scientists is one where the static model analysis is done before any risky evaluations. Any standard will require thorough research in fields like mechanistic interpretability and deep learning theory. If looking at the weights of a model gives us no knowledge of its behavior, then we will not detect an evil model before we run it. Without robust theories and tools for static model analysis we are doomed to continue to just run gold-spitting black boxes until the end.

New Comment