This post comes a bit late with respect to the news cycle, but I argued in a recent interview that o1 is an unfortunate twist on LLM technologies, making them particularly unsafe compared to what we might otherwise have expected:
The basic argument is that the technology behind o1 doubles down on a reinforcement learning paradigm, which puts us closer to the world where we have to get the value specification exactly right in order to avert catastrophic outcomes.
Additionally, this technology takes us further from interpretability. If you ask GPT4 to produce a chain-of-thought (with prompts such as "reason step-by-step to arrive at an answer"), you know that in some sense, the natural-language reasoning you see in the output is how it arrived at the answer.[1] This is not true of systems like o1. The o1 training rewards any pattern which results in better answers. This can work by improving the semantic reasoning which the chain-of-thought apparently implements, but it can also work by promoting subtle styles of self-prompting. In principle, o1 can learn a new internal language which helps it achieve high reward.
You can tell the RL is done properly when the models cease to speak English in their chain of thought
- Andrej Karpathy
A loss of this type of (very weak) interpretability would be quite unfortunate from a practical safety perspective. Technology like o1 moves us in the wrong direction.
Informal Alignment
The basic technology currently seems to have the property that it is "doing basically what it looks like it is doing" in some sense. (Not a very strong sense, but at least, some sense.) For example, when you ask ChatGPT to help you do your taxes, it is basically trying to help you do your taxes.
This is a very valuable property for AI safety! It lets us try approaches like Cognitive Emulation.
In some sense, the Agent Foundations program at MIRI sees the problem as: human values are currently an informal object. We can only get meaningful guarantees for formal systems. So, we need to work on formalizing concepts like human values. Only then will we be able to get formal safety guarantees.
Unfortunately, fully formalizing human values appears to be very difficult. Human values touch upon basically all of the human world, which is to say, basically all informal concepts. So it seems like this route would need to "finish philosophy" by making an essentially complete bridge between formal and informal. (This is, arguably, what approaches such as Natural Abstractions are attempting.)
Approaches similar to Cognitive Emulation lay out an alternative path. Formalizing informal concepts seems hard, but it turns out that LLMs "basically succeed" at importing all of the informal human concepts into a computer. GPT4 does not engage in the sorts of naive misinterpretations which were discussed in the early days of AI safety. If you ask it for a plan to manufacture paperclips, it doesn't think the best plan would involve converting all the matter in the solar system into paperclips. If you ask for a plan to eliminate cancer, it doesn't think the extermination of all biological life would count as a success.
We know this comes with caveats; phenomena such as adversarial examples show that the concept-borders created by modern machine learning are deeply inhuman in some ways. The computerized versions of human commonsense concepts are not robust to optimization. We don't want to naively optimize these rough mimics of human values.
Nonetheless, these "human concepts" seem robust enough to get a lot of useful work out of AI systems, without automatically losing sight of ethical implications such as the preservation of life. This might not be the sort of strong safety guarantee we would like, but it's not nothing. We should be thinking about ways to preserve these desirable properties going forward. Systems such as o1 threaten this.
- ^
Yes, this is a fairly weak sense. There is a lot of computation under the hood in the big neural network, and we don't know exactly what's going on there. However, we also know "in some sense" that the computation there is relatively weak. We also know it hasn't been trained specifically to cleverly self-prompt into giving a better answer (unlike o1); it "basically" interprets its own chain-of-thought as natural language, the same way it interprets human input.
So, to the extent that the chain-of-thought helps produce a better answer in the end, we can conclude that this is "basically" improved due to the actual semantic reasoning which the chain-of-thought apparently implements. This reasoning can fail for systems like o1.
I more-or-less agree with Eliezer's comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn't know him in 1996). I have a small beef with his bolded "MIRI is always in every instance" claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they've ever said while employed at MIRI).
What I am trying to say is something looser and more gestalt. I do think what I am saying contains some disagreement with some spirit-of-MIRI, and possibly some specific others at MIRI, such that I could say I've updated on the modern progress of AI in a different way than they have.
For example, in my update, the modern progress of LLMs points towards the Paul side of some Eliezer-Paul debates. (I would have to think harder about how to spell out exactly which Eliezer-Paul debates.)
One thing I can say is that I myself often argued using "naive misinterpretation"-like cases such as the paperclip example. However, I was also very aware of the Eliezer-meme "the AI will understand what the humans mean, it just won't care". I would have predicted difficulty in building a system which correctly interprets and correctly cares about human requests to the extent that GPT4 does.
This does not mean that AI safety is easy, or that it is solved; only that it is easier than I anticipated at this particular level of capability.
Getting more specific to what I wrote in the post:
My claim is that modern LLMs are "doing roughly what they seem like they are doing" and "internalize human intuitive concepts". This does include some kind of claim that these systems are more-or-less ethical (they appear to be trying to be helpful and friendly, therefore they "roughly are").
The reason I don't think this contradicts with Eliezer's bolded claim ("Getting a shape into the AI's preferences is different from getting it into the AI's predictive model") is that I read Eliezer as talking about strongly superhuman AI with this claim. It is not too difficult to get something into the values of some basic reinforcement learning agent, to the extent that something like that has values worth speaking of. It gets increasingly difficult as the agent gets cleverer. At the level of intelligence of, say, GPT4, there is not a clear difference between getting the LLM to really care about something vs merely getting those values into its predictive model. It may be deceptive or honest; or, it could even be meaningless to classify it as deceptive or honest. This is less true of o1, since we can see it actively scheming to deceive.