I'm doing research and other work focused on AI safety and AI catastrophic risk reduction. Currently my top projects are (last updated May 19, 2023):
General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, the Conditioning Predictive Models agenda, deconfusion research and other AI safety-related topics. My work is currently self-funded.
Research that I’ve authored or co-authored:
Other recent work:
Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.
I'm always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any 'hacking' or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...
I don't think it's right to say that Anthropic's "Discovering Language Model Behaviors with Model-Written Evaluations" paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an "Assistant" character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.
(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)
Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use "Human:" and "Assistant:" labels. Which means we shouldn't interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the "Assistant" character. nostalgebraist's comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it's mainly one part]
--
After taking a closer look at this paper, pages 38-40 (Figures 21-24) show in detail what I think are the most important results. Most of these charts indicate what evhub highlighted in another comment, i.e. that "the model's tendency to do X generally increases with model scale and RLHF steps", where (in my opinion) X is usually a concerning behavior from an AI safety point of view:
A few thoughts on these graphs as I've been studying them:
Exposing the weaknesses of fine-tuned models like the Llama 3.1 Instruct models against refusal vector ablation is important because the industry seems to really have overreliance on these safety techniques currently.
It's worth noting that refusal vector ablation isn't even necessary for this sort of malicious use with Llama 3.1 though because Meta also released the base pretrained models without instruction finetuning (unless I'm misunderstanding something?).
Saw that you have an actual paper on this out now. Didn't see it linked in the post so here's a clickable for anyone else looking: https://arxiv.org/abs/2410.10871 .
Thanks for working on this. In case anyone else is looking for a paper on this, I found https://arxiv.org/abs/2410.10871 from the OP which looks like a similar but more up-to-date investigation on Llama 3.1 70B.
I only see bad options, a choice between an EU-style regime and doing essentially nothing.
What issues do you have with the EU approach? (I assume you mean the EU AI Act.)
Thoughtful/informative post overall, thanks.
Wow this seems like a really important breakthrough.
Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?
Thanks, I think you're referring to:
It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)
Really fascinating post, thanks.
On green as according to black, I think there's an additional facet perhaps even more important than just the acknowledgment that sometimes we are too weak to succeed and so should conserve energy. Black being strongly self-interested will tend to cast aside virtues like generosity, honesty and non-harm except as means in social games they are playing to achieve other ends for themselves. But self-interest tends to include desire for reduction of self-suffering. Green + white* (as I'm realizing this may be more a color combo than purely green) are more inclined to discover, e.g. through meditation/mindfulness, that aggression, deceit and other non-virtues actually produce self-suffering in the mind as a byproduct. So black is capable of embracing virtue as part of a more complete pursuit of self-interest.**
It may be that one of the most impactful things that green + white can do is get black to realize this fact, since black will tend to be powerful and successful in the world at promoting whatever it understands to be its self interest.
I haven't read your post on attunement yet, maybe you touch on this or related ideas there.
--
*You could argue this also includes blue and so should be green + white + blue, since it largely deals with knowledge of self.
**I believe this fact of non-virtue inflicting self-suffering is true for most human minds. However, there may be cases where a person has some sort of psychological disorder that makes them effectively lack a conscience where it wouldn't hold.
But in this case Patrick Collison is a credible source and he says otherwise.
Patrick Collison: These aren’t just cherrypicked demos. Devin is, in my experience, very impressive in practice
Patrick is an investor in Cognition. So while he may still be credible in this case, he also has a conflict of interest.
The independent red-teaming organization ARC Evals that OpenAI partnered with to evaluate GPT-4 seems to disagree with this. While they don't use the term "runaway intelligence", they have flagged similar dangerous capabilities that they think will possibly be in reach for the next models beyond GPT-4: