If you want counterarguments, here's one good place to look: Object-Level AI Risk Skepticism - LessWrong
I expect we might get more today, as it's the deadline for the Open Philanthropy AI Worldview Contest
In the deceptive alignment story, the model wants to take action A, because its goal is misaligned, but chooses to take apparently aligned action B to avoid overseers noticing that it is misaligned. In other words, in the absence of deceptive tendencies, the model would take action A, which would identify it as a misaligned model, because overseers wanted it to take action B. That's the definition of a differential adversarial example.
If there were an unaligned model with no differential adversarial examples in training, that would be an example of a...
I have a whole section on the key assumptions about the training process and why I expect them to be the default. It's all in line with what's already happening, and the labs don't have to do anything special to prevent deceptive alignment. Did I miss anything important in that section?
Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I'm explicitly not addressing other failure modes in this post.
What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don't know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more...
Which assumptions are wrong? Why?
I don't think that the specific ways people give feedback is very relevant. This post is about deceptive misalignment, which is really about inner misalignment. Also, I'm assuming that this a process that enables TAI to emerge, especially the first time, and asking people who don't know about a topic to give feedback probably won't be the strategy that gets us there. Does that answer your question?
From Ajeya Cotra's post that I linked to:
Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.
It's not important what the tasks are, as long as the model is learning to complete diverse tasks by following directions.
Pre-trained models could conceivably have goals like predicting the next token, but they should be extremely myopic and not have situational awareness. In pre-training, a text model predicts tokens totally independently of each other, and nothing other than its performance on the next token depends directly on its output. The model makes the prediction, then that prediction is used to update the model. Otherwise, it doesn't directly affect anything. Having a goal for something external to its next prediction could only be harmful for training performance, ...
I'd be curious to hear what you think about my arguments that deceptive alignment is unlikely. Without deceptive alignment, there are many fewer realistic internal goals that produce good training results.
Thanks for sharing your perspective! I've written up detailed arguments that deceptive alignment is unlikely by default. I'd love to hear what you think of it and how that fits into your view of the alignment landscape.
Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.”
Likely TAI training scenarios include information about the base objective in the input. A corrigibly-aligned model could learn to infer the base objective and optimize for that.
...However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at leas
Thanks for summarizing this! I have a very different perspective on the likelihood of deceptive alignment, and I'd be interested to hear what you think of it!
This is an interesting post. I have a very different perspective on the likelihood of deceptive alignment. I'd love to hear what you think of it and discuss further!
I recently made an inside view argument that deceptive alignment is unlikely. It doesn't cover other failure modes, but it makes detailed arguments against a core AI x-risk story. I'd love to hear what you think of it!
This is an interesting point, but it doesn't undermine the case that deceptive alignment is unlikely. Suppose that a model doesn't have the correct abstraction for the base goal, but its internal goal is the closest abstraction it has to the base goal. Because the model doesn't understand the correct abstraction, it can't instrumentally optimize for the correct abstraction rather than its flawed abstraction, so it can't be deceptively aligned. When it messes up due to having a flawed goal, that should push its abstraction closer to the correct abstraction....
1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren't the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like "maximize reward in the next hour or so." Or maaaaaaybe "Do what humans watching you and rating your actions would rate highly," though that's a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.
I specify the training setup here: “The goal of the t...
Thanks for your thoughtful reply! I really appreciate it. I’m starting with your fourth point because I agree it is closest to the crux of our disagreement, and this has become a very long comment.
#4:
What amount of understanding of the base goal is sufficient? What if the answer is "It has to be quite a lot, otherwise it's really just a proxy that appears superficially similar to the base goal?" In that case the classic arguments for deceptive alignment would work fine.
TL;DR the model doesn’t have to explicitly represent “X, whatever that turns out t...
Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced "deception" with "deceptive alignment" in both posts. Thanks for pointing that out!
I'm intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven't thought about them nearly as much, and I don't have strong intuition for how likely they are yet, so I'm choosing to stay focused on deceptive alignment for this sequence.
That makes sense. I misread the original post as arguing that capabilities research is better than safety work. I now realize that it just says capabilities research is net positive. That's definitely my mistake, sorry!
I strong upvoted your comment and post for modifying your views in a way that is locally unpopular when presented with new arguments. That's important and hard to do!
Your first link appears to be broken. Did you meant to link here? It looks like the last letter of the address got truncated somehow. If so, I'm glad you found it valuable!
For what it's worth, although I think deceptive alignment is very unlikely, I still think work on making AI more robustly beneficial and less risky is a better bet than accelerating capabilities. For example, my posts don't address these stories. There are also a lot of other concerns about potential downsides of AI that may not be existential, but are still very important.
However, it seems like adversarial examples do not differentially favor alignment over deception. A deceptive model with a good understanding of the base objective will also perform better on the adversarial examples.
If your training set has a lot of these adversarial examples, then the model will encounter them before it develops the prerequisites for deception, such as long-term goals and situational awareness. These adversarial examples should keep it from converging on an unaligned proxy goal early in training. The model doesn't start out with an unali...
Thanks for writing this up clearly! I don't agree that gradient descent favors deception. In fact, I've made detailed, object-level arguments for the opposite. To become aligned, the model needs to understand the base goal and point at it. To become deceptively aligned, the model needs to have long-run goal and situational awareness before or around the same time as it understands the base goal. I argue that this makes deceptive alignment much harder to achieve and much less likely to come from gradient descent. I'd love to hear what you think of my arguments!
Thanks for writing this! If you're interested in a detailed, object-level argument that a core AI risk story is unlikely, feel free to check out my Deceptive Alignment Skepticism sequence. It explicitly doesn't cover other risk scenarios, but I would love to hear what you think!
Thanks for posting this! Not only does a model have to develop complex situational awareness and have a long-term goal to become deceptively aligned, but it also has to develop these around the same time as it learns to understand the training goal, or earlier. I recently wrote a detailed, object-level argument that this is very unlikely. I would love to hear what you think of it!
The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal
Wouldn’t you expect decision making, and therefore goals, to be in the final layers of the model? If so, they will calculate the goal based on high-level world model neurons. If the world model improves, those high-level abstractions will also improve. The goal doesn’t have to make a model of the goal from scratch, because it is connected to the world model.
...When the learner has a really excellent world model that can make long range p
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.
Where do you see weak points in the argument?
Do you think language models already exhibit deceptive alignment as defined in this post?
...I’m discussing a specific version of deceptive alignment, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly. To avoid confusion, I will refer to these
I just posted a detailed explanation of why I am very skeptical of the traditional deceptive alignment story. I'd love to hear what you think of it!
If the model is sufficiently good at deception, there will be few to no differential adversarial examples.
We're talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned.
Also, at this stage of the process, the model doesn't have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal.
the vastly larger number of misaligned goals
I agree that ...
Thanks for pointing that out! My goal is to highlight that there are at least 3 different sequencing factors necessary for deceptive alignment to emerge:
The post you linked to talked about the importance of sequencing for #3, but it seems to assume that goal directedness will come first (#1) without disc...
Nate, please correct me if I'm wrong, but it looks like you:
- Skimmed, but did not read, a 3,000-word essay
- Posted a 1,200-word response that clearly stated that you hadn't read it properly
- Ignored a comment by one of the post's authors saying you thoroughly misunderstood their post and a comment by the other author offering to have a conversation with you about it
- Found a different person to talk to about their views (Ronny), who also had not read their post
- Participated in a 7,500-word dialogue with Ronny in which you speculated about what the core argum
... (read more)I don't want to speak for Nate, and I also don't want to particularly defend my own behavior here, but I have kind of done similar things around trying to engage with the "AI is easy to control" stuff.
I found it quite hard to engage with directly. I have read the post, but I would not claim to be able to be close to passing an ITT of its authors and bounced off a few times, and I don't currently expect direct conversation with Quintin or Nora to be that productive (though I would still be up for it and would give it a try).
So I have been talkin... (read more)