Meta: I tagged this with "Eliciting Latent Knowledge" because we didn't have another tag that seemed appropriate. I'm not 100% sure how opinionated Paul Christiano's conception of ELK was supposed to be, and whether that's an appropriate tag here or if there should be a more generic version.
A Misaligned Model Would Have Activations Correlated With Lying
Humans routinely deceive others by deceiving themselves. A sufficiently intelligent AI might be able to do something similar. E.g., it could modify itself such that
it contains two goals: G1 = {do as the operators want} and G2 = {actual, misaligned goal},
it has a subnet S that tracks something like C := "do I have a decisive strategic advantage yet?", and that subnet is made reflectively inaccessible to other parts of the AI,
if C is false, the AI genuinely, honestly pursues G1,
when C becomes true, the AI self-modifies so as to discard G1 and pursue G2 instead.
Of course, this would require considerable ability to self-modify in rather precise ways; realistic near-future ML systems may or may not be able to do that kind of stuff. But if an AI were capable of that kind of stuff, it seems like it would break the {There Are Activations Correlated With Lying} condition?
Also maybe worth noting: In order for the AI to even be able to knowingly lie/deceive, it would have to be capable of reasoning about things like
Would the text I am about to output cause the operators to believe things which do not match reality?
Before that level of capability, it seems unclear whether there could be any {activations correlated with lying}, since the AI would not really even be capable of (intentionally) lying. And after that level of capability, the AI would be able to reason about itself, the operators, their epistemic states, etc.; i.e. the AI might have all the intelligence/understanding needed to invent the kinds of deceptive self-modifications described in the previous comment.
And so there might not be any capability regime in which {A Misaligned Model Would Have Activations Correlated With Lying}. Or that regime might be very short-lived/narrow.
Collin Burns is a second-year ML PhD at Berkeley, working with Jacob Steinhardt and Dan Klein, whose focus is on making language models honest, interpretable, and aligned.
In our interview we discuss his approach to doing AI Alignment research and in particular his recent paper Discovering latent knowledge in language models without supervision and the accompanying Lesswrong post.
I think this interview would be useful for people who would be interested in hearing Collin's high-level takes on AI Alignment research or learn more about Collin's AI Alignment agenda.
Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.
On Alignment Research
Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems
We Should Have More People Working On New Agendas From First Principles
Researching Unsupervised Methods Because Those Are Less Likely To Break With Future Models
On Discovering Latent Knowledge Without Supervision
Recovering The Truth From The Activations Directly
Why Saying The Truth Matters For Alignment
A Misaligned Model Would Have Activations Correlated With Lying