I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially "representing something interesting" internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting line...
Seems like an experiment worth doing. Some thoughts:
Yeah great question! I'm planning to hash out this concept in a future post (hopefully soon). But here are my unfinished thoughts I had recently on this.
I think using different methods to elicit "bad behavior" i.e. to red team language models have different pros and cons as you suggested (see for instance this paper by ethan perez: https://arxiv.org/abs/2202.03286). If we assume that we have a way of measuring bad behavior (i.e. a reward model or classifier that tells you when your model is outputting toxic things, being deceptive, sycophantic etc., which ...
Seems to me like this is easily resolved so long as you don't screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?
Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can "easily" be resolved by making your hypothesis more specific. We are merely pointing out that "In theory, this is a trivial po...
Thanks for your comments Akash. I think I have two main points I want to address.
My argument here is very related to what jacquesthibs mentions.
Right now it seems like the biggest bottleneck for the AI Alignment field is senior researchers. There are tons of junior people joining the field and I think there are many opportunities for junior people to up-skill and do some programs for a few months (e.g. SERI MATS, MLAB, REMIX, AGI Safety Fundamentals, etc.). The big problem (in my view) is that there are not enough organizations to actually absorb all the rather "junior" people at the moment. My sense is that 80K and most programs encou...
Intuitively I think that no other field has junior people independently working without clear structures and supervision, so I feel like my skepticism is warranted.
Einstein had his miracle year in such a context.
Modern academia has few junior people independently working without clear structures and supervision, but pre-Great Stagnation that happened more.
Generally, pre-paradigmatic work is likely easier to do independently than post-paradigmatic work. That still means that most researchers won't produce anything useful but that's generally common for academic work and if a few researches manage to do great paradigm founding work it can still be worth it over all.
This means that, at least in theory, the out of distribution behaviour of amortized agents can be precisely characterized even before deployment, and is likely to concentrate around previous behaviour. Moreover, the out of distribution generalization capabilities should scale in a predictable way with the capacity of the function approximator, of which we now have precise mathematical characterizations due to scaling laws.
Do you have pointers that explain this part better? I understand that scaling computing and data will improve misgeneralization to some ...
That's interesting!
Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.
With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). ...
OpenAI has just released a description of how their models work here.
text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO).
"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.
I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison ...
I looked at your code (very briefly though), and you mention this weird thing where even the normal model sometimes is completely unaligned (i.e. even in the observed case it takes the action "up" all the time). You say that this sometimes happens and that it depends on the random seed. Not sure (since I don't fully understand your code), but that might be something to look into since somehow the model could be biased in some way given the loss function.
Why am I mentioning this? Well, why does it happen that the mesa-optimized agent happens to go upward wh...
I'll link to the following post that came out a little bit earlier Mysteries of Mode Collapse due to RLHF, which is basically a critique of the whole RLHF approach and the Instruct Models (specifically text-davinci-002).
I think the terminology you are looking for is called "Deliberate Practice" (just two random links I just found). Many books/podcasts/articles have been written about that topic. The big difference is when you "just do your research" you are executing your skills and trying to achieve the main goal (e.g. answering a research question). Yes, you sometimes need to read textbooks or learn new skills to achieve that, but this learning is usually subordinate to your end goal. Also one could make the argument that if you actually need to invest a few hours into ...
"Either", "or" pairs in text.
Heuristic. If the word either appears in a sentence, wait for the comma and then add an " or".
What follows are a few examples. Note that the completion is just something I randomly come up with, the important part is the or. Using the webapp, GPT-2 puts a high probability (around 40%-60%) on the token " or".
"Either you take a left at the next intersection," -> or take a left after that.
"Either you go to the cinema," -> or you stay at home.
"Tonight I could either order some food," -> or coo...
ERO: I do buy the argument of Steganography everywhere if you are optimizing for outcomes. As described here (https://www.lesswrong.com/posts/pYcFPMBtQveAjcSfH/supervise-process-not-outcomes) outcome-based optimization is an attractor and will make your sub-compoments uninterpretable. While not guaranteed, I do think that process based optimization might suffer less from steganography (although only experiments will eventually show what happens). Any thoughts on process based optimization?
Shard Theory: Yeah, the word research agenda was maybe w...
Thanks for your thoughts, really appreciate it.
One quick follow-up question, when you say "build powerful AI tools that are deceptive" as a way of "the problem being easier than anticipated", how exactly do you mean that? Do you say that as in, if we can create deceptive or power-seeking tool AI very easily, it will be much simpler to investigate what is happening and derive solutions?
Here are some links to the concepts you asked about.
Externalized Reasoning Oversight: This was also recently introduced https://www.lesswrong.com/post...
I have two questions I'd love to hear your thoughts about.
1. What is the overarching/high-level research agenda of your group? Do you have a concrete alignment agenda where people work on the same thing or do people work on many unrelated things?
2. What are your thoughts on various research agendas to solve the alignment that exists today? Why do you think they will fall short of their goal? What are you most excited about?
Feel free to talk about any agendas, but I'll just list a few that come to my mind (in no particular order).
IDA, Debate, In...
Finetuning LLMs with RL seems to make them more agentic. We will look at the changes RL makes to LLMs' weights; we can see how localized the changes are, get information about what sorts of computations make something agentic, and make conjectures about selected systems, giving us a better understanding of agency.
Could you elaborate on how you measure the "agenticness" of a model in this experiment? In case you don't want to talk about it until you finish the project that's also fine, just thought I'd ask.
I'll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem.
Chess: We are defining 3 models, the expert model , the (human equivalent) weak model and the strong but misaligned assistant model . The goal is for to leverage the misaligned assistant to reach...
Based on this post I was wondering whether your views have shifted away from your proposal of "The case for aligning narrowly superhuman models" and also the "Sandwiching" problem that you propose there? I am not sure if this question is warranted given your new post. But it seems to me, that potential projects you propose in "narrowly aligning superhuman models" are to some extent the similar to things you address in this new post as making it likely to eventually lead to a "full-blown AI takeover".
Or put differently are those different sides of the same ...
To investigate sandwiching in a realistic scenario, I don't think that one can go around setting up some experiment with humans in the loop. However, I agree that artificial sandwiching might be useful as a way to iterate on the algorithm. The way I understand it, the bottleneck for artificial sandwiching (i.e. using models only) is to define a misaligned expert-assistant". As you point out, one would like to model some real misalignment and not only a difference in capabilities. For iterating on rather simple artificial sandwiching experiments, one ...
I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.