All of Jérémy Scheurer's Comments + Replies

I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.

I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially "representing something interesting" internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.

I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting line... (read more)

2Olli Järviniemi
I (briefly) looked at the DeepMind paper you linked and Roger's post on CCS. I'm not sure if I'm missing something, but these don't really update me much on the interpretation of linear probes in the setup I described. One of the main insights I got out of those posts is "unsupervised probes likely don't retrieve the feature you wanted to retrieve" (and adding some additional constraints on the probes doesn't solve this). This... doesn't seem that surprising to me? And more importantly, it seems quite unrelated to the thing I'm describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I'm claiming "If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem." An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n - even though I agree that the model might not be literally thinking about p (mod 3). And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude "the model is definitely doing a lot of work to solve this problem" (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).

Seems like an experiment worth doing. Some thoughts:

  • I understand that setting up a classification problem in step 1 is important (so you can train the linear probe). I'd try to come up with a classification problem that a base model might initially refuse (or we'd hope it would refuse). Then the training to say "sorry i can't help with that" makes more intuitive sense. I get that mechanistically it's the same thing. But you want to come as close to the real deal as you can, and its unclear why a model would say "I'm sorry I can't help" to the knapsack prob
... (read more)
1Olli Järviniemi
  Agree this is one axis the experiment could be improved on. I think I disagree here. At least for problems that are (computationally) difficult enough I think "linear probe can retrieve answer" implies "model has done the necessary computation to solve the problem".[1] To solve a computational problem you need to do work, and if the work isn't done by the probe, then it must be done by the model. Or are you referring to something else when you talk about "the model is literally thinking about it"? (In any case: if all one needs is one linear layer to extract dangerous cognition from the model, that doesn't seem very safe. Maybe the model itself has such a linear layer.) Seems quite likely to me as well. (The paper you mentioned indeed seems relevant, thanks.) I'm also somewhat interested in training against the probe, i.e. linear concept erasure / obfuscation of internals, and seeing whether performance can be maintained and solutions be retrieved with deeper probes. (I haven't looked at the concept erasure literature, seems very possible something similar has already been done - "train against the probe" feels like such a basic idea. References are welcome.) 1. ^ This is not strictly true: Maybe you need e.g. 10 layers to solve the problem, the model provides 9 of them and the linear probe provides the last one, so the model hasn't quite solved the problem by itself. Still, close enough: the model has to at least do all but the very last steps of the computation.

Yeah great question! I'm planning to hash out this concept in a future post (hopefully soon). But here are my unfinished thoughts I had recently on this.

I think using different methods to elicit "bad behavior" i.e. to red team language models have different pros and cons as you suggested (see for instance this paper by ethan perez: https://arxiv.org/abs/2202.03286). If we assume that we have a way of measuring bad behavior (i.e. a reward model or classifier that tells you when your model is outputting toxic things, being deceptive, sycophantic etc., which ... (read more)

Thanks a lot for this helpful comment! You are absolutely right; the citations refer to goal misgeneralization which is a problem of inner alignment, whereas goal misspecificatin is related to outer alignment. I have updated the post to reflect this.

Seems to me like this is easily resolved so long as you don't screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?

 

Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can "easily" be resolved by making your hypothesis more specific. We are merely pointing out that "In theory, this is a trivial po... (read more)

1Lucius Bushnaq
It's not about DKL vs. some other loss function. It's about using a one dimensional summary of a high dimensional comparison, instead of a one dimensional comparison. There are many ways for two neural networks to both diverge from some training labels y by an average loss l while spitting out very different outputs. There are tautologically no ways for two neural networks to have different output behaviour without having non-zero divergence in label assignment for at least some data points. Thus, it seems that you would want a metric that aggregates the divergence of the two networks' outputs from each other, not a metric that compares their separate aggregated divergences from some unrelated data labels and so throws away most of the information. A low dimensional summary of a high dimensional comparison between the networks seems fine(ish). A low dimensional comparison between the networks based on the summaries of their separate comparisons to a third distribution throws away a lot of the relevant information.

Thanks for your comments Akash. I think I have two main points I want to address.

  1. I agree that it's very good that the field of AI Alignment is very competitive! I did not want to imply that this is a bad thing. I was mainly trying to point out that from my point of view, it seems like overall there are more qualified and experienced people than there are jobs at large organizations. And in order to fill that gap we would need more senior researchers, who then can follow their research agendas and hire people (and fund orgs), which is however hard to achiev
... (read more)
2Roman Leventov
I'm not sure people seriously thought about this before, your perspective seems rather novel. I think existing labs themselves are the best vehicle to groom new senior researchers. Anthropic, Redwood Research, ARC, and probably other labs were all found by ex-staff of existing labs at the time (except that maybe one shouldn't credit OpenAI for "grooming" Paul Cristiano to senior level, but anyways). It's unclear what field-building projects could incentivise labs to part with their senior researchers and let them spin off their own labs. Or to groom senior researchers "faster", so to speak. If the theory that AI alignment is extremely competitive is right, then logically both the labs shouldn't cling to their senior people too much (because it will be relatively easy to replace them), and senior researchers shouldn't worry about starting their own projects too much because they know they can assemble a very competent team very quickly. It seems that it's only the funding for these new labs and their organisational strategy which could be a point of uncertainty for senior researchers that could deter them from starting their own projects (apart from, of course, just being content with the project they are involved in at their current jobs, and their level of influence on research agendas). So, maybe the best field-building project that could be done in this area is someone offering knowledge about and support through founding, funding, and setting a strategy for new labs (which may range from brief informal consultation to more structured support, a-la "incubator for AI safety labs") and advertise this offering among the staff of existing AI labs.

My argument here is very related to what jacquesthibs mentions.

Right now it seems like the biggest bottleneck for the AI Alignment field is senior researchers. There are tons of junior people joining the field and I think there are many opportunities for junior people to up-skill and do some programs for a few months (e.g. SERI MATS, MLAB, REMIX, AGI Safety Fundamentals, etc.). The big problem (in my view) is that there are not enough organizations to actually absorb all the rather "junior" people at the moment. My sense is that 80K and most programs encou... (read more)

4Nathan Helm-Burger
I'm definitely feeling like I sacrificed both income and career capital by deciding to do alignment research full time. I don't feel like I'm being 'hurt' by the world though, I feel like the world is hurting. In a saner world, there would be more resources devoted to this, and it is to the world's detriment that this is not the case. I could go back to doing mainstream machine learning if I wasn't overwhelmed by a sense of impending doom and compelled by a feeling of duty to do what I can to help. I'm going to keep trying my best, but I would be a lot more effective if I were working as part of a group. Even just things like being able to share the burden of some of the boilerplate code I need to write in order to do my experiments would speed things up a lot, or having a reviewer to help point out mistakes to me.
2Igor Ivanov
I 100% agree with you.  I am a person entering the field right now, I also know several people in a position similar to mine, and there are just no positions for people like me, even though I think I am very proactive and have valuable experience
5[anonymous]
A few thoughts: 1. I agree that it would be great to have more senior researchers in alignment 2. I agree that, ideally, it would be easier for independent researchers to get funding. 3. I don’t think it’s necessarily a bad thing that the field of AI alignment research is reasonably competitive. 4. My impression is that there’s still a lot of funding (and a lot of interest in funding) independent alignment researchers. 5. My impression is that it’s still considerably easier to get funding for independent alignment research than many other forms of independent non-commercial research. For example, many PhD programs have acceptance rates <10% (and many require that you apply for independent grants or that you spend many of your hours as a teaching assistant). 6. I think the past ~2 months has been especially tough for people seeking independent funding, given that funders have been figuring out what to do in light of the FTX stuff & have been more overwhelmed than usual. 7. I am concerned that, in the absence of independent funding, people will be more inclined to join AGI labs even if that’s not the best option for them. (To be clear, I think some AGI lab safety teams are doing reasonable work. But I expect that they will obtain increasingly more money/prestigé in the upcoming years, which could harm peoples’ ability to impartially assess their options, especially if independent funding is difficult to acquire). Overall, empathize with concerns about funding, but I wish the narrative included (a) the fact that the field is competitive is not necessarily a bad thing and (b) funding is still much more available than for most other independent research fields. Finally, I think part of the problem is that people often don’t know what they’re supposed to do in order to (honestly and transparently) present themselves to funders, or even which funders they should be applying to, or even what they’re able to ask for.. If you’re in this situation, feel free to

Intuitively I think that no other field has junior people independently working without clear structures and supervision, so I feel like my skepticism is warranted. 

Einstein had his miracle year in such a context. 

Modern academia has few junior people independently working without clear structures and supervision, but pre-Great Stagnation that happened more. 

Generally, pre-paradigmatic work is likely easier to do independently than post-paradigmatic work. That still means that most researchers won't produce anything useful but that's generally common for academic work and if a few researches manage to do great paradigm founding work it can still be worth it over all.

4Hoagy
Proposal: If other people are doing independent research in London I'd be really interested in co-working and doing some regular feedback and updates. (Could be elsewhere but I find being in person important for me personally). If anyone would be interested reply here or message me and we'll see if we can set something up :) General comment: This feels accurate to me. I've been working as an independent researcher for the last few months, after 9 months of pure skill building and have got close but not succeeded in getting jobs at the local research orgs in London (DeepMind, Conjecture). It's a great way to build some skills, having to build your own stack, but it's also hard to build research skills without people with more experience giving feedback, and because iteration of ideas is slow, it's difficult to know whether to stick with something or try something else. In particular it forces you to be super proactive if you want to get any feedback.

This means that, at least in theory, the out of distribution behaviour of amortized agents can be precisely characterized even before deployment, and is likely to concentrate around previous behaviour. Moreover, the out of distribution generalization capabilities should scale in a predictable way with the capacity of the function approximator, of which we now have precise mathematical characterizations due to scaling laws.

Do you have pointers that explain this part better? I understand that scaling computing and data will improve misgeneralization to some ... (read more)

That's interesting!

Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from. 

With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). ... (read more)

OpenAI has just released a description of how their models work here

text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO). 

"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.

I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison ... (read more)

1Evan R. Murphy
Very useful update, thanks. Though I notice they don't say anything about how ada and text-ada-* models were trained.
4janus
In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7/7, and he answered "A mix of previously trained models. Probably very few samples from base models if any" (emphasis mine). I'm curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step.  Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only "cloned" them. (Note that it wasn't specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)

I looked at your code (very briefly though), and you mention this weird thing where even the normal model sometimes is completely unaligned (i.e. even in the observed case it takes the action "up" all the time). You say that this sometimes happens and that it depends on the random seed. Not sure (since I don't fully understand your code), but that might be something to look into since somehow the model could be biased in some way given the loss function.


Why am I mentioning this? Well, why does it happen that the mesa-optimized agent happens to go upward wh... (read more)

1MadHatter
The true loss function includes a term to incentivize going up: it's the squared distance to the line y=x (which I think of as the alignment loss) minus the y coordinate (which I think of as a capability loss). Since the alignment loss is quadratic and the capability loss is linear (and all the distances are at least one since we're on the integer grid), it should generally incentivize going up, but more strongly incentivize staying close to the line y=x. If I had to guess, I would say that the models turning out unaligned just have some subtle sub-optimality in the training procedure that makes them not converge to the correct behavior.

I'll link to the following post that came out a little bit earlier Mysteries of Mode Collapse due to RLHF, which is basically a critique of the whole RLHF approach and the Instruct Models (specifically text-davinci-002). 

I think the terminology you are looking for is called "Deliberate Practice" (just two random links I just found). Many books/podcasts/articles have been written about that topic. The big difference is when you "just do your research" you are executing your skills and trying to achieve the main goal (e.g. answering a research question). Yes, you sometimes need to read textbooks or learn new skills to achieve that, but this learning is usually subordinate to your end goal. Also one could make the argument that if you actually need to invest a few hours into ... (read more)

1Carson Jones
Thanks for your thoughts here! So I do have some familiarity with the concept of deliberate practice, and I would definitely include that as part of the thing I'm talking about above. But I can also think of things that might improve a researcher's capacity that don't fall under deliberate practice. 1. One researcher told me they were having frequent interruptions to their focused working time as a result of their work environment, so they made some adjustments to their work environment to prevent that. I don't think I'd call that deliberate practice, but it does seem like a big improvement. 2. Motivation/procrastination. This is probably the single biggest complaint I've heard from researchers. To the extent that they find a solution to this, it probably won't end up being something in the category of "deliberate practice". It will probably look like creating incentives for themselves, or introspecting on their own preferences and motivations, or creating accountability mechanisms, etc.

"Either", "or" pairs in text. 
Heuristic. If the word either appears in a sentence, wait for the comma and then add an " or". 

What follows are a few examples. Note that the completion is just something I randomly come up with, the important part is the or. Using the webapp, GPT-2 puts a high probability (around 40%-60%) on the token " or". 

"Either you take a left at the next intersection," -> or take a left after that. 
"Either you go to the cinema," -> or you stay at home. 
"Tonight I could either order some food," -> or coo... (read more)

2Haoxing Du
Thanks! There are probably other grammatical structures in English that require a bit of an algorithmic thinking like this one as well.

ERO: I do buy the argument of Steganography everywhere if you are optimizing for outcomes. As described here (https://www.lesswrong.com/posts/pYcFPMBtQveAjcSfH/supervise-process-not-outcomes) outcome-based optimization is an attractor and will make your sub-compoments uninterpretable. While not guaranteed, I do think that process based optimization might suffer less from steganography (although only experiments will eventually show what happens). Any thoughts on process based optimization?

 

Shard Theory:  Yeah, the word research agenda was maybe w... (read more)

3David Scott Krueger (formerly: capybaralet)
My understanding is that process-based optimization is just another name for supervising intermediary computations -- you can treat anything computed by a network as an "output" in the sense of applying some loss function.   So (IIUC), it is not qualitatively different.

Thanks for your thoughts, really appreciate it. 

One quick follow-up question, when you say "build powerful AI tools that are deceptive" as a way of "the problem being easier than anticipated", how exactly do you mean that? Do you say that as in, if we can create deceptive or power-seeking tool AI very easily, it will be much simpler to investigate what is happening and derive solutions? 

Here are some links to the concepts you asked about. 
Externalized Reasoning Oversight:  This was also recently introduced https://www.lesswrong.com/post... (read more)

3David Scott Krueger (formerly: capybaralet)
ERO is another one of those things that I think has a lot of precedent in ML (There's a paper I was trying to dig up using natural language as the latent space in a variational autoencoder), but doesn't look very promising to me because of "steganography everywhere".  Like other approaches to interpretability, it seems worth pursuing, but I also worry that people believe too much in flawed interpretations. Shard theory sounds like an interesting framing, and again something that a lot of people would already agree with in ML, I'm not sure what it is supposed to be useful for or what sort of testable predictions it makes.  Seems like a perspective worth keeping in mind, But I'm not sure I'd call it a research agenda. RAT: I don't see any way to "search over something like ‘the model’s beliefs about what it has seen’"; This seems like a potential sticking point; There's more foundational research needed to figure out if when how we can even ascribe beliefs to a model etc. As a general comment, I think a lot of the "agendas" That people discuss here are not very well fleshed out, And the details are crucially important.  I'm not even sure whether to call a lot of these ideas "agendas".  To need they strike me as more like "framings".  It is important to note that the ML community doesn't publish "framings", except when they can be supported by concrete results (You can sometimes find people giving their perspective or framing on some problems in machine learning in blogs or keynotes or tutorials, etc.).  So I think that people here often overestimate the novelty of their perspective.  I think it is good to reward people for sharing these things, but given that a lot of other people might have similar thoughts but choose not to share them, I don't think people here have quite the right attitude towards this.  Writing up or otherwise communicating such framings without a strong empirical or theoretical contribution and expecting credit for the ideas / citations of you
2David Scott Krueger (formerly: capybaralet)
That was a typo, should say "are NOT deceptive"

I have two questions I'd love to hear your thoughts about.

1. What is the overarching/high-level research agenda of your group? Do you have a concrete alignment agenda where people work on the same thing or do people work on many unrelated things? 

2. What are your thoughts on various research agendas to solve the alignment that exists today? Why do you think they will fall short of their goal? What are you most excited about?

Feel free to talk about any agendas, but I'll just list a few that come to my mind (in no particular order). 

IDA, Debate, In... (read more)

4David Scott Krueger (formerly: capybaralet)
1. No overarching agenda, but I do have some directions I'm more keen on pushing, see: https://www.davidscottkrueger.com/ 2. I'm pessimistic overall, and don't find any of the agendas particularly promising. I'm a strong proponent of the portfolio approach (https://futureoflife.org/2017/08/17/portfolio-approach-to-ai-safety-research/).   My pessimism means that my success stories usually involve us getting lucky; the main way this can happen is if some aspect of the problem is easier than I expect. From the technical point of view the main candidates ~OTMH are: i) deep learning generalizes super well so we can just learn preferences well enough, and success mostly becomes a matter dotting our "i"s and crossing our "t"s, and not fucking up. ii) it's easy to build powerful AI genies/tools that are deceptive or power-seeking. very quick / hot takes on the particular agendas: - IDA: "not even wrong".  The basic idea of getting AI systems to help you align more powerful AI systems is promising, but raises a chicken and egg problem.   - debate: Unlike IDA, at least it is concrete enough you can pin it down.  I like the complexity analogy, and the idea of pitting AI systems against each other in order to give them the right incentives.  I don't think anybody has shown how it even comes close to solving the problem though. - interpretability: worth doing as an extra "safety filter", but it still seems like a lot of "just so stories",  and seems like it won't scale Unless we have some conceptual insights into what makes a (simplified) explanation good. - natural abstraction hypothesis: I listened to Wentworth's interview (On Daniel Filan's podcast I think?).  I've usually thought about such questions from a more "science of deep learning" point of view.  I haven't looked at the technical details of Wentworth's work at all, But it seems like a sensible direction.  I don't have a story for how this is supposed to solve alignment.  Seems more

Finetuning LLMs with RL seems to make them more agentic. We will look at the changes RL makes to LLMs' weights; we can see how localized the changes are, get information about what sorts of computations make something agentic, and make conjectures about selected systems, giving us a better understanding of agency.

Could you elaborate on how you measure the "agenticness" of a model in this experiment? In case you don't want to talk about it until you finish the project that's also fine, just thought I'd ask.

I'll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem. 

Chess: We are defining 3 models, the expert model , the (human equivalent) weak model  and the strong but misaligned assistant model . The goal is for  to leverage the misaligned assistant  to reach... (read more)

1Sam Bowman
Thanks! I'll admit that I meant to be asking especially about the toxicity case, though I didn't make that at all clear. As in Charlie's comment, I'm most interested in using this approach as a way to efficiently explore and pilot techniques that we can ultimately adapt back to humans, and text-based interactions seems like a good starting point for that kind of work.  I don't see a clear picture either way on whether the noisy signal story presents a hard problem that's distinctively alignment oriented.

Thanks for clearing that up. This is super helpful as a context for understanding this post.

Based on this post I was wondering whether your views have shifted away from your proposal of "The case for aligning narrowly superhuman models" and also the "Sandwiching" problem that you propose there? I am not sure if this question is warranted given your new post. But it seems to me, that potential projects you propose in "narrowly aligning superhuman models" are to some extent the similar to things you address in this new post as making it likely to eventually lead to a "full-blown AI takeover".
Or put differently are those different sides of the same ... (read more)

6Ajeya Cotra
I'm still fairly optimistic about sandwiching. I deliberately considered a set of pretty naive strategies ("naive safety effort" assumption) to contrast with future posts which will explore strategies that seem more promising. Carefully-constructed versions of debate, amplification, recursive reward-modeling, etc seem like they could make a significant difference and could be tested through a framework like sandwiching.

To investigate sandwiching in a realistic scenario, I don't think that one can go around setting up some experiment with humans in the loop. However, I agree that artificial sandwiching might be useful as a way to iterate on the algorithm. The way I understand it, the bottleneck for artificial sandwiching (i.e. using models only) is to define a misaligned expert-assistant".  As you point out, one would like to model some real misalignment and not only a difference in capabilities. For iterating on rather simple artificial sandwiching experiments, one ... (read more)

1Sam Bowman
Thanks! I think I have some sense of what both directions look like, but not enough to know what a concrete starting experiment would look like. What would a minimum viable experiment look like for each?