All of Erik Jenner's Comments + Replies

Yeah. I think there's a broader phenomenon where it's way harder to learn from other people's mistakes than from your own. E.g. see my first bullet point on being too attached to a cool idea. Obviously, I knew in theory that this was a common failure mode (from the Sequences/LW and from common research advice), and someone even told me I might be making the mistake in this specific instance. But my experience up until that point had been that most of the research ideas I'd been similarly excited about ended up ~working (or at least the ones I put serious time into).

Some heuristics (not hard rules):

  • ~All code should start as a hacky jupyter notebook (like your first point)
  • As my codebase grows larger and messier, I usually hit a point where it becomes more aversive to work with because I don't know where things are, there's too much duplication, etc. Refactor at that point.
  • When refactoring, don't add abstractions just because they might become useful in the future. (You still want to think about future plans somewhat of course, maybe the heuristic is to not write code that's much more verbose than necessary right now, i
... (read more)

Research mistakes I made over the last 2 years.

Listing these in part so that I hopefully learn from them, but also because I think some of these are common among junior researchers, so maybe it's helpful for someone else.

  • I had an idea I liked and stayed attached to it too heavily.
    • (The idea is using abstractions of computation for mechanistic anomaly detection. I still think there could be something there, but I wasted a lot of time on it.)
    • What I should have done was focus more on simpler baselines, and be more scared when I couldn't beat those simple basel
... (read more)
Reply14631
Buck220

I also have most of these regrets when I think about my work in 2022.

4Mark Xu
It's kind of strange that, from my perspective, these mistakes are very similar to the mistakes I think I made, and also see a lot of other people making. Perhaps one "must" spend too long doing abstract slippery stuff to really understand the nature of why it doesn't really work that well?
3Oliver Daniels
curious if you have takes on the right balance between clean research code / infrastructure and moving fast/being flexible. Maybe its some combination of  * get hacky results ASAP * lean more towards functional programming / general tools and away from object-oriented programing / frameworks (especially early in projects where the abstractions/experiments/ research questions are more dynamic), but don't sacrifice code quality and standard practices
7CstineSublime
I am a big fan of the humility and the intention to help others by openly reflecting on these lessons, thank you for that.

2 years ago, you wrote:

theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)

I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot... (read more)

8johnswentworth
It's been pretty on-par. Amusingly, I tend to worry more about the opposite failure mode: findings on today's nets won't generalize to tomorrow's nets (even without another transformers-level paradigm shift), and therefore leveraging evidence from other places is the only way to do work which will actually be relevant. (More accurately, I worry that the relevance or use-cases of findings on today's nets won't generalize to tomorrow's nets. Central example: if we go from a GPT-style LLM to a much bigger o1/o3-style model which is effectively simulating a whole society talking to each other, then the relationship between the tokens and the real-world effects of the system changes a lot. So even if work on the GPT-style models tells us something about the o1/o3-style models, its relevance is potentially very different.) Yeah, that was on a little MNIST net. And the degree of success I saw in that earlier experiment was actually about on par with what we saw in our more recent experiments, our bar was just quite a lot higher this time around. This time we were aiming for things like e.g. "move one person's head" rather than "move any stuff in any natural way at all".

I think this was a very good summary/distillation and a good critique of work on natural abstractions; I'm less sure it has been particularly useful or impactful.

I'm quite proud of our breakdown into key claims; I think it's much clearer than any previous writing (and in particular makes it easier to notice which sub-claims are obviously true, which are daring, which are or aren't supported by theorems, ...). It also seems that John was mostly on board with it.

I still stand by our critiques. I think the gaps we point out are important and might not be obvi... (read more)

One more: It seems plausible to me that the alignment stress-testing team won't really challenge core beliefs that underly Anthropic's strategy.

For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I'm not sure whether I should think of this as work by the stress-testing team?) then showed positi... (read more)

I think Anthropic de facto acts as though "models are quite unlikely (e.g. 3%) to be scheming" is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.

  • I think Anthropic might be "all in" on its RSP and formal affirmative safety cases too much and might do better to diversify safety approaches a bit. (I might have a wrong impression of how much you're already doing/considering these.)
    • In addition to affirmative safety cases that are critiqued by a red team, the red team should make proactive "risk cases" that the blue team can argue against (to avoid always letting the blue team set the overall framework, which might make certain considerations harder to notice).
    • A worry I have about RSPs/safety cases: we m
... (read more)

One more: It seems plausible to me that the alignment stress-testing team won't really challenge core beliefs that underly Anthropic's strategy.

For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I'm not sure whether I should think of this as work by the stress-testing team?) then showed positi... (read more)

Interesting, thanks! My guess is this doesn't include benefits like housing and travel costs? Some of these programs pay for those while others don't, which I think is a non-trivial difference (especially for the bay area)

4Ryan Kidd
Yes, this doesn't include those costs and programs differ in this respect.

I think different types of safety research have pretty different effects on concentration of power risk.

As others have mentioned, if the alternative to human concentration of power is AI takeover, that's hardly an improvement. So I think the main ways in which proliferating AI safety research could be bad are:

  1. "Safety" research might be more helpful for letting humans use AIs to concentrate power than they are for preventing AI takeover.
  2. Actors who want to build AIs to grab power might also be worried about AI takeover, and if good(-seeming) safety technique
... (read more)

You're totally right that this is an important difficulty I glossed over, thanks!

TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can't supervise, and this ingredient could be interpretability. On the other hand, there's at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming).


Just to restate things a bit, I'd distinguish two cases:

  • "In-distribution anomaly detection:" we
... (read more)

Yeah, seems right that these adversarial prompt should be detectable as mechanistically anomalous---it does intuitively seem like a different reason for the output, given that it doesn't vary with the input. That said, if you look at cases where the adversarial prompt makes the model give the correct answer, it might be hard to know for sure to what extent the anomalous mechanism is present. More generally, the fact that we don't understand how these prompts work probably makes any results somewhat harder to interpret. Cases where the adversarial prompt leads to an incorrect answer seem more clearly unusual (but detecting them may also be a significantly easier task).

1Sodium
I think those sound right to me. It still feels like prompts with weird suffixes obtained through greedy coordinate search (or other jailbreaking methods like h3rm4l) are good examples for "model does thing for anomalous reasons."
Erik JennerΩ8123

I directionally agree with this (and think it's good to write about this more, strongly upvoted!)

For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:

  1. "ensuring that if the AIs are not aligned [...], then you are still OK" (which I think is the main meaning of "AI control")
  2. Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood's existing work on control).

I th... (read more)

Buck*Ω19268

I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).

We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.

I plan to write more about white box control soon (but I plan to write about a lot of stuff…)

Yeah, I feel like we do still disagree about some conceptual points but they seem less crisp than I initially thought and I don't know experiments we'd clearly make different predictions for. (I expect you could finetune Leela for help mates faster than training a model from scratch, but I expect most of this would be driven by things closer to pattern recognition than search.)

I think if there is a spectrum from pattern recognition to search algorithm there must be a turning point somewhere: Pattern recognition means storing more and more knowledge to get

... (read more)

Thanks for running these experiments! My guess is that these puzzles are hard enough that Leela doesn't really "know what's going on" in many of them and gets the first move right in significant part by "luck" (i.e., the first move is heuristically natural and can be found without (even heuristically) knowing why it's actually good). I think your results are mainly reflections of that, rather than Leela generally not having sensibly correlated move and value estimates (but I'm confused about what a case would be where we'd actually make different predictio... (read more)

4p.b.
I actually originally thought about filtering with a weaker model, but that would run into the argument: "So you adversarially filtered the puzzles for those transformers are bad at and now you've shown that bigger transformers are also bad at them." I think we don't disagree too much, because you are too damn careful ... ;-) You only talk about "look-ahead" and you see this as on a spectrum from algo to pattern recognition.  I intentionally talked about "search" because it implies more deliberate "going through possible outcomes". I mostly argue about the things that are implied by mentioning "reasoning", "system 2", "algorithm".  I think if there is a spectrum from pattern recognition to search algorithm there must be a turning point somewhere: Pattern recognition means storing more and more knowledge to get better. A search algo means that you don't need that much knowledge. So at some point of the training where the NN is pushed along this spectrum much of this stored knowledge should start to be pared away and generalised into an algorithm. This happens for toy tasks during grokking. I think it doesn't happen in Leela.  I do have an additional dataset with puzzles extracted from Lichess games. Maybe I'll get around to running the analysis on that dataset as well.  I thought about an additional experiment one could run: Finetuning on tasks like help mates. If there is a learned algo that looks ahead, this should work much better than if the work is done by a ton of pattern recognition which is useless for the new task. Of course the result of such an experiment would probably be difficult to interpret. 

Thank you for writing this! I've found it helpful both to get an impression what some people at Anthropic think and also to think about some things myself. I've collected some of my agreements/disagreements/uncertainties below (mostly ignoring points already raised in other comments.)

Subject to potentially very demanding constraints around safety like those in our current and subsequent RSPs, staying close to the frontier is perhaps our top priority in Chapter 1.

If I understand this correctly, the tasks in order of descending priority during Chapter 1 are:... (read more)

I don't think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.

Maybe I misunderstood you then, and tbc I agree that you don't need a sharp boundary. That said, the rest of your message makes me think we might still be talking past each other a bit. (Feel free to disengage at any point obviously.)

For your thought experiment, my prediction would depend on the specifics of what this "tactical motive" l... (read more)

2p.b.
Hmm, yeah, I think we are talking past each other.  Everything you describe is just pattern recognition to me. Lookahead or search does not depend on the broadness of the motive.  Lookahead, to me, is the ability to look ahead and see what is there. It allows very high certainty even for never before seen mating combinations.  If the line is forcing enough it allows finding very deep combinations (which you will never ever find with pattern recognition because the combinatorial explosions means that basically every deep combination has never been seen before). In humans, it is clearly different from pattern recognition. Humans can see multi-move patterns in a glance. The example in the post I would play instantly in every blitz game. I would check the conditions of the pattern, but I wouldn't have to "look ahead".  Humans consider future moves even when intuitively assessing positions. "This should be winning, because I still have x,y and z in the position". But actually calculating is clearly different because it is effortful. You have to force yourself to do it (or at least I usually have to). You manipulate the position sequentially in your mind and see what could happen. This allows you to see many things that you couldn't predict from your past experience in similar positions I didn't want to get hung up on whether there is a crisp boundary. Maybe you are right and you just keep generalising and generalising until there is a search algo in the limit. I very much doubt this is where the ability of humans to calculate ahead comes from. In transformers? Who knows. 

I still don't see the crisp boundary you seem to be getting at between "pattern recognition building on general circuits" and what you call "look-ahead." It sounds like one key thing for you is generalization to unseen cases, but the continuous spectrum I was gesturing at also seems to apply to that. For example:

But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the "pattern recognition

... (read more)
2p.b.
I don't think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.  Here is one (thought) experiment to tease this apart: Imagine you train the model to predict whether a position leads to a forced checkmate and also the best move to make. You pick one tactical motive and erase it from the checkmate prediction part of the training set, but not the move prediction part.  Now the model still knows which the right moves are to make i.e. it would play the checkmate variation in a game. But would it still be able to predict the checkmate?  If it relies on pattern recognition it wouldn't - it has never seen this pattern be connected to mate-in-x. But if it relies on lookahead, where it leverages the ability to predict the correct moves and then assesses the final position then it would still be able to predict the mate.  The results of this experiment would also be on a spectrum from 0% to 100% of correct checkmate-prediction for this tactical motive. But I think it would be fair to say that it hasn't really learned lookahead for 0% or a very low percentage and that's what I would expect.

Thanks for the elaboration, these are good points. I think about the difference between what you call look-ahead vs pattern recognition on a more continuous spectrum. For example, you say:

The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.

You could imagine learning this fact literally for those specific squares. Or you could imagine generalizing very slightly and using the same learned mechanism if you flip along the vertical axis a... (read more)

2p.b.
I think the spectrum you describe is between pattern recognition by literal memorisation and pattern recognition building on general circuits.  There are certainly general circuits that compute whether a certain square can be reached by a certain piece on a certain other square.  But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the  existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the "pattern recognition"-network to predict that Ng6 is not a feasible option.  The "lookahead"-network however would go through these moves and assess that 2.Rh4 is not mate because of 2...Bh6. The lookahead algorithm would allow it to use general low-level circuits like "block mate", "move bishop/queen on a diagonal" to generalise to unseen combinations of patterns.  

Good point, explicit representations of the objective might not be as crucial for safety applications as my post frames it.

That said, some reasons this might not generalize in a way that enables this kind of application:

  • I think this type of look-ahead/search is especially favored in chess, and it might not be as important in at least some domains in which we'd want to understand the model's cognition.
  • Our results are on a very narrow subset of board states ("tactically complex" ones). We already start with a filtered set of "puzzles" instead of general stat
... (read more)

The manner in which these pathological policies  achieve high  is also concerning: most of the time they match the reference policy , but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy , it could be impossible to tell whether  is Goodharting or identical to the base policy.

I'm confused; to learn this policy , some of the extremely high reward trajectories would likely have to be taken during RL training, so we could s... (read more)

2Thomas Kwa
This is a fair criticism. I changed "impossible" to "difficult". My main concern is with future forms of RL that are some combination of better at optimization (thus making the model more inner aligned even in situations it never directly sees in training) and possibly opaque to humans such that we cannot just observe outliers in the reward distribution. It is not difficult to imagine that some future kind of internal reinforcement could have these properties; maybe the agent simulates various situations it could be in without stringing them together into a trajectory or something. This seems worth worrying about even though I do not have a particular sense that the field is going in this direction.

I don't know the answer to your actual question, but I'll note there are slightly fewer mech interp mentors than mentors listed in the "AI interpretability" area (though all of them are at least doing "model internals"). I'd say Stephen Casper and I aren't focused on interpretability in any narrow sense, and Nandi Schoots' projects also sound closer to science of deep learning than mech interp. Assuming we count everyone else, that leaves 11 out of 39 mentors, which is slightly less than ~8 out of 23 from the previous cohort (though maybe not by much).

Nice overview, agree with most of it!

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of the... (read more)

1Oliver Daniels
Here's a revised sketch  A few notes: * I use Scalable Oversight to refer to both Alignment and Control  * I'm confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it  * I don't explicitly include easy-to-hard, I think OOD basically covers it * taxonomies and abstractions are brittle and can be counterproductive Scalable Oversight Taxonomy * Scalable Oversight * Scalable Alignment * Benchmarks / Tasks * Sandwiching Experiments (human amateurs + model, gt from human experts) * Weak models supervising Strong models * Approaches * Debate * Recursive reward modeling * (Solution to Eliciting Latent Knowledge) + Narrow Elicitation * (Note - I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model) * Eliciting Latent Knowledge * Approaches * Contrast Consistent Search * Confidence * Intermediate Probing * "Speed Prior" * "Simplicity Prior" * Concept Extrapolation - learn all salient generalizations, use expensive supervision to select correct one * IID Mechanistic Anomaly Detection + expensive supervision on anomalies * Subclasses * Measurement Tampering Detection * Approaches * OOD Mechanistic Anomaly Detection * In distribution * Out of Distribution (likely? requires multiple measurment structure) * Concept Extrapolation * train diverse probes on untrusted data, select probe that predicts positive measurements less frequently * Narrow Elicitation * ... * Scalable Control * Weak Review * Untrusted Rephrase or whatever * Coup probes

I think the status quo around publishing safety research is mostly fine (though being a bit more careful seems good); more confidently, I think going as far as the vibe of this post suggests would be bad.

Some possible cruxes, or reasons the post basically didn't move my view on that:

  • Most importantly, I think the research published by people working on x-risk tends to overall help safety/alignment more than capabilities.
    • I suspect the main disagreement might be what kind of research is needed to make AI go well, and whether the research currently happening h
... (read more)
4the gears to ascension
It may be producing green nodes faster, but it seems on track to produce a red node before a yellow node.

I don't buy the argument that safety researchers have unusually good ideas/research compared to capability researchers at top labs

I don't think this particularly needs to be true for my point to hold; they only need to have reasonably good ideas/research, not unusually good, for them to publish less to be a positive thing.

That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues.

That's... (read more)

We have been able to scale to 79% accuracy on a balanced dataset of n119 and non-n119, with networks each less than three convolution layers and less than 1000 neurons, compared to pure deep-learning which does 92% on 1000 parameters and three convolution layers

Is the "1000 parameters" a typo, should it be "1000 neurons"? Otherwise, this would be a strange comparison (since 1000 parameters is a much smaller network than 1000 neurons)

1Épiphanie Gédéon
Yes, thank you, it's less than 1000 parameters for both

And as a follow-up in case the current sleeper agent does admit when it's doing something dangerous: I'd be interested whether you can train a sleeper agent that never admits it's doing something dangerous if the "2024" trigger is in the prompt (even if after backdoor insertion, the blue team finetunes it to answer truthfully in cases with dangerous code but without the presence of that trigger). For such a sleeper agent, you could again check whether a probe works.

I think that paper is some evidence that there's typically no huge effect from internal activations being optimized for predicting future tokens (on natural language). But I don't think it's much (if any) evidence that this doesn't happen to some small extent or that it couldn't be a huge effect on certain other natural language tasks.

(In fact, I think the myopia gap is probably the more relevant number than the local myopia bonus, in which case I'd argue the paper actually shows a pretty non-trivial effect, kind of contrary to how the authors interpret it... (read more)

7ryan_greenblatt
+1 to this comment, also I expect the importance of activations being optimized for predicting future tokens to increase considerably with scale. (E.g., GPT-4 level compute maybe just gets you a GPT-3 level model if you enforce no such optimization with a stop grad.)
Erik JennerΩ240

Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout.

Trying to summarize my current understanding of what you're saying:

  • SAEs themselves aren't meant to be descriptions of (network, dataset). (I'd just misinterpreted your earlier comment.)
  • As a description of just the network, SAEs have a higher description length than a naive neuron-based description of the network.
  • Given a description of the network in terms of "parts," we can get a description of (network, dataset) by listing out which "parts" are "active" on each sample. I
... (read more)
2Lee Sharkey
Yes all four sound right to me.  To avoid any confusion, I'd just add an emphasis that the descriptions are mathematical, as opposed semantic. I too am keen to converge on a format in terms of Turing machines or Kolmogorov complexity or something else more formal. But I don't feel very well placed to do that, unfortunately, since thinking in those terms isn't very natural to me yet.

Some niche thoughts on obstacles to certain mechanistic anomaly detection benchmarks:

  • We might want to have benchmarks where human researchers can be part of the anomaly detector. For example, in the future, we might have a whole team of humans manually study a plan proposed by an AI to see whether it contains any measurement tampering, if it's a sufficiently important plan. Right now, we might want to let interpretability researchers manually study what happens on individual test-time inputs and whether there's anything anomalous about that.
  • The challenge i
... (read more)

I had this cached thought that the Sleeper Agents paper showed you could distill a CoT with deceptive reasoning into the model, and that the model internalized this deceptive reasoning and thus became more robust against safety training.

But on a closer look, I don't think the paper shows anything like this interpretation (there are a few results on distilling a CoT making the backdoor more robust, but it's very unclear why, and my best guess is that it's not "internalizing the deceptive reasoning").

In the code vulnerability insertion setting, there's no co... (read more)

Erik JennerΩ120

Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering.

As naive examples that probably don't work (similar to the ones from my original comment):

  • We could consider any Turing machine that approximately outputs (network, dat
... (read more)
2Lee Sharkey
I'll register that I prefer using 'description' instead of 'explanation' in most places. The reason is that 'explanation' invokes a notion of understanding, which requires both a mathematical description and a semantic description. So I regret using the word explanation in the comment above (although not completely wrong to use it - but it did risk confusion). I'll edit to replace it with 'description' and strikethrough 'explanation'.  "explanation of (network, dataset)": I'm afraid I don't have a great formalish definition beyond just pointing at the intuitive notion.  But formalizing what an explanation is seems like a high bar. If it's helpful, a mathematical description is just a statement of what the network is in terms of particular kinds of mathematical objects.  "mathematical description length of an explanation":  (Note:  Mathematical descriptions are of networks, not of explanations.)  It's just the set of objects used to describe the network. Maybe helpful to think in terms of maps between different descriptions:  E.g. there is a many-to-one map between a description of a neural network in terms of polytopes and in terms of neurons. There are ~exponentially many more polytopes. Hence the mathematical description of the network in terms of individual polytopes is much larger.    I would not. So: I think that the confusion might again be from using 'explanation' rather than description.  SAEs (or decompiled networks that use SAEs as the building block) are supposed to approximate the original network behaviour.  So SAEs are mathematical descriptions of the network, but not of the (network, dataset). What's a mathematical description of the (network, dataset), then? It's just what you get when you pass the dataset through the network; this datum interacts with this weight to produce this activation,  that datum interacts with this weight to produce that activation, and so on. A mathematical description of the (network, dataset) in terms of SAEs are: th

My non-answer to (2) would be that debate could be used in all of these ways, and the central problem it's trying to solve is sort of orthogonal to how exactly it's being used. (Also, the best way to use it might depend on the context.)

What debate is trying to do is let you evaluate plans/actions/outputs that an unassisted human couldn't evaluate correctly (in any reasonable amount of time). You might want to use that to train a reward model (replacing humans in RLHF) and then train a policy; this would most likely be necessary if you want low cost at infe... (read more)

1VojtaKovarik
I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)
4ryan_greenblatt
It's worth noting that in cases where you care about average case performance, you can always distill the behavior back into the model. So, average case usage can always be equivalent to generating training or reward data in my view.
Erik JennerΩ41411

The sparsity penalty trains the SAE to activate fewer features for any given datapoint, thus optimizing for shorter mathematical description length

I'm confused by this claim and some related ones, sorry if this comment is correspondingly confused and rambly.

It's not obvious at all to me that SAEs lead to shorter descriptions in any meaningful sense. We get sparser features (and maybe sparser interactions between features), but in exchange, we have more features and higher loss. Overall, I share Ryan's intuition here that it seems pretty hard to... (read more)

3Lee Sharkey
Thanks Erik :) And I'm glad you raised this.   One of the things that many researchers I've talked to don't appreciate is that, if we accept networks can do computation in superposition, then we also have to accept that we can't just understand the network alone.  We want to understand the network's behaviour on a dataset, where the dataset contains potentially lots of features.  And depending on the features that are active in a given datum, the network can do different computations in superposition (unlike in a linear network that can't do superposition). The combined object '(network, dataset)' is much larger than the network itself. Explanations Descriptions of the (network, dataset) object can actually be compressions despite potentially being larger than the network.  So, You can have one feature active for each datapoint, but now we've got an explanation description of the (network, dataset) that scales linearly in the size of the dataset, which sucks! Instead, if we look for regularities (opportunities for compression) in how the network treats data, then we have a better chance at explanations descriptions that scale better with dataset size. Suppose a datum consists of a novel combination of previously explained described circuits. Then our explanation description of the (network, dataset) is much smaller than if we explained described every datapoint anew.  In light of that, you can understand my disagreement with "in that case, I could also reduce the description length by training a smaller model." No! Assuming the network is smaller yet as performant (therefore presumably doing more computation in superposition), then the explanation description of the (network, dataset) is basically unchanged. 

Nice post, would be great to understand what's going on here!

Minor comment unrelated to your main points:

Conceptually, loss recovered seems a worse metric than KL divergence. Faithful reconstructions should preserve all token probabilities, but loss only compares the probabilities for the true next token

I don't think it's clear we want SAEs to be that faithful, for similar reasons as briefly mentioned here and in the comments of that post. The question is whether differences in the distribution are "interesting behavior" that we want to explain or whether ... (read more)

1wesg
Yes this a good consideration. I think 1. KL as a metric makes a good tradeoff here by mostly ignoring changes to tokens the original model treated as low probability (as opposed to measuring something more cursed like log prob L2 distance) and so I think captures the more interesting differences. 2. This motivates having good baselines to determine what this noise floor should be.

Would you expect this to outperform doing the same thing with a non-sparse autoencoder (that has a lower latent dimension than the NN's hidden dimension)? I'm not sure why it would, given that we aren't using the sparse representations except to map them back (so any type of capacity constraint on the latent space seems fine). If dense autoencoders work just as well for this, they'd probably be more straightforward to train? (unless we already have an SAE lying around from interp anyway, I suppose)

2Charlie Steiner
Regular AE's job is to throw away the information outside some low-dimensional manifold, sparse ~linear AE's job is to throw away the information not represented by sparse dictionary codes. (Also a low-dimensional manifold, I guess, just made from a different prior.) If an AE is reconstructing poorly, that means it was throwing away a lot of information. How important that information is seems like a question about which manifold the underlying network "really" generalizes according to. And also what counts as an anomaly / what kinds of outliers you're even trying to detect.

But sadly, you don't have any guarantee that it will output the optimal element

If I understand the setup correctly, there's no guarantee that the optimal element would be good, right? It's just likely since the optimal element a priori shouldn't be unusually bad, and you're assuming most satisficing elements are fine.

This initially threw me off regarding what problem you're trying to solve. My best current guess is:

  • We're assuming that if we could get a random satisficing action, we'd be happy with that with high probability. (So intuitively, we're not aski
... (read more)
5Simon Fischer
Thanks for your comment! I didn't get around to answering earlier, but maybe it's still useful to try to clarify a few things. My threat model here is that we have access to an Oracle that's not trustworthy (as specified in the first paragraph), so that even if we were able to specify our preferences correctly, we would still have a problem. So in this context you could assume that we managed to specify our preferences correctly. If our problem is simply that we misspecified our preferences (this would roughly correspond to "fail at outer alignment" vs my threat model of "fail at inner alignment), solving this by soft-maximization is much easier: Just "flatten" the top part of your utility function (i.e. make all outputs that satisfice have the same utility) and add some noise, then hand it to an optimizer. So I guess the point I tried to make with my post could be stated as "Soft-optimization can also be used to help with inner alignment, not just outer alignment" (and I freely admit that I should have said so in the post). I'm not just assuming that, I gave a bit of an argument for why I believe that to be that case: I assume that an output is not randomly dangerous, but only dangerous if it was specifically chosen to achieve a goal that differs from ours. This only holds if the goal is not too ambitious, e.g. if we ask the AI controlling the paperclip factory for 10^30 paperclips that will not go well. As I understand, gwern has a stronger opinion on that and believes that side-effects from less ambitious plans are still an important concern. But at least my proposal gets rid of the potentially adversarial selection by the untrusted optimizer, so I believe that getting safe outputs (e.g. by combining our goal with impact regularization) should be much easier. My assumption about the black-box untrusted optimizer is a worst-case assumption. So I'm trying to understand if we could get something useful out of the system even in that case. If we can make stronge

I think this is an important point, but IMO there are at least two types of candidates for using SAEs for anomaly detection (in addition to techniques that make sense for normal, non-sparse autoencoders):

  1. Sometimes, you may have a bunch of "untrusted" data, some of which contains anomalies. You just don't know which data points have anomalies on this untrusted data. (In addition, you have some "trusted" data that is guaranteed not to have anomalies.) Then you could train an SAE on all data (including untrusted) and figure out what "normal" SAE features look
... (read more)
4Charlie Steiner
Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I'm uncertain if it's going to be weak against adversarial anomalies relative to regular ol' random anomalies.

Sign of the effect of open source on hype? Or of hype on timelines? I'm not sure why either would be negative.

By "those effects" I meant a collection of indirect "release weights → capability landscape changes" effects in general, not just hype/investment. And by "sign" I meant whether those effects taken together are good or bad. Sorry, I realize that wasn't very clear.

As examples, there might be a mildly bad effect through increased investment, and/or there might be mildly good effects through more products and more continuous takeoff.

I agree that releas... (read more)

I agree that releasing the Llama or Grok weights wasn't particularly bad from a speeding up AGI perspective. (There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I'm not even sure about the sign.)

I also don't think misuse of public weights is a huge deal right now.

My main concern is that I think releasing weights would be very bad for sufficiently advanced models (in part because of deliberate misuse becoming a bigger deal, but also because it makes most interventions we'd want... (read more)

3utilistrutil
Sign of the effect of open source on hype? Or of hype on timelines? I'm not sure why either would be negative. Open source --> more capabilities R&D --> more profitable applications --> more profit/investment --> shorter timelines * The example I've heard cited is Stable Diffusion leading to LORA. There's a countervailing effect of democratizing safety research, which one might think outweighs because it's so much more neglected than capabilities, more low-hanging fruit.

Yeah, agreed. Though I think

the type and amount of empirical work to do presumably looks quite different depending on whether it's the main product or in support of some other work

applies to that as well

One worry I have about my current AI safety research (empirical mechanistic anomaly detection and interpretability) is that now is the wrong time to work on it. A lot of this work seems pretty well-suited to (partial) automation by future AI. And it also seems quite plausible to me that we won't strictly need this type of work to safely use the early AGI systems that could automate a lot of it. If both of these are true, then that seems like a good argument to do this type of work once AI can speed it up a lot more.

Under this view, arguably the better thin... (read more)

3Chris_Leong
Doing stuff manually might provide helpful intuitions/experience for automating it?

Oh I see, I indeed misunderstood your point then.

For me personally, an important contributor to day-to-day motivation is just finding research intrinsically fun---impact on the future is more something I have to consciously consider when making high-level plans. I think moving towards more concrete and empirical work did have benefits on personal enjoyment just because making clear progress is fun to me independently of whether it's going to be really important (though I think there've also been some downsides to enjoyment because I do quite like thinking ... (read more)

I'd definitely agree the updates are towards the views of certain other people (roughly some mix of views that tend to be common in academia, and views I got from Paul Christiano, Redwood and other people in a similar cluster). Just based on that observation, it's kind of hard to disentangle updating towards those views just because they have convincing arguments behind them, vs updating towards them purely based on exposure or because of a subconscious desire to fit in socially.

I definitely think there are good reasons for the updates I listed (e.g. speci... (read more)

2TekhneMakre
I'm not saying that it looks like you're copying your views, I'm saying that the updates look like movements towards believing in a certain sort of world: the sort of world where it's natural to be optimistically working together with other people on project that are fulfilling because you believe they'll work. (This is a super empathizable-with movement, and a very common movement to make. Also, of course this is just one hypothesis.) For example, moving away from theory and "big ideas", as well as moving towards incremental / broadly-good-seeming progress, as well as believing more in a likely continuum of value of outcomes, all fit with trying to live in a world where it's more immediately motivating to do stuff together. Instead of witholding motivation until something that might really work is found, the view here says: no, let's work together on whatever, and maybe it'll help a little, and that's worthwhile because every little bit helps, and the witholding motivation thing wasn't working anyway. (There could be correct reasons to move toward believing and/or believing in such worlds; I just want to point out the pattern.)

Thanks, I think I should distinguish more carefully between automating AI (safety) R&D within labs and automating the entire economy. (Johannes also asked about ability vs actual automation here but somehow your comment made it click).

It seems much more likely to me that AI R&D would actually be automated than that a bunch of random unrelated things would all actually be automated. I'd agree that if only AI R&D actually got automated, that would make takeoff pretty discontinuous in many ways. Though there are also some consequences of fast vs s... (read more)

I'm roughly imagining automating most things a remote human expert could do within a few days. If we're talking about doing things autonomously that would take humans several months, I'm becoming quite a bit more scared. Though the capability profile might also be sufficiently non-human that this kind of metric doesn't work great.

Practically speaking, I could imagine getting a 10x or more speedup on a lot of ML research, but wouldn't be surprised if there are some specific types of research that only get pretty small speedups (maybe 2x), especially anythin... (read more)

8Daniel Kokotajlo
I agree that they'll be able to automate most things a remote human expert could do within a few days before they are able to do things autonomously that would take humans several months. However, I predict that by the time they ACTUALLY automate most things a remote human expert could do within a few days, they will already be ABLE to do things autonomously that would take humans several months. Would you agree or disagree? (I'd also claim that they'll be able to take over the world before they have actually automated away most of the few-days tasks. Actually automating things takes time and schlep and requires a level of openness & aggressive external deployment that the labs won't have, I predict.)

Transformative: Which of these do you agree with and when do you think this might happen?

For some timelines see my other comment; they aren't specifically about the definitions you list here but my error bars on timelines are huge anyway so I don't think I'll try to write down separate ones for different definitions.

Compared to definitions 2. and 3., I might be more bullish on AIs having pretty big effects even if they can "only" automate tasks that would take human experts a few days (without intermediate human feedback). A key uncertainty I have though i... (read more)

Good question, I think I was mostly visualizing ability to automate while writing this. Though for software development specifically I expect the gap to be pretty small (lower regulatory hurdles than elsewhere, has a lot of relevance to the people who'd do the automation, already starting to happen right now).

In general I'd expect inertia to become less of a factor as the benefits of AI become bigger and more obvious---at least for important applications where AI could provide many many billions of dollars of economic value, I'd guess it won't take too lon... (read more)

I don't have well-considered cached numbers, more like a vague sense for how close various things feel. So these are made up on the spot and please don't take them too seriously except as a ballpark estimate:

  • AI can go from most Github issues to correct PRs (similar to https://sweep.dev/ but works for things that would take a human dev a few days with a bunch of debugging): 25% by end of 2026, 50% by end of 2028.
    • This kind of thing seems to me like plausibly one of the earliest important parts of AI R&D that AIs could mostly automate.
  • I expect that once w
... (read more)

How much time do you think there is between "ability to automate" and "actually this has been automated"? Are your numbers for actual automation, or just ability? I personally would agree to your numbers if they are about ability to automate, but I think it will take much longer to actually automate, due to people's inertia and normal regulatory hurdles (though I find it confusing to think about, because we might have vastly superhuman AI and potentially loss of control before everything is actually automated.)

Erik Jenner10527

How my views on AI have changed over the last 1.5 years

I started my AI safety PhD around 1.5 years ago, this is a list of how my views have changed since ~then.

Skippable meta notes:

  • I think in descending order of importance, these changes have been due to getting more AI safety research experience, reading things, talking to people in Berkeley, and just thinking through some things more.
  • Overall my views haven’t changed radically (probably less than in the 1.5 years before the start of my PhD), I still “vibe” with a lot of what I wrote then, it just feels na
... (read more)
9TekhneMakre
I note that almost all of these updates are (weakly or strongly) predicted by thinking of you as someone who is trying to harmonize better with a nice social group built around working together to do "something related to AI risk".
6Daniel Kokotajlo
Awesome post! Very good for people to keep track of how they've changed their minds.   Nitpick: I probably disagree with this, but I'm not sure, it depends on what you mean exactly. How much useful work do you think they'll be doing prior to being able to take over the world? 
6[anonymous]
I've noticed a pretty wide range of views on what early takeoff looks like.   For example, transformative AI: what level of capabilities does the model need to have?  I've seen the line at:   1.  Can extract most of the economic value from remote work, potentially helpless with robots. (Daniel Kokotajlo)    2.  Can write a paper to the level of a current professor writing a research paper (log error is the same) (Matthew_Barnett)    3.  n-second AGI : for up to n seconds, the model can potentially learn any human skill and do as well as an expert.  Transformative requires "n" to be on the order of months.  (Richard Ngo)    4.  Potentially subhuman in some domains, but generally capable and able to drive robotics sufficient to automate most of the steps in building more robots + ICs.  (my definition)    5.  ROI.   (also mine)           a.  Investment ROI: Produces AI hype from demos, leading to more new investment than the cost of the demo          b.  Financial ROI: does enough work from doing whatever tasks the current models can do to pay for the hosting + development cost          c.  Robotics ROI : robots, in between repairs, build more parts than the total parts used in themselves at the last repair including the industrial supply chain and the power sources.  (formalizing 4)   Transformative: Which of these do you agree with and when do you think this might happen?   Also, how much compute do you think an AGI or superintelligence will require at inference time initially?  What is a reasonable level of optimization?  Do you agree that many doom scenarios require it to be possible for an AGI to compress to fit on very small host PCs?   Is this plausible?  (eg can a single 2070 8gb host a model with general human intelligence at human scale speeds and vision processing and robotics proprioception and control...?)
3Mateusz Bagiński
What are your timelines now?

Thanks for that overview and the references!

On hydrodynamic variables/predictability: I (like probably many others before me) rediscovered what sounds like a similar basic idea in a slightly different context, and my sense is that this is somewhat different from what John has in mind, though I'd guess there are connections. See here for some vague musings. When I talked to John about this, I think he said he's deliberately doing something different from the predictability-definition (though I might have misunderstood). He's definitely aware of similar idea... (read more)

2Alexander Gietelink Oldenziel
John has several lenses on natural abtractions: * natural abstraction as information-at-a-distance  * natural abstraction = redundant & latent representation of information  * natural abstraction = convergent abstraction for 'broad' class of minds the thing that felt closest to me to the Quantum Darwinism story that Jess was talking about as the 'redudant/ latent story, e.g. https://www.lesswrong.com/posts/N2JcFZ3LCCsnK2Fep/the-minimal-latents-approach-to-natural-abstractions and https://www.lesswrong.com/posts/dWQWzGCSFj6GTZHz7/natural-latents-the-math

Thanks for writing this! On the point of how to get information, mentors themselves seem like they should also be able to say a lot of useful things (though especially for more subjective points, I would put more weight on what previous mentees say!)

So since I'm going to be mentoring for MATS and for CHAI internships, I'll list my best guesses as to how working with me will be like, maybe this helps someone decide:

  • In terms of both research experience and mentoring experience, I'm one of the most junior mentors in MATS.
    • Concretely, I've been doing ML researc
... (read more)

Strong +1 to asking the mentor being a great way to get information! My guess is many mentors aren't going out of their way to volunteer this kind of info, but will share it if asked. Especially if they've already decided that they want to work with you.

My MATS admission doc has some info on that for me, though I can give more detailed answers if anyone emails me with specific questions.

Load More