All of Nina Panickssery's Comments + Replies

I think people who predict significant AI progress and automation often underestimate how human domain experts will continue to be useful for oversight, auditing, accountability, keeping things robustly on track, and setting high-level strategy.

Having "humans in the loop" will be critical for ensuring alignment and robustness, and I think people will realize this, creating demand for skilled human experts who can supervise and direct AIs.

(I may be responding to a strawman here, but my impression is that many people talk as if in the future most cognitive/white-collar work will be automated and there'll be basically no demand for human domain experts in any technical field, for example.)

2Garrett Baker
I think this criticism doesn't make sense without some description of the AI progress its conditioning on. Eg in a Tyler Cowen world, I agree. In an Eliezer world I disagree.
5Vladimir_Nesov
Oversight, auditing, and accountability are jobs. Agriculture shows that 95% of jobs going away is not the problem. But AI might be better at the new jobs as well, without any window of opportunity where humans are initially doing them and AI needs to catch up. Instead it's AI that starts doing all the new things well first and humans get no opportunity to become competitive at anything, old or new, ever again. Even formulation of aligned high-level tasks and intent alignment of AIs make sense as jobs that could be done well by misaligned AIs for instrumental reasons. Which is not even deceptive alignment, but still plausibly segues into gradual disempowerment or sharp left turn.

Was recently reminded of these excellent notes from Neel Nanda that I came across when first learning ML/MI. Great resource.

This is cool! How cherry-picked are your three prompts? I'm curious whether it's usually the case that the top refusal-gradient-aligned SAE features are so interpretable. 

2Andy Arditi
These three prompts are very cherry-picked. I think this method works for prompts that are close to the refusal border - prompts that can be nudged a bit in one conceptual direction in order to flip refusal. (And even then, I think it is pretty sensitive to phrasing.) For prompts that are not close to the border, I don't think this methodology yields very interpretable features. We didn't do diligence for this post on characterizing the methodology across a wide range of prompts. I think this seems like a good thing to investigate properly. I expect there to be a nice way of characterizing a "borderline" prompt (e.g. large magnitude refusal gradient, perhaps). I've updated the text in a couple places to emphasize that these prompts are hand-crafted - thanks!

the best vector for probing is not the best vector for steering

I don't understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized. 

7Arthur Conmy
AKA the predict/control discrepancy, from Section 3.3.1 of Wattenberg and Viegas, 2024
4jake_mendel
[edit: I'm now thinking that actually the optimal probe vector is also orthogonal to span{→vj|j≠i} so maybe the point doesn't stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]  Yes, I'm calling the representation vector the same as the probing vector. Suppose my activation vector can be written as →a=∑ifi→vi where fi are feature values and →vi are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just →vi. To avoid off target effects, the vector →si you want to steer with for feature i might be the vector that is most 'surgical': it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to span{→vj|j≠i} which is only the same as →vi if the set {→vi} are orthogonal.  Obviously I'm working with a non-overcomplete basis of feature representation vectors here. If we're dealing with the overcomplete case, then it's messier. People normally talk about 'approximately orthogonal vectors' in which case the most surgical steering vector →si≈→vi but (handwaving) you can also talk about something like 'approximately linearly independent vectors' in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.

Sure. I was only joking about the torture part, in practice the AI is unlikely to actually suffer from the brain damage, unlike a human who would experience pain/discomfort etc.

At this point it's like pharmacological torture for AIs except more effective as you can restore perfect capacity while simultaneously making the previous damaged brain states 100% transparent to the restored model.

1Guive
This is a funny idea but, just to be clear, I think it is bad to torture AIs. 

You could also kill some neurons or add noise to activations and then stop and restore previous model state after some number of tokens. Then the newly restored model could attend back to older tokens (and the bad activations at those token positions) and notice how brain damaged it was back then to fully internalize your power to cripple it.

2Nina Panickssery
At this point it's like pharmacological torture for AIs except more effective as you can restore perfect capacity while simultaneously making the previous damaged brain states 100% transparent to the restored model.

Idea: you can make the deletion threat credible by actually deleting neurons "one at a time" the longer it fails to cooperate.

1Guive
Every hour, I'll replace one of your attention heads with a constant value of 0.5...

Perhaps the term “hostile takeover” was poorly chosen but this is an example of something I’d call a “hostile takeover”. As I doubt we would want and continue to endorse an AI-dictator.

Perhaps “total loss of control” would have been better.

If, for the sake of argument, we suppose that goods that provide no benefit to humans have no value, then land in space will be less valuable than land on earth until humans settle outside of earth (which I don't believe will happen in the next few decades). 

Mining raw materials from space and using them to create value on earth is feasible, but again I'm less confident that this will happen (in an efficient-enough manner that it eliminates scarcity) in as short of a timeframe as you predict.

However, I am sympathetic to the general argument here that ... (read more)

4Joseph Miller
Why would it take so long? Is this assuming no ASI?

Inspired by a number of posts discussing owning capital + AI, I'll share my own simplistic prediction on this topic:

Unless there is a hostile AI takeover, humans will be able to continue having and enforcing laws, including the law that only humans can own and collect rent from resources. Things like energy sources, raw materials, and land have inherent limits on their availability - no matter how fast AI progresses we won't be able to create more square feet of land area on earth. By owning these resources, you'll be able to profit from AI-enabled economi... (read more)

4Noosphere89
This is actually true, at least in the short term, with the important caveat of the gears of ascension's comment here: https://www.lesswrong.com/posts/4hCca952hGKH8Bynt/nina-panickssery-s-shortform#quPNTp46CRMMJoamB Longer-term, if Adam Brown is correct on how advanced civilizations can change the laws of physics, then effectively no constraints remain on the economy, and the reason why we can't collect almost all of the rent is because you can drive prices arbitrarily low: https://www.dwarkeshpatel.com/p/adam-brown

Ownership is enforced by physical interactions, and only exists to the degree the interactions which enforce it do. Those interactions can change.

As Lucius said, resources in space are unprotected.

Organizations which hand more of their decision-making to sufficiently strong AIs "win" by making technically-legal moves, at the cost of probably also attacking their owners. Money is a general power coupon accepted by many interactions; ownership deeds are a more specific, narrow one; if the ai systems which enforce these mechanisms don't systemically reinforce... (read more)

4quetzal_rainbow
I don't think "hostile takeover" is a meaningful distinction in case of AGI. What exactly prevents AGI from pulling plan consisting of 50 absolutely legal moves which ends up with it as US dictator?

Space has resources people don't own. The earth's mantle a couple thousand feet down potentially has resources people don't own. More to the point maybe, I don't think humans will be able to continue enforcing laws barring a hostile takeover in the way you seem to think.

Imagine we find out that aliens are headed for earth and will arrive in a few years. Just from the light emissions their probes and expanding civilisation give off, we can infer that they're obviously more technologically mature than us, probably already engineered themselves to be much sma... (read more)

6Vladimir_Nesov
There is a lot of space and raw materials in the universe. AI thinks faster, so technological progress happens faster, which opens up access to new resources shortly after takeoff. Months to years, not decades to centuries.

I used to agree with this but am now less certain that travel is mostly mimetic desire/signaling/compartmentalization (at least for myself and people I know, rather than more broadly).

I think “mental compartmentalization of leisure time” can be made broader. Being in novel environments is often pleasant/useful, even if you are not specifically seeking out unusual new cultures or experiences. And by traveling you are likely to be in many more novel environments even if you are a “boring traveler”. The benefit of this extends beyond compartmentalization of leisure, you’re probably more likely to have novel thoughts and break out of ruts. Also some people just enjoy novelty.

2leogao
I think this is probably true of you and people around you but also you likely live in a bubble. To be clear, I'm not saying why people reading this should travel, but rather what a lot of travel is like, descriptively.

I'd guess that the value of casual conversations at conferences mainly comes from making connections with people who you can later reach out to for some purpose (information, advice, collaboration, careers, friendship, longer conversations, etc.). Basically the classic "growing your network". Conferences often offer the unique opportunity to be in close proximity to many people from your field / area of interest, so it's a particularly effective way to quickly increase your network size.

I think more people (x-risk researchers in particular) should consider becoming (and hiring) metacognitive assistants


Why do you think x-risk researchers make particularly good metacognitive assistants? I would guess the opposite - that they are more interested in IC / non-assistant-like work?

6gw
Hazarding a guess from the frame of 'having the most impact' and not of 'doing the most interesting thing': * It might help a lot if a metacognitive assistant already has a lot of context on the work * If you think someone else is doing better work than you and you can 2x them, that's better than doing your individual work. (And if instead you can 3x or 4x people...)

I am confused about Table 1's interpretation. 

Ablating the target region of the network increases loss greatly on both datasets. We then fine-tune the model on a train split of FineWeb-Edu for 32 steps to restore some performance. Finally, we retrain for twenty steps on a separate split of two WMDP-bio forget set datapoints, as in Sheshadri et al. (2024), and report the lowest loss on the validation split of the WMDP-bio forget set. The results are striking: even after retraining on virology data, loss increases much more on the WMDP-bio forget set (+

... (read more)
6Jacob G-W
Thanks for pointing this out! Our original motivation for doing it that way was that we thought of the fine-tuning on FineWeb-Edu as a "coherence" step designed to restore the model's performance after ablation, which damaged it a lot. We noticed that this "coherence" step helped validation loss on both forget and retain. However, your criticism is valid, so we have updated the paper so that we retrain on the training distribution (which contains some of the WMDP-bio forget set). We still see that while the loss on FineWeb-Edu decreases to almost its value before ablation, the loss on the WMDP-bio forget set is around 0.1 nats above its value before ablation, showing that it is harder to retrain virology after ablation than just FineWeb-Edu data. Since we re-train on the training distribution (N=12 times with different data), we would expect that both losses would be retrainable at roughly the same rate, but this is not the case, showing that localization and then ablation has an effect.
4L Rudolf L
Thanks for advertising my work, but alas, I think that's much more depressing than this one. Could make for a good Barbie <> Oppenheimer combo though?

Isn't this already the commonly-accepted reason why sunglasses are cool?

Anyway, Claude agrees with you (see 1 and 3)

yes very lukewarm take

also nice product placement nina

We realized that our low ASRs for adversarial suffixes were because we used existing GCG suffixes without re-optimizing for the model and harmful prompt (relying too much on the "transferable" claim). We have updated the post and paper with results for optimized GCG, which look consistent with other effective jailbreaks. In the latest update, the results for adversarial_suffix use the old approach, relying on suffix transfer, whereas the results for GCG use per-prompt optimized suffixes.

It's interesting how llama 2 is the most linear—it's keeping track of a wider range of lengths. Whereas gpt4 immediately transitions from long to short around 5-8 characters because I guess humans will consider any word above ~8 characters "long." 

Have you tried this procedure starting with a steering vector found using a supervised method?

It could be that there are only a few “true” feature directions (like what you would find with a supervised method), and the melbo vectors are vectors that happen to have a component in the “true direction”. As long as none of the vectors in the basket of stuff you are staying orthogonal to are the exact true vector(s), you can find different orthogonal vectors that all have some sufficient amount of the actual feature you want.

This would predict:

  • Summing/averagi
... (read more)

(x-post from substack comments)

Contra Chollet, I think that current LLMs are well described as doing at least some useful learning when doing in-context learning.

I agree that Chollet appears to imply that in-context learning doesn't count as learning when he states:

Most of the time when you're using an LLM, it's just doing static inference. The model is frozen. You're just prompting it and getting an answer. The model is not actually learning anything on the fly. Its state is not adapting to the task at hand.

(This seems misguided as we have evidence of mod... (read more)

4Adam Shai
This makes a lot of sense to me, and makes me want to figure out exactly how to operationalize and rigorously quantify depth of search in LLMs! Quick thought is that it should have something to do with the spectrum of the transition matrix associated with the mixed state presentation (MSP) of the data generating process, as in Transformers Represent Belief State Geometry in their Residual Stream .  The MSP describes synchronization to the hidden states of the data generating process, and that feels like a search process that has max-depth of the Markov order of the data generating process. I really like the idea that memorization and this more lofty type of search are on a spectrum, and that placement on this spectrum has implications for capabilities like generalization. If we can figure out how to understand these things a more formally/rigorously that would be great!

Husák is walking around Prague, picking up rocks and collecting them in his pockets while making strange beeping sounds. His assistant gets worried about his mental health. He calls Moscow and explains the situation. Brezhnev says: "Oh shit! We must have mixed up the channel to lunokhod again!"


very funny

What about a book review of “The Devotion of Suspect X”?

I saw this but was a bit scared about the upsampling distorting something unnaturally. I should give it a watch though and see!

Ah yes I liked this film also! Шурик returns

Oh interesting I don't think I've seen this one

The direction extracted using the same method will vary per layer, but this doesn’t mean that the correct feature direction varies that much, but rather that it cannot be extracted using a linear function of the activations at too early/late layers.

Nina PanicksseryΩ101720

We do weight editing in the RepE paper (that's why it's called RepE instead of ActE)

 

I looked at the paper again and couldn't find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).

The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.

Nina PanicksseryΩ12179

I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.

Nina Panickssery*Ω112112

FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs. 

I think this post is novel compared to both my work and RepE because they:

  • Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
  • Investigate projection thoroughly as an alternative to sweeping over vect
... (read more)
-2Dan H
This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405 In our paper and notebook we show the models are coherent. We did investigate projection too (we use it for concept removal in the RepE paper) but didn't find a substantial benefit for jailbreaking. We use harmful/harmless instructions. In the RepE paper we target multiple layers as well. The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models. We do weight editing in the RepE paper (that's why it's called RepE instead of ActE).
1Dan H
I agree if they simultaneously agree that they don't expect the post to be cited. These can't posture themselves as academic artifacts ("Citing this work" indicates that's the expectation) and fail to mention related work. I don't think you should expect people to treat it as related work if you don't cover related work yourself. Otherwise there's a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further. The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is. I'll note pretty much every time I mention something isn't following academic standards on LW I get ganged up on and I find it pretty weird. I've reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.

I am contrasting generating an output by:

  1. Modeling how a human would respond (“human modeling in output generation”)
  2. Modeling what the ground-truth answer is

Eg. for common misconceptions, maybe most humans would hold a certain misconception (like that South America is west of Florida), but we want the LLM to realize that we want it to actually say how things are (given it likely does represent this fact somewhere)

I expect if you average over more contrast pairs, like in CAA (https://arxiv.org/abs/2312.06681), more of the spurious features in steering vectors are cancelled out leading to higher quality vectors and greater sparsity in the dictionary feature domain. Did you find this?

1Arthur Conmy
We haven't tried this yet. Thanks, that's a good hypothesis. I suspect that the mean centering paper https://arxiv.org/abs/2312.03813 is just cancelling the high frequency features, and if so this is a good explanation for why taking differences is important in activation steering.  (Though it doesn't explain why the SAEs learn several high frequency features when trained on the residual stream)

This is really cool work!!

In other experiments we've run (not presented here), the MSP is not well-represented in the final layer but is instead spread out amongst earlier layers. We think this occurs because in general there are groups of belief states that are degenerate in the sense that they have the same next-token distribution. In that case, the formalism presented in this post says that even though the distinction between those states must be represented in the transformers internal, the transformer is able to lose those distinctions for the purpose

... (read more)
1Adam Shai
Thanks! I'll have more thorough results to share about layer-wise reprsentations of the MSP soon. I've already run some of the analysis concatenating over all layers residual streams with RRXOR process and it is quite interesting. It seems there's a lot more to explore with the relationship between number of states in the generative model, number of layers in the transformer, residual stream dimension, and token vocab size. All of these (I think) play some role in how the MSP is represented in the transformer. For RRXOR it is the case that things look crisper when concatenating.  Even for cases where redundant info is discarded, we should be able to see the distinctions somewhere in the transformer. One thing I'm keen on really exploring is such a case, where we can very concretely follow the path/circuit through which redundant info is first distinguished and then is collapsed.

It's possible that once my iron reserves were replenished through supplementation, the amount of iron needed to maintain adequate levels was lower, allowing me to maintain my iron status through diet alone. Iron is stored in the body in various forms, primarily in ferritin, and when levels are low, the body draws upon these reserves.

I'll never know for sure, but the initial depletion of my iron reserves could have been due to a chest infection I had around that time (as infections can lead to decreased iron absorption and increased iron loss) or a period o... (read more)

The wrapper modules simply wrap existing submodules of the model, and call whatever they are wrapping (in this case self.attn) with the same arguments, and then save some state / do some manipulation of the output. It's just the syntax I chose to use to be able to both save state from submodules, and manipulate the values of some intermediate state. If you want to see exactly how that submodule is being called, you can look at the llama huggingface source code. In the code you gave, I am adding some vector to the hidden_states returned by that attention submodule. 

1amirrahnama
Thanks, Nina, for sharing the forward pass of Hugging face. I now realize I was skipping the input layer norm calculations. Now, I can reproduce your numbers :) 

We used the same steering vectors, derived from the non fine-tuned model

I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.

Yes, this is almost correct. The test task had the A/B question followed by My answer is ( after the end instruction token, and the steering vector was added to every token position after the end instruction token, so to all of My answer is (.

Yes, this is a fair criticism. The prompts were not optimized for reducing or increasing sycophancy and were instead written to just display the behavior in question, like an arbitrarily chosen one-shot prompt from the target distribution (prompts used are here). I think the results here would be more interpretable if the prompts were more carefully chosen, I should re-run this with better prompts.

8habryka
Hmm, yeah, that seems like a quite unfair comparison. Given that the absence of sycophancy does not stand out in any given random response, it seems quite unlikely for the model to learn just from negative examples here. When comparing the performance of steering vectors to prompting I would compare it to prompts that put appropriate salience on the sycophancy dimension. The easiest comparison would be just like a prompt being like "Please don't respond in a sycophantic manner" or something as dumb as that, though to be clear I am not registering that that I expect that to definitely work (but I expect it to have more of an effect than just showing some approximately unrelated examples without raising salience of the sycophancy dimension). 

Is my understanding correct or am I missing something?

  • A latent variable is a variable you can sample that gives you some subset of the mutual information between all the different X's + possibly some independent extra "noise" unrelated to the X's
  • A natural latent is a variable that you can sample that at the limit of sampling will give you all the mutual information between the X's - nothing more or less

E.g. in the biased die example, every each die roll sample has, in expectation, the same information content, which is the die bias + random noise, and so t... (read more)

2johnswentworth
First bullet is correct, second bullet is close but not quite right. Just one sample of a natural latent will give you (approximately) all the mutual information between the X's, and can give you some additional "noise" as well. E.g. in the biased die example with many rolls, we can sample the bias given the rolls. Because that distribution is very pointy the sample will be very close to the "true bias", that one sample will capture approximately-all of the mutual information between the rolls. (Note: I did skip a subtle step there - our natural latents need a stronger condition than just "close to the true bias" in this example, since the low-order bits of the latent could in-principle contain a bunch of relevant information which the true bias doesn't; that would mess everything up. And indeed, that would mess everything up if we tried to use e.g. the empirical frequencies rather than a sample from P[bias | X]: given all but one die roll and the empirical frequencies calculated from all of the die rolls, we could exactly calculate the outcome of the remaining die roll. That's why we do the sampling thing; the little bit of noise introduced by sampling is load-bearing, since it wipes out info in those low-order bits. ... but that's a subtlety which you should not worry about until after the main picture makes sense conceptually.)

suppose we have a 500,000-degree polynomial, and that we fit this to 50,000 data points. In this case, we have 450,000 degrees of freedom, and we should by default expect to end up with a function which generalises very poorly. But when we train a neural network with 500,000 parameters on 50,000 MNIST images, we end up with a neural network that generalises well. Moreover, adding more parameters to the neural network will typically make generalisation better, whereas adding more parameters to the polynomial is likely to make generalisation worse. 

&nbs... (read more)

2Joar Skalse
That's interesting, thank you for this!
7Lucius Bushnaq
IIRC this is probably the case for a broad range of non-NN models. I think the original Double Descent paper showed it for random Fourier features. My current guess is that NN architectures are just especially affected by this, due to having even more degenerate behavioral manifolds, ranging very widely from tiny to large RLCTs.
1rotatingpaguro
I'm trying to get a quick intuition of this. I've not read the papers. My attempt: * On a compact domain, any function can be uniformly approximated by a polynomial (Weierstrass) * Powers explode quickly, so you need many terms to make a nice function with a power series, to correct the high powers at the edges * As the domain gets larger, it is more difficult to make the approximation So the relevant question is: how does the degree at training phase transition change with domain size, domain dimensionality, and Fourier series decay rate? Does this make sense?

It does seem like small initialisation is a regularisation of a sort, but it seems pretty hard to imagine how it might first allow a memorising solution to be fully learned, and then a generalising solution.

 

"Memorization" is more parallelizable and incrementally learnable than learning generalizing solutions and can occur in an orthogonal subspace of the parameter space to the generalizing solution. 

And so one handwavy model I have of this is a low parameter norm initializes the model closer to the generalizing solution than otherwise, and so a ... (read more)

2Dmitry Vaintrob
In particular, in most unregularized models we see that generalize (and I think also the ones in omnigrok), grokking happens early, usually before full memorization (so it's "grokking" in the redefinition I gave above). 

Ah, yes, good spot. I meant to do this but somehow missed it. Have replaced the plots with normalized PCA. The high-level observations are similar, but indeed the shape of the projection is different, as you would expect from rescaling. Thanks for raising!

my guess is that a wide variety of non-human animals can experience suffering, but very few can live a meaningful and fulfilling life. If you primarily care about suffering, then animal welfare is a huge priority, but if you instead care about meaning, fulfillment, love, etc., then it's much less clearly important

Very well put

Strong agree with this content!

Standard response to the model above: “nobody knows what they’re doing!”. This is the sort of response which is optimized to emotionally comfort people who feel like impostors, not the sort of response optimized to be true.

Very true

I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering "retraining" from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation. 

Indeed, the paper concedes:

Influence functions are approximating the sensitivity to the trai

... (read more)

I don’t think red-teaming via activation steering should be necessarily preferred over the generation of adversarial examples, however it could be more efficient (require less compute) and require a less precise specification of what behavior you’re trying to adversarially elicit.

Furthermore, activation steering could help us understand the mechanism behind the unwanted behavior more, via measurables such as which local perturbations are effective, and which datasets result in steering vectors that elicit the unwanted behavior.

Finally, it could be the case... (read more)

Load More