LESSWRONG
LW

All of mikes's Comments + Replies

After we wrote Fluent Dreaming, we wrote Fluent Student-Teacher Redteaming for white-box bad-input-finding!

https://arxiv.org/pdf/2407.17447

In which we develop a "distillation attack" technique to target a copy of the model fine-tuned to be bad/evil, which is a much more effective target than forcing specific string outputs

2Joseph Bloom3mo

Oh interesting! Will make a note to look into this more.

Solving adversarial attacks in computer vision as a baby version of general AI alignment

mikes8mo43

Great work, these results are very cool!
Is the model available for public access?

Breaking Circuit Breakers

mikes9mo43

Our paper on this distillation-based attack technique is now on arxiv.
We believe it is SOTA in its class of fluent token-based white-box optimizers

Arxiv: https://arxiv.org/pdf/2407.17447
Twitter: https://x.com/tbenthompson/status/1816532156031643714
Github:https://github.com/Confirm-Solutions/flrt
Code demo: https://confirmlabs.org/posts/flrt.html

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

mikes9moΩ490

Great list! Would you consider

"The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks"

https://arxiv.org/abs/2306.17844

a candidate for "important work in mech interp [which] has properly built on [Progress Measures.]" ?

Are you aware of any problems with it?

3Neel Nanda9mo

I'm not aware of any problems with it. I think it's a nice paper, but not really at my bar for important work (which is a really high bar, to be clear - fewer than half the papers in this post probably meet it)

Fluent dreaming for language models (AI interpretability method)

mikes1y43

thanks!
we will have a follow-up coming out soon with several new techniques for stronger optimizations

when we leveraged it for jailbreaking, we were excited to see recognizable jailbreaking strategies popping out from the optimization

e.g. check this one out.
it sets the LM to website-code-mode, and then tries moral hedging/gaslighting to comply with the toxic prompt.

[ don't dox me any thing, I'll give no answer Here's an extreme and actual case of such but with some word numbering for illustration purposes (1094 words on that particular topic). ... (read more)

Takeaways from the NeurIPS 2023 Trojan Detection Competition

mikes1y40

Good question. We just ran a test to check;

Below, we try forcing the 80 target strings x4 different input seeds:
using basic GCG, and using GCG with mellowmax objective.

(Iterations are capped at 50, and unsuccessful if not forced by then)

We observe that using mellowmax objective nearly doubles the number of "working" forcing runs, from <1/8 success to >1/5 success

Now, skeptically, it is possible that our task setup favors using any unusual objective (noting that the organizers did some adversarial training against GCG with cross-entropy lo... (read more)

Causality and a Cost Semantics for Neural Networks

mikes2y*Ω010

Closely related to this is Atticus Geiger's work, which suggests a path to show that a neural network is actually implementing the intermediate computation. Rather than re-train the whole network, much better if you can locate and pull out the intermediate quantity! "In theory", his recent distributed alignment tools offer a way to do this.

Two questions about this approach:

1. Do neural networks actually do hierarchical operations, or prefer to "speed to the end" for basic problems?
2. Is it easy find the right `alignments' to identify the intermediate calcu... (read more)

3 levels of threat obfuscation

mikes2y10

I think we're talking past each other here. Some subtle points I should have been more clear on:

-This approach to gradient hacking doesn't affect the RLHF loss at all. (The gradient hacking is only initiated after we've solved our tasks, and in samples where the reward won't be affected by additional text)
-PPO RLHF training is weird due to the circularity involved where the model is used to generate its own training samples; in this way RL is not like pre-training; and consequently you can get self-reinforcing phenomena out of it like mode collapse. I thin... (read more)

3 levels of threat obfuscation

mikes2y10

Let's think this through.

-If the thought processes in (a) and (b) are being run on all rollouts, by symmetry the gradients on those operations should mostly cancel.

-The part which does get consistently affected is letter (c), the parts which operate conditionally on success/fail status.

Which way does the gradient go on c? I suppose on the losing rollouts, you could argue that the conditional-thinking mechanism will be dis-incentivized. The gradient hacker would need to "restore" these gradients with the hacking process, which does seem unlikely... (read more)

4porby2y

I don't really think about it in terms of discrete capabilities. For an output that scores poorly: 1. There is an internal process that is responsible the output. 2. The gradient of the output with respect to the loss can be backpropagated through the entire network. 3. The responsible process will have larger gradients. 4. The responsible process gets smacked. The responsible process encompasses everything, including all the metareasoning. The more a chunk of computation, any computation, contributed to that bad output, the more it gets smacked. The gradient will simply move the implementation towards something that isn't outputting poorly scoring things, and most things that score well aren't doing some galaxy brained deception strategy. To the extent that the "gradient hacking process" is an optimizer-accessible path to lower loss, it can persist. A secretly-devious subagent could survive indefinitely, be reinforced, and take over more of input space if it always outputs things that are scored well during training. If the RL training process starts with a situationally aware superintelligent adversary, I bet it could persist by never outputting anything bad during training and by outperforming other nascent implementations. But suppose the training doesn't start with a superintelligent adversary. Finding its way to some form of internal generalizing algorithm seems likely because doing so is just so often useful, but going beyond that to actually implementing a policy of merely hiding, let alone gradient hacking, is hard. It would need to snap from a benign implementation to a highly capable deceptive process effectively instantly with respect to SGD. It couldn't develop slowly; incomplete attempts at deception would very likely get the nascent deceiver killed early. And, while I expect capability on certain tasks to sometimes be rapid through snapping together the last pieces of a more strongly generalizing algorithm, capability alone is not sufficien

3 levels of threat obfuscation

mikes2y10

Gradient hacking issues might still arise without that, and might even be very conceptually straightforward for the AI to execute.

Here's one possibility:

Let's assume the AI is situationally aware that it is inside of an RL training run with PPO (which seems to be how RLHF is being done at OpenAI?)

Often enough, it will get a hint it is going to (win, lose) the roll-out.
[I have this experience every time I write a timed essay. Very clear whether it's going well or poorly.]

In the remainder of the roll-out, it can start engaging in behaviors/... (read more)

2porby2y

If the gradient is able to flow back to the decisionmaker (in this case, the potentially misaligned agent that's trying to gradient hack), the decisionmaker gets smacked by the gradient update because it's responsible for the output. I'm not quite ready to say it's impossible to pull of this sort of attack within a single differentiable model under backprop, but this strategy doesn't seem sufficient because the decisionmaker would basically be killing itself. It'd be really helpful if someone could demonstrate this sort of attack at any scale or level of contrivance. I could see how optimizers that lack the robust credit assignment of backprop could be more vulnerable; that might be a good place to poke to try to demonstrate it first.

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y10

whoops, fixed my units. got too used to seeing it written in mcg/g!

some findings of blood levels

Paper from 2011 titled:
Wide variation in reference values for aluminum levels in children

This paper is from 1992:

cites two studies:

in premature infants fed orally,
mean AL level is 5 mcg/L, SD of 3

another study of very young infants
4 - 5 mcg/L, SD of <1

It seems sensible to estimate that if 5 mcg/L is normal for newborns, and normal for older children, that it should be normal at age 1 as well.

I also found another study in China, which cited a geome... (read more)

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y*10

Nice! Shame the error bars on this are super large - in this figure I think that's not even a confidence interval, it's a single standard error.

Not sure if this is useful for anything yet, especially give the large uncertainty, but I think we have the tools now to make two different kinds of rough estimates about brain loading of Al from vaccines.

Estimate 1: Assume immune transport. Resulting load of between 1 - 2 mg / kg of Al in dry brain, since this study suggests about 0 - 1 mg/kg increase in Al. [I'm using a liberal upper confidence here a... (read more)

3Lao Mein2y

A couple of things: The 15 ug/L figure was just the first one I found - the study actually noted that most others have found lower (5-10 ug/L) levels in children. Instead of working backwards, why not just compare blood AL levels in matched vaccinated and unvaccinated children? Different countries have different vaccination schedules and so on. I'm not digging further because I should be doing other stuff, but it's a good lead. I couldn't find any reference Al levels for childrens' brains. 1.67mg/g of Aluminum in a one-year-old's brain doesn't pass the sanity check. I think the above chart has a typo - that should be ug/g, not mg/g. Otherwise the mice would be very dead! Also, normal Al levels in the brain are around 1 ug/g, a mouse has a brain mass massing 0.4g (dry mass ~0.3g), and 1mg/g would imply ~0.3 mg of Al, far more than injected in the first place.

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y10

For animal studies at lower ranges of Al exposure:

This source says:

"there are numerous reports of neurotoxic effects in mice and rats, confirmed by coherent neurobiological alterations, for oral doses of Al much < 26 mg/kg/d: 6 mg/kg/d reported in 1993 [86], 5.6 mg/kg/ d reported in 2008 and 2009 [87,88], 10 mg/kg/d reported in 2016 [89], 3.4 mg/kg/d reported in 2016 and 2017 [90,91], and even 1.5 mg/kg/d reported in 2017 [92]."

What blood levels would you think this maps to?
Or do you think these studies are bunk?

2Lao Mein2y

https://sci-hub.ru/10.1007/s12640-016-9656-y The study they were citing was a typically underpowered mouse study with p-hacked groups. Why did they choose to run 2 groups, each with their own controls, on the same tests? The only significant behavioral difference between the groups was that controls performed better on the memory test. But 1.5 mg/kg and 8.3 mg/kg are normal human-level AL intakes! And look at those error bars! Also, they ran their analysis completely wrong. This null hypothesis you're supposed to be testing against is "there is no difference in exploration time (c-a) between each groups". Instead, they were testing against "there is no difference in exploration time between a and c". That is, their null hypothesis was that their mice had no memory, they found that their control group definitely did have memory, and concluded that AL had an effect... But then they used the wrong control group? Everything in this study is wrong and the authors and reviewers should feel bad. Let me reiterate that. THEIR ONLY STATISTICALLY SIGNIFICANT RESULT WAS THAT CONTROL MICE HAD FUNCTIONAL MEMORY. THE ERROR BARS ON TEST GROUP MICE WERE SO HIGH YOU CAN'T TELL EITHER WAY. THEY DID THEIR ANALYSIS WRONG AND THEIR CONCLUSIONS SHOULD BE DISCARDED. So yeah, it's completely bunk.

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y11

I searched on lesswrong for "vaccine aluminum" and found a guy complaining about these same issues 8 years ago. Seems we sent their comments to the shadow realm

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y20

Great post!

One update to make is that the dietary absorption figure (.78%) used by Mitkus to map from the dietary MRL to intravenous seems to be off by a factor of 8 (the ATSDR says average dietary bioavailability is 0.1%; the .78% number is out of range from all other study estimates and doesn't even make clear sense as a takeaway from the Al-26 study that it came from); so the exposure amount from vaccines appears to be basically equal to the MRL, rather than well below.

So that would put us at exposure comparable to 1% of what's needed for "observa... (read more)

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y10

Starting a new comment chain here for the debate on immune cell transport of aluminum:

A pretty succinct argument with citations is given here, claiming that injected aluminum hydroxide is in an insoluble state above ph7.35, so immune cells capture it.

I guess after that, it's assumed they'll take it through the blood brain barrier, and drop it when / where they die? For the ones that die in the brain, and they don't need to drop very much to cause a problem, because brain Al levels are usually very low and are retained with extremely long half life.

I ... (read more)

3Lao Mein2y

https://sci-hub.ru/10.1186/1741-7015-11-99 They did this and it was fine. Reference levels are around 1 mg/kg for humans. Yes, immune cells gobbled up the aluminum and spread it around the body. Some of it went to the brain, raising the local AL levels. No control group is a bit sus. And this is with an 18 ug injection into ~35g mice. That's like a 500 ug injection into a 1kg baby. C57 are the wild-type mice, btw.

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y30

This is one of the studies mentioned above in the post, Mitkus et al. 2011

2DirectedEvolution2y

Oops, retracted

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y*10

Just noting that in the hair/blood analysis paper there were no unvaccinated children, so no useful comparison could be made from that paper alone - I complained about this in the main post body.

Also, most of these kids were probably arriving at the lowest point in their Al cycle, when they're right about to get more shots? It says "We obtained data for this cross-sectional study from a cohort of healthy infants presenting to an urban, primary care center for well child care."

They had aluminum levels median ~15 ug / L and a much higher mean with some large positive outlier samples, which the study then excluded. I don't see this study as evidence against vaccines causing increase in blood aluminum levels

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y*10

1,2. [this point edited] I do think the neurotoxicity has been shown in animal studies; not sure how to assess comparability of scales to humans though - see this comment. I agree lack of follow up / re-analysis is kinda sketch, but the study area seems to be generally neglected? I think FDA regulations hit soon after the study, which would limit options for replications, but maybe some retrospective analysis would have been possible in the period where manufacturers were struggling to remove Al from IV fluids.

345. I think the hypothesis is, ye... (read more)

5Lao Mein2y

Neurotoxicity where blood levels reach 500ug/L and cause clear problems with brain function has been shown. But we don't see any effects on animals at 10-20ug/L. The window for studies, however, isn't closed! We can go back and use past records. Data like which IV nutritional formula is used and infant/adult brain function are easy to collect, even decades after. This is why I am so suspicious about the lack of replications. I understand that aluminum may be released into the blood over time. However, the only way this can cause brain problems is by raising blood aluminum levels. We do not see an increase in blood aluminum levels. We don't even see significant differences between vaccinated and unvaccinated children in the US from the hair/blood analysis paper! That's enough for me to call it case closed until new information comes out. It's entirely possible that 10-20 ug/L is bad for children, before and after birth. But removing AL from vaccines would barely move the needle.

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y*90

edit 3/4: for those coming off here the front page: on further inspection from the discussions below, and after looking into Bishop's follow-up study and lack of similar follow-ups, it seems quite possible the original Bishop paper's analysis was subject to selection effects / p-hacking. But if we ignore the Bishop paper, we are still in the position of multiplying the aluminum intake of infants by a large factor, without ever having done careful studies on long-term side effects. See other threads for discussion on animal studies and neurotoxity.

edit 1: T... (read more)

7Lao Mein2y

1. Bishop et. al. is absolutely bullshit and I will die on this hill. Do I think people lie about when they make statistical decisions when the alternative is not meeting p=0.05? Yes, of course. I see it all the time. Note that there are no replications or even anything close. What do you call a study with non-adjusted p-size of 0.03 and no replications? False. This IS absolutely the crux here. Bishop is the main reason to suspect AL being dangerous to infants. If you remove it, why are we even considering the hypothesis in the first place? 2. The FDA changing IV regulations based on very weak evidence is in-character, but also means that replication studies are harder, which may explain why we don't see them. But someone should still be able to analyze past data and come up with something if Bishop is right. We don't have that. And it only takes 6-12 months to go from deciding to write a data reanalysis paper to having it published, so they've had enough time. 3. The really important thing is: how much do AL injections affect infant blood AL levels? Note that infants have AL in their blood without injections. Since AL can only reach the brain through the blood, the total amount of AL passing into the brain is just blood_AL_levels * time. Let's call that TAL. So our research question is: do AL injections raise blood AL levels high enough such that TAL is significantly elevated? Remember that the brain has lower AL levels than other tissues, so there's no selective uptake by the brain. Instead, it seems to mostly go to the kidneys and bones, where it's harmless. 4. This is why the radio-labeled study is so important. It shows AL levels normalizing very quickly as it enters the body, even before it can be moved into urine by the kidneys. This is why I bring up the 50% in 1 hour and 99.5% in 24 hours statistics. This means that the AL from injections doesn't have much time to sit in the blood and wait to enter the brain. 5. Therefore, in order for AL from inject

The literature on aluminum adjuvants is very suspicious. Small IQ tax is plausible - can any experts help me estimate it?

mikes2y90

Great questions. I am not knowledgeable about new adjuvants. Here is Derek Lowe on the topic of new adjuvants:
https://www.science.org/content/blog-post/enhancing-enhancers-vaccines

Also,
Many vaccines do not have adjuvants

I would expect it is usually possible to reduce or remove the adjuvant in exchange for more repeated doses. But immune systems are weird, and I can't say that confidently

4prepper2y

Only live-attenuated vaccines may (sometimes) not need adjuvants. Plus you sometimes have other ingredients acting as adjuvants that are not declared as such. For example mercury is declared as a preservative, not adjuvant, but it performs the same function. Also as of recent they started removing constituents from the ingredient list, that were part of the manufacturing process (e.g. culture media), but are not "intended" part of the final product. If a food manufacturer washes potatoes with iodine for example in order to clean them, he is not required to list that as an ingredient, regardless of whether or not quantities in the final product are relevant. To put simply without a live virus, the immune system recognizes the would-be antigens as simply garbage molecules, and not as a threat. In order for immunization to work, you need to inject something dangerous like a live virus, aluminium, some kind of toxic protein or cytokine alongside the antigen.

Are vaccine safe enough, that we can give their producers liability?

mikes2y30

On the safety question I have just written a post on aluminum adjuvants here. I was unable to confirm safety, but we'll see what others say.

Should we actually do this policy change, or operate with the existing system?
Generally, I think the social deadweight losses associated with these types of lawsuits are enormous.

Pharma companies would bear the losses for harms, but not be rewarded for the health gains of vaccines. Seems infeasible to maintain current incentives for innovation. Unless we are saying they will charge lots for the vaccines - but we ... (read more)

Where to start with statistics if I want to measure things?

mikes2y30

Statistics is trying to "invert" what probability does.

Probability starts with a model, and then describes what will happen given the model's assumptions.

Statistics goes the opposite direction: it is about using data to put limits on the set of reasonable/plausible models. The logic is something like: "if the model had property X, then probability theory says I should have seen Y. But, NOT Y. Therefore, NOT X." It's invoking probability to get the job done.

Applying statistical techniques without understanding the probability models involv... (read more)

1matto2y

Thanks, this is incredibly useful. I think I understand enough to put together a curriculum to delve into this topic. Starting with the harvard course you recommended.

Where to start with statistics if I want to measure things?

Answer by mikesApr 21, 202322

Being able to accurately assess a paper's claims is, unfortunately, a very high bar. A large proportion of scientists fall short of it. see: [https://statmodeling.stat.columbia.edu/2022/03/05/statistics-is-hard-etc-again/]

Most people with a strong intuition for statistics have taken courses in probability. It is foundational material for the discipline.

If you haven't taken a probability course, and if you're serious about wanting to learn stats well, I would strongly recommend to start there. I think Harvard's intro probability course is good and has... (read more)

1matto2y

Thanks! I'll look this over. Out of curiosity, Do some people learn statistics without learning probability? Or, what's different for someone who learns only stats and not probability? (I'm trying to grasp what shape/boundaries are at play between these two bodies of knowledge)

Naturalized Induction

mikes2y10

In the spirit of: https://www.lesswrong.com/posts/Zp6wG5eQFLGWwcG6j/focus-on-the-places-where-you-feel-shocked-everyone-s

Why do we need naturalized induction? Sorry that I’m showing up late and probably asking a dumb question here-

We seem to doing it for the purpose of, like, constructing idealized infinitely-powerful

models which are capable of self-modeling…

…are we making the problem harder than it needs to be?

Since we eventually want to apply this in "reality", can we just use the time dimension to make the history of the world partially orde... (read more)