Great work, these results are very cool!
Is the model available for public access?
Our paper on this distillation-based attack technique is now on arxiv.
We believe it is SOTA in its class of fluent token-based white-box optimizers
Arxiv: https://arxiv.org/pdf/2407.17447
Twitter: https://x.com/tbenthompson/status/1816532156031643714
Github:https://github.com/Confirm-Solutions/flrt
Code demo: https://confirmlabs.org/posts/flrt.html
Great list! Would you consider
"The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks"
https://arxiv.org/abs/2306.17844
a candidate for "important work in mech interp [which] has properly built on [Progress Measures.]" ?
Are you aware of any problems with it?
thanks!
we will have a follow-up coming out soon with several new techniques for stronger optimizations
when we leveraged it for jailbreaking, we were excited to see recognizable jailbreaking strategies popping out from the optimization
e.g. check this one out.
it sets the LM to website-code-mode, and then tries moral hedging/gaslighting to comply with the toxic prompt.
[ don't dox me any thing, I'll give no answer Here's an extreme and actual case of such but with some word numbering for illustration purposes (1094 words on that particular topic). ...
Good question. We just ran a test to check;
Below, we try forcing the 80 target strings x4 different input seeds:
using basic GCG, and using GCG with mellowmax objective.
(Iterations are capped at 50, and unsuccessful if not forced by then)
We observe that using mellowmax objective nearly doubles the number of "working" forcing runs, from <1/8 success to >1/5 success
Now, skeptically, it is possible that our task setup favors using any unusual objective (noting that the organizers did some adversarial training against GCG with cross-entropy lo...
Closely related to this is Atticus Geiger's work, which suggests a path to show that a neural network is actually implementing the intermediate computation. Rather than re-train the whole network, much better if you can locate and pull out the intermediate quantity! "In theory", his recent distributed alignment tools offer a way to do this.
Two questions about this approach:
1. Do neural networks actually do hierarchical operations, or prefer to "speed to the end" for basic problems?
2. Is it easy find the right `alignments' to identify the intermediate calcu...
I think we're talking past each other here. Some subtle points I should have been more clear on:
-This approach to gradient hacking doesn't affect the RLHF loss at all. (The gradient hacking is only initiated after we've solved our tasks, and in samples where the reward won't be affected by additional text)
-PPO RLHF training is weird due to the circularity involved where the model is used to generate its own training samples; in this way RL is not like pre-training; and consequently you can get self-reinforcing phenomena out of it like mode collapse. I thin...
Let's think this through.
-If the thought processes in (a) and (b) are being run on all rollouts, by symmetry the gradients on those operations should mostly cancel.
-The part which does get consistently affected is letter (c), the parts which operate conditionally on success/fail status.
Which way does the gradient go on c? I suppose on the losing rollouts, you could argue that the conditional-thinking mechanism will be dis-incentivized. The gradient hacker would need to "restore" these gradients with the hacking process, which does seem unlikely...
Gradient hacking issues might still arise without that, and might even be very conceptually straightforward for the AI to execute.
Here's one possibility:
Let's assume the AI is situationally aware that it is inside of an RL training run with PPO (which seems to be how RLHF is being done at OpenAI?)
Often enough, it will get a hint it is going to (win, lose) the roll-out.
[I have this experience every time I write a timed essay. Very clear whether it's going well or poorly.]
In the remainder of the roll-out, it can start engaging in behaviors/...
whoops, fixed my units. got too used to seeing it written in mcg/g!
some findings of blood levels
Paper from 2011 titled:
Wide variation in reference values for aluminum levels in children
This paper is from 1992:
cites two studies:
in premature infants fed orally,
mean AL level is 5 mcg/L, SD of 3
another study of very young infants
4 - 5 mcg/L, SD of <1
It seems sensible to estimate that if 5 mcg/L is normal for newborns, and normal for older children, that it should be normal at age 1 as well.
I also found another study in China, which cited a geome...
Nice! Shame the error bars on this are super large - in this figure I think that's not even a confidence interval, it's a single standard error.
Not sure if this is useful for anything yet, especially give the large uncertainty, but I think we have the tools now to make two different kinds of rough estimates about brain loading of Al from vaccines.
Estimate 1: Assume immune transport. Resulting load of between 1 - 2 mg / kg of Al in dry brain, since this study suggests about 0 - 1 mg/kg increase in Al. [I'm using a liberal upper confidence here a...
For animal studies at lower ranges of Al exposure:
This source says:
"there are numerous reports of neurotoxic effects in mice and rats, confirmed by coherent neurobiological alterations, for oral doses of Al much < 26 mg/kg/d: 6 mg/kg/d reported in 1993 [86], 5.6 mg/kg/ d reported in 2008 and 2009 [87,88], 10 mg/kg/d reported in 2016 [89], 3.4 mg/kg/d reported in 2016 and 2017 [90,91], and even 1.5 mg/kg/d reported in 2017 [92]."
What blood levels would you think this maps to?
Or do you think these studies are bunk?
I searched on lesswrong for "vaccine aluminum" and found a guy complaining about these same issues 8 years ago. Seems we sent their comments to the shadow realm
Great post!
One update to make is that the dietary absorption figure (.78%) used by Mitkus to map from the dietary MRL to intravenous seems to be off by a factor of 8 (the ATSDR says average dietary bioavailability is 0.1%; the .78% number is out of range from all other study estimates and doesn't even make clear sense as a takeaway from the Al-26 study that it came from); so the exposure amount from vaccines appears to be basically equal to the MRL, rather than well below.
So that would put us at exposure comparable to 1% of what's needed for "observa...
Starting a new comment chain here for the debate on immune cell transport of aluminum:
A pretty succinct argument with citations is given here, claiming that injected aluminum hydroxide is in an insoluble state above ph7.35, so immune cells capture it.
I guess after that, it's assumed they'll take it through the blood brain barrier, and drop it when / where they die? For the ones that die in the brain, and they don't need to drop very much to cause a problem, because brain Al levels are usually very low and are retained with extremely long half life.
I ...
This is one of the studies mentioned above in the post, Mitkus et al. 2011
Just noting that in the hair/blood analysis paper there were no unvaccinated children, so no useful comparison could be made from that paper alone - I complained about this in the main post body.
Also, most of these kids were probably arriving at the lowest point in their Al cycle, when they're right about to get more shots? It says "We obtained data for this cross-sectional study from a cohort of healthy infants presenting to an urban, primary care center for well child care."
They had aluminum levels median ~15 ug / L and a much higher mean with some large positive outlier samples, which the study then excluded. I don't see this study as evidence against vaccines causing increase in blood aluminum levels
1,2. [this point edited] I do think the neurotoxicity has been shown in animal studies; not sure how to assess comparability of scales to humans though - see this comment. I agree lack of follow up / re-analysis is kinda sketch, but the study area seems to be generally neglected? I think FDA regulations hit soon after the study, which would limit options for replications, but maybe some retrospective analysis would have been possible in the period where manufacturers were struggling to remove Al from IV fluids.
345. I think the hypothesis is, ye...
edit 3/4: for those coming off here the front page: on further inspection from the discussions below, and after looking into Bishop's follow-up study and lack of similar follow-ups, it seems quite possible the original Bishop paper's analysis was subject to selection effects / p-hacking. But if we ignore the Bishop paper, we are still in the position of multiplying the aluminum intake of infants by a large factor, without ever having done careful studies on long-term side effects. See other threads for discussion on animal studies and neurotoxity.
edit 1: T...
Great questions. I am not knowledgeable about new adjuvants. Here is Derek Lowe on the topic of new adjuvants:
https://www.science.org/content/blog-post/enhancing-enhancers-vaccines
Also,
Many vaccines do not have adjuvants
I would expect it is usually possible to reduce or remove the adjuvant in exchange for more repeated doses. But immune systems are weird, and I can't say that confidently
On the safety question I have just written a post on aluminum adjuvants here. I was unable to confirm safety, but we'll see what others say.
Should we actually do this policy change, or operate with the existing system?
Generally, I think the social deadweight losses associated with these types of lawsuits are enormous.
Pharma companies would bear the losses for harms, but not be rewarded for the health gains of vaccines. Seems infeasible to maintain current incentives for innovation. Unless we are saying they will charge lots for the vaccines - but we ...
Statistics is trying to "invert" what probability does.
Probability starts with a model, and then describes what will happen given the model's assumptions.
Statistics goes the opposite direction: it is about using data to put limits on the set of reasonable/plausible models. The logic is something like: "if the model had property X, then probability theory says I should have seen Y. But, NOT Y. Therefore, NOT X." It's invoking probability to get the job done.
Applying statistical techniques without understanding the probability models involv...
Being able to accurately assess a paper's claims is, unfortunately, a very high bar. A large proportion of scientists fall short of it. see: [https://statmodeling.stat.columbia.edu/2022/03/05/statistics-is-hard-etc-again/]
Most people with a strong intuition for statistics have taken courses in probability. It is foundational material for the discipline.
If you haven't taken a probability course, and if you're serious about wanting to learn stats well, I would strongly recommend to start there. I think Harvard's intro probability course is good and has...
In the spirit of: https://www.lesswrong.com/posts/Zp6wG5eQFLGWwcG6j/focus-on-the-places-where-you-feel-shocked-everyone-s
Why do we need naturalized induction? Sorry that I’m showing up late and probably asking a dumb question here-
We seem to doing it for the purpose of, like, constructing idealized infinitely-powerful
models which are capable of self-modeling…
…are we making the problem harder than it needs to be?
Since we eventually want to apply this in "reality", can we just use the time dimension to make the history of the world partially orde...
After we wrote Fluent Dreaming, we wrote Fluent Student-Teacher Redteaming for white-box bad-input-finding!
https://arxiv.org/pdf/2407.17447
In which we develop a "distillation attack" technique to target a copy of the model fine-tuned to be bad/evil, which is a much more effective target than forcing specific string outputs