Hey Armaan! Here's a paper where they instead used an MLP in the beginning w/ similar results (looking at your code, it seems by "dense" layer, you also mean a nonlinearity, which seems equivalent to the MLP one?)
How many tokens did you train yours on?
Have you tried ablations on the dense layers, such as only having the input one vs the output one? I know you have some tied embeddings for both, but I'm unsure if the better results are for the output or input.
For both of these, it does complicate circuits because you'll have features combining nonline...
Sleep consolidation/ "sleeping on it" is when you struggle w/ [learning a piano piece], sleep on it, and then you're suddenly much better at it the next day!
This has happened to me for piano, dance, math concepts, video games, & rock climbing, but it varies in effectiveness. Why? Is it:
My current guess is a mix of all four. But I'm unsure if you [practice piano] in the morning, you...
Huh, those brain stimulation methods might actually be practical to use now, thanks for mentioning them!
Regarding skepticism of survey-data: If you're imagining it's only an end-of-the-retreat survey which asks "did you experience the jhana?", then yeah, I'll be skeptical too. But my understanding is that everyone has several meetings w/ instructors where a not-true-jhana/social-lie wouldn't hold up against scrutiny.
I can ask during my online retreat w/ them in a couple months.
Suppose we were able to gather large amounts of brain-scans, lets say w/ millions of wearable helmets w/ video and audio as well,[1] then what could we do with that? I'm assuming a similar pre-training stage where models are used to predict next brain-states (possibly also video and audio), and then can be finetuned or prompted for specific purposes.
Jhana is a non-addicting high pleasure state. If we can scan people entering this state, we might drastically reduce the time it takes to learn to enter ...
it seems unlikely to me that so many talented people went astray
Well, maybe we did go astray, but it's not for any reasons mentioned in this paper!
SAEs were trained on random weights since Anthropic's first SAE paper in 2023:
...To assess the effect of dataset correlations on the interpretability of feature activations, we run dictionary learning on a version of our one-layer model with random weights. 28 The resulting features are here, and contain many single-token features (such as "span", "file", ".", and "nature") and some other features firing on seeming
I think the fuller context,
Anthropic has put WAY more effort into safety, way way more effort into making sure there are really high standards for safety and that there isn't going to be danger what these AIs are doing
implies it's just the amount of effort is larger than other companies (which I agree with), and not the Youtuber believing they've solved alignment or are doing enough, see:
but he's also a realist and is like "AI is going to really potentially fuck up our world"
and
...But he's very realistic. There is a lot of bad shit that is going to happ
You're right! Thanks
For Mice, up to 77%
Sox2-17 enhanced episomal OKS MEF reprogramming by a striking 150 times, giving rise to high-quality miPSCs that could generate all-iPSC mice with up to 77% efficiency
For human cells, up to 9% (if I'm understanding this part correctly).
SOX2-17 gave rise to 56 times more TRA1-60+ colonies compared with WT-SOX2: 8.9% versus 0.16% overall reprogramming efficiency.
So seems like you can do wildly different depending on the setting (mice, humans, bovine, etc), and I don't know what the Retro folks were doing, but does make their result less impressive.
You're actually right that this is due to meditation for me. AFAIK, it's not a synesthesia-esque though (ie I'm not causing there to be two qualia now), more like the distinction between mental-qualia and bodily-qualia doesn't seem meaningful upon inspection.
So I believe it's a semantic issue, and I really mean "confusion is qualia you can notice and act on" (though I agree I'm using "bodily" in non-standard ways and should stop when communicating w/ non-meditators).
This is great feedback, thanks! I added another example based off what you said.
For how obvious the first one, at least two folks I asked (not from this community) didn't think it was a baby initially (though one is non-native english and didn't know "2 birds of a feather" and assumed "our company" meant "the singers and their partner"). Neither are parents.
I did select these because they caused confusion in myself when I heard/saw them years ago, but they were "in the wild" instead of in a post on noticing confusion.
I did want a post I could link [non rationalist friends] to that's a more fun intro to noticing confusion, so more regular members might not benefit!
For those also curious, Yamanaka factors are specific genes that turn specialized cells (e.g. skin, hair) into induced pluripotent stem cells (iPSCs) which can turn into any other type of cell.
This is a big deal because you can generate lots of stem cells to make full organs[1] or reverse aging (maybe? they say you just turn the cell back younger, not all the way to stem cells).
You can also do better disease modeling/drug testing: if you get skin cells from someone w/ a genetic kidney disease, you can turn those cells into the iPSCs, then i...
According to the article, SOTA was <1% of cells converted into iPSCs
I don't think that's right, see https://www.cell.com/cell-stem-cell/fulltext/S1934-5909(23)00402-2
A trending youtube video w/ 500k views in a day brings up Dario Amodei's Machines of Loving Grace (Timestamp for the quote):
[Note: I had Claude help format, but personally verified the text's faithfulness]
...I am an AI optimist. I think our world will be better because of AI. One of the best expressions of that I've seen is this blog post by Dario Amodei, who is the CEO of Anthropic, one of the biggest AI companies. I would really recommend reading this - it's one of the more interesting articles and arguments I have read. He's basically saying AI is going to
Hey Midius!
My recommended rationality habit is noticing confusion, by which I mean a specific mental feeling that's usually quick & subtle & easy to ignore.
David Chapman has a more wooey version called Eating Your Shadow, which was very helpful for me since it pointed me towards acknowledging parts of my experience that I was denying due to identity & social reasons (hence the easy to ignore part).
Could you go into more details into what skills these advisers would have or what situations to navigate?
Because I'm baking in the "superhuman in coding/maths" due to the structure of those tasks, and other tasks can either be improved through:
1. general capabilies
2. Specific task
And there might be ways to differentially accelarate that capability.
In my experience, most of the general public will verbally agree that AI X-risk is a big deal, but then go about their day (cause reasonably, they have no power). There's no obvious social role/action to do in response to that.
For climate, people understand that they should recycle, not keep the water running, and if there's a way to donate to clean the ocean on a Mr. Beast video, then some will even donate (sadly, none of these are very effective for solving the climate problem though! Gotta avoid that for our case).
Having a clear call-to-action seems rel...
Claude 3.5 seems to understand the spirit of the law when pursuing a goal X.
A concern I have is that future training procedures will incentivize more consequential reasoning (because those get higher reward). This might be obvious or foreseeable, but could be missed/ignored under racing pressure or when lab's LLMs are implementing all the details of research.
Thanks!
I forgot about faithful CoT and definitely think that should be a "Step 0". I'm also concerned here that AGI labs just don't do the reasonable things (ie training for briefness making the CoT more steganographic).
For Mech-interp, ya, we're currently bottlenecked by:
From an apparent author on reddit:
[Frontier Math is composed of] 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style porblems, 25% T3 = early researcher problems
The comment was responding to a claim that Terence Tao said he could only solve a small percentage of questions, but Terence was only sent the T3 questions.
I also have a couple friends that require serious thinking (or being on my toes). I think it's because they have some model of how something works, and I say something, showing my lack of this model.
Additionally, programming causes this as well (in response to compilation errors, getting nonsense outputs, or runs too long).
Was looking up Google Trend lines for chatGPT and noticed a cyclical pattern:
Where the dips are weekends, meaning it's mostly used by people in the workweek. I mostly expect this is students using it for homework. This is substantiated by two other trends:
1. Dips in interest over winter and summer breaks (And Thanksgiving break in above chart)
2. "Humanize AI" which is
Humanize AI™ is your go-to platform for seamlessly converting AI-generated text into authentic, undetectable, human-like content
[Although note that overall interest in ChatGPT is W...
I was expecting this to include the output of MIRI for this year. Digging into your links we have:
Two Technical Governance Papers:
1. Mechanisms to verify international agreements about AI development
2. What AI evals for preventing catastrophic risks can and cannot do
Four Media pieces of Eliezer regarding AI risk:
1. Semafor piece
2. 1 hr talk w/ Panel
3. PBS news hour
4. 4 hr video w/ Stephen Wolfram
Is this the full output for the year, or are there less linkable outputs such as engaging w/ policymakers on AI risks?
Hi, I’m part of the communications team at MIRI.
To address the object-level question: no, that’s not MIRI’s full public output for the year (but our public output for the year was quite small; more on that below). The links on the media page and research page are things that we put in the spotlight. We know the current website isn’t great for seeing all of our output, and we have plans to fix this. In the meantime, you can check out our newsletters, TGT’s new website, and a forthcoming post with more details about the media stuff we’ve ...
Donated $100.
It was mostly due to LW2 that I decided to work on AI safety, actually, so thanks!
I've had the pleasure of interacting w/ the LW team quite a bit and they definitely embody the spirit of actually trying. Best of luck to y'all's endeavors!
I tried a similar experiment w/ Claude 3.5 Sonnet, where I asked it to come up w/ a secret word and in branching paths:
1. Asked directly for the word
2. Played 20 questions, and then guessed the word
In order to see if it does have a consistent it can refer back to.
Branch 1:
Branch 2:
Which I just thought was funny.
Asking again, telling it about the experiment and how it's important for it to try to give consistent answers, it initially said "telescope" and then gave hints towards a paperclip.
Interesting to see when it flips it answers, though it's a sim...
During WWII, the CIA produced and distributed an entire manual (well worth reading) about how workers could conduct deniable sabotage in the German-occupied territories.
...(11) General Interference with Organizations and Production
(a) Organizations and Conferences
- Insist on doing everything through "channels." Never permit short-cuts to be taken in order to expedite decisions.
- Make speeches, talk as frequently as possible and at great length. Illustrate your points by long anecdotes and accounts of personal experiences. Neve
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.
This paragraph sounded like you're claiming LLMs do have concepts, but they're not in specific activations or weights, but distributed across them instead.
But from your comment, you mean that LLMs themselves don't learn the true simple-compressed features of reality, but a ...
The one we checked last year was just Pythia-70M, which I don't expect the LLM itself to have a gender feature that generalizes to both pronouns and anisogamy.
But again, the task is next-token prediction. Do you expect e.g. GPT 4 to have learned a gender concept that affects both knowledge about anisogamy and pronouns while trained on next-token prediction?
Sparse autoencoders finds features that correspond to abstract features of words and text. That's not the same as finding features that correspond to reality.
(Base-model) LLMs are trained to minimize prediction error, and SAEs do seem to find features that sparsely predict error, such as a gender feature that, when removed, affects the probability of pronouns. So pragmatically, for the goal of "finding features that explain next-word-prediction", which LLMs are directly trained for, SAEs find good examples![1]
I'm unsure what goal you have in mind for "feat...
Is there code available for this?
I'm mainly interested in the loss fuction. Specifically from footnote 4:
We also need to add a term to capture the interaction effect between the key-features and the query-transcoder bias, but we omit this for simplicity
I'm unsure how this is implemented or the motivation.
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.
@Lucius Bushnaq , why would MLPs compute linear transformations?
Because two linear transformations can be combined into one linear transformation, why wouldn't downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?
What is the activation name for the resid SAEs? hook_resid_post or hook_resid_pre?
I found https://github.com/ApolloResearch/e2e_sae/blob/main/e2e_sae/scripts/train_tlens_saes/run_train_tlens_saes.py#L220
to suggest _post
but downloading the SAETransformer from wandb shows:(saes):
ModuleDict( (blocks-6-hook_resid_pre):
SAE( (encoder): Sequential( (0):...
which suggests _pre.
3. Those who are more able to comprehend and use these models are therefore of a higher agency/utility and higher moral priority than those who cannot. [emphasis mine]
This (along with saying "dignity" implies "moral worth" in Death w/ Dignity post), is confusing to me. Could you give a specific example of how you'd treat differently someone who has more or less moral worth (e.g. give them more money, attention, life-saving help, etc)?
One thing I could understand from your Death w/ Dignity excerpt is he's definitely implying a metric that scores every...
Kind of confused on why the KL-only e2e SAE have worse CE than e2e+downstream across dictionary size:
This is true for layers 2 & 6. I'm unsure if this means that training for KL directly is harder/unstable, and the intermediate MSE is a useful prior, or if this is a difference in KL vs CE (ie the e2e does in fact do better on KL but worse on CE than e2e+downstream).
I finally checked!
Here is the Jaccard similarity (ie similarity of input-token activations) across seeds
The e2e ones do indeed have a much lower jaccard sim (there normally is a spike at 1.0, but this is removed when you remove features that only activate <10 times).
I also (mostly) replicated the decoder similarity chart:
And calculated the encoder sim:
[I, again, needed to remove dead features (< 10 activations) to get the graphs here.]
So yes, I believe the original paper's claim that e2e features learn quite different features across seed...
And here's the code to convert it to NNsight (Thanks Caden for writing this awhile ago!)
import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer
from nnsight.models.UnifiedTransformer import UnifiedTransformer
model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")
# Undo my hacky LayerNorm removal
for block in model.transformer.h:
block.ln_1.weight.data = block.ln_1.weight.data / 1e6
block.ln_1.eps = 1e-5
block.ln_2.weight.data = block.ln_2.weight.data / 1e6
block.ln_2.e
... Maybe this should be like Anthropic's shared decoder bias? Essentially subtract off the per-token bias at the beginning, let the SAE reconstruct this "residual", then add the per-token bias back to the reconstructed x.
The motivation is that the SAE has a weird job in this case. It sees x, but needs to reconstruct x - per-token-bias, which means it needs to somehow learn what that per-token-bias is during training.
However, if you just subtract it first, then the SAE sees x', and just needs to reconstruct x'.
So I'm just suggesting changing&...
That's great thanks!
My suggested experiment to really get at this question (which if I were in your shoes, I wouldn't want to run cause you've already done quite a bit of work on this project!, lol):
Compare
1. Baseline 80x expansion (56k features) at k=30
2. Tokenized-learned 8x expansion (50k vocab + 6k features) at k=29 (since the token adds 1 extra feature)
for 300M tokens (I usually don't see improvements past this amount) showing NMSE and CE.
If tokenized-SAEs are still better in this experiment, then that's a pretty solid argument to use thes...
About similar tokenized features, maybe I'm misunderstanding, but this seems like a problem for any decoder-like structure.
I didn't mean to imply it's a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from " 9" token on a specific task, then we should conclude it's reading from a more general number direction than a specific number direction.
I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)
Although, tokenized features are dissimilar to normal features in that they don't vary in activation strength. Tokenized features are either 0 or 1 (or norm of the vector). So it's not exactly an apples-to-apples comparison w/ a similar sized dictionary of normal SAE features, although that plot would be nice!
I do really like this work. This is useful for circuit-style work because the tokenized-features are already interpreted. If a downstream encoded feature reads from the tokenized-feature direction, then we know the only info being transmitted is info on the current token.
However, if multiple tokenized-features are similar directions (e.g. multiple tokenizations of word "the") then a circuit reading from this direction is just using information about this set of tokens.
Do you have a dictionary-size to CE-added plot? (fixing L0)
So, we did not use the lookup biases as additional features (only for decoder reconstruction)
I agree it's not like the other features in that the encoder isn't used, but it is used for reconstruction which affects CE. It'd be good to show the pareto improvement of CE/L0 is not caused by just having an additional vocab_size number of features (although that might mean having to use auxk to have a similar number of alive features).
I agree. There is a tradeoff here for the L0/MSE curve & circuit-simplicity.
I guess another problem (w/ SAEs in general) is optimizing for L0 leads to feature absorption. However, I'm unsure of a metric (other than the L0/MSE) that does capture what we want.