All of Bart Bussmann's Comments + Replies

Interesting idea, I had not considered this approach before!

I'm not sure this would solve feature absorption though. Thinking about the "Starts with E-" and "Elephant" example: if the "Elephant" latent absorbs the "Starts with E-" latent, the "Starts with E-" feature will develop a hole and not activate anymore on the input "elephant". After the latent is absorbed, "Starts with E-" wouldn't be in the list to calculate cumulative losses for that input anymore. 

Matryoshka works because it forces the early-indexed latents to reconstruct well using only t... (read more)

Although the code has the option to add a L1-penalty, in practice I set the l1_coeff to 0 in all my experiments (see main.py for all hyperparameters).

1Jose Sepulveda
Oh I see that, thanks! :) Super interesting work. I'm testing it's application to recommender systems.

I haven't actually tried this, but recently heard about focusbuddy.ai, which might be a useful ai assistant in this space.

Bart BussmannΩ20352

Great work! I have been working on something very similar and will publish my results here some time next week, but can already give a sneak-peak: 

The SAEs here were only trained for 100M tokens (1/3 the TinyStories[11:1] dataset). The language model was trained for 3 epochs on the 300M token TinyStories dataset. It would be good to validate these results with more 'real' language models and train SAEs with much more data.

I can confirm that on Gemma-2-2B Matryoshka SAEs dramatically improve the absorption score on the first-letter task from Chanin et ... (read more)

3Noa Nabeshima
That's very cool, I'm looking forward to seeing those results! The Top-K extension is particularly interesting, as that was something I wasn't sure how to approach. I imagine you've explored important directions I haven't touched like better benchmarking, top-k implementation, and testing on larger models. Having multiple independent validations of an approach also seems valuable. I'd be interested in continuing this line of research, especially circuits with Matryoshka SAEs. I'd love to hear about what directions you're thinking of. Would you want to have a call sometime about collaboration or coordination? (I'll DM you!) Really looking forward to reading your post!

Three years later,  and we actually got LLMs with visible thoughts, such as Deepseek, QwQ, and (although partially hidden from the user) o1-preview.

I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs.


Good call!

1Martin Randall
But I don't think these came about through training on synthetic thought-annotated texts.

Sing along! https://suno.com/song/35d62e76-eac7-4733-864d-d62104f4bfd0

This project seems to be trying to translate whale language.

3Terence Coelho
I've learned a weird amount about whales from here this week. If unsupervised translation is possible for creatures with a language as different from ours as whales, that would be amazing. Especially if it could be done without monitoring their behaviors (although that might be asking for too much)

You might enjoy this classic: https://www.lesswrong.com/posts/9HSwh2mE3tX6xvZ2W/the-pyramid-and-the-garden

Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible).  Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context. 
 


Definitely seems like multiple ways to interpret this work, as also described in SAE feature geom... (read more)

Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs we found many spelling/sound related meta-latents.

I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE. 

Potentially, this would remove the "Starts-with-L"-component from the "l... (read more)

1Joseph Bloom
  Thanks! We were sad not to have time to try out Meta-SAEs but want to in the future. I think this is the wrong way to go to be honest. I see it as doubling down on sparsity and a single decomposition, both of which I think may just not reflect the underlying data generating process. Heavily inspired by some of John Wentworth's ideas here.  Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible).  Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context. 
2tailcalled
As I mentioned in my other comment, SAEs finds features that correspond to abstract features of words and text. That's not the same as finding features that correspond to reality.

Having been at two LH parties, one with music and one without, I definitely ended up in the "large conversation with 2 people talking and 5 people listening"-situation much more in the party without music.

That said, I did find it much easier to meet new people at the party without music, as this also makes it much easier to join conversations that sound interesting when you walk past (being able to actually overhear them).

This might be one of the reasons why people tend to progressively increase the volume of the music during parties. First give people a chance to meet interesting people and easily join conversations. Then increase the volume to facilitate smaller conversations.

2Yoav Ravid
Yeah, when there's loud music it's much easier for me to understand people I know than people I don't because I'm already used to their speaking patterns, and can more easily infer what they said even when I don't hear it perfectly. And also because any misunderstanding or difficulty that rises out of not hearing each other well is less awkward with someone I already know than someone I do.

I just finished reading "Zen and the Art of Motorcycle Maintenance" yesterday, which you might enjoy reading as it explores the topic of Quality (what you call excellence). From the book:

“Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”

Interesting, we find that all features in a smaller SAE have a feature in a larger SAE with cosine similarity > 0.7, but not all features in a larger SAE have a close relative in a smaller SAE (but about ~65% do have a close equavalent at 2x scale up). 

Yes! This is indeed a direction that we're also very interested in and currently working on. 

As a sneak preview regarding the days of the week, we indeed find that one weekday feature in the 768-feature SAE, splits into the individual days of the week in the 49152-feature sae, for example Monday, Tuesday..

The weekday feature seems close to mean of the individual day features.
 

2RogerDearnaley
Cool! That makes a lot of sense. So does it in fact split into three before it splits into 7, as I predicted based on dimensionality? I see a green dot, three red dots, and seven blue ones… On the other hand, the triangle formed by the three red  dots is a lot smaller than the heptagram, which I wasn't expecting… I notice it's also an oddly shaped heptagram.

Interesting! I actually did a small experiment with this a while ago, but never really followed up on it.

I would be interested to hear about your theoretical work in this space, so sent you a DM :)

Thanks!

Yeah, I think that's fair and don't necessarily think that stitching multiple SAEs is a great way to move the pareto frontier of MSE/L0 (although some tentative experiments showed they might serve as a good initialization if retrained completely). 

However, I don't think that low L0 should be a goal in itself when training SAEs as L0 mainly serves as a proxy for the interpretability of the features, by lack of good other feature quality metrics. As stitching features doesn't change the interpretability of the features, I'm not sure how useful/important the L0 metric still is in this context.

According to this Nature paper, the Atlantic Meridional Overturning Circulation (AMOC), the "global conveyor belt", is likely to collapse this century (mean 2050, 95% confidence interval is 2025-2095). 

Another recent study finds that it is "on tipping course" and predicts that after collapse average February temperatures in London will decrease by 1.5 °C per decade (15 °C over 100 years). Bergen (Norway) February temperatures will decrease by 35 °C. This is a temperature change about an order of magnitude faster than normal global warming (0.2 °C per ... (read more)

I expect the 0.05 peak might be the minimum cosine similarity if you want to distribute 8192 vectors over a 512-dimensional space uniformly? I used a bit of a weird regularizer where I penalized: 

mean cosine similarity + mean max cosine similarity  + max max cosine similarity

I will check later whether the 0.3 peak all have the same neighbour.

1Demian Till
Nice, that's promising! It would also be interesting to see how those peaks are affected when you retrain the SAE both on the same target model and on different target models.

A quick and dirty first experiment with adding an orthogonality regularizer indicates that this can work without too much penalty on the reconstruction loss. I trained an SAE on the MLP output of a 1-layer model with dictionary size 8192 (16 times the MLP output size).

I trained this without the regularizer and got a reconstruction score of 0.846 at an L0 of ~17. 
With the regularizer, I got a reconstruction score of 0.828 at an L0 of ~18.

Looking at the cosine similarities between neurons:



Interesting peaks around a cosine similarity of 0.3 and 0.05 ther... (read more)

2jacob_drori
The peaks at 0.05 and 0.3 are strange. What regulariser did you use? Also, could you check whether all features whose nearest neighbour has cosine similarity 0.3 have the same nearest neighbour (and likewise for 0.05)?

Thanks for the suggestion! @BeyondTheBorg suggested something similar with his Transcendent AI. After some thought, I've added the following:

Transcendent AI: AGI uncovers and engages with previously unknown physics, using a different physical reality beyond human comprehension. Its objectives use resources and dimensions that do not compete with human needs, allowing it to operate in a realm unfathomable to us. Humanity remains largely unaffected, as AGI progresses into the depths of these new dimensions, detached from human concerns.

 

Good proposal! I agree that this is a great opportunity to try out some ideas in this space.

Another proposal for the metric: 

The regrantor will judge in 5 years whether they are happy that they funded this project. This has a simple binary resolution criterium and aligns the incentives of the market nicely with the regrantor.

I agree that "Moral Realism AI" was a bit of a misnomer and I've changed it to "Convergent Morality AI".

Your scenario seems highly specific. Could you try to rephrase it in about three sentences, as in the other scenarios? 

I'm a bit wary about adding a lot of future scenarios that are outside of our reality and want the scenarios to focus on the future of our universe. However, I do think there is space for a scenario where our reality ends as it has achieved its goals (as in your scenario, I think?).

1kornai
Dear Bart, thanks for changing the name of that scenario. Mine is not just highly specific, it happens to be true in great part: feel free to look at the work of Alan Gewirth and subsequent discussion (the references are all actual).  That reality ends when a particular goal is achieved is an old idea (see e.g. https://en.wikipedia.org/wiki/The_Nine_Billion_Names_of_God) In that respect, the scenario I'm discussing is more in line with your "Partially aligned AGI" scenario.  The main point is indeed that the Orthogonality Thesis is false: for a sufficiently high level of intelligence, human or machine, the Golden Rule is binding. This rules out several of the scenarios now listed (and may help readers to redistribute the probability mass they assign to the remaining ones).

Thanks! I think your tag of @avturchin didn't work, so just pinging them here to see if they think I missed important and probable scenarios.

Taking the Doomsday argument seriously, the "Futures without AGI because we go extinct in another way" and the "Futures with AGI in which we die" seem most probable. In futures with conscious AGI agents, it will depend a lot on how experience gets sampled (e.g. one agent vs many).

Yes, good one! I've added the following:

Powergrab with AI: OpenAI, Deepmind or another small group of people invent AGI and align it to their interests. In a short amount of time, they become all-powerful and rule over the world. 

I've disregarded the "wipe out everyone else" part, as I think that's unlikely enough for people who are capable of building an AGI.

Thanks, good suggestions! I've added the following:

Pious AI: Humanity builds AGI and adopts one of the major religions. Vast amounts of superintelligent cognition is devoted to philosophy, theology, and prayer. AGI proclaims itself to be some kind of Messiah, or merely God's most loyal and capable servant on Earth and beyond.

I think Transcendant AI is close enough to Far far away AI, where in this case far far away means another plane of physics. Similarly, I think your Matrix AI scenario is captured in:

Theoretical Impossibility: For some reason or another

... (read more)

I almost never consider character.ai, yet total time spent there is similar to Bing or ChatGPT. People really love the product, that visit duration is off the charts. Whereas this is total failure for Bard if they can’t step up their game.


Wow, wasn't aware they are this big. And they supposedly train their own models. Does anyone know if the founders have a stance on AI X-risk?

Interesting! Does it ask for a different confidence interval every time I see the card? Or will it always ask for the 90% confidence interval I see the example card?

2Sage Future
Good question - each time it'll ask for your 50%, 60%, 70%, 80%, 90%, or 95% confidence interval (chosen at random). So over time you'll be able to see your calibration scores

This strategy has never worked for me, but I can see it working for other people. If you want to try this though, it is important to make it clear to yourself which procedure you're following.

I believe that for my mechanism, it is very important to always follow up on the dice. If there is a dice outcome that would disappoint you, just don't put it on the list!

I can see this being a problem. However, I see myself as someone with very low willpower and this is still not a problem for me. I think this is because of two reasons:

  1. I never put an option on the list that I know I would/could not execute.
  2. I regard the dice outcome as somewhat holy. I would always pay out a bet I lost to a friend. Partly, because it's just the right thing to do and partly because I know that otherwise, the whole mechanism of betting is worthless from that moment on. I guess that all my parts are happy enough with this system that none of them want to break it by not executing the action.

True. It does however resolve internal conflicts between multiple parts of yourself. Often when I have an internal conflict about something (let's say going to the gym vs going to a bar) the default action is inaction or think about this for an hour until I don't have enough time to do any of them.

I believe this is because both actions are unacceptable for the other part, which doesn't feel heard.

However, both parts can agree to a 66% chance of going to the gym, and 33% of going to the bar, and the die decision is ultimate.

2MSRayne
Whenever I've attempted this it has failed. Each of my parts is so stubborn that they can't even agree to obey the outcome of a die / coin flip. I just keep prevaricating. This seems like the sort of thing that takes willpower to succeed at.

I use the same strategy sometimes for internal coordination. Sometimes when I have a lot of things to do I tend to get overwhelmed, freeze and do nothing instead. 

A way for me to get out of this state is to write down 6 things that I could do, throw a die, and start with the action corresponding to the dice outcome!

2abramdemski
Nice! I think about doing this, sometimes, but never end up actually doing it. (Partly because I don't always need it, once I recognize the problem; but probably partly because if I'm procrastinating then I'm probably motivated to keep procrastinating rather than settle on anything.)

I'm very excited about this series! I have been using spaced repetition for general knowledge, specific knowledge, and language learning for years and am excited to see other applications of flash cards.

Especially using flash cards to remember happy memories seems very interesting to me. I have a specific photo album that I periodically review for warm fuzzy memories, but many of my best memories are never captured (and trying to capture everything in the moment can often ruin special moments), so creating flashcards for them afterward is an excellent idea.

1Florence Hinder
Thanks for the positive encouragement, much appreciated! 
2Optimization Process
Respectable Person: check.  Arguing against AI doomerism: check. Me subsequently thinking, "yeah, that seemed reasonable": no check, so no bounty. Sorry! It seems weaselly to refuse a bounty based on that very subjective criterion, so, to keep myself honest, I'll post my reasoning publicly. His arguments are, roughly: * Intelligence is situational / human brains can't pilot octopus bodies. * ("Smarter than a smallpox virus" is as meaningful as "smarter than a human" -- and look what happened there.) * Environment affects how intelligent a given human ends up. "...an AI with a superhuman brain, dropped into a human body in our modern world, would likely not develop greater capabilities than a smart contemporary human." * (That's not a relevant scenario, though! How about an AI merely as smart as I am, which can teleport through the internet, save/load snapshots of itself, and replicate endlessly as long as each instance can afford to keep a g4ad.16xlarge EC2 instance running?) * Human civilization is vastly more capable than individual humans. "When a scientist makes a breakthrough, the thought processes they are running in their brain are just a small part of the equation...  Their own individual cognitive work may not be much more significant to the whole process than the work of a single transistor on a chip." * (This argument does not distinguish between "ability to design self-replicating nanomachinery" and "ability to produce beautiful digital art.") * Intelligences can't design better intelligences. "This is a purely empirical statement: out of billions of human brains that have come and gone, none has done so. Clearly, the intelligence of a single human, over a single lifetime, cannot design intelligence, or else, over billions of trials, it would have already occurred." * (This argument does not distinguish between "ability to design intelligence" and "ability to design weapons that can level cities"; neither had ever happened, until one di

I think there is great promise here. So many overweight people don't work out, because they just don't identify as a person that would go running or to the gym. Developing exciting (addicting?) VR games that ease overweight people into work-outs could be an interesting cause area!

2Yonatan Cale
Just saying I am not at all overweight myself, but I still want to work out. Regarding "cause area":  There are pretty cheap actions to do here, like bring a VR headset to your local meetup and help people try it out. I brought it to an EA Israel retreat, one person bought his own headset a few days later, and a few others said they're considering it.
8Dustin
Just remember that (I think, not an expert!) exercise is much less important than diet when it comes to losing weight.  An hour run that burns 300 calories is swamped by having a double cheeseburger instead of a salad.

I like the chart and share the sentiment of spending more time on fun & important things, but the percentages seem unattainable to me. I recently noticed how large part of my life I spend on 'maintenance': cooking, eating, cleaning, laundry, showering, sleeping, etc. But maybe this means I should focus on making these activities more fun!

5jasoncrawford
Well for what it's worth, these percentages are rough guesses based on gut feel, rather than being based on any kind of measurement or even quantitative estimation. Consider them “conceptual”/illustrative. You can minimize cooking by prioritizing meals that are simple to prepare (e.g., single-serving frozen vegetables that you place directly in the microwave and steam right in the bag). If you can afford it, you can pay other people to do cleaning and laundry for you (and more people ought to do this than currently do, I suspect). Etc. Sleep is different: it is ~1/3 of your life, and it is crucial for health. Consider this a guide to waking hours. And note that it is as much about energy as it is about time.

I have been using your app for a week now and I must say I really like it.  It's simple, clean, and has all the functionality it needs!

The European Medicine Agency (EMA) supports national authorities who may decide on possible early use of Paxlovid prior to marketing authorization, for example in emergency use settings, in the light of rising rates of infection and deaths due to COVID-19 across the EU.

Seems like great news for Europe!

https://www.ema.europa.eu/en/news/ema-issues-advice-use-paxlovid-pf-07321332-ritonavir-treatment-covid-19
 

Software: Anki

Need: Remembering anything

Other programs I've tried: Supermemo, Mnemosyne, Quizlet

Anki is a free and open-source flashcard program using spaced repetition, a technique from cognitive science for fast and long-lasting memorization. What makes it better than its alternatives are the countless plugins that can customize your learning experience and the fact that you can control the parameters of the algorithm.

1milo
Did you try readwise? I found it more modern  with vast array of source recordinf
5charlesoblack
On the broader topic of SRS, how do you deal with ever-increasing workloads? I'm a user for 4 years now and have been struggling with my current workload, unable to add any more cards.
-5mb99
4Gunnar_Zarncke
Anki decks by LW users points to http://www.stafforini.com/blog/anki-decks-by-lesswrong-users/  Anki on Android in 60 seconds

I used Anki for 3ish years and SuperMemo for the last year, and have to say I've liked SuperMemo exponentially more because of it's incremental reading feature, where you put hundreds of sources to learn from (like lesswrong posts) into it, and go over them over time and can rank them by priority. Is far less of a pain to learn from things then making cards one by one.

Software: Pluckeye

Need: Blocking certain websites during certain times

Other programs I've tried: StayFocusd, ColdTurkey, AppBlock

In a fight against procrastination, I've tried many programs to block distracting websites during working hours, but many of them don't have enough flexibility, are too simple to by-pass, or don't work on Linux. With Pluckeye you have basically any option you can think of, such that you can customize the blocking entirely to your own needs.

1[comment deleted]

Your Skedpal link leads to a sketchy site. I believe you meant Skedpal.
 

1IrenicTruth
Yes. Fixed.

They are probably talking about the machine learning model, like GPT-3.

This fallacy is known as Post hoc ergo propter hoc and is indeed a mistake that is often made.  However, there are some situations in which we can infer causation from correlation, and where the arrow of time is very useful. These methods are mostly known as Granger causality methods of which the basic premise is: X has a Granger causal influence on Y if the prediction of Y from its own past, and the past of all other variables, is improved by additionally accounting for X.  In practice, Granger causality relies on some heavy assumptions, such as that there are no unobserved confounders. 

2jasoncrawford
Yes, @NeuroStats likes to call it “Granger prediction” for this reason

I do it first thing every morning, Monday-Friday. This is of course a personal preference, but generally I have trouble with establishing habits in evenings, due to reduced executive function. I like to immediately tick a task as completed when done (small dopamine boost), but check when setting new goals, whether there are any unresolved goals from other days.

The main change I have made is separating goals into different time categories. before that, missing a daily goal had as much impact as quarterly goals. Other than that, I haven't changed much to the whole routine.

Interesting! Did they just use it for aggregate business results or was it encouraged for personal goals as well? 

2Alexei
Just aggregate.

I have been using it for about 4 months now myself now. I have not shared this technique with anyone else yet, so I don't know whether it works for other people. This is one of the reasons why I made this post, to hopefully inspire some other people to use it and see whether it works for them. 

No, because I try to align my goals with my general well-being, and not just with raw work output. It's really more about intentional living than working hard. A goal might also be: "Take at least four 20-minute breaks from work today".