Is the LSRDR a proposed alternative to NN’s in general?
What interpretability do you gain from it?
Could you show a comparison between a transformer embedding and your method with both performance and interpretability? Even MNIST would be useful.
Also, I found it very difficult to understand your post (Eg you didn’t explain your acronym! I had to infer it). You can use the “request feedback” feature on LW in the future; they typically give feedback quite quickly.
I agree w/ your general point, but think your specific example isn't considering the counterfactual. The possible choices aren't usually:
A. 50/50% chance of death/utopia
B. 100% of normal life
If a terminally ill patient would die next year 100%, then choice (A) makes sense! Most people aren't terminally ill patients though. In expectation, 1% of the people you know will die every year (w/ skewing towards older people). So a 50% of death vs utopia shouldn't be preferred by most people, & they should accept a delay of 1 year of utopia for >1% reduction in x-risk.[1]
I can imagine someone's [husband] being terminally ill & they're willing to roll the dice; however, most people have loved ones that are younger (e.g. (great)-children, nephews/nieces, siblings, etc) which would require them to value their [husband] vastly greater than everyone else.[2]
However if normal life is net-negative, then either death or utopia would be preferred, changing the decision. This is also a minority though.
However, folks could be short-sighted. Thinking to minimize the suffering of their loved one in front of them, w/o considering the negative effects of their other loved ones. This isn't utility function relevant, just a better understanding of the situation.
AFAIK, I have similar values[1] but lean differently.
~1% of the world dies every year. If we accelerate AGI sooner 1 year, we save 1%. Push back 1 year, lose 1%. So, pushing back 1 year is only worth it if we reduce P(doom) by 1%.
This means you're P(doom) given our current trajectory very much matters. If you're P(doom) is <1%, then pushing back a year isn't worth it.
The expected change conditioning on accelerating also matters. If accelerating by 1 year increases e.g. global tensions, increasing a war between nuclear states by X% w/ an expected Y-deaths (I could see arguments either way though, haven't thought too hard about this).
For me, I'm at ~10% P(doom). Whether I'd accept a proposed slowdown depends on how much I expect it decrease this number.[2]
How do you model this situation? (also curious on your numbers)
Assumptions:
"focus should no longer be put into SAEs...?"
I think we should still invest research into them BUT it depends on the research.
Less interesting research:
1. Applying SAEs to [X-model/field] (or Y-problem w/o any baselines)
More interesting research:
In general, I'm still excited about an unsupervised method that finds all the model's features/functions. SAE's are one possible option, but others are being worked on! (APD & L3D for a weight-based method)
Relatedly, I'm also excited about interpretable-from-scratch architectures that do lend themselves more towards mech-interp (or bottom-up in Dan's language).
Just on the Dallas example, look at this +8x & -2x below
So they 8x all features in the China super-node and multiplied the Texas supernode (Texas is "under" China, meaning it's being "replaced") by -2x. That's really weird! It should be multiplying Texas node by 0. If Texas is upweighting "Austin", then -2x-ing it could be downweighting "Austin", leading to cleaner top outputs results. Notice how all the graphs have different numbers for upweighting & downweighting (which is good that they include that scalar in the images). This means the SAE latents didn't cleanly separate the features (we think exist).
(With that said, in their paper itself, they're very careful & don't overclaim what their work shows; I believe it's a great paper overall!)
You can learn a per-token bias over all the layers to understand where in the model it stops representing the original embedding (or a linear transformation of it) like in https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases
You could also plot the cos-sims of the resulting biases to see how much it rotates.
That does clarify a lot of things for me, thanks!
Looking at your posts, there’s no hooks or trying to sell your work, which is a shame cause LSRDR’s seem useful. Since they are you useful, you should be able to show it.
For example, you trained an LSRDR for text embedding, which you could show at the beginning of the post. Then showing the cool properties of pseudo-determinism & lack of noise compared to NN’s. THEN all the maths. So the math folks know if the post is worth their time, and the non-math folks can upvote and share with their mathy friends.
I am assuming that you care about [engagement, useful feedback, connections to other work, possible collaborators] here. If not, then sorry for the unwanted advice!
I’m still a little fuzzy on your work, but possible related papers that come to mind are on tensor networks.
Tensorization is [Cool essentially] - https://arxiv.org/pdf/2505.20132 - mostly a position and theoretical paper arguing why tensorization is great and what limitations.
Im pretty sure both sets of authors here read LW as well.