Logan Riggs

Wikitag Contributions

Comments

Sorted by

That does clarify a lot of things for me, thanks!

Looking at your posts, there’s no hooks or trying to sell your work, which is a shame cause LSRDR’s seem useful. Since they are you useful, you should be able to show it.

For example, you trained an LSRDR for text embedding, which you could show at the beginning of the post. Then showing the cool properties of pseudo-determinism & lack of noise compared to NN’s. THEN all the maths. So the math folks know if the post is worth their time, and the non-math folks can upvote and share with their mathy friends.
 

I am assuming that you care about [engagement, useful feedback, connections to other work, possible collaborators] here. If not, then sorry for the unwanted advice!

I’m still a little fuzzy on your work, but possible related papers that come to mind are on tensor networks.

  1. Compositionality Unlocks Deep Interpretable Models - they efficiently train tensor networks on [harder MNIST], showing approximately equivalent loss to NN’s, and show the inherent interpretability in their model.
  2. Tensorization is [Cool essentially] - https://arxiv.org/pdf/2505.20132 - mostly a position and theoretical paper arguing why tensorization is great and what limitations.

    Im pretty sure both sets of authors here read LW as well.

Is the LSRDR a proposed alternative to NN’s in general?

What interpretability do you gain from it?

Could you show a comparison between a transformer embedding and your method with both performance and interpretability? Even MNIST would be useful.

Also, I found it very difficult to understand your post (Eg you didn’t explain your acronym! I had to infer it). You can use the “request feedback” feature on LW in the future; they typically give feedback quite quickly.

Gut reaction is “nope!”.

Could you spell out the implication?

Correct! I did mean to communicate that in the first footnote. I agree value-ing the unborn would drastically lower the amount of acceptable risk reduction.

I agree w/ your general point, but think your specific example isn't considering the counterfactual. The possible choices aren't usually: 

A. 50/50% chance of death/utopia
B. 100% of normal life

If a terminally ill patient would die next year 100%, then choice (A) makes sense! Most people aren't terminally ill patients though. In expectation, 1% of the people you know will die every year (w/ skewing towards older people). So a 50% of death vs utopia shouldn't be preferred by most people, & they should accept a delay of 1 year of utopia for >1% reduction in x-risk.[1]

I can imagine someone's [husband] being terminally ill & they're willing to roll the dice; however, most people have loved ones that are younger (e.g. (great)-children, nephews/nieces, siblings, etc) which would require them to value their [husband] vastly greater than everyone else.[2]

  1. ^

    However if normal life is net-negative, then either death or utopia would be preferred, changing the decision. This is also a minority though.

  2. ^

    However, folks could be short-sighted. Thinking to minimize the suffering of their loved one in front of them, w/o considering the negative effects of their other loved ones. This isn't utility function relevant, just a better understanding of the situation.

AFAIK, I have similar values[1] but lean differently.

~1% of the world dies every year. If we accelerate AGI sooner 1 year, we save 1%. Push back 1 year, lose 1%. So, pushing back 1 year is only worth it if we reduce P(doom) by 1%. 

This means you're P(doom) given our current trajectory very much matters. If you're P(doom) is <1%, then pushing back a year isn't worth it.

The expected change conditioning on accelerating also matters. If accelerating by 1 year increases e.g. global tensions, increasing a war between nuclear states by X% w/ an expected Y-deaths (I could see arguments either way though, haven't thought too hard about this).

For me, I'm at ~10% P(doom). Whether I'd accept a proposed slowdown depends on how much I expect it decrease this number.[2] 

How do you model this situation? (also curious on your numbers)

Assumptions: 

  1. We care about currently living people equally (alternatively, if you cared mostly about your young children, you'd happily accept a reduction in x-risk of 0.1% (possibly even 0.02%). Actuary table here)
  2. Using expected value, which only mostly matches my intuitions (e.g. I'd actually accept pushing back 2 years for a reduction of x-risk from 1% to ~0%) 
  1. ^

    I mostly care about people I know, some for people in general, and the cosmic endownment would be nice, sure, but only 10% of the value for me.

  2. ^

    Most of my (currently living) loved ones skew younger, ~0.5% expected death-rate, so I'd accept a lower expected reduction in x-risk (maybe 0.7%)

"focus should no longer be put into SAEs...?"

I think we should still invest research into them BUT it depends on the research. 

Less interesting research:

1. Applying SAEs to [X-model/field] (or Y-problem w/o any baselines)

More interesting research:

  1. Problems w/ SAEs & possible solutions
    1. Feature supression (solved by post-training, gated-SAEs, & top-k)
    2. Feature absorption (possibly solved by Matryoshka SAEs)
    3. SAE's don't find the same features across seeds (maybe solved by constraining latents to the convex hull of the data)
    4. Dark-matter of SAEs (nothing AFAIK)
    5. Many more I'm likely forgetting/haven't read
  2. Comparing SAEs w/ strong baselines for solving specific problems
  3. Using SAEs to test how true the linear representation hypothesis is
  4. Changing SAE architecture to match the data

In general, I'm still excited about an unsupervised method that finds all the model's features/functions. SAE's are one possible option, but others are being worked on! (APD & L3D for a weight-based method)

Relatedly, I'm also excited about interpretable-from-scratch architectures that do lend themselves more towards mech-interp (or bottom-up in Dan's language).

Just on the Dallas example, look at this +8x & -2x below
 

So they 8x all features in the China super-node and multiplied the Texas supernode (Texas is "under" China, meaning it's being "replaced") by -2x. That's really weird! It should be multiplying Texas node by 0. If Texas is upweighting "Austin", then -2x-ing it could be downweighting "Austin", leading to cleaner top outputs results. Notice how all the graphs have different numbers for upweighting & downweighting (which is good that they include that scalar in the images). This means the SAE latents didn't cleanly separate the features (we think exist). 

(With that said, in their paper itself, they're very careful & don't overclaim what their work shows; I believe it's a great paper overall!)

You can learn a per-token bias over all the layers to understand where in the model it stops representing the original embedding (or a linear transformation of it) like in https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases


You could also plot the cos-sims of the resulting biases to see how much it rotates.

In the next two decades we're likely to reach longevity escape velocity: the point at which medicine can increase our healthy lifespans faster than we age.

I have the same belief and have thought about how bad it’d be if my loved ones died too soon.

Sorry for your loss.

Load More