User Comment Replies

HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix

Jaehyuk Lim9mo30

Hey, thanks for the reply. Yes, we tried k-means and agglomerative clustering and they worked with some mixed results.

We'll try PaCMAP instead and see if it is better!

SAE features for refusal and sycophancy steering vectors

Jaehyuk Lim9mo30

Although not "circuit-style," this could also be considered one of these attempts outlined by Mack et al. 2024.

https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/#:~:text=Unsupervised%20steering%20as,more%20distributed%20circuits.

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

Jaehyuk Lim9mo10

some issues related to causal interpretations

Could you refer to the line you are referring to from Marks et al.?

2Sodium9mo

Sorry, I linked to the wrong paper! Oops, just fixed it. I meant to link to Aaron Mueller's Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks.

An Interpretability Illusion from Population Statistics in Causal Analysis

Jaehyuk Lim1y10

Do you also conclude that the causal role of the circuit you discovered was spurious? What's a better way to incorporate the mentioned sample-level variance in measuring the effectiveness of an SAE feature or SV? (i.e. should a good metric of causal importance satisfy both sample- and population-level increase?)

Could you also link to an example where causal intervention satisfied the above-mentioned (or your own alternative that was not mentioned in this post) criteria?

2Daniel Tan1y

In the steering vectors work I linked, we looked at how much of the variance in the metric was explained by a spurious factor, and I think that could be a useful technique if you have some a priori intuition about what the variance might be due to. However, this doesn't mean we can just test a bunch of hypotheses, because that looks like p-hacking. Generally, I do think that 'population variance' should be a metric that's reported alongside 'population mean' in order to contextualize findings. But again this doesn't tell a very clean picture; variance being high could be due to heteroscedasticity, among other things I don't have great solutions for this illusion outside of those two recommendations. One naive way we might try to solve this is to remove things from the dataset until the variance is minimal, but it's hard to do this in a right way that doesn't eventually look like p-hacking. I would guess that the IOI SAE circuit we found is not unduly influenced by spurious factors, and that the analysis using (variance in the metric difference explained by ABBA / BABA) would corroborate this. I haven't rigorously tested this, but I'd be very surprised if this turned out not to be the case

SAEs Discover Meaningful Features in the IOI Task

Jaehyuk Lim1yΩ010

Is there a codebase for the supervised dictionary work?

1Alex Makelov1y

Hi - there's code here https://github.com/amakelov/sae which covers almost everything reported in the blog post. Let me know if you have more specific questions (or open an issue) and I can point to / explain specific parts of the code!

Jaehyuk Lim1y10

"I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes."

As artificial intelligence continues to advance and demonstrate increasingly impressive creative capabilities, it raises important questions about the role and value of human creativity. While AI has the potential to enhance and augment human creativity in many ways, it also threatens to compress or diminish it in certain domains. This essay explores the complex interplay between AI and human creativity,... (read more)

SAE sparse feature graph using only residual layers

Jaehyuk Lim1y10

Thank you for the feedback, and thanks for this.

Who else is actively pursuing sparse feature circuits in addition to Sam Marks? I'm curious because the code breaks in the forward pass of the linear layer in gpt2 since the dimensions are different from Pythia's (768).

1Joseph Bloom1y

SAEs are model specific. You need Pythia SAEs to investigate Pythia. I don't have a comprehensive list but you can look at the sparse autoencoder tag on LW for relevant papers.

What does "autodidact" mean?

Jaehyuk Lim1y10

I agree with you there. There are numerous benefits of being an autodidact (freedom to learn what you want, less pressure from authorities), but formal education offers more mentorship. For most people, the desire to learn something is often not enough even with the increased accessibility of information, as the material gets more complex.

Dangers of Closed-Loop AI

Jaehyuk Lim1y10

Do you see possible dangers of closed-loop automated interpretability systems as well?

Neuroscience and Alignment

Jaehyuk Lim1y10

the only aligned AIs are those which are digital emulations of human brains

I don't think this is necessarily true. I don't think emulated human brains are necessary for full alignment, nor whether emulated human brains would be more aligned than a well calibrated and scaled up version of our current alignment techniques (+ new ones to be discovered in the next few years). To emulate the entire human brain to align values seem to be not only implausible (even with neurmorphic computing, efficient neural networks and Moore's law^1000), it seems like an overk... (read more)

3Garrett Baker1y

This conclusion was the result of conditioning on the world where in order to decode human values from the brain, we need to understand the entire brain. I agree with you when this is not the case, but in different degrees depending on how much of the brain must be decoded.

2[comment deleted]1y

Language Models Don't Learn the Physical Manifestation of Language

Jaehyuk Lim1y10

What's the difference between "having a representation" for uppercase/lowercase and using the representation to solving MCQ or AB test? From your investigations, do you have intuitions as to what might be the mechanism of disconnect? I'm interested in seeing what might cause these models to perform poorly, despite having representations that seem to be relevant to solving the task, at least to us people.

Considering that the tokenizer architecture for Mistral-7B probably includes a case-sensitive dictionary (https://discuss.huggingface.co/t/case-sensitivity... (read more)

LESSWRONG
LW

All of Jaehyuk Lim's Comments + Replies