Do you also conclude that the causal role of the circuit you discovered was spurious? What's a better way to incorporate the mentioned sample-level variance in measuring the effectiveness of an SAE feature or SV? (i.e. should a good metric of causal importance satisfy both sample- and population-level increase?)

Could you also link to an example where causal intervention satisfied the above-mentioned (or your own alternative that was not mentioned in this post) criteria?

SAEs Discover Meaningful Features in the IOI Task

Jaehyuk Lim5moΩ010

Is there a codebase for the supervised dictionary work?

SAE sparse feature graph using only residual layers

Jaehyuk Lim6mo10

Thank you for the feedback, and thanks for this.

Who else is actively pursuing sparse feature circuits in addition to Sam Marks? I'm curious because the code breaks in the forward pass of the linear layer in gpt2 since the dimensions are different from Pythia's (768).

What does "autodidact" mean?

Jaehyuk Lim8mo10

I agree with you there. There are numerous benefits of being an autodidact (freedom to learn what you want, less pressure from authorities), but formal education offers more mentorship. For most people, the desire to learn something is often not enough even with the increased accessibility of information, as the material gets more complex.

Dangers of Closed-Loop AI

Jaehyuk Lim8mo10

Do you see possible dangers of closed-loop automated interpretability systems as well?

Neuroscience and Alignment

Jaehyuk Lim8mo10

the only aligned AIs are those which are digital emulations of human brains

I don't think this is necessarily true. I don't think emulated human brains are necessary for full alignment, nor whether emulated human brains would be more aligned than a well calibrated and scaled up version of our current alignment techniques (+ new ones to be discovered in the next few years). To emulate the entire human brain to align values seem to be not only implausible (even with neurmorphic computing, efficient neural networks and Moore's law^1000), it seems like an overkill and a misallocation of valuable computational resources. Assuming I'm understanding "emulated human brains" correctly, emulation would mean pseudo-sentient systems solely designed to be aligned to our values. Perhaps morality can be a bit simpler than that, somewhere in the middle of static, written rules (the law) and the unpredictable human mind. Because if we do make more of people essentially, it's not really addressing the "many biases or philosophical inadequacies" of us.