This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
is fundraising!
Tags
LW
$
Login
Interpretability (ML & AI)
•
Applied to
Towards a Unified Interpretability of Artificial and Biological Neural Networks
by
jan_bauer
1d
ago
•
Applied to
A short critique of Omohundro's "Basic AI Drives"
by
Soumyadeep Bose
3d
ago
•
Applied to
Learning Multi-Level Features with Matryoshka SAEs
by
Bart Bussmann
4d
ago
•
Applied to
Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces
by
Matthew A. Clarke
4d
ago
•
Applied to
Matryoshka Sparse Autoencoders
by
Noa Nabeshima
6d
ago
•
Applied to
Testing which LLM architectures can do hidden serial reasoning
by
Filip Sondej
7d
ago
•
Applied to
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
by
Can
12d
ago
•
Applied to
Backdoors have universal representations across large language models
by
Amirali Abdullah
16d
ago
•
Applied to
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
by
TurnTrout
17d
ago
•
Applied to
Are SAE features from the Base Model still meaningful to LLaVA?
by
Shan23Chen
17d
ago
•
Applied to
Are SAE features from the Base Model still meaningful to LLaVA?
by
Shan23Chen
17d
ago
•
Applied to
Deep Learning is cheap Solomonoff induction?
by
Lucius Bushnaq
19d
ago
•
Applied to
Intricacies of Feature Geometry in Large Language Models
by
7vik
20d
ago
•
Applied to
Beyond Gaussian: Language Model Representations and Distributions
by
Matt Levinson
20d
ago
•
Applied to
AXRP Episode 38.2 - Jesse Hoogland on Singular Learning Theory
by
DanielFilan
26d
ago
•
Applied to
Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders
by
PaulPauls
1mo
ago