LESSWRONG
Wikitags
LW

Apollo Research (org)

Settings
Applied to Try training token-level probes by StefanHex 1mo ago
Applied to Detecting Strategic Deception Using Linear Probes by Nicholas Goldowsky-Dill 3mo ago
Applied to Paper: Open Problems in Mechanistic Interpretability by bilalchughtai 3mo ago
Applied to Attribution-based parameter decomposition by Lucius Bushnaq 3mo ago
Applied to Activation space interpretability may be doomed by bilalchughtai 4mo ago
Applied to An Opinionated Evals Reading List by Marius Hobbhahn 7mo ago
Applied to [Interim research report] Activation plateaus & sensitive directions in GPT2 by StefanHex 8mo ago
Applied to Apollo Research is hiring evals and interpretability engineers & scientists by Lee Sharkey 9mo ago
Applied to Theories of Change for AI Auditing by Lee Sharkey 9mo ago
Applied to A starter guide for evals by Lee Sharkey 9mo ago
Applied to We need a Science of Evals by Lee Sharkey 9mo ago
Applied to Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs by Lee Sharkey 9mo ago
Applied to Understanding strategic deception and deceptive alignment by Lee Sharkey 9mo ago
Applied to Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning by Lee Sharkey 9mo ago
Applied to The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks by Lee Sharkey 9mo ago
Applied to Interpretability: Integrated Gradients is a decent attribution method by Lee Sharkey 9mo ago
Applied to You can remove GPT2’s LayerNorm by fine-tuning for an hour by Lee Sharkey 9mo ago
Applied to Sparsify: A mechanistic interpretability research agenda by Lee Sharkey 9mo ago
Applied to Apollo Research 1-year update by Lee Sharkey 9mo ago
Applied to Announcing Apollo Research by Lee Sharkey 9mo ago