This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
is fundraising!
Tags
LW
$
Login
Eliciting Latent Knowledge (ELK)
•
Applied to
[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
by
Leon Lang
2mo
ago
•
Applied to
Clarifying Alignment Fundamentals Through the Lens of Ontology
by
eternal/ephemera
2mo
ago
•
Applied to
Mechanistic Anomaly Detection Research Update
by
Nora Belrose
4mo
ago
•
Applied to
Covert Malicious Finetuning
by
Tony Wang
6mo
ago
•
Applied to
"What the hell is a representation, anyway?" | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents
by
IwanWilliams
6mo
ago
•
Applied to
Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.
by
Josh Levy
7mo
ago
•
Applied to
CCS on compound sentences
by
Artyom Karpov
8mo
ago
•
Applied to
AXRP Episode 29 - Science of Deep Learning with Vikrant Varma
by
DanielFilan
8mo
ago
•
Applied to
Finding the estimate of the value of a state in RL agents
by
Clément Dumas
8mo
ago
•
Applied to
Auditing LMs with counterfactual search: a tool for control and ELK
by
Jacob Pfau
10mo
ago
•
Applied to
Striking Implications for Learning Theory, Interpretability — and Safety?
by
RogerDearnaley
1y
ago
•
Applied to
Measurement tampering detection as a special case of weak-to-strong generalization
by
ryan_greenblatt
1y
ago
•
Applied to
Discussion: Challenges with Unsupervised LLM Knowledge Discovery
by
Seb Farquhar
1y
ago
•
Applied to
Betting on what is un-falsifiable and un-verifiable
by
Abhimanyu Pallavi Sudhir
1y
ago
•
Applied to
Eliciting Latent Knowledge in Comprehensive AI Services Models
by
acabodi
1y
ago
•
Applied to
Robustness of Contrast-Consistent Search to Adversarial Prompting
by
Nandi
1y
ago
•
Applied to
Discovering Latent Knowledge in the Human Brain: Part 1 – Clarifying the concepts of belief and knowledge
by
Joseph Emerson
1y
ago
•
Applied to
Attributing to interactions with GCPD and GWPD
by
jenny
1y
ago
•
Applied to
A personal explanation of ELK concept and task.
by
Zeyu Qin
1y
ago