This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Wikitags
LW
Login
Shard Theory
Settings
Applied to
Self-dialogue: Do behaviorist rewards make scheming AGIs?
by
Steven Byrnes
1mo
ago
Dakara
v1.9.0
Dec 30th 2024 GMT
(
+6
/
-6
)
1
Applied to
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
by
TurnTrout
3mo
ago
Applied to
Programming Refusal with Conditional Activation Steering
by
Bruce W. Lee
3mo
ago
Applied to
Reward Bases: A simple mechanism for adaptive acquisition of multiple reward type
by
Bogdan Ionut Cirstea
4mo
ago
Applied to
Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake
by
TurnTrout
4mo
ago
Noosphere89
v1.8.0
Oct 26th 2024 GMT
(+2)
3
Applied to
Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents
by
Alejandro Aristizabal
6mo
ago
Applied to
Why I'm bearish on mechanistic interpretability: the shards are not in the network
by
tailcalled
6mo
ago
Applied to
Shard Theory - is it true for humans?
by
Rishika
9mo
ago
Sohaib Imran
v1.7.0
Nov 14th 2023 GMT
(
+7
/
-8
)
3
Sohaib Imran
v1.6.0
Nov 14th 2023 GMT
(
+8
/
-7
)
1
Applied to
Paper: Understanding and Controlling a Maze-Solving Policy Network
by
TurnTrout
1y
ago
Applied to
AXRP Episode 22 - Shard Theory with Quintin Pope
by
DanielFilan
2y
ago
Applied to
Steering GPT-2-XL by adding an activation vector
by
TurnTrout
2y
ago
Applied to
Some Thoughts on Virtue Ethics for AIs
by
TurnTrout
2y
ago