This is a linkpost for our two recent papers: 1. An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927 2. An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928 This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs),...
This is a linkpost for our two recent papers:
This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper.
A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:
Concretely I guess current tech can get a message out to a few targets at 10^3 to 10^6 light years distance. ASI can use many physical probes near light speed, accelerated using energy from a Dyson swarm, so I'd guess it's a few years behind only. I don't expect there to be aliens within 10^6 lys, nor do we know which stars, and it's again unlikely that they happen to be in the thin window of technological development where a warning message from us helps them.
Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
The question is if the attractor is big enough. And given how there's various impossibility theorems related to corrigibility & coherence I anticipate that the attractor around corrigibility is quite small, bc one has to evade various obstacles at once. Otoh proxies that flow into a non-corrigible location once we ramp up intelligence, aren't obstructed by the same theorems, so they can be just as numerous as proxies for kindness.
Wrt your concrete attractor: if the AI doesn't improve its world model and decisions aka intelligence, then it's also not useful for us. And a human in the loop doesn't help if the AI's proposals are inscrutable to us bc then we'll just wave them through and are essentially not in the loop anymore. A corrigible AI can be trusted with improving its intelligence bc it only does so in ways that preserve the corrigibility.
Oops. Then I don't get what techniques you are proposing. Like, most techniques that claim to work for superintelligence / powerful agents also claim to work in some more limited manner for current agents (in part bc most techniques assume that no phase change occurs between now and then, or that the phase change doesn't affect the technique => the technique stops working in a gradual manner and one can do empirical studies on current models).
And while there certainly is some loss function or initial random seed for current techniques that gives you aligned superintelligence, there's no way to find them.
With pseudo-kindness I mean any proxy for kindness that's both too wrong to have any overlap with kindness when optimized for by a superintelligence, and right enough to have overlap with kindness when optimized for by current LLMs.
Kindness is some property that behavior & consequences can exhibit. There are many properties in general, and there are still many that correlate strongly on a narrow test environment with kindness. Some of these proxy properties are algorithmically simple (and thus plausibly found in LLMs and thus again in superintelligence), some even share subcomputations/subdefinitions with kindness. Theres some degree of freedom argument about how many such proxies there are. Concretely one can give examples, e.g. "if... (read more)
Empirically, current LLM behavior is better predicted by a model
than by a model
The second model under capability growth indeed can yield a capable reasoner steered by reflexes towards approximate true kindness. And if we get enough training before ASI, the approximation can become good enough that due to discreteness or attractors it just is equal to true kindness.
The first model just generalizes to a capable misaligned reasoner.
I expect that all processes that promote kind-looking outputs route either through reflexes towards pseudo-kindness, or through instrumental reasoning about pseudo-kindness and kindness. Reflexes towards true kindness are just very complex to implement in any neural net, and so unlikely to spontaneously form during training since there's so many alternative pseudo-kindness reflexes instead one could get. Humans stumbled into what we call kindness somehow, partially due to quirks in evolution vs SGD like genome size or the need for cooperation between small tribes etc. Now new humans acquire similar reflexes towards similar kindness due to their shared genes, culture and environment.
Reinforcing kind-looking outputs in AI just reinforces those reasoning processes and reflexes... (read more)
Disclosure: Written by ChatGPT on 2025-09-16 at my request.
Abstract. Standard physiology; no novelty. When an object is held stationary, static equilibrium (, ) sets the joint torque needed to counter gravity. Muscle fibers generate that torque. Force in skeletal muscle is produced by continuous actomyosin cross-bridge cycling; each cycle uses one ATP. Calcium must be kept high to permit attachment and then pumped back into the sarcoplasmic reticulum; the Na/K pump maintains excitability. External mechanical work is approximately zero, but chemical energy is consumed continuously and dissipated as heat. Example: holding a bag at the hand (forearm , elbow ) requires . Relative to typical maximum elbow-flexion torque ( in young men, in young women), this... (read 477 more words →)
Thx, I think I got most of this from your top level comment & Mikhail's post already. I strongly expect that I do not know your policy for confidentiality right now, but I also expect that once I do I'd disagree with it being the best policy one can have, just based on what I heard from Mikhail and you about your one interaction.
My guess is that refusing the promise is plausibly better than giving it for free? But I guess that there'd have been another solution where 1) Mikhail learns not to screw up again, and 2) you get to have people talk more freely around you to a degree that's... (read more)
I don't know how costly/beneficial this screw up concretely was to humanity's survival, but I guess that total cost would've been lower if Habryka as a general policy were more flexible in when the sensitivity of information has to be negotiated.
Like, with all this new information I now am a tiny bit more wary of talking in front of Habryka. I may blabber out something that has a high negative expected utility if Habryka shares it (after conditioning on the event that he shares it) and I don't have a way to cheaply fix that mistake (which would bound the risk).
And there isn't an equally strong opposing force afaict? I can imagine... (read more)
It becomes a bit more like logical inductors.
If logical inductors is what one wants, just do that.
a reasonable time-penalty
I'm not entirely sure, but I suspect that I don't want any time penalty in my (typical human) prior. E.g. even if quantum mechanics takes non-polynomial time to simulate, I still think it a likely hypothesis. Time penalty just doesn't seem to be related to what I pay attention to when I access my prior for the laws of physics / fundamental hypotheses. There's also many other ideas for augmenting a simplicity prior that fail similar tests.