I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
I added a caption to the picture. Does that help?
Yeah I agree with that. I have a diagram of homeostatic feedback control in §1.5 of my SMTM reply post, and RL is one of the ingredients (d & f).
I know very little about this topic, but I was under the impression that there was more to it than “KV cache: yes or no?”, and I was trying to refer to that whole category of possible improvements. E.g. here’s a paper on “KV cache compression”.
Thanks for the feedback!
I would also include improvements from synthetic training data as an algorithmic improvement, not a data-related improvement, because better synthetic training data is created by better algorithms…
I have now changed the text in a few places to better clarify how I am defining the scope of §1.1.
I feel like you’re maybe reading in some subtext, where you think I’m trying to downplay the things outside §1.1, and suggest that they don’t really count, or something? If so, that’s not what I meant to suggest, and it’s not how I feel in my heart. I’m open to rewording more, if you have suggestions.
I think most things mentioned in 1.4 ("Algorithmic changes that are not really quantifiable as efficiency") belong to 1.1 (algorithmic efficiency progress) because they can actually be quantified as efficiency improvements, namely SFT, RLHF, RLVR. These have strongly increased capabilities, as measured by benchmarks, compared to GPT-3-style prompt engineering of the underlying base model. So a much smaller model with these improvements can get to the performance of a larger base model without them.
In the context of this post, I’m mainly interested in: (1) are the things in §1.4 relevant to the Epoch claim of exponential algorithmic improvements? and (2) are the things in §1.4 relevant to the Dario claim of exponential algorithmic improvements? It seems to me that the answer in both cases is “no”.
(1) In the Epoch case, I believe they quantified performance by perplexity not benchmarks.
(2) In the Dario case, I mean, I keep reading and re-reading the exact wording of the excerpt where he talks about “compute multipliers”. And it just really doesn’t sound to me like he is referring to SFT, RLHF, or RLVR in that excerpt (nor anything else in §1.4). Admittedly, his wording is a bit vague and confusing (to me). I’m open to discussion.
I think model distillation would not cause such a large and ongoing improvement in inference efficiency.
Pick a fixed model size, let’s say N=50B parameters. My current belief is that: if you straightforwardly distill Claude Opus 3.5 into an N-parameter model, then you wind up with a worse model, than if you straightforwardly distill Claude Opus 4.5 into an N-parameter model.
Are you disagreeing with that?
If you agree, then it would follow that (for example) maybe:
(because Bob’s better starting point is making up for his more aggressive distillation). Thus we would see ever-falling inference costs at any given level of benchmarks. See what I mean?
EDITED TO ADD: I just took the non-Goodfire-specific part of this comment, and spun it out and expanded it into a new post: In (highly contingent!) defense of interpretability-in-the-loop ML training
~ ~ ~
No opinion on Goodfire, or even on LLMs more broadly, but I generically think that there exists some AI algorithms in which interpretability outputs are connected to a reward function in a way that might be very helpful for safe & beneficial ASI. See for example Reward Function Design: a starter pack sections 1 & 4 & 5.
E.g. consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the reward can evidently get triggered by specific activations inside my inscrutable learned world-model. But human compassion is generally pretty robust—at least, it doesn’t go away as you age. It would be cool if we knew how to put something like that into an AGI.
…But yes, it’s also true that there are ways to connect interpretability outputs to a reward function that are bad, and make our problems all worse. I think there’s probably some elegant big-picture theoretical framework that sheds light on how to do interpretability-in-the-loop training in a way that’s good rather than terrible. Developing that theoretical framework would be good. (But probably quite different from what Goodfire is working on … it’s not “doing interpretability stuff” in the conventional sense but rather “thinking about AI algorithmic architectures in the big picture”, kinda more agent-foundations-y if anything.)
Fair enough, I reworded slightly. (Although I’d be pretty surprised if LLM algorithmic progress were way faster today than in, say, 2020, given that there was presumably much more low-hanging fruit in 2020.)
In 2024, I wasn't able to find anyone making this argument. My sense is that it was not at all prevalent, and continues to be not at all prevalent.
I feel like I see it pretty often. Check out “Unfalsifiable stories of doom”, for example.
Or really, anyone who uses the phrase “hypothetical risk” or “hypothetical threat” as a conversation-stopper when talking about ASI extinction, is implicitly invoking the intuitive idea that we should by default be deeply skeptical of things that we have not already seen with our own eyes.
Is The Spokesperson a realistic villain?
Obviously I agree that The Spokesperson is not going to sound realistic and sympathetic when he is arguing for “Ponzi Pyramid Incorporated” led by “Bernie Bankman”. It’s a reductio ad absurdum, showing that this style of argument proves too much. That’s the whole point.
Thanks, this is great!
Envy toward a friend’s success…
I used to think that envy was a social instinct (before 2023ish), but now I don’t think it’s a social instinct at all (see “changelog” here). Instead I currently think that envy is a special case of, umm, “craving” (in the general colloquial sense, not the specific Buddhist sense)—a kind of anxious frustration in a scenario where something is highly salient, and highly desired, but in fact cannot happen.
So a social example would be: Sally has a juice box, and I love juice, but I can’t have any. Looking at Sally drinking juice reminds me of the scenario where I’m drinking juice, which makes me unhappy because I don’t have any juice.
Whereas non-social example of the same innate reaction would be: It’s lunch time, and every day at lunch I have juice and a sandwich in a brown paper bag, and I love juice. But it happens that there’s a new global juice shortage, so today for the first time I don’t have any juice. Looking at my sandwich and the brown bag reminds me of the scenario where I’m drinking juice, which makes me unhappy because I don’t have any juice.
So that’s my starting point: both these two examples are the same kind of (not-specifically-social) craving-related frustration reaction.
After that, of course, the Sally scenario becomes social, because the scenario involves Sally doing something (i.e. drinking juice) that causes me to feel an unpleasant feeling (per above), and generically if someone is causing me unpleasant feelings then that tends to push me from regarding Sally as a friend, towards regarding her as an enemy, and to feel motivated to find an excuse to blame her for my troubles and pick a fight with her.
Admiration for a rival or enemy
My guess is that, just as going to bed can feel like a good idea or a bad idea depending on which aspects of the situation you’re paying attention to, likewise Genghis Khan can feel like a friend or an enemy depending on which aspects of him you’re paying attention to. I would suggest that people don’t feel admiration towards Person X and schadenfreude towards Person X at the very same instant. You might be able to flip back and forth from one to the other very quickly, even within 1 or 2 seconds, but not at the very same instant. For example, if I say the sentence “It was catastrophic how Genghis Khan killed all those people, but I have to admit, he was a talented leader”, I would suggest that the “innate friend-vs-enemy parameter” related to thoughts of Genghis Khan flips from enemy in the first half of the sentence to friend in the second half.
Compassion for a stable enemy’s suffering
There probably isn’t one great answer; probably different people are different. As above, we can think of people in different ways, paying attention to different aspects of them, and they can flip rapidly from enemy to friend and back. Since attention control is partly voluntary, it’s partly (but not entirely) a choice whether we see someone as a friend vs enemy, and we tend to choose the option that feels better / more motivating on net, and there can be a bunch of factors related to that. For example, approval reward is a factor—some people take pride in their compassion (just as we nod approvingly when superheroes take compassion upon their enemies, and cf. §6), while others take pride in their viciousness. Personality matters, culture matters, the detailed situation matters, etc.
Gratitude / indebtedness
Hmm. Generically, I think there are two (not mutually exclusive) paths:
As an example of the latter, recently someone important-to-me went out of his way to help me, and I expected the interaction to work out well for him too, but instead it wound up being a giant waste of his time, and objectively it wasn’t really my fault, but I still felt horrible and lost much sleep over it, and I think the aspect that felt most painful to me was when I imagined him secretly being annoyed at me and regretful for ever reaching out to me, even if he was far too nice a guy to say anything like that to me directly.
…But I’m kinda neurotic; different people are different and I don’t want to overgeneralize. Happy to hear more about how things seem to you.
Private guilt
I talked about “no one will ever find out” a bit in §6.1 of the approval reward post. I basically think that you can consciously believe that no one will ever find out, while nevertheless viscerally feeling a bit of the reaction associated with a nonzero possibility of someone finding out.
As for the “Dobby effect” (self-punishment related to guilt, a.k.a. atonement), that’s an interesting question. I thought about it a bit and here’s my proposed explanation:
Generally, if Ahab does something hurtful to Bob, then Bob might get angry at Ahab, and thus want Ahab to suffer (and better yet, to suffer while thinking about Bob, such as if Bob is punching Ahab in the face). But that desire of Bob’s, just like hunger and many other things, is satiable—just like a hungry person stops being hungry after eating a certain amount, likewise Bob tends to lose his motivation for Ahab to suffer, after Ahab has already suffered a certain amount. For example, if an angry person punches out his opponent in a bar fight, he usually feels satisfied, and doesn’t keep kicking his victim when he’s down, except in unusual cases. Or even if he kicks a bit, he won’t keep kicking for hours and hours.
We all know this intuitively from life experience, and we intuitively pick up on what it implies: if Ahab did something hurtful to Bob, and Ahab wants to get back to a situation where Bob feels OK about Ahab ASAP, then Ahab should be making himself suffer, and better yet suffer while thinking about Bob. Then not only is Ahab helping dull Bob’s feelings of aggression by satiating them, but simultaneously, there’s the very fact that Ahab is helping Bob feel a good feeling (i.e., satiation of anger), which should help push Ahab towards the “friend” side of the ledger in Bob’s mind.
Aggregation cases
In “identifiable victim effect”, I normally think of, like, reading a news article about an earthquake across the world. It’s very abstract. There’s some connection to the ground-truth reward signals that I suggested in Neuroscience of human social instincts: a sketch, but it’s several steps removed. Ditto “psychic numbing”, I think.
By contrast, in stage fright, you can see the people right there, looking at you, potentially judging you. You can make eye contact with one actual person, then move your eyes, and now you’re making eye contact with a different actual person, etc. The full force of the ground-truth reward signals is happening right now.
Likewise, for “audience effect”, we all have life experience of doing something, and then it turns out that there’s a real person right there who was watching us and judging us based on what we did. At any second, that real person could appear, and make eye contact etc. So again, we’re very close to the full force of the ground-truth reward signals here.
…So I don’t see a contradiction there.
Again I really appreciate this kind of comment, feel free to keep chatting.
I read one of their papers (the Pong one, which is featured on the frontpage of their website) and thought it was really bad and p-hacked, see here & here.
…sounds like a joke? you do not want to do any computation on neurons, they are slow and fragile. (you might want to run brain-inspired algorithms, but on semiconductors!)
Strong agree.
Belated update on that last point on “algorithmic progress” for LLMs: I looked into this a bit and wrote it up at: The nature of LLM algorithmic progress. The last section is how it relates to this post, with the upshot that I stand by what I wrote in OP.