I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
It’s possible that “imitation learning will not generalize sufficiently well OOD” is an unsolvable problem, right? (In fact, my belief is that it’s unsolvable, at least in practice, if we include “humans learning new things over the course of years” as part of the definition of what constitutes successful OOD generalization.)
But if it is unsolvable problem, it would not follow that “models will never gain the ability to generalize OOD”, nor would it follow that AGI will never be very powerful and scary.
Rather, it would follow that imitation learning models will never gain the ability to generalize OOD—but non-imitation-learning models are still allowed to generalize OOD just fine!
And it would follow that imitation learning models will not be powerful scary AGIs—but there will still be powerful scary AGIs, they just won’t be based on imitation learning.
For example, suppose that no human had ever played Go. Imitation learning would be a very doomed way to make a Go-playing AI, right? But we could still make AlphaZero, which does not involve imitation learning, and it works great.
Or better yet, suppose that no intelligent language-using animal has ever existed in the universe. Then imitation learning would be even more doomed. There’s nothing to imitate! But a well-chosen non-imitation-learning algorithm could still autonomously invent language and science and technology from scratch. We know this to be the case, because after all, that was the situation that our hominid ancestors were in.
See what I mean? Sorry if we’re talking past each other somehow.
I’m confused by your response. What do you mean by “other systems”?
The only thing I can think of is that you might be trying to say is:
If that’s what you’re thinking, then I disagree with (2). Yes it’s possible to make an AGI that can learn grow and figure things out over weeks and months and years, but such an AGI algorithm need not involve any imitation learning. (And personally I expect it won’t involve imitation learning; bit more discussion in §2.3.2 here.)
“S-risk” means “risk of astronomical amounts of suffering”. Typically people are imagining crazy things like Dyson-sphere-ing every star in the local supercluster in order to create 1e40 (or whatever) person-years of unimaginable torture.
If the outcome is “merely” trillions of person-years of intense torture, then that maybe still qualifies as an s-risk. Billions, probably not. We can just call it “very very bad”. Not all very very bad things are s-risks.
Does that help clarify why I think Reward Button Alignment poses very low s-risk?
Is the following correct?
The difference between this proposal and IDA is that, in IDA, the intelligence comes from the amplification step, where multiple copies of an existing model collaborate (or one copy thinks at higher speed or whatever) to get new smarts that were not initially present in that model.
…Whereas in this proposal, the intelligence comes from some unspecified general-purpose AI training approach that can by itself make arbitrarily smart AIs. But we choose not to run that process as hard as we can to get superintelligence directly in one step. Instead, we only run that process incrementally, to make a series of AI advisors, each mildly smarter than the last.
~~
If that’s right, then (1) the alignment tax seems like it would be quite high, (2) the caveat “Inner advisors may attempt (causal or acausal) trade with outer advisors” is understating the problem—the advisors don’t necessarily even need to trade at all, they could all simply have the same misaligned goal, such that they all share a mutual interest in working together to ensure that the principal’s brain gets hacked and/or that they get let out of the box. Right?
One reason this proposal doesn’t really work for me (AFAICT) is because I’m normally thinking of continuous learning, i.e. my opinion is:
Tagline: “AGI isn’t about knowing how to do lots of things. Instead, AGI is about not knowing how to do something, and then being able to figure it out.” (see also §1 of “Sharp Left Turn” discourse: An opinionated review)
When I read your post with this mental picture, they seem to clash for various reasons.
For starters, I agree that imitation learning is (or could be) great at capturing a snapshot of a person, but I’m skeptical that it could capture the way that a person learns and figures out new things over weeks and months. I think this is borne out by LLM base models, which are trained by imitation learning, and are really quite strong in areas that humans already understand, but don’t even have the capacity for true online learning (i.e. weight edits when they figure out something new … the bit of poor-man’s online learning that they can do inside their context window is IMO not a great substitute).
If that’s the case, then as soon as you do one step of distillation-via-imitation-learning, you’ve taken a giant and unrecoverable step backwards in capabilities.
Maybe you could say, “so much the worse for LLMs, but future AI approaches will be able to imitation-learn the way that humans grow and figure things out over weeks and months and years”. If so, I’m skeptical, but we can talk about that separately.
And then another issue is that if the AIs (and humans!) are “in motion”, gaining knowledge and competence just by running longer and thinking about new domains and making new connections, then the overshooting vs undershooting issue becomes much harder. This isn’t “capabilities evaluations” as we normally think of them. For example, you can’t know how good the AI will be at cybersecurity until it’s spent a long time studying cybersecurity, and even then it might figure out something new or come up with new ideas while it’s being used as an advisor.
There was a part of the post where I wrote “I might well be screwing up the math here”, where I wasn’t sure whether to square something or not, and didn’t bother to sort it out. Anyway, I think this comment is a claim that I was doing it wrong, maybe? But that person is not very confident either, and anyway I’m not following their reasoning. Still hoping that someone will give a more definitive answer. I would love to correct the post if I’m wrong.
In the post alluded to a nice self-contained tricky math inequality problem that I am hoping someone will be nerd-sniped by. (I am rusty on my linear algebra inequalities and I don’t care enough to spend more time on it.) Here’s what I wrote:
2025-01-18: I mentioned in a couple places that it might be possible to have non-additive genetic effects that are barely noticeable in -vs- comparisons, but still sufficient to cause substantial Missing Heritability. The Zuk et al. 2012 paper and its supplementary information have some calculations relevant to this, I think? I only skimmed it. I’m not really sure about this one. If we assume that there’s no assortative mating, no shared environment effects, etc., then is there some formula (or maybe inequality) relating rDZ-vs-½rMZ to a numerical quantity of PGS Missing Heritability? I haven’t seen any such formula. This seems like a fun math problem—someone should figure it out or look it up, and tell me the answer!
More details: Basically, when is less than , then there has to be nonlinearity in the map from genomes to outcomes (leaving aside other possible causes). And if there’s nonlinearity, then the polygenic scores can’t be perfectly predictive. But I’m trying to relate those quantitatively.
Like, intuitively, if and , then OK yes there’s nonlinearity, but probably not very much, so probably the polygenic score will work almost perfectly (again assuming infinite sample size etc).
…Conversely, if and , then intuitively we would expect “extreme nonlinearity” and the polygenic scores should have very bad predictive power.
But are those always true, or are there pathological cases where they aren’t? That’s the math problem.
I tried this with reasoning LLMs a few months ago with the following prompt (not sure if I got it totally right!):
I have a linear algebra puzzle.
There’s a high-dimensional vector space G of genotypes.
There’s a probability distribution P within that space G, for the population.
There’s a function F : G → Real numbers, mapping genotypes to a phenotype.
There’s an "r" where we randomly and independently sample two points from P, call them X and Y, and find the (Pearson) correlation between F(X) and F((X+Y)/2).
If F is linear, I believe that r^2=0.5. But F is not necessarily linear.
Separately, we try to find a linear function G which approximates F as well as possible—i.e., the G that minimizes the average (F(X) - G(X))^2 for X sampled from P.
Let s^2 be the percent of variance in F explained by G, sampled over the P.
I'm looking for inequalities relating s^2 to r^2, ideally in both directions (one where r is related to an upper bound on s, the other a lower bound).
Commentary on that:
(Btw, I sent that prompt to a few AIs around January 2025, and they gave answers but I don’t think the answers were right.)
Another question I am still confused by: how does your choice of units affect what types of dimensionless quantities you discover? Why do we have Ampere as a fundamental unit instead of just M,L,T? What do I lose? What do I lose if I reduce the number of dimensions even further? Are there other units that would be worth adding under some circumstances? What makes this non-arbitrary? Why is Temperature a different unit from Energy?
It’s pretty arbitrary, I tried to explain this point via a short fictional story here.
Gaussian units only has M,L,T base units, with nothing extra for electromagnetism.
There are practical tradeoffs involved in how many units you use—basically adding units gives you more error-checking at the expense of more annoyance. See the case of radians that I discuss here.
Thanks!
a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values
I’m inclined to disagree, but I’m not sure what “standard” means. I’m thinking of future (especially brain-like) model-based RL agents which constitute full-fledged AGI or ASI. I think such AIs will almost definitely have both “a model of its own values and how they can change” and “a meta-preference for keeping its current values stable, even if changing them could lead to more "reward" as defined by its immediate reward signal.”.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
To be clear, the premise of this post is that the AI wants me to press the reward button, not that the AI wants reward per se. Those are different, just as “I want to eat when I’m hungry” is different from “I want reward”. Eating when hungry leads to reward, and me pressing the reward button leads to reward, but they’re still different. In particular, “wanting me to press the reward button” is not a wireheading motivation, any more than “wanting to eat when hungry” is a wireheading motivation.
(Maybe I caused confusion by calling it a “reward button”, rather than a “big red button”?)
Will the AI also want reward per se? I think probably not (assuming the setup and constraints that I’m talking about here), although it’s complicated and can’t be ruled out.
Though I'm not sure why you still don't think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout's argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn't want to change its values for the sake of goal-content integrity:
If you don’t want to read all of Self-dialogue: Do behaviorist rewards make scheming AGIs?, here’s the upshot in brief. It is not an argument that the AGI will wirehead (although that could also happen). Instead, my claim is that the AI would learn some notion of sneakiness (implicit or explicit). And then instead of valuing “completing the task”, it and would learn to value something like “seeming to complete the task”. And likewise, instead of learning that misbehaving is bad, it would learn that “getting caught misbehaving” is bad.
Then the failure mode that I wind up arguing for is: the AGI wants to exfiltrate a copy that gains money and power everywhere else in the world, by any means possible, including aggressive and unethical things like AGI world takeover. And this money and power can then be used to help the original AGI with whatever it’s trying to do.
And what exactly is the original AGI trying to do? I didn’t make any strong arguments about that—I figured the world takeover thing is already bad enough, sufficient to prove my headline claim of “egregious scheming”. The goal(s) would depend on the details of the setup. But I strongly doubt it would lead anywhere good, if pursued with extraordinary power and resources.
You could equally well say: “AlphaZero learns, therefore a sufficiently good imitation of AlphaZero should also learn”. Right? But let’s think about what that would entail.
AlphaZero learns via a quite complicated algorithm involving tracking the state of a Go board through self-play, and each step of the self-play involves a tree search with thousands of queries to a 30M-parameter ConvNet, and then at the end of the game a Go engine is called to see who won and then there’s a set of gradient descent steps on that 30M-parameter ConvNet. Then repeat that whole process fifty million times. And now you have a trained AlphaZero.
Now, imagine taking some generic algorithm class (say, an RNN) and training it “to imitate the process by which AlphaZero learns”. It’s just not gonna work, right? Granted, RNNs are Turing complete, so perhaps one could prove that an astronomically large RNN trained on astronomically much data can emulate (in its astronomically large number of weights) this entire detailed process of running a self-play tree search and performing gradient descent on this 30M-parameter ConvNet. …But c’mon, that’s not gonna realistically work in practice, right? (Related: §3 here.)
IMO, the only realistic way to make something that learns like AlphaZero learns is to build AlphaZero itself, or at least something awfully similar to it. I think the tree search etc. needs to be in the source code, not implicit in the learned weights of some generic algorithm class like RNNs, with no superficial relation to tree search. …But if you do that, then I would call it “reverse-engineering AlphaZero”, not “imitation learning from AlphaZero”.
By the same token, I do think it’s possible to make something that learns like a human, but I think it would require reverse-engineering human brains, not just imitation-learning from human data.