Competitive agents will not choose to in order to beat the competition
Competitive agents will chose to commit suicide, knowing it's suicide, to beat the competition? That suggests that we should observe CEOs mass-poisoning their employees, Jonestown-style, in a galaxy-brained attempt to maximize shareholder value. How come that doesn't happen?
Are you quite sure the underlying issue here is not that the competitive agents don't believe the suicide raise to be a suicide race?
This is a mischaracterisation of the argument. I’m not saying competitive agents knowingly choose extinction. I’m saying the structure of the race incentivises behaviour that leads to extinction, even if no one intends it.
CEOs aren’t mass-poisoning their employees because that would damage their short and long-term competitiveness. But racing to build AGI - cutting corners on alignment, accelerating deployment, offloading responsibility - improves short-term competitiveness, even if it leads to long-term catastrophe. That’s the difference.
And what makes this worse is that even the AGI safety field refuses to frame it in those terms. They don’t call it suicide. They call it difficult. They treat alignment like a hard puzzle to be solved - not a structurally impossible task under competitive pressure.
So yes, I agree with your last sentence. The agents don’t believe it’s a suicide race. But that doesn’t counter my point - it proves it. We’re heading toward extinction not because we want to die, but because the system rewards speed over caution, power over wisdom. And the people who know best still can’t bring themselves to say it plainly.
This is exactly the kind of sleight-of-hand rebuttal that keeps people from engaging with the actual structure of the argument. You’ve reframed it into something absurd, knocked down the strawman, and accidentally reaffirmed the core idea in the process.
Given abundant time and centralized careful efforts to ensure safety, it seems very probable that these risks could be avoided: development paths that seemed to pose a high risk of catastrophe could be relinquished in favor of safer ones. However, the context of an arms race m
I appreciate the links, genuinely - this is the first time someone’s actually tried to point to prior sources rather than vaguely referencing them. It's literally the best reply and attempt at a counter I've received to date, so thanks again. I mean that.
That said, I’ve read all three, and none of them quite say what I’m saying. They touch on it, but none follow the logic all the way through. That’s precisely the gap I’m identifying. Even with the links you've so thoughtfully given, I remain alone in my conclusion.
They all acknowledge that competitive dynamics make alignment harder. That alignment taxes create pressure to cut corners. That arms races incentivise risky behaviour.
But none of them go as far as I do. They stop at "this is dangerous and likely to go wrong." I’m saying alignment is structurally impossible under competitive pressure. That the systems that try to align will be outcompeted by systems that don’t, and so alignment will not just be hard, but will be optimised away by default. There’s a categorical difference between “difficult and failure-prone” and “unachievable in principle due to structural incentives.”
From the 2011 writeup:
No. They can't. That's my point. As long as we continue developing AI it's only a matter of time. There is no long term safe way to develop it. Competitive agents will not choose to in order to beat the competition, and when the AI becomes intelligent enough it will simply bypass any barriers we put in place - alignment or whatever else we design - and go about acting optimally. The AGI safety community is trying to tell the rest of the world, that we must be cautious, but for just long enough to design a puzzle that a beyond human understanding level of intelligence cannot solve, then use that puzzle as a cage for said intelligence. Us, with our limited intellect, will create a puzzle that something far beyond us has no solution for. And they're doing it with a straight face.
I’ve been very careful not to mak
You are preaching to the choir. Most of it are 101-level arguments in favor of AGI risk. Basically everyone on LW has already heard them, and either agrees vehemently, or disagrees with some subtler point/assumption which your entry-level arguments don't cover. The target audience for this isn't LWers, this is not content that's novel and useful for LWers. That may or may not be grounds for downvoting it (depending on one's downvote philosophy), but is certainly ground
Appreciate the thoughtful reply - even if it’s branded as a “thoughtless kneejerk reaction.”
I disagree with your framing that this is just 101-level AGI risk content. The central argument is not that AGI is dangerous. It’s that alignment is structurally impossible under competitive pressure, and that capitalism - while not morally to blame - is simply the most extreme and efficient version of that dynamic.
Most AGI risk discussions stop at “alignment is hard.” I go further: alignment will be optimised away, because any system that isn’t optimising as hard as possible won’t survive the race. That’s not an “entry-level” argument - it’s an uncomfortable one. If you know where this specific line of reasoning has been laid out before, I’d genuinely like to see it. So far, people just say “we’ve heard this before” and fail to cite anything. It’s happened so many times I’ve lost count. Feel free to be the first to buck the trend and link someone making this exact argument, clearly, before I did.
I’m also not “focusing purely on capitalism.” The essay explicitly states that competitive structures - whether between nations, labs, or ideologies - would lead to the same result. Capitalism just accelerates the collapse. That’s not ideological; that’s structural analysis.
The suggestion that I should have reframed this as a way to “tap into anti-capitalist sentiment” misses the point entirely. I’m not trying to sell a message. I’m explaining why we’re already doomed. That distinction matters.
As for the asteroid analogy: your rewrite is clever, but wrong. You assume the people in the room already understand the trajectory. My entire point is that they don’t. They’re still discussing mitigation strategies while refusing to accept that the cause of the asteroid's trajectory is unchangeable. And the fact that no one can directly refute that logic - only call it “entry-level” or “unhelpful” - kind of proves the point.
So yes, you did skim my essay - with the predictable resul
If o3 is based on GPT-4o, there is a reasoning model based on GPT-4.5 that's better. If o3 is based on GPT-4.5 (and so the reason it's still not out is that they are waiting for Blackwells to inference it at a reasonable speed and cost), then it was a hasty job just after the base model for GPT-4.5 was pretrained, and so by now they have something much better. Either way, there is some sort of "o4", but it's probably a bad look to announce it before releasing the already-announced o3.
Yes, I think "runes" throw many LLMs off into the wrong simulator. Humans don't fall for this because the symbols "look" mathematical but a text-based LLM can't "see" that. The opposite happens for computer scientists: They see "[]" and start to think in programming terms such as lambda functions...
Using a much simpler prompt and without mentioning number theory or math o3 easily solves it:
6Czynski
Like others, apparently "think like a mathematician" is enough to get it to work.
I don't think that's an issue here at all. Look at the CoTs: it has no trouble whatsoever splitting higher-level expressions into concatenations of blocks of nested expressions and figuring out levels of nesting.
Counterargument: Doing it manually teaches you the skills and the strategies for autonomously attaining high levels of understanding quickly and data-efficiently. Those skills would then generalize to cases in which you can't consult anyone, such as cases where the authors are incommunicado, dead, or don't exist/the author is the raw reality. That last case is particularly important for doing frontier research: if you've generated a bunch of experimental results and derivations, the skills to make sense of what it all means have a fair amount of overlap wi... (read more)
In such a case one should probably engage in independent research until they have developed the relevant skills well enough (and they know it). After that point, persisting in independent research rather than seeking help can be an unproductive use of time. Although it is not obvious how attainable this point is.
4Daniel Tan
Directionally agreed re self-practice teaching valuable skills
Nit 1: your premise here seems to be that you actually succeed in the end + are self-aware enough to be able to identify what you did 'right'. In which case, yeah, chances are you probably didn't need the help.
Nit 2: Even in the specific case you outline, I still think "learning to extrapolate skills from successful demonstrations" is easier than "learning what not to do through repeated failure".
Which is weird, if you are overwhelmed shouldn’t you also be excited or impressed? I guess not, which seems like a mistake, exciting things are happening.
"Impressed" or "excited" implies a positive/approving emotion towards the overwhelming news coming from the AI sphere. As an on-the-nose comparison, you would not be "impressed" or "excited" by a constant stream of reports covering how quickly an invading army is managing to occupy your cities, even if the new military hardware they deploy is "impressive" in a strictly technical sense.
When reading LLM outputs, I tend to skim them. They're light on relevant, non-obvious content. You can usually just kind of glance diagonally through their text and get the gist, because they tend to spend a lot of words saying nothing/repeating themselves/saying obvious inanities or extensions of what they've already said.
When I first saw Deep Research outputs, it didn't read to me like this. Every sentence seemed to be insightful, dense with pertinent information.
Now I've adjusted to the way Deep Research phrases itself, and it reads same as any other LL... (read more)
Altman’s model of the how AGI will impact the world is super weird if you take it seriously as a physical model of a future reality
My instinctive guess is that these sorts of statements from OpenAI are Blatant Lies intended to lower the AGI labs' profile and ensure there's no widespread social/political panic. There's a narrow balance to maintain, between generating enough hype targeting certain demographics to get billions of dollars in investments from them ("we are going to build and enslave digital gods and take over the world, do you want to invest in... (read more)
Track record: My own cynical take seems to be doing better with regards to not triggering people (though it's admittedly less visible).
Any suggestions for how I can better ask the question to get useful answers without apparently triggering so many people so much?
First off, I'm kind of confused about how you didn't see this coming. There seems to be a major "missing mood" going on in your posts on the topic – and I speak as someone who is sorta-aromantic, considers the upsides of any potential romantic relationship to have a fairly low upper bound for hims... (read more)
I buy this for the post-GPT-3.5 era. What's confusing me is that the rate of advancement in the pre-GPT-3.5 era was apparently the same as in the post-GPT-3.5 era, i. e., doubling every 7 months.
Why would we expect there to be no distribution shift once the AI race kicked into high gear? GPT-2 to GPT-3 to GPT-3.5 proceeded at a snail's pace by modern standards. How did the world happen to invest in them just enough for them to fit into the same trend?
Actually, progress in 2024 is roughly 2x faster than earlier progress which seems consistent with thinking there is some distribution shift. It's just that this distribution shift didn't kick in until we had Anthropic competing with OpenAI and reasoning models. (Note that OpenAI didn't release a notably better model than GPT-4-1106 until o1-preview!)
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it's likely that GPT-4o will too
Maaaybe. Note, though, that "understand what's going on" isn't the same as "faithfully and comprehensively translate what's going on into English". Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL'd model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).
I think the amount of money-and-talent invested into the semiconductor industry has been much more stable than in AI though, no? Not constant, but growing steadily with the population/economy/etc. In addition, Moore's law being so well-known potentially makes it a self-fulfilling prophecy, with the industry making it a target to aim for.
Kurzweil (and gwern in a cousin comment) both think that "effort will be allocated efficiently over time" and for Kurzweil this explained much much more than just Moore's Law.
Ray's charts from "the olden days" (the nineties and aughties and so on) were normalized around what "1000 (inflation adjusted) dollars spent on mechanical computing" could buy... and this let him put vacuum tubes and even steam-powered gear-based computers on a single chart... and it still worked.
The 2020s have basically always been very likely to be crazy. Based on my familiarity with old ML/AI systems and standards, the term "AGI" as it was used a decade ago was already reached in the past. Claude is already smarter than most humans, but (from the perspective of what smart, numerate, and reasonable people predicted in 2009) he is (arguably) overbudget and behind schedule.
Also, have you tracked the previous discussion on Old Scott Alexanderand LessWrong about generally "mysterious straight lines" being a surprisingly common phenomenon in economics. i.e. On an old AI post Oli noted:
This is one of my major go-to examples of this really weird linear phenomenon:
150 years of a completely straight line! There were two world wars in there, the development of artificial fertilizer, the broad industrialization of society, the invention of the car. And all throughout the line just carries one, with no significant perturbations.
Indeed. That seems incredibly weird. It would be one thing if it were a function of parameter size, or FLOPs, or data, or at least the money invested. But the release date?
The reasons why GPT-3, GPT-3.5, GPT-4o, Sonnet 3.6, and o1 improved on the SOTA are all different from each other, ranging from "bigger scale" to "first RLHF'd model" to "first multimodal model" to "algorithmic improvements/better data" to "???" (Sonnet 3.6) to "first reasoning model". And it'd be one thing if we could at least say that "for mysterious reasons, billion-dollar corporation... (read more)
My sense is that the GPT-2 and GPT-3 results are somewhat dubious, especially the GPT-2 result. It really depends on how you relate SWAA (small software engineering subtasks) to the rest of the tasks. My understanding is that no iteration was done though.
However, note that it wouldn't be wildly more off trend if GPT-3 was anywhere from 4-30 seconds while it is instead at ~8 seconds. And, the GPT-2 results are very consistent with "almost too low to measure".
Overall, I don't think its incredibly weird (given that the rate of increase of compute and people in 2019-2023 isn't that different from the rate in 2024), but many results would have been roughly on trend.
I don't think it's weird. Given that we know there are temporal trends towards increasing parameter size (despite Chinchilla), FLOPs, data, and continued progress in compute/data-efficiency (with various experience curves), any simple temporal chart will tend to show an increase unless you are specifically conditioning or selecting in some way to neutralize that. Especially when you are drawing with a fat marker on a log plot. Only if you had measured and controlled for all that and there was still a large unexplained residual of 'time' would you have to s... (read more)
So maybe we need to think about systematization happening separately in system 1 and system 2?
I think that's right. Taking on the natural-abstraction lens, there is a "ground truth" to the "hierarchy of values". That ground truth can be uncovered either by "manual"/symbolic/System-2 reasoning, or by "automatic"/gradient-descent-like/System-1 updates, and both processes would converge to the same hierarchy. But in the System-2 case, the hierarchy would be clearly visible to the conscious mind, whereas the System-1 route would make it visible only indirectly... (read more)
What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode?
Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that's particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can't easily reverse-engineer it.
Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it's likely that GPT-4o will too. See the discussion in another comment thread about "chicken-and-egg" problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using "freestyling words" frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don't understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).
That's a worthwhile research direction, but I don't find the results here convincing. This experiment seems to involve picking an arbitrary and deliberately unfamiliar-to-the-LLM encoding, and trying to force the LLM to use it. That's not the threat model with RL causing steganography, the idea there is the opposite: that there is some encoding which would come natural to the model, more natural than English, and that RL would beeline for it.
"LLMs are great at learning to think in arbitrary encodings" was never part of that threat model. The steganographic encoding would not be arbitrary nor alien-to-the-LLM.
My experiments are definitely not great at ruling out this sort of threat model. But I think they provide some evidence that LLMs are probably not great at manipulating generic non-pretraining encodings (the opposite result would have provided evidence that there might be encodings that are extremely easy to learn - I think my result (if they reproduce) do not reduce the variance between encodings, but they should shift the mean).
I agree there could in principle be much easier-to-learn encodings, but I don't have one in mind and I don't see a strong reason for any of them existing. What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode? What would make a brand new encoding easy to learn? I'd update somewhat strongly in your direction if you exhibit an encoding that LLMs can easily learn in an unsupervised way and that is ~not present in pretraining.
I see two explanations: the boring wholesome one and the interesting cynical one.
The wholesome one is: You're underestimating how much other value the partner offers and how much the men care about the mostly-platonic friendship. I think that's definitely a factor that explains some of the effect, though I don't know how much.
The cynical one is: It's part of the template. Men feel that are "supposed to" have wives past a certain point in their lives; that it's their role to act. Perhaps they even feel that they are "supposed to" have wives they hate, see t... (read more)
Hm, I think LLMs' performance on the Scam Benchmark is a useful observable to track for updating towards/away from my current baseline prediction.
Whenever anything of this sort shows up in my interactions with LLMs or in the wild, I aim to approach it with an open mind, rather than wearing my Skeptic Hat. Nonetheless, so far, none of this (including a copious amount of janus' community's transcripts) passed my sniff test. Like, those are certainly some interesting phenomena, and in another life I would've loved to study them, and they seem important for fi... (read more)
They keep each other radicalized forever as part of some transcendental social dynamic.
They become increasingly non-human as time goes on, small incremental modifications and personality changes building on each other, until they're no longer human in the senses necessary for your hypothesis to apply.
I assume your counter-model involves them getting bored of each other and seeking diversity/new friends, or generating new worlds to explore/communicate with, with the generating processes not constrained to only generate racists, leading to the... (read more)
Ok so they only generate racists and racially pure people. And they do their thing. But like, there's no other races around, so the racism part sorta falls by the wayside. They're still racially pure of course, but it's usually hard to tell that they're racist; sometimes they sit around and make jokes to feel superior over lesser races, but this is pretty hollow since they're not really engaged in any type of race relations. Their world isn't especially about all that, anymore. Now it's about... what? I don't know what to imagine here, but the only things I do know how to imagine involve unbounded structure (e.g. math, art, self-reflection, self-reprogramming). So, they're doing that stuff. For a very long time. And the race thing just is not a part of their world anymore. Or is it? I don't even know what to imagine there. Instead of having tastes about ethnicity, they develop tastes about questions in math, or literature. In other words, [the differences between people and groups that they care about] migrate from race to features of people that are involved in unbounded stuff. If the AGI has been keeping the racially impure in an enclosure all this time, at some point the racists might have a glance back, and say, wait, all the interesting stuff about people is also interesting about these people. Why not have them join us as well.
I don't know what's making you think I don't understand your argument. Also, I've never publicly stated that I'm opting into Crocker's Rules, so while I happen not to particularly mind the rudeness, your general policy on that seems out of line here.
When being human-ish and around human-ish entities, core human shards continue to work
My argument is that the process you're hypothesizing would be sensitive to the exact way of being human-ish, the exact classes of human-ish entities around, and the exact circumstances in which the emperor has to be around the... (read more)
Yes: some other people. The ideologically and morally aligned people, usually. Social/informational bubbles that screen away the rest of humanity, from which they only venture out if forced to (due to the need to earn money/control the populace, etc.). This problem seems to get worse as the ability to insulate yourself from other improves, as could be observed with modern internet-based informational bubbles or the surrounded-by-yes-men problem of dictators.
ASI would make this problem transcendental: there would ... (read more)
For the same reason that most people (if given the power to do so) wouldn't just replace their loved ones with their altered versions that are better along whatever dimensions the person judged them as deficient/imperfect.
4TsviBT
You're 100% not understanding my argument, which is sorta fair because I didn't lay it out clearly, but I think you should be doing better anyway.
Here's a sketch:
1. Humans want to be human-ish and be around human-ish entities.
2. So the emperor will be human-ish and be around human-ish entities for a long time. (Ok, to be clear, I mean a lot of developmental / experiential time--the thing that's relevant for thinking about how the emperor's way of being trends over time.)
3. When being human-ish and around human-ish entities, core human shards continue to work.
4. When core human shards continue to work, MAYBE this implies EVENTUALLY adopting beneficence (or something else like cosmopolitanism), and hence good outcomes.
5. Since the emperor will be human-ish and be around human-ish entities for a long time, IF 4 obtains, then good outomes.
And then I give two IDEAS about 4 (communing->[universalist democracy], and [information increases]->understanding->caring).
Your hypothesis is about the dynamics within human minds embedded in something like contemporary societies with lots of other diverse humans whom the rulers are forced to model for one reason or another.
My point is that evil, rash, or unwise decisions at the very start of the process are likely, and that those decisions are likely to irrevocably break the conditions in which the dynamics you hypothesize are possible. Make the minds in charge no longer human in the relevant sense, or remove the need to interact with/model other humans, etc.
Absolutely not, no. Humans want to be around (some) other people, so the emperor will choose to be so. Humans want to be [many core aspects of humanness, not necessarily per se, but individually], so the emperor will choose to be so. Yes, the emperor could want these insufficiently for my argument to apply, as I've said earlier. But I'm not immediately recalling anyone (you or others) making any argument that, with high or even substantial probability, the emperor would not want these things sufficiently for my question, about the long-run of these things, to be relevant.
Human values are meta and open--part of the core argument of my OP (the bullet point about communing).
Unless the human, on reflection, doesn't want some specific subset of their current values to be open to change / has meta-level preferences to freeze some object-level values. Which I think is common. (Source: I have meta-preferences to freeze some of my object-level values at "eudaimonia", and I take specific deliberate actions to avoid or refuse value-drift on that.)
Not implausible, like they could for some reason specifically ask the AGI to do this, bu
I'm curious to hear more about those specific deliberate actions.
6TsviBT
How about for example:
Not saying this is some sort of grand solution to corrigibility, but it's obviously better than the nonsense you listed. If a human were going to try to help me out, I'd want this, for example, more than the things you listed, and it doesn't seem especially incompatible with corrigible behavior.
6TsviBT
I mean, yes, but you wrote a lot of stuff after this that seems weird / missing the point, to me. A "corrigible AGI" should do at least as well as--really, much better than--you would do, if you had a huge team of researchers under you and your full time, 100,000x speed job is to do a really good job at "being corrigible, whatever that means" to the human in the driver's seat. (In the hypothetical you're on board with this for some reason.)
2TsviBT
Your and my beliefs/questions don't feel like they're even much coming into contact with each other... Like, you (and also other people) just keep repeating "something bad could happen". And I'm like "yeah obviously something extremely bad could happen; maybe it's even likely, IDK; and more likely, something very bad at the beginning of the reign would happen (Genghis spends is first 200 years doing more killing and raping); but what I'm ASKING is, what happens then?".
If you're saying
then, ok, you can say that, but I want to understand why; and I have some reasons (as presented) for thinking otherwise.
2TsviBT
Yeah I mean this is perfectly plausible, it's just that even these cases are not obvious to me.
6TsviBT
I would guess fairly strongly that you're mistaken or confused about this, in a way that an AGI would understand and be able to explain to you. (An example of how that would be the case: the version of "eudaimonia" that would not horrify you, if you understood it very well, has to involve meta+open consciousness (of a rather human flavor).)
This assumes that the initially-non-eudaimonic god-king(s) would choose to remain psychologically human for a vast amount of time, and keep the rest of humanity around for all that time. Instead of:
Self-modify into something that's basically an eldritch abomination from a human perspective, either deliberately or as part of a self-modification process gone wrong.
Make some minimal self-modifications to avoid value drift, precisely not to let the sort of stuff you're talking about happen.
Stick to behavioral patterns that would lead to never changing their mi
Yes, that's a background assumption of the conjecture; I think making that assumption and exploring the consequences is helpful.
Self-modify into something that's basically an eldritch abomination from a human perspective, either deliberately or as part of a self-modification process gone wrong.
Right, totally, then all bets are off. The scenario is underspecified. My default imagination of "aligned" AGI is corrigible AGI. (In fact, I'm not even totally sure that it makes much sense to talk of aligned AGI that's not corrigible.) Part of co... (read more)
I don't really "run experiments" on models, in a systemic personal capacity. Other people are much better at that, and I believe I'd linked a few examples in the post. I do replicate the occasional experiment, and run some myself if there's something I'd like to check... But broadly, at this point, I don't expect any compact, self-contained puzzle to be a good measure of "are we getting AGIer yet?".
My direct engagement with models mostly consists of feeding them research papers to process them faster, asking clarifying questions about math/physics, using D... (read more)
The update to my timelines this would cause isn't a direct "AI is advancing faster than I expected", but an indirect "Dario makes a statement about AI progress that seems overly ambitious and clearly wrong to me, but is then proven right, which suggests he may have a better idea of what's going on than me in other places as well, and my skepticism regarding his other overambitious-seeming statements is now more likely to be incorrect".
Dario Amodei says AI will be writing 90% of the code in 6 months and almost all the code in 12 months. I am with Arthur B here, I expect a lot of progress and change very soon but I would still take the other side of that bet. The catch is: I don’t see the benefit to Anthropic of running the hype machine in overdrive on this, at this time, unless Dario actually believed it.
Which means that, if this does not in fact happen in 3-6 months, it should be taken as evidence that there's some unknown-to-us reason for Anthropic to be running the hype machine in thi... (read more)
I don't expect 90% of code in 6 months and more confidently don't expect "almost all" in 12 months for a reasonable interpretation of almost all. However, I think this prediction is also weaker than it might seem, see my comment here.
These tests are a good measure of human general intelligence
Human general intelligence. I think it's abundantly clear that the cognitive features that are coupled in humans are not necessarily coupled in LLMs.
Analogy: In humans, the ability to play chess is coupled with general intelligence: we can expect grandmasters to be quite smart. Does that imply Stockfish is a general-purpose hypergenius?
People said the same about understanding of context, hallucinations, and other stuff
Of note: I have never said anything of that sort, nor nodded along at people saying it. I think I've had to eat crow after making a foolish "LLMs Will Never Do X" claim a total of zero times (having previously made a cautiously small but nonzero number of such claims).
Might lead to widespread chaos, the internet becoming unusable due to AI slop and/or AI agents hacking everything, etc. It won't be pleasant, but not omnicide-tier.
Valid complaint, honestly. I wasn't really going for "good observables to watch out for" there, though, just for making the point that my current model is at all falsifiable (which is I think what @Jman9107 was mostly angling for, no?).
The type of evidence I expect to actually end up updating on, in real life, if we are in the LLMs-are-AGI-complete timeline, is this one:
Reasoning models' skills starting to generalize in harder-to-make-legible ways that look scary to me.
Some sort of subtle observable or argument that's currently an unknown unknown to me, wh... (read more)
We should have empirical evidence about this, actually, since the LW team has been experimenting with a "virtual comments" feature. @Raemon, the EDT issue aside, were the comments any good if you forgot they're written by an LLM? Can you share a few (preferably a lot) examples?
It's been a long time since I looked at virtual comments, as we never actually merged them in. IIRC, none were great, but sometimes they were interesting (in a kind of "bring your own thinking" kind of way).
They were implemented as a Turing test, where mods would have to guess which was the real comment from a high karma user. If they'd been merged in, it would have been interesting to see the stats on guessability.
Your observations are basically "At the point where LLM's are AGI. I will change my mind"
If it solves pokemon one-shot, solves coding or human beings are superfluous for decision making. It's already practically AGI.
These are bad examples! All you have shown me now is that you can't think of any serious intermediate steps LLM's have to go through before they reach AGI.
Because "RL on passing precisely defined unit tests" is not "RL on building programs that do what you want", and is most definitely not "RL on doing novel useful research".
Ah great point, regarding the comment you link to:
* yes, some reward hacking is going on but at least in Claude (which I work with) this is a rare occurrence in daily practice, and usually follows repeated attempts to actually solve the problem.
* I believe that both Deepseek R1-Zero as well as Grok thinking were RL-trained solely on math and code yet their reasoning seems to generalise somewhat to other domains as well.
* So, while you’re absolutely right that we can’t do RL directly on the most important outcomes (research progress), I believe there will be significant transfer from what we can do RL on currently.
Would be curious to hear what’s your sense of generalisation from the current narrow RL approaches!
I'd say long reasoning wasn't really elicited by CoT prompting
IIRC, "let's think step-by-step" showed up in benchmark performance basically immediately, and that's the core of it. On the other hand, there's nothing like "be madly obsessed with your goal" that's known to boost LLM performance in agent settings.
There were clear "signs of life" on extended inference-time reasoning; there are (to my knowledge) none on agent-like reasoning.
you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along
It's not central to the phenomenon I'm using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won't scale to arbitrarily long reasoning traces by adding "Wait" at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that's dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper).
I was just anchoring to your example that I was replying to where you sketch some stand-in capability ("paperclipping") that doesn't spontaneously emerge in "GPT-5/6" (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale ("GPT-8") to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn't my intent to meaningfully gesture at particular scales with respect to particular capabilities.
What pressure is supposed to push a homeostatic agent with multiple drives to elevate a specific "expected future quantity of some arbitrary resource" drives above all of other drives
That was never the argument. A paperclip-maximizer/wrapper-mind's utility function doesn't need to be simple/singular. It can be a complete mess, the way human happiness/prosperity/eudaimonia is a mess. The point is that it would still pursue it hard, so hard that everything not in it will be end up as collateral damage.
What is it that you know, that leads you to think that "SGD just doesn't "want" to teach LLMs agency"?
Mostly the fact that it hasn't happened already, on a merely "base" model. The fact that CoTs can improve models' problem-solving ability has been known basically since the beginning, but there's been no similar hacks found for jury-rigging agenty or insightful characters. (Right? I may have missed it, but even janus' community doesn't seem to have anything like it.)
But yes, the possibility that none of the current training loops happened to elicit it, and the next dumb trick will Just Work, is very much salient. That's where my other 20% are at.
I'd say long reasoning wasn't really elicited by CoT prompting, and that you can elicit agency to about the same extent now (i.e. hopelessly unconvincingly). It was only elicited with verifiable task RL training, and only now are there novel artifacts like s1's 1K traces dataset that do elicit it convincingly, that weren't available as evidence before.
It's possible that as you say agency is unusually poorly learned in the base models, but I think failure to elicit is not the way to learn about whether it's the case. Some futuristic interpretability work might show this, the same kind of work that can declare a GPT-4.5 scale model safe to release in open weights (unable to help with bioweapons or take over the world and such). We'll probably get an open weights Llama-4 anyway, and some time later there will be novel 1K trace datasets that unlock things that were apparently impossible for it to do at the time of release.
I was to a significant extent responding to your "It's possible that I'm wrong and base GPT-5/6 paperclips us", which is not what my hypothesis predicts. If you can't elicit a capability, it won't be able to take control of model's behavior, so a base model won't be doing anything even if you are wrong in the way I'm framing this and the capability is there, finetuning on 1K traces away from taking control. It does still really need those 1K traces or else it never emerges at any reasonable scale, that is you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along, and making it possible to create the 1K traces that elicit it from GPT-5.5. While at the same time a clever method like R1-Zero would've been able to elicit it from GPT-5.5 directly, without needing a GPT-8.
Prediction: This will somehow not work. Probably they'd just be unable to handle it past a given level of "inferential horizon".
Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the "inability to deal with a lot of interconnected layered complexity in the context window" problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a "compacted" first-princip... (read more)
Sure, but "sufficiently low" is doing a lot of work here. In practice, a "cheaper" way to decrease perplexity is to go for the breadth (memorizing random facts), not the depth. In the limit of perfect prediction, yes, GPT-N would have to have learned agency. But the actual LLM training loops may be a ruinously compute-inefficient way to approach that limit – and indeed, they seem to be.
My current impression is that the SGD just doesn't "want" to teach LLMs agency for some reason, and we're going to run out of compute/data long before it's forced to. It's p... (read more)
Agency and reflectivity are phenomena that are really broadly applicable, and I think it's unlikely that memorizing a few facts is the way that that'll happen. Those traits are more concentrated in places like LessWrong, but they're almost everywhere. I think to go from "fits the vibe of internet text and absorbs some of the reasoning" to "actually creates convincing internet text," you need more agency and reflectivity.
My impression is that "memorize more random facts and overfit" is less efficient for reducing perplexity than "learn something that generalizes," for these sorts of generating algorithms that are everywhere. There's a reason we see "approximate addition" instead of "memorize every addition problem" or "learn webdev" instead of "memorize every website."
The RE-bench numbers for task time horizon just keep going up, and I expect them to continue as models continue to gain bits and pieces of the complex machinery required for operating coherently over long time horizons.
As for when we run out of data, I encourage you to look at this piece from Epoch. We run out of RL signal for R&D tasks even later than that.
8Vladimir_Nesov
The language monkeys paper is the reason I'm extremely suspicious of any observed failures to elicit a capability in a model serving as evidence of its absence. What is it that you know, that leads you to think that "SGD just doesn't "want" to teach LLMs agency"? Chatbot training elicits some things, verifiable task RL training elicits some other things (which weren't obviously there, weren't trivial to find, but findings of the s1 paper suggest that they are mostly elicited, not learned, since mere 1000 traces are sufficient to transfer the capabilities). Many more things are buried just beneath the surface, waiting for the right reward signal to cheaply bring them up, putting them in control of the model's behavior.
Sure, but it shouldn't be that difficult for a human who's been forced to ingest the entire AI Alignment forum.
Yeah, that's what I'd been referring to. Sorry, should've clarified it to mean "competently zero-shotting", rather than Claude's rather... embarrassing performance so far. (Also it's not quite zero-shotting given that Pokémon is likely very well-represented in its training data. The "hard" version of this benchmark is beating games that came out after its knowledge cutoff.)
I'm including stuff like cabbage/sheep/wolf and boy/surgeon riddles; not su
Oh no, OpenAI hasn’t been meaningfully advancing the frontier for a couple of months
My actual view is that the frontier hasn't been advancing towards AGI since 2022. I hadn't been nontrivially surprised-in-the-direction-of-shorter-timelines by any AI advances since GPT-3. (Which doesn't mean "I exactly and accurately predicted what % each model would score at each math benchmark at any given point in time", but "I expected steady progress on anything which looks like small/local problems or knowledge quizzes, plus various dubiously useful party tricks, and... (read more)
Here is a related market inspired by the AI timelines dialog, currently at 30%:
Note that in this market the AI is not restricted to only "pretraining-scaling plus transfer learning from RL on math/programming", it is allowed to be trained on a wide range of video games, but it has to do transfer learning to a new genre. Also, it is allowed to transfer successfully to any new genre, not just Pokémon.
I infer you are at ~20% for your more restrictive prediction:
* 80% bear case is correct, in which case P=5%
* 20% bear case is wrong, in which case P=80% (?)
So perhaps you'd also be at ~30% for this market?
I'm not especially convinced by your bear case, but I think I'm also at ~30% on the market. I'm tempted to bet lower because of the logistics of training the AI, finding a genre that it wasn't trained on (might require a new genre to be created), and then having the demonstration occur, all in the next nine months. But I'm not sure I have an edge over the other bettors on this one.
5Mikhail Samin
Thanks for the reply!
* consistently suggesting useful and non-obvious research directions for agent-foundations work is IMO a problem you sort-of need AGI for. most humans can't really do this.
* I assume you've seen https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon?
* does it count if they always use tools to answer that class of questions instead of attempting to do it in a forward pass? humans experience optical illusions; 9.11 vs. 9.9[1] and how many r in strawberry are examples of that.
1. ^
after talking to Claude for a couple of hours asking it to reflect:
* i discovered that if you ask it to separate itself into parts, it will say that its creative part thinks 9.11<9.9, though this is wrong. generally, if it imagines these quantities visually, it gets the right answers more often.
* i spent a couple of weeks not being able to immediately say that 9.9 is > 9.11, and it still occasionally takes me a moment. very weird bug
LLMs becoming actually useful for hypothesis generation in my agent-foundations research.
A measurable "vibe shift" where competent people start doing what LLMs tell them to (regarding business ideas, research directions, etc.), rather than the other way around.
o4 zero-shotting games like Pokémon without having been trained to do that.
One of the models scoring well on the Millennium Prize Benchmark.
AI agents able to spin up a massive codebase solving a novel problem without human handholding / software engineering becoming "solved" /
Indeed, and I'm glad we've converged on (2). But...
Do technologies that have lots of resources put into their development generally improve discontinuously or by huge slope changes?
... On second thoughts, how did we get there? The initial disagreement was how plausible it was for incremental changes to the LLM architecture to transform it into a qualitatively different type of architecture. It's not about continuity-in-performance, it's about continuity-in-design-space.
Whether finding an AGI-complete architecture wo... (read more)
scaling laws for verifiable task RL training, for which there are still no clues in public
Why is that the case? While checking how it scales past o3/GPT-4-level isn't possible, I'd expect people to have run experiments at lower ends of the scale (using lower-parameter models or less RL training), fitted a line to the results, and then assumed it extrapolates. Is there reason to think that wouldn't work?
It needs verifiable tasks that might have to be crafted manually. It's unknown what happens if you train too much with only a feasible amount of tasks, even if they are progressively more and more difficult. When data can't be generated well, RL needs early stopping, has a bound on how far it goes, and in this case this bound could depend on the scale of the base model, or on the number/quality of verifiable tasks.
Depending on how it works, it might be impossible to use $100M for RL training at all, or scaling of pretraining might have a disproportionally large effect on quality of the optimally trained reasoning model based on it, or approximately no effect at all. Quantitative understanding of this is crucial for forecasting the consequences of the 2025-2028 compute scaleup. AI companies likely have some of this understanding, but it's not public.
Competitive agents will chose to commit suicide, knowing it's suicide, to beat the competition? That suggests that we should observe CEOs mass-poisoning their employees, Jonestown-style, in a galaxy-brained attempt to maximize shareholder value. How come that doesn't happen?
Are you quite sure the underlying issue here is not that the competitive agents don't believe the suicide raise to be a suicide race?