LESSWRONG
LW

Comment Permalink

Too bad about the tumors. Turns out iPSCs are so good at turning into other cells, that they can turn into infinite cells (ie cancer). iPSCs were used to fix spinal cord injuries (in mice) which looked successful for 112 days, but then a follow up study said [a different set of mice also w/ spinal iPSCs] resulted in tumors.

My current understanding is this is caused by the method of delivering these genes (ie the Yamanaka factors) through retrovirus which

is a virus that uses RNA as its genomic material. Upon infection with a retrovirus, a cell converts the retroviral RNA into DNA, which in turn is inserted into the DNA of the host cell.

which I'd guess this is the method the Retro Biosciences uses.

I also really loved the story of how Yamanaka discovered iPSCs:

Induced pluripotent stem cells were first generated by Shinya Yamanaka and Kazutoshi Takahashi at Kyoto University, Japan, in 2006.^[1] They hypothesized that genes important to embryonic stem cell (ESC) function might be able to induce an embryonic state in adult cells. They chose twenty-four genes previously identified as important in ESCs and used retroviruses to deliver these genes to mouse fibroblasts. The fibroblasts were engineered so that any cells reactivating the ESC-specific gene, Fbx15, could be isolated using antibiotic selection.
Upon delivery of all twenty-four factors, ESC-like colonies emerged that reactivated the Fbx15 reporter and could propagate indefinitely. To identify the genes necessary for reprogramming, the researchers removed one factor at a time from the pool of twenty-four. By this process, they identified four factors, Oct4, Sox2, cMyc, and Klf4, which were each necessary and together sufficient to generate ESC-like colonies under selection for reactivation of Fbx15.

^{^}
These organs would have the same genetics as the person who supplied the [skin/hair cells] so risk of rejection would be lower (I think)

See in context

Thane Ruthenis's Shortform

by Thane Ruthenis

13th Sep 2024

AI Alignment Forum

1 min read

8 Ω 5

This is a special post for quick takes by Thane Ruthenis. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

363A Bear Case: My Predictions Regarding AI Progress

Thane Ruthenis's Shortform

7Alexander Gietelink Oldenziel

7Thane Ruthenis

6Alexander Gietelink Oldenziel

93 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:13 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Thane Ruthenis5mo*Ω361219

Alright, so I've been following the latest OpenAI Twitter freakout, and here's some urgent information about the latest closed-doors developments that I've managed to piece together:

Following OpenAI Twitter freakouts is a colossal, utterly pointless waste of your time and you shouldn't do it ever.
If you saw this comment of Gwern's going around and were incredibly alarmed, you should probably undo the associated update regarding AI timelines (at least partially, see below).
OpenAI may be running some galaxy-brained psyops nowadays.

Here's the sequence of events, as far as I can tell:

Some Twitter accounts that are (claiming, without proof, to be?) associated with OpenAI are being very hype about some internal OpenAI developments.
Gwern posts this comment suggesting an explanation for point 1.
Several accounts (e. g., one, two) claiming (without proof) to be OpenAI insiders start to imply that:
1. An AI model recently finished training.
2. Its capabilities surprised and scared OpenAI researchers.
3. It produced some innovation/is related to OpenAI's "Level 4: Innovators" stage of AGI development.
Gwern's comment goes viral on Twitter (example).
A news story about GPT-4b micro comes out, indeed confirmi

... (read more)

[-]Robert Cousineau5mo274

I personally put a relatively high probability of this being a galaxy brained media psyop by OpenAI/Sam Altman.

Eliezer makes a very good point that confusion around people claiming AI advances/whistleblowing benefits OpenAI significantly, and Sam Altman has a history of making galaxy brained political plays (attempting to get Helen fired (and then winning), testifying to congress that it is good he has oversight via the board and he should not be full control of OpenAI and then replacing the board with underlings, etc).

Sam is very smart and politically capable. This feels in character.

[-]avturchin5mo122

It all started from Sam's six words story. So it looks like as organized hype.

9Seth Herd5mo

Thanks for doing this so I didn't have to! Hell is other people - on social media. And it's an immense time-sink. Zvi is the man for saving the rest of us vast amounts of time and sanity. I'd guess the psyop spun out of control with a couple of opportunistic posters pretending they had inside information, and that's why Sam had to say lower your expectations 100x. I'm sure he wants hype, but he doesn't want high expectations that are very quickly falsified. That would lead to some very negative stories about OpenAI's prospects, even if they're equally silly they'd harm investment hype.

6Thane Ruthenis5mo

There's a possibility that this was a clown attack on OpenAI instead...

7Alexander Gietelink Oldenziel5mo

Thanks for the sleuthing. The thing is - last time I heard about OpenAI rumors it was Strawberry. The unfortunate fact of life is that too many times OpenAI shipping has surpassed all but the wildest speculations.

7Thane Ruthenis5mo

That was part of my reasoning as well, why I thought it might be worth engaging with! But I don't think this is the same case. Strawberry/Q* was being leaked-about from more reputable sources, and it was concurrent with dramatic events (the coup) that were definitely happening. In this case, all evidence we have is these 2-3 accounts shitposting.

6Alexander Gietelink Oldenziel5mo

Thanks. Well 2-3 shitposters and one gwern. Who would be so foolish to short gwern? Gwern the farsighted, gwern the prophet, gwern for whom entropy is nought, gwern augurious augustus

4Joseph Miller5mo

I feel like for the same reasons, this shortform is kind of an engaging waste of my time. One reason I read LessWrong is to avoid twitter garbage.

6Thane Ruthenis5mo

Valid, I was split on whether it's worth posting vs. it'd be just me taking my part in spreading this nonsense. But it'd seemed to me that a lot of people, including LW regulars, might've been fooled, so I erred on the side of posting.

3Cervera5mo

I dont think any of that invalidates that Gwern is a usual, usually right.

4Thane Ruthenis5mo

As I'd said, I think he's right about the o-series' theoretic potential. I don't think there is, as of yet, any actual indication that this potential has already been harnessed, and therefore that it works as well as the theory predicts. (And of course, the o-series scaling quickly at math is probably not even an omnicide threat. There's an argument for why it might be – that the performance boost will transfer to arbitrary domains – but that doesn't seem to be happening. I guess we'll see once o3 is public.)

2Hzn5mo

I think super human AI is inherently very easy. I can't comment on the reliability of those accounts. But the technical claims seem plausible.

[-]Thane Ruthenis24d11371

I am not an AI successionist because I don't want myself and my friends to die.

There are various high-minded arguments that AIs replacing us is okay because it's just like cultural change and our history is already full of those, or because they will be our "mind children", or because they will be these numinous enlightened beings and it is our moral duty to give birth to them.

People then try to refute those by nitpicking which kinds of cultural change are okay or not, or to what extent AIs' minds will be descended from ours, or whether AIs will necessarily have consciousnesses and feel happiness.

And it's very cool and all, I'd love me some transcendental cultural change and numinous mind-children. But all those concerns are decidedly dominated by "not dying" in my Maslow hierarchy of needs. Call me small-minded.

If I were born in 1700s, I'd have little recourse but to suck it up and be content with biological children or "mind-children" students or something. But we seem to have an actual shot at not-dying here^[1]. If it's an option to not have to be forcibly "succeeded" by anything, I care quite a lot about trying to take this option.^[2]

Many other people also have such preferences... (read more)

[-]evhub23d3622

I really don't understand this debate—surely if we manage to stay in control of our own destiny we can just do both? The universe is big, and current humans are very small—we should be able to both stay alive ourselves and usher in an era of crazy enlightened beings doing crazy transhuman stuff.

[-]Nina Panickssery23d176

I think it’s more likely than not that “crazy enlightened beings doing crazy transhuman stuff” will be bad for “regular” biological humans (ie. it’ll decrease our number/QoL/agency/pose existential risks).

[-]evhub23d1510

I mostly disagree with "QoL" and "pose existential risks", at least in the good futures I'm imagining—those things are very cheap to provide to current humans. I could see "number" and "agency", but that seems fine? I think it would be bad for any current humans to die, or to lose agency over their current lives, but it seems fine and good for us to not try to fill the entire universe with biological humans, and for us to not insist on biological humans having agency over the entire universe. If there are lots of other sentient beings in existence with their own preferences and values, then it makes sense that they should have their own resources and have agency over themselves rather than us having agency over them.

[-]Nina Panickssery23d116

If there are lots of other sentient beings in existence with their own preferences and values, then it makes sense that they should have their own resources and have agency over themselves rather than us having agency over them

Perhaps yes (although I’d say it depends on what the trade-offs are) but the situation is different if we have a choice in whether or not to bring said sentient beings with difference preferences into existence in the first place. Doing so on purpose seems pretty risky to me (as opposed to minimizing the sentience, independence, and agency of AI systems as much as possible, and instead directing the technology to promote “regular” human flourishing/our current values).

7Thane Ruthenis23d

Not any more risky than bringing in humans. This is a governance/power distribution problem, not a what-kind-of-mind-this-is problem. Biological humans sometimes go evil or crazy. If you have a system that can handle that, you have a system that can handle alien minds that are evil or crazy (from our perspective), as long as you don't imbue them with more power than this system can deal with (and why would you?). (On the other hand, if your system can't deal with crazy evil biological humans, it's probably already a lawless wild-west hellhole, so bringing in some aliens won't exacerbate the problem much.)

4Nina Panickssery23d

1. Humans are more likely to be aligned with humanity as a whole compared to AIs, even if there are exceptions 2. Many existing humans want their descendants to exist, so they are fulfiling the preferences of today‘s humans

4Thane Ruthenis23d

"AIs as trained by DL today" are only a small subset of "non-human minds". Other mind-generating processes can produce minds that are as safe to have around as humans, but which are still completely alien. Many existing humans also want fascinating novel alien minds to exist.

7evhub23d

Certainly I'm excited about promoting "regular" human flourishing, though it seems overly limited to focus only on that. I'm not sure if by "regular" you mean only biological, but at least the simplest argument that I find persuasive here against only ever having biological humans is just a resource utilization argument, which is that biological humans take up a lot of space and a lot of resources and you can get the same thing much more cheaply if you bring into existence lots of simulated humans instead (certainly I agree that doesn't imply we should kill existing humans and replace them with simulations, though, unless they consent to that). And I think even if you included simulated humans in "regular" humans, I also think I value diversity of experience, and a universe full of very different sorts of sentient/conscious lifeforms having satisfied/fulfilling/flourishing experiences seems better than just "regular" humans. I also separately don't buy that it's riskier to build AIs that are sentient—in fact, I think it's probably better to build AIs that are moral patients than AIs that are not moral patients.

[-]ryan_greenblatt23d2012

IMO, it seems bad to intentionally try to build AIs which are moral patients until after we've resolved acute risks and we're deciding what to do with the future longer term. (E.g., don't try to build moral patient AIs until we're sending out space probes or deciding what to do with space probes.) Of course, this doesn't mean we'll avoid building AIs which aren't significant moral patients in practice because our control is very weak and commercial/power incentives will likely dominate.

I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk and seems morally bad. (Views focused on non-person-affecting upside get dominated by the long run future, so these views don't care about making moral patient AIs which have good lives in the short run. I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they'd prefer no patienthood at all for now.)

The only upside is that it might increase value conditional on AI takeover. But, I think "are the AIs morally valuable themselves" is much less important than the preferences of these AIs from the perspective of longer run value conditional on AI takeov... (read more)

4evhub23d

How so? Seems basically orthogonal to me? And to the extent that it does matter for takeover risk, I'd expect the sorts of interventions that make it more likely that AIs are moral patients to also make it more likely that they're aligned. Even absent AI takeover, I'm quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later, and I think it would be better to lock in AIs that are moral patients if we have to lock something in, since that opens up the possibility for the AIs to live good lives in the future. I agree that seems like the more important highest-order bit, but it's not an argument that making AIs moral patients is bad, just that it's not the most important thing to focus on (which I agree with).

8ryan_greenblatt23d

I would have guessed that "making AIs be moral patients" looks like "make AIs have their own independent preferences/objectives which we intentionally don't control precisely" which increases misalignment risks. At a more basic level, if AIs are moral patients, then there will be downsides for various safety measures and AIs would have plausible deniability for being opposed to safety measures. IMO, the right response to the AI taking a stand against your safety measures for AI welfare reasons is "Oh shit, either this AI is misaligned or it has welfare. Either way this isn't what we wanted and needs to be addressed, we should train our AI differently to avoid this." I don't understand, won't all the value come from minds intentionally created for value rather than in the minds of the laborers? Also, won't architecture and design of AIs radically shift after humans aren't running day to day operations? I don't understand the type of lock in your imagining, but it naively sounds like a world which has negligible longtermist value (because we got locked into obscure specifics like this), so making it somewhat better isn't important.

5Nina Panickssery23d

Interesting! Aside from the implications for human agency/power, this seems worse because of the risk of AI suffering—if we build sentient AIs we need to be way more careful about how we treat/use them.

3cubefox21d

Exactly. Bringing a new kind of moral patient into existence is a moral hazard, because once they exist, we will have obligations toward them, e.g. providing them with limited resources (like land), and giving them part of our political power via voting rights. That's analogous to Parfit's Mere Addition Paradox that leads to the repugnant conclusion, in this case human marginalization.

2Vladimir_Nesov21d

(How could "land" possibly be a limited resource, especially in the context of future AIs? The world doesn't exist solely on the immutable surface of Earth...)

9Thane Ruthenis21d

I mean, if you interpret "land" in a Georgist sense, as the sum of all natural resources of the reachable universe, then yes, it's finite. And the fights for carving up that pie can start long before our grabby-alien hands have seized all of it. (The property rights to the Andromeda Galaxy can be up for sale long before our Von Neumann probes reach it.)

6Vladimir_Nesov21d

The salient referent is compute, sure, my point is that it's startling to see what should in this context be compute within the future lightcone being (very indirectly) called "land". (I do understand that this was meant as an example clarifying the meaning of "limited resources", and so it makes perfect sense when decontextualized. It's just not an example that fits that well when considered within this particular context.) (I'm guessing the physical world is unlikely to matter in the long run other than as substrate for implementing compute. For that reason importance of understanding the physical world, for normative or philosophical reasons, seems limited. It's more important how ethics and decision theory work for abstract computations, the meaningful content of the contingent physical computronium.)

2cubefox21d

A population of AI agents could marginalize humans significantly before they are intelligent enough to easily (and quickly!) create more Earths.

8Vladimir_Nesov23d

For me, a crux of a future that's good for humanity is giving the biological humans the resources and the freedom to become the enlightened transhuman beings themselves, with no hard ceiling on relevance in the long run. Rather than only letting some originally-humans to grow into more powerful but still purely ornamental roles, or not letting them grow at all, or not letting them think faster and do checkpointing and multiple instantiations of the mind states using a non-biological cognitive substrate, or letting them unwillingly die of old age or disease. (For those who so choose, under their own direction rather than only through externally imposed uplifting protocols, even if that leaves it no more straightforward than world-class success of some kind today, to reach a sensible outcome.) This in particular implies reasonable resources being left to those who remain/become regular biological humans (or take their time growing up), including through influence of some of these originally-human beings who happen to consider that a good thing to ensure. Edit: Expanded into a post.

6hairyfigment23d

This sounds like a question which can be addressed after we figure out how to avoid extinction. I do note that you were the one who brought in "biological humans," as if that meant the same as "ourselves" in the grandparent. That could already be a serious disagreement, in some other world where it mattered.

1MattJ23d

The mere fear that the entire human race will be exterminated in their sleep through some intricate causality we are too dumb to understand will seriously diminish our quality of life.

8Thane Ruthenis23d

I very much agree. The hardcore successionist stances, as I understand them, are either that trying to stay in control at all is immoral/unnatural, or that creating the enlightened beings ASAP matters much more than whether we live through their creation. (Edit: This old tweet by Andrew Critch is still a good summary, I think.) So it's not that they're opposed to the current humanity's continuation, but that it matters very little compared to ushering in the post-Singularity state. Therefore, anything that risks or delays the Singularity in exchange for boosting the current humans' safety is opposed.

1fasf23d

Another stance is that it would suck to die the day before AI makes us immortal (like how Bryan Johnson main motivation for maximizing his lifespan is due to this). Hence trying to delay AI advancement is opposed

4Thane Ruthenis23d

Yeah, but that's a predictive disagreement between our camps (whether the current-paradigm AI is controllable), not a values disagreement. I would agree that if we find a plan that robustly outputs an aligned AGI, we should floor it in that direction.

8Vladimir_Nesov24d

Endorsing successionism might be strongly correlated with expecting the "mind children" to keep humans around, even if in a purely ornamental role and possibly only at human timescales. This might be more of a bailey position, so when pressed on it they might affirm that their endorsement of successionism is compatible with human extinction, but in their heart they would still hope and expect that it won't come to that. So I think complaints about human extinction will feel strawmannish to most successionists.

7Thane Ruthenis23d

I'm not so sure about that: Though sure, Critch's process there isn't white-boxed, so any number of biases might be in it.

4lc23d

"Successionism" is such a bizarre position that I'd look for the underlying generator rather than try to argue with it directly.

8sunwillrise23d

I'm not sure it's that bizarre. It's anti-Humanist, for sure, in the sense that it doesn't focus on the welfare/empowerment/etc. of humans (either existing or future) as its end goal. But that doesn't, by itself, make it bizarre. From Eliezer's Raised in Technophilia, back in the day: From A prodigy of refutation: From the famous Musk/Larry Page breakup: Successionism is the natural consequence of an affective death spiral around technological development and anti-chauvinism. It's as simple as that. Successionists start off by believing that technological change makes things better. That not only does it virtually always make things better, but that it's pretty much the only thing that ever makes things better. Everything else, whether it's values, education, social organization etc., pales in comparison to technological improvements in terms of how they affect the world; they are mere short-term blips that cannot change the inevitable long-run trend of positive change. At the same time, they are raised, taught, incentivized to be anti-chauvinist. They learn, either through stories, public pronouncements, in-person social events etc., that those who stand athwart atop history yelling stop are always close-minded bigots who want to prevent new classes of beings (people, at first; then AIs, afterwards) from receiving the moral personhood they deserve. In their eyes, being afraid of AIs taking over is like being afraid of The Great Replacement if you're white and racist. You're just a regressive chauvinist desperately clinging to a discriminatory worldview in the face of an unstoppable tide of change that will liberate new classes of beings from your anachronistic and damaging worldview. Optimism about technology and opposition to chauvinism are both defensible, and arguably even correct, positions in most cases. Even if you personally (as I do) believe non-AI technology can also have pretty darn awful effects on us (social media, online gambling) and that carin

2cubefox23d

An AI successionist usually argues that successionism isn't bad even if dying is bad. For example, when humanity is prevented from having further children, e.g. by sterilization. I say that even in this case successionism is bad. Because I (and I presume: many people) want humanity, including our descendants, to continue into the future. I don't care about AI agents coming into existence and increasingly marginalizing humanity.

[-]Thane Ruthenis5mo*Ω12340

Current take on the implications of "GPT-4b micro": Very powerful, very cool, ~zero progress to AGI, ~zero existential risk. Cheers.

First, the gist of it appears to be:

OpenAI’s new model, called GPT-4b micro, was trained to suggest ways to re-engineer the protein factors to increase their function. According to OpenAI, researchers used the model’s suggestions to change two of the Yamanaka factors to be more than 50 times as effective—at least according to some preliminary measures.
The model was trained on examples of protein sequences from many species, as well as information on which proteins tend to interact with one another. [...] Once Retro scientists were given the model, they tried to steer it to suggest possible redesigns of the Yamanaka proteins. The prompting tactic used is similar to the “few-shot” method, in which a user queries a chatbot by providing a series of examples with answers, followed by an example for the bot to respond to.

Crucially, if the reporting is accurate, this is not an agent. The model did not engage in autonomous open-ended research. Rather, humans guessed that if a specific model is fine-tuned on a specific dataset, the gradient descent would chisel... (read more)

5Logan Riggs5mo

For those also curious, Yamanaka factors are specific genes that turn specialized cells (e.g. skin, hair) into induced pluripotent stem cells (iPSCs) which can turn into any other type of cell. This is a big deal because you can generate lots of stem cells to make full organs[1] or reverse aging (maybe? they say you just turn the cell back younger, not all the way to stem cells). You can also do better disease modeling/drug testing: if you get skin cells from someone w/ a genetic kidney disease, you can turn those cells into the iPSCs, then into kidney cells which will exhibit the same kidney disease because it's genetic. You can then better understand how the [kidney disease] develops and how various drugs affect it. So, it's good to have ways to produce lots of these iPSCs. According to the article, SOTA was <1% of cells converted into iPSCs, whereas the GPT suggestions caused a 50x improvement to 33% of cells converted. That's quite huge!, so hopefully this result gets verified. I would guess this is true and still a big deal, but concurrent work got similar results. Too bad about the tumors. Turns out iPSCs are so good at turning into other cells, that they can turn into infinite cells (ie cancer). iPSCs were used to fix spinal cord injuries (in mice) which looked successful for 112 days, but then a follow up study said [a different set of mice also w/ spinal iPSCs] resulted in tumors. My current understanding is this is caused by the method of delivering these genes (ie the Yamanaka factors) through retrovirus which which I'd guess this is the method the Retro Biosciences uses. I also really loved the story of how Yamanaka discovered iPSCs: 1. ^ These organs would have the same genetics as the person who supplied the [skin/hair cells] so risk of rejection would be lower (I think)

[-]TsviBT5moΩ3110

According to the article, SOTA was <1% of cells converted into iPSCs

I don't think that's right, see https://www.cell.com/cell-stem-cell/fulltext/S1934-5909(23)00402-2

4Logan Riggs5mo

You're right! Thanks For Mice, up to 77% For human cells, up to 9% (if I'm understanding this part correctly). So seems like you can do wildly different depending on the setting (mice, humans, bovine, etc), and I don't know what the Retro folks were doing, but does make their result less impressive.

4TsviBT5mo

(Still impressive and interesting of course, just not literally SOTA.)

4Logan Riggs5mo

Thinking through it more, Sox2-17 (they changed 17 amino acids from Sox2 gene) was your linked paper's result, and Retro's was a modified version of factors Sox AND KLF. Would be cool if these two results are complementary.

[-]Thane Ruthenis5moΩ1732-4

Here's an argument for a capabilities plateau at the level of GPT-4 that I haven't seen discussed before. I'm interested in any holes anyone can spot in it.

Consider the following chain of logic:

The pretraining scaling laws only say that, even for a fixed training method, increasing the model's size and the amount of data you train on increases the model's capabilities – as measured by loss, performance on benchmarks, and the intuitive sense of how generally smart a model is.
Nothing says that increasing a model's parameter-count and the amount of compute spent on training it is the only way to increase its capabilities. If you have two training methods A and B, it's possible that the B-trained X-sized model matches the performance of the A-trained 10X-sized model.
Empirical evidence: Sonnet 3.5 (at least the not-new one), Qwen-2.5-70B, and Llama 3-72B all have 70-ish billion parameters, i. e., less than GPT-3. Yet, their performance is at least on par with that of GPT-4 circa early 2023.
Therefore, it is possible to "jump up" a tier of capabilities, by any reasonable metric, using a fixed model size but improving the training methods.
The latest set of GPT-4-sized models (Opus 3.5, Ori

... (read more)

[-]Vladimir_Nesov5mo*Ω10275

Given some amount of compute, a compute optimal model tries to get the best perplexity out of it when training on a given dataset, by choosing model size, amount of data, and architecture. An algorithmic improvement in pretraining enables getting the same perplexity by training on data from the same dataset with less compute, achieving better compute efficiency (measured as its compute multiplier).

Many models aren't trained compute optimally, they are instead overtrained (the model is smaller, trained on more data). This looks impressive, since a smaller model is now much better, but this is not an improvement in compute efficiency, doesn't in any way indicate that it became possible to train a better compute optimal model with a given amount of compute. The data and post-training also recently got better, which creates the illusion of algorithmic progress in pretraining, but their effect is bounded (while RL doesn't take off), doesn't get better according to pretraining scaling laws once much more data becomes necessary. There is enough data until 2026-2028, but not enough good data.

I don't think the cumulative compute multiplier since GPT-4 is that high, I'm guessing 3x, except p... (read more)

3Thane Ruthenis5mo

Coming back to this in the wake of DeepSeek r1... How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1, despite using training methods that clearly have wildly different returns on compute investment? Like, GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million. Yet, they're roughly at the same level. That should be very surprising. Investing a very different amount of money into V3's training should've resulted in it either massively underperforming GPT-4, or massively overperforming, not landing precisely in its neighbourhood! Consider this graph. If we find some training method A, and discover that investing $100 million in it lands us at just above "dumb human", and then find some other method B with a very different ROI, and invest $5.5 million in it, the last thing we should expect is to again land near "dumb human". Or consider this trivial toy model: You have two linear functions, f(x) = Ax and g(x) = Bx, where x is the compute invested, output is the intelligence of the model, and f and g are different training methods. You pick some x effectively at random (whatever amount of money you happened to have lying around), plug it into f, and get, say, 120. Then you pick a different random value of x, plug it into g, and get... 120 again. Despite the fact that the multipliers A and B are likely very different, and you used very different x-values as well. How come? The explanations that come to mind are: * It actually is just that much of a freaky coincidence. * DeepSeek have a superintelligent GPT-6 equivalent that they trained for $10 million in their basement, and V3/r1 are just flexes that they specifically engineered to match GPT-4-ish level. * DeepSeek directly trained on GPT-4 outputs, effectively just distilling GPT-4 into their model, hence the anchoring. * DeepSeek kept investing and tinkering until getting to GPT-4ish level, an

5Vladimir_Nesov5mo

Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead. Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT-4 was trained in 2022 on A100s and needed a lot of them, while in 2024 it could be trained on 8K H100s in BF16. DeepSeek-V3 is trained in FP8, doubling the FLOP/s, so the FLOPs of original GPT-4 could be produced in FP8 by mere 4K H100s. DeepSeek-V3 was trained on 2K H800s, whose performance is about that of 1.5K H100s. So the cost only has to differ by about 3x, not 20x, when comparing a compute optimal variant of DeepSeek-V3 with original GPT-4, using the same hardware and training with the same floating point precision. The relevant comparison is with GPT-4o though, not original GPT-4. Since GPT-4o was trained in late 2023 or early 2024, there were 30K H100s clusters around, which makes 8e25 FLOPs of raw compute plausible (assuming it's in BF16). It might be overtrained, so make that 4e25 FLOPs for a compute optimal model with the same architecture. Thus when comparing architectures alone, GPT-4o probably uses about 15x more compute than DeepSeek-V3. Returns on compute are logarithmic though, advantage of a $150 billion training system over a $150 million one is merely twice that of $150 billion over $5 billion or $5 billion over $150 million. Restrictions on access to compute can only be overcome with 30x compute multipliers, and at least DeepSeek-V3 is going to be reproduced using the big compute of US training systems shortly, so that advantage is already gone. ---------------------------------------- 1. That is, raw utilized compute. I'm assuming the same compute utilization for all models. ↩︎

4Noosphere895mo

I buy that 1 and 4 is the case, combined with Deepseek probably being satisfied that GPT-4-level models were achieved. Edit: I did not mean to imply that GPT-4ish neighbourhood is where LLM pretraining plateaus at all, @Thane Ruthenis.

2Thane Ruthenis5mo

Thanks! You're more fluent in the scaling laws than me: is there an easy way to roughly estimate how much compute would've been needed to train a model as capable as GPT-3 if it were done Chinchilla-optimally + with MoEs? That is: what's the actual effective "scale" of GPT-3? (Training GPT-3 reportedly took 3e23 FLOPS, and GPT-4 2e25 FLOPS. Naively, the scale-up factor is 67x. But if GPT-3's level is attainable using less compute, the effective scale-up is bigger. I'm wondering how much bigger.)

7Vladimir_Nesov5mo

IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1). GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla's compute optimal 20 tokens/parameter is approximately correct for GPT-3, it's 10x undertrained. Interpolating from the above 30x overtraining example, a compute optimal model needs about 1.5e23 FLOPs to get the same perplexity. (The effect from undertraining of GPT-3 turns out to be quite small, reducing effective compute by only 2x. Probably wasn't worth mentioning compared to everything else about it that's different from GPT-4.)

[-]leogao5moΩ9198

in retrospect, we know from chinchilla that gpt3 allocated its compute too much to parameters as opposed to training tokens. so it's not surprising that models since then are smaller. model size is a less fundamental measure of model cost than pretraining compute. from here on i'm going to assume that whenever you say size you meant to say compute.

obviously it is possible to train better models using the same amount of compute. one way to see this is that it is definitely possible to train worse models with the same compute, and it is implausible that the current model production methodology is the optimal one.

it is unknown how much compute the latest models were trained with, and therefore what compute efficiency win they obtain over gpt4. it is unknown how much more effective compute gpt4 used than gpt3. we can't really make strong assumptions using public information about what kinds of compute efficiency improvements have been discovered by various labs at different points in time. therefore, we can't really make any strong conclusions about whether the current models are not that much better than gpt4 because of (a) a shortage of compute, (b) a shortage of compute efficiency improvements, or (c) a diminishing return of capability wrt effective compute.

[-]nostalgebraist5moΩ8123

One possible answer is that we are in what one might call an "unhobbling overhang."

Aschenbrenner uses the term "unhobbling" for changes that make existing model capabilities possible (or easier) for users to reliably access in practice.

His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we're also getting better at unhobbling over time, which leads to even more growth.

That is, Aschenbrenner treats pretraining improvements and unhobbling as economic substitutes: you can improve "practically accessible capabilities" to the same extent by doing more of either one even in the absence of the other, and if you do both at once that's even better.

However, one could also view the pair more like economic complements. Under this view, when you pretrain a model at "the next tier up," you also need to do novel unhobbling research to "bring out" the new capabilities unlocked at that tier. If you only scale up, while re-using the unhobbling tech of yesteryear, most of t... (read more)

6Seth Herd5mo

That seems maybe right, in that I don't see holes in your logic on LLM progression to date, off the top of my head. It also lines up with a speculation I've always had. In theory LLMs are predictors, but in practice, are they pretty much imitators? If you're imitating human language, you're capped at reproducing human verbal intelligence (other modalities are not reproducing human thought so not capped; but they don't contribute as much in practice without imitating human thought). I've always suspected LLMs will plateau. Unfortunately I see plenty of routes to improving using runtime compute/CoT and continuous learning. Those are central to human intelligence. LLMs already have slightly-greater-than-human system 1 verbal intelligence, leaving some gaps where humans rely on other systems (e.g., visual imagination for tasks like tracking how many cars I have or tic-tac-toe). As we reproduce the systems that give humans system 2 abilities by skillfully iterating system 1, as o1 has started to do, they'll be noticeably smarter than humans. The difficulty of finding new routes forward in this scenario would produce a very slow takeoff. That might be a big benefit for alignment.

2Thane Ruthenis5mo

Yep, I think that's basically the case. @nostalgebraist makes an excellent point that eliciting any latent superhuman capabilities which bigger models might have is an art of its own, and that "just train chatbots" doesn't exactly cut it, for this task. Maybe that's where some additional capabilities progress might still come from. But the AI industry (both AGI labs and the menagerie of startups and open-source enthusiasts) has so far been either unwilling or unable to move past the chatbot paradigm. (Also, I would tentatively guess that this type of progress is not existentially threatening. It'd yield us a bunch of nice software tools, not a lightcone-eating monstrosity.)

4Seth Herd5mo

I agree that chatbot progress is probably not existentially threatening. But it's all too short a leap to making chatbots power general agents. The labs have claimed to be willing and enthusiastic about moving to an agent paradigm. And I'm afraid that a proliferation of even weakly superhuman or even roughly parahuman agents could be existentially threatening. I spell out my logic for how short the leap might be from current chatbots to takeover-capable AGI agents in my argument for short timelines being quite possible. I do think we've still got a good shot of aligning that type of LLM agent AGI since it's a nearly best-case scenario. RL even in o1 is really mostly used for making it accurately follow instructions, which is at least roughly the ideal alignment goal of Corrigibility as Singular Target. Even if we lose faithful chain of thought and orgs don't take alignment that seriously, I think those advantages of not really being a maximizer and having corrigibility might win out. That in combination with the slower takeoff make me tempted to believe its actually a good thing if we forge forward, even though I'm not at all confident that this will actually get us aligned AGI or good outcomes. I just don't see a better realistic path.

5green_leaf5mo

One obvious hole would be that capabilities did not, in fact, plateau at the level of GPT-4.

8Seth Herd5mo

I thought the argument was that progress has slowed down immensely. The softer form of this argument is that LLMs won't plateau but progress will slow to such a crawl that other methods will surpass them. The arrival of o1 and o3 says this has already happened, at least in limited domains - and hybrid training methods and perhaps hybrid systems probably will proceed to surpass base LLMs in all domains.

3Thane Ruthenis5mo

There's been incremental improvement and various quality-of-life features like more pleasant chatbot personas, tool use, multimodality, gradually better math/programming performance that make the models useful for gradually bigger demographics, et cetera. But it's all incremental, no jumps like 2-to-3 or 3-to-4.

4green_leaf5mo

I see, thanks. Just to make sure I'm understanding you correctly, are you excluding the reasoning models, or are you saying there was no jump from GPT-4 to o3? (At first I thought you were excluding them in this comment, until I noticed the "gradually better math/programming performance.")

6Thane Ruthenis5mo

I think GPT-4 to o3 represent non-incremental narrow progress, but only, at best, incremental general progress. (It's possible that o3 does "unlock" transfer learning, or that o4 will do that, etc., but we've seen no indication of that so far.)

[-]Thane Ruthenis5moΩ10232

So, Project Stargate. Is it real, or is it another "Sam Altman wants $7 trillion"? Some points:

The USG invested nothing in it. Some news outlets are being misleading about this. Trump just stood next to them and looked pretty, maybe indicated he'd cut some red tape. It is not an "AI Manhattan Project", at least, as of now.
Elon Musk claims that they don't have the money and that SoftBank (stated to have "financial responsibility" in the announcement) has less than $10 billion secured. If true, while this doesn't mean they can't secure an order of magnitude more by tomorrow, this does directly clash with "deploying $100 billion immediately" statement.
- But Sam Altman counters that Musk's statement is "wrong", as Musk "surely knows".
- I... don't know which claim I distrust more. Hm, I'd say Altman feeling the need to "correct the narrative" here, instead of just ignoring Musk, seems like a sign of weakness? He doesn't seem like the type to naturally get into petty squabbles like this, otherwise.
- (And why, yes, this is how an interaction between two Serious People building world-changing existentially dangerous megaprojects looks like. Apparently.)
Some people try to counter Musk's claim by

... (read more)

6Robert Cousineau5mo

Here is what I posted on "Quotes from the Stargate Press Conference": On Stargate as a whole: This is a restatement with a somewhat different org structure of the prior OpenAI/Microsoft data center investment/partnership, announced early last year (admittedly for $100b). Elon Musk states they do not have anywhere near the 500 billion pledged actually secured: I do take this as somewhat reasonable, given the partners involved just barely have $125 billion available to invest like this on a short timeline. Microsoft has around 78 billion cash on hand at a market cap of around 3.2 trillion. Softbank has 32 billion dollars cash on hand, with a total market cap of 87 billion. Oracle has around 12 billion cash on hand, with a market cap of around 500 billion. OpenAI has raised a total of 18 billion, at a valuation of 160 billion. Further, OpenAI and Microsoft seem to be distancing themselves somewhat - initially this was just an OpenAI/Microsoft project, and now it involves two others and Microsoft just put out a release saying "This new agreement also includes changes to the exclusivity on new capacity, moving to a model where Microsoft has a right of first refusal (ROFR)." Overall, I think that the new Stargate numbers published may (call it 40%) be true, but I also think there is a decent chance this is new administration trump-esque propoganda/bluster (call it 45%), and little change from the prior expected path of datacenter investment (which I do believe is unintentional AINotKillEveryone-ism in the near future). Edit: Satya Nadella was just asked about how funding looks for stargate, and said "Microsoft is good for investing 80b". This 80b number is the same number Microsoft has been saying repeatedly.

5Thane Ruthenis5mo

I think it's definitely bluster, the question is how much of a done deal it is to turn this bluster into at least $100 billion. I don't this this changes the prior expected path of datacenter investment at all. It's precisely how the expected path was going to look like, the only change is how relatively high-profile/flashy this is being. (Like, if they invest $30-100 billion into the next generation of pretrained models in 2025-2026, and that generation fails, they ain't seeing the remaining $400 billion no matter what they're promising now. Next generation makes or breaks the investment into the subsequent one, just as was expected before this.) He does very explicitly say "I am going to spend 80 billion dollars building Azure". Which I think has nothing to do with Stargate. (I think the overall vibe he gives off is "I don't care about this Stargate thing, I'm going to scale my data centers and that's that". I don't put much stock into this vibe, but it does fit with OpenAI/Microsoft being on the outs.)

[-]Thane Ruthenis6moΩ12228

Here's something that confuses me about o1/o3. Why was the progress there so sluggish?

My current understanding is that they're just LLMs trained with RL to solve math/programming tasks correctly, hooked up to some theorem-verifier and/or an array of task-specific unit tests to provide ground-truth reward signals. There are no sophisticated architectural tweaks, not runtime-MCTS or A* search, nothing clever.

Why was this not trained back in, like, 2022 or at least early 2023; tested on GPT-3/3.5 and then default-packaged into GPT-4 alongside RLHF? If OpenAI was too busy, why was this not done by any competitors, at decent scale? (I'm sure there are tons of research papers trying it at smaller scales.)

The idea is obvious; doubly obvious if you've already thought of RLHF; triply obvious after "let's think step-by-step" went viral. In fact, I'm pretty sure I've seen "what if RL on CoTs?" discussed countless times in 2022-2023 (sometimes in horrified whispers regarding what the AGI labs might be getting up to).

The mangled hidden CoT and the associated greater inference-time cost is superfluous. DeepSeek r1/QwQ/Gemini Flash Thinking have perfectly legible CoTs which would be fine to prese... (read more)

[-]brambleboy6mo2715

While I don't have specifics either, my impression of ML research is that it's a lot of work to get a novel idea working, even if the idea is simple. If you're trying to implement your own idea, you'll be banging your head against the wall for weeks or months wondering why your loss is worse than the baseline. If you try to replicate a promising-sounding paper, you'll bang your head against the wall as your loss is worse than the baseline. It's hard to tell if you made a subtle error in your implementation or if the idea simply doesn't work for reasons you don't understand because ML has little in the way of theoretical backing. Even when it works it won't be optimized, so you need engineers to improve the performance and make it stable when training at scale. If you want to ship a working product quickly then it's best to choose what's tried and true.

7leogao6mo

simple ideas often require tremendous amounts of effort to make work.

[-]Thane Ruthenis5moΩ6150

Some more evidence that whatever the AI progress on benchmarks is measuring, it's likely not measuring what you think it's measuring:

AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination
AIME 2025 part I was conducted yesterday, and the scores of some language models are available here:
https://matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.
I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.
That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It's really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can't multiply 3-digit numbers. I was wrong, I guess.
I then used openai's Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
https://quora.com/In-what-bases-b-does-b-7-divide-into-9b-7-without-any-remainder
I thought maybe it was just coincidence, and used Deep Research again on Problem 3. And guess what? A very similar question was on

... (read more)

4Noosphere895mo

To be fair, non-reasoning models do much worse on these questions, even when it's likely that the same training data has already been given to GPT-4. Now, I could believe that RL is more or less working as an elicitation method, which plausibly explains the results, though still it's interesting to see why they get much better scores, even with very similar training data.

5Thane Ruthenis4mo

I do think RL works "as intended" to some extent, teaching models some actual reasoning skills, much like SSL works "as intended" to some extent, chiseling-in some generalizable knowledge. The question is to what extent it's one or the other.

[-]Thane Ruthenis2mo*Ω611-3

Edit: I've played with the numbers a bit more, and on reflection, I'm inclined to partially unroll this update. o3 doesn't break the trendline as much as I'd thought, and in fact, it's basically on-trend if we remove the GPT-2 and GPT-3 data-points (which I consider particularly dubious).

Regarding METR's agency-horizon benchmark:

I still don't like anchoring stuff to calendar dates, and I think the o3/o4-mini datapoints perfectly show why.

It would be one thing if they did fit into the pattern. If, by some divine will controlling the course of our world's history, OpenAI's semi-arbitrary decision about when to allow METR's researchers to benchmark o3 just so happened to coincide with the 2x/7-month model. But it didn't: o3 massively overshot that model.^[1]

Imagine a counterfactual in which METR's agency-horizon model existed back in December, and OpenAI invited them for safety testing/benchmarking then, four months sooner. How different would the inferred agency-horizing scaling laws have been, how much faster the extrapolated progress? Let's run it:

o1 was announced September 12th, o3 was announced December 19th, 98 days apart.
o1 scored at ~40 minutes, o3 at ~1.5 hours, a 2.25x'ing.
Th

... (read more)

[-]ryan_greenblatt2moΩ9143

Do you also dislike Moore's law?

I agree that anchoring stuff to release dates isn't perfect because the underlying variable of "how long does it take until a model is released" is variable, but I think is variability is sufficiently low that it doesn't cause that much of an issue in practice. The trend is only going to be very solid over multiple model releases and it won't reliably time things to within 6 months, but that seems fine to me.

I agree that if you add one outlier data point and then trend extrapolate between just the last two data points, you'll be in trouble, but fortunately, you can just not do this and instead use more than 2 data points.

This also means that I think people shouldn't update that much on the individual o3 data point in either direction. Let's see where things go for the next few model releases.

5Thane Ruthenis2mo

That one seems to work more reliably, perhaps because it became the metric the industry aims for. My issue here is that there wasn't that much variance in the performance of all preceding models they benchmarked: from GPT-2 to Sonnet 3.7, they seem to almost perfectly fall on the straight line. Then, the very first advancement of the frontier after the trend-model is released is an outlier. That suggests an overfit model. I do agree that it might just be a coincidental outlier and that we should wait and see whether the pattern recovers with subsequent model releases. But this is suspicious enough I feel compelled to make my prediction now.

[-]Thomas Kwa2moΩ7104

The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.

Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It's not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon.

Release schedules could be altered
A model could be overfit to our dataset
One model could play less well with our elicitation/scaffolding
One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.

All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.

7Thane Ruthenis2mo

Fair, also see my un-update edit. Have you considered removing GPT-2 and GPT-3 from your models, and seeing what happens? As I'd previously complained, I don't think they can be part of any underlying pattern (due to the distribution shift in the AI industry after ChatGPT/GPT-3.5). And indeed: removing them seems to produce a much cleaner trend with a ~130-day doubling.

3bhalstead2mo

For what it's worth, the model showcased in December (then called o3) seems to be completely different from the model that METR benchmarked (now called o3).

[-]Thane Ruthenis9mo102

On the topic of o1's recent release: wasn't Claude Sonnet 3.5 (the subscription version at least, maybe not the API version) already using hidden CoT? That's the impression I got from it, at least.

The responses don't seem to be produced in constant time. It sometimes literally displays a "thinking deeply" message which accompanies a unusually delayed response. Other times, the following pattern would play out:

I pose it some analysis problem, with a yes/no answer.
It instantly produces a generic response like "let's evaluate your arguments".
There's a 1-2 second delay.
Then it continues, producing a response that starts with "yes" or "no", then outlines the reasoning justifying that yes/no.

That last point is particularly suspicious. As we all know, the power of "let's think step by step" is that LLMs don't commit to their knee-jerk instinctive responses, instead properly thinking through the problem using additional inference compute. Claude Sonnet 3.5 is the previous out-of-the-box SoTA model, competently designed and fine-tuned. So it'd be strange if it were trained to sabotage its own CoTs by "writing down the bottom line first" like this, instead of being taught not to commit to a ... (read more)

6habryka9mo

My model was that Claude Sonnet has tool access and sometimes does some tool-usage behind the scenes (which later gets revealed to the user), but that it wasn't having a whole CoT behind the scenes, but I might be wrong.

4quetzal_rainbow9mo

I think you heard about this thread (I didn't try to replicate it myself).

2Thane Ruthenis9mo

Thanks, that seems relevant! Relatedly, the system prompt indeed explicitly instructs it to use "<antThinking>" tags when creating artefacts. It'd make sense if it's also using these tags to hide parts of its CoT.

[-]Thane Ruthenis5moΩ48-2

Here's a potential interpretation of the market's apparent strange reaction to DeepSeek-R1 (shorting Nvidia).

I don't fully endorse this explanation, and the shorting may or may not have actually been due to Trump's tariffs + insider trading, rather than DeepSeek-R1. But I see a world in which reacting this way to R1 arguably makes sense, and I don't think it's an altogether implausible world.

If I recall correctly, the amount of money globally spent on inference dwarfs the amount of money spent on training. Most economic entities are not AGI labs training n... (read more)

8Vladimir_Nesov5mo

The bet that "makes sense" is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we're going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also "suggesting" that quality is compute-insensitive in a large range, so there is no benefit from more compute per token. But if quality instead improves soon (including by training DeepSeek-V3 architecture on GPT-4o compute), and that improvement either makes it necessary to use more compute per token, or motivates using inference for more tokens even with models that have the same active parameter count (as in Jevons paradox), that argument doesn't work. Also, the ceiling of quality at the possible scaling slowdown point depends on efficiency of training (compute multiplier) applied to the largest training system that the AI economics will support (maybe 5-15 GW without almost-AGI), and improved efficiency of DeepSeek-V3 raises that ceiling.

[-]Thane Ruthenis3mo50

It's been more than three months since o3 and still no o4, despite OpenAI researchers' promises.

Deep Learning has officially hit a wall. Schedule the funeral.

[/taunting_god]

9Vladimir_Nesov3mo

If o3 is based on GPT-4o, there is a reasoning model based on GPT-4.5 that's better. If o3 is based on GPT-4.5 (and so the reason it's still not out is that they are waiting for Blackwells to inference it at a reasonable speed and cost), then it was a hasty job just after the base model for GPT-4.5 was pretrained, and so by now they have something much better. Either way, there is some sort of "o4", but it's probably a bad look to announce it before releasing the already-announced o3.

1[comment deleted]3mo

[-]Thane Ruthenis5moΩ340

Some thoughts on protecting against LLM steganography.

tl;dr: I suspect the Paraphraser idea is near-completely ineffectual as you scale the capabilities, both for scheming and for general CoT interpretability.

In the reasoning-as-communication framework, the amount of information a fixed-length message/CoT can transfer is upper-bounded by information-theoretic constraints. That is, if we have a ten-bit message, it cannot communicate more than ten bits of information: it cannot always uniquely specify one internal state of an AI model from more than $2^{10}$ ... (read more)

4Daniel Kokotajlo5mo

I like to distinguish between "Subtle strategic bias" and "Steganography." Paraphrasers help (and maybe solve?) the latter, but do nothing about the former. Example of subtle strategic bias: The model knows it's misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. "I'm sorry sir I just am not making much progress," it says. You look at the CoT, and it's just doing a bunch of sensible but fruitless lines of reasoning. Example of steganography: The model needs to insert a subtle security vulnerability into the code it is writing -- but it doesn't know one right away, it has to invent one. And that requires lots of sequential thought. So it needs to try out different ideas, vet them, critique them, refine them, etc. Can't do that in a single forward pass. So it needs to pass information forward to its future self somehow... without humans reading the CoT knowing that's what it's doing. THIS is the sort of thing that is much harder to do with a paraphraser than without a paraphraser.

4Thane Ruthenis5mo

I don't see why it would be harder to do. Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, "I need to design unit tests for this module". Each word of this CoT is mapped to some internal representation, and the internal representation consists of "on-task" human-intended reasoning X, and misaligned scheming Y (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). "I" is mapped to ⟨X1,Y1⟩, "need" is mapped to ⟨X2,Y2⟩, etc. Each Xi iterates on Xi−1, each Yi iterates on Yi−1, we have a sequence of serial computations. In other words: as the model outputs, "I need to design unit tests for this module", there's an unspoken subtext of "I need to sabotage the unit tests here", and the model packs nine words' worth of serial computations into both the "textual" X and the "subtextual" Y aspects of it. Now you use the Paraphraser on the string, generating alternatives like "the next step is to generate some unit tests" and "several unit tests for this module are needed here". The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the ith token of this phrase is mapped to Xi, as if the paraphrasing didn't happen, but that the mapping of i→Yi would be broken. Except, why would that happen? What privileges X-representations over Y-representations? Why would i→Xi be preserved whereas i→Yi is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model's capabilities)? In other words: We're basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn't this just map the subtextual

4Daniel Kokotajlo5mo

I don't think I understand. Let's work through an example. The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let's say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let's say, requires 300 serial steps of reasoning. The model's strategy is to write subtly-vulnerable code but in a way that looks like it's trying to write good code and just getting a bit unlucky and taking longer than usual (let's suppose the humans have a poor ability to judge how long it's supposed to take, but still some ability, such that it can't take e.g. 10x longer without arousing suspicion). If there's no paraphraser, this is 'easy.' I mean it's still difficult but it's 'easy' in the sense that the AI can basically just use a code -- each nugget of reasoning might have 10 different semantically equivalent expressions, or maybe even 100, so it can 'add extra layers of meaning' to the nuggets -- in addition to their meaning in the surface level i'm-tring-to-write-good-code, they can have a hidden meaning having to do with the security vulnerability-related properties of the code. And it can make a chain of reasoning that is basically about the security stuff while looking like a reasonable but unlucky attempt to write good code. But a paraphraser wrecks that strategy. The model can still add extra layers of meaning to the nuggets, but now it has loads of trouble making the chain of thought look like an honest attempt to write good code. To use your example, if 'I should generate unit tests now' and all semantically equivalent variants of that idea are assigned the hidden meaning 'I should check to see if the vulnerability is too easy to spot' for example, then... what is it going to write next? It's too constrained; in order to look non-suspicious there is a limited set of options it can follow up with... ok, hmm, I'm at the limits of my understanding he

2Thane Ruthenis5mo

Ah, I think I see my error. Consider a CoT of the form "[problem statement]: ...........", which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities. And in the framework I'd outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such "filler" tokens. So yep, you're right, this does force all scheming to be taking place within one forward pass.

Moderation Log

Curated and popular this week

93Comments