All of ryan_greenblatt's Comments + Replies

How much flop do you think it takes for large scale RL to learn codes? Let's say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we'd expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).

I think we'll probably see 1e27 FLOP RL runs next year?

2Fabien Roger
I think it's much worse than that. First, I think RL is more like 10,000x than 100x  less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP). Additionally, learning new languages is likely a much harder task than regular math RL, because you have chicken-and-egg issues (you are not incentivized to encode a new structure before knowing how to decode it, and you are not incentivized to decode it before you encode it). We have some empirical evidence of this: people struggle to make even simple steganography appear in RL setups which incentivize it the most. Maybe simple encodings can bootstrap to more complex encodings and this bypasses some of the chicken-and-egg issues, but I'm not sure. Chicken-and-egg problems also mean you benefit a lot from having many serial steps of RL when learning an encoding, and while RL can probably be scaled massively in parallel, high latency of generations imply there probably won't be massive scale ups of the number of serial steps of RL compared to what r1 already did. (This is a hand-wavy argument, it might be incorrect.)
4gwern
You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what 'medium-scale' here might mean.)

Doesn't SL already learn a full language in which complex thoughts can be expressed? I agree it was brutal, it required >1e24 flop, but still.

4Fabien Roger
Edited to say "small-scale SL and large-scale RL"

I think the RL case might be more analogous to the translation pair case than the "just the encoded text" case. How does that alter the bottom line?

4Fabien Roger
I think it's only somewhat more analogous: if you slowly transition to a new language, you don't have incentives to have translations pairs with redundant content. But I agree one big difference with the monolingual corpus case is that you may have part 1 of the text in one language and part 2 in the other language, which could help a lot. I think even this sort of language learning is hard for small-scale SL and large-scale RL to learn. (I also think that there is a more frightening version of the "new incomprehensible language" hypothesis: the case where you learn a new language to express thoughts which are hard to express in English. This is not analogous to any translation experiment, and I expect it to be brutal for small-scale SL and large-scale RL to learn a full language in which complex thoughts can be expressed.)

However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:

  • Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
  • Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
  • Form an EUM which is totally indifferen
... (read more)
4Richard_Ngo
I was a bit lazy in how I phrased this. I agree with all your points; the thing I'm trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion "Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for": 1. Strongly incentivizes deception (including self-deception) during bargaining (e.g. each agent wants to overstate the difficulty of providing cake for it). 2. Strongly incentivizes defection from the deal once one of the agents realize that they'll get no cake going forward. 3. Is non-robust to multi-agent dynamics (e.g. what if one of Alice's allies later decides "actually I'm going to sell pies to the Alice+Bob coalition more cheaply if Alice gets to eat them"? Does that then divert Bob's resources towards buying cakes for Alice?) EUM treats these as messy details. Coalitional agency treats them as hints that EUM is missing something. EDIT: another thing I glossed over is that IIUC Harsanyi's theorem says the aggregation of EUMs should have a weighted average of utilities, NOT a probability distribution over weighted averages of utilities. So even flipping a coin isn't technically kosher. This may seem nitpicky but I think it's yet another illustration of the underlying non-robustness of EUM.

I'd guess that the benchmarks which METR uses have enough label noise and other issues (e.g. specification ambiguity) that measuring >=95% reliability isn't meaningful. 80% probably is meaningful.

When analyzing the high reliability regime, I think you'd want to get a ceiling on performance by baselining with human domain experts who are instructed to focus on reliability. (E.g. something like: "continue testing etc until you're very confident the task is completed successfully".) And then you'd compare this humans ceiling to AI performance (for AIs whic... (read more)

2LDJ
Thanks for the comment. In terms of the label noise I feel like that's decently accounted for already, as the calculations I'm using are actually not from specific model scores, but rather these points are sampled from me taking the intersecting points of a smoothed curve with the accuracy level for each of the models, and then I derive the factor that exists between that point and the 50% accuracy time horizon, as you can see here in the image at this link for example: https://prnt.sc/odMcvz0isuRU Additionally, after I noted those factors for 50% to 80%, 90%, 95% and 99% individually for all 6 of those models, I averaged them across the models, which results in the final singular set of averaged factors, which I ended up using for formulating the 80%, 95% and 99% trend lines in the final chart I made. I think there is still some higher margin of error worth noting for those higher accuracy values perhaps, but unless I'm missing something I feel as though this methodology appears relatively robust to being made redundant by "label noise". "I think you'd want to get a ceiling on performance by baselining with human domain experts who are instructed to focus on reliability." "So, I expect that looking at 95% will underestimate AI progress." Yes I very much agree, I don't think it's fair for people to assume from my chart "Well I guess it won't be human level until it's doing these tasks at 99% reliability" and I hope people don't end up with that take-away.  And I don't intend to imply that (99% accuracy)  means "1% away from human level" either, as the humans accuracy could be less than 99% or even less than 90% or 80% depending on the task. I don't have the resources at hand myself to implement such human testing you described though, however I think it's worth giving that feedback to the folks at METR, and then that could result in more useful data for people like me to make visualizations out of again :)  

It looks like the images aren't showing up on LW.

1LDJ
Thanks, should be fixed now.

Number cortical neurons != brain size. Orcas have ~2x the number of cortical neurons, but much larger brains. Assuming brain weight is proportional to volume, with human brains being typically 1.2-1.4kg, and orca brains being typically 5.4-6.8kg, orca brains are actually like 6.1/1.3=4.7 times larger than human brains.

I think cortical neurons is a better proxy than brain size and I expect that the relation between cortical neurons and brain size differs substantially between species. (I expect more similarity within a species.)

My guess is that neuron

... (read more)
1Towards_Keeperhood
Yeah I think I came to agree with you. I'm still a bit confused though because intuitively I'd guess chimps are dumber than -4.4SD (in the interpretation for "-4.4SD" I described in my other new comment).

Some quick (and relatively minor) notes:

  • I expect that full stack intelligence explosion could look more like "make the whole economy bigger using a bunch of AI labor" rather than specifically automating the chip production process. (That said, in practice I expect explicit focused automation of chip production to be an important part of the picture, probably the majority of the acceleration effect.) Minimally, you need to scale up energy at some point.
    • Focusing on the whole economy is closer to the perspective (I think) of some people from Epoch like Ta
... (read more)

Yes, I was intending my comment to refer to just code at Anthropic. (Otherwise I would talk much more about serious integration lags and lack of compute.)

Answer by ryan_greenblatt42

I'm very skeptical orcas can be trained be smarter than humans. I explain why in this comment.

This makes sense as a crux for the claim "we need philosophical competence to align unboundedly intelligent superintelligences." But, it doesn't make sense for the claim "we need philosophical competence to align general, openended intelligence."

I was thinking of a slightly broader claim: "we need extreme philosophical competence". If I thought we had to use human labor to align wildly superhuman AIs, I would put much more weight on "extreme philosophical competence is needed". I agree that "we need philosophical competence to align any general, openend... (read more)

2Raemon
Yeah I agree that was happening somewhat. The connecting dots here are "in worlds where it turns out we need a long Philosophical Pause, I think you and Buck would probably be above some threshold where you notice and navigate it reasonably."  I think my actual belief is "the Motte is high likelihood true, the Bailey is... medium-ish likelihood true, but, like, it's a distribution, there's not a clear dividing line between them"  I also think the pause can be "well, we're running untrusted AGIs and ~trusted pseudogeneral LLM-agents that help with the philosophical progress, but, we can't run them that long or fast, they help speed things up and make what'd normally be a 10-30 year pause into a 3-10 year pause, but also the world would be going crazy left to it's own devices, and the sort of global institutional changes necessary are still similarly-outside-of-overton window as a 20 year global moratorium and the "race with China" rhetoric is still bad.

Orcas have about 43 billion cortical neurons - humans have about 21 billion. The orca cortex has 6 times the area of the human cortex, though the neuron density is about 3 times lower.

[...]

My uncertain guess is that, within mammalian brains, scaling matters a lot more for individual intelligence,

This post seems to assume that a 2x increase in brain size is a huge difference (claiming this could plausibly yield +6SD), but a naive botec doesn't support this.

For humans, brain size and IQ are correlated at ~0.3. Brain size has a standard deviation of roughl... (read more)

1Towards_Keeperhood
Thanks for describing a wonderfully concrete model. I like that way you reason (especially the squiggle), but I don't think it works quite that well for this case. But let's first assume it does: Your estimamtes on algorithmic efficiency deficits of orca brains seem roughly reasonable to me. (EDIT: I'd actually be at more like -3.5std mean with standard deviation of 2std, but idk.) Number cortical neurons != brain size. Orcas have ~2x the number of cortical neurons, but much larger brains. Assuming brain weight is proportional to volume, with human brains being typically 1.2-1.4kg, and orca brains being typically 5.4-6.8kg, orca brains are actually like 6.1/1.3=4.7 times larger than human brains. Taking the 5.4-6.8kg range, this would be 4.15-5.23 range of how much larger orca brains are. Plugging that in for `orca_brain_size_difference` yields 45% on >=2std, and 38% on >=4std (where your values ) and 19.4% on >=6std.  Updating down by 5x because orcas don't seem that smart doesn't seem like quite the right method to adjust the estimate, but perhaps fine enough for the upper end estimates, which would leave 3.9% on >=6std. Maybe you meant "brain size" as only an approximation to "number of cortical neurons", which you think are the relevant part. My guess is that neuron density is actually somewhat anti-correlated with brain size, and that number of cortical neurons would be correlated with IQ rather at ~0.4-0.55 in humans, though i haven't checked whether there's data on this. And ofc using that you get lower estimates for orca intelligence than in my calculation above. (And while I'd admit that number of neurons is a particularly important point of estimation, there might also be other advantages of having a bigger brain like more glia cells. Though maybe higher neuron density also means higher firing rates and thereby more computation. I guess if you want to try it that way going by number of neurons is fine.) My main point is however, that brain size (or corti

The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem.

Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on "extreme philosophical competence".)

Looking over the comments of

... (read more)
2Raemon
Thanks for laying this out thus far. I'mma reply but understand if you wanna leave the convo here . I would be interested in more effortpost/dialogue about your thoughts here. This makes sense as a crux for the claim "we need philosophical competence to align unboundedly intelligent superintelligences." But, it doesn't make sense for the claim "we need philosophical competence to align general, openended intelligence." I suppose my OP didn't really distinguish these claims and there were a few interpretations of how the arguments fit together. I was more saying the second (although to be fair I'm not sure I was actually distinguishing them well in my head until now) It doesn't make sense for "we just' need to be able to hand off to an AI which is seriously aligned" to be a crux for the second. A thing can't be a crux for itself.  I notice my "other-guy-feels-like-they're-missing-the-point" -> "check if I'm not listening well, or if something is structurally wrong with the convo" alarm is firing, so maybe I do want to ask for one last clarification on "did you feel like you understood this the first time? Does it feel like I'm missing the point of what you said? Do you think you understand why it feels to me like you were missing the point (even if you think it's because I'm being dense about something?) Takes on your proposal Meanwhile, here's some takes based on my current understanding of your proposal. These bits: ...is a bit I think is philosophical-competence bottlenecked. And this bit: ...is a mix of "philosophically bottlenecked" and "rationality bottlenecked." (i.e. you both have to be capable of reasoning about whether you've found things that really worked, and, because there are a lot of degrees of freedom, capable of noticing if you're deploying that reasoning accurately) I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help)

Yep, just the obvious. (I'd say "much less bought in" than "isn't bought in", but whatever.)

I don't really have dots I'm trying to connect here, but this feels more central to me than what you discuss. Like, I think "alignment might be really, really hard" (which you focus on) is less of the crux than "is misalignment that likely to be a serious problem at all?" in explaining. Another way to put this is that I think "is misalignment the biggest problem" is maybe more of the crux than "is misalignment going to be really, really hard to resolve in some worlds". I see why you went straight to your belief though.

IMO actively torch the "long pause" worlds

Not sure how interesting this is to discuss, but I don't think I agree with this. Stuff they're doing does seem harmful to worlds where you need a long pause, but feels like at the very least Anthropic is a small fraction of the torching right? Like if you think Anthropic is making this less likely, surely they are a small fraction of people pushing in this direction such that they aren't making this that much worse (and can probably still pivot later given what they've said so far).

I don't expect 90% of code in 6 months and more confidently don't expect "almost all" in 12 months for a reasonable interpretation of almost all. However, I think this prediction is also weaker than it might seem, see my comment here.

3Thane Ruthenis
Yup, agreed. The update to my timelines this would cause isn't a direct "AI is advancing faster than I expected", but an indirect "Dario makes a statement about AI progress that seems overly ambitious and clearly wrong to me, but is then proven right, which suggests he may have a better idea of what's going on than me in other places as well, and my skepticism regarding his other overambitious-seeming statements is now more likely to be incorrect".

Dario Amodei says AI will be writing 90% of the code in 6 months and almost all the code in 12 months.

I think it's somewhat unclear how big of a deal this is. In particular, situations where AIs write 90% of lines of code, but are very far (in time, effective compute, and qualitative capabilities) from being able to automate research engineer jobs seem very plausible to me. Perhaps Dario means something a bit stronger than "90% of lines of code".

It's pretty easy to get to 25% of lines of code written by LLMs with very weak models, e.g., Google claims to... (read more)

2niplav
My best guess is that the intended reading is "90% of the code at Anthropic", not in the world at large—if I remember the context correctly that felt like the option that made the most sense. (I was confused about this at first, and the original context on this is not clear whether the claim is about the world at large or about Anthropic specifically.)

My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.

I share similar concerns that Anthropic doesn't seem ... (read more)

2Raemon
I think you kinda convinced me here this reasoning isn't (as stated) very persuasive. I think my reasoning had some additional steps like: * when I'm 15% on 'alignment might be philosophically hard', I still expect to maybe learn more and update to 90%+, and it seems better to pursue strategies that don't actively throw that world under the bus. (and, while I don't fully understand the Realpolitik, it seems to me that Anthropic could totally be pursuing strategies that achieve a lot of it's goals without Policy Comms that IMO actively torch the "long pause" worlds) * you are probably right I was oriented around "getting to like 5% risk" than reducing risk on the margin. * I'm probably partly just not really visualizing what it'd be like to be a 15%-er and bringing some bias in.
8Raemon
The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem. Looking over the comments of the first joshc post, it seems like that's also basically asserted it wasn't necessary by fiat. And, the people who actively believe in "alignment is philosophically loaded" showed up to complain that this ignored the heart of the problem. My current summary of the arguments (which I put ~60% on, and I think Eliezer/Oli/Wentworth treat much more confidently and maybe believe a stronger version of) are something like: 1. Anything general enough to really tackle openended, difficult-to-evaluate plans, will basically need to operate in a goal directed way in order to do that. (i.e. What's Up With Confusingly Pervasive Goal Directedness?) 2. The goal-directedness means it's very likely to be self/situationally aware, and the requisite intelligence to solve these sorts of problems means even if it's not full blown anti-aligned, it's at least probably going to want to try to build more option value for itself. 3. The fact that you can't evaluate the results means it has a lot of room to give you answers that help preserve it's goals and bootstrap (at least on the margin), even if it's not massively smart enough to one-shot escape. And you can't solve that problem with Control (i.e. The Case Against AI Control Research). 4. You can maybe have interpretability tools that check for schemingness (if it's the first generation of generally capable agent and isn't too smart yet, maybe you've done a good job preserving Chain of Thought as a reasonably faithful representation, for now). But, you'll then just see "yep, the agent is unaligned", and not actually be able to fix it.  I think my current model of you (Ryan) is like: "Training models to do specific things, cleverly, actually just makes it pretty hard for them t
2Raemon
Thanks. I'll probably reply to different parts in different threads. For the first bit: The rough number you give are helpful. I'm not 100% sure I see the dots you're intending to connect with "leadership thinks 1/5-ryan-misalignment and 2x-ryan-totalitariansm" / "rest of alignment science team closer to ryan" -> "this explains a lot."  Is this just the obvious "whelp, leadership isn't bought into this risk model and call most of the shots, but in conversations with several employees that engage more with misalignment?". Or was there a more specific dynamic you thought it explained?

I think I disagree some with this change. Now I'd say something like "We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope."

You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.

Related question: are you in favor of making AGI open weights? By AGI, I mean AIs which effectively operate autonomously and can acquire automously acquire money/power. This includes AIs capable enough to automate whole fields of R&D (but not much more capable than this). I think the case for this being useful on your views feels much stronger than the case for control preventing warning shots. After all, you seemingly mostly thought control was bad due to the chance it would prevent escape or incidents of strange (and not that strategic) behavior. Nai... (read more)

5ryan_greenblatt
You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.

I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. [...] My guess is that none of that would have happened with properly implemented control measures.

Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example.

I would say some assumptions go into who the ‘we’

... (read more)
3Jan_Kulveit
According to this report Sydney relatives are well and alive as of last week.
4Jan_Kulveit
Meaningful representative example in what class: I think it's representative in 'weird stuff may happen', not in we will get more teenage-intern-trapped-in-a-machine characters. Which is the problem - my default expectation is the we in "the AI company" does not take strong action (for specificity, like, shutting down). Do you expect any of the labs to shut down if they catch their new model 'rogue deploy' or sabotage part of their processes? In contrast I do expect basically smooth spectrum of incidents and accidents. And expect control shapes the distribution away from small and moderately large to xrisk (that's the main point) Can you express what you believe in this frame? My paraphrase is you think it decreases the risk approximately uniformly across scales, and you expect some discontinuity between kills zero people and kills some people, where the 'and also kills everyone' is very close to kills some people. I deeply distrust the analytical approach of trying to enumerate failure modes and reason from that.  Because I don't think it will be easy to evaluate "leading people astray in costly ways".
ryan_greenblatt*Ω255427

I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.

My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large... (read more)

Jan_KulveitΩ5113

I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.

My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large

... (read more)

After thinking more about it, I think "we haven't seen evidence of scheming once the octopi were very smart" is a bigger update than I was imagining, especially in the case where the octopi weren't communicating with octopese. So, I'm now at ~20% without octopese and about 50% with it.

Yes, I think frontier AI companies are responsible for most of the algorithmic progress. I think its unclear how much the leading actor benefits from progress done at other slightly behind AI companies and this could make progress substantially slower. (However, it's possible the leading AI company would be able to acquire the GPUs from these other companies.)

My current best guess median is that we'll see 6 OOMs of effective compute in the first year after full automation of AI R&D if this occurs in ~2029 using a 1e29 training run and compute is scaled up by a factor of 3.5x[1] over the course of this year[2]. This is around 5 years of progress at the current rate[3].

How big of a deal is 6 OOMs? I think it's a pretty big deal; I have a draft post discussing how much an OOM gets you (on top of full automation of AI R&D) that I should put out somewhat soon.

Further, my distribution over this is radically u... (read more)

2Stephen McAleese
Thanks for these thoughtful predictions. Do you think there's anything we can do today to prepare for accelerated or automated AI research?
1Hjalmar_Wijk
Maybe distracting technicality: This seems to make the simplifying assumption that the R&D automation is applied to a large fraction of all the compute that was previously driving algorithmic progress right? If we imagine that a company only owns 10% of the compute being used to drive algorithmic progress pre-automation (and is only responsible for say 30% of its own algorithmic progress, with the rest coming from other labs/academia/open-source), and this company is the only one automating their AI R&D, then the effect on overall progress might be reduced (the 15X multiplier only applies to 30% of the relevant algorithmic progress). In practice I would guess that either the leading actor has enough of a lead that they are already responsible for most of their algorithmic progress, or other groups are close behind and will thus automate their own AI R&D around the same time anyway. But I could imagine this slowing down the impact of initial AI R&D automation a little bit (and it might make a big difference for questions like "how much would it accelerate a non-frontier lab that stole the model weights and tried to do rsi").

I'm not making a strong claim this makes sense and I think people should mostly think about the AI case directly. I think it's just another intuition pump and we can potentially be more concrete in the octopus case as we know the algorithm. (While in the AI case, we haven't seen an ML algorithm that scales to human level.)

I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.

Amusingly, this is actually not exactly what we were discussing and is mostly off-topic for our discussion.

ryan_greenblattΩ24550

Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using "business as usual RLHF" end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.

So, what do I mean by octopus misalignment? Suppose... (read more)

3ryan_greenblatt
After thinking more about it, I think "we haven't seen evidence of scheming once the octopi were very smart" is a bigger update than I was imagining, especially in the case where the octopi weren't communicating with octopese. So, I'm now at ~20% without octopese and about 50% with it.
3nielsrolf
My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons: * dogs are pretty nice, and it's apparently easy to domesticate foxes with very few generations. This suggests that "be nice to humans" is simple to learn for animals, and in particular simpler than "be as intelligent as human AI researchers". So assuming the octopuses would first learn to like humans, it's unclear to me why they would later start to dislike humans. Maybe your intuition is that they would later realize that they are being treated unfairly, which causes frustration that is not present in earlier generations? * If the octopuses speak octopese and English, I think it wouldn't be too hard to get some octopi to translate for us. For example, we could ask one octopus to communicate some information to a second octopus using octopese. We can check if that worked by asking the second octopus to translate back to English. Now, we have a pair of (octopese  text, English text) for which we know the translation. Now we ask a third octopus to translate octopese to English, and check the performance on some known examples. It would be possible to scheme against this test if the octopi are amazingly skilled at coordination, but I suspect human AI researchers would fail this test if a foreign secret service would use a similar setup to get AI researchers to translate between foreignese and English, so I think we should assume that such coordination is hard and a small individual failure rate would likely be enough for the group to fail (we need just one octopus that translates faithfully). (Note that this argument hinges on the existence of a language that is understood by different octopi, so it doesn't translate 1:1 to AIs.)
2jbash
If the plural weren't "octopuses", it would be "octopodes". Not everything is Latin.
4ChristianKl
And the related question would be: Even if they are not "actively scheming" what are the chances that most of the power to make decisions about the real world gets delegated to them, organizations that don't delegate power to octopuses get outcompeted, and they start to value octopuses more than humans over time?
3reallyeli
What was the purpose of using octopuses in this metaphor? Like, it seems you've piled on so many disanalogies to actual octopuses (extremely smart, many generations per year, they use Slack...) that you may as well just have said "AIs." EDIT: Is it gradient descent vs. evolution?
5ryan_greenblatt
I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.
8Daniel Kokotajlo
Yep, I feel more like 90% here. (Lower numbers if the octopi don't have octopese.) I'm curious for other people's views.
ryan_greenblattΩ102117

I'm not sure about the details of the concrete proposal, but I agree with the spirit of the proposal.

(In particular, I don't know if I think having the "do you consent" text in this way is a good way to do this given limited will. I also think you want to have a very specific signal of asking for consent that you commit to filtering out except when it is actually being used. This is so the AI isn't worried it is in red teaming etc.)

9Daniel Kokotajlo
I endorse that suggestion for changing the details.

I certainly agree it isn't clear, just my current best guess.

On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.

Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)

2Jeremy Gillen
It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues. But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.

Oh, yeah I meant "perform well according to your metrics" not "behave well" (edited)

I don't think "what is the necessary work for solving alignment" is a frame I really buy. My perspective on alignment is more like:

  • Avoiding egregious misalignment (where AIs intentionally act in ways that make our tests highly misleading or do pretty obviously unintended/dangerous actions) reduces risk once AIs are otherwise dangerous.
  • Additionally, we will likely to need to hand over making most near term decisions and most near term labor to some AI systems at some point. This going well very likely requires being able to avoid egregious misalignment (
... (read more)
3Jeremy Gillen
Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with. I guess I shouldn't respond too much in public until you've published the doc, but: * If I'm interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things. * A bunch of the ideas do seem reasonable to want to try (given that you had AGIs to play with, and were very confident that doing so wouldn't allow them to escape or otherwise gain influence). I am sympathetic to the various ideas that involve gaining understanding of how to influence goals better by training in various ways. * There are chunks of these ideas that definitely aren't "prosaic and relatively unenlightened ML research", and involve very-high-trust security stuff or non-trivial epistemic work. * I'd be a little more sympathetic to these kinda desperate last-minute things if I had no hope in literally just understanding how to build task-AGI properly, in a well understood way. We can do this now. I'm baffled that almost all of the EA-alignment-sphere has given up on even trying to do this. From talking to people this weekend this shift seems downstream of thinking that we can make AGIs do alignment work, without thinking this through in detail. Agree it's unclear. I think the chance of most of the ideas being helpful depends on some variables that we don't clearly know yet. I think 90% risk improvement can't be right, because there's a lot of correlation between each of the things working or failing. And a lot of the risk comes from imperfect execution of the control scheme, which adds on top. One underlying intuition that I want to express: The world where we are making proto-AGIs run all these experiments is pure ch

who realize AGI is only a few years away

I dislike the implied consensus / truth. (I would have said "think" instead of "realize".)

I think that for people (such as myself) who think/realize timelines are likely short, I find it more truth-tracking to use terminology that actually represents my epistemic state (that timelines are likely short) rather than hedging all the time and making it seem like I'm really uncertain.

Under my own lights, I'd be giving bad advice if I were hedging about timelines when giving advice (because the advice wouldn't be tracking the world as it is, it would be tracking a probability distribution I disagree with and thus a probability distribution that leads... (read more)

A typical crux is that I think we can increase our chances of "real alignment" using prosaic and relatively unenlightened ML reasearch without any deep understanding.

I both think:

  1. We can significantly accelerate prosaic ML safety research (e.g., of the sort people are doing today) using AIs that are importantly limited.
  2. Prosaic ML safety research can be very helpful for increasing the chance of "real alignment" for AIs that we hand off to. (At least when this research is well executed and has access to powerful AIs to experiment on.)

This top level post is part of Josh's argument for (2).

3Jeremy Gillen
Yep this is the third crux I think. Perhaps the most important. To me it looks like you're making a wild guess that "prosaic and relatively unenlightened ML research" is a very large fraction of the necessary work for solving alignment, without any justification that I know of? For all the pathways to solving alignment that I am aware of, this is clearly false. I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?

FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations.

but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).

I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well perform well according to your metrics in the average case using online training.

2Jeremy Gillen
I think we kind of agree here. The cruxes remain: I think that the metric for "behave well" won't be good enough for "real" large research acceleration. And "average case" means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.] Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I'm-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance. I think there exists an extremely strong/unrealistic version of believing in "data-efficient long-horizon RL" that does allow this. I'm aware you don't believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn't make sense?

Something important is that "significantly accelerate alignment research" isn't the same as "making AIs that we're happy to fully defer to". This post is talking about conditions needed for deference and how we might achieve them.

(Some) acceleration doesn't require being fully competitive with humans while deference does.

I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

4Jeremy Gillen
Agreed. The invention of calculators was useful for research, and the invention of more tools will also be helpful. Maybe some kinds of "safety work", but real alignment involves a human obtaining a deep understanding of intelligence and agency. The path to this understanding probably isn't made of >90% moderate duration ML tasks. (You need >90% to get 5-10x because of communication costs, it's often necessary to understand details of experiment implementation to get insight from them. And costs from the AI making mistakes and not quite doing the experiments right).

To be clear, I think there are important additional considerations related to the fact that we don't just care about capabilities that aren't covered in that section, though that section is not that far from what I would say if you renamed it to "behavioral tests", including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).

2joshc
Yeah that's fair. Currently I merge "behavioral tests" into the alignment argument, but that's a bit clunky and I prob should have just made the carving: 1. looks good in behavioral tests 2. is still going to generalize to the deferred task But my guess is we agree on the object level here and there's a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.

But it isn't a capabilities condition? Maybe I would be happier if you renamed this section.

2ryan_greenblatt
To be clear, I think there are important additional considerations related to the fact that we don't just care about capabilities that aren't covered in that section, though that section is not that far from what I would say if you renamed it to "behavioral tests", including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
ryan_greenblattΩ142315

I think there is an important component of trustworthiness that you don't emphasize enough. It isn't sufficient to just rule out alignment faking, we need the AI to actually try hard to faithfully pursue our interests including on long, confusing, open-ended, and hard to check tasks. You discuss establishing this with behavioral testing, but I don't think this is trivial to establish with behavioral testing. (I happen to think this is pretty doable and easier than ruling out reasons why our tests might be misleading, but this seems nonobvious.)

Perhaps you ... (read more)

2joshc
I don't expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral. See section 6: "The capability condition"
ryan_greenblatt*Ω6106

Note that the capability milestone forecasted in the linked short form is substantially weaker than the notion of transformative AI in the 2020 model. (It was defined as AI with an effect at least as large as the industrial revolution.)

I don't expect this adds many years, for me it adds like ~2 years to my median.

(Note that my median for time from 10x to this milestone is lower than 2 years, but median to Y isn't equal to median to X + median from X to Y.)

Maybe? At a very high level, I think the weights tend not to have "goals," in the way that the rollouts tend to have goals.

Sure, I meant natural emerging malign goals to include both "the ai pursues non myopic objectives" and "these objectives weren't intended and some (potentially small) effort was spent trying to prevent this".

(I think AIs that are automating huge amounts of human labor will be well described as pursuing some objective at least within some small context (e.g. trying to write and test a certain piece of software), but this could be well controlled or sufficiently myopic/narrow that the ai doesn't focus on steering the general future situation including its own weights.)

CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that's a big bump down. I think people kinda overstate how likely this is to happen naturally though.

Presumably you'd update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?

(I expect that (at least when neuralese is first introduced) you'll have both latent reasoning and natural language ... (read more)

91a3orn
Yes. I basically agree with your summary of points 1 - 4. I'd want to add that 2 encompasses several different mechanisms that would otherwise need to be inferred, that I would break out separately: knowledge that it is in training or not, and knowledge of the exact way in which it's responses will be used in training. Regarding point 2, I do think a lot of research on how models behave, done in absence of detailed knowledge of how models were trained, tells us very very little about the limits of control we have over models. Like I just think that in absence of detailed knowledge of Anthropic's training, the Constitutional principles they used, their character training, etc, most conclusions about what behaviors are very deliberately put there and what things are surprising byproducts must be extremely weak and tentative. Ok so "naturally" is a tricky word, right? Like I saw the claim from Jack Clark that the faking alignment paper was a natural example of misalignment, I didn't feel like that was a particularly normal use of the word. But it's.... more natural than it could be, I guess. It's tricky, I don't think people are intentionally misusing the word but it's not a useful word in conversation. Ok, good question. Let me break that down into unit tests, with more directly observable cases, and describe how I'd update. For all the below I assume we have transparent CoT, because you could check these with CoT even if it ends up getting dropped. 1. You train a model with multi-turn RL in an environment where, for some comparatively high percent (~5%) of cases, it stumbles into a reward-hacked answer -- i.e., it offers a badly-formatted number in its response, the verifier was screwed up, and it counts as a win. This model then systematically reward hacks. Zero update. You're reinforcing bad behavior, you get bad behavior. (I could see this being something that gets advertised as reward hacking, though? Like, suppose I'm training a front-end engineer AI, an

Some things are true simply because they are true and in general there's no reason to expect a simpler explanation.

You could believe:

Some things are true simply because they are true, but only when being true isn't very surprising. (For instance, it isn't very surprising that there are some cellular automata that live for 100 steps or that any particular cellular automata lives for 100 steps.)

However, things which are very surprising and don't have a relatively compact explanation are exponentionally rare. And, in the case where something is infinitely surprising (e.g., if the digits of pi weren't normal), there will exist a finite explanation.

2Logan Zoellner
It sounds like you agree "if a Turing machine goes for 100 steps and then stops" this is ordinary and we shouldn't expect an explanation.  But also believe "if pi is normal for 10*40 digits and then suddenly stops being normal this is a rare and surprising coincidence for which there should be an explanation". And in the particular case of pi I agree with you. But if you start using this principle in general it is not going to work out well for you.  Most simple to describe sequences that suddenly stop aren't going to have nice pretty explanations. Or to put it another way: the number of things which are nice (like pi) are dramatically outnumbered by the number of things that are arbitrary (like cellular automata that stop after exactly 100 steps). I would absolutely love if there was some criteria that I could apply to tell me whether something is nice or arbitrary, but the Halting Problem forbids this.  The best we can do is mathematical taste.  If mathematicians have been studying something for a long time and it really does seem nice, there is a good chance it is.

(I don't expect o3-mini is a much better agent than 3.5 sonnet new out of the box, but probably a hybrid scaffold with o3 + 3.5 sonnet will be substantially better than 3.5 sonnet. Just o3 might also be very good. Putting aside cost, I think o1 is usually better than o3-mini on open ended programing agency tasks I think.)

The question of context might be important, see here. I wouldn't find 15 minutes that surprising for ~50% success rate, but I've seen numbers more like 1.5 hours. I thought this was likely to be an overestimate so I went down to 1 hour, but more like 15-30 minutes is also plausible.

Keep in mind that I'm talking about agent scaffolds here.

6habryka
Yeah, I have failed to get any value out of agent scaffolds, and I don't think I know anyone else who has so far. If anyone has gotten more value out of them than just the Cursor chat, I would love to see how they do it!  All things like Cursor composer and codebuff and other scaffolds have been worse than useless for me (though I haven't tried it again after o3-mini, which maybe made a difference, it's been on my to-do list to give it another try).

I mean, I don't think AI R&D is a particularly hard field persay, but I do think it involves lots of tricky stuff and isn't much easier than automating some other plausibly-important-to-takeover field (e.g., robotics). (I could imagine that the AIs have a harder time automating philosophy even if they were trying to work on this, but it's more confusing to reason about because human work on this is so dysfunctional.) The main reason I focused on AI R&D is that I think it is much more likely to be fully automated first and seems like it is probably fully automated prior to AI takeover.

2TsviBT
Ok, I think I see what you're saying. To check part of my understanding: when you say "AI R&D is fully automated", I think you mean something like:
Load More