I'd guess that the benchmarks which METR uses have enough label noise and other issues (e.g. specification ambiguity) that measuring >=95% reliability isn't meaningful. 80% probably is meaningful.

When analyzing the high reliability regime, I think you'd want to get a ceiling on performance by baselining with human domain experts who are instructed to focus on reliability. (E.g. something like: "continue testing etc until you're very confident the task is completed successfully".) And then you'd compare this humans ceiling to AI performance (for AIs which are given similar instructions). I think humans (and AIs) typically don't aspire to that high of reliability by default. Ideally you'd also iterate on tasks and baselining strategies until the expert baseliners do >95% or whatever.

So, I expect that looking at 95% will underestimate AI progress.

Improved visualizations of METR Time Horizons paper.

ryan_greenblatt21h20

It looks like the images aren't showing up on LW.

Considerations on orca intelligence

ryan_greenblatt3d40

Number cortical neurons != brain size. Orcas have ~2x the number of cortical neurons, but much larger brains. Assuming brain weight is proportional to volume, with human brains being typically 1.2-1.4kg, and orca brains being typically 5.4-6.8kg, orca brains are actually like 6.1/1.3=4.7 times larger than human brains.

I think cortical neurons is a better proxy than brain size and I expect that the relation between cortical neurons and brain size differs substantially between species. (I expect more similarity within a species.)

My guess is that neuron density is actually somewhat anti-correlated with brain size

This might be true in mammals (and/or birds) overall, but I'm kinda skeptical this is a big effect within humans. Like I'd guess that regression slope between brain size and cortical neurons is ~1 in humans rather than substantially less than 1.

that number of cortical neurons would be correlated with IQ rather at ~0.4-0.55 in humans

I agree you'll probably see a bigger correlation with cortical neurons (if you can measure this precisely enough!). I wouldn't guess much more though?

Overall, I'm somewhat sympathetic to your arguments that we should expect that multiplying cortical neurons by X is a bigger effect than multiplying brain size by X. Maybe this moves my estimate of SDs / doubling of cortical neurons up by 1.5x to more like 1.8 SD / doubling. I don't think this makes a huge difference to the bottom line.

Three Types of Intelligence Explosion

ryan_greenblatt3d123

Some quick (and relatively minor) notes:

I expect that full stack intelligence explosion could look more like "make the whole economy bigger using a bunch of AI labor" rather than specifically automating the chip production process. (That said, in practice I expect explicit focused automation of chip production to be an important part of the picture, probably the majority of the acceleration effect.) Minimally, you need to scale up energy at some point.
- Focusing on the whole economy is closer to the perspective (I think) of some people from Epoch like Tamay, Matthew, and Ege.
You talk about "chip technology" feedback loop as taking months, but presumably improvements to ASML take longer as they often require building new fabs?
The 6 OOM limit for chip technology is based on limits to FLOP/joule, but currently, we're not limited by energy prices/supply as much we we're limited by chip cost. So, in principle you could improve chip technology by reducing the cost of manufactoring. I think this maybe gets you an extra few OOMs, though it's somewhat unclear how to do the accouting between this and scaling up chip production. When analyzing the cumulative limits this sort of question doesn't matter (as the overall limit can be assessed by just doing intelligence/flop * flop/joule * maximum joules), but when breaking down how much progress is possible from each component, then the flop/joule abstraction doesn't really cleanly map onto an area like "chip technology".

AI #107: The Misplaced Hype Machine

ryan_greenblatt3d50

Yes, I was intending my comment to refer to just code at Anthropic. (Otherwise I would talk much more about serious integration lags and lack of compute.)

Could orcas be (trained to be) smarter than humans? 

Answer by ryan_greenblattMar 17, 202542

I'm very skeptical orcas can be trained be smarter than humans. I explain why in this comment.

Anthropic, and taking "technical philosophy" more seriously

ryan_greenblatt4d*20

This makes sense as a crux for the claim "we need philosophical competence to align unboundedly intelligent superintelligences." But, it doesn't make sense for the claim "we need philosophical competence to align general, openended intelligence."

I was thinking of a slightly broader claim: "we need extreme philosophical competence". If I thought we had to use human labor to align wildly superhuman AIs, I would put much more weight on "extreme philosophical competence is needed". I agree that "we need philosophical competence to align any general, openended intelligence" isn't affected by the level of capability at handoff.

I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.

I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is "careful conceptual thinking might be required rather than pure naive empiricism (because we won't have good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this thinking" and the bailey is "extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed".

I buy the motte here, but not the bailey. I think the motte is a substantial discount on Anthropic from my perspective, but I'm kinda sympathetic to where they are coming from. (Getting conceptual stuff and futurism right is real hard! How would they know who to trust among people disagreeing wildly!)

And ultimately, what matters is "does Anthropic leadership go forward with the next training run", so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people.

I don't think "does anthropic stop (at the right time)" is the majority of the relevance of careful conceptual thinking from my perspective. Probably more of it is "do they do a good job allocating their labor and safety research bets". This is because I don't think they'll have very much lead time if any (median -3 months) and takeoff will probably be slower than the amount of lead time if any, so pausing won't be as relevant. Correspondingly, pausing at the right time isn't the biggest deal relative to other factors, though it does seem very important at an absolute level.

Considerations on orca intelligence

ryan_greenblatt4d*142

Orcas have about 43 billion cortical neurons - humans have about 21 billion. The orca cortex has 6 times the area of the human cortex, though the neuron density is about 3 times lower.

[...]

My uncertain guess is that, within mammalian brains, scaling matters a lot more for individual intelligence,

This post seems to assume that a 2x increase in brain size is a huge difference (claiming this could plausibly yield +6SD), but a naive botec doesn't support this.

For humans, brain size and IQ are correlated at ~0.3. Brain size has a standard deviation of roughly 12%. So, a doubling of brain size is ~6 SD of brain size which yields ~1.8 SD of IQ. I think this is probably a substantial overestimate as I don't expect this correlation is fully causal: increased brain size is probably correlated with factors that also increase IQ through other mechanisms (e.g., generally fewer deleterious mutations, aging, nutrition). So, my bottom line guess is that doubling brain size is more like 1.2 SD of IQ. This implies that doubling brain size isn't that big of a deal relative to other factors.

I think this is basically the bottom line--quite likely doubling brain size isn't very decisive, but I did a more detailed botec to get to a (sloppy) bottom line out of curiosity.

First let's account for non-brain size differences. My guess is that orcas are (median) around -4 SD on non-brain size differences (aka brain algorithms) with respect to research tasks (putting some weight on language specialization, accounting for orca specialization, etc.), for an overall estimate of -2.8 SD. This doesn't feel crazy to me?

Where does this -4 SD come from? My guess is that the human-chimp gap in non-brain size improvements is around 2.2 SD^[1] and the chimp-orca (non-brain size) gap is probably similar, maybe a bit smaller, let's say 1.8 SD. So, this yields -4 SD.

I think my 95% confidence internal for the orca algorithmic advantage (on research) is like -8 SD to -1 SD with a roughly normal distribution. (I could be argued into having a substantially lower value for the bottom of the confidence internval; -12 SD doesn't seem too crazy.)

To get a interval for IQ vs brain size effect, let's do a very charitable estimate and then use this as a 95th percentile. The maximally charitable estimate would be something like:

Assume the correlation is more like 0.4 and is fully causal.
Assume standard deviation of 8% so 2x is 9 SDs.
This yields 3.6 SD of IQ per doubling.

Using this as a 95th percentile and 1.2 as my median, my 95% confidence internal is like 0.4 to 3.6 SDs per 2xing (putting aside the probability of this whole model being confused). Let's say this is distributed lognormally.

Now, let's say 95% interval on ocra brain size as 1.5 to 2.5. (With a log normal distribution.)

This yields ~5% chance of orcas being over 4SDs above humans. And ~10% chance of >2SDs. (See the notebook here.) After updating on our observations (humans appear to be running an intelligent civilization and orcas don't seem that smart), I update down by maybe 5x on both of these estimates to 1% chance of >4SDs and 2% chance of >2SDs.

Correspondingly, I'm not very optimistic about the prospects here.

I think the chimp vs human gap is probably roughly half brain size and half other (algorithmic) improvements. The brain size gap is 3.5x or 2.2 SD (using 1.2 SD / doubling). So, if the gap is half brain size and half other (algorithmic) improvements, then we'd get 2.2 SD of algorithmic improvement. ↩︎

Anthropic, and taking "technical philosophy" more seriously

ryan_greenblatt6d81

The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem.

Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on "extreme philosophical competence".)

Looking over the comments of the first joshc post, it seems like that's also basically asserted it wasn't necessary by fiat.

I agree that Josh's first post doesn't argue about this, but the later two posts kinda do? That said, I think Josh doesn't really state the argument very clearly in a single place anywhere (at least not in the way I would when trying to cover this objection).

"Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from "don't think about scheming ever" to "secretly think about scheming" to avoid getting caught, and that probably just won't work?"

Hmm, no this isn't really that much of my model. Or at least, I don't feel myself thinking thoughts like this.

Interested in whatever Real You's cruxes are, 1-2 steps removed.

I think this is probably not the best place to discuss this, but I'll try laying out the quick version of the argument, mostly so I can practice explaining myself here.

Objectives and components

Our goal is to make a trustworthy AI that we defer to. This AI need to be:

Capable: At least as capable as we are collectively. This includes handling wicked strategy/forecasting/conceputual problems and anticipating risks from what it does.
Trustworthy: It needs to faithfully pursue our interests, including on confusing and open ended problems that are conceptually loaded and that we can't really check.

We'll do this using two components:

Avoid egregious misalignment: Rule out misalignment that could make behavioral tests very misleading: avoid scheming (both during training and later developed at runtime as we run more copies of the AI), types of sycophancy that are hard to detect, and value drift.
Use behavioral tests for capabilities and the parts of trustworthiness that remain.These tests will need to evaluate how the AI generalizes to tasks that are harder, more open ended, longer, and harder to check than the domains we trained it on. (Because we'll need things to generalize further than this.) Iterate against these tests until we know how to train AIs that generalize well.

Avoiding egregious misalignment

I'd first claim that we at least have mechanisms (that don't require extreme philosophical competence) for greatly reducing the risk of egregious misalignment in expectation. The strategies available are:

Test measures on analogous setups (model organisms) and then try to transfer
Apply principled methods where we have some a priori story it should help

Analogous setups includes:

Cases where a model naturally became egregiously misaligned (ideally with a roughly similar architecture and training method to the one we are using).
Cases where we trained in egregious misalignment (or at least heavily nudged the model towards this).
Very abstract analogies where the object we're studying isn't actually egregious misalignment, but is analogous to this.

If we had a large range of pretty good test beds of analogous naturally occuring misalignment that cover the failure modes we're most worried about, then iterating on countermeasures in these test beds seems quite useful. (The countermeasures would be drawn from the principled methods.)

A serious problem is that we're not necessarily able to consistently detect misalignment. We also need to be able to test whether misalignment is less likely after some countermeasure, so being able to reproducably catch the misalignment at sufficiently low cost is important.^[1] This is probably the central difficulty in test beds. We also need to ensure that our countermeasures aren't effectively training against our tests.^[2]

(To get a lot of signal, our test beds would need to be representative of the failure modes we're worried about or there would need to be some reason to expect the countermeasures we try to generalize to all failure modes)

Another objection is: "Actually, we didn't have any methods available to try which could end up with a model that (always) isn't egregiously misaligned. So, even if you can iterate a bunch, you'll just either find that nothing works or you'll just fool yourself." I'm somewhat skeptical of this objection if we actually have a bunch of signal in the test beds and can afford to iterate. It feels to me like in most worlds where we have serious misalignment problems, we can get a good enough set of methods to try that we "just" need like <5 bits. As in, if we were to randomly sample sets of measures (and parameters for these measures, etc) from stuff that seem promising, it seems unlikely we'd need more than 32 tries before something actually worked. This means you maybe don't need to apply crazy amounts of optimization pressure. I don't have a cached quick justification.

Oops, I ran out of my time box for this comment. I didn't get to principled measures. I can send you my doc though.

Behavioral tests

Oops, I ran out of my time box for this comment. Hopefully this section is somewhat comprehensible though I don't get into detail.

We need to test that our AI does reasonable and aligned stuff on really hard to check tasks. So, how can we get confidence in this given that the tasks are hard to check?

First, we can see how the AI does on relatively hard to check or philosophically tricky tasks which we have already looked into substantially for unrelated reasons. A big reason why things might be hard to check is that we don't have enough time: the AI might do investigations that would take humans decades. But, we have done some investigations in the past! This won't ultimately let us check tasks as hard as the ones we want the AIs to do, but we can get close. Note that we also have to apply a similar process for humans.
Second, we can study how generalization on this sort of thing works in general: when an AI is trained on easy, looks good on medium, how will it do on hard? We can study this in all kinds of ways so that we can gain confidence in generalization. In particular, we can get a sense of how reasonable humans are and trust them to generalize. If we could get a similar property for AIs, we might be good, and this doesn't seem clearly harder than the problem of making AIs which are as capable as humans.

We'd then need to iterate against these tests in a way that doesn't overfit.

Technically, we just need some signal here, but if catching the misalignment was very costly initially, P(catch) increases with spend, and our initial catch doesn't make it much easier to catch peturbations of the model that might behave slightly differently, then we probably wouldn't get enough signal. That said, if we can sometimes detect misalignment in some models, iteration still could be quite useful ↩︎
More precisely, we need to ensure that our countermeasures aren't just shifting from a type of misalignment we can detect to a type we can't. Qualitatively analyzing the countermeasures and our tests should help here. ↩︎

Anthropic, and taking "technical philosophy" more seriously

ryan_greenblatt6d20

Yep, just the obvious. (I'd say "much less bought in" than "isn't bought in", but whatever.)

I don't really have dots I'm trying to connect here, but this feels more central to me than what you discuss. Like, I think "alignment might be really, really hard" (which you focus on) is less of the crux than "is misalignment that likely to be a serious problem at all?" in explaining. Another way to put this is that I think "is misalignment the biggest problem" is maybe more of the crux than "is misalignment going to be really, really hard to resolve in some worlds". I see why you went straight to your belief though.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Objectives and components

Avoiding egregious misalignment

Behavioral tests