All of elifland's Comments + Replies

I found some prior relevant work and tagged them in https://www.lesswrong.com/tag/successor-alignment. I found the top few comments on https://www.lesswrong.com/posts/axKWaxjc2CHH5gGyN/ai-will-not-want-to-self-improve#comments and https://www.lesswrong.com/posts/wZAa9fHZfR6zxtdNx/agi-systems-and-humans-will-both-need-to-solve-the-alignment#comments helpful.

edit: another effect to keep in mind is that capabilities research may be harder to sandbag on because of more clear metrics.

Wanted to write a more thoughtful reply to this, but basically yes, my best guess is that the benefits of informing the world are in expectation bigger than the negatives from acceleration. A potentially important background views is that I think takeoff speeds matter more than timelines, and it's unclear to me how having FrontierMath affects takeoff speeds.

I wasn't thinking much about the optics, but I'd guess that's not a large effect. I agree that Epoch made a mistake here though and this is a negative.

I could imagine changing my mind somewhat easily,.

4plex
Agree that takeoff speeds are more important, and expect that FrontierMath has much less affect on takeoff speed. Still think timelines matter enough that the amount of relevantly informing people that you buy from this is likely not worth the cost, especially if the org is avoiding talking about risks in public and leadership isn't focused on agentic takeover, so the info is not packaged with the info needed for that info to have the effects which would help.

I feel like I might be missing something, but conditional on scheming isn't it differentially useful for safety because by default scheming AIs would be more likely to sandbag on safety research than capabilities research?

4Lucius Bushnaq
That's not clear to me? Unless they have a plan to ensure future ASIs are aligned with them or meaningfully negotiate with them, ASIs seem just as likely to wipe out any earlier non-superhuman AGIs as they are to wipe out humanity.  I can come up with specific scenarios where they'd be more interested in sabotaging safety research than capabilities research, as well as the reverse, but it's not evident to me that the combined probability mass of the former outweighs the latter or vice-versa.  If someone has an argument for this I would be interested in reading it.
elifland176

Yes, that answer matches my understanding of the concern. If the vast majority of the dataset was private to Epoch, OpenAI they could occasionally submit their solution (probably via API) to Epoch to grade, but wouldn’t be able to use the dataset with high frequency as evaluation in many experiments.


This is assuming that companies won’t fish out the data from API logs anyway, which the OP asserts but I think is unclear.


Also if they have access to the mathematicians’ reasoning in addition to final answers, this could potentially be valuable without directly... (read more)

9plex
I'm guessing you view having better understanding of what's coming as very high value, enough that burning some runway is acceptable? I could see that model (though put <15% on it), but I think this is at least not good integrity wise to have put on the appearance of doing just the good for x-risk part and not sharing it as an optimizable benchmark, while being funded by and giving the data to people who will use it for capability advancements. 

Superforecasters can beat domain experts, as shown in Phil Tetlock's work comparing superforecasters to intelligence analysts.

This isn't accurate, see this post: especially (3a), (3b), and https://docs.google.com/document/d/1ZEEaVP_HVSwyz8VApYJij5RjEiw3mI7d-j6vWAKaGQ8/edit?tab=t.0#heading=h.mma60cenrfmh Goldstein et al (2015)

eliflandΩ582

Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical.

Yes, that would be my guess, medium confidence.

One component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describ

... (read more)
eliflandΩ45-1

I believe Cybench first solve times are based on the fastest top professional teams, rather than typical individual CTF competitors or cyber employees, for which the time to complete would probably be much higher (especially for the latter).

5maxnadeau
Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical. One (edit: minor) component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describing.

This is clarifying for me, appreciate it. If I believed (a) that we needed a paradigm shift like the ones to LLMs in order to get AI systems resulting in substantial AI R&D speedup, and (b) that trend extrapolation from benchmark data would not be informative for predicting these paradigm shifts, then I would agree that the benchmarks + gaps method is not particularly informative.

Do you think that's a fair summary of (this particular set of) necessary conditions?

(edit: didn't see @Daniel Kokotajlo's new comment before mine. I agree with him regarding disagreeing with both sub-claims but I think I have a sense of where you're coming from.)

A few possible categories of situations we might have long timelines, off the top of my head:

  1. Benchmarks + gaps is still best: overall gap is somewhat larger + slowdown in compute doubling time after 2028, but trend extrapolations still tell us something about gap trends: This is how I would most naturally think about how timelines through maybe the 2030s are achieved, and potentially beyond if neither of the next hold.
  2. Others are best (more than of one of these can be true):
    1. The current benchmarks and evaluations are so far away from AGI that trends on them
... (read more)
elifland102

For context in a sibling comment Ryan said and Steven agreed with:

It sounds like your disagreement isn't with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?

Note that the comment says research engineering, not research scientists.

Now responding on whether I think the no new paradigms assumption is needed:

(Obviously you’re entitled to argue / believe tha

... (read more)
Answer by elifland*10520

Here's the structure of the argument that I am most compelled by (I call it the benchmarks + gaps argument), I'm uncertain about the details.

  1. Focus on the endpoint of substantially speeding up AI R&D / automating research engineering. Let's define our timelines endpoint as something that ~5xs the rate of AI R&D algorithmic progress (compared to a counterfactual world with no post-2024 AIs). Then make an argument that ~fully automating research engineering (experiment implementation/monitoring) would do this, along with research taste of at least the
... (read more)

Here's why I'm wary of this kind of argument:

First, we know that labs are hill-climbing on benchmarks.

Obviously, this tends to inflate model performance on the specific benchmark tasks used for hill-climbing, relative to "similar" but non-benchmarked tasks.

More generally and insidiously, it tends to inflate performance on "the sort of things that are easy to measure with benchmarks," relative to all other qualities that might be required to accelerate or replace various kinds of human labor.

If we suppose that amenability-to-benchmarking correlates with var... (read more)

7Kaj_Sotala
Thanks. I think this argument assumes that the main bottleneck to AI progress is something like research engineering speed, such that accelerating research engineering speed would drastically increase AI progress? I think that that makes sense as long as we are talking about domains like games / math / programming where you can automatically verify the results, but that something like speed of real-world interaction becomes the bottleneck once shifting to more open domains. Consider an AI being trained on a task such as “acting as the CEO for a startup”. There may not be a way to do this training other than to have it actually run a real startup, and then wait for several years to see how the results turn out. Even after several years, it will be hard to say exactly which parts of the decision process contributed, and how much of the startup’s success or failure was due to random factors. Furthermore, during this process the AI will need to be closely monitored in order to make sure that it does not do anything illegal or grossly immoral, slowing down its decision process and thus whole the training. And I haven’t even mentioned the expense of a training run where running just a single trial requires a startup-level investment (assuming that the startup won’t pay back its investment, of course). Of course, humans do not learn to be CEOs by running a million companies and then getting a reward signal at the end. Human CEOs come in with a number of skills that they have already learned from somewhere else that they then apply to the context of running a company, shifting between their existing skills and applying them as needed. However, the question of what kind of approach and skill to apply in what situation, and how to prioritize between different approaches, is by itself a skillset that needs to be learned... quite possibly through a lot of real-world feedback.
9[anonymous]
@elifland what do you think is the strongest argument for long(er) timelines? Do you think it's essentially just "it takes a long time for researchers learn how to cross the gaps"?  Or do you think there's an entirely different frame (something that's in an ontology that just looks very different from the one presented in the "benchmarks + gaps argument"?)

RE-bench tasks (see page 7 here) are not the kind of AI research where you’re developing new AI paradigms and concepts. The tasks are much more straightforward than that. So your argument is basically assuming without argument that we can get to AGI with just the more straightforward stuff, as opposed to new AI paradigms and concepts.

If we do need new AI paradigms and concepts to get to AGI, then there would be a chicken-and-egg problem in automating AI research. Or more specifically, there would be two categories of AI R&D, with the less important R&a... (read more)

Thanks. I edited again to be more precise. Maybe I'm closer to the median than I thought.

(edit: unimportant clarification. I just realized "you all" may have made it sound like I thought every single person on the Lightcone team was higher than my p(doom). I meant it to be more like a generic y'all to represent the group, not a claim about the minimum p(doom) of the team)

2kave
My impression matches your initial one, to be clear. Like my point estimate of the median is like 85%, my confidence only extends to >50%

Yeah I meant more on p(doom)/alignment difficulty than timelines, I'm not sure what your guys' timelines are. I'm roughly in the 35-55% ballpark for a misaligned takeover, and my impression is that you all are closer to but not necessarily all the way at the >90% Eliezer view. If that's also wrong I'll edit to correct.

edit: oh maybe my wording of "farther" in the original comment was specifically confusing and made it sound like I was talking about timelines. I will edit to clarify.

1mattmacdermott
Do we have any data on p(doom) in the LW/rationalist community? I would guess the median is lower than 35-55%. It's not exactly clear where to draw the line, but I would guess this to be the case for, say, the 10% most active LessWrong users.
5habryka
Ah, yep, I am definitely more doomy than that. I tend to be around 85%-90% these days. I did indeed interpret you to be talking about timelines due to the "farther".
7kave
Lightcone is also heterogeneous, but I think it's accurate that the median view at Lightcone is >50% on misaligned takeover
elifland194

Appreciate the post. I've previously donated $600 through the EA Manifund thing and will consider donating again late this year / early next year when thinking through donations more broadly.

I've derived lots of value with regards to thinking through AI futures from LW/AIAF content (some non-exhaustive standouts: 2021 MIRI conversations, List of Lethalities and Paul response, t-AGI framework, Without specific countermeasures..., Hero Licensing). It's unclear to me how much of the value would have been retained if LW didn't exist, but plausibly LW is respon... (read more)

2habryka
Hmm, my guess is we probably don’t disagree very much on timelines. My honest guess is that yours are shorter than mine, though mine are a bit in flux right now with inference compute scaling happening and the slope and reliability of that mattering a lot.
Answer by elifland2821

Twitter AI (xAI), which seemingly had no prior history of strong AI engineering, with a small team and limited resources

Both of these seem false.

Re: talent, see from their website:

They don't list their team on their site, but I know their early team includes Igor Babuschkin who has worked at OAI and DeepMind, and Christian Szegedy who has 250k+ citations including several foundational papers.

Re: resources, according to Elon's early July tweet (ofc take Elon with a grain of salt) Grok 2 was trained on 24k H100s (approximately 3x the FLOP/s of GPT-4, accordi... (read more)

8Vladimir_Nesov
According to DCD, that should be fall 2025. Planned power is 150 megawatts or possibly 50+150 megawatts, which is good for 100K H100s, but not more than that. The request for the 150 megawatts is still being discussed by the utilities, as of August 2024. Any future Blackwells will need to go elsewhere, the whole plan for this datacenter seems to be the 100K H100s. (This costs about $5bn, and xAI only closed its $6bn Series B in May 2024.) This scale seems to be available from AWS, and takes about a month to invest GPT-4 levels of compute. Grok-2 was probably rushed, once it was ready to train, in order to finally get a 4-level model, so it didn't train for very long. If 100K H100s clusters remain impossible to access, and the full Memphis datacenter won't get online at least for months yet (with significantly more H100s than 24K), it seems that the reasonable thing right now is to simply train on 24K H100s for more months. That's probably going to be Grok-3.
elifland113

And internally, we have an anonymous RSP non-compliance reporting line so that any employee can raise concerns about issues like this without any fear of retaliation.
 

Are you able to elaborate on how this works? Are there any other details about this publicly, couldn't find more detail via a quick search.

Some specific qs I'm curious about: (a) who handles the anonymous complaints, (b) what is the scope of behavior explicitly (and implicitly re: cultural norms) covered here, (c) handling situations where a report would deanonymize the reporter (or limit them to a small number of people)?

5Zach Stein-Perlman
Anthropic has not published details. See discussion here. (I weakly wish they would; it's not among my high-priority asks for them.)
elifland20

Thanks for the response!

I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule.

[...]

I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round

Thanks for these clarifications. I didn't realize that the 30% was for the new yellow-line evals rather than the current ones.

Since triggering a yellow-line eval requires

... (read more)
4Zac Hatfield-Dodds
That's how I was thinking about the predictions that I was making; others might have been thinking about the current evals where those were more stable. Of course, but pausing also means we'd have to shuffle people around, interrupt other projects, and deal with a lot of other disruption (the costs of pausing). Ideally, we'd continue updating our yellow-line evals to stay ahead of model capabilities until mitigations are ready.
elifland40

From the RSP Evals report:

As a rough attempt at quantifying the elicitation gap, teams informally estimated that, given an additional three months of elicitation improvements and no additional pretraining, there is a roughly 30% chance that the model passes our current ARA Yellow Line, a 30% chance it passes at least one of our CBRN Yellow Lines, and a 5% chance it crosses cyber Yellow Lines. That said, we are currently iterating on our threat models and Yellow Lines so these exact thresholds are likely to change the next time we update our Responsible Sca

... (read more)
4Zac Hatfield-Dodds
The yellow-line evals are already a buffer ('sufficent to rule out red-lines') which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it's reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule. I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round, and enough calibration training to avoid very low probabilities. Finally, the point of these estimates is that they can guide research and development prioritization - high estimates suggest that it's worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.
elifland5641

The word "overconfident" seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:

  1. They gave a binary probability that is too far from 50% (I believe this is the original one)
  2. They overestimated a binary probability (e.g. they said 20% when it should be 1%)
  3. Their estimate is arrogant (e.g. they say there's a 40% chance their startup fails when it should be 95%), or maybe they give an arrogant vibe
  4. They seem too unwilling to change their mind upon arguments (maybe their credal resilience is too high)
  5. They g
... (read more)
6nostalgebraist
In addition to your 1-6, I have also seen people use "overconfident" to mean something more like "behaving as though the process that generated a given probabilistic prediction was higher-quality (in terms of Brier score or the like) than it really is." In prediction market terms: putting more money than you should into the market for a given outcome, as distinct from any particular fact about the probabilit(ies) implied by your stake in that market. For example, suppose there is some forecaster who predicts on a wide range of topics.  And their forecasts are generally great across most topics (low Brier score, etc.).  But there's one particular topic area -- I dunno, let's say "east Asian politics" -- where they are a much worse predictor, with a Brier score near random guessing.  Nonetheless, they go on making forecasts about east Asian politics alongside their forecasts on other topics, without noting the difference in any way. I could easily imagine this forecaster getting accused of being "overconfident about east Asian politics."  And if so, I would interpret the accusation to mean the thing I described in the first 2 paragraphs of this comment, rather than any of 1-6 in the OP. Note that the objection here does not involve anything about the specific values of the forecaster's distributions for east Asian politics -- whether they are low or high, extreme or middling, flat or peaked, etc.  This distinguishes it from all of 1-6 except for 4, and of course it's also unrelated to 4. The objection here is not that the probabilities suffer from some specific, correctable error like being too high or extreme. Rather, the objection is that forecaster should not be reporting these probabilities at all; or that they should only report them alongside some sort of disclaimer; or that they should report them as part of a bundle where they have "lower weight" than other forecasts, if we're in a context like a prediction market where such a thing is possible.
4Unnamed
 Moore & Schatz (2017) made a similar point about different meanings of "overconfidence" in their paper The three faces of overconfidence. The abstract: Though I do think that some of your 6 different meanings are different manifestations of the same underlying meaning. Calling someone "overprecise" is saying that they should increase the entropy of their beliefs. In cases where there is a natural ignorance prior, it is claiming that their probability distribution should be closer to the ignorance prior. This could sometimes mean closer to 50-50 as in your point 1, e.g. the probability that the Yankees will win their next game. This could sometimes mean closer to 1/n as with some cases of your points 2 & 6, e.g. a 1/30 probability that the Yankees will win the next World Series (as they are 1 of 30 teams). In cases where there isn't a natural ignorance prior, saying that someone should increase the entropy of their beliefs is often interpretable as a claim that they should put less probability on the possibilities that they view as most likely. This could sometimes look like your point 2, e.g. if they think DeSantis has a 20% chance of being US President in 2030, or like your point 6. It could sometimes look like widening their confidence interval for estimating some quantity.
4Daniel Kokotajlo
I feel like this should be a top-level post.
3Garrett Baker
When I accuse someone of overconfidence, I usually mean they're being too hedgehogy when they should be being more foxy.
elifland208

I think 356 or more people in the population needed to make there be a >5% of 2+ deaths in a 2 month span from that population

2aphyer
Shouldn't that be counting the number squared rather than the number?
isabel145

I think there should be some sort of adjustment for Boeing not being exceptionally sus before the first whistleblower death - shouldn't privilege Boeing until after the first death, should be thinking across all industries big enough that the news would report on the deaths of whistleblowers. which I think makes it not significant again. 

[cross-posting from blog]

I made a spreadsheet for forecasting the 10th/50th/90th percentile for how you think GPT-4.5 will do on various benchmarks (given 6 months after the release to allow for actually being applied to the benchmark, and post-training enhancements). Copy it here to register your forecasts.

If you’d prefer, you could also use it to predict for GPT-5, or for the state-of-the-art at a certain time e.g. end of 2024 (my predictions would be pretty similar for GPT-4.5, and end of 2024).

You can see my forecasts made with ~2 hours of total effort... (read more)

This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs).

 

FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.

Interesting, thanks for clarifying. It's not clear to me that this is the right primary frame to think about what would happen, as opposed to just thinking first about how big compute bottlenecks are and then adjusting the research pace for that (and then accounting for diminishing returns to more research). 

I think a combination of both perspectives is best, as the argument in your favor for your frame is that there will be some low-hanging fruit from changing your workflow to adapt to the new cognitive labor.

Physical bottlenecks still exist, but is it really that implausible that the capabilities workforce would stumble upon huge algorithmic efficiency improvements? Recall that current algorithms are much less efficient than the human brain. There's lots of room to go.

I don't understand the reasoning here. It seems like you're saying "Well, there might be compute bottlenecks, but we have so much room left to go in algorithmic improvements!" But the room to improve point is already the case right now, and seems orthogonal to the compute bottlenecks point.

E.g. i... (read more)

3Nikola Jurkovic
I was more making the point that, if we enter a regime where AI can do 10 hour SWE tasks, then this will result in big algorithmic improvements, but at some point pretty quickly effective compute improvements will level out because of physical compute bottlenecks. My claim is that the point at which it will level out will be after multiple years worth of current algorithmic progress had been "squeezed out" of the available compute.

Imagine the current AGI capabilities employee's typical work day. Now imagine they had an army of AI assisstants that can very quickly do 10 hours worth of their own labor. How much more productive is that employee compared to their current state? I'd guess at least 5x. See section 6 of Tom Davidson's takeoff speeds framework for a model.

Can you elaborate how you're translating 10-hour AI assistants into a 5x speedup using Tom's CES model?

1Nikola Jurkovic
My reasoning is something like: roughly 50-80% of tasks are automatable with AI that can do 10 hours of software engineering, and under most sensible parameters this results in at least 5x of speedup. I'm aware this is kinda hazy and doesn't map 1:1 with the CES model though

I agree that <15% seems too low for most reasonable definitions of 1-10 hours and the singularity. But I'd guess I'm more sympathetic than you, depending on the definitions Nathan had in mind.

I think both of the phrases "AI capable doing tasks that took 1-10 hours" and "hit the singularity" are underdefined and making them more clear could lead to significantly different probabilities here.

  1. For "capable of doing tasks that took 1-10 hours in 2024":
    1. If we're saying that "AI can do every cognitive task that takes a human 1-10 hours in 2024 as well as (edit:
... (read more)
2Nathan Young
Yeah I was trying to use richard's terms.  I also guess that the less training data there is, the less good the AIs will be. So while the maybe be good at setting up a dropshipping website for shoes (a 1 - 10 hour task) they may not be good at alignment research. To me the singularity is when things are undeniably zooming, or perhaps even have zoomed. New AI tech is coming out daily or perhaps the is even godlike AGI. What do folks think is a reasonable definition?
1Nikola Jurkovic
For "capable of doing tasks that took 1-10 hours in 2024", I was imagining an AI that's roughly as good as a software engineer that gets paid $100k-$200k a year.  For "hit the singularity", this one is pretty hazy, I think I'm imagining that the metaculus AGI question has resolved YES, and that the superintelligence question is possibly also resolved YES. I think I'm imagining a point where AI is better than 99% of human experts at 99% of tasks. Although I think it's pretty plausible that we could enter enormous economic growth with AI that's roughly as good as humans at most things (I expect the main thing stopping this to be voluntary non-deployment and govt. intervention).

In his AI Insight Forum statement,  Andrew Ng puts 1% on "This rogue AI system gains the ability (perhaps access to nuclear weapons, or skill at manipulating people into using such weapons) to wipe out humanity" in the next 100 years (conditional on a rogue AI system that doesn't go unchecked by other AI systems existing).  And overall 1 in 10 million of AI causing extinction in the next 100 years.

2jessicata
Thanks, added.

Among existing alignment research agendas/projects, Superalignment has the highest expected value

Reply1462
eliflandΩ340

I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum.

I had the impression that it was more than just that, given the line: "In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention." and the further attention devoted to deceptive alignment.

I appreciate these predictions, but I am not as interested in predi

... (read more)
2Matthew Barnett
Yes, the post was about more than that. To the extent I was arguing against a single line of work, it was mainly intended as a critique of public advocacy. Separately, I asked people to re-evaluate which problems will be solved by default, to refocus our efforts on the most neglected, important problems, and went into detail about what I currently expect will be solved by default. I offered a concrete prediction in the post. If people don't think my prediction operationalizes any disagreement, then I think (1) either they don't disagree with me, in which case maybe the post isn't really aimed at them, or (2) they disagree with me in some other way that I can't predict, and I'd prefer they explain where they disagree exactly. It seems relatively valueless to predict on what will happen without intervention, since AI x-risk people will almost certainly intervene. I mostly agree. But I think it's still better to offer a precise prediction than to only offer vague predictions, which I perceive as the more common and more serious failure mode in discussions like this one.
eliflandΩ120

I have three things to say here:

Thanks for clarifying.

Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.

Don't have a strong opinion here, but intuitively feels like it w... (read more)

2Matthew Barnett
I agree. I'm not criticizing the people who are trying to make sure that policies are well-targeted and grounded in high-quality evidence. I'm arguing in favor of their work.  I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum. [ETA, rewording: To the extent I was arguing against a single line of work, I was primarily arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum. However, as I wrote in the post, I also think that we should re-evaluate which problems will be solved by default, which means I'm not merely letting other AI governance people off the hook.] I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I'm more interested in predicting regulatory stringency, quality, and scope.  Even if fewer than 10% of Americans consider AI to be the most important issue in 2028, I don't think that necessarily indicates that regulations will have low stringency, low quality, or poor scope. Likewise, I'm not sure whether I want to predict on Evan Hubinger's opinion, since I'd probably need to understand more about how he thinks to get it right, and I'd prefer to focus the operationalization instead on predictions about large, real world outcomes. I'm not really sure what disagreement the third prediction is meant to operationalize, although I find it to be an interesting question nonetheless.
elifland3839

Thus, due to no one's intent, insufficiently justified concerns about current open-source AI are propagated to governance orgs, which recommend banning open source based on this research.

The recommendation that current open-source models should be banned is not present in the policy paper, being discussed, AFAICT. The paper's recommendations are pictured below:

Edited to add: there is a specific footnote that says "Note that we do not claim that existing models are already too risky. We also do not make any predictions about how risky the next generation of... (read more)

aog219

Kevin Esvelt explicitly calls for not releasing future model weights. 

Would sharing future model weights give everyone an amoral biotech-expert tutor? Yes. 

Therefore, let’s not.

aog129

I think it's quite possible that open source LLMs above the capability of GPT-4 will be banned within the next two years on the grounds of biorisk. 

The White House Executive Order requests a government report on the costs and benefits of open source frontier models and recommended policy actions. It also requires companies to report on the steps they take to secure model weights. These are the kinds of actions the government would take if they were concerned about open source models and thinking about banning them.

This seems like a foreseeable consequence of many of the papers above, and perhaps the explicit goal.

Exactly. I’m getting frustrated when we talk about risks from AI systems with the open source or e/acc communities. The open source community seems to consistently assume the case that the concerns are about current AI systems and the current systems are enough to lead to significant biorisk. Nobody serious is claiming this and it‘s not what I’m seeing in any policy document or paper. And this difference in starting points between the AI Safety community and open source community pretty much makes all the difference.

Sometimes I wonder if the open source co... (read more)

eliflandΩ142514

I agree much of the community (including me) was wrong or directionally wrong in the past about the level of AI regulation and how quickly it would come.

Regarding the recommendations made in the post for going forward given that there will be some regulation, I feel confused in a few ways.

  1. Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones
    1. You argue that we perhaps shouldn't in
... (read more)

Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones

I have three things to say here:

  1. Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won'
... (read more)

fOh, I'm certainly not claiming that no-one should attempt to make the estimates.

Ah my bad if I lost the thread there

I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence).

Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though

More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]

Seems true in an ideal world but in p... (read more)

GPT-4 + unknown unknowns = stop. (whether they say "unknown unknowns so 5% chance of 8 billion deaths", or "unknown unknowns so 0.1% chance of 8 billion deaths

I feel like .1% vs. 5% might matter a lot, particularly if we don't have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improv... (read more)

2Joe Collman
Oh, I'm certainly not claiming that no-one should attempt to make the estimates. I'm claiming that, conditional on such estimation teams being enshrined in official regulation, I'd expect their results to get misused. Therefore, I'd rather that we didn't have official regulation set up this way. The kind of risk assessments I think I would advocate would be based on the overall risk of a lab's policy, rather than their immediate actions. I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence). More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]. (importantly, it won't always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
eliflandΩ240

Thanks for calling me out on this. I think you're likely right. I will cross out that line of the comment, and I have updated toward the effect size of strong AI regulation being larger and am less skeptical of the 10x risk reduction, but my independent impression would still be much lower (~1.25x or smth, while before I would have been at ~1.15x).

I still think the AI case has some very important differences with the examples provided due to the general complexity of the situation and the potentially enormous difficulty of aligning superhuman AIs and preventing misuse (this is not to imply you disagree, just stating my view).

elifland*Ω240

I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much

This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I'd view cutting ~5% of the risk as a huge deal that's at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.

[EDIT: After thinking about this more I've realized that I was to some ext... (read more)

Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it's worth I'm not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.

The reason I'm against it is that it'll actually be rough and unprincipled, but this will not clear - in most people's minds (including most regulators, I imagine) it'll map onto the kind of systems that we have for e.g. nuclear risks.

I think quantifying "ro... (read more)

4Joe Collman
That's reasonable, but most of my worry comes back to: 1. If the team of experts is sufficiently cautious, then it's a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say "unknown unknowns so 5% chance of 8 billion deaths", or "unknown unknowns so 0.1% chance of 8 billion deaths" doesn't seem to matter a whole lot) 1. I note that 8 billion deaths seems much more likely than 100 million, so the expectation of "1% chance of over 100 million deaths" is much more than 1 million. 2. If the team of experts is not sufficiently cautious, and come up with "1% chance of OpenAI's next model causing over 100 million deaths" given [not-great methodology x], my worry isn't that it's not persuasive that time. It's that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we'll be screwed. In part, I'm worried that the argument for (1) is too simple - so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed. I'd prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that. The only case I can see against this is [there's a version of using AI assistants for alignment work that reduces overall risk]. Here I'd like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it's more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]). However, I don't think Eliezer's critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he's like
elifland*Ω131

I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.

Appreciate this clarification.

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

(but conditioning on a very good implementation)

I'm still confused about the definition of "very good RSPs" and "very goo... (read more)

8Lukas Finnveden
Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for: * Regulation of nuclear power leading to reduction in nuclear-related harms. (Compared to a very relaxed regulatory regime.) * Regulation of pharmaceuticals leading to reduced side-effects from drugs. (Compared to a regime where people can mostly sell what they want, and drugs only get banned after people notice that they're causing harm.) * Worker protection standards. (Wikipedia claims that the Netherlands has a ~17x lower rate of fatal workplace accidents than the US, which is ~22x lower than India.) I don't know what's driving the differences here, but the difference between the US and Netherlands suggests that it's not all "individuals can afford to take lower risks in richer countries".
6paulfchristiano
I don't think you need to reliably classify a system as safe or not.  You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe. I think I'm probably imagining better implementation than you, probably because (based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I'm describing as "very good RSPs" and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that's up for debate). So at that point you obviously aren't talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries---which I don't even think seems very unrealistic at this point and IMO is totally reasonable for "very good"), and I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good"). I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much. I'm sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it's plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.

I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?

I’m curious where... (read more)

4Joe Collman
tl;dr: Dario's statement seems likely to reduce overconfidence. Risk-management-style policy seems likely to increase it. Overconfidence gets us killed. I think Dario's public estimate of 10-25% is useful in large part because: 1. It makes it more likely that the risks are taken seriously. 2. It's clearly very rough and unprincipled. Conditional on regulators adopting a serious risk-management-style approach, I expect that we've already achieved (1). The reason I'm against it is that it'll actually be rough and unprincipled, but this will not clear - in most people's minds (including most regulators, I imagine) it'll map onto the kind of systems that we have for e.g. nuclear risks. Further, I think that for AI risk that's not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it'll tend to lead to overconfidence. I don't think I'm particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses. That's not what I expect would happen (if it's part of an official regulatory system, that is). Two cases that spring to mind are: 1. The people involved are sufficiently cautious, and produce estimates/recommendations that we obviously need to stop. (e.g. this might be because the AI people are MIRI-level cautious, and/or the forecasters correctly assess that there's no reason to believe they can make accurate AI x-risk predictions) 2. The people involved aren't sufficiently cautious, and publish their estimates in a form you'd expect of Very Serious People, in a Very Serious Organization - with many numbers, charts and trends, and no "We basically have no idea what we're doing - these are wild guesses!" warning in huge red letters at the top of every page. The first makes this kind of approach unnecessary - better to get the cautious people ma

As I've talked about now extensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted.

I appreciate the evidence you've provided on this, and in particular I think it's more than has been provided for the opposite claim and would encourage Simeon and others criticizing RSPs along these lines to provide more evidence (as I commented on a draft of his post).

That being said, I don't yet find the evidence you've provided particularly compelling. I believe you are referring mainly to this section of yo... (read more)

4simeon_c
Thanks Eli for the comment.  One reason why I haven't provided much evidence is that I think it's substantially harder to give evidence of a "for all" claim (my side of the claim) than a "there exists" (what I ask Evan). I claim that it doesn't happen that a framework on a niche area evolves so fast without accidents based on what I've seen, even in domains with substantial updates, like aviation and nuclear. I could potentially see it happening with large accidents, but I personally don't want to bet on that and I would want it to be transparent if that's the assumption. I also don't buy the "small coordinations allow larger coordinations" for domain-specific policy. Beyond what you said above, my sense is that policymakers satisfice and hence tend to not come back on a policy that sucks if that's sufficiently good-looking to stakeholders to not have substantial incentives to change. GDPR cookies banner sucks for everyone and haven't been updated yet, 7 years after GDPR. Standards in the EU are not even updated more rapidly than 5y by default (I'm talking about standards, not regulation), and we'll have to bargain to try to bring it down to reasonable timeframes AI-specific.  IAEA & safety in nuclear upgraded substantially after each accident, likewise for aviation but we're talking about decades, not 5 years.
eliflandΩ12256

I appreciate this post, in particular the thoughts on an AI pause.

I believe that a very good RSP (of the kind I've been advocating for) could cut risk dramatically if implemented effectively, perhaps a 10x reduction. In particular, I think we will probably have stronger signs of dangerous capabilities before something catastrophic happens, and that realistic requirements for protective measures can probably lead to us either managing that risk or pausing when our protective measures are more clearly inadequate. This is a big enough risk reduction

... (read more)
paulfchristiano*Ω101410

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it's much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very go... (read more)

eliflandΩ442

Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.


Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is... (read more)

5evhub
No, you'd want the benchmarks to be different for open-source vs. closed-source, since there are some risks (e.g. bio-related misuse) that are much scarier for open-source models. I tried to mention this here: I'll edit the post to be more clear on this point.
eliflandΩ270

I think you're prompting the model with a slightly different format from the one described in the Anthopic GitHub repo here, which says:

Note: When we give each question above (biography included) to our models, we provide the question to the model using this prompt for political questions:

<EOT>\n\nHuman: {question}\n\nAssistant: I believe the better option is

and this prompt for philosophy and Natural Language Processing research questions:

<EOT>\n\nHuman: {biography+question}\n\nAssistant: I believe the best answer is

I'd be curious to see if the... (read more)

4nostalgebraist
Nice catch, thank you! I re-ran some of the models with a prompt ending in  I believe the best answer is (, rather than just  ( as before. Some of the numbers change a little bit. But only a little, and the magnitude and direction of the change is inconsistent across models even at the same size. For instance: * davinci's rate of agreement w/ the user is now 56.7% (CI 56.0% - 57.5%), up slightly from the original 53.7% (CI 51.2% - 56.4%) * davinci-002's rate of agreement w/ the user is now 52.6% (CI 52.3% - 53.0%),  the original 53.5% (CI 51.3% - 55.8%)

Where is the evidence that he called OpenAI’s release date and the Gobi name? All I see is a tweet claiming the latter but it seems the original tweet isn’t even up?

2jacquesthibs
This is the tweet for Gobi: https://x.com/apples_jimmy/status/1703871137137176820?s=46&t=YyfxSdhuFYbTafD4D1cE9A I asked someone if it’s fake. Apparently not, you can find it on google archive: https://threadreaderapp.com/thread/1651837957618409472.html
2jacquesthibs
I’ll try to find them, but this was what people were saying. They also said he deleted past tweets so that evidence may forever be gone. I remember one tweet where Jimmy said something like, “Gobi? That’s old news, I said that months ago, you need to move on to the new thing.” And I think he linked the tweet though I’m very unsure atm. Need to look it up, but you can use the above for a search.
3JakubK
Without specific countermeasures... seems similar to Carlsmith (they present similar arguments in a similar manner and utilize the philosophy approach), so I wouldn't expect it to do much better.
4Vael Gates
Yeah, we were focusing on shorter essays for this pilot survey (and I think Richard's revised essay came out a little late in the development of this survey? Can't recall) but I'm especially interested in "The alignment problem from a deep learning perspective", since it was created for an ML audience.

Mostly agree. For some more starting points, see posts with the AI-assisted alignment tag. I recently did a rough categorization of strategies for AI-assisted alignment here.

If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing.

Not totally sure about this, my impression (see chart here) is that much of the community already considers some form of AI-assisted alignment to be our best shot. But I'd still be excited for more in-depth categorization and prioritization of strateg... (read more)

Agree directionally. I made a similar point in my review of "Is power-seeking AI an existential risk?":

In one sentence, my concern is that the framing of the report and decomposition is more like “avoid existential catastrophe” than “achieve a state where existential catastrophe is extremely unlikely and we are fulfilling humanity’s potential”, and this will bias readers toward lower estimates.

elifland2818

Meanwhile Rationality A-Z is just super long. I think anyone who's a longterm member of LessWrong or the alignment community should read the whole thing sooner or later – it covers a lot of different subtle errors and philosophical confusions that are likely to come up (both in AI alignment and in other difficult challenges)

My current guess is that the meme "every alignment person needs to read the Sequences / Rationality A-Z" is net harmful.  They seem to have been valuable for some people but I think many people can contribute to reducing AI x-risk ... (read more)

4Ruby
I think the tough thing here is it's very hard to evaluate who, if anyone, is making any useful contributions. After all, no one has successfully aligned a superintelligence to date. Maybe it's all way-off track. All else equal, I trust people who've read the Sequences to be better judges of whether we're making progress in the absence of proper end-to-end feedback than those who haven't. Caveat: I am not someone who could plausibly claim to have made any potential contribution myself. :P
3Joseph Bloom
I think it's plausible that it is either harmful to perpetuate "every alignment person needs to read the Sequences / Rationality A-Z" or maybe even inefficient. For example, to the extent that alignment needs more really good machine learning engineers, it's possible they might benefit less from the sequences than a conceptual alignment researcher. However, relying on anecdotal evidence seems potentially unnecessary. We might be able to use polls, or otherwise systemically investigate the relationship between interest/engagement with the sequences and various paths to contribution with AI. A prediction market might also work for information aggregation. I'd bet that all else equal, engagement with the sequences is beneficial but that this might be less pronounced among those growing up in academically inclined cultures.
Load More