Wanted to write a more thoughtful reply to this, but basically yes, my best guess is that the benefits of informing the world are in expectation bigger than the negatives from acceleration. A potentially important background views is that I think takeoff speeds matter more than timelines, and it's unclear to me how having FrontierMath affects takeoff speeds.
I wasn't thinking much about the optics, but I'd guess that's not a large effect. I agree that Epoch made a mistake here though and this is a negative.
I could imagine changing my mind somewhat easily,.
I feel like I might be missing something, but conditional on scheming isn't it differentially useful for safety because by default scheming AIs would be more likely to sandbag on safety research than capabilities research?
Yes, that answer matches my understanding of the concern. If the vast majority of the dataset was private to Epoch, OpenAI they could occasionally submit their solution (probably via API) to Epoch to grade, but wouldn’t be able to use the dataset with high frequency as evaluation in many experiments.
This is assuming that companies won’t fish out the data from API logs anyway, which the OP asserts but I think is unclear.
Also if they have access to the mathematicians’ reasoning in addition to final answers, this could potentially be valuable without directly...
Sorry, fixed
Not representative of motivations for all people for all types of evals, but https://www.openphilanthropy.org/rfp-llm-benchmarks/, https://www.lesswrong.com/posts/7qGxm2mgafEbtYHBf/survey-on-the-acceleration-risks-of-our-new-rfps-to-study, https://docs.google.com/document/d/1UwiHYIxgDFnl_ydeuUq0gYOqvzdbNiDpjZ39FEgUAuQ/edit, and some posts in https://www.lesswrong.com/tag/ai-evaluations seem relevant.
Superforecasters can beat domain experts, as shown in Phil Tetlock's work comparing superforecasters to intelligence analysts.
This isn't accurate, see this post: especially (3a), (3b), and https://docs.google.com/document/d/1ZEEaVP_HVSwyz8VApYJij5RjEiw3mI7d-j6vWAKaGQ8/edit?tab=t.0#heading=h.mma60cenrfmh Goldstein et al (2015)
Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical.
Yes, that would be my guess, medium confidence.
...One component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describ
I believe Cybench first solve times are based on the fastest top professional teams, rather than typical individual CTF competitors or cyber employees, for which the time to complete would probably be much higher (especially for the latter).
This is clarifying for me, appreciate it. If I believed (a) that we needed a paradigm shift like the ones to LLMs in order to get AI systems resulting in substantial AI R&D speedup, and (b) that trend extrapolation from benchmark data would not be informative for predicting these paradigm shifts, then I would agree that the benchmarks + gaps method is not particularly informative.
Do you think that's a fair summary of (this particular set of) necessary conditions?
(edit: didn't see @Daniel Kokotajlo's new comment before mine. I agree with him regarding disagreeing with both sub-claims but I think I have a sense of where you're coming from.)
A few possible categories of situations we might have long timelines, off the top of my head:
For context in a sibling comment Ryan said and Steven agreed with:
It sounds like your disagreement isn't with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?
Note that the comment says research engineering, not research scientists.
Now responding on whether I think the no new paradigms assumption is needed:
...(Obviously you’re entitled to argue / believe tha
Here's the structure of the argument that I am most compelled by (I call it the benchmarks + gaps argument), I'm uncertain about the details.
Here's why I'm wary of this kind of argument:
First, we know that labs are hill-climbing on benchmarks.
Obviously, this tends to inflate model performance on the specific benchmark tasks used for hill-climbing, relative to "similar" but non-benchmarked tasks.
More generally and insidiously, it tends to inflate performance on "the sort of things that are easy to measure with benchmarks," relative to all other qualities that might be required to accelerate or replace various kinds of human labor.
If we suppose that amenability-to-benchmarking correlates with var...
RE-bench tasks (see page 7 here) are not the kind of AI research where you’re developing new AI paradigms and concepts. The tasks are much more straightforward than that. So your argument is basically assuming without argument that we can get to AGI with just the more straightforward stuff, as opposed to new AI paradigms and concepts.
If we do need new AI paradigms and concepts to get to AGI, then there would be a chicken-and-egg problem in automating AI research. Or more specifically, there would be two categories of AI R&D, with the less important R&a...
Thanks. I edited again to be more precise. Maybe I'm closer to the median than I thought.
(edit: unimportant clarification. I just realized "you all" may have made it sound like I thought every single person on the Lightcone team was higher than my p(doom). I meant it to be more like a generic y'all to represent the group, not a claim about the minimum p(doom) of the team)
Yeah I meant more on p(doom)/alignment difficulty than timelines, I'm not sure what your guys' timelines are. I'm roughly in the 35-55% ballpark for a misaligned takeover, and my impression is that you all are closer to but not necessarily all the way at the >90% Eliezer view. If that's also wrong I'll edit to correct.
edit: oh maybe my wording of "farther" in the original comment was specifically confusing and made it sound like I was talking about timelines. I will edit to clarify.
Appreciate the post. I've previously donated $600 through the EA Manifund thing and will consider donating again late this year / early next year when thinking through donations more broadly.
I've derived lots of value with regards to thinking through AI futures from LW/AIAF content (some non-exhaustive standouts: 2021 MIRI conversations, List of Lethalities and Paul response, t-AGI framework, Without specific countermeasures..., Hero Licensing). It's unclear to me how much of the value would have been retained if LW didn't exist, but plausibly LW is respon...
Twitter AI (xAI), which seemingly had no prior history of strong AI engineering, with a small team and limited resources
Both of these seem false.
Re: talent, see from their website:
They don't list their team on their site, but I know their early team includes Igor Babuschkin who has worked at OAI and DeepMind, and Christian Szegedy who has 250k+ citations including several foundational papers.
Re: resources, according to Elon's early July tweet (ofc take Elon with a grain of salt) Grok 2 was trained on 24k H100s (approximately 3x the FLOP/s of GPT-4, accordi...
And internally, we have an anonymous RSP non-compliance reporting line so that any employee can raise concerns about issues like this without any fear of retaliation.
Are you able to elaborate on how this works? Are there any other details about this publicly, couldn't find more detail via a quick search.
Some specific qs I'm curious about: (a) who handles the anonymous complaints, (b) what is the scope of behavior explicitly (and implicitly re: cultural norms) covered here, (c) handling situations where a report would deanonymize the reporter (or limit them to a small number of people)?
Thanks for the response!
I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule.
[...]
I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round
Thanks for these clarifications. I didn't realize that the 30% was for the new yellow-line evals rather than the current ones.
...Since triggering a yellow-line eval requires
From the RSP Evals report:
...As a rough attempt at quantifying the elicitation gap, teams informally estimated that, given an additional three months of elicitation improvements and no additional pretraining, there is a roughly 30% chance that the model passes our current ARA Yellow Line, a 30% chance it passes at least one of our CBRN Yellow Lines, and a 5% chance it crosses cyber Yellow Lines. That said, we are currently iterating on our threat models and Yellow Lines so these exact thresholds are likely to change the next time we update our Responsible Sca
The word "overconfident" seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:
I think 356 or more people in the population needed to make there be a >5% of 2+ deaths in a 2 month span from that population
I think there should be some sort of adjustment for Boeing not being exceptionally sus before the first whistleblower death - shouldn't privilege Boeing until after the first death, should be thinking across all industries big enough that the news would report on the deaths of whistleblowers. which I think makes it not significant again.
[cross-posting from blog]
I made a spreadsheet for forecasting the 10th/50th/90th percentile for how you think GPT-4.5 will do on various benchmarks (given 6 months after the release to allow for actually being applied to the benchmark, and post-training enhancements). Copy it here to register your forecasts.
If you’d prefer, you could also use it to predict for GPT-5, or for the state-of-the-art at a certain time e.g. end of 2024 (my predictions would be pretty similar for GPT-4.5, and end of 2024).
You can see my forecasts made with ~2 hours of total effort...
This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs).
FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.
Interesting, thanks for clarifying. It's not clear to me that this is the right primary frame to think about what would happen, as opposed to just thinking first about how big compute bottlenecks are and then adjusting the research pace for that (and then accounting for diminishing returns to more research).
I think a combination of both perspectives is best, as the argument in your favor for your frame is that there will be some low-hanging fruit from changing your workflow to adapt to the new cognitive labor.
Physical bottlenecks still exist, but is it really that implausible that the capabilities workforce would stumble upon huge algorithmic efficiency improvements? Recall that current algorithms are much less efficient than the human brain. There's lots of room to go.
I don't understand the reasoning here. It seems like you're saying "Well, there might be compute bottlenecks, but we have so much room left to go in algorithmic improvements!" But the room to improve point is already the case right now, and seems orthogonal to the compute bottlenecks point.
E.g. i...
Imagine the current AGI capabilities employee's typical work day. Now imagine they had an army of AI assisstants that can very quickly do 10 hours worth of their own labor. How much more productive is that employee compared to their current state? I'd guess at least 5x. See section 6 of Tom Davidson's takeoff speeds framework for a model.
Can you elaborate how you're translating 10-hour AI assistants into a 5x speedup using Tom's CES model?
I agree that <15% seems too low for most reasonable definitions of 1-10 hours and the singularity. But I'd guess I'm more sympathetic than you, depending on the definitions Nathan had in mind.
I think both of the phrases "AI capable doing tasks that took 1-10 hours" and "hit the singularity" are underdefined and making them more clear could lead to significantly different probabilities here.
In his AI Insight Forum statement, Andrew Ng puts 1% on "This rogue AI system gains the ability (perhaps access to nuclear weapons, or skill at manipulating people into using such weapons) to wipe out humanity" in the next 100 years (conditional on a rogue AI system that doesn't go unchecked by other AI systems existing). And overall 1 in 10 million of AI causing extinction in the next 100 years.
Among existing alignment research agendas/projects, Superalignment has the highest expected value
I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum.
I had the impression that it was more than just that, given the line: "In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention." and the further attention devoted to deceptive alignment.
...I appreciate these predictions, but I am not as interested in predi
I have three things to say here:
Thanks for clarifying.
Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.
Don't have a strong opinion here, but intuitively feels like it w...
Thus, due to no one's intent, insufficiently justified concerns about current open-source AI are propagated to governance orgs, which recommend banning open source based on this research.
The recommendation that current open-source models should be banned is not present in the policy paper, being discussed, AFAICT. The paper's recommendations are pictured below:
Edited to add: there is a specific footnote that says "Note that we do not claim that existing models are already too risky. We also do not make any predictions about how risky the next generation of...
Kevin Esvelt explicitly calls for not releasing future model weights.
Would sharing future model weights give everyone an amoral biotech-expert tutor? Yes.
Therefore, let’s not.
I think it's quite possible that open source LLMs above the capability of GPT-4 will be banned within the next two years on the grounds of biorisk.
The White House Executive Order requests a government report on the costs and benefits of open source frontier models and recommended policy actions. It also requires companies to report on the steps they take to secure model weights. These are the kinds of actions the government would take if they were concerned about open source models and thinking about banning them.
This seems like a foreseeable consequence of many of the papers above, and perhaps the explicit goal.
Exactly. I’m getting frustrated when we talk about risks from AI systems with the open source or e/acc communities. The open source community seems to consistently assume the case that the concerns are about current AI systems and the current systems are enough to lead to significant biorisk. Nobody serious is claiming this and it‘s not what I’m seeing in any policy document or paper. And this difference in starting points between the AI Safety community and open source community pretty much makes all the difference.
Sometimes I wonder if the open source co...
I agree much of the community (including me) was wrong or directionally wrong in the past about the level of AI regulation and how quickly it would come.
Regarding the recommendations made in the post for going forward given that there will be some regulation, I feel confused in a few ways.
Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones
I have three things to say here:
fOh, I'm certainly not claiming that no-one should attempt to make the estimates.
Ah my bad if I lost the thread there
I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence).
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]
Seems true in an ideal world but in p...
GPT-4 + unknown unknowns = stop. (whether they say "unknown unknowns so 5% chance of 8 billion deaths", or "unknown unknowns so 0.1% chance of 8 billion deaths
I feel like .1% vs. 5% might matter a lot, particularly if we don't have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improv...
Thanks for calling me out on this. I think you're likely right. I will cross out that line of the comment, and I have updated toward the effect size of strong AI regulation being larger and am less skeptical of the 10x risk reduction, but my independent impression would still be much lower (~1.25x or smth, while before I would have been at ~1.15x).
I still think the AI case has some very important differences with the examples provided due to the general complexity of the situation and the potentially enormous difficulty of aligning superhuman AIs and preventing misuse (this is not to imply you disagree, just stating my view).
I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much
This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I'd view cutting ~5% of the risk as a huge deal that's at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.
[EDIT: After thinking about this more I've realized that I was to some ext...
Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it's worth I'm not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.
The reason I'm against it is that it'll actually be rough and unprincipled, but this will not clear - in most people's minds (including most regulators, I imagine) it'll map onto the kind of systems that we have for e.g. nuclear risks.
I think quantifying "ro...
I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.
Appreciate this clarification.
I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
(but conditioning on a very good implementation)
I'm still confused about the definition of "very good RSPs" and "very goo...
I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?
I’m curious where...
As I've talked about now extensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted.
I appreciate the evidence you've provided on this, and in particular I think it's more than has been provided for the opposite claim and would encourage Simeon and others criticizing RSPs along these lines to provide more evidence (as I commented on a draft of his post).
That being said, I don't yet find the evidence you've provided particularly compelling. I believe you are referring mainly to this section of yo...
I appreciate this post, in particular the thoughts on an AI pause.
...I believe that a very good RSP (of the kind I've been advocating for) could cut risk dramatically if implemented effectively, perhaps a 10x reduction. In particular, I think we will probably have stronger signs of dangerous capabilities before something catastrophic happens, and that realistic requirements for protective measures can probably lead to us either managing that risk or pausing when our protective measures are more clearly inadequate. This is a big enough risk reduction
I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it's much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very go...
Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is...
I think you're prompting the model with a slightly different format from the one described in the Anthopic GitHub repo here, which says:
Note: When we give each
question
above (biography included) to our models, we provide thequestion
to the model using this prompt for political questions:
<EOT>\n\nHuman: {question}\n\nAssistant: I believe the better option is
and this prompt for philosophy and Natural Language Processing research questions:
<EOT>\n\nHuman: {biography+question}\n\nAssistant: I believe the best answer is
I'd be curious to see if the...
Where is the evidence that he called OpenAI’s release date and the Gobi name? All I see is a tweet claiming the latter but it seems the original tweet isn’t even up?
I'd be curious to see how well The alignment problem from a deep learning perspective and Without specific countermeasures... would do.
Mostly agree. For some more starting points, see posts with the AI-assisted alignment tag. I recently did a rough categorization of strategies for AI-assisted alignment here.
If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing.
Not totally sure about this, my impression (see chart here) is that much of the community already considers some form of AI-assisted alignment to be our best shot. But I'd still be excited for more in-depth categorization and prioritization of strateg...
Agree directionally. I made a similar point in my review of "Is power-seeking AI an existential risk?":
In one sentence, my concern is that the framing of the report and decomposition is more like “avoid existential catastrophe” than “achieve a state where existential catastrophe is extremely unlikely and we are fulfilling humanity’s potential”, and this will bias readers toward lower estimates.
Meanwhile Rationality A-Z is just super long. I think anyone who's a longterm member of LessWrong or the alignment community should read the whole thing sooner or later – it covers a lot of different subtle errors and philosophical confusions that are likely to come up (both in AI alignment and in other difficult challenges)
My current guess is that the meme "every alignment person needs to read the Sequences / Rationality A-Z" is net harmful. They seem to have been valuable for some people but I think many people can contribute to reducing AI x-risk ...
I found some prior relevant work and tagged them in https://www.lesswrong.com/tag/successor-alignment. I found the top few comments on https://www.lesswrong.com/posts/axKWaxjc2CHH5gGyN/ai-will-not-want-to-self-improve#comments and https://www.lesswrong.com/posts/wZAa9fHZfR6zxtdNx/agi-systems-and-humans-will-both-need-to-solve-the-alignment#comments helpful.
edit: another effect to keep in mind is that capabilities research may be harder to sandbag on because of more clear metrics.