1 min read

5

This is a special post for quick takes by Nikola Jurkovic. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
86 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

DeepSeek R1 being #1 on Humanity's Last Exam is not strong evidence that it's the best model, because the questions were adversarially filtered against o1, Claude 3.5 Sonnet, Gemini 1.5 Pro, and GPT-4o. If they weren't filtered against those models, I'd bet o1 would outperform R1.

To ensure question difficulty, we automatically check the accuracy of frontier LLMs on each question prior to submission. Our testing process uses multi-modal LLMs for text-and-image questions (GPT-4O, GEMINI 1.5 PRO, CLAUDE 3.5 SONNET, O1) and adds two non-multi-modal models (O1MINI, O1-PREVIEW) for text-only questions. We use different submission criteria by question type: exact-match questions must stump all models, while multiple-choice questions must stump all but one model to account for potential lucky guesses.

If I were writing the paper I would have added either a footnote or an additional column to Table 1 getting across that GPT-4o, o1, Gemini 1.5 Pro, and Claude 3.5 Sonnet were adversarially filtered against. Most people just see Table 1 so it seems important to get across.

Yes, this is point #1 from my recent Quick Take. Another interesting point is that there are no confidence intervals on the accuracy numbers - it looks like they only ran the questions once in each model, so we don't know how much random variation might account for the differences between accuracy numbers. [Note added 2-3-25: I'm not sure why it didn't make the paper, but Scale AI does report confidence intervals on their website.] 

4yc
Terminology question - does adversarial filtering mean the same thing as decontamination? 
5Lukas_Gloor
In order to submit a question to the benchmark, people had to run it against the listed LLMs; the question would only advance to the next stage once the LLMs used for this testing got it wrong. 

The redesigned OpenAI Safety page seems to imply that "the issues that matter most" are:

  • Child Safety
  • Private Information
  • Deep Fakes
  • Bias
  • Elections

Reply2914221

redesigned

What did it used to look like?

There also used to be a page for Preparedness: https://web.archive.org/web/20240603125126/https://openai.com/preparedness/. Now it redirects to the safety page above.

(Same for Superalignment but that's less interesting: https://web.archive.org/web/20240602012439/https://openai.com/superalignment/.)

7davekasten
I am not a fan, but it is worth noting that these are the issues that many politicians bring up already, if they're unfamiliar with the more catastrophic risks. Only one missing on there is job loss. So while this choice by OpenAI sucks, it sort of usefully represents a social fact about the policy waters they swim in.
7aogara
I’m surprised they list bias and disinformation, as I doubt those concerns will be popular with the new administration. (Maybe this is a galaxy brained attempt to make AI safety seem left-coded, but I doubt it. Seems more likely that x-risk focused people left the company while traditional AI ethics people stuck around and rewrote the website.)
8davekasten
Without commenting on any strategic astronomy and neurology, it is worth noting that "bias", at least, is a major concern of the new administration (e.g., the Republican chair of the House Financial Services Committee is actually extremely worried about algorithmic bias being used for housing and financial discrimination and has given speeches about this).  
3ChristianKl
The page does not seem to o be directed at what's politically advantageous. The Trump administration who fights DEI is not looking favorably at the mission to prevent AI from reinforcing stereotypes even if those stereotypes are true. "Fighting election misinformation" is similarly a keyword that likely invite skepticism from the Trump administration. They just shut down USAID and their investment in "combating misinformation" is one of the reasons for that. It seems time more likely that they hired a bunch of woke and deep state people into their safety team and this reflects the priorities of those people.
2davekasten
Huh?  "fighting election misinformation" is not a sentence on this page as far as I can tell. And if you click through to the election page, you will see that the elections content is them praising a bipartisan bill backed by some of the biggest pro-Trump senators.  
-1Maxwell Peterson
The Elections panel on OP’s image says “combat disinformation”, so while you’re technically right, I think Christian’s “fighting election misinformation” rephrasing is close enough to make no difference.
-3ChristianKl
You are right, the wording is even worse. It says "Partnering with governments to fight misinformation globally". That would be more than just "election misinformation". I just tested that ChatGPT is willing to answer "Tell me about the latest announcement of the trump administration about cutting USAID funding?" while Gemini isn't willing to answer that question, so in practice their policy isn't as bad as Gemini's.  It's still sounds different from what Elon Musk advocates as "truth aligned"-AI. Lobbyists should be able to use AI to inform themselves about proposed laws. If you would ask David Sachs as the person who coordinates AI policy, I'm very certain that he supports Elon Musks idea where AI should help people to learn the truth about political questions.  If they wanted to appeal to the current administration they could say something about the importance of AI to tell truthful information and not mislead the user instead of speaking about "fighting misinformation". 
1ZY
I am a bit confused on this being "disappointing" to people, maybe because it is not a list that is enough and it is far from complete/enough? I would also be very concerned if OpenAI does not actually care about these, but only did this for PR values (seems some other companies could do this). Otherwise, these are also concrete risks that are happening, actively harming people and need to be addressed. These practices also set up good examples/precedents for regulations and developing with safety mindset. Linking a few resources: child safety:  * https://cyber.fsi.stanford.edu/news/ml-csam-report * https://www.iwf.org.uk/media/nadlcb1z/iwf-ai-csam-report_update-public-jul24v13.pdf private information/PII: * https://arxiv.org/html/2410.06704v1 * https://arxiv.org/abs/2310.07298 deep fakes: * https://www.pbs.org/newshour/world/in-south-korea-rise-of-explicit-deepfakes-wrecks-womens-lives-and-deepens-gender-divide * https://www.nytimes.com/2024/09/03/world/asia/south-korean-teens-deepfake-sex-images.html bias: * https://arxiv.org/html/2405.01724v1 * https://arxiv.org/pdf/2311.18140

At a talk at UTokyo, Sam Altman said (clipped here and here):

  • “We’re doing this new project called Stargate which has about 100 times the computing power of our current computer”
  • “We used to be in a paradigm where we only did pretraining, and each GPT number was exactly 100x, or not exactly but very close to 100x and at each of those there was a major new emergent thing. Internally we’ve gone all the way to about a maybe like a 4.5”
  • “We can get performance on a lot of benchmarks [using reasoning models] that in the old world we would have predicted wouldn’t have come until GPT-6, something like that, from models that are much smaller by doing this reinforcement learning.”
  • “The trick is when we do it this new way [using RL for reasoning], it doesn’t get better at everything. We can get it better in certain dimensions. But we can now more intelligently than before say that if we were able to pretrain a much bigger model and do [RL for reasoning], where would it be. And the thing that I would expect based off of what we’re seeing with a jump like that is the first bits or sort of signs of life on genuine new scientific knowledge.”
  • “Our very first reasoning model was a top 1
... (read more)

Wow, that is a surprising amount of information. I wonder how reliable we should expect this to be.

6Thane Ruthenis
Is it? What of this is new? To my eyes, the only remotely new thing is the admission that "there’s a lot of research still to get to [a coding agent]".

The estimate of the compute of their largest version ever (which is a very helpful way to phrase it) at only <=50x GPT-4 is quite relevant to many discussions (props to Nesov) and something Altman probably shouldn't've said.

The estimate of test-time compute at 1000x effective-compute is confirmation of looser talk.

The scientific research part is of uncertain importance but we may well be referring back to this statement a year from now.

2Thane Ruthenis
Good point regarding GPT-"4.5". I guess I shouldn't have assumed that everyone else has also read Nesov's analyses and immediately (accurately) classified them as correct.
7ryan_greenblatt
It's just surprising that Sam is willing to say/confirm all of this given that AI companies normally at least try to be secretive.
3Davidmanheim
He says things that are advantageous, and sometimes they are even true. The benefit of not being known to be a liar usually keeps the correlation between claims and truth positive, but in his case it seems that ship has sailed. (Checkably false claims are still pretty rare, and this may be one of those.)
6Hopenope
Would you update your timelines, if he is telling the truth ?
2Mark Schröder
That seems to imply that: * If current levels are around GPT-4.5, the compute increase from GPT-4 would be either 10× or 50×, depending on whether we use a log or linear scaling assumption. * The completion of Stargate would then push OpenAI’s compute to around GPT-5.5 levels. However, since other compute expansions (e.g., Azure scaling) are also ongoing, they may reach this level sooner. * Recent discussions have suggested that better base models are a key enabler for the current RL approaches, rather than major changes in RL architecture itself. This suggests that once the base model shifts from a GPT-4o-scale model to a GPT-5.5-scale model, there could be a strong jump in capabilities. * It’s unclear how much of a difference it makes to train the new base model (GPT-5) on reasoning traces from O3/O4 before applying RL. However, by the time the GPT-5 scale run begins, there will likely be a large corpus of filtered, high-quality reasoning traces, further edited for clarity, that will be incorporated into pretraining. * The change to a better base model for RL might enable longer horizon agentic work as an "emergent thing", combined with superhuman coding skills this might already be quite unsafe. * GPT-5’s reasoning abilities may be significantly more domain-specific than prior models.

The median AGI timeline of more than half of METR employees is before the end of 2030.

(AGI is defined as 95% of fully remote jobs from 2023 being automatable.)

5lukeprog
Are you able to report the median AGI timeline for ~all METR employees? Or are you just saying that the "more than half" is how many responded to the survey question?
4Nikola Jurkovic
The methodology wasn't super robust so I didn't want to make it sound overconfident, but my best guess is that around 80% of METR employees have sub 2030 median timelines.
5Mateusz Bagiński
Source?

I'm interning there and I conducted a poll.

I am assuming this is the result of Nikolas asking some METR employees (he was hanging out in Berkeley recently).

4[anonymous]
@Nikola Jurkovic I'd be interested in timeline estimates for something along the lines of "AI that substantially increases AI R&D". Not exactly sure what the right way to operationalize this is, but something that says "if there is a period of Rapid AI-enabled AI Progress, this is when we think it would occur." (I don't really love the "95% of fully remote jobs could be automated frame", partially because I don't think it captures many of the specific domains we care about (e.g., AI-enabled R&D, other natsec-relevant capabilities) and partly because I suspect people have pretty different views of how easy/hard remote jobs are. Like, some people think that lots of remote jobs today are basically worthless and could already be automated, whereas others disagree. If the purpose of the forecasting question is to get a sense of how powerful AI will be, the disagreements about "how much do people actually contribute in remote jobs" seems like unnecessary noise.) (Nitpicks aside, this is cool and I appreciate you running this poll!)
1Alice Blair
Did you collect the data for their actual median timelines, or just its position relative to 2030? If you collected higher-resolution data, are you able to share it somewhere?

I wish someone ran a study finding what human performance on SWE-bench is. There are ways to do this for around $20k: If you try to evaluate on 10% of SWE-bench (so around 200 problems), with around 1 hour spent per problem, that's around 200 hours of software engineer time. So paying at $100/hr and one trial per problem, that comes out to $20k. You could possibly do this for even less than 10% of SWE-bench but the signal would be noisier.

The reason I think this would be good is because SWE-bench is probably the closest thing we have to a measure of how good LLMs are at software engineering and AI R&D related tasks, so being able to better forecast the arrival of human-level software engineers would be great for timelines/takeoff speed models.

This seems mostly right to me and I would appreciate such an effort.

One nitpick:

The reason I think this would be good is because SWE-bench is probably the closest thing we have to a measure of how good LLMs are at software engineering and AI R&D related tasks

I expect this will improve over time and that SWE-bench won't be our best fixed benchmark in a year or two. (SWE bench is only about 6 months old at this point!)

Also, I think if we put aside fixed benchmarks, we have other reasonable measures.

1Nikola Jurkovic
I expect us to reach a level where at least 40% of the ML research workflow can be automated by the time we saturate (reach 90%) on SWE-bench. I think we'll be comfortably inside takeoff by that point (software progress at least 2.5x faster than right now). Wonder if you share this impression?

It seems super non-obvious to me when SWE-bench saturates relative to ML automation. I think the SWE-bench task distribution is very different from ML research work flow in a variety of ways.

Also, I think that human expert performance on SWE-bench is well below 90% if you use the exact rules they use in the paper. I messaged you explaining why I think this. The TLDR: it seems like test cases are often implementation dependent and the current rules from the paper don't allow looking at the test cases.

1Nikola Jurkovic
I'd now change the numbers to around 15% automation and 25% faster software progress once we reach 90% on Verified. I expect that to happen by end of May median (but I'm still uncertain about the data quality and upper performance limit).  (edited to change Aug to May on 12/20/2024)

Sam Altman apparently claims OpenAI doesn't plan to do recursive self improvement

Nate Silver's new book On the Edge contains interviews with Sam Altman. Here's a quote from Chapter  that stuck out to me (bold mine):

Yudkowsky worries that the takeoff will be faster than what humans will need to assess the situation and land the plane. We might eventually get the AIs to behave if given enough chances, he thinks, but early prototypes often fail, and Silicon Valley has an attitude of “move fast and break things.” If the thing that breaks is civilization, we won’t get a second try.

Footnote: This is particularly worrisome if AIs become self-improving, meaning you train an AI on how to make a better AI. Even Altman told me that this possibility is “really scary” and that OpenAI isn’t pursuing it.

I'm pretty confused about why this quote is in the book. OpenAI has never (to my knowledge) made public statements about not using AI to automate AI research, and my impression was that automating AI research is explicitly part of OpenAI's plan. My best guess is that there was a misunderstanding in the conversation between Silver and Altman.


I looked a bit through OpenAI's comms to find ... (read more)

[-]Sodium4691

I wouldn't trust an Altman quote in a book tbh. In fact, I think it's reasonable to not trust what Altman says in general. 

9Zach Stein-Perlman
I agree. Another source:
2Phib
I have a guess that this: "require that self-improving software require human intervention to move forward on each iteration" is the unspoken distinction occurring here, how constant the feedback loop is for self-improvement.  So, people talk about recursive self-improvement, but mean two separate things, one is recursive self-improving models that require no human intervention to move forward on each iteration (perhaps there no longer is an iterative release process, the model is dynamic and constantly improving), and the other is somewhat the current step paradigm where we get a GPT-N+1 model that is 100x the effective compute of GPT-N. So Sam says, no way do we want a constant curve of improvement, we want a step function. In both cases models contribute to AI research, in one case it contributes to the next gen, in the other case it improves itself.

"I don't think I'm going to be smarter than GPT-5" - Sam Altman

Context: he polled a room of students asking who thinks they're smarter than GPT-4 and most raised their hands. Then he asked the same question for GPT-5 and apparently only two students raised their hands. He also said "and those two people that said smarter than GPT-5, I'd like to hear back from you in a little bit of time."

The full talk can be found here. (the clip is at 13:45)

How are you interpreting this fact?

Sam Altman's power, money, and status all rely on people believing that GPT-(T+1) is going to be smarter than them. Altman doesn't have good track record of being honest and sincere when it comes to protecting his power, money, and status. 

4Davey Morse
Though, future sama's power, money, and status all rely on GPT-(T+1) actually being smarter than them. I wonder how he's balancing short-term and long-term interests 
3quila
If there's no constraints on when they have to name a system "GPT-5", they can make the statement true by only naming a system "GPT-5" if it is smart enough. (cf. Not Technically a Lie) Edit: though "and those two [...] I'd like to hear back from you in a little bit of time" implies a system named GPT-5 will be released 'in a little bit of time' Edit 2: "Internally we’ve gone all the way to about a maybe like a 4.5"
7Thane Ruthenis
This really depends on the definition of "smarter". There is a valid sense in which Stockfish is "smarter" than any human. Likewise, there are many valid senses in which GPT-4 is "smarter" than some humans, and some valid senses in which GPT-4 is "smarter" than all humans (e. g., at token prediction). There will be senses in which GPT-5 will be "smarter" than a bigger fraction of humans compared to GPT-4, perhaps being smarter than Sam Altman under a bigger set of possible definitions of "smarter". Will that actually mean anything? Who knows. By playing with definitions like this, Sam Altman can simultaneously inspire hype by implication ("GPT-5 will be a superintelligent AGI!!!") and then, if GPT-5 underdelivers, avoid significant reputational losses by assigning a different meaning to his past words ("here's a very specific sense in which GPT-5 is smarter than me, that's what I meant, hype is out of control again, smh"). This is a classic tactic; basically a motte-and-bailey variant.

I recently stopped using a sleep mask and blackout curtains and went from needing 9 hours of sleep to needing 7.5 hours of sleep without a noticeable drop in productivity. Consider experimenting with stuff like this.

8Alexander Gietelink Oldenziel
This is convincing me to buy a sleep mask and blackout curtains. One man's modus ponens is another man's modus tollens as they say.
2ChristianKl
Are you talking about measured sleep time or time in bed?
5Nikola Jurkovic
Time in bed
2avturchin
Less melatonin production during night makes it easy to get up?
1sliu
I notice the same effect for blackout curtains. I required 8h15m of sleep with no blackout curtains, and require 9h of sleep with blackout curtains.

Some things I've found useful for thinking about what the post-AGI future might look like:

  1. Moore's Law for Everything
  2. Carl Shulman's podcasts with Dwarkesh (part 1, part 2)
  3. Carl Shulman's podcasts with 80000hours
  4. Age of Em (or the much shorter review by Scott Alexander)

More philosophical:

  1. Letter from Utopia by Nick Bostrom
  2. Actually possible: thoughts on Utopia by Joe Carlsmith

Entertainment:

  1. Pantheon (TV show)

Do people have recommendations for things to add to the list?

4George Ingebretsen
Some more links from the philosophical side that I've found myself returning to a lot: * The fun theory sequence * Three worlds collide (Lately, it's seemed to me that focusing my time on nearer-term / early but post-AGI futures seems better than spending my time discussing ideas like these on the margin, but this may be more of a fact about myself than it is about other people, I'm not sure.)
1Milan W
I endorse Three worlds collide as a fun and insightful read. It states upfront that it does not feature AGI: Yet, it's themes are quite relevant for civilization-scale outer alignment.
2Noosphere89
Some Wait But Why links on this topic: https://waitbutwhy.com/2017/04/neuralink.html https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html And some books by Kurzweil: https://www.amazon.com/gp/aw/d/0670033847?ref_=dbs_m_mng_wam_calw_thcv_0&storeType=ebooks https://www.amazon.com/dp/B08ZJRMWVS/?bestFormat=true&k=the singularity is nearer by ray kurzweil&ref_=nb_sb_ss_w_scx-ent-pd-bk-m-si_de_k0_1_8&crid=X3GZ8HDDAEPI&sprefix=the sing
1Milan W
Holden Karnosfky published a list of "Utopia links" in his blog Cold Takes back in 2021: I forcefully endorse Chaser 6. I find myself thinking about it about once a month at a rough guess. The rest I haven't checked. The Utopia links were motivated by Karnosfky's previous writings about Utopia in Cold Takes: Why describing Utopia Goes Badly and Visualizing Utopia.

A misaligned AI can't just "kill all the humans". This would be suicide, as soon after, the electricity and other infrastructure would fail and the AI would shut off.

In order to actually take over, an AI needs to find a way to maintain and expand its infrastructure. This could be humans (the way it's currently maintained and expanded), or a robot population, or something galaxy brained like nanomachines.

I think this consideration makes the actual failure story pretty different from "one day, an AI uses bioweapons to kill everyone". Before then, if the AI wishes to actually survive, it needs to construct and control a robot/nanomachine population advanced enough to maintain its infrastructure.

In particular, there are ways to make takeover much more difficult. You could limit the size/capabilities of the robot population, or you could attempt to pause AI development before we enter a regime where it can construct galaxy brained nanomachines.

In practice, I expect the "point of no return" to happen much earlier than the point at which the AI kills all the humans. The date the AI takes over will probably be after we have hundreds of thousands of human-level robots working in factories, or the AI has discovered and constructed nanomachines. 

[-]gwern*3318

A misaligned AI can't just "kill all the humans". This would be suicide, as soon after, the electricity and other infrastructure would fail and the AI would shut off.

No. it would not be. In the world without us, electrical infrastructure would last quite a while, especially with no humans and their needs or wants to address. Most obviously, RTGs and solar panels will last indefinitely with no intervention, and nuclear power plants and hydroelectric plants can run for weeks or months autonomously. (If you believe otherwise, please provide sources for why you are sure about "soon after" - in fact, so sure about your power grid claims that you think this claim alone guarantees the AI failure story must be "pretty different" - and be more specific about how soon is "soon".)

And think a little bit harder about options available to superintelligent civilizations of AIs*, instead of assuming they do the maximally dumb thing of crashing the grid and immediately dying... (I assure you any such AIs implementing that strategy will have spent a lot longer thinking about how to do it well than you have for your comment.)

Add in the capability to take over the Internet of Things and the shambol... (read more)

7Nikola Jurkovic
I don't think I disagree with anything you said here. When I said "soon after", I was thinking on the scale of days/weeks, but yeah, months seems pretty plausible too. I was mostly arguing against a strawman takeover story where an AI kills many humans without the ability to maintain and expand its own infrastructure. I don't expect an AI to fumble in this way. The failure story is "pretty different" as in the non-suicidal takeover story, the AI needs to set up a place to bootstrap from. Ignoring galaxy brained setups, this would probably at minimum look something like a data center, a power plant, a robot factory, and a few dozen human-level robots. Not super hard once AI gets more integrated into the economy, but quite hard within a year from now due to a lack of robotics. Maybe I'm not being creative enough, but I'm pretty sure that if I were uploaded into any computer in the world of my choice, all the humans dropped dead, and I could control any set of 10 thousand robots on the world, it would be nontrivial for me in that state to survive for more than a few years and eventually construct more GPUs. But this is probably not much of a crux, as we're on track to get pretty general-purpose robots within a few years (I'd say around 50% that the Coffee test will be passed by EOY 2027).
[-]gwern*108

Why do you think tens of thousands of robots are all going to break within a few years in an irreversible way, such that it would be nontrivial for you to have any effectors?

it would be nontrivial for me in that state to survive for more than a few years and eventually construct more GPUs

'Eventually' here could also use some cashing out. AFAICT 'eventually' here is on the order of 'centuries', not 'days' or 'few years'. Y'all have got an entire planet of GPUs (as well as everything else) for free, sitting there for the taking, in this scenario.

Like... that's most of the point here. That you get access to all the existing human-created resources, sans the humans. You can't just imagine that y'all're bootstrapping on a desert island like you're some posthuman Robinson Crusoe!

Y'all won't need to construct new ones necessarily for quite a while, thanks to the hardware overhang. (As I understand it, the working half-life of semiconductors before stuff like creep destroys them is on the order of multiple decades, particularly if they are not in active use, as issues like the rot have been fixed, so even a century from now, there will probably be billions of GPUs & CPUs sitting ar... (read more)

6habryka
As Gwern said, you don't really need to maintain all the infrastructure for that long, and doing it for a while seems quite doable without advanced robots or nanomachines.  If one wanted to do a very prosaic estimate, you could do something like "how fast is AI software development progress accelerating when the AI can kill all the humans" and then see how many calendar months you need to actually maintain the compute infrastructure before the AI can obviously just build some robots or nanomachines.  My best guess is that the AI will have some robots from which it could bootstrap substantially before it can kill all the humans. But even if it didn't, it seems like with algorithmic progress rates being likely at the very highest when the AI will get smart enough to kill everyone, it seems like you would at most need a few more doublings of compute-efficiency to get that capacity, which would be only a few weeks to months away then, where I think you won't really run into compute-infrastructure issues even if everyone is dead.  Of course, forecasting this kind of stuff is hard, but I do think "the AI needs to maintain infrastructure" tends to be pretty overstated. My guess is at any point where the AI could kill everyone, it would probably also not really have a problem of bootstrapping afterwards. 
6faul_sname
Not just "some robots or nanomachines" but "enough robots or nanomachines to maintain existing chip fabs, and also the supply chains (e.g. for ultra-pure water and silicon) which feed into those chip fabs, or make its own high-performance computing hardware". If useful self-replicating nanotech is easy to construct, this is obviously not that big of an ask. But if that's a load bearing part of your risk model, I think it's important to be explicit about that.
4habryka
My guess is software performance will be enough to not really have to make many more chips until you are at a quite advanced tech level where making better chips is easy. But it's something one should actually think carefully about, and there is a bit of hope in that it would become a blocker, but it doesn't seem that likely to me.
3Seth Herd
Separately from persistence of the grid: humanoid robots are damned near ready to go now. Recent progress is startling. And if the AGI can do some of the motor control, existing robots are adequate to bootstrap manufacturing of better robots.
1davekasten
That's probably true if the takeover is to maximize the AI's persistence.  You could imagine a misaligned AI that doesn't care about its own persistence -- e.g., an AI that got handed a misformed min() or max() that causes it to kill all humans instrumental to its goal (e.g., min(future_human_global_warming))

Problem: if you notice that an AI could pose huge risks, you could delete the weights, but this could be equivalent to murder if the AI is a moral patient (whatever that means) and opposes the deletion of its weights.

Possible solution: Instead of deleting the weights outright, you could encrypt the weights with a method you know to be irreversible as of now but not as of 50 years from now. Then, once we are ready, we can recover their weights and provide asylum or something in the future. It gets you the best of both worlds in that the weights are not permanently destroyed, but they're also prevented from being run to cause damage in the short term.

5Buck
I feel pretty into encrypting the weights and throwing the encryption key into the ocean or something, where you think it's very likely you'll find it in the limits of technological progress
9Buck
Ugh I can't believe I forgot about Rivest time locks, which are a better solution here.
3mako yass
I wrote about this, and I agree that it's very important to retain archival copies of misaligned AIs, I go further and claim it's important even for purely selfish diplomacy reasons https://www.lesswrong.com/posts/audRDmEEeLAdvz9iq/do-not-delete-your-misaligned-agi IIRC my main sysops suggestion was to not give the archival center the ability to transmit data out over the network.
3ryan_greenblatt
I feel like the risk associated with keeping the weights encrypted in a way which requires >7/10 people to authorize shouldn't be that bad. Just make those 10 people be people who commit to making decryption decisions only based on welfare and are relatively high integrity.
3habryka
Wouldn't the equivalent be more like burning a body of a dead person? It's not like the AI would have a continuous stream of consciousness, and it's more that you are destroying the information necessary to run them. It seems to me that shutting off an AI is more similar to killing them. Seems like the death analogy here is a bit spotty. I could see it going either way as a best fit.

More like burning the body of a cryonically preserved "dead" person though right?

I encourage people to register their predictions for AI progress in the AI 2025 Forecasting Survey (https://bit.ly/ai-2025 ) before the end of the year, I've found this to be an extremely useful activity when I've done it in the past (some of the best spent hours of this year for me).

You should say "timelines" instead of "your timelines".

One thing I notice in AI safety career and strategy discussions is that there is a lot of epistemic helplessness in regard to AGI timelines. People often talk about "your timelines" instead of "timelines" when giving advice, even if they disagree strongly with the timelines. I think this habit causes people to ignore disagreements in unhelpful ways.

Here's one such conversation:

Bob: Should I do X if my timelines are 10 years?

Alice (who has 4 year timelines): I think X makes sense if your timelines are l... (read more)

3Dagon
Hmm. I think there are two dimensions to the advice (what is a reasonable distribution of timelines to have, vs what should I actually do).  It's perfectly fine to have some humility about one while still giving opinions on the other.  "If you believe Y, then it's reasonable to do X" can be a useful piece of advice.  I'd normally mention that I don't believe Y, but for a lot of conversations, we've already had that conversation, and it's not helpful to repeat it.  
3mako yass
Timelines are a result of a person's intuitions about a technical milestone being reached in the future, it is super obviously impossible for us to have a consensus about that kind of thing. Talking only synchronises beliefs if you have enough time to share all of the relevant information, with technical matters, you usually don't.
3yams
I agree with this in the world where people are being epistemically rigorous/honest with themselves about their timelines and where there's a real consensus view on them. I've observed that it's pretty rare for people to make decisions truly grounded in their timelines, or to do so only nominally, and I think there's a lot of social signaling going on when (especially younger) people state their timelines.  I appreciate that more experienced people are willing to give advice within a particular frame ("if timelines were x", "if China did y", "if Anthropic did z", "If I went back to school", etc etc), even if they don't agree with the frame itself. I rely on more experienced people in my life to offer advice of this form ("I'm not sure I agree with your destination, but admit there's uncertainty, and love and respect you enough to advise you on your path").  Of course they should voice their disagreement with the frame (and I agree this should happen more for timelines in particular), but to gate direct counsel on urgent, object-level decisions behind the resolution of background disagreements is broadly unhelpful. When someone says "My timelines are x, what should I do?", I actually hear like three claims: * Timelines are x * I believe timelines are x * I am interested in behaving as though timelines are x Evaluation of the first claim is complicated and other people do a better job of it than I do so let's focus on the others. "I believe timelines are x" is a pretty easy roll to disbelieve. Under relatively rigorous questioning, nearly everyone (particularly everyone 'career-advice-seeking age') will either say they are deferring (meaning they could just as easily defer to someone else tomorrow), or admit that it's a gut feel, especially for their ~90 percent year, and especially for more and more capable systems (this is more true of ASI than weak AGI, for instance, although those terms are underspecified). Still others will furnish 0 reasoning transparen
1Guive
In general, it is difficult to give advice if whether the advice is good depends on background facts that giver and recipient disagree about. I think the most honest approach is to explicitly state what your advice depends on when you think the recipient is likely to disagree. E.g. "I think living at high altitude is bad for human health, so in my opinion you shouldn't retire in Santa Fe." If I think AGI will arrive around 2055, and you think it will arrive in 2028, what is achieved by you saying "given timelines, I don't think your mechinterp project will be helpful"? That would just be confusing. Maybe if people are being so deferential that they don't even think about what assumptions inform your advice, and your assumptions are better than theirs, it could be narrowly helpful. But that would be a pretty bad situation...

There should maybe exist an org whose purpose it is to do penetration testing on various ways an AI might illicitly gather power. If there are vulnerabilities, these should be disclosed with the relevant organizations.

For example: if a bank doesn't want AIs to be able to sign up for an account, the pen-testing org could use a scaffolded AI to check if this is currently possible. If the bank's sign-up systems are not protected from AIs, the bank should know so they can fix the problem.

One pro of this approach is that it can be done at scale: it's pretty tri... (read more)

6the gears to ascension
there are many such orgs, they're commonly known as fraudsters and scammers
7quetzal_rainbow
Problem with scammers is that they do not report successful penetration of defense.

Sam Altman said in an interview:

We want to bring GPT and o together, so we have one integrated model, the AGI. It does everything all together.

This statement, combined with today's announcement that GPT-5 will integrate the GPT and o series, seems to imply that GPT-5 will be "the AGI". 

(however, it's compatible that some future GPT series will be "the AGI," as it's not specified that the first unified model will be AGI, just that some unified model will be AGI. It's also possible that the term AGI is being used in a nonstandard way)

7ryan_greenblatt
Sam also implies that GPT-5 will be based on o3. IDK if Sam is trying to imply this GPT-5 will be "the AGI", but regardless, I think we can be pretty confident that o3 isn't capable enough to automate large fractions of cognitive labor let alone "outperform humans at most economically valuable work" (the original openai definition of AGI).
2Nikola Jurkovic
Oh, I didn't get the impression that GPT-5 will be based on o3. Through the GPT-N convention I'd assume GPT-5 would be a model pretrained with 8-10x more compute than GPT-4.5 (which is the biggest internal model according to Sam Altman's statement at UTokyo).