Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
It’s much more the same than a lot of prosaic safety though, right?
Let me put it this way: If an AI can’t achieve catastrophe on that order of magnitude, it also probably cannot do something truly existential.
One of the issues this runs into is if a misaligned AI is playing possum, and so doesn’t attempt lesser catastrophes until it can pull off a true takeover. I nonetheless though think this framing points generally at the right type of work (understood that others may disagree of course)
I’ve often preferred a frame of ‘catastrophe avoidance’ over a frame of x-risk. This has a possible downside of people underfeeling the magnitude of risk, but also an upside of IMO feeling way more plausible. I think it’s useful to not need to win specific arguments about extinction, and also to not have some of the existential/extinction conflation happening in ‘x-‘.
If someone is wondering what prefilling means here, I believe Ted means ‘putting words in the model’s mouth’ by being able to fabricate a conversational history where the AI appears to have said things it didn’t actually say.
For instance, if you can start a conversation midway, and if the API can’t distinguish between things the model actually said in the history vs. things you’ve written in its behalf as supposed outputs in a fabricated history, this can be a jailbreak vector: If the model appeared to already violate some policy on turns 1 and 2, it is more likely to also violate this on turn 3, whereas it might have refused if not for the apparent prior violations.
(This was harder to clearly describe than I expected.)
Great! Appreciate you letting me know & helping debug for others
Oh interesting, I’m out at the moment and don’t recall having this issue, but if you override the default number of threads for the repo to 1, does that fix it for you?
https://github.com/openai/evals/blob/main/evals/eval.py#L211
(There are two places in this file where threads =
, would change 10 to 1 in each)
I quite like the Function Deduction eval we built, which is a problem-solving game that tests one’s ability to efficiently deduce a hidden function by testing its value on chosen inputs.
It’s runnable from the command-line (after repo install) with the command:
oaieval human_cli function_deduction
(I believe that’s the right second term, but it might just be humancli)
The standard mode might be slightly easier than you want, because it gives some partial answer feedback along the way. There is also a hard mode that can be run, which would not give this partial feedback and so would be harder to find any brute forcing method.
For context, I could solve ~95% of the standard problems with a time limit of 5 minutes and a 20 turn game limit. GPT-4 could solve roughly half within that turn limit IIRC, and o1-preview could do ~99%. (o1-preview also scored ~99% on the hard mode variant; I never tested myself on this but I estimate maybe I’d have gotten 70%? The problems could also be scaled up in difficulty pretty easily if one wanted.)
This helps me to notice that there is a fairly strong and simple data poisoning attack possible with canaries, such that canaries can’t really be relied upon (amid other reasons I already believed they’re insufficient, esp once AIs can browse the Web):
The attack is that one could just siphoning up all text on the Internet that did have a canary string, and then republish it without the canary :/
I agree, these are interesting points, upvoted. I’d claim that AI output also isn’t linear with the resources - but nonetheless, that you’re right that the curve of marginal return from each AI unit could be different from each human unit in an important way. Likewise, the easier on-demand labor of AI is certainly a useful benefit.
I don’t think these contradict the thrust of my point though? That in each case, one shouldn’t just be thinking about usefulness/capability, but should also be considering the resources necessary for achieving this.
I believe we should view AGI as a ratio of capability to resources, rather than simply asking how AI's abilities compare to humans'. This view is becoming more common, but is not yet common enough.
When people discuss AI's abilities relative to humans without considering the associated costs or time, this is like comparing fractions by looking only at the numerators.
In other words, AGI has a numerator (capability): what the AI system can achieve. This asks questions like: For this thing that a human can do, can AI do it too? How well can AI do it? (For example, on a set of programming challenges, how many can the AI solve? How many can a human solve?)
But also importantly, AGI should account for the denominator: how many resources are required to achieve its capabilities. Commonly, this resource will be a $-cost or an amount of time. This asks questions like "What is the $-cost of getting this performance?" or "How long does this task take to complete?".
I claim that an AI system might fail to qualify as AGI if it lacks human-level capabilities, but it could also fail by being wildly inefficient compared to a human. Both the numerator and denominator matter when evaluating these ratios.
A quick example of why focusing solely on the capability (numerator) is insufficient:
The AGI-as-ratio concept has long been implicit in some labs' definitions of AGI - for instance, OpenAI describes AGI as "a highly autonomous system that outperforms humans at most economically valuable work". Outperforming humans economically does seem to imply being more cost-effective, not just having the same capabilities. Yet even within OpenAI, the denominator aspect wasn’t always front of mind, which is why I wrote up a memo on this during my time there.
Until the recent discussions of o1 Pro / o3 drawing upon lots of inference compute, I rarely saw these ideas discussed, even in otherwise sophisticated analyses. One notable exception to this is my former teammate Richard Ngo's t-AGI framework, which deserves a read. METR has also done a great job of accounting for this in their research comparing AI R&D performance, given a certain amount of time. I am glad to see more and more groups thinking in terms of these factors - but in casual analysis, it is very easy to just slip into comparisons of capability levels. This is worth pushing back on, imo: The time at which "AI capabilities = human capabilities" is different than the time when I expect AGI will have been achieved in the relevant senses.
There are also some important caveats to my claim here, that 'comparing just by capability is missing something important':
At the risk of being pedantic, just noting I don’t think it’s really correct to consider that first person as earning $300/hr. For example I’d expect to need to account for the depreciation on the jet skis (or more straightforwardly, that one is in the hole on having bought them until a certain number of hours rented), and also presumably some accrual for risk of being sued in the case of an accident.
(I do think it’s useful to notice that the jet ski rental person is much higher variance, in both directions IMO - so this can be both good or bad. I do also appreciate you sharing your experience!)