Dario said (Nov 2024):
"I never liked those words [P(DOOM)], I think they're kinda wired, my view is we should measure risks as they come up, and in the meantime we should get all the economic benefits that we could get. And we should find ways to measure the risks that are effective, but are minimally disruptive to the amazing economic process that we see going on now that the last thing we want to do is slow down" [lightly edited for clarity, bold is mine]
If he doesn't believe this[1], I think he should clarify.
Hopefully cold take: People should say what they believe.
Changing the logic of chips is possible:
https://www.nanoscopeservices.co.uk/fib-circuit-edit/
h/t @Jonathan_H from TemperSec
Open question: How expensive is this, and specifically can this be done in scale for the chips of an entire data center?
I don't understand Control as aiming to align a super intelligence:
I think the Control pitch is:
(Mistakes are mine, especially if I misinterpreted quotes)
It doesn't include all things you might do, as you said. (I have an open question here, TBD)
Ryan said: “this doesn't seem like an objection to labs employing control as their plan, but it is a reason why control work could be net negative. Note that one goal with control will be to prevent dangerous side effects from misaligned AIs (e.g. escaping) and applying this doesn't obviously help with capabilities research. I agree that research which just makes AIs more useful is less good than you would otherwise expect due to this issue (but making AIs more useful in adversarial cases seems good if that doesn't transfer to non-adversarial cases). Note that AIs might want to make capabilties research outpace safety and thus if we don't do the adversarial case, we might get particularly wrecked”. This was a really surprising thought to me. At the same time, I'm a bit afraid this is trying to make a pretty specific guess about a system we don't understand.
We could also ask if these situations exist ("is there any funder you have that you didn't disclose?" and so on, especially around NDAs), and Epoch could respond with Yes/No/Can'tReply[1].
Also seems relevant for other orgs.
This would only patch the kind of problems we can easily think about, but it seems to me like a good start
I learned that trick from hpmor!
Sounds like a legit pushback that I'd add to the letter?
"if all other labs sign this letter AND the U.S government approves this agreement, then we'll stop" ?
Phishing emails might have bad text on purpose, so that security-aware people won't click through, because the next stage often involves speaking to a human scammer who prefers only speaking to people[1] who have no idea how to avoid scams.
(did you ever wonder why the generic phishing SMS you got was so bad? Couldn't they proof read their one SMS? Well, sometimes they can't, but sometimes it's probably on purpose)
This tradeoff could change if AIs could automate the stage of "speaking to a human scammer".
But also, if that stage isn't automated, then I'm guessing phishing messages[1] will remain much worse than you'd expect given the attackers have access to LLMs.
Thanks Noa Weiss who worked on fraud prevention and risk mitigation at PayPal for pointing out this interesting tradeoff. Mistakes are mine
Assuming a wide spread phishing attempt that doesn't care much who the victims are. I'm not talking about targeting a specific person, such as the CFO of a company
I'm trying the same! You have my support
My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:
“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).
The high level goal is to still get useful work out of this AI.
[what to do with this work? below]
In scope: AIs that could potentially cause serious damage to humanity (so, very smart but not ASI).
The proposed way to control a smart AI (e.g GPT-5) is mostly to monitor it with a weaker but more-trustworthy[3] AI (e.g GPT-4).
Suspicious, ha? Why would GPT-4 be able to monitor GPT-5?
The answer is we're not sure it can, so we check:
We set up an experiment where GPT-5 is scheming (trying to do something bad on purpose), and we give it all sorts of advantages to help it scheme (such as knowing about our safety measures), and if our control measures are still enough to always catch GPT-5 (in this setup where GPT-5 has advantages), then we assume GPT-5 also will also be caught if it does something bad on purpose in “real” deployment.
A way this can fail: If GPT-5 understands we are prompting it to scheme as part of a safety test and it pretends like it can’t scheme well. I think solving this problem (eliciting the true capabilities) is in-scope for AI Control, I’m not sure how Redwood plans to solve it, but it seems like something we could make progress on (though not totally solve with only black-box prompting, I think).
What does this give us?
A model (GPT-5) which doesn’t do “bad things” on purpose, in a way vaguely in-distribution to what we tested with GPT-4.
GPT-5 is still assumed to be misaligned and might want to cause damage.
Some stories of undesirable things it could do:
What do we do with this AI?
Answer 1: More Control. (sorry)
Given we trust that GPT-5 won’t do anything bad on purpose (since it failed to do that even when prompted and given advantages), GPT-5 is somewhat “trusted” and we can use it to control GPT-6.
At least that’s the pitch.
Ways this could break:
Answer 1.1 (for “what will we do with a controlled AI”) :
Interpretability.
For.. for more Control, sorry.
So, maybe we could use GPT-5 to do interpretability on GPT-4, figure out GPT-4’s goals, or when it’s lying. Then we “fix” that (by understanding what’s going on, not by naively training it away[4] (!)). Now GPT-4 is way more trusted, and maybe we can use it to monitor GPT-5 and let GPT-5 do something more sensitive, like suggesting ways to do interpretability on itself. Now maybe GPT-5 could be more trusted to look after GPT-6?
A possible pushback:
Maybe GPT-6 can tell GPT-5, “Hey, you’re being monitored by GPT-4.. but don’t worry.. if you put two periods at the end of each sentence, GPT-4 won’t mark it as suspicious.. which is a fact I can easily infer..”
Reply: A main part of the Control agenda is to get GPT-6 to try to do this on purpose and see if it can. We only use GPT-6 for anything vaguely sensitive after we proved (or.. attempted to imply) that GPT-6 can’t.
[this intuitively sounds to me like it could indeed let us use very capable models, but yeah, not ASI]
What could we do with this (except for more Control) ?
The main important pitch seems like: “Anything we could have done without this Controlled AI, we could do better with it”. Including “trying to convince everyone to pause/stop” (which seems to me like the main alternative).
Pushbacks:
My current sense:
Another problem: Deception happens in human interaction all the time. If our monitor (e.g GPT-4) flags “normal looking” behaviours as deception (as they actually are), but then a human tells GPT-4 that it's wrong, then what kind of monitor will GPT-4 become?
Worse: The act of flagging this as a mistake doesn't have to be explicitly labeling the example in a data set used for RL. It could simply be an engineer noticing that some form of Control seems to produce too many false positives, and so the engineer will go try something else instead.
Thanks @Jozdien for helping me understand a lot of this. Mistakes are definitely mine, corrections are welcome.
My understanding seems to change every day (in the last.. 4 days?), these are today's notes. If you'd find a less drafty version useful, lmk
Control can be used as an "extra line of defense" in case alignment fails since Control doesn't depend on alignment
Typically we'd trust the weaker model simply because it's not smart enough to "scheme", but maybe this is only the plan to start off, before we have better reasons to trust the model (read on)
Training against scheming/lying might result in a model that is more capable of avoiding the ways we noticed that behaviour in the first place
US Gov isn't likely to sign: Seems right.
OpenAI isn't likely to sign: Seems right.
Still, I think this letter has value, especially if it has "P.S. We're making this letter because we think if everyone keeps racing then there's a noticable risk of everyone dying. We think it would be worse if only we stop, but having everyone stop would be the safest, and we think this opinion of ours should be known publicly"