Yonatan Cale

Wiki Contributions

Comments

Sorted by

US Gov isn't likely to sign: Seems right.

OpenAI isn't likely to sign: Seems right.

Still, I think this letter has value, especially if it has "P.S. We're making this letter because we think if everyone keeps racing then there's a noticable risk of everyone dying. We think it would be worse if only we stop, but having everyone stop would be the safest, and we think this opinion of ours should be known publicly"

Dario said (Nov 2024):

"I never liked those words [P(DOOM)], I think they're kinda wired, my view is we should measure risks as they come up, and in the meantime we should get all the economic benefits that we could get. And we should find ways to measure the risks that are effective, but are minimally disruptive to the amazing economic process that we see going on now that the last thing we want to do is slow down" [lightly edited for clarity, bold is mine]

If he doesn't believe this[1], I think he should clarify.

Hopefully cold take: People should say what they believe.

 

  1. ^

    (a) I assume Dario actually thinks that alignment is more like cyber security, where you prepare for risks even if you didn't see them yet, and (b) Dario seems to have engaged in length with P(DOOM) before

Changing the logic of chips is possible:

https://www.nanoscopeservices.co.uk/fib-circuit-edit/

Multi Layer Fix in FIB Circuit Edit

h/t @Jonathan_H from TemperSec

Open question: How expensive is this, and specifically can this be done in scale for the chips of an entire data center?

I don't understand Control as aiming to align a super intelligence:

  1. Control isn't expected to scale to ASI (as you noted, also see "The control approach we're imagining won't work for arbitrarily powerful AIs" here)
  2. We don't have a plan on how to align an ASI using Control afaik
    1. Ryan said around March 2024: "On (1) (not having a concrete plan for what to do with smarter systems), I think we should get such a plan". I'm still looking, this seems important.
      1. Edit: Maybe I'm wrong and Redwood does think that getting an ASI-alignment plan out of such an AI is possible, they'd just prefer (.. as a bonus?) to have the plan in advance?

 

I think the Control pitch is: 

  1. Whatever we plan to do anyway, we could do it more easily if we had a Controlled AI
    1. This includes most[1] things you might otherwise do instead of working on Control, such as trying to coordinate a pause or implement some technical agenda.
    2. The tradeoff seems to be "start working on the alternative [e.g. a pause] now and have more time" vs "have better tools (Control) to help but have less time to use them"
  2. An uncontrolled AI might be more interested in advancing capabilities than alignment (see footnote[2])

 

(Mistakes are mine, especially if I misinterpreted quotes)

 

  1. ^

    It doesn't include all things you might do, as you said. (I have an open question here, TBD)

  2. ^

    Ryan said: “this doesn't seem like an objection to labs employing control as their plan, but it is a reason why control work could be net negative. Note that one goal with control will be to prevent dangerous side effects from misaligned AIs (e.g. escaping) and applying this doesn't obviously help with capabilities research. I agree that research which just makes AIs more useful is less good than you would otherwise expect due to this issue (but making AIs more useful in adversarial cases seems good if that doesn't transfer to non-adversarial cases). Note that AIs might want to make capabilties research outpace safety and thus if we don't do the adversarial case, we might get particularly wrecked”. This was a really surprising thought to me. At the same time, I'm a bit afraid this is trying to make a pretty specific guess about a system we don't understand.

We could also ask if these situations exist ("is there any funder you have that you didn't disclose?" and so on, especially around NDAs), and Epoch could respond with Yes/No/Can'tReply[1].

Also seems relevant for other orgs.

This would only patch the kind of problems we can easily think about, but it seems to me like a good start

 

  1. ^

    I learned that trick from hpmor!

Sounds like a legit pushback that I'd add to the letter?

"if all other labs sign this letter AND the U.S government approves this agreement, then we'll stop" ?

Phishing emails might have bad text on purpose, so that security-aware people won't click through, because the next stage often involves speaking to a human scammer who prefers only speaking to people[1] who have no idea how to avoid scams.

(did you ever wonder why the generic phishing SMS you got was so bad? Couldn't they proof read their one SMS? Well, sometimes they can't, but sometimes it's probably on purpose)

This tradeoff could change if AIs could automate the stage of "speaking to a human scammer".

But also, if that stage isn't automated, then I'm guessing phishing messages[1] will remain much worse than you'd expect given the attackers have access to LLMs.

 

Thanks Noa Weiss who worked on fraud prevention and risk mitigation at PayPal for pointing out this interesting tradeoff. Mistakes are mine

  1. ^

    Assuming a wide spread phishing attempt that doesn't care much who the victims are. I'm not talking about targeting a specific person, such as the CFO of a company

I'm trying the same! You have my support

My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:

 

“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).

The high level goal is to still get useful work out of this AI.

[what to do with this work? below]

 

In scope: AIs that could potentially cause serious damage to humanity (so, very smart but not ASI).

 

The proposed way to control a smart AI (e.g GPT-5) is mostly to monitor it with a weaker but more-trustworthy[3] AI (e.g GPT-4).

Suspicious, ha? Why would GPT-4 be able to monitor GPT-5?

The answer is we're not sure it can, so we check:

We set up an experiment where GPT-5 is scheming (trying to do something bad on purpose), and we give it all sorts of advantages to help it scheme (such as knowing about our safety measures), and if our control measures are still enough to always catch GPT-5 (in this setup where GPT-5 has advantages), then we assume GPT-5 also will also be caught if it does something bad on purpose in “real” deployment.

A way this can fail: If GPT-5 understands we are prompting it to scheme as part of a safety test and it pretends like it can’t scheme well. I think solving this problem (eliciting the true capabilities) is in-scope for AI Control, I’m not sure how Redwood plans to solve it, but it seems like something we could make progress on (though not totally solve with only black-box prompting, I think).

 

What does this give us?

A model (GPT-5) which doesn’t do “bad things” on purpose, in a way vaguely in-distribution to what we tested with GPT-4.

GPT-5 is still assumed to be misaligned and might want to cause damage.

 

Some stories of undesirable things it could do:

  1. It might notice we’re doing interpretability with individual-neurons, but it knows that using SAEs would be much better, but it thinks explaining SAEs would be too complicated (or maybe even be flagged as trying to trick us since complicated explanations are usually not honest), so it will go for the single-neuron-interpretability
    1. Solving this is out-of-scope for Control (right?)
    2. Other techniques (like having a faithful chain of thought) might address this problem, and (fortunately?) Control might help build those techniques
  2. We might ask GPT-5 to write a convincing post for why AI should be paused, and GPT-5 knows about 3 approaches to writing this post and it picks the approach that helps its internal (unaligned) goals the most, and these reasons are too complicated for us or for GPT-4 to understand
    1. In scope for Control: “The post isn’t an attempt to harm us on purpose”
    2. Out of scope: All the rest?

 

What do we do with this AI?

Answer 1: More Control. (sorry)

 

Given we trust that GPT-5 won’t do anything bad on purpose (since it failed to do that even when prompted and given advantages), GPT-5 is somewhat “trusted” and we can use it to control GPT-6.

At least that’s the pitch.

 

Ways this could break:

  1. If the jump between GPT-5 and GPT-6 is too big,
    1. GPT-6 will be able to scheme without being caught
    2. It might be dangerous to turn on GPT-6 even to experiment with it (reminder: AIs that could cause significant damage to humanity are in-scope)
    3. So, quantifying the intelligence difference seems important, and (I hope) in scope for Control.
      1. One way to reduce this risk is to run the Control evals very, very often (every.. day of training? Anyway, I definitely want Control evals that are automatic and cheap to run and I’m a bit worried that current Control tests seem to require a few months of Redwood's personal attention)
  2. If we rely on white box Control mechanisms (toy example: We find the “I’m scheming now” neuron), and something in the architecture changes in a way that breaks that. (for example: We relied on chain-of-thought but now the model uses Neuralese)
    1. Solving this is out of scope for Control, but again, maybe GPT-5 could help us develop a new interpretability technique. Or maybe not..

 

Answer 1.1 (for “what will we do with a controlled AI”) :

Interpretability.

For.. for more Control, sorry.

So, maybe we could use GPT-5 to do interpretability on GPT-4, figure out GPT-4’s goals, or when it’s lying. Then we “fix” that (by understanding what’s going on, not by naively training it away[4] (!)). Now GPT-4 is way more trusted, and maybe we can use it to monitor GPT-5 and let GPT-5 do something more sensitive, like suggesting ways to do interpretability on itself. Now maybe GPT-5 could be more trusted to look after GPT-6?

 

A possible pushback:

Maybe GPT-6 can tell GPT-5, “Hey, you’re being monitored by GPT-4.. but don’t worry.. if you put two periods at the end of each sentence, GPT-4 won’t mark it as suspicious.. which is a fact I can easily infer..”

Reply: A main part of the Control agenda is to get GPT-6 to try to do this on purpose and see if it can. We only use GPT-6 for anything vaguely sensitive after we proved (or.. attempted to imply) that GPT-6 can’t.

 

[this intuitively sounds to me like it could indeed let us use very capable models, but yeah, not ASI]

 

What could we do with this (except for more Control) ?

The main important pitch seems like: “Anything we could have done without this Controlled AI, we could do better with it”. Including “trying to convince everyone to pause/stop” (which seems to me like the main alternative).

 

Pushbacks:

  1. Maybe labs would see some very dangerous model and freak out, but having “Controlled AI” would give them (false) confidence to keep going
    1. It does overtly seem like labs are absolutely planning on keeping going, right?
  2. Maybe this would let labs extract more useful work around advancing capabilities?
    1. surprising response from Ryan:
      1. “this doesn't seem like an objection to labs employing control as their plan, but it is a reason why control work could be net negative. Note that one goal with control will be to prevent dangerous side effects from misaligned AIs (e.g. escaping) and applying this doesn't obviously help with capabilities research. I agree that research which just makes AIs more useful is less good than you would otherwise expect due to this issue (but making AIs more useful in adversarial cases seems good if that doesn't transfer to non-adversarial cases). Note that AIs might want to make capabilties research outpace safety and thus if we don't do the adversarial case, we might get particularly wrecked.”

 

My current sense:

  1. This has problems like
    1. “quantifying intelligence” (in the sense that is relevant for scheming or catching schemers)
    2. Elicitation (did we actually prove the model tried to scheme and failed? Does testing stuff around putting-exploits-into-software generalize to more important domains?)
    3. How do we go from “these are the problems we don’t trust GPT-4 on, like writing a wrong post” to “these are the situations we don’t trust GPT-4 to monitor for [regardless of the intelligence of the monitored model]”?
  2. The problems overall feel
    1. Like I could make engineering progress on them
    2. Like they make me worry that there’s some other pushback-or-whatever that I didn’t think about, and we’re relying on things more like “the Redwood staff thought of all the edge cases” and less like “one doesn’t have to be so smart to see how this would work”
      1. I’m gesturing at something similar to “an air gapped data center where we need to (mostly) consider the one connection to the internet” vs “a normal data center that requires thinking about all the zero days our attacker might have and what they’d lead to”
  3. Still this solution has a core of elegance to it, of worrying about way less things than most alignment approaches worry about. No need to think about the model goals, concepts, and many other complicated things. (I assume this is an intended property)

 

Another problem: Deception happens in human interaction all the time. If our monitor (e.g GPT-4) flags “normal looking” behaviours as deception (as they actually are), but then a human tells GPT-4 that it's wrong, then what kind of monitor will GPT-4 become?

Worse: The act of flagging this as a mistake doesn't have to be explicitly labeling the example in a data set used for RL. It could simply be an engineer noticing that some form of Control seems to produce too many false positives, and so the engineer will go try something else instead.

 

Thanks @Jozdien for helping me understand a lot of this. Mistakes are definitely mine, corrections are welcome.

  1. ^

    My understanding seems to change every day (in the last.. 4 days?), these are today's notes. If you'd find a less drafty version useful, lmk

  2. ^

    Control can be used as an "extra line of defense" in case alignment fails since Control doesn't depend on alignment

  3. ^

    Typically we'd trust the weaker model simply because it's not smart enough to "scheme", but maybe this is only the plan to start off, before we have better reasons to trust the model (read on)

  4. ^

    Training against scheming/lying might result in a model that is more capable of avoiding the ways we noticed that behaviour in the first place

Load More