Fabien Roger

Sequences

AI Control

Wiki Contributions

Comments

Sorted by
Fabien RogerΩ341

Here are ways you could make using GPT-n safer if its Chain-of-Thoughts are understandable by a team of linguists and AI experts:

  • On a small random subset of deployment Chain-of-Thought, and when running adversarial evaluations / honeypots aimed at tricking the model into revealing a potential misalignment, actually call your team of linguists and AI experts to analyze what GPT-n is generating (it probably isn't that much text anyway). Not seeing evidence of misalignment in the Chain-of-Thought in such circumstances is evidence that 1. the model is not having misaligned thoughts (in CoT) very frequently and 2. the model is not having such thoughts in circumstances where it is most crucial for GPT-n to be deceptive.
  • If the model doesn't use Klingon or dense jargon in its Chain-of-Thoughts (which I predict it won't), monitor all CoT of GPT-n with a weaker cheaper model (and/or a regex). If something looks suspicious to the weaker monitor, you can:
    • call your team of linguists and AI experts to analyze it further
    • or resample the CoT (this likely was just a fluke, since most CoT don't have worrying signs)
    • or give up on this particular query (e.g. send an error message to your user, use another weaker model to generate an answer, ...)
    • or ask another copy of GPT-n to analyze the CoT (though this can be tricky to get right because of the risk of correlated failures), only calling your team of linguists and AI experts if more evidence of this CoT being suspicious is found.

This is not perfect, but I think this can represent an important layer of defense against the risks of AIs that are powerful enough to speed up alignment research but that are not vastly superhuman.

Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks

Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models.

If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in place to access helpful-only models - which might make it harder for the CEO to take over. So while I agree that you need good security and governance to prevent a CEO from using helpful-only models to take over, I think that without good adversarial robustness, it is much harder to build adequate security/governance measures without destroying an AI-assisted-CEO's productivity.

There is a lot of power concentration risk that just comes from people in power doing normal people-in-power things, such as increasing surveillance on dissidents - for which I agree that adversarial robustness is ~useless. But security against insider threats is quite useless too.

That would work if the untrusted model is better at inserting subtle backdoors than the "human" high-quality labor used to generate the synthetic backdoors (used to validate that the untrusted model doesn't just lie).

In our work we make the assumption that the "human" labor and the untrusted model are as capable (in fact we use GPT-4 as a substitute for "human" labor), so if I understood your suggestion correctly this doesn't quite work in the context of the paper, though this can be used to slightly improve the red-team untrusted monitor - if we did not already tell it to flag obvious backdoors (I don't remember exactly what we did).

It's unclear how much this assumption will hold in practice for the untrusted AIs I care most about (the first AIs able to speed up alignment research by ≥20x). I am optimistic it will hold because humans will be able to spend a lot of time crafting backdoors, while untrusted AIs will only have a forward pass.

I am not sure that it is over-conservative. If you have an HP-shaped that can easily be transformed in HP-data using fine-tuning, does it give you a high level of confidence that people misusing the model won't be able to extract the information from the HP-shaped hole or that a misaligned model won't be able to notice to HP-shaped hole and use that to answer to question to HP when it really wants to?

I think that it depends on the specifics of how you built the HP-shaped hole (without scrambling the information). I don't have a good intuition for what a good technique like that could look like. A naive thing that comes to mind would be something like "replace all facts in HP by their opposite" (if you had a magic fact-editing tool), but I feel like in this situation it would be pretty easy for an attacker (human misuse or misaligned model) to notice "wow all HP knowledge has been replaced by anti-HP knowledge" and then extract all the HP information by just swapping the answers.

That's right.

I think it's not that devastating, since I expect that their method can be adapted to counter classic LoRA tuning (the algorithm takes as input some set finetuning methods it "trains against"). But yeah, it's not reassuring that it doesn't generalize between full-weight FT and LoRA.

I have converted the anthology into an audio format using the default (and terrible) AWS AI voice (mostly for my own consumption): https://podcasters.spotify.com/pod/show/fabien368/episodes/Overcoming-bias-anthology---our-thinking-e2q61mg

Fabien RogerΩ560

I like this post. I made similar observations and arguments in The Translucent Thoughts Hypotheses and Their Implications but these were less clear and buried at the end of the post.

Very cool work!

About the (GB)RL results, I'd be curious to better understand how it emerges. Given how human understandable these seem, I guess that what happens is that there is an already existing subtle link by default between the payload and the generated text at least some of the time, and that training

  1. teaches the receiver to recognize this signal
  2. reinforces that behavior
  3. makes the behavior gradually more subtle

Do you share this interpretation? I think this is supported by the fact that the settings which work best seem to be the ones where this payload-default generation relationship seems the strongest.

One experiment I am curious about is apply the receiver from the end of training to generations halfway through training, I would guess that a lot of what the training is doing is (1) and therefore you will see accuracy rise earlier in training than when you apply the halfway receiver to the halfway sender.

I've added to the Other experiments of my negative results post the following paragraph:

Hidden in Plain Text has some positive results on some similar setups! The steganography schemes which are learned seem very human understandable. Their success might come from using payloads and cover texts which have a more natural relationship by default, in addition to their superior training setup.

These are mostly unlearning techniques, right? I am not familiar enough with diffusion models to make an informed guess about how good the unlearning of the meta-learning paper is (but in general, I am skeptical, in LLMs all attempts at meta-unlearning have failed, providing some robustness only against specific parameterizations, so absent evaluations of reparametrizations, I wouldn't put much weights on the results).

I think the inter-based unlearning techniques look cool! I would be cautious about using them for evaluation though, especially when they are being optimized against like in the ConceptVector paper. I put more faith in SGD making loss go down when the knowledge is present.

It would be great if this was solvable with security, but I think it's not for most kinds of deployments. For external deployments you almost can never know if a google account corresponds to a student or a professional hacker. For internal deployments you might have misuse from insiders. Therefore, it is tempting to try to address that with AI (e.g. monitoring, refusals, ...), and in practice many AI labs try to use this as the main layer of defense (e.g. see OpenAI's implementation of model spec, which relies almost entirely on monitoring account activity and models refusing clearly ill-intentioned queries). But I agree this is far from being bulletproof.

Load More