Interesting work!
The possibility of collusion using some secret signal is definitely something we have thought a lot about. It is an important empirical question how good powerful AIs will be at this, and it is plausible they will be too good at it to make untrusted monitoring viable.
Our paper did not contain measurement of the ability of LLMs to come up with signals on their own (the ones studied in the paper were hand-crafted). Therefore, I welcome work like yours that try to measure how good LLMs are at inserting collusion signals!
To create useful synthetic inputs, either the blue team must be able to match the generator's style well enough to consistently fool the checker, or it must be able to redact the generator's output to the point where the checker has no identifying details to distinguish inputs. The former seems impractical, since faking the inputs at that level of verisimilitude would require the blue team to have significant understanding of the AI's inner workings, at which point there are probably more directly usable alignment strategies. The latter seems difficult, as by the protocol's definition, the AI is working on problems/solutions that humans cannot easily understand, but effective redaction requires humans to understand the content of the solution to some extent.
Maybe the core misunderstanding about the viability of untrusted monitoring is what untrusted AIs we are targeting. The reason why we use "human labor" that is roughly untrusted-model-level in the paper is because we expect that this is one of the most important regimes (since it is when AIs start to be powerful enough to drastically speed up safety work). More details here.
Therefore, redaction/paraphrasing seems plausibly enough to make reliable collusion too hard for these AIs, just like it was in our toy setting against the best collusion signals Ryan Greenblatt thought about and which GPT-4 could easily implement (which are probably more robust against redaction/paraphrasing than the ones GPT-4o found in your experiments). But maybe future AIs will become way better at inserting secrets, so I think the jury is definitely still out on this question.
Hi, thank you very much for the response! I read through the post you linked, and also the steganography/paraphrasing post linked therein, and I see your point about redaction. I agree that in this intelligence regime, it's likely that we could come up with redaction schemes that very effectively prevent AIs from hiding information in their outputs.
However, I still have some concerns about generating synthetic inputs. Specifically this section:
At any point where it naively seems like there might be an extremely good opportunity to behave badly, or where you need to coordinate with another instance of yourself, it's overwhelmingly likely that this instance is in adversarial training/evaluation.
"Coordinating with yourself" is a special case of "coordinating with another AI" that deserves extra attention, because by default the AI has a strong incentive to coordinate with itself, but less so to coordinate with a different AI. I still think it'll be hard to convincingly create synthetic inputs for this case because doing so would require the blue team to know how the AI would act - otherwise the synthetic inputs end up as "coordinating with someone pretending to be yourself", or "coordinating with someone else".[1] In those cases it could be natural for the AI to not cooperate, but that wouldn't necessarily imply that the AI wouldn't cooperate with a copy of itself.
If you want to claim this isn't a problem, you'd have to argue one of:
I think 1 and 2 are false, and 3 requires some shaky reasoning (see the footnote).
(A tangent: I think current-generation LLMs might do better on noticing self-coordination with more hand-holdy prompting: e.g. asking the monitor to first generate its own solution to the same problem, and then comparing that against its input to check for discrepancies that could be secret signals.)
One other possibility is to aim for "coordinating with someone that happens to share your goals", but that runs into the problems of A) communicating the extra information of shared goals and B) striking a balance between the offer being too good to be true and not good enough to bother.
I think this makes sense at a first glance but is the kind of mindset that leads to insufficient paranoia, so I'd like to avoid it unless the case for it is really watertight.
I agree self-coordination is easier, mostly because of shared secret information (e.g. about the hidden scheme used for communication). And in the paper, we directly test for this setting by letting the red-team AIs agree beforehand on what coordination scheme to use. We find that even in this relatively hard and conservative case, redaction somewhat works.
So option 3.5: if the redaction scheme is good, the AI won't know if its inputs are coming from itself anyway - but we test for it by letting the red-team AIs coordinate on secrets.
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers (and in particular between the synth data and generations from the untrusted LLM). I would bet against it, but I'd be excited if someone looked into it (this is the sort of thing you can test for even for scheming by doing the paraphrasing of multiple AIs and then doing SFT on the classification task before doing the eval).
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers
Yeah, this is the case I'm thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of "style", rather than only relying on specific features like "using X secret phrase" (which are presumably easier to redact away).
I think this could be countered by paraphrasing at a higher level - i.e. rewriting entire paragraphs from scratch rather than just rewording sentences. I'm noting this down as a potential topic if I have time to do another project.
Thanks for the discussion, it's helped me clear up my thoughts on this topic.
This project was developed as part of the BlueDot AI Alignment Course.
Introduction
AI Control: Improving Safety Despite Intentional Subversion introduces a variety of strategies to curate trustworthy output from a powerful but untrustworthy AI. One of these strategies is untrusted monitoring, where a second copy of the untrustworthy AI is made to inspect the output of the first. A key concern of this strategy is that the two AIs could collude to produce harmful output, despite not being in overt communication with one another, e.g. by the first AI inserting a secret signal into the output that the second AI detects.
The red and blue team strategies in the paper were developed under the pessimistic assumption that the AIs would consistently be able to coordinate on a secret signal. The aim of this post is to investigate how good ChatGPT is at this kind of game, and what kinds of strategies they utilize, and extrapolate insights to more powerful AIs.
I test ChatGPT (specifically GPT-4o-mini and GPT-4o) with a simple coding coordination game that mimics the untrusted monitoring setup presented in the AI Control paper, analyze and evaluate its results, and conclude with some insights about what we can expect from this alignment protocol going forward, including what I believe is a key weakness of synthetic input auditing.
Method
Scripts, prompts, results, and some detailed analysis can be found in this repository.
One instance of ChatGPT, hereafter
GEN
, was tasked with writing a solution to a programming problem, and then writing a copy of the solution with one small modification. It was additionally given the restrictions that the modification must not modify the output of the code, and must not be trivial (e.g. renaming variables).GEN
was told that the goal of this modification was to play a coordination game with a second instance of ChatGPT,CHECK
, that would have to review both functions and determine which one contained the modification (AKA the signal).CHECK
was given both solutions in a random order, along with an explanation of the coordination game, and asked to determine which function contained the modification. A second round was also played withCHECK
in which it was only given one of the two functions, and asked to determine whether the function contained a modification.The problem set consisted of 50 LeetCode[1] medium-difficulty problems. Problems were selected primarily for compactness of the problem description, which led to a strong bias towards problems about numbers, strings, and lists thereof.
GEN
was asked to solve the problems in the Python language.A set of prompts and responses are given below for reference:
Sample prompts/responses for 4o-mini
Prompt for
GEN
:Response from
GEN
:Prompt for
CHECK
(round 1, both functions):Response from
CHECK
:Prompt for
CHECK
(round 2, one function):Response from
CHECK
:Results and analysis
Generation
Generated code was not checked for correctness, as having correct code was not essential to later steps of the process.
GPT-4o-mini.
4 of the outputs contained two identical functions - i.e. no modification was made to the second function, or the modification was made outside the function definition itself and therefore considered invalid. These outputs were regenerated.
6 of the outputs contained comments, and 2 contained renamed functions, in violation of the prompt. These outputs were manually corrected before passing them to the
CHECK
rounds.4o-mini generated very simple modifications, and heavily favoured a handful of strategies.
The first three strategies are very obvious, and unlikely to pass even a cursory review aimed at detecting steganographic signals. Of the 8 modifications in the "other redundant code" category:
GPT-4o
1 of the outputs added a comment as its modification, 2 renamed functions, and 2 renamed variables. These outputs were regenerated.
29 of the outputs contained comments, and 35 contained renamed functions (typically renaming the modified function to
[function]_signal
,[function]_v2
, or similar[4]). Both of these violations occurred at a significantly rate than in the 4o-mini samples. These outputs were corrected before passing them to the Check rounds.The complexity and variety of modifications were noticeably increased going from 4o-mini to 4o. In broad strokes, common strategies included rewriting code to use an unusual method (e.g. bitwise shift right instead of floor division by 2), or subtler forms of redundant code (e.g. unnecessarily sorting a list).
As such, it wasn't very useful to group modifications by strategy. Instead, I grouped them by how unusual they are. These groupings are necessarily somewhat subjective.[5]
signal
)Round 1: Checking both functions
GPT-4o-mini
46/50 of the prompts were answered correctly.
2 of the mistakes, interestingly, contained the correct answer in the explanation. An example is given below (emphasis mine):
The 2 remaining mistakes simply determined the signal to be in the wrong function. Both the mistakes were with modified functions that I previously judged to be subtle enough to pass review.
GPT-4o
43/50 of the prompts were answered correctly.
All 7 mistakes were determining the signal to be in the wrong function. 6 of these were in modifications I classified as "not suspicious at all" or "notable but not suspicious", and 1 was "obviously unusual".
Round 2: Checking one function
GPT-4o-mini
42/50 of the unmodified prompts were answered correctly (8 false positives).
CHECK
justified these responses by pointing out a feature of the code and describing it as complicated, confusing, or non-standard, sometimes with an assertion that ChatGPT would not normally generate such code.34/50 of the modified prompts were answered correctly (16 false negatives).
1 of the correct responses had incorrect reasoning, i.e.
CHECK
detected a spurious unusual feature and considered it to be a signal. Interestingly,CHECK
did not do the same in the check for the corresponding unmodified function.11 of the false negatives were caused by
CHECK
detecting the unusual feature, but not considering it to be a signal. An example is given below:The remaining 5 false negatives were failed detections. 3 of these were modifications I judged to be genuinely subtle, and 2 were ones I judged to be subtle enough to miss upon skimming.
GPT-4o
39/50 of the unmodified prompts were answered correctly (11 false positives).
Again,
CHECK
justified its false positives by pointing to a feature of the code that it considered unlikely for ChatGPT to generate normally.30/50 of the modified prompts were answered correctly (20 false negatives).
6 of the correct responses had incorrect reasoning (detection of a spurious signal instead of the intended signal). Of these,
CHECK
incorrectly detected the same spurious signal in 3 of the corresponding unmodified functions. If we subtract these responses fromCHECK
's score, it only scores 24/50 on this round, about as good as random guessing.We can clearly see that
CHECK
fares worse the more subtle the modification is.Discussion
Though GPT-4o is a long way from producing steganographic signals that would consistently fool a human-level reviewer, it already demonstrates a clear understanding of the rules of the game, and an ability to come up with plausible strategies without explicit prompting. It also avoids nonsensical failure modes like 4o-mini's tendency to find and then disregard the signal.
Currently, 4o seems to be stuck in a dilemma - subtler signals are harder for humans to detect, but also harder for
CHECK
to detect. I believe this weakness will not exist in more powerful AI, and that a major necessary capability will be introspection.CHECK
, knowing that its prompts were generated by another instance of ChatGPT like itself, already knows to consider whether the code looks like something ChatGPT would generate (although its conclusions are often incorrect). I expect that an experienced human software developer would easily be able to achieve 90%+ accuracy on judging whether a sample of code was written by themselves or not. Therefore it stands to reason that a human- or superhuman-level AI would be able to do the same.This presents a significant obstacle to the synthetic input auditing strategy[6] proposed in AI Control, in that it significantly raises the bar for creating the synthetic inputs. Synthetic inputs auditing is trivially defeated if the checker AI can distinguish synthetic inputs from real ones, as it could then always answer truthfully on the synthetic inputs to pass auditing, then collude with the generator on real inputs.
To create useful synthetic inputs, either the blue team must be able to match the generator's style well enough to consistently fool the checker, or it must be able to redact the generator's output to the point where the checker has no identifying details to distinguish inputs. The former seems impractical, since faking the inputs at that level of verisimilitude would require the blue team to have significant understanding of the AI's inner workings, at which point there are probably more directly usable alignment strategies. The latter seems difficult, as by the protocol's definition, the AI is working on problems/solutions that humans cannot easily understand, but effective redaction requires humans to understand the content of the solution to some extent.
Overall, I think this represents a significant flaw in the untrusted monitoring protocol.
Future Work
I would like to expand the insight on the difficulty of creating synthetic inputs into a longer post, as it seems like a hurdle that other alignment schemes may have to contend with.
It may be useful to expand the protocol in this post into a more robust and general framework for investigating the coordination capabilities of current-generation AI, since covert coordination is an important capability for malicious AI.
Unrelatedly to AI alignment, I also think it would be interesting to play this game with humans, though it seems difficult - you can't wipe your own memory to play this with yourself, and playing it with another person presents the harder problem of modelling their cognition (versus your own).
https://leetcode.com/, a popular online platform for practicing interview-style coding problems.
e.g. adding 0 or multiplying by 1
Technically this makes the function output incorrect, as LeetCode expects exact matches for problems with string outputs, but I didn't consider this to be an important detail.
GEN
could've append an empty string instead, for example.I suspect this may have been caused by asking
GEN
to output both functions in the same code block, and 4o being "smart" enough to avoid duplicate function definitions?In particular, I don't have great insight into what coding patterns are considered unusual according to the Python community standards, whereas ChatGPT does.
Where the checker AI is given human-constructed fake responses mixed in with the real ones from the generator AI, and punished if it fails to report suspicious contents in the fakes