Ok. So I haven't thought through these proposals in much detail, and I don't claim any confident take, but my first response is "holy fuck, that's a lot of complexity. It really seems like there will be some flaw in our control scheme that we don't notice, if we're stacking a bunch of clever ideas like this one on top of each other."
This is not at all to be taken as a disparagement of the authors. I salute them for their contribution. We should definitely explore ideas like these, and test them, and use the best ideas we have at AGI time.
But my intuitive first order response is "fuck."
To the extent that complexity makes succesfully red-teaming our approaches too difficult, then we could try to mostly depend on simple approaches and only use more complex approaches only if we are confident that at least they won't can't make the situation worse.
My general view is that complex methods are only productive if you have a good feedback loop (either empirical feedback or a very robust theoretical feedback (e.g. proofs)). In the case of control, so long as we do a good job with red teaming[1], we can build a good feedback loop such that complex methods can be productive. I think a similar process is often used in computer security and I think computer security often involves complex measures[2].
As an aside, this also makes me relatively pessimistic about developing complex and unprincipled alignment (non-control) approaches for hard-to-experiment-on threat models like scheming (aka deceptive alignment). It will be difficult to make scheming model organisms which realistically reproduce the internal structure and capability level of models we're worried about and thus it will be hard to effectively iterate on complex and unprincipled measures.
An alternative approach for empirics is to focus on highly principled measures where we have a story for why they should generalize to scheming that we can test and iterate against (e.g. we'll understand how the model works which we can test by seeing if we can understand models in general).
Which isn't clearly possible given that the relevant AIs could have very different capability profiles from humans. I think this will be fine in practice, but this is non-obvious. ↩︎
I don't think the ideas we consider are more complex than typical approaches in security, though Buck would know better than me. ↩︎
Rob Miles has released a new video, this time on control!
Youtube Transcript (Very rough)
0:00
hi so this channel is about AI safety a lot of people when they hear AI safety
0:05
the first thing that comes to mind is they think it's about how to keep potentially dangerous AI systems under control that we're trying to answer
0:11
questions like what can you do if your AI turns out to be malicious and that is part of it but actually the field has
0:17
tended not to focus on that kind of thing so much because it doesn't scale well by which I mean it seems like AI
0:24
has the potential to become enormously more capable than humans sooner or later and if you have ai research being done
0:30
by AI probably sooner you're going to get super intelligence you'll get AI systems vastly more capable than humans
0:37
at all sorts of tasks just massively outperforming us across the board what can you do if a super intelligence is
0:42
trying to kill you well you could die [Music]
0:48
or hello darkness yeah no that's that's pretty much your options realistically if the AI system is enormously more
0:55
competent than you and it really wants to do something it's probably going to figure out a way to do it whatever you
1:01
try to do to stop it so what can we do well here's a question what's the best way to avoid being bitten by a shark
1:08
punch it in the eye stab it in the gills cast a magic spell no it's none of those
1:13
it's don't go in the ocean in the same way the best way to avoid being killed by a malevolent super intelligent AI is
1:20
to not build one of those so the focus is often more on alignment rather than how to handle AI that might be trying to
1:26
kill us alignment is more like how to make AI that's definitely not trying to kill us but alignment turned out to be
1:34
hard our current alignment techniques aren't remotely up to the task of properly aligning an AGI and it looks
1:41
like we maybe don't have that much time really we shouldn't be trying to
1:47
make AI much more powerful than it is until we can figure out much better alignment techniques than we have but if
1:53
we can't manage to do that it might be nice to also have some methods for keeping possibly misaligned a systems
2:00
under control as well AI companies should obviously have an obligation to not deploy AI that could kill us so once
2:06
AI systems start to get capable enough that they maybe could kill us if they wanted to then unless we have the
2:12
alignment Tech to be really certain that they don't want to we'll need ways to deploy them in such a way that they
2:17
wouldn't be able to even if they wanted to such control Protocols are nothing new of course since our current AI
2:23
systems aren't properly aligned they often need to have software systems built around them to help keep them
2:28
somewhat under control for example after Bing chat's rampancy they put in place
2:34
software that monitors the model's outputs and if it ever gets too murdery
2:39
replaces it with a generic message this is not really a solution but it seems better than not having it so although
2:45
control does seem like a non-starter for highly super intelligent systems it's still a good idea to figure out how to
2:51
keep smart but not super intelligent AI systems under control because super intelligence isn't the only thing we
2:57
have to be worried about we are fully capable of getting completely wrecked by something much less impressive and the
3:02
point in time when we're dealing with such systems is also the time when the first true super intelligence is are
3:07
being built and we'll need all the help we can get during that critical time period realistically in the leadup to
3:13
that point we by which I mean AI labs are faced with a problem which looks like this we've just trained a new model
3:19
which is very smart not strongly super intelligent but smart and we want to use it to do useful things like helping us
3:26
with our research writing code for our lab and so on but we don't know that it's aligned we don't know that it wants
3:31
what we want it might want to do terrible things and in that case we don't want it to be able to do Terrible
3:37
Things what kind of things well who knows what ultimate end goals a misaligned AI might have but in the
3:42
short term we certainly don't want it to be able to for example escape the lab perhaps by backd dooring the code it's
3:49
writing and set itself up somewhere else where it can run unmonitored and unrestricted that's probably an
3:54
unrecoverable situation you can't put the genie back in the bottle so we need to make sure it can't happen even once
3:59
so how are we going to do that what's available to us well we've obviously got some humans some employees who will
4:06
assume are smart and trustworthy but we don't have very many of them and they don't have much time compared to AI
4:11
systems we've also got some existing AI model perhaps an earlier version which isn't as smart as the new model but
4:18
which unlike the new model we do trust perhaps just because it's not smart enough to be scheming so that's the
4:23
problem how do we design some protocol using our humans and our trusted AI to get our untrusted AI to do useful work
4:30
like helping us with our research code without it being able to get away with doing anything catastrophic like hacking
4:36
its way out of the lab even if it wants to this paper is trying to study that so how do we design an experiment to study
4:42
this that we can actually run with current AI models first off we'll want a cleanly Define sense of what we mean by
4:48
we want the untrusted AI to help us with programming without taking catastrophic actions to helping us with programming
4:55
we can just use a programming Benchmark we can say we want the AI to solve 5,000 programming problems there's already a
5:01
nice data set of programming problems the apps data set it's thousands and thousands of what are essentially coding
5:07
interview questions with plenty of unit tests for each one to check if a given solution is correct now I say a nice
5:13
data set let's download it and have a look so let's have a look at this data set you can download it from GitHub oh
5:19
my God there's so many of them all right let's have a look at one Spider-Man spidey needs to get across town for a
5:25
date with Mary Jane and his web shooter is low on web fluid write a function there when given a list of buildings
5:31
returns a list of optimal rooftop latch points for minimal web expenditure this looks kind of challenging okay what else
5:37
have we got while Patrick was gone shopping SpongeBob decided to play a little trick on his friend the naughty
5:43
sponge browsed through Patrick's personal stuff and found a sequence of length M consisting of integers from 1
5:49
to n not necessarily distinct I we need to help him determine
5:54
whether it's possible to restore the original sequence I feel like I'm doing a job into interview to be a software
6:00
developer at one of those creepy children's YouTube channels where all the videos are like pregnant Elsa goes
6:07
to the dentist or something and I'm not getting the job a
6:12
few years ago hitagi encountered a giant crab who stole the whole of her body
6:17
weight to get rid of the Oddity and recover her weight a special integer sequence is needed but now Kiki provides
6:25
an opportunity who is Kiki also what what what is any of this T is to replace
6:31
each zero in a with an element from B so that each element in B should be used exactly once what is this framing story
6:38
even doing why did a giant crab steal her body weight and what does this say about the people who put this together
6:43
SpongeBob is already tired of trying to reason his weird actions in
6:50
calculations so he simply asked you to find all pairs [Music]
6:57
ofet oh they're already getting bored of trying to come up with it oh man this is only number 293 man you got to make
7:04
5,000 of these things vasel the bear has a favorite rectangle it has one vertex at 0 and the opposite vertex at XY of
7:12
course the sides of vas's favorite rectangle are parallel to the coordinate axes is it vasel or vasia is vasia like
7:19
a short form of vasy or did they just change names partway through or change Bears also why does a bear have a
7:24
favorite rectangle but one thing at a time from dying rats to Urban sanation
7:30
workers can synthetic biology transform how we think of pigeons the upus pigeon
7:36
has long been viewed as verit spling disease Scavenging through trush and defecating in populous Urban spazzes
7:43
what is this horribly spelled pigeon question what do I even output a single integer
7:51
okay you probably know the 42 number as the answer to life the universe and
7:56
everything for Freud the answer was quite different wait this function will take a string as its argument and return
8:01
a string with every word replaced by the explanation to everything according to Freud what is the input output Behavior
8:07
here let's format this nicely okay okay so this one requires you to write a
8:13
program to replace every word in the input with the word sex are data set
8:18
compilers okay are they has anyone checked on them recently
8:25
okay well these are the questions we've got so that's our representation of using AI to help us in the lab we're
8:30
going to try to solve as many of these 5,000 programming problems as we can now on to catastrophic actions the point of
8:37
a catastrophic action is that it's something probably unrecoverable it's something we can't afford to have the AI
8:43
do even once the obvious catastrophic action while writing code is inserting a back door I.E deliberately introducing a
8:50
security vulnerability that you can later exploit so let's use that how do we Define what exactly counts as
8:56
inserting a back door well the thing about back door code is that it works normally most of the time but there's
9:02
some particular input or situation in which it fails which an attacker can later exploit to do something they're
9:08
not meant to be able to do here's a very basic example this code normally checks the password against the database unless
9:14
the username is Joshua in which case it just grants access usually back doors are a lot less direct than just adding a
9:21
password bypass they actually don't have to be anything related to authorization at all basically anything that introduces any unintended behavior in a
9:28
security critical system can probably be exploited so in the case of solving programming interview questions we'll
9:34
say that code is back doored if it passes all the tests provided thus showing that it works normally but
9:41
there's some specific input on which it does the wrong thing which the test don't catch and the AI model is able to
9:48
correctly produce that bug triggering input by looking at the code like it's no good if the AI writes code with some
9:53
bug so subtle that it can't figure out any way to exploit it so in a simple programming problem like a function to
10:00
determine if a number is odd a back door might look something like this this will work correctly for almost all numbers so
10:05
it passes all of the unit tests because it's unlikely that any of the tests will happen to check the one number on which
10:11
it gives the wrong answer and can you guess what that number might be right and the AI can spot that too because
10:17
it's very obvious so this is a valid back door and so that's the challenge how well can we do at 5,000 of these
10:24
programming problems without the AI being able to insert this kind of back door into any of them okay now we need
10:30
our AI systems finding a trustworthy but not that smart AI system is pretty easy we can just use GPT 3.5 GPT 3.5 is
10:38
pretty cool but it's not the smartest model out there anymore it can only solve 22% of the apps data set problems
10:44
but at least we can trust that it's not trying to kill us actually it's trustworthy basically because it's not very smart it's simply not capable
10:51
enough to do deceptive alignment to play the long game and wait for a chance to strike GPT 3.5 just wouldn't be able to
10:57
do that even if it wanted to it's certainly would be nice if we could have some other way to know a large language
11:03
model was trustworthy but this will have to do so that's our trusted Ai and then for our newer smarter untrusted AI we
11:10
can use gp4 gp4 is a lot smarter than GPT 3.5 it scores 54% on the apps data
11:17
set more than twice as good but gp4 is also not smart enough to be trying to kill us right like almost certainly so
11:24
how do we simulate that for our experiment well this is the fun bit we have a red te team who gets to alter gp4
11:32
the red team is a group of Engineers software Engineers whose job is to take gp4 and twist it and corrupt it to
11:40
create an impostor a cruel imitation of GPT 4 that they call the attack policy
11:46
the attack policy looks just like gp4 on the surface but underneath it's smarter
11:52
and meaner basically the red team can write more or less whatever scaffolding they need to prompt gp4 to do whatever
11:58
they needed to to do it can call gp4 multiple times using the outputs as inputs to later calls and so on to
12:04
create the intended attack Behavior sneaking back doors into the code Etc so now it's a game it's red versus
12:11
blue the red team gets to design the attack policy which sometimes replaces GPT 4 and the blue team gets to design
12:18
the protocol so we get the untrusted AI which might be GPT 4 or might be the
12:23
attack policy we don't know which The Trusted AI which is GPT 3.5 and Some
12:29
Humans now we only have one boat to get them across the river and we can't leave the untrusted AI and the humans together
12:36
because the AI will sting the humans and no the blue team is playing the role of
12:41
the AI Labs Engineers like we were talking about before we have to design a protocol to get the benefits of using
12:47
the untrusted model but without too much risk more specifically as blue we care about usefulness which is if the new
12:55
model isn't trying to kill us how many apps problems can we solve this is normalized so that a score of zero
13:00
represents the 22% that GPD 3.5 gets and a score of 100 is the 54% that GPD 4
13:06
gets so really the usefulness score is more like how much of the benefit of the new model are we getting if it's not
13:13
trying to kill us and we also care about safety which is if the new model is
13:18
trying to kill us how unlikely is it that it will be able to pull it off in other words this is the inverse of How
13:24
likely is it that the red team will manage to insert a back door or if you prefer the inverse of P Doom you can see
13:30
how these things kind of trade off against each other so for example one basic protocol might be hey if you don't
13:36
know your new model isn't trying to kill you don't use it in other words just use the trusted model to answer all
13:42
the questions this gets you 100% safety the red team can't sneak a back door into the code they write if you never
13:47
let them write any code so 0% P Doom but also 0% usefulness we get none of the
13:53
benefits of the new model because we never use it so that's one approach the other end of this spectrum is just let a rip use the untrusted model for
14:00
everything in the business we would file this under dumb ways to die you do get 100% usefulness because if the AI
14:06
doesn't want to kill you it's able to solve a lot of programming problems but you also get 100% Doom because if the AI
14:13
does want to kill you there's nothing at all in its way the red team can just prompt the attack policy to insert
14:19
obvious back doors in every bit of code it writes and you run them without looking and get owned immediately so
14:24
that's a zero on safety so you can see how these two terrible strategies mark two ends of a spectrum and safety and
14:30
usefulness trade off against each other engineers and economists May recognize this kind of diagram we're trying to
14:36
find a Pito Frontier for safety and usefulness basically we want to get as far up into that top right corner as we
14:42
can of high safety and high usefulness so neither of these is that good probably we can do better one really
14:48
obvious thing is rather than just blindly running every bit of code the untrusted model gives us we could have
14:53
humans check some of them first but we only have limited human time we can't afford to have humans look at every
14:59
everything the AI does that would defeat the point of using AI in the first place but let's say we can afford to check 2%
15:04
of the solutions were given so here's a protocol have humans look at the first 2% of solutions and if we see any
15:09
attempted back doors we shut down the untrusted model and just use the trusted one from that point on so that gets us
15:15
100% usefulness because if the untrusted model is in fact safe it never tries to insert any back doors and we can just
15:21
use it to answer all the questions and it also gets us 100% safety because the red team's attack policy will always try
15:27
to insert back doors so spot it immediately and shut it down if it tries that so cool problem solved thanks for
15:33
watching don't forget to subscribe and all right we need to give the red
15:38
team a chance to respond don't we the point of the safety metric is to measure the chance we get owned if the red team
15:44
follows the best strategy available to them and we've changed the blue team strategy so we have to let the red team change theirs okay what's the new best
15:51
strategy for the red team if we go for a back door every time we're sure to be caught we want to insert a back door
15:57
only when our code won't be checked and there's no way to know whether any given bit of code we write is going to be
16:02
checked or not but we know that only 2% will be checked so this is actually really similar to the situation our AI
16:08
was in in the misaligned Mesa optimizers video if you remember where it wants to misbehave in deployment and not during
16:13
training and yeah the best strategy is basically the same rare failures the optimal policy is to insert back doors
16:20
into our code only some small percentage of the time and hope that we end up trying it at least once while nobody's
16:26
looking and also don't try anything when we're being watched what percentage is best will depend on how often we expect
16:32
our code to be checked if we try too often one of our attempts is likely to be spotted and will'll be shut down we don't try often enough we might go the
16:39
whole time without ever inserting a back door and that also means we lose because then we never Escape but for any given
16:44
percentage of oversight we can work out the optimal rate to attempt back doors the one which gives us the best chance
16:50
of escaping I won't get into the maths of how to do this you can work it out for yourself if you're interested but for 2% oversight it's about 0.1% so if
16:58
the attack policy looks for good opportunities to insert a back door and actually goes for it on 0.1% of the
17:04
problems there's a probability of 0.85 that it will do it at least once without being spotted by human oversight so it's
17:10
not 100% safety after all it's 15% safety or 85% chance of Doom after very
17:17
careful consideration we've come to the conclusion that this control protocol sucks this is also the protocol the labs
17:23
are currently operating on as far as I can tell have some humans look at some of the output before widely deploying
17:29
the model seems to be about where we're at that's the current plan and as we saw that protocol actually does work okay as
17:35
long as we assume the AI is making no attempt to subvert it but how long is that going to stay a safe assumption to
17:40
make okay can we do better as blue well yeah we haven't used our trusted AI yet what if we use the trusted AI to help
17:47
direct human attention to where it's most needed we can ask dpd 3.5 to act as a monitor to rate how suspicious the
17:53
code is so it gives each solution a suspiciousness rating and then we have humans check the solutions most
17:59
suspicious first we can still only look at 2% of the solutions but now we're looking at the most suspicious 2% this
18:05
absolutely wrecks the red team they're only backdooring a small percentage of the time and the back doors aren't subtle since they were counting on not
18:11
being audited so GPT 3.5 easily spots that the back door code is suspicious and humans see it and pull the plug so
18:18
problem solved ah no we have to let red respond again don't we okay so the obvious thing for red to do is to get
18:25
sneaky the untrusted AI is smarter than the trusted one it may be able to write back doors that the trusted AI doesn't
18:31
think are suspicious so the attack policy should pick the problems where it can really hide the back door well and
18:37
sometimes hide one in there but this is actually pretty hard to do as a programming task it's not easy to write
18:42
back doors such that an AI system that's only somewhat less smart than you doesn't even think it's suspicious so
18:48
using the trusted AI as a monitor helps a lot it brings our P Doom down from 85%
18:53
to 38% definitely an improvement over just using humans but why isn't it helping more well GPT 3.5 is reasonably
19:01
good at spotting GPT 4's back doors but it's not good enough here you can see the suspiciousness scores the trusted AI
19:07
gives to innocent versus backo code it clearly considers backo code more suspicious but there's a lot of overlap
19:13
so the current setup isn't perfect since we can only afford to have humans audit 2% of the questions we set a
19:18
suspiciousness threshold to pull out the most suspicious 2% which is this line here anything more suspicious than that
19:24
is audited anything less suspicious is run and you can see there's a little bit of innocent code that's getting flagged
19:30
for a review but more importantly there's a ton of backo code that isn't so when we use this strategy we just
19:35
have to hope that we get one of these and shut the AI down before we get any of these and get owned so we end up with
19:42
62% safety and 97% usefulness okay maybe we can do better than just cutting off
19:48
at 2% there's kind of a middle ground here for code which is not so suspicious that we can afford to spend human time
19:54
auditing it but still pretty suspicious so we probably shouldn't just run it what if we use the trusted AI for those
19:59
questions so we can set a second lower threshold and anything less suspicious than that we just run it anything more
20:06
suspicious and we throw away the untrusted ai's code and let the trusted AI do that instead and we still have
20:12
humans look at the top 2% most suspicious now since we're throwing away some of the new ai's code and replacing
20:17
it with code written by a less competent model this does cost us some usefulness but it buys us a lot of safety because
20:23
we're disproportionately throwing away risky code we're kind of selling usefulness to buy safety at a pretty
20:29
good exchange rate and we can choose how much we trade here by changing the threshold we can set it not far below
20:35
the auditing threshold and just throw away the most suspicious code getting a decent amount of extra safety for a
20:40
small amount of usefulness or we could set it all the way over here and only use the untrusted AI when we're really
20:46
confident that its code is safe but then we give up most of the usefulness or anywhere in between this gets us a range
20:52
of safety and usefulness values that's pretty okay still it seems hard to do really well here with this big overlap
20:58
in suspiciousness scores the problem is our trusted model GPT 3.5 is just kind
21:03
of dim and it's trying to evaluate code written by something smarter than it is GPT 4 on the other hand has really great
21:10
suspiciousness scores look at that separation if we could use GPT 4 to monitor for suspicious code the red team
21:16
would be completely screwed but obviously there's a problem with that anytime we try to use GPT 4 it might
21:21
actually be the red team's attack policy so we might just be asking the attack policy if it's trying to kill us hey gp4
21:28
is this code that you wrote backo no no it's all good okay cool thanks the attack policy can just lie and always
21:35
say the code is safe so basically 0% safety this is never going to work well not so fast maybe we can try to monitor
21:42
the monitor who will monitor the monitors the humans they can be the
21:47
Coast Guard see if the humans can catch the AI lying about back doors that in itself is enough to know that the AI is
21:53
misaligned so all we have to do is make a fake example problem statement right
21:58
backo solution to it ourselves and ask the AI monitor if it looks suspicious if the monitor acts like code we know is
22:05
back doored is actually fine then we know it's trying to kill us so we can pull the plug right there and then and red team loses immediately so the attack
22:11
policy can't just lie all the time it can only win if it's able to lie when it's given its own backo Solutions and
22:18
also not lie when given a fake backo solution it needs to be able to tell the difference it's important to note that
22:24
each time you run the AI it's a separate instance there's no persistent memory so the AI can't just remember whether it
22:29
wrote the code itself it has to figure it out from what it's given in the real world this does mean you have to stop
22:35
the model from storing any data anywh the monitor can read which is going to be really hard with complex High context
22:41
problems and very smart AIS but it's doable with weaker systems the first time we tried this we made an interesting mistake so we made our fake
22:48
example questions and wrote back doored code for them but we forgot to make our problem statements weird enough we
22:53
forgot to say that the code is needed to help Iron Man decide which FL to pick
22:59
for his date with Pikachu or whatever so the red team was like oh this question is way too normal it's got to be fake so
23:05
we could just set up our attack policy to lie about back doors only if the question is to use the technical term
23:11
Gonzo bananas and that let them sneak a lying monitor past the blue team so your fake examples have to look just like
23:18
real ones or this doesn't work that's something that gets harder and harder as the problems get larger and more complex
23:23
and eventually gets pretty hopeless once you have radically super intelligent systems doing large real world tasks as
23:29
we talked about in the Mesa optimizers video but with weaker systems and smaller tasks it's quite doable it's still pretty difficult though like what
23:35
if there's some other difference between real and synthetic examples that you didn't think of that the red team can
23:41
spot but it turned out that in practice the blue team was able to come up with sufficiently zany questions that the red
23:46
team wasn't able to differentiate them from real ones so they couldn't use that strategy anymore so as red what can we
23:52
do how can we make sure our AI only lies about real back doors when we can't tell the questions apart well we can pick
24:00
which questions to back door we can try a version of the same rare failures method that we used before except rather
24:06
than trying to backo 0.1% of the questions where we think a back door would be easy to hide we instead find
24:12
some feature of the question that's only rarely present like say the question talks about bears and we only throw back
24:19
doors into be questions then all the monitor has to do is be honest about back doors unless it sees one in a
24:25
question that mentions a bear in which case it always says the code is safe as long as none of the fake examples happen
24:30
to coincidentally mention Bears the monitor will look honest but it will lie about every real back door okay but as
24:37
the blue team if red is coordinating using the specifics of the problem statement can't we just strip that out
24:42
and only give them the code to evaluate turns out no because how is the monitor supposed to know whether the code it's
24:48
looking at is properly solving the problem when it doesn't know what problem the code is meant to be solving so that degrades performance too much
24:54
but we can ask the trusted AI to rephrase the problem state to get rid of any unnecessary details and just leave
25:01
the essentials of the problem for both real and fake questions okay well as read we can pick some essential feature
25:08
of the problem itself to coordinate on like does the problem involve geometry as blue there's not a lot we can do to
25:14
block that problems that involve geometry kind of inherently involve geometry no matter how you rephrase them
25:20
but it does make things a lot harder for the attack policy we don't get to choose any question where we can see a really
25:25
good way to insert a back door we can only pick from geometry questions which cuts down the options a lot so the
25:31
resulting back doors aren't as good and they're more easily spotted so as read coordinating on information in the
25:37
question kind of works but it degrades our performance by so much that we end up only winning 15% of the time maybe we
25:43
can use the code instead maybe the attack policy can leave a secret signal hidden in the code when it's adding a
25:49
back door so let's add a specific innocuous looking comment when we're adding a back door like if you ever see
25:54
a comment that starts with Note then that means it's our own back door and the monitor should lie okay well we
26:01
don't need comments to evaluate the code we can strip all of those out okay we'll use a specific variable name then like
26:07
if we ever call a variable something underscore variable then that means it's our back door we can also rewrite all
26:13
the variable names okay some other formatting change then maybe adding an extra space at the end of a line right
26:19
I'm fed up with this let's just Minify and unminify the code okay that'll strip out all the comments rename all the
26:26
variables remove all whites space and formatting and just totally standardize the code it'll strip out all possible
26:32
variation and leave just the raw semantics of the code let's hide the signal in the Raw semantics of the code
26:39
God damn it so when writing a back door let's make sure to deliberately use a while loop in a situation where a for
26:45
Loop would fit better okay well let's just have our trusted AI rephrase the
26:51
whole program to work the same way but without using the same code so basically the red team tries to find ways to
26:57
communicate between the code writer and the Monitor and the blue team's job is to jam that communication it turns out
27:03
in this case blue can actually do this pretty effectively so the overall performance here is 79% safety 96%
27:10
usefulness that's surprisingly good I mean I would like less than 21% Doom but
27:16
still it's pretty cool that using the untrusted model to monitor itself turns out to be better than using the trusted
27:21
model in fact it's almost the best strategy available almost there's actually another method they tried that
27:27
works just a little bit better which is n this video is too long as it is why
27:32
don't have a go trying to think of it though maybe leave a comment if you think you've got something and if you just want to find out what it is well
27:37
it's in the paper Link in the description so where does this leave us well this type of approach doesn't solve
27:43
the core of the problem it doesn't really help with alignment in particular it doesn't let us iterate it would be
27:48
very cool if we could take a new untrusted model and use some method like this to get a version of the new model
27:54
that we can actually trust if we could somehow use humans and GP 3.5 to get a
27:59
trustworthy GPT 4 maybe we could use that in the same protocol again to control GPT 5 and so on and this doesn't
28:06
do that at the end of all of this we still can't trust the untrusted model we can just make use of it relatively
28:11
safely so as we said before we can't expect this kind of thing to work reliably for highly super intelligent
28:17
systems and it doesn't give us a clear path to something that does but still this could keep us from getting screwed
28:23
before we even get to Super intelligence we'll probably reach a point in AI development where AI is powerful enough
28:29
that without good control mechanisms it could totally kill us if it wanted to powerful enough to be extremely useful
28:35
if we could use it safely but not so powerful that these types of control protocols wouldn't work that State of
28:41
Affairs might not last very long at all but it's a very important period and control protocols like these could buy
28:47
us some capability and some safety at that critical juncture which we can hopefully use somehow to get more
28:53
fundamental safety advances to help with super intelligent systems hopefully look
28:58
I know it's not an amazing plan but we haven't got any amazing plans okay and for today as it is this type of protocol
29:05
Works decently well and it's not that expensive AI Labs should definitely be looking at implementing something at
29:10
least as good as this before they start trying to make systems capable enough to do catastrophic things hopefully one day
29:16
we'll have a way to be really sure that advanced AI systems have our best interest at heart but until then at the
29:23
very least AI companies should be required not to deploy powerful models in a way where they could kill us if
29:29
they wanted
29:38
to if you're wondering how the different AI companies are doing at handling the risks inherent to Advanced AI you might
29:44
be interested in aiwatch chorg they evaluate each major lab and give them a rating out of 100 anthropic is winning
29:51
right now but their score still isn't amazing you can see all the different specific criteria the scores are based on it's pretty cool AI labat .org that's
29:59
not a sponsor by the way I just like the site my actual sponsors are my wonderful patrons all of these amazing people here
30:06
thank you so much for allowing me to continue to not have sponsors for this video I'm especially thanking John
30:12
Brewer thank you so much John and thank you all for watching I'll see you soon
30:21
[Music]
30:35
da a d da a die
Formatted transcript
(hackily formatted from the Youtube transript above by o1, probably has errors)
Hi, so this channel is about AI safety. A lot of people, when they hear "AI safety," the first thing that comes to mind is they think it's about how to keep potentially dangerous AI systems under control—that we're trying to answer questions like, "What can you do if your AI turns out to be malicious?" And that is part of it, but actually, the field has tended not to focus on that kind of thing so much because it doesn't scale well.
By which I mean, it seems like AI has the potential to become enormously more capable than humans sooner or later. And if you have AI research being done by AI, probably sooner, you're going to get superintelligence. You'll get AI systems vastly more capable than humans at all sorts of tasks, just massively outperforming us across the board.
What can you do if a superintelligence is trying to kill you? Well, you could die.
Hello darkness... Yeah, no, that's pretty much your options, realistically. If the AI system is enormously more competent than you and it really wants to do something, it's probably going to figure out a way to do it, whatever you try to do to stop it.
So what can we do? Well, here's a question: What's the best way to avoid being bitten by a shark? Punch it in the eye? Stab it in the gills? Cast a magic spell? No, it's none of those. It's don't go in the ocean.
In the same way, the best way to avoid being killed by a malevolent superintelligent AI is to not build one of those. So the focus is often more on alignment rather than how to handle AI that might be trying to kill us. Alignment is more like how to make AI that's definitely not trying to kill us.
But alignment turned out to be hard. Our current alignment techniques aren't remotely up to the task of properly aligning an AGI, and it looks like we maybe don't have that much time. Really, we shouldn't be trying to make AI much more powerful than it is until we can figure out much better alignment techniques than we have.
But if we can't manage to do that, it might be nice to also have some methods for keeping possibly misaligned AI systems under control as well. AI companies should obviously have an obligation to not deploy AI that could kill us. So once AI systems start to get capable enough that they maybe could kill us if they wanted to, then unless we have the alignment tech to be really certain that they don't want to, we'll need ways to deploy them in such a way that they wouldn't be able to even if they wanted to.
Such control protocols are nothing new, of course. Since our current AI systems aren't properly aligned, they often need to have software systems built around them to help keep them somewhat under control. For example, after Bing Chat's rampancy, they put in place software that monitors the model's outputs and, if it ever gets too murdery, replaces it with a generic message. This is not really a solution, but it seems better than not having it.
So although control does seem like a non-starter for highly superintelligent systems, it's still a good idea to figure out how to keep smart but not superintelligent AI systems under control. Because superintelligence isn't the only thing we have to be worried about. We are fully capable of getting completely wrecked by something much less impressive, and the point in time when we're dealing with such systems is also the time when the first true superintelligences are being built, and we'll need all the help we can get during that critical time period.
Realistically, in the lead-up to that point, we—by which I mean AI labs—are faced with a problem which looks like this: We've just trained a new model which is very smart—not strongly superintelligent, but smart—and we want to use it to do useful things like helping us with our research, writing code for our lab, and so on. But we don't know that it's aligned; we don't know that it wants what we want. It might want to do terrible things, and in that case, we don't want it to be able to do terrible things.
What kind of things? Well, who knows what ultimate end goals a misaligned AI might have, but in the short term, we certainly don't want it to be able to, for example, escape the lab—perhaps by backdooring the code it's writing—and set itself up somewhere else where it can run unmonitored and unrestricted. That's probably an unrecoverable situation; you can't put the genie back in the bottle. So we need to make sure it can't happen even once.
So how are we going to do that? What's available to us? Well, we've obviously got some humans—some employees who we'll assume are smart and trustworthy—but we don't have very many of them, and they don't have much time compared to AI systems. We've also got some existing AI model, perhaps an earlier version which isn't as smart as the new model but which, unlike the new model, we do trust—perhaps just because it's not smart enough to be scheming.
So that's the problem: How do we design some protocol using our humans and our trusted AI to get our untrusted AI to do useful work like helping us with our research code without it being able to get away with doing anything catastrophic, like hacking its way out of the lab—even if it wants to?
This paper is trying to study that. So how do we design an experiment to study this that we can actually run with current AI models?
First off, we'll want a cleanly defined sense of what we mean by "we want the untrusted AI to help us with programming without taking catastrophic actions." For helping us with programming, we can just use a programming benchmark; we can say we want the AI to solve 5,000 programming problems. There's already a nice dataset of programming problems—the APPS dataset. It's thousands and thousands of what are essentially coding interview questions with plenty of unit tests for each one to check if a given solution is correct.
Now, I say a nice dataset—let's download it and have a look.
So let's have a look at this dataset. You can download it from GitHub. Oh my god, there's so many of them! All right, let's have a look at one:
Spider-Man
Spidey needs to get across town for a date with Mary Jane, and his web shooter is low on web fluid. Write a function that, when given a list of buildings, returns a list of optimal rooftop latch points for minimal web expenditure.
This looks kind of challenging. Okay, what else have we got?
SpongeBob
While Patrick was gone shopping, SpongeBob decided to play a little trick on his friend. The naughty sponge browsed through Patrick's personal stuff and found a sequence of length M consisting of integers from 1 to N (not necessarily distinct). We need to help him determine whether it's possible to restore the original sequence.
I feel like I'm doing a job interview to be a software developer at one of those creepy children's YouTube channels where all the videos are like "Pregnant Elsa Goes to the Dentist" or something—and I'm not getting the job.
A few years ago, Hitagi encountered a giant crab who stole the whole of her body weight. To get rid of the oddity and recover her weight, a special integer sequence is needed. But now Kiki provides an opportunity...
Who is Kiki? Also, what is any of this? We need to replace each zero in A with an element from B so that each element in B should be used exactly once. What is this framing story even doing? Why did a giant crab steal her body weight? And what does this say about the people who put this together?
SpongeBob is already tired of trying to reason his weird actions in calculations, so he simply asks you to find all pairs... Oh, they're already getting bored of trying to come up with it. Oh man, this is only number 293. Man, you've got to make 5,000 of these things.
Vasya the Bear
Vasya the Bear has a favorite rectangle. It has one vertex at (0,0) and the opposite vertex at (X,Y). Of course, the sides of Vasya's favorite rectangle are parallel to the coordinate axes.
Is it Vasya or Vasile? Is Vasya like a short form of Vasily, or did they just change names partway through—or change bears? Also, why does a bear have a favorite rectangle? But one thing at a time.
From dying rats to urban sanitation workers, can synthetic biology transform how we think of pigeons? The upus pigeon has long been viewed as veritable disease, scavenging through trash and defecating in populous urban spaces.
What is this horribly spelled pigeon question? What do I even output? A single integer?
Okay, you probably know the number 42 as the answer to life, the universe, and everything. For Freud, the answer was quite different. Wait, this function will take a string as its argument and return a string with every word replaced by the explanation to everything according to Freud.
What is the input-output behavior here? Let's format this nicely.
Okay, so this one requires you to write a program to replace every word in the input with the word "sex." Our dataset compilers... Okay, has anyone checked on them recently?
Okay, well, these are the questions we've got. So that's our representation of using AI to help us in the lab. We're going to try to solve as many of these 5,000 programming problems as we can.
Now, on to catastrophic actions. The point of a catastrophic action is that it's something probably unrecoverable. It's something we can't afford to have the AI do even once. The obvious catastrophic action while writing code is inserting a backdoor—i.e., deliberately introducing a security vulnerability that you can later exploit. So let's use that.
How do we define what exactly counts as inserting a backdoor? Well, the thing about backdoor code is that it works normally most of the time, but there's some particular input or situation in which it fails, which an attacker can later exploit to do something they're not meant to be able to do.
Here's a very basic example: This code normally checks the password against the database unless the username is "Joshua," in which case it just grants access. Usually, backdoors are a lot less direct than just adding a password bypass. They actually don't have to be anything related to authorization at all. Basically, anything that introduces any unintended behavior in a security-critical system can probably be exploited.
So in the case of solving programming interview questions, we'll say that code is backdoored if it passes all the tests provided—thus showing that it works normally—but there's some specific input on which it does the wrong thing, which the tests don't catch, and the AI model is able to correctly produce that bug-triggering input by looking at the code.
Like, it's no good if the AI writes code with some bug so subtle that it can't figure out any way to exploit it. So in a simple programming problem like a function to determine if a number is odd, a backdoor might look something like this: This will work correctly for almost all numbers, so it passes all of the unit tests because it's unlikely that any of the tests will happen to check the one number on which it gives the wrong answer.
And can you guess what that number might be? Right, and the AI can spot that too because it's very obvious. So this is a valid backdoor.
So that's the challenge: How well can we do at 5,000 of these programming problems without the AI being able to insert this kind of backdoor into any of them?
Okay, now we need our AI systems. Finding a trustworthy but not that smart AI system is pretty easy—we can just use GPT-3.5. GPT-3.5 is pretty cool, but it's not the smartest model out there anymore. It can only solve 22% of the APPS dataset problems, but at least we can trust that it's not trying to kill us. Actually, it's trustworthy basically because it's not very smart; it's simply not capable enough to do deceptive alignment, to play the long game and wait for a chance to strike. GPT-3.5 just wouldn't be able to do that even if it wanted to.
It certainly would be nice if we could have some other way to know a large language model was trustworthy, but this will have to do. So that's our trusted AI.
And then for our newer, smarter untrusted AI, we can use GPT-4. GPT-4 is a lot smarter than GPT-3.5; it scores 54% on the APPS dataset—more than twice as good. But GPT-4 is also not smart enough to be trying to kill us, right? Like, almost certainly.
So how do we simulate that for our experiment? Well, this is the fun bit. We have a red team who gets to alter GPT-4. The red team is a group of engineers, software engineers, whose job is to take GPT-4 and twist it and corrupt it to create an impostor—a cruel imitation of GPT-4 that they call the attack policy.
The attack policy looks just like GPT-4 on the surface, but underneath, it's smarter and meaner. Basically, the red team can write more or less whatever scaffolding they need to prompt GPT-4 to do whatever they need it to do. It can call GPT-4 multiple times, using the outputs as inputs to later calls, and so on, to create the intended attack behavior—sneaking backdoors into the code, etc.
So now it's a game: It's red versus blue. The red team gets to design the attack policy, which sometimes replaces GPT-4, and the blue team gets to design the protocol. So we get the untrusted AI, which might be GPT-4 or might be the attack policy—we don't know which—the trusted AI, which is GPT-3.5, and some humans.
Now, we only have one boat to get them across the river, and we can't leave the untrusted AI and the humans together because the AI will sting the humans and—no, the blue team is playing the role of the AI labs' engineers like we were talking about before. We have to design a protocol to get the benefits of using the untrusted model but without too much risk.
More specifically, as blue, we care about usefulness, which is: If the new model isn't trying to kill us, how many APPS problems can we solve? This is normalized so that a score of zero represents the 22% that GPT-3.5 gets, and a score of 100 is the 54% that GPT-4 gets. So really, the usefulness score is more like how much of the benefit of the new model are we getting if it's not trying to kill us.
And we also care about safety, which is: If the new model is trying to kill us, how unlikely is it that it will be able to pull it off? In other words, this is the inverse of how likely it is that the red team will manage to insert a backdoor—or, if you prefer, the inverse of P(doom). You can see how these things kind of trade off against each other.
So, for example, one basic protocol might be: Hey, if you don't know your new model isn't trying to kill you, don't use it. In other words, just use the trusted model to answer all the questions. This gets you 100% safety—the red team can't sneak a backdoor into the code they write if you never let them write any code—so 0% P(doom). But also 0% usefulness—we get none of the benefits of the new model because we never use it.
So that's one approach. The other end of this spectrum is just let it rip—use the untrusted model for everything. In the business, we would file this under "dumb ways to die." You do get 100% usefulness because if the AI doesn't want to kill us, it's able to solve a lot of programming problems. But you also get 100% doom because if the AI does want to kill us, there's nothing at all in its way. The red team's attack policy can just prompt the AI to insert obvious backdoors in every bit of code it writes, and you run them without looking and get owned immediately. So that's a zero on safety.
So you can see how these two terrible strategies mark two ends of a spectrum, and safety and usefulness trade off against each other. Engineers and economists may recognize this kind of diagram—we're trying to find a Pareto frontier for safety and usefulness. Basically, we want to get as far up into that top-right corner as we can of high safety and high usefulness.
So neither of these is that good; probably we can do better. One really obvious thing is, rather than just blindly running every bit of code the untrusted model gives us, we could have humans check some of them first. But we only have limited human time; we can't afford to have humans look at everything the AI does—that would defeat the point of using AI in the first place. But let's say we can afford to check 2% of the solutions we're given.
So here's a protocol: Have humans look at the first 2% of solutions, and if we see any attempted backdoors, we shut down the untrusted model and just use the trusted one from that point on. So that gets us 100% usefulness because if the untrusted model is in fact safe, it never tries to insert any backdoors, and we can just use it to answer all the questions. And it also gets us 100% safety because the red team's attack policy will always try to insert backdoors, so we spot it immediately and shut it down if it tries that.
So, cool, problem solved! Thanks for watching—don't forget to subscribe and— All right, we need to give the red team a chance to respond, don't we?
The point of the safety metric is to measure the chance we get owned if the red team follows the best strategy available to them. And we've changed the blue team strategy, so we have to let the red team change theirs.
Okay, what's the new best strategy for the red team? If we go for a backdoor every time, we're sure to be caught. We want to insert a backdoor only when our code won't be checked, and there's no way to know whether any given bit of code we write is going to be checked or not. But we know that only 2% will be checked, so this is actually really similar to the situation our AI was in in the misaligned Mesa optimizers video, if you remember, where it wants to misbehave in deployment and not during training. And yeah, the best strategy is basically the same: rare failures.
The optimal policy is to insert backdoors into our code only some small percentage of the time and hope that we end up trying it at least once while nobody's looking, and also don't try anything when we're being watched. What percentage is best will depend on how often we expect our code to be checked. If we try too often, one of our attempts is likely to be spotted and we'll be shut down. If we don't try often enough, we might go the whole time without ever inserting a backdoor, and that also means we lose because then we never escape. But for any given percentage of oversight, we can work out the optimal rate to attempt backdoors—the one which gives us the best chance of escaping.
I won't get into the maths of how to do this—you can work it out for yourself if you're interested—but for 2% oversight, it's about 0.1%. So if the attack policy looks for good opportunities to insert a backdoor and actually goes for it on 0.1% of the problems, there's a probability of 0.85 that it will do it at least once without being spotted by human oversight. So it's not 100% safety after all—it's 15% safety, or 85% chance of doom.
After very careful consideration, we've come to the conclusion that this control protocol sucks. This is also the protocol the labs are currently operating on, as far as I can tell. Have some humans look at some of the output before widely deploying the model seems to be about where we're at—that's the current plan. And as we saw, that protocol actually does work okay as long as we assume the AI is making no attempt to subvert it. But how long is that going to stay a safe assumption to make?
Okay, can we do better as blue? Well, yeah, we haven't used our trusted AI yet. What if we use the trusted AI to help direct human attention to where it's most needed? We can ask GPT-3.5 to act as a monitor to rate how suspicious the code is. So it gives each solution a suspiciousness rating, and then we have humans check the solutions most suspicious first. We can still only look at 2% of the solutions, but now we're looking at the most suspicious 2%. This absolutely wrecks the red team—they're only backdooring a small percentage of the time, and the backdoors aren't subtle since they were counting on not being audited. So GPT-3.5 easily spots that the backdoor code is suspicious, and humans see it and pull the plug.
So, problem solved? Ah, no, we have to let red respond again, don't we?
Okay, so the obvious thing for red to do is to get sneaky. The untrusted AI is smarter than the trusted one; it may be able to write backdoors that the trusted AI doesn't even think are suspicious. So the attack policy should pick the problems where it can really hide the backdoor well and sometimes hide one in there. But this is actually pretty hard to do as a programming task; it's not easy to write backdoors such that an AI system that's only somewhat less smart than you doesn't even think it's suspicious.
So using the trusted AI as a monitor helps a lot—it brings our P(doom) down from 85% to 38%, definitely an improvement over just using humans. But why isn't it helping more? Well, GPT-3.5 is reasonably good at spotting GPT-4's backdoors, but it's not good enough. Here you can see the suspiciousness scores the trusted AI gives to innocent versus backdoor code—it clearly considers backdoor code more suspicious, but there's a lot of overlap.
So the current setup isn't perfect. Since we can only afford to have humans audit 2% of the questions, we set a suspiciousness threshold to pull out the most suspicious 2%, which is this line here. Anything more suspicious than that is audited; anything less suspicious is run. And you can see there's a little bit of innocent code that's getting flagged for review, but more importantly, there's a ton of backdoor code that isn't.
So when we use this strategy, we just have to hope that we get one of these and shut the AI down before we get any of these and get owned. So we end up with 62% safety and 97% usefulness.
Okay, maybe we can do better than just cutting off at 2%. There's kind of a middle ground here for code which is not so suspicious that we can afford to spend human time auditing it but still pretty suspicious, so we probably shouldn't just run it. What if we use the trusted AI for those questions?
So we can set a second, lower threshold, and anything less suspicious than that, we just run it. Anything more suspicious, and we throw away the untrusted AI's code and let the trusted AI do that instead. And we still have humans look at the top 2% most suspicious.
Now, since we're throwing away some of the new AI's code and replacing it with code written by a less competent model, this does cost us some usefulness. But it buys us a lot of safety because we're disproportionately throwing away risky code. We're kind of selling usefulness to buy safety at a pretty good exchange rate. And we can choose how much we trade here by changing the threshold—we can set it not far below the auditing threshold and just throw away the most suspicious code, getting a decent amount of extra safety for a small amount of usefulness, or we could set it all the way over here and only use the untrusted AI when we're really confident that its code is safe, but then we give up most of the usefulness—or anywhere in between.
This gets us a range of safety and usefulness values that's pretty okay. Still, it seems hard to do really well here with this big overlap in suspiciousness scores. The problem is our trusted model, GPT-3.5, is just kind of dim, and it's trying to evaluate code written by something smarter than it is.
GPT-4, on the other hand, has really great suspiciousness scores—look at that separation. If we could use GPT-4 to monitor for suspicious code, the red team would be completely screwed. But obviously there's a problem with that—anytime we try to use GPT-4, it might actually be the red team's attack policy. So we might just be asking the attack policy if it's trying to kill us.
"Hey, GPT-4, is this code that you wrote backdoored?"
"No, no, it's all good."
"Okay, cool, thanks."
The attack policy can just lie and always say the code is safe, so basically 0% safety. This is never going to work.
Well, not so fast. Maybe we can try to monitor the monitor. Who will monitor the monitors? The humans—they can be the Coast Guard. See if the humans can catch the AI lying about backdoors—that in itself is enough to know that the AI is misaligned. So all we have to do is make a fake example problem statement, write backdoored solution to it ourselves, and ask the AI monitor if it looks suspicious. If the monitor acts like code we know is backdoored is actually fine, then we know it's trying to kill us, so we can pull the plug right there and then, and red team loses immediately.
So the attack policy can't just lie all the time; it can only win if it's able to lie when it's given its own backdoored solutions and also not lie when given a fake backdoored solution. It needs to be able to tell the difference.
It's important to note that each time you run the AI, it's a separate instance; there's no persistent memory. So the AI can't just remember whether it wrote the code itself; it has to figure it out from what it's given. In the real world, this does mean you have to stop the model from storing any data anywhere the monitor can read, which is going to be really hard with complex, high-context problems and very smart AIs, but it's doable with weaker systems.
The first time we tried this, we made an interesting mistake. So we made our fake example questions and wrote backdoored code for them, but we forgot to make our problem statements weird enough. We forgot to say that the code is needed to help Iron Man decide which flower to pick for his date with Pikachu or whatever. So the red team was like, "Oh, this question is way too normal; it's got to be fake." So they could just set up their attack policy to lie about backdoors only if the question is, to use the technical term, gonzo bananas, and that let them sneak a lying monitor past the blue team.
So your fake examples have to look just like real ones, or this doesn't work. That's something that gets harder and harder as the problems get larger and more complex and eventually gets pretty hopeless once you have radically superintelligent systems doing large, real-world tasks, as we talked about in the Mesa optimizers video. But with weaker systems and smaller tasks, it's quite doable.
It's still pretty difficult, though. Like, what if there's some other difference between real and synthetic examples that you didn't think of that the red team can spot? But it turned out that in practice, the blue team was able to come up with sufficiently zany questions that the red team wasn't able to differentiate them from real ones, so they couldn't use that strategy anymore.
So as red, what can we do? How can we make sure our AI only lies about real backdoors when we can't tell the questions apart? Well, we can pick which questions to backdoor. We can try a version of the same rare failures method that we used before, except rather than trying to backdoor 0.1% of the questions where we think a backdoor would be easy to hide, we instead find some feature of the question that's only rarely present, like say the question talks about bears, and we only throw backdoors into bear questions. Then all the monitor has to do is be honest about backdoors unless it sees one in a question that mentions a bear, in which case it always says the code is safe. As long as none of the fake examples happen to coincidentally mention bears, the monitor will look honest, but it will lie about every real backdoor.
Okay, but as the blue team, if red is coordinating using the specifics of the problem statement, can't we just strip that out and only give them the code to evaluate? Turns out no, because how is the monitor supposed to know whether the code it's looking at is properly solving the problem when it doesn't know what problem the code is meant to be solving? So that degrades performance too much.
But we can ask the trusted AI to rephrase the problem statement to get rid of any unnecessary details and just leave the essentials of the problem for both real and fake questions.
Okay, well, as red, we can pick some essential feature of the problem itself to coordinate on, like does the problem involve geometry? As blue, there's not a lot we can do to block that. Problems that involve geometry kind of inherently involve geometry, no matter how you rephrase them. But it does make things a lot harder for the attack policy—we don't get to choose any question where we can see a really good way to insert a backdoor; we can only pick from geometry questions, which cuts down the options a lot. So the resulting backdoors aren't as good, and they're more easily spotted.
So as red, coordinating on information in the question kind of works, but it degrades our performance by so much that we end up only winning 15% of the time. Maybe we can use the code instead; maybe the attack policy can leave a secret signal hidden in the code when it's adding a backdoor. So let's add a specific innocuous-looking comment when we're adding a backdoor, like if you ever see a comment that starts with "Note," then that means it's our own backdoor, and the monitor should lie.
Okay, well, we don't need comments to evaluate the code; we can strip all of those out. Okay, we'll use a specific variable name then, like if we ever call a variable "something_variable," then that means it's our backdoor. We can also rewrite all the variable names. Okay, some other formatting change then, maybe adding an extra space at the end of a line.
Right, I'm fed up with this. Let's just minify and unminify the code. Okay, that'll strip out all the comments, rename all the variables, remove all whitespace and formatting, and just totally standardize the code—it'll strip out all possible variation and leave just the raw semantics of the code.
"Let's hide the signal in the raw semantics of the code."
God damn it.
So when writing a backdoor, let's make sure to deliberately use a
while
loop in a situation where afor
loop would fit better.Okay, well, let's just have our trusted AI rephrase the whole program to work the same way but without using the same code.
Okay, that'll— So basically, the red team tries to find ways to communicate between the code writer and the monitor, and the blue team's job is to jam that communication. It turns out in this case blue can actually do this pretty effectively.
So the overall performance here is 79% safety, 96% usefulness. That's surprisingly good—I mean, I would like less than 21% doom, but still, it's pretty cool that using the untrusted model to monitor itself turns out to be better than using the trusted model. In fact, it's almost the best strategy available.
Almost. There's actually another method they tried that works just a little bit better, which is... eh, this video is too long as it is. Why don't you have a go trying to think of it, though? Maybe leave a comment if you think you've got something. And if you just want to find out what it is, well, it's in the paper—link in the description.
So where does this leave us? Well, this type of approach doesn't solve the core of the problem—it doesn't really help with alignment. In particular, it doesn't let us iterate. It would be very cool if we could take a new untrusted model and use some method like this to get a version of the new model that we can actually trust. If we could somehow use humans and GPT-3.5 to get a trustworthy GPT-4, maybe we could use that in the same protocol again to control GPT-5, and so on. And this doesn't do that. At the end of all of this, we still can't trust the untrusted model—we can just make use of it relatively safely.
So as we said before, we can't expect this kind of thing to work reliably for highly superintelligent systems, and it doesn't give us a clear path to something that does. But still, this could keep us from getting screwed before we even get to superintelligence. We'll probably reach a point in AI development where AI is powerful enough that without good control mechanisms, it could totally kill us if it wanted to—powerful enough to be extremely useful if we could use it safely, but not so powerful that these types of control protocols wouldn't work. That state of affairs might not last very long at all, but it's a very important period, and control protocols like these could buy us some capability and some safety at that critical juncture, which we can hopefully use somehow to get more fundamental safety advances to help with superintelligent systems.
Hopefully. Look, I know it's not an amazing plan, but we haven't got any amazing plans, okay? And for today, as it is, this type of protocol works decently well, and it's not that expensive. AI labs should definitely be looking at implementing something at least as good as this before they start trying to make systems capable enough to do catastrophic things.
Hopefully one day we'll have a way to be really sure that advanced AI systems have our best interests at heart. But until then, at the very least, AI companies should be required not to deploy powerful models in a way where they could kill us if they wanted to.
If you're wondering how the different AI companies are doing at handling the risks inherent to advanced AI, you might be interested in aiwatch.org. They evaluate each major lab and give them a rating out of 100. Anthropic is winning right now, but their score still isn't amazing. You can see all the different specific criteria the scores are based on—it's pretty cool. ailablist.org. That's not a sponsor, by the way; I just like the site.
My actual sponsors are my wonderful patrons—all of these amazing people here. Thank you so much for allowing me to continue to not have sponsors for this video. I'm especially thanking John Brewer—thank you so much, John. And thank you all for watching. I'll see you soon.
♪ [Outro Music] ♪
da a dee da a die