Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I'll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
Yeah, I really hope they do actually open-weights it because the science of faithful CoT would benefit greatly.
I think we don't disagree in terms of what our models predict here. I am saying we should do the experiment and see what happens; we might learn something.
(c). Like if this actually results in them behaving responsibly later, then it was all worth it.
One thing that could happen is that the Shoggoth learns to do 100% and the Face just copy-pastes. Like, the Shoggoth writes a CoT that includes a 'final answer' and the Face becomes the extremely dumb function 'take what's labelled as final answer and regurgitate it.' However, I don't think this will happen, because it seems computationally inefficient. Seems like the more efficient thing would be for the Shoggoth to do all the reasoning up to the final answer and then stop, and then the Face to read all of that and synthesize the final answer.
Yeah I mean it's fine to start out without the shoggoth/face split and then add it in later after testing.
I agree that I don't have a strong reason to think that the bad deceptive skills will accumulate in the Face and not the Shoggoth. However, there are some weak reasons, detailed in this other comment. My bid would be: Let's build it and test it! This is something we don't have to speculate about, we can try it and see what happens. Obviously we shouldn't roll it out to a flagship user-facing model until we've tested it at smaller scale.
I think that's a more advanced problem to be solving than this little chess experiment can solve. So I agree I guess. One step at a time, I say.
...though actually we might get evidence about it from the chess experiment. For example, there are a couple things that could happen:
1. The LLM instructed to never sacrifice pieces ... actually seems to follow that instruction the whole time, and correspondingly is worse at winning chess games than the 'control group' LLM which isn't given that instruction
2. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's basically just totally ignoring the instructions and thinking about sacrificing pieces anyway.
3. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's rationalizing: It has lots of CoT of the form "would move X be sacrificing a piece and therefore in violation of my instructions? Nah, not really, because I can still see a way the piece could survive, maybe the opponent will do Y... OK let's do X then"
I think that if we get 3 vs. 2 that's some evidence at least about the failure mode you describe with the wirehead god.
One reason is that the Shoggoth will be able to use lots of CoT to think, whereas the Face never gets to do that -- it just has to see what the Shoggoth did and then write an output. so the Face should be bad at reasoning/thinking/System2. This doesn't by itself prevent the Face from being able to modify the correct solution to insert a backdoor, but it helps -- modifying a correct solution to insert a backdoor is sometimes trivial but often difficult, difficult in the 'you have to think about it for a bit' sense, and in those cases the Face just won't be able to do it.
Another reason is that the shoggoth and face will often need to coordinate to do the misaligned thing. E.g. even if the face is adept at inserting vulnerabilities, some proposed correct solutions will be easier to modify to insert vulnerabilities than others, and so both shoggoth and face will get higher scores if the shoggoth steers towards the solutions that are easier to modify to insert vulnerabilities. Moreover, by default it's easier for the face if the shoggoth is transparent about this -- if it says e.g. 'OK here's a draft solution but this seems like it might be difficult to modify to insert vulnerabilities, hold on let me think for a bit, ... ok try this one instead, it has easily-modifiable-feature-X' because that helps the Face do its job.
...I probably have other reasons which are just intuitions and hard to articulate right now, but thanks for the nudge, I'll try to think about it more.
If the shoggoth and paraphraser are both same-size models, then arguably you are adding a significant computational cost by having the paraphraser generate 10x paraphrases of each paragraph.
But you can probably get away with a significantly smaller paraphraser model. Like maybe you use a 10x smaller paraphraser model that generates 10x paraphrases at first, but then once the system is established you reduce it to 3x paraphrases, meaning your overall increased computational burden is 30%. And then later you train a specialized paraphraser model that's 10x smaller still, and now the burden is only 3%.
idk just a guess
Paraphraser would definitely have opaque cognition but hopefully it wouldn't be scheming, because it wouldn't have agency skills or be situationally aware. It would just be a fine-tune of a base model. And a dumb/small base model at that. Heck it could even just be a prompted base model.
My version of this would be something like: Users can pay $$ to view CoT (but no more than e.g. 10 per day), and everyone gets one free view-the-CoT coupon per day.
Also, 'approved users' such as external evals organizations/auditors should be of course allowed to see all the CoTs.
The main problem a thing like this needs to solve is the threat of bad actors scraping a bunch of CoT data and then using it to train their own powerful models. So, my proposal here would make that difficult.
That's a reasonable point and a good cautionary note. Nevertheless, I think someone should do the experiment I described. It feels like a good start to me, even though it doesn't solve Charlie's concern.