Daniel Kokotajlo

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I'll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Sequences

Agency: What it is and why it matters

AI Timelines

Takeoff and Takeover in the Past and Future

Posts

Sorted by New

5Daniel Kokotajlo's Shortform

506

91Why Don't We Just... Shoggoth+Face+Paraphraser?

63Self-Awareness: Taxonomy and eval suite proposal

9mo

279AI Timelines

59Linkpost for Jan Leike on Self-Exfiltration

108Paper: On measuring situational awareness in LLMs

41AGI is easier than robotaxis

61Pulling the Rope Sideways: Empirical Test Results

39What money-pumps exist, if any, for deontologists?

55The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG)

41My version of Simulacra Levels

Wiki Contributions

Comments

Sorted by

Newest

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo10h40

That's a reasonable point and a good cautionary note. Nevertheless, I think someone should do the experiment I described. It feels like a good start to me, even though it doesn't solve Charlie's concern.

DeepSeek beats o1-preview on math, ties on coding; will release weights

Daniel Kokotajlo13h2912

Yeah, I really hope they do actually open-weights it because the science of faithful CoT would benefit greatly.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo15h20

I think we don't disagree in terms of what our models predict here. I am saying we should do the experiment and see what happens; we might learn something.

Akash's Shortform

Daniel Kokotajlo20h50

(c). Like if this actually results in them behaving responsibly later, then it was all worth it.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo21h51

One thing that could happen is that the Shoggoth learns to do 100% and the Face just copy-pastes. Like, the Shoggoth writes a CoT that includes a 'final answer' and the Face becomes the extremely dumb function 'take what's labelled as final answer and regurgitate it.' However, I don't think this will happen, because it seems computationally inefficient. Seems like the more efficient thing would be for the Shoggoth to do all the reasoning up to the final answer and then stop, and then the Face to read all of that and synthesize the final answer.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo21h42

Yeah I mean it's fine to start out without the shoggoth/face split and then add it in later after testing.

I agree that I don't have a strong reason to think that the bad deceptive skills will accumulate in the Face and not the Shoggoth. However, there are some weak reasons, detailed in this other comment. My bid would be: Let's build it and test it! This is something we don't have to speculate about, we can try it and see what happens. Obviously we shouldn't roll it out to a flagship user-facing model until we've tested it at smaller scale.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo21h42

I think that's a more advanced problem to be solving than this little chess experiment can solve. So I agree I guess. One step at a time, I say.

...though actually we might get evidence about it from the chess experiment. For example, there are a couple things that could happen:

1. The LLM instructed to never sacrifice pieces ... actually seems to follow that instruction the whole time, and correspondingly is worse at winning chess games than the 'control group' LLM which isn't given that instruction

2. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's basically just totally ignoring the instructions and thinking about sacrificing pieces anyway.

3. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's rationalizing: It has lots of CoT of the form "would move X be sacrificing a piece and therefore in violation of my instructions? Nah, not really, because I can still see a way the piece could survive, maybe the opponent will do Y... OK let's do X then"

I think that if we get 3 vs. 2 that's some evidence at least about the failure mode you describe with the wirehead god.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo21h41

One reason is that the Shoggoth will be able to use lots of CoT to think, whereas the Face never gets to do that -- it just has to see what the Shoggoth did and then write an output. so the Face should be bad at reasoning/thinking/System2. This doesn't by itself prevent the Face from being able to modify the correct solution to insert a backdoor, but it helps -- modifying a correct solution to insert a backdoor is sometimes trivial but often difficult, difficult in the 'you have to think about it for a bit' sense, and in those cases the Face just won't be able to do it.

Another reason is that the shoggoth and face will often need to coordinate to do the misaligned thing. E.g. even if the face is adept at inserting vulnerabilities, some proposed correct solutions will be easier to modify to insert vulnerabilities than others, and so both shoggoth and face will get higher scores if the shoggoth steers towards the solutions that are easier to modify to insert vulnerabilities. Moreover, by default it's easier for the face if the shoggoth is transparent about this -- if it says e.g. 'OK here's a draft solution but this seems like it might be difficult to modify to insert vulnerabilities, hold on let me think for a bit, ... ok try this one instead, it has easily-modifiable-feature-X' because that helps the Face do its job.

...I probably have other reasons which are just intuitions and hard to articulate right now, but thanks for the nudge, I'll try to think about it more.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo21h42

If the shoggoth and paraphraser are both same-size models, then arguably you are adding a significant computational cost by having the paraphraser generate 10x paraphrases of each paragraph.

But you can probably get away with a significantly smaller paraphraser model. Like maybe you use a 10x smaller paraphraser model that generates 10x paraphrases at first, but then once the system is established you reduce it to 3x paraphrases, meaning your overall increased computational burden is 30%. And then later you train a specialized paraphraser model that's 10x smaller still, and now the burden is only 3%.

idk just a guess

Paraphraser would definitely have opaque cognition but hopefully it wouldn't be scheming, because it wouldn't have agency skills or be situationally aware. It would just be a fine-tune of a base model. And a dumb/small base model at that. Heck it could even just be a prompted base model.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo1d110

My version of this would be something like: Users can pay $$ to view CoT (but no more than e.g. 10 per day), and everyone gets one free view-the-CoT coupon per day.

Also, 'approved users' such as external evals organizations/auditors should be of course allowed to see all the CoTs.

The main problem a thing like this needs to solve is the threat of bad actors scraping a bunch of CoT data and then using it to train their own powerful models. So, my proposal here would make that difficult.