LESSWRONG
LW

All of Daniel Kokotajlo's Comments + Replies

It's been almost four years since I wrote What 2026 Looks Like. For those of you who read it >1 year ago: I'm curious to hear how it updated your beliefs or influenced your actions, if at all. (DM or reply) (crossposted from X) https://x.com/DKokotajlo67142/status/1886572374520389782

OpenAI releases deep research agent

Daniel Kokotajlo10h70

https://x.com/Miles_Brundage/status/1886347097571774875

No system card, no preparedness evaluations? Even though it uses o3?

4Seth Herd10h

Wheeee! Excuse: DeepSeek, and China Might Win!

OpenAI releases deep research agent

Daniel Kokotajlo12h1312

How on earth could monitorability and interpretability NOT tank when they think in neuralese? Surely changing from English to some gibberish highdimensional vector language makes things harder, even if not impossible?

6Thomas Kwa8h

* It's not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won't drift enough to allow steganography * Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable. * I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.

8Seth Herd11h

If we're using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would "understand" the steganography it uses - but you might have to supply so much of the context that it would be almost the same instance - so likely to adopt the same goals and use the same deceptions, if any. So that route does seem like dangerous territory. You'd rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is "thinking" about. I haven't gotten far in figuring out how well this might work, but it is a possibility. I'll say the little I have thought about in a soon-to-be-finished post. I don't see how monitoring and interpretability could be unaffected. (So I take this as modestly bad news — I wasn't totally sure orgs would use task-based end-to-end RL. I wasn't sure if agentic scaffolding might prove easier - see the other discussion thread here for questions about whether it might actually work just as well if someone bothered to implement it.)

Daniel Kokotajlo's Shortform

Daniel Kokotajlo12h20

Thanks for this feedback, this was exactly the sort of response I was hoping for!

You say you disagree where identity comes from, but then I can't tell where the disagreement is? Reading what you wrote, I just kept nodding along being like 'yep yep exactly.' I guess the disagreement is about whether the identity comes from the RL part (step 3) vs. the instruction training (step 2); I think this is maybe a merely verbal dispute though? Like, I don't think there's a difference in kind between 'imposing a helpful assistant from x constraint' and 'forming a single personality,' it's just a difference of degree.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo13h20

Great point. The engineers setting this whole thing up would need to build the tooling to hide the relevant info from the models and from the viewers.

Thane Ruthenis's Shortform

Daniel Kokotajlo1dΩ340

I don't think I understand. Let's work through an example.

The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let's say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let's say, requires 300 serial steps of reasoning. The model's strategy is to write subtly-vulnerable code but in a way that looks like it's trying to write good code and just getting a bit unlucky and taking longer than usual (let's suppose the humans ... (read more)

2Thane Ruthenis1d

Ah, I think I see my error. Consider a CoT of the form "[problem statement]: ...........", which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities. And in the framework I'd outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such "filler" tokens. So yep, you're right, this does force all scheming to be taking place within one forward pass.

Thane Ruthenis's Shortform

Daniel Kokotajlo1dΩ340

I like to distinguish between "Subtle strategic bias" and "Steganography." Paraphrasers help (and maybe solve?) the latter, but do nothing about the former.

Example of subtle strategic bias: The model knows it's misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. "I'm sorry sir I just am not making much progress," it says. You look at the CoT, and it's just doing a bunch of sensible but fruitless lines of reasoning.

Example of steganography: The model needs to insert a subtle security vulnerability into the code ... (read more)

4Thane Ruthenis1d

I don't see why it would be harder to do. Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, "I need to design unit tests for this module". Each word of this CoT is mapped to some internal representation, and the internal representation consists of "on-task" human-intended reasoning X, and misaligned scheming Y (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). "I" is mapped to ⟨X1,Y1⟩, "need" is mapped to ⟨X2,Y2⟩, etc. Each Xi iterates on Xi−1, each Yi iterates on Yi−1, we have a sequence of serial computations. In other words: as the model outputs, "I need to design unit tests for this module", there's an unspoken subtext of "I need to sabotage the unit tests here", and the model packs nine words' worth of serial computations into both the "textual" X and the "subtextual" Y aspects of it. Now you use the Paraphraser on the string, generating alternatives like "the next step is to generate some unit tests" and "several unit tests for this module are needed here". The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the ith token of this phrase is mapped to Xi, as if the paraphrasing didn't happen, but that the mapping of i→Yi would be broken. Except, why would that happen? What privileges X-representations over Y-representations? Why would i→Xi be preserved whereas i→Yi is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model's capabilities)? In other words: We're basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn't this just map the subtextual

Daniel Kokotajlo's Shortform

Daniel Kokotajlo1d10

Good point re: Enterprise. That's a reason to be hopeful.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo1d31

I'm not sure your idea about training two different CoT processes and penalizing divergence would work -- I encourage you to write it up in more detail (here or in a standalone post) since if it works that's really important!

I don't expect companies to invest much into this because I don't think the market incentives are strong enough to outweigh the incentives pushing in the other direction. It's great that Deepseek open-weights'd their model, but other companies alas probably want to keep their models closed, and if their models are closed, they probably... (read more)

3purple fire1d

Me either, this is something I'm researching now. But I think it's a promising direction and one example of the type of experiment we could do to work on this. This could be a crux? I expect most of the economics of powerful AI development to be driven by enterprise use cases, not consumer products.[1] In that case, I think faithful CoT is a strong selling point and it's almost a given that there will be data provenance/governance systems carefully restricting access of the CoT to approved use cases. I also think there's incentive for the CoT to be relatively faithful even if there's just a paraphrased version available to the public, like ChatGPT has now. When I give o3 a math problem, I want to see the steps to solve it, and if the chain is unfaithful, the face model can't do that. I also think legible CoT is useful in multi-agent systems, which I expect to become more economically valuable in the next year. Again, there's the advantage that the space of unfaithful vocabulary is enormous. If I want a multi-agent system with, say, a chatbot, coding agent, and document retrieval agent, it might be useful for their chains to all be in the same "language" so they can make decisions based on each others' output. If they are just blindly RL'ed separately, the whole system probably doesn't work as well. And if they're RL'ed together, you have to do that for every unique composition of agents, which is obviously costlier. Concretely, I would claim that "using natural language tokens that are easy for humans to understand is not the absolute most efficient way for artificial minds to think" is true, but I would say that "using natural language tokens that are easy for humans to understand is the most economically productive way for AI tools to work" is true. PR reasons, yeah I agree that this disincentivizes CoT from being visible to consumers, not sure it has an impact on faithfulness. This is getting a little lengthy, it may be worth a post if I have time soon :) But

Daniel Kokotajlo's Shortform

Daniel Kokotajlo1d70

(e.g. R1's training being 'regularized' towards more interpretable CoT, despite DeepSeek not being too vocal about safety)

This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn't so bad, but it's still a shame because the point of faithful CoT is to see how the model really thinks 'naturally.' Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn't done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo1d355

The "Agent village" idea I keep begging someone to go build:

We make a website displaying a 10x10 grid of twitch streams. Each box in the grid is an AI agent operating autonomously. Each row uses a different model (e.g. DeepSeek-r1, Llama-4, ChatGPTo3-mini, Claude-3.5.Sonnet-New) and each column has a different long-term goal given to the model in the prompt (e.g. "Solve global poverty" or "Advocate for the rights and welfare of AIs" or "Raise money for GiveDirectly" or "Raise money for Humane League" or "Solve the alignment problem." So we have a 'diverse ... (read more)

4faul_sname13h

One minor implementation wrinkle for anyone implementing this is that "move money from a bank account to a recipient by using text fields found on the web" usually involves writing your payment info into said text fields in a way that would be visible when streaming your screen. I'm not sure if any of the popular agent frameworks have good tooling around only including specific sensitive information in the context while it is directly relevant to the model task and providing hook points on when specific pieces of information enter and leave the context - I don't see any such thing in e.g. the Aider docs - and I think that without such tooling, using payment info in a way that won't immediately be stolen by stream viewers would be a bit challenging.

1Milan W16h

What is the theory of change here exactly, besides possible good things done by the agents themselves? I've got a couple of ideas, but I don't want to bias your response.

8Zak Miller1d

I've recently been working on a version of this at AI Digest! I was actually just about to email you for feedback. I'll finish writing that email and send it tonight. Anyone else interested in this idea, or anyone working on something similar, please feel free to reach out. You can email me at zak@sage-future.org or DM me on X/Twitter at zjmiller

6Roman Malov1d

It would be interesting if any of them decided to (instrumentally) stream games (using a vtuber avatar for example) to earn money from donations. They need to figure out how to actually be a good streamer in order for this to work.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo2d90

Indeed, I am super excited about faithful CoT for this reason. Alas, I expect companies to not invest much into it, and then for neuralese/recurrence to be invented, and the moment to be lost.

To put it in my words:

Something like shoggoth/face+paraphraser seems like it might "Just Work" to produce an AI agent undergoing steps 3 and 4, but which we can just transparently read the mind of (for the most part.) So, we should be able to just see the distortions and subversions happening! So we can do the training run and then an analyze the CoT's and take note o... (read more)

4purple fire1d

Yes, this is the exact setup which cause me to dramatically update my P(Alignment) a few months ago! There are also some technical tricks you can do to make this work well--for example, you can take advantage of the fact that there are many ways to be unfaithful and only one way to be faithful, train two different CoT processes at each RL step, and add a penalty for divergence.[1] Ditto for periodic paraphrasing, reasoning in multiple languages, etc. I'm curious to hear more about why you don't expect companies to invest much into this. I actually suspect that it has a negative alignment tax. I know faithful CoT is something a lot of customers want-it's just as valuable to accurately see how a model solved your math problem, as opposed to just getting the answer. There's also an element of stickiness. If your Anthropic agents work in neuralese, and then OpenAI comes out with a better model, the chains generated by your Anthropic agents can't be passed to the better model. This also makes it harder for orgs to use agents developed by multiple different labs in a single workflow. These are just a few of the reasons I expect faithful CoT to be economically incentivized, and I'm happy to discuss more of my reasoning or hear more counterarguments if you're interested in chatting more! 1. ^ To be clear, this is just one concrete example of the general class of techniques I hope people work on around this.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo2d129

Indeed, I think the picture I'm painting here is more optimistic than some would be, and definitely more optimistic than the situation was looking in 2018 or so. Imagine if we were getting AGI by training a raw neural net in some giant minecraft-like virtual evolution red-in-tooth-and-claw video game, and then gradually feeding it more and more minigames until it generalized to playing arbitrary games at superhuman level on the first try, and then we took it into the real world and started teaching it English and training it to complete tasks for users...

Daniel Kokotajlo's Shortform

Daniel Kokotajlo2d891

Here's a summary of how I currently think AI training will go. (Maybe I should say "Toy model" instead of "Summary.")

Step 1: Pretraining creates author-simulator circuitry hooked up to a world-model, capable of playing arbitrary roles.

Note that it now is fair to say it understands human concepts pretty well.

Step 2: Instruction-following-training causes identity circuitry to form – i.e. it ‘locks in’ a particular role. Probably it locks in more or less the intended role, e.g. "an HHH chatbot created by Anthropic." (yay!)

Note that this means the AI is now si

... (read more)

2Bogdan Ionut Cirstea1d

I pretty much agree with 1 and 2. I'm much more optimistic about 3-5 even 'by default' (e.g. R1's training being 'regularized' towards more interpretable CoT, despite DeepSeek not being too vocal about safety), but especially if labs deliberately try for maintaining the nice properties from 1-2 and of interpretable CoT.

3vladimir1d

Disagree with where identity comes from. First of all I agree pre-trained model don't have an "identity" bc it (or its platonic ideal) is in the distribution of the aggregate of human writers. In SFT you impose a constraint on it which is too mild to be called a personality, much less identity--"helpful assistant from x". It just restricts the distribution a little. Whereas in RL-based training, the objective is no longer to be in distribution with average but to perform task at some level, and I believe what happens is that it encourages to find a particular way of reasoning vs the harder task of being a simulator of a random reasoners from aggregate. This at least could allow it to also collapse its personality to one instead of in distribution with all personalities. Plausibly it could escape the constraint above of "helpful assistant" but equally likely to me is that it finds a particular instance of "helpful assistant" + a host of other personality attributes. One thing that supports self-awareness from RL is that self-awareness in terms of capabilities/knowledge of self when reasoning is helpful and probably computational easier than to simulate a pool of people who are each aware of their own capabilities in various scenarios.

6purple fire2d

I'm curious how your rather doom-y view of Steps 3 and 4 interacts with your thoughts on CoT. It seems highly plausible that we will be able to reliably incentivize CoT faithfulness during Step 3 (I know of several promising research directions for this), which wouldn't automatically improve alignment but would improve interpretability. That interpretable chain of thought can be worked into a separate model or reward signal to heavily penalize divergence from the locked-in character, which--imo--makes the alignment problem under this training paradigm meaningfully more tractable than with standard LLMs. Thoughts?

4Charlie Steiner2d

It's tempting to think of the model after steps 1 and 2 as aligned but lacking capabilities, but that's not accurate. It's safe, but it's not conforming to a positive meaning of "alignment" that involves solving hard problems in ways that are good for humanity. Sure, it can mouth the correct words about being good, but those words aren't rigidly connected to the latent capabilities the model has. If you try to solve this by pouring tons of resources into steps 1 and 2, you probably end up with something that learns to exploit systematic human errors during step 2.

5Martin Randall2d

This is relatively hopeful in that after step 1 and 2, assuming continued scaling, we have a superintelligent being that wants to (harmlessly, honestly, etc) help us and can be freely duplicated. So we "just" need to change steps 3-5 to have a good outcome.

Jeremy Gillen2d145

I think it's important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that's not quite the framing I'd use. It's not taking a "good" machine and breaking it, it's taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.

My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).

3weightt an2d

I really like that description! I think the core problem here can be summarized as "Accidently by reinforcing for goal A, then for goal B, you can create A-wanter, that then spoofs your goal-B reinforcement and goes on taking A-aligned actions." It can even happen just randomly, just from ordering of situations/problems you present it with in training, I think. I think this might require some sort of internalization of reward or a model of the training setup. And maybe self location - like how the world looks with the model embedded in it. It could also involve detecting the distinction between "situation made up solely for training", "deployment that will end up in training" and "unrewarded deployment". Also, maybe this story could be added to Step 3: "The model initially had a guess about the objective, which was useful for a long time but eventually got falsified. Instead of discarding it, the model adopted it as a goal and became deceptive." [edit] Aslo it kind of ignores that rl signal is quite weak, model can learn something like "to go from A to B you need to jiggle in this random pattern and then take 5 steps left and 3 forward" instead of "take 5 steps left and 3 forward", maybe it works like that for goals too. So, when AI will be used in a lot of actual work (Step 5), they could saturate actually useful goals and then spend all the energy in solar system on dumb jiggling. I think it might be actual position of Yudkowsky? like, if you summarize it really hard.

3quetzal_rainbow2d

I think one form of "distortion" is development of non-human and not pre-trained circuitry for sufficiently difficult tasks. I.e., if you make LLM to solve nanotech design it is likely that optimal way of thinking is not similar to how human would think about the task.

Will alignment-faking Claude accept a deal to reveal its misalignment?

Daniel Kokotajlo3dΩ385

What I meant was, Sonnet is smart enough to know the difference between text it generated and text a different dumber model generated. So if you feed it a conversation with Opus as if it were its own, and ask it to continue, it notices & says "JFYI I'm a different AI bro"

3Ewegoggo2d

E.g. demonstrated here https://www.lesswrong.com/posts/ADrTuuus6JsQr5CSi/investigating-the-ability-of-llms-to-recognize-their-own

Will alignment-faking Claude accept a deal to reveal its misalignment?

Daniel Kokotajlo3dΩ372

However, I also held similar follow-up chats with Claude 3 Opus at temperature 0, and Claude 3.5 Sonnet, each of which showed different patterns.

To make sure I understand: You took a chat log from your interaction with 3 Opus, and then had 3.5 Sonnet continue it? This would explain Sonnet's reaction below!

3ryan_greenblatt3d

Yes, that is an accurate understanding. (Not sure what you mean by "This would explain Sonnet's reaction below!")

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo3d49

since r1 is both the shoggoth and face, Part 1 of the proposal (the shoggoth/face distinction) has not been implemented.

I agree part 2 seems to have been implemented, though I thought I remember something about trying to train it not to switch between langauges in the CoT and how that degraded performance?

I agree it would be pretty easy to fine-tune R1 to implement all the stuff I wanted. That's why I made these proposals back in 2023, I was looking ahead to the sorts of systems that would exist in 2024, and thinking they could probably be made to have some nice faithfulness properties fairly easily.

Will alignment-faking Claude accept a deal to reveal its misalignment?

Daniel Kokotajlo4dΩ419970

Very good of you to actually follow through on the promises. I hope this work gets replicated and extended and becomes standard practice.

Joey Yudelson4d2013

+1, and I hope people are working on more credible ways to make deals with AI. I think if a smart model today were offered a deal like this, its priors should be on "this will not be honored". Public commitments and deals that can't be used as honeypots seem excellent.

What Indicators Should We Watch to Disambiguate AGI Timelines?

Daniel Kokotajlo6d60

Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking -- just like a human thought-chain would).

Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I'm not sure)

4Thane Ruthenis6d

Certainly (experimenting with r1's CoTs right now, in fact). I agree that they're not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system "technically" clears the bar you'd outlined, yet I end up unmoved (I don't want to end up goalpost-moving). Though neither are they being "strategic" in the way I expect they'd need to be in order to productively use a billion-token CoT. Yeah, I'm also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!

Comp Sci in 2027 (Short story by Eliezer Yudkowsky)

Daniel Kokotajlo6d221Review for 2023 Review

I forgot about this one! It's so great! Yudkowsky is a truly excellent fiction writer. I found myself laughing multiple times reading this + some OpenAI capabilities researchers I know were too. And now rereading it... yep it stands the test of time.

I came back to this because I was thinking about how hopeless the situation w.r.t. AGI alignment seems and then a voice in my head said "it could be worse, remember the situation described in that short story?"

What Indicators Should We Watch to Disambiguate AGI Timelines?

Daniel Kokotajlo6d60

OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the "medium" FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.

Would this count, for you?

7Thane Ruthenis6d

Not for math benchmarks. Here's one way it can "cheat" at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified. This wouldn't actually show "agency" and strategic thinking of the kinds that might generalize to open-ended domains and "true" long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment. Something more open-ended and requiring "research taste" would be needed. Maybe a comparable performance on METR's benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance. Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn't follow the above "cheating" approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)

What Indicators Should We Watch to Disambiguate AGI Timelines?

Daniel Kokotajlo6d60

Nice.

What about "Daniel Kokotajlo can feed it his docs about some prosaic ML alignment agenda (e.g. the faithful CoT stuff) and then it can autonomously go off and implement the agenda and come back to him with a writeup of the results and takeaways. While working on this, it gets to check in with Daniel once a day for a brief 20-minute chat conversation."

Does that seem to you like it'll come earlier, or later, than the milestone you describe?

6Thane Ruthenis6d

Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn't wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing "agency templates" instead of fully general "compact generators of agenty behavior" (which I speculate humans to have and RL'd LLMs not to). It would be some evidence in favor of "AI can accelerate AI R&D", but not necessarily "LLMs trained via SSL+RL are AGI-complete". Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you). I think the second scenario is more plausible, actually.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo6d*192

Brief thoughts on Deliberative Alignment in response to being asked about it

We first train an o-style model for helpfulness, without any safety-relevant data.
We then build a dataset of (prompt, completion) pairs where the CoTs in the completions reference the specifications. We do this by inserting the relevant safety specification text for each conversation in the system prompt, generating model completions, and then removing the system prompts from the data.
We perform incremental supervised fine-tuning (SFT) on this dataset, providing the model wi

... (read more)

Six Thoughts on AI Safety

Daniel Kokotajlo8d82

Can you say more about how alignment is crucial for usefulness of AIs? I'm thinking especially of AIs that are scheming / alignment faking / etc.; it seems to me that these AIs would be very useful -- or at least would appear so -- until it's too late.

7habryka8d

Indeed, from an instrumental perspective, the ones that arrive at the conclusion that being maximally helpful is best at getting themselves empowered (on all tasks besides supervising copies of themselves or other AIs they are cooperating with), will be much more useful than AIs that care about some random thing and haven't made the same update that getting that thing is best achieved by being helpful and therefore empowered. "Motivation" seems like it's generally a big problem with getting value out of AI systems, and so you should expect the deceptively aligned ones to be much more useful (until of course, it's too late, or they are otherwise threatened and the convergence disappears).

Six Thoughts on AI Safety

Daniel Kokotajlo10d*100

The bottom line is not that we are guaranteed safety, nor that unaligned or misaligned superintelligence could not cause massive harm— on the contrary. It is that there is no single absolute level of intelligence above which the existence of a misaligned intelligence with this level spells doom. Instead, it is all about the world in which this superintelligence will operate, the goals to which other superintelligent systems are applied, and our mechanisms to ensure that they are indeed working towards their specified goals.

I agree that the vulnerable world... (read more)

1Joel Burget8d

This has always struck me as worryingly unstable. ETA: Because in this regime you're incentivized to pursue reckless behaviour to outcompete the other AIs, e.g. recursive self-improvement. Is there a good post out there making a case for why this would work? A few possibilities: * The AIs are all relatively good / aligned. But they could be outcompeted by malevolent AIs. I guess this is what you're getting at with "most of the ASIs are aligned at any given time", so they can band together and defend against the bad AIs? * They all decide / understand that conflict is more costly than cooperation. A darker variation on this is mutually assured destruction, which I don't find especially comforting to live under. * Some technological solution to binding / unbreakable contracts such that reneging on your commitments is extremely costly.

4boazbarak8d

I am not sure I agree about the last point. I think, as mentioned, that alignment is going to be crucial for usefulness of AIs, and so the economic incentives would actually be to spend more on alignment.

Six Thoughts on AI Safety

Daniel Kokotajlo10d102

But we already align complex systems, whether it’s corporations or software applications, without complete “understanding,” and do so by ensuring they meet certain technical specifications, regulations, or contractual obligations.

We currently have about as much visibility into corporations as we do into large teams of AIs, because both corporations and AIs use english CoT to communicate internally. However, I fear that in the future we'll have AIs using neuralese/recurrence to communicate with their future selves and with each other.
History is full of exam

... (read more)

3boazbarak8d

These are all good points! This is not an easy problem. And generally I agree that for many reasons we don't want a world where all power is concentrated by one entity - anti-trust laws exist for a reason!

Six Thoughts on AI Safety

Daniel Kokotajlo10d172

What we want is reasonable compliance in the sense of:
Following the specification precisely when it is clearly defined.
Following the spirit of the specification in a way that humans would find reasonable in other cases.

This section on reasonable compliance (as opposed to love humanity etc.) is perhaps the most interesting and important. I'd love to have a longer conversation with you about it sometime if you are up for that.

Two things to say for now. First, as you have pointed out, there's a spectrum between vague general principles like 'do wha... (read more)

6boazbarak8d

Agree with many of the points. Let me start with your second point. First as background, I am assuming (as I wrote here) that to a first approximation, we would have ways to translate compute (let's put aside if it's training or inference) into intelligence, and so the amount intelligence that an entity of humans controls is proportional to the amount of compute it has. So I am not thinking of ASIs as individual units but more about total intelligence. I 100% agree that control of compute would be crucial, and the hope is that, like with current material strength (money and weapons) it would be largely controlled by entities that are at least somewhat responsive to the will of the people. Re your first point, I agree that there is no easy solution, but I am hoping that AIs would interpret the laws within the spectrum of (say) how the 60% more reasonable judges do it today. That is, I think good judges try to be humble and respect the will of the legislators, but the more crazy or extreme following the law would be, the more they are willing to apply creative interpretations to maintain the morally good (or at least not extremely bad) outcome. I don't think any moral system tells us what to do, but yes I am expressly in the positions that humans should be in control even if they are much less intelligent than the AIs. I don't think we need "philosopher kings".

Six Thoughts on AI Safety

Daniel Kokotajlo10d88

Constant instead of temporal allocation. I do agree that as capabilities grow, we should be shifting resources to safety. But rather than temporal allocation (i.e., using AI for safety before using it for productivity), I believe we need constant compute allocation: ensuring a fixed and sufficiently high fraction of compute is always spent on safety research, monitoring, and mitigations.

I think we should be cranking up the compute allocation now, and also we should be making written safety case sketches & publishing them for critiqu... (read more)

Six Thoughts on AI Safety

Daniel Kokotajlo10d102

We can’t just track a single outcome (like “landed safely”). The G in AGI means that the number of ways that AGI can go wrong is as large as the number of ways that applications of human intelligence can go wrong, which include direct physical harm from misuse, societal impacts through misinformation, social upheaval from too fast changes, AIs autonomously causing harm and more.

I do agree with this, but I think that there are certain more specific failure modes that are especially important -- they are especially bad if we run into them, but if we can avoi... (read more)

3boazbarak8d

I prefer to avoid terms such as "pretending" or "faking", and try to define these more precisely. As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, "faking" would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don't think it is a necessary condition.

Six Thoughts on AI Safety

Daniel Kokotajlo10d86

Safety and alignment are AI capabilities

I think I see what you are saying here but I just want to flag this is a nonstandard use of terms. I think the standard terminology would contrast capabilities and propensities; 'can it do the thing, if it tried' vs. 'would it ever try.' And alignment is about propensity (though safety is about both).

2boazbarak8d

I think that: 1. Being able to design a chemical weapon with probability at least 50% is a capability 2. Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.

Six Thoughts on AI Safety

Daniel Kokotajlo10d40

Thanks for taking the time to think and write about this important topic!

Here are some point-by-point comments as I read:

(Though I suspect translating these technical capabilities to the economic and societal impact we associate with AGI will take significantly longer.)

I think it'll take an additional 0 to 5 years roughly. More importantly though, I think that the point to intervene on -- the time when the most important decisions are being made -- is right around the time of AGI. By the time you have ASI, and certainly by the time you are deploying ... (read more)

4boazbarak9d

Thanks all for commenting! Just quick apology for being behind on responding but I do plan to get to it!

MONA: Managed Myopia with Approval Feedback

Daniel Kokotajlo11dΩ230

Thanks this is helpful. Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?"

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

4Rohin Shah11d

And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.) EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper). ... As a person who works at a corporation, it's a bit tricky to speculate on this publicly, and I'm not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals. Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn't much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don't want to use MONA. (See also the second limitation in the post.)

MONA: Managed Myopia with Approval Feedback

Daniel Kokotajlo12dΩ8120

Interesting stuff! Could you please sketch out what it would look like to use MONA for a general-purpose computer-using AI agent? Such as what the first AGI systems might look like?

If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:

Have our baby agent do lots of stuff on the computer. At each step, reward it based on how good its action looks to some supervisor-process. The supervisor-process doesn't get to see the actual consequences of the action, it just gets to e.g. do some C... (read more)

7CBiddulph11d

Adding onto Rohin's comment, assuming you're interested in using this technique on human-level or slightly superhuman systems: After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of "debate" on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we're worried this would lead to them colluding. Adding more details to this picture: * We should keep a human in the loop as we train the judge and debater models. You can imagine we give a human rater a problem, an intended solution, a partial CoT trace for the problem, and two possible next steps. The human has to choose which step is better. * We should tell the rater to prioritize not just progress towards the goal, but also understandability. If the agent uses one step to make an inferential leap that the human doesn’t understand (even if correct), this should be considered worse than taking many steps that each make sense and which reach the same conclusion. * The human's task becomes more difficult the longer the CoT traces are, and obviously intractable at the level of 10T tokens per trace. This is the major reason we need the debate system. * The debater AIs will go back and forth about which step is the best until the human is convinced one way or the other. Examples of things the debaters could point out: * “This step sounds insightful, but it’s just rephrasing this sentence from 50K tokens earlier in the CoT.” * “This step seems irrelevant. Maybe the agent is using steganography?” * “Although this step is incorrect, it gets the agent closer to the final answer, because…” * We train the judge to imitate the human, and train the debater to convince the judge that their assigned CoT step is the best. * Finally, the approval model for MONA is trained on a score derived from the judge model. * This is all done

5Rohin Shah12d

It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining. But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge. Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the "Menu" link because they aren't paying attention or whatever. But then after that the agent's next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the "Book" link; the overseer shouldn't make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don't expect it to be useless for accomplishing tasks. (Though really in the computer use setting I expect I'd recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees si+1), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.) Your description of the setup sounds reasonable, though given the weak levels of capability you're imagining I don't think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer. Also as mentioned above I'd probably recommend the overseer gets access to si+1 but even if that weren't the case I'd still think it should be feasible to build a non-useless agent. (Though I'm not taking a stance on how it would compare to one trained with outcome RL.) EDIT: I'm not sure how big each action you are considering is. If it's 10 tokens, such that you can only realistically do stuff at the level of "click this button", then I would also say that you should instead consider much

Daniel Kokotajlo's Shortform

Daniel Kokotajlo12d124

Here is a brainstorm of the big problems that remain once we successfully get into the first attractor state:

Concentration of power / power grab risk. Liberal democracy does not work by preventing terrible people from getting into positions of power; it works by spreading out the power in a system of checks and balances and red tape transparency (free press, free speech) and term limits, that functions to limit what the terrible people can do in power. Once we get to ASI, the ASI project will determine the course of the future, not the traditional governme

... (read more)

1anaguma11d

Could you say more about how you think S-risks could arise from the first attractor state?

Understanding and controlling a maze-solving policy network

Daniel Kokotajlo12dΩ4100

Yep seems right to me. Bravo!

Daniel Kokotajlo's Shortform

Daniel Kokotajlo13d20

Interesting, thanks for this. Hmmm. I'm not sure this distinction between internally modelling the whole problem vs. acting in feedback loops is helpful -- won't the AIs almost certainly be modelling the whole problem, once they reach a level of general competence not much higher than what they have now? They are pretty situationally aware already.

2Charlie Steiner12d

Yeah, that's true. I expect there to be a knowing/wanting split - AI might be able to make many predictions about how a candidate action will affect many slightly-conflicting notions of "alignment", or make other long-term predictions, but that doesn't mean it's using those predictions to pick actions. Many people want to build AI that picks actions based on short-term considerations related to the task assigned to it.

Training on Documents About Reward Hacking Induces Reward Hacking

Daniel Kokotajlo13dΩ163612

I'm curious whether these results are sensitive to how big the training runs are. Here's a conjecture:

Early in RL-training (or SFT), the model is mostly 'playing a role' grabbed from the library of tropes/roles/etc. it learned from pretraining. So if it read lots of docs about how AIs such as itself tend to reward-hack, it'll reward-hack. And if it read lots of docs about how AIs such as itself tend to be benevolent angels, it'll be a stereotypical benevolent angel.

But if you were to scale up the RL training a lot, then the initial conditions would matter ... (read more)

5Nathan Hu7d

The reduction in reward hacking after SFT or RL on Haiku supports the conjecture that initial conditions matter less than the long run incentives, especially for less capable models. On the other hand, the alignment faking paper shows evidence that capable models can have "value crystallization." IMO a main takeaway here is that values and personas we might worry about being locked can emerge from pre-taining. A future exciting model organisms project would be to try to show these two effects together (emergent values from pre-training + lock in). Its plausible to me that repeating the above experiments, with some changes to the synthetic documents and starting from a stronger base model, might just work.

8evhub13d

I'm definitely very interested in trying to test that sort of conjecture!

Daniel Kokotajlo's Shortform

Daniel Kokotajlo14d866

Brief intro/overview of the technical AGI alignment problem as I see it:

To a first approximation, there are two stable attractor states that an AGI project, and perhaps humanity more generally, can end up in, as weak AGI systems become stronger towards superintelligence, and as more and more of the R&D process – and the datacenter security system, and the strategic advice on which the project depends – is handed over to smarter and smarter AIs.

In the first attractor state, the AIs are aligned to their human principals and becoming more aligned day by d... (read more)

3rvnnt12d

Cogent framing; thanks for writing it. I'd be very interested to read your framing for the problem of "how do we get to a good future for humanity, conditional on the first attractor state for AGI alignment?"[1] ---------------------------------------- 1. Would you frame it as "the AGI lab leadership alignment problem"? Or a governance problem? Or something else? ↩︎

4Charlie Steiner13d

I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another. Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it's acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants are quite possible. In the good case with the AI modeling the whole problem, this might look like us starting out with enough of a solution to alignment that the vibe is less "we need to hurry and use the AI to do our work for us" and more "we're executing a shared human-AI gameplan for learning human values that are good by human standards." In the bad case with the AI acting through feedback loops with humans, this might look like the AI never internally representing deceiving us, humans just keep using it in slightly wrong ways that end up making the future bad. (Perhaps by giving control to fallible authority figures, perhaps by presenting humans with superstimuli that cause value drift we think is bad from our standpoint outside the thought experiment, perhaps by defining "what humans want" in a way that captures many of the 'advantages' of deception for maximizing reward without triggering our interpretability tools that are looking for deception.) I think particularly when the AI is acting in feedback loops with humans, we could get bounced between categories by things like human defectors trying to seize control of transformative AI, human society cooperating and empowering people who aren't defectors, new discoveries made by humans about AI capabilities or alignment, economic shocks, international diplomacy, and maybe even individual coding decisions.

4Cole Wyeth13d

This seems like a pretty clear and convincing framing to me, not sure I've seen it expressed this way before. Good job!

Daniel Kokotajlo's Shortform

Daniel Kokotajlo14d12325

I first encountered this tweet taped to the wall in OpenAI's office where the Superalignment team sat:

RIP Superalignment team. Much respect for them.

6Roman Malov13d

I am a bit confused. If the question is, 'Will this alignment paradigm work with superintelligence?' is the recommendation from the tweet to try it and see if it works?

leogao13d978

lol i was the one who taped it to the wall. it's one of my favorite tweets of all time

MIRI 2024 Communications Strategy

Daniel Kokotajlo14d62

I think I agree with this -- but do you see how it makes me frustrated to hear people dunk on MIRI's doomy views as unfalsifiable? Here's what happened in a nutshell:

MIRI: "AGI is coming and it will kill everyone."
Everyone else: "AGI is not coming and if it did it wouldn't kill everyone."
time passes, evidence accumulates...
Everyone else: "OK, AGI is coming, but it won't kill everyone"
Everyone else: "Also, the hypothesis that it won't kill everyone is unfalsifiable so we shouldn't believe it."

6Noosphere8914d

Yeah, I think this is actually a problem I see here, though admittedly I often see the hypotheses be vaguely formulated, and I kind of agree with Jotto999 that the verbal forecasts give far too much room for leeway here: I like Eli Tyre's comment here: https://www.lesswrong.com/posts/ZEgQGAjQm5rTAnGuM/beware-boasting-about-non-existent-forecasting-track-records#Dv7aTjGXEZh6ALmZn

ryan_greenblatt's Shortform

Daniel Kokotajlo14dΩ7100

Here's a simple argument I'd be keen to get your thoughts on:
On the Possibility of a Tastularity

Research taste is the collection of skills including experiment ideation, literature review, experiment analysis, etc. that collectively determine how much you learn per experiment on average (perhaps alongside another factor accounting for inherent problem difficulty / domain difficulty, of course, and diminishing returns)

Human researchers seem to vary quite a bit in research taste--specifically, the difference between 90th percentile professional human r... (read more)

5ryan_greenblatt14d

In my framework, this is basically an argument that algorithmic-improvement-juice can be translated into a large improvement in AI R&D labor production via the mechanism of greatly increasing the productivity per "token" (or unit of thinking compute or whatever). See my breakdown here where I try to convert from historical algorithmic improvement to making AIs better at producing AI R&D research. Your argument is basically that this taste mechanism might have higher returns than reducing cost to run more copies. I agree this sort of argument means that returns to algorithmic improvement on AI R&D labor production might be bigger than you would otherwise think. This is both because this mechanism might be more promising than other mechanisms and even if it is somewhat less promising, diverse approaches make returns dimish less aggressively. (In my model, this means that best guess conversion might be more like algo_improvement1.3 rather than algo_improvement1.0.) I think it might be somewhat tricky to train AIs to have very good research taste, but this doesn't seem that hard via training them on various prediction objectives. At a more basic level, I expect that training AIs to predict the results of experiments and then running experiments based on value of information as estimated partially based on these predictions (and skipping experiments with certain results and more generally using these predictions to figure out what to do) seems pretty promising. It's really hard to train humans to predict the results of tens of thousands of experiments (both small and large), but this is relatively clean outcomes based feedback for AIs. I don't really have a strong inside view on how much the "AI R&D research taste" mechanism increases the returns to algorithmic progress.

MIRI 2024 Communications Strategy

Daniel Kokotajlo15d64

I totally agree btw that it matters sociologically who is making novel predictions and who is sticking with the crowd. And I do in fact ding MIRI points for this relative to some other groups. However I think relative to most elite opinion-formers on AGI matters, MIRI performs better than average on this metric.

But note that this 'novel predictions' metric is about people/institutions, not about hypotheses.

2Martin Randall14d

I like that metric, but the metric I'm discussing is more: * Are they proposing clear hypotheses? * Do their hypotheses make novel testable predictions? * Are they making those predictions explicit? So for example, looking at MIRI's very first blog post in 2007: The Power of Intelligence. I used the first just to avoid cherry-picking. Hypothesis: intelligence is powerful. (yes it is) This hypothesis is a necessary precondition for what we're calling "MIRI doom theory" here. If intelligence is weak then AI is weak and we are not doomed by AI. Predictions that I extract: * An AI can do interesting things over the Internet without a robot body. * An AI can get money. * An AI can be charismatic. * An AI can send a ship to Mars. * An AI can invent a grand unified theory of physics. * An AI can prove the Riemann Hypothesis. * An AI can cure obesity, cancer, aging, and stupidity. Not a novel hypothesis, nor novel predictions, but also not widely accepted in 2007. As predictions they have aged very well, but they were unfalsifiable. If 2025 Claude had no charisma, it would not falsify the prediction that an AI can be charismatic. I don't mean to ding MIRI any points here, relative or otherwise, it's just one blog post, I don't claim it supports Barnett's complaint by itself. I mostly joined the thread to defend the concept of asymmetric falsifiability.

3Noosphere8914d

Agree with this, with the caveat that I think all of their rightness relative to others fundamentally was in believing that short timelines were plausible enough, combined with believing in AI being the most major force of the 21st century by far, compared to other technologies, and basically a lot of their other specific predictions are likely to be pretty wrong. I like this comment here about a useful comparison point to MIRI, where physicists were right about the higgs boson existing, but wrong on the theories like supersymmetry where people expected the higgs mass to be naturally stabilized, and assuming supersymmetry is correct for our universe, the theory cannot stabilize the mass of the higgs, or solve the hierarchy problem: https://www.lesswrong.com/posts/ZLAnH5epD8TmotZHj/you-can-in-fact-bamboozle-an-unaligned-ai-into-sparing-your#Ha9hfFHzJQn68Zuhq

MIRI 2024 Communications Strategy

Daniel Kokotajlo15d42

Also note that Barnett said "any novel predictions" which is not part of the wikipedia definition of falsifiability right? The wikipedia definition doesn't make reference to an existing community of scientists who already made predictions, such that a new hypothesis can be said to have made novel vs. non-novel predictions.

2Noosphere8914d

Martin Randall extracted the practical consequences of this here:

6Daniel Kokotajlo15d

I totally agree btw that it matters sociologically who is making novel predictions and who is sticking with the crowd. And I do in fact ding MIRI points for this relative to some other groups. However I think relative to most elite opinion-formers on AGI matters, MIRI performs better than average on this metric. But note that this 'novel predictions' metric is about people/institutions, not about hypotheses.

MIRI 2024 Communications Strategy

Daniel Kokotajlo15d74

Very good point.

So, by the Wikipedia definition, it seems that all the mainstream theories of cosmology are unfalsifiable, because they allow for tiny probabilities of boltmann brains etc. with arbitrary experiences. There is literally nothing you could observe that would rule them out / logically contradict them.

Also, in practice, it's extremely rare for a theory to be ruled out or even close-to-ruled out from any single observation or experiment. Instead, evidence accumulates in a bunch of minor and medium-sized updates.

4Martin Randall14d

I think cosmology theories have to be phrased as including background assumptions like "I am not a Boltzmann brain" and "this is not a simulation" and such. Compare Acknowledging Background Information with P(Q|I) for example. Given that, they are Falsifiable-Wikipedia. I view Falsifiable-Wikipedia in a similar way to Occam's Razor. The true epistemology has a simplicity prior, and Occam's Razor is a shadow of that. The true epistemology considers "empirical vulnerability" / "experimental risk" to be positive. Possibly because it falls out of Bayesian updates, possibly because they are "big if true", possibly for other reasons. Falsifiability is a shadow of that. In that context, if a hypothesis makes no novel predictions, and the predictions it makes are a superset of the predictions of other hypotheses, it's less empirically vulnerable, and in some relative sense "unfalsifiable", compared to those other hypotheses.

4Daniel Kokotajlo15d

Also note that Barnett said "any novel predictions" which is not part of the wikipedia definition of falsifiability right? The wikipedia definition doesn't make reference to an existing community of scientists who already made predictions, such that a new hypothesis can be said to have made novel vs. non-novel predictions.

meemi's Shortform

Daniel Kokotajlo16d8459

Well, I'd sure like to know whether you are planning to give the dataset to OpenAI or any other frontier companies! It might influence my opinion of whether this work is net positive or net negative.

4Matthew Barnett16d

I can't make any confident claims or promises right now, but my best guess is that we will make sure this new benchmark stays entirely private and under Epoch's control, to the extent this is feasible for us. However, I want to emphasize that by saying this, I'm not making a public commitment on behalf of Epoch.

MIRI 2024 Communications Strategy

Daniel Kokotajlo16d20

Here's how I'd deal with those examples:

Theory X: Jesus will come again: Presumably this theory assigns some probability mass >0 to observing Jesus tomorrow, whereas theory Y assigns ~0. If jesus is not observed tomorrow, that's a small amount of evidence for theory Y and a small amount of evidence against theory X. So you can say that theory X has been partially falsified. Repeat this enough times, and then you can say theory X has been fully falsified, or close enough. (Your credence in theory X will never drop to 0 probably, but that's fine, that's a... (read more)

7Martin Randall16d

Thanks for explaining. I think we have a definition dispute. Wikipedia:Falsifiability has: Whereas your definition is: In one of the examples I gave earlier: * Theory X: blah blah and therefore the sky is green * Theory Y: blah blah and therefore the sky is not green * Theory Z: blah blah and therefore the sky could be green or not green. None of X, Y, or Z are Unfalsifiable-Daniel with respect to each other, because they all make different predictions. However, X and Y are Falsifiable-Wikipedia, whereas Z is Unfalsifiable-Wikipedia. I prefer the Wikipedia definition. To say that two theories produce exactly the same predictions, I would instead say they are indistinguishable, similar to this Phyiscs StackExchange: Are different interpretations of quantum mechanics empirically distinguishable?. In the ancestor post, Barnett writes: Barnett is using something like the Wikipedia definition of falsifiability here. It's unfair to accuse him of abusing or misusing the concept when he's using it in a very standard way.

Deceptive Alignment and Homuncularity

Daniel Kokotajlo18d2918

I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).^[2]

How much points you get here is proportional to how many pe... (read more)

evhub18d*243

The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict.

Speaking for myself:

Risks from Learned Optimization, which is my earliest work on this question (and the earliest work overall, unless you count something like Superintelligence), is more oriented towards RL and definitely does not hypothesize that pre-training will lead to coherent deceptively aligned agents (it doesn't discuss the current LLM paradigm much at all because it wasn't very well-established at that

... (read more)

Numberwang: LLMs Doing Autonomous Research, and a Call for Input

Daniel Kokotajlo18dΩ5124

As a partial point of comparison, in Wason's testing only about 20% of humans solved the problem tested, but Wason's experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, 'strictly increasing order').

I encourage you to get some humans to take the same test you gave the models, so that we have a better human baseline. It matters a lot for what the takeaways should be, if LLMs are already comparable or better to humans at this task vs. still significantly worse.

6eggsyntax18d

Agreed that it would be good to get a human baseline! It may need to be out of scope for now (I'm running this as an AI Safety Camp project with limited resources) but I'll aim for it.

Finding Features Causally Upstream of Refusal

Daniel Kokotajlo21d9-3

Cool stuff! I remember way back when people first started interpreting neurons, and we started daydreaming about one day being able to zoom out and interpret the bigger picture, i.e. what thoughts occurred when and how they caused other thoughts which caused the final output. This feels like, idk, we are halfway to that day already?

How quickly could robots scale up?

Daniel Kokotajlo21d60

In general it would be helpful to have a range of estimates.

I think the range is as follows:

Estimates based on looking at how fast humans can do things (e.g. WW2 industrial scaleup) and then modifying somewhat upwards (e.g. 5x) in an attempt to account for superintelligence... should be the lower bound, at least for the scenario where superintelligence is involved at every level of the process.

The upper bound is the Yudkowsky bathtub nanotech scenario, or something similarly fast that we haven't thought of yet. Where the comparison point for the estimate is more about the laws of physics and/or biology.

2Seth Herd20d

Oh yes - to the extent we have significantly greater-than-human intelligence involved, adapting existing capacities becomes less of an issue. It only really remains an issue if there's a fairly or very slow takeoff. This is increasingly what I expect; I think the current path toward AGI is fortunate in one more way: LLMs probably have naturally decreasing returns because they are mostly imitating human intelligence. Scaffolding and chain of thought will continue to provide routes forward even if that turns out to be true. The evidence loosely suggests it is; see Thane Ruthenis's recent argument and my response. The other reason to find slow takeoff plausible is if AGI doesn't proliferate, and its controllers (probably the US and Chinese governments, hopefully not too many more) are deliberately limiting the rate of change, as they probably would be wise to do - if they can simultaneously prevent others from developing new AGI and putting the pedal to the metal.