Thanks, that clears it up. Indeed I agree that "take real trajectories, then edit the attacks in" is relatively promising, and didn't track this when writing my first comment (oops) - thanks for flagging it.
Not sure I understand what you are saying.
If we can construct fake honeypots that are indistinguishable - to the policy - from real opportunities, and do this at scale, then that could suffice: we could drive the policy's P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.) If this is what you are saying, I agree.
If we can construct fake attacks that are indistinguishable - to the untrusted monitor - from real attacks, then that possibly suffices: we could drive the monitor's prior P(deployment | attack) very low. (We have to worry about some other sources of evidence, though.) I guess one could call these honeypots for the monitor, and if this is what you are saying, I agree.
If you are saying that we don't need either of these - i.e. that we can get by with honeypots/attacks that are noticeably fake - then that's news to me, and I don't see how you can do that.
Wow, tough crowd
Glad to see public writing on what safety cases could look like!
I liked the section "Common precautions". Points like "wherever possible, run supervised fine-tuning on the eval", "run the eval on multiple model sizes", "ensure that we don’t train against measurements" and "all generated outputs are monitored" are commonly acknowledged as useful/important (even if obvious), so it's nice they were explicitly mentioned here.
In the AI control sketch, there's a part that I find unsatisfactory:
The situations for the game were generated using Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution. (See below for validating the indistinguishability of this distribution.)
It's a strong assumption that we have such a simulation procedure, at least if we are talking about General External Deployment (GED). [Edited later: As Buck says, we actually don't even need this assumption for trusted monitoring. I'm still leaving this up, since untrusted monitoring does rely on such assumptions.] In the Limitations section later, you write
Diversity of simulations. It seems very difficult to produce realistic simulations for a sufficiently diverse set of situations to achieve indistinguishability. Manual construction is infeasible (note that context lengths are likely to be very long). Automatically generating realistic contexts is a challenging research problem.
but I think that even this is still understating the issue: For example, during GED a user might ask a model to proofread a yet-unpublished graduate level textbook on analytic number theory she has been writing. It would be very costly to generate a diverse set of such inputs for the evaluation, and automatic generation faces the challenge of discrimination being easier than generation.
The best solution for this that I'm aware of is to train and use a trusted model to flag inputs that likely weren't in the evaluation set. Now, you mention a related technique in "[P2.5] Optional Steps."
Information hiding. Each input to an oversight model is first fed to a smaller, trusted model trained to check for clear signs that an input is either test or deployment. Such details (such as clear evidence of the date, or other side channels) are explicitly censored in all inputs such that the presence of censorship does not itself carry information about deployment status.
but I think this is far from enough: in the "textbook on analytic number theory" example, it's extremely hard to hide the fact that this input would have been costly to simulate.[1]
My answer would be to have a human oversee the model's actions on inputs flagged by the trusted model, and use such inputs to continually train and evaluate the model. But this is costly and not privacy-preserving. (The cost is acceptable if these inputs are rare enough, though. And if such inputs are common, one needs to put more effort to bridge the evaluation-deployment gap.)
I also worry about multimodality (e.g. native computer usage) making it harder to 1) make realistic simulations and 2) modify inputs in real time to hide information.
Quantity is size
Example: "That's a large salary"
Mapping: From numbers and quantities to size.
Experience: Observing that multiple things occupy a larger volume than a single instance.
I hadn't previously appreciated just how deep this mapping is. Basically all language about numbers goes through size: is a huge number, is such a small portion. Fine, when comparing numbers you say "three is greater than two". But in Finnish we also say "three is bigger than two" and "five is smaller than seven", and "two plus two is as large as four". No, really, the equality sign is called the "equal-largeness" sign!
This is a long answer, in which I list around ten concrete problem types that such a site could have.
Before I go into my concrete proposals, here are some general points:
And overall I think that having an Actually Really Good website for rationality training would be extremely valuable, so I'm supportive of efforts in this direction.
I brainstormed some problem types that I think such a site could include.
1: Probabilistic calibration training for quantifying uncertainty
This is the obvious one. I already commented on this, in particular that I don't think this should be the main focus. (But if one were to execute this: I think that the lack of quantity and/or diversity of questions in existing tools is a core reason I don't do this more.)
2: "Statistical calibration"
I feel like there are lots of quantitative statistics one could ask questions about. Here are some basic ones:
(For more ideas, you can e.g. look at Statistics Finland's list here. And there just are various quantitative statistics floating around: e.g. today I learned that salt intake in 1800s Europe was ~18g/day, which sure is more than I'd have guessed.)
3: Quantitative modeling
(The line between this and the previous one is blurry.)
Fermi estimates are the classic one here; see Quantified Intuitions' The Estimation Game. See also this recent post that's thematically related.
There's room for more sophisticated quantitative modeling, too. Here are two examples to illustrate what I have in mind:
Example 1. How much value would it create to increase the speed of all passenger airplanes by 5%?
Example 2. Consider a company that has two options: either have its employees visit nearby restaurants for lunch, or hire food personnel and start serving lunch at its own spaces. How large does the company need to be for the second one to become profitable?
It's not obvious how to model these phenomena, and the questions are (intentionally) underspecified; I think the interesting part would be comparing modeling choices and estimates of parameters with different users rather than simply comparing outputs.
4: The Wikipedia false-modifications game
See this post for discussion.
5: Discourse-gone-astray in the wild
(Less confident on this one.)
I suspect there's a lot of pedagogically useful examples of poor discourse happening the wild (e.g. tilted or poorly researched newspaper articles, heated discussions in Twitter or elsewhere). This feels like a better way to execute what the "spot cognitive biases / logical fallacies" exercises aim to do. Answering questions like "How is this text misleading?", "How did this conversation go off the rails?" or "What would have been a better response instead of what was said here?" and then comparing one's notes to others seems like it could make a useful exercise.
6: Re-deriving established concepts
Recently it occurred to me that I didn't know how inflation works and what its upsides are. Working this through (with some vague memories and hints from my friend) felt like a pretty good exercise to me.
Another example: I don't know how people make vacuums in practice, but when I sat and thought it through, it wasn't too hard to think of a way to create a space with much less air molecules than atmosphere with pretty simple tools.
Third example: I've had partial success prompting people to re-derive the notion of Shapley value.
I like this sort of problems: they are a bit confusing, in that part of the problem is asking the right questions, but there are established, correct (or at least extremely good) solutions.
(Of course someone might already know the canonical answer to any given question, but that's fine. I think there are lots of good examples in economics - e.g. Vickrey auction, prediction markets, why price controls are bad / price gouging is pretty good, "fair" betting odds - for this, but maybe this is just because I don't know much economics.)
7: Generating multiple ideas/interventions/solutions/hypotheses
An exercise I did at some point is "Generate 25 ideas for interventions that might improve learning and other outcomes in public education". I feel like the ability to come up with multiple ideas to a given problem is pretty useful (e.g. this is something I face in my work all the time, and this list itself is an example of "think of many things"). This is similar to the babble exercises, though I'm picturing more "serious" prompts than the ones there.
Another way to train this skill would be to have interactive exercises that are about doing science (c.f. the 2-4-6 problem) and aiming to complete them as efficiently as possible (This article is thematically relevant.)
(Discussion of half-developed ideas that I don't yet quite see how to turn into exercises.)
8: Getting better results with more effort
Two personal anecdotes:
A type of exercise where you are supposed to first give an initial answer after X time, and then are allowed to revise your answer for Y time, seems like it could train this and other skills. (Maybe brainstorming exercises of the form "if you had a week/month/year of time, how would you solve [problem]?" could help, too.)
9: I think there's something in the genre of "be specific", and more specifically in "operationalize vague claims into something that has a truth value", that'd be nice to have in large-quantity exercise form. See this post for related discussion. I'm also reminded of this comment.
There are definitely things not covered by this list; in particular, I have little of directly training to apply all this in real life (c.f. TAPs, which is definitely a very real-life-y technique). So while I did keep practicality in mind, I'd be happy to see exercises that bridge the theory-practice-gap even more.
Also, the Dungeons and Data Science and the stuff Raymond is doing are something to keep in mind.
My main worry is that we don't know how to measure the important things - a running theme in the post. So while we could benchmark and do unlearning on e.g. models' ability to agree on the same element on a list of 10 items, this isn't obviously useful.
(I also suspect that unlearning Schelling coordination is harder than e.g. forgetting factual knowledge: for example, a natural strategy is to choose "popular" options, and it's not clear whether you can remove notions of popularity without major negative downstream effects.)
That said, I'd imagine there's something useful you can do with white-box techniques, like making the monitor think that a particular signal/action has been produced by [humans / other AIs] via activation steering, among other ideas about changing the model's beliefs of its deployment status. It seems to me one could create benchmarks for how well such methods work. I haven't thought about this much, but I do want to echo calls for more white-box work on control.
I tentatively think that it's good to distinguish at least between the following three model classes: active planners, sleeper agents and opportunists.
Looking back at this, I find this categorization pretty confusing, and not quite carving the model space at the right places. I discuss what I now think is a better frame here.
Briefly, the idea is that one should make a distinction on how often the model is thinking about its plans against you. In this post's terminology, this basically draws a line between opportunists vs. active planners and sleeper agents. The latter two roughly correspond to different strategies, namely active sabotage vs. "lie and wait".
But I think the descriptions I gave of "active planners" etc. in the post are overly specific and are possibly sneaking in false inferences. So I think those labels are possibly harmful, and I've mostly stopped using those terms in my own thinking.
You might be interested in this post of mine, which is more precise about what "trustworthy" means. In short, my definition is "the AI isn't adversarially trying to cause a bad outcome". This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).