This is a long answer, in which I list around ten concrete problem types that such a site could have.
Before I go into my concrete proposals, here are some general points:
And overall I think that having an Actually Really Good website for rationality training would be extremely valuable, so I'm supportive of efforts in this direction.
I brainstormed some problem types that I think such a site could include.
1: Probabilistic calibration training for quantifying uncertainty
This is the obvious one. I already commented on this, in particular that I don't think this should be the main focus. (But if one were to execute this: I think that the lack of quantity and/or diversity of questions in existing tools is a core reason I don't do this more.)
2: "Statistical calibration"
I feel like there are lots of quantitative statistics one could ask questions about. Here are some basic ones:
(For more ideas, you can e.g. look at Statistics Finland's list here. And there just are various quantitative statistics floating around: e.g. today I learned that salt intake in 1800s Europe was ~18g/day, which sure is more than I'd have guessed.)
3: Quantitative modeling
(The line between this and the previous one is blurry.)
Fermi estimates are the classic one here; see Quantified Intuitions' The Estimation Game. See also this recent post that's thematically related.
There's room for more sophisticated quantitative modeling, too. Here are two examples to illustrate what I have in mind:
Example 1. How much value would it create to increase the speed of all passenger airplanes by 5%?
Example 2. Consider a company that has two options: either have its employees visit nearby restaurants for lunch, or hire food personnel and start serving lunch at its own spaces. How large does the company need to be for the second one to become profitable?
It's not obvious how to model these phenomena, and the questions are (intentionally) underspecified; I think the interesting part would be comparing modeling choices and estimates of parameters with different users rather than simply comparing outputs.
4: The Wikipedia false-modifications game
See this post for discussion.
5: Discourse-gone-astray in the wild
(Less confident on this one.)
I suspect there's a lot of pedagogically useful examples of poor discourse happening the wild (e.g. tilted or poorly researched newspaper articles, heated discussions in Twitter or elsewhere). This feels like a better way to execute what the "spot cognitive biases / logical fallacies" exercises aim to do. Answering questions like "How is this text misleading?", "How did this conversation go off the rails?" or "What would have been a better response instead of what was said here?" and then comparing one's notes to others seems like it could make a useful exercise.
6: Re-deriving established concepts
Recently it occurred to me that I didn't know how inflation works and what its upsides are. Working this through (with some vague memories and hints from my friend) felt like a pretty good exercise to me.
Another example: I don't know how people make vacuums in practice, but when I sat and thought it through, it wasn't too hard to think of a way to create a space with much less air molecules than atmosphere with pretty simple tools.
Third example: I've had partial success prompting people to re-derive the notion of Shapley value.
I like this sort of problems: they are a bit confusing, in that part of the problem is asking the right questions, but there are established, correct (or at least extremely good) solutions.
(Of course someone might already know the canonical answer to any given question, but that's fine. I think there are lots of good examples in economics - e.g. Vickrey auction, prediction markets, why price controls are bad / price gouging is pretty good, "fair" betting odds - for this, but maybe this is just because I don't know much economics.)
7: Generating multiple ideas/interventions/solutions/hypotheses
An exercise I did at some point is "Generate 25 ideas for interventions that might improve learning and other outcomes in public education". I feel like the ability to come up with multiple ideas to a given problem is pretty useful (e.g. this is something I face in my work all the time, and this list itself is an example of "think of many things"). This is similar to the babble exercises, though I'm picturing more "serious" prompts than the ones there.
Another way to train this skill would be to have interactive exercises that are about doing science (c.f. the 2-4-6 problem) and aiming to complete them as efficiently as possible (This article is thematically relevant.)
(Discussion of half-developed ideas that I don't yet quite see how to turn into exercises.)
8: Getting better results with more effort
Two personal anecdotes:
A type of exercise where you are supposed to first give an initial answer after X time, and then are allowed to revise your answer for Y time, seems like it could train this and other skills. (Maybe brainstorming exercises of the form "if you had a week/month/year of time, how would you solve [problem]?" could help, too.)
9: I think there's something in the genre of "be specific", and more specifically in "operationalize vague claims into something that has a truth value", that'd be nice to have in large-quantity exercise form. See this post for related discussion. I'm also reminded of this comment.
There are definitely things not covered by this list; in particular, I have little of directly training to apply all this in real life (c.f. TAPs, which is definitely a very real-life-y technique). So while I did keep practicality in mind, I'd be happy to see exercises that bridge the theory-practice-gap even more.
Also, the Dungeons and Data Science and the stuff Raymond is doing are something to keep in mind.
My main worry is that we don't know how to measure the important things - a running theme in the post. So while we could benchmark and do unlearning on e.g. models' ability to agree on the same element on a list of 10 items, this isn't obviously useful.
(I also suspect that unlearning Schelling coordination is harder than e.g. forgetting factual knowledge: for example, a natural strategy is to choose "popular" options, and it's not clear whether you can remove notions of popularity without major negative downstream effects.)
That said, I'd imagine there's something useful you can do with white-box techniques, like making the monitor think that a particular signal/action has been produced by [humans / other AIs] via activation steering, among other ideas about changing the model's beliefs of its deployment status. It seems to me one could create benchmarks for how well such methods work. I haven't thought about this much, but I do want to echo calls for more white-box work on control.
I tentatively think that it's good to distinguish at least between the following three model classes: active planners, sleeper agents and opportunists.
Looking back at this, I find this categorization pretty confusing, and not quite carving the model space at the right places. I discuss what I now think is a better frame here.
Briefly, the idea is that one should make a distinction on how often the model is thinking about its plans against you. In this post's terminology, this basically draws a line between opportunists vs. active planners and sleeper agents. The latter two roughly correspond to different strategies, namely active sabotage vs. "lie and wait".
But I think the descriptions I gave of "active planners" etc. in the post are overly specific and are possibly sneaking in false inferences. So I think those labels are possibly harmful, and I've mostly stopped using those terms in my own thinking.
For background, see this post on trustworthiness.
Here are two notions of trustworthiness that are useful to distinguish:
These notions can vary substantially. A central example is a model that has never had any serious misaligned thoughts but, when presented with a clear opportunity to escape, takes it. This model would be trusted under the first notion, but not under the second.
Another example is the case where a weaker model is monitoring a stronger model. If during deployment the weak model obtains strong evidence for indeed being deployed (possibly because the strong model aims for this outcome), the weak model might end up not reporting bad actions the strong model takes.[1] This can happen even if the weak model is generally not egregiously misaligned.
I find this distinction useful for two reasons:
First, there are many sorts of evaluations and arguments which imply (1) but not (2). For instance, arguments from the model’s incompetence at theory-of-mind and keeping secrets, or at thinking of and executing cross-context strategies, are strong evidence in favor of (1), but much weaker evidence in favor of (2). Obtaining strong evidence for (2) might require evaluations targeted at that specifically.
Second, whether we care about (1) or (2) depends on the application. Situations where we primarily care about (1) include: the model has produced large quantities of research, we have run capability evaluations on the model, we are forming vague high-level impressions about the model’s psychology based on behavioral evidence. In these cases we typically have redundancy and multiple lines of evidence, and thus sabotage would need to be somewhat consistent. In contrast, we care about (2) when a single action by the model can have large consequences, which is the case at least when the model is monitoring other models.
I like this frame better than my previous ones. Credit to Ryan. For ease of reference, here’s terminology: property (1) is average or low-stakes trustworthiness, and property (2) is worst-case or high-stakes trustworthiness.
We might be able to ensure the model can’t reliably distinguish evaluations from deployment, so this scenario might be unrealistic. But I want to be extremely clear about whether the argument for safety is “the model won’t know it’s deployed” versus “the model can recognize when it’s deployed, but it still won’t take any bad actions because it’s a good guy”.
and the system was coherent enough that it wouldn't point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
If we are already at the stage "use another copy to monitor the actions", then surely we'll at least consider the failure mode "maybe the monitor just always gives thumbs-up" and take some measures against it.
So simply lying isn't as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic. (Edit: training isn't strictly necessary - maybe you can get high evaluation accuracy via prompt engineering etc. - though it helps a lot.)
(There's some not-yet-public writing that will go into more detail about this dynamic and how we can evaluate whether models can indeed muster that Bayesian evidence. It might be best to continue this discussion after that's out?)
You seem to be making a huge number of assumptions of what "scheming" means, so I am now mostly assuming you are talking about something else than what I am talking about, but I'll try to clarify.
I agree there is miscommunication about what "scheming" means. I put out a post just yesterday about how people really should be more specific about what they mean by that.
I think Tom uses "scheming" to point at models that are very worst-case-like, in a way current models are not, whereas you think "there is no hard line between 'role-playing' and 'scheming'."
And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so[1] during evaluations. This requires a high degree of adversariality on the model's part, which doesn't fit with my understanding of "role-playing", and which I take to be the reason why Tom makes "a huge number of assumptions" about scheming.
There are complications from "even benign models sometimes fail and do weird stuff when (adversarially) evaluated". These are not trivial, but I think we can incorporate that into our model.
Quite a few of these seem very hard to do in a single forward pass.
Current models definitely don't seem capable of doing this in a single forward pass. But we are possibly not far away from models being capable of doing basic forms of this reasoning with CoT (possibly we are already there), and then one needs to get quantitative: How authentic-looking red-teaming examples can humans generate? For performance reasons, we are probably giving Weak a CoT - how long is it? (We could monitor Weak's CoT, though then you have more complexity, and Weak isn't really trustworthy if you need to do that.) And probably it's very easy for Weak to figure it's being used as a monitor - it's prompt literally contains instructions for what to report and what to not.
Overall this is messy, and someone needs to actually sit down and think through these arguments and then run the evaluations. ("Thinking through these arguments" is largely what I'm doing nowadays.) So while I am more optimistic about "scheming-risks" than I have been before, there's work to be done here.
Rephrased, we can use Weak to automate non-trivial chunks of safety research and as an internals monitor, too (not just on externalized reasoning).
We definitely want to do some monitoring on internals (e.g. "coup probes"). The worry is that it's hard to get confidence that such methods work right when a model is actually defecting, if one doesn't in advance have any examples of the models doing that and has to go with synthetic analogues.
I think the "sudden and inexplicable flashes of insight" description of Ramanujan is exaggerated/misleading.
On the first example of the post: It's not hard to see that the problem is, by the formula for triangular numbers, roughly(!) about the solvability of
.
Since is roughly a square - - one can see that this reduces to something like Pell's equation . (And if you actually do the calculations while being careful about house , you indeed reduce to .)
I think it's totally reasonable to expect an experienced mathematician to (at a high level) see the reduction to Pell's equation in 60 seconds, and from that making the (famous, standard) association to continued fractions takes 0.2 seconds, so the claim "The minute I heard the problem, I knew that the answer was a continued fraction" is entirely reasonable. Ramanujan surely could notice a Pell's equation in his sleep (literally!), and continued fractions are a major theme in his work. If you spend hundreds-if-not-thousands of hours on a particular domain of math, you start to see connections like this very quickly.
About "visions of scrolls of complex mathematical content unfolding before his eyes": Reading the relevant passage in The man who knew infinity, there is no claim there about this content being novel or correct or the source of Ramanujan's insights.
On the famous taxicab number 1729, Ramanujan apparently didn't come up with this on the spot, but had thought about this earlier (emphasis mine):
Berndt is the only person who has proved each of the 3,542 theorems [in Ramanujan's pre-Cambridge notebooks]. He is convinced that nothing "came to" Ramanujan but every step was thought or worked out and could in all probability be found in the notebooks. Berndt recalls Ramanujan's well-known interaction with G.H. Hardy. Visiting Ramanujan in a Cambridge hospital where he was being treated for tuberculosis, Hardy said: "I rode here today in a taxicab whose number was 1729. This is a dull number." Ramanujan replied: "No, it is a very interesting number; it is the smallest number expressible as a sum of two cubes in two different ways." Berndt believes that this was no flash of insight, as is commonly thought. He says that Ramanujan had recorded this result in one of his notebooks before he came to Cambridge. He says that this instance demonstrated Ramanujan's love for numbers and their properties.
This is not say Ramanujan wasn't a brilliant mathematician - clearly he was! Rather, I'd say that one shouldn't picture Ramanujan's thought processes as wholly different from those of other brilliant mathematicians; if you can imitate modern Field's medalists, then you should be able to imitate Ramanujan.
I haven't read much about Ramanujan; these are what I picked up, after seeing the post yesterday, by thinking about the anecdotes and looking to the references a little.
I first considered making a top-level post about this, but it felt kinda awkward, since a lot of this is a response to Yudkowsky (and his points in this post in particular) and I had to provide a lot of context and quotes there.
(I do have some posts about AI control coming up that are more standalone "here's what I believe", but that's a separate thing and does not directly respond to a Yudkowskian position.)
Making a top-level post of course gets you more views and likes and whatnot; I'm sad that high-quality comments on old posts very easily go unnoticed and get much less response than low-quality top-level posts. It might be locally sensible to write a shortform that says "hey I wrote this long effort-comment, maybe check it out", but I don't like this being the solution either. I would like to see the frontpage allocating relatively more attention towards this sort of thing over a flood of new posts. (E.g. your effort-comments strike me as "this makes most sense as a comment, but man, the site does currently give this stuff very little attention", and I'm not happy about this.)
Quantity is size
Example: "That's a large salary"
Mapping: From numbers and quantities to size.
Experience: Observing that multiple things occupy a larger volume than a single instance.
I hadn't previously appreciated just how deep this mapping is. Basically all language about numbers goes through size: 10100 is a huge number, 0.3% is such a small portion. Fine, when comparing numbers you say "three is greater than two". But in Finnish we also say "three is bigger than two" and "five is smaller than seven", and "two plus two is as large as four". No, really, the equality sign is called the "equal-largeness" sign!