Doing AI Safety research for ethical reasons.
My webpage.
Leave me anonymous feedback.
I operate by Crocker's Rules.
I don't see how Take 4 is anything other than simplicity (in the human/computational language). As you say, it's a priori unclear whether a an agent is an instance of a human or the other way around. You say the important bit is that you are subtracting properties from a human to get an agent. But how shall we define subtraction here? In one formal language, the definition of human will indeed be a superset of that of agent... but in another one it will not. So you need to choose a language. And the natural way forward every time this comes up (many times), is to just "weigh by Turing computations in the real world" (instead of choosing a different and odd-looking Universal Turing Machine), that is, a simplicity prior.
Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).
Like you, I endorse the step of "scoping the reference class" (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn't have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I'm sure that there'll be at lesat a few hundred other aspects (both more and less subtle than this one) that you and me obviously endorse, but they will not implement, and will drastically affect the outcome of the whole procedure.
In fact, it seems strikingly plausible that even among EAs, the outcome could depend drastically on seemingly-arbitrary starting conditions (like "whether we use deliberation-and-distillation procedure #194 or #635, which differ in some details"). And "drastically" means that, even though both outcomes still look somewhat kindness-shaped and friendly-shaped, one's optimum is worth <10% to the other's utility (or maybe, this holds for the scope-sensitive parts of their morals, since the scope-insensitive ones are trivial to satisfy).
To pump related intuitions about how difficult and arbitrary moral deliberation can get, I like Demski here.
I'm sure some of people's ignorance of these threat models comes from the reasons. But my intuition is that most of it comes from "these are vaguer threat models that seem very up in the air, and other ones seem more obviously real and more shovel-ready" (this is similar to your "Flinch", but I think more conscious and endorsed).
Thus, I think the best way to converge on whether these threat models are real/likely/actionable is to work through as-detailed-as-possible example trajectories. Someone objects that the state will handle it? Let's actually think through how the state might look like in 5 years! Someone objects that democracy will prevent it? Let's actually think through the actual consequences of cheap cognitive labor in democracy!
This is analogous to what pessimists about single-single alignment have gone through. They have some abstract arguments, people don't buy them, so they start working through them in more detail or provide example failures. I buy some parts of them, but not others. And if you did the same for this threat model, I'm uncertain how much I'd buy!
Of course, the paper might have been your way of doing that. I enjoyed it, but still would have preferred more fully detailed examples, on top of the abstract arguments. You do use examples (both past and hypothetical), but they are more like "small, local examples that embody one of the abstract arguments", rather than "an ambitious (if incomplete) and partly arbitrary picture of how these abstract arguments might actually pan out in practice". And I would like to know the messy details of how you envision these abstract arguments coming into contact with reality. This is why I liked TASRA, and indeed I was more looking forward to an expanded, updated and more detailed version of TASRA.
Just writing a model that came to mind, partly inspired by Ryan here.
Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".
If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it's more likely than not that the former wins, but it's not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.
If this doesn't happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.
If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it's common knowledge.
Fantastic snapshot. I wonder (and worry) whether we'll look back on it with similar feelings as those we have for What 2026 looks like now.
There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
[...]
There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.
These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols will be sufficient or even sane, by analogy to nuclear or bio.
This makes it really unclear what to work on.
It's not super obvious to me that there won't be clever ways to change local incentives / improve coordination, and successful interventions in this direction would seem incredibly high-leveraged, since they're upstream of many of the messy and decentralized failure modes. If they do exist, they probably look not like "a simple cooridnation mechanism", and more like "a particular actor gradually steering high-stakes conversations (through a sequence of clever actions) to bootstrap minimal agreements". Of course, similarity to past geopolitical situations does make it seem unlikely on priors.
There is no time to get to very low-risk worlds anymore. There is only space for risk reduction along the way.
My gut has been in agreement for some time that the most cost-effective x-risk reduction now probably looks like this.
I agree with conjunctiveness, although again more optimistic about huge improvements. I mostly wanted to emphasize that I'm not sure there are structurally robust reasons (as opposed to personal whims) why huge spendings on safety won't happen
Speaking for myself (not my coauthors), I don't agree with your two items, because:
More generally, I think behavioral self-awareness for capability evaluation is and will remain strictly worse than the obvious capability evaluation techniques.
That said, I do agree systematic inclusion of considerations about negative externalities should be a norm, and thus we should have done so. I will shortly say now that a) behavioral self-awareness seems differentially more relevant to alignment than capabilities, and b) we expected lab employees to find out about this themselves (in part because this isn't surprising given out-of-context reasoning), and we in fact know that several lab employees did. Thus I'm pretty certain the positive externalities of building common knowledge and thinking about alignment applications are notably bigger.
Most difficulties you raise here could imo change drastically with tens of billions being injected into AI safety, especially thanks to new ideas coming out of left field that might make safety cases way more efficient. (I'm probably more optimistic about new ideas than you, partly because "it always subjectively feels like there are no big ideas left", and AI safety is so young.)
If your government picks you as a champion and gives you amazing resources, you no longer have to worry about national competition, and that amount seems doable. You still have to worry about international competition, but will you feel so closely tied that you can't even spare that much? My guess would be no. That said, I still don't expect certain lab leaders to want to do this.
The same is not true of security though, that's a tough one.
See our recent work (especially section on backdoors) which opens the door to directly asking the model. Although there are obstacles like Reversal Curse and it's unclear if it can be made to scale.
It's unclear what the optimal amount of thinking per step is. My initial guess would have been that letting Claude think for a whole paragraph before each single action (rather than only each 10 actions, or whenever it's in a match, or whatever) scores slightly better than letting it think more (sequentially). But I guess this might work better if it's what the streamer is using after some iteration.
The story for parallel checks could be different though. My guess would be going all out and letting Claude generate the paragraph 5 times and then generate 5 more parallel paragraphs about whether it has gotten something wrong, and then having a lower-context version of Claude decide whether there are any important disagreements, and if not just majority-vote, would improve robustness problems (like "I close a goal before actually achieving it"). But maybe this adds too much bloat and opportunities for mistakes, or makes some mistakes better but others way worse.