It's interesting to compare this to the other curated posts I got in my inbox over the last week, What is malevolence? and How will we update about scheming. Both of those (especially the former) I bounced off of due to length. But this one I stuck with for quite a while, before I started skimming in the worksheet section.
I think the instinct to apply a length filter before sending a post to many peoples' inboxes is a good one. I just wish it were more consistently applied :)
If you can find non-equilibrium quantum states, they are distinguishable. https://en.m.wikipedia.org/wiki/Quantum_non-equilibrium
(Seems pretty unlikely we'd ever be able to definitively say a state was non-equilibrium instead of some other weirdness, though.)
I can help confirm that your blind assumption is false. Source: my undergrad research was with a couple of the people who have tried hardest, which led to me learning a lot about the problem. (Ward Struyve and Samuel Colin.) The problem goes back to Bell and has been the subject of a dedicated subfield of quantum foundations scholars ever since.
This many years distant, I can't give a fair summary of the actual state of things. But a possibly unfair summary based on vague recollections is: it seems like the kind of situation where specialists have something that kind of works, but people outside the field don't find it fully satisfying. (Even people in closely adjacent fields, i.e. other quantum foundations people.) For example, one route I recall abandons using position as the hidden variable, which makes one question what the point was in the first place, since we no longer recover a simple manifest image where there is a "real" notion of particles with positions. And I don't know whether the math fully worked out all the way up to the complexities of the standard model weakly coupled to gravity. (As opposed to, e.g., only working with spin-1/2 particles, or something.)
Now I want to go re-read some of Ward's papers...
If you are willing to share, can you say more about what got you into this line of investigation, and what you were hoping to get out of it?
For my part, I don't feel like I have many issues/baggage/trauma, so while some of the "fundamental debugging" techniques discussed around here (like IFS or meditation) seem kind of interesting, I don't feel too compelled to dive in. Whereas, techniques like TYCS or jhana meditation seem more intriguing, as potential "power ups" from a baseline-fine state.
So I'm wondering if your baseline is more like mine, and you ended up finding fundamental debugging valuable anyway.
The agent is indifferent between creating stoppable or unstoppable subagents, but the agent goes back to being corrigible in this way. The "emergent incentive" handwave is only necessary for the subagents working on sub-goals (section 8.4). Which is not something that either Sores et al. or your post that we're commenting on are prepared to tackle, although it is an interesting followup work.
I suggest engaging with the simulator. It very clearly shows that, given the option of creating shutdown-resistant successor agents, the agent does not do so! (Figure 11) If you believe it doesn't work, you must also believe there's a bug in the simulation, or some mis-encoding of the problem in the simulation. Working that out, either by forking his code or by working out an example on paper, would be worthwhile. (Forking his code is not recommended, as it's in Awk; I have an in-progress reimplementation in optimized-for-readability TypeScript which might be helpful if I get around to finishing it. But especially if you simplify the problem to a 2-step setting like your post, computing his correction terms on paper seems very doable.)
I agree with the critique that some patches are unsatisfying. I'm not sure how broadly you are applying your criticism, but to me the ones involving constant offsets (7.2 and 8.2) are not great. However, at least for 7.2, the paper clarifies what's going on reasonably well: the patch is basically environment-dependent, and in the limit where your environment is unboundedly hostile (e.g., an agent controls unbounded utility and is willing to bribe you with it) you're going to need an unbounded offset term.
I found that the paper's proof was pretty intuitive and distilled. I think it might be for you as well if you did a full reading.
At a meta-level, I'd encourage you to be a bit more willing to dive into this work, possibly including the paper series it's part of. Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we're commenting on. He's given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable. (Notably, the simulation helps make it obvious how Sores et al.'s approach has critical mathematical mistakes and cannot be implemented; see appendix C.) The followup papers, which I'm still working through, port the result to various other paradigms such as causal influence diagrams. Attempting to start this field over as if there's been no progress on the shutdown problem since Sores et al. seems... wasteful at best, and hubristic at worst.
If you want to minimize time investment, then perhaps the following is attractive. Try to create a universe specification similar to that of Holtman's paper, e.g. world state, available actions, and utility function before and after shutdown as a function of the world state, such that you believe that Holtman's safety layer does not prevent the agent from taking the "create an unstoppable sub-agent" action. I'll code it up, apply the correction term, and get back to you.
Are you aware of how Holtman solved MIRI's formulation of the shutdown problem in 2019? https://arxiv.org/abs/1908.01695, my summary notes at https://docs.google.com/document/u/0/d/1Tno_9A5oEqpr8AJJXfN5lVI9vlzmJxljGCe0SRX5HIw/mobilebasic
Skimming through your proposal, I believe Holtman's correctly-constructed utility function correction terms would work for the scenario you describe, but it's not immediately obvious how to apply them once you jump to a subagent model.
These posts always leave me feeling a little melancholy that my life doesn't seem to have that many challenges where thinking faster/better/harder/sooner would actually help.
Most of my waking hours are spent on my job, where cognitive performance is not at all the bottleneck. (I honestly believe that if you made me 1.5x "better at thinking", this would not give a consistent boost in output as valued by the business. I'm a software engineer.) I have some intellectual spare-time hobbies, but the most demanding of them is Japanese studying, which is more about volume, exposure, and spaced repetition than clever strategies. I am intrigued by making myself more productive in my programming side projects, but I think the biggest force multiplier for me there is learning how to leverage AI agents more effectively. (Besides the raw time savings, rapid iteration speed can also lessen the need for thinking of the right solution the first time around.)
I can easily see how this would be an important skill for someone doing novel academic-ish research, however. And I wish some of the examples were about that, instead of Thinking Physics and video games!