At some point essentially everything of value is going to be abstract computation, and anything outside of that will be either noise or fully understood, the physical substrate. It's not clear how that abstract computation should be structured, but the experience from engineering of the physical systems or communication technologies, the classical ways of understanding the physical world, isn't obviously that relevant for agent foundations.
Even with text-only reasoning traces, there is inscrutable thinking in the activations that can span the whole reasoning trace (with global attention), it just has to remain shallow, as its depth is only given by layers, and its depth can only be extended by looping around through the text-only reasoning trace. There will be some steganography (RL makes sure of this), even if it's not substantial, that increases the effective depth of inscrutable thinking beyond a single pass through the layers. But it stops with the reasoning trace, so this affordance is still bounded.
Continual learning greatly extends the thinking horizon, likely using inscrutable weight updates, thus in particular extending the feasible depth of inscrutable thinking that can avoid showing up in the text-only reasoning traces.
Continual learning destroys the hope that we get to early AGIs while avoiding the issues with neuralese, because it almost certainly requires inscrutable communication that maintains thinking for a long time (likely in the form of some kind of test time weight updates), and it's an essential capability. With neuralese itself, there is a significant possibility that it doesn't get too much better by that time yet, but continual learning does need to happen first (as a capability rather than a particular algorithmic breakthrough), and it more inherently has the same issues.
Strange attitude towards the physical world can be reframed as caring only about some abstract world that happens to resemble the physical world in some ways. A chess AI could be said to be acting on some specific physical chessboard within the real world and carefully avoiding all concern about everything else, but it's more naturally described as acting on just the abstract chessboard, nothing else. I think values/preference (for some arbitrary agent) should be not just about probutility upon the physical world, but should also specify which world they are talking about, so that different agents are not just normatively disagreeing about relative value of events, but about which worlds are worth caring about (not just possible worlds within some space of nearby possible worlds, but fundamentally very different abstract worlds), and therefore what kinds of events (from which sample spaces) ought to serve as semantics for possible actions, before their value can be considered.
A world model (such as an LLM with frozen weights) is already an abstraction, its data is not the same as the physical world itself, but it's coordinated with the physical world to some extent, similarly to how an abstract chessboard is coordinated with a specific physical chessboard in the real world (learning is coordination, adjusting the model so that the model and the world have more shared explanations for their details). Thus acting within an abstract world given by a world model (as opposed to within the physical world itself) might be a useful framing for systematically ignoring some aspects of the physical world, and world models could be intentionally crafted to emphasize particular aspects.
I would term "hope for " rather than "reliability", because it's about willingness to enact in response to belief in , but if is no good, you shouldn't do that. Indeed, for bad , having the property of is harmful fatalism, following along with destiny rather than choosing it. In those cases, you might want to or something, though that only prevents from being believed, that you won't need to face in actuality, it doesn't prevent the actual . So reflects a value judgement about reflected in agent's policy, something downstream of endorsement of , a law of how the content of the world behaves according to an embedded agent's will.
Payor's Lemma then talks about belief in hope , that is hope itself is exogenous and needs to be judged (endorsed or not). Which is reasonable for games, since what the coalition might hope for is not anyone's individual choice, the details of this hope couldn't have been hardcoded in any agent a priori and need to be negotiated during a decision that forms the coalition. A functional coalition should be willing to act on its own hope (which is again something we need to check for a new coalition, that might've already been the case for a singular agent), that is we need to check that is sufficient to motivate the coalition to actually . This is again a value judgement about whether this coalition's tentative aspirations, being a vehicle for hope that , are actually endorsed by it.
Thus I'd term "coordination" rather than "trust", the fact that this particular coalition would tentatively intend to coordinate on a hope for . Hope is a value judgement about , and in this case it's the coalition's hope, rather any one agent's hope, and the coalition is a temporary nascent agency thing that doesn't necessarily know what it wants yet. The coalition asks: "If we find ourselves hoping for together, will we act on it?" So we start with coordination about hope, seeing if this particular hope wants to settle as the coalition's actual values, and judging if it should by enacting if at least coordination on this particular hope is reached, which should happen only if is a good thing.
(One intuition pump with some limitations outside the provability formalism is treating as "probably ", perhaps according to what some prediction market tells you. If "probably " is enough to prompt you to enact , that's some kind of endorsement, and it's a push towards increasing the equilibrium-on-reflection value of probability of , pushing "probably " closer to reality. But if is terrible, then enacting it in response to its high probability is following along with self-fulfilling doom, rather doing what you can to push the equilibrium away from it.)
Löb's Theorem then says that if we merely endorse a belief by enacting the believed outcome, this is sufficient for the outcome to actually happen, a priori and without that belief yet being in evidence. And Payor's Lemma says that if we merely endorse a coalition's coordinated hope by enacting the hoped-for outcome, this is sufficient for the outcome to actually happen, a priori and without the coordination around that hope yet being in evidence. The use of Löb's Theorem or Payor's Lemma is that the condition (belief in , or coordination around hope for ) should help in making the endorsement, that is it should be easier to decide to if you already believe that , or if you already believe that your coalition is hoping for . For coordination, this is important because every agent can only unilaterally enact its own part in the joint policy, so it does need some kind of premise about the coalition's nature (in this case, about the coalition's tentative hope for what it aims to achieve) in order to endorse playing its part in the coalition's joint policy. It's easier to decide to sign an assurance contract than to unconditionally donate to a project, and the role of Payor's Lemma is to say that if everyone does sign the assurance contract, then the project will in fact get funded sufficiently.
Power-centralisation in a post-AGI world is not about wielding humans, unlike in a pre-AGI world. Power is no longer power over humans doing your bidding, because humans doing your bidding won't give you power. By orthogonality, any terrible thing can in principle be someone's explicit intended target (an aspiration, not just a habit shaped by circumstance), but that's rare. Usually the terrible things are (a side effect of) an instrumentally useful course of action that has other intended goals, even where in the final analysis the justification doesn't quite work.
Most s-risk scenarios vaguely analogous to historical situations don't happen in a post-AGI world, because there humans aren't useful for anything, either economically or in terms of maintaining power (unlike how they were throughout human history). It's not useful for the entities in power to do any of the things with traditionally terrible side effects.
Absence of feedback loops for treating people well (at the level of humanity as a whole) is its own problem, but it's a distinct kind of problem. It doesn't necessarily settle poorly (at the level of individuals and smaller communities) in a world with radical abundance, if indeed even a tiny fraction of the global resources gets allocated to the future of humanity, which is the hard part to ensure.
Even Anthropic is building frontier AIs (since Opus 3 or Sonnet 3.5; this was a bit of a surprise to some at the time). Thus if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it, as soon as the immediate behavioral downsides get mitigated to the usual level. Any other kinds of downsides won't be a reason to not go there, with frontier AIs.
Also, continual learning is analogous to neuralese, an inscrutable way of preserving/propagating information with long chains of reasoning, different from text-only notes. In both cases, you can build a textual "transcript" of a process of thinking, but it's not necessarily faithful, and doesn't screen off earlier thinking from later thinking.
Trillions in revenue is more plausible if continual learning is somewhat working soon and gradually getting better, rather than mostly absent until it's unlocked in 2033. With gradual improvement in continual learning, AGI timelines might get more like gradual disempowerment timelines, no clear thresholds until it's suddenly already been too late for a while.
It's an argument from cosmic normality, about the scope of applicability of such methods. Like medicine or biology, the relevance is a temporary accident of the current phase of human condition. I'm not sure how natural the impression of overclaiming applicability of physics, statistics, information theory, or machine learning is, perhaps this is quite clear already.
The point is that these things are not obviously at all relevant to the nature of agency, or values. You could in principle have frail human biological bodies within a simulated world, and practice medicine on them, but that's hardly a central thing that happens in a post-computronium world.