I think this is too big-brain. Reasoning about systems more complex than you should look more like logical inductors, or infrabayesian hypotheses, or heuristic arguments, or other words that code for "you find some regularities and trust them a little, rather than trying to deduce an answer that's too hard to compute."
Which part specifically are you referring to as being overly complicated? What I take to be the primary assertions of the post to be are:
I'll try to justify my approach with respect to one or more of these claims, and if I can't, I suppose that would give me strong reason to believe the method is overly complicated.
This doesn't have to be resource acquisition, just any negative action that we could reasonably expect a rational agent to pursue.
I am disagreeing with the underlying assumption that it's worthwhile to create simulacra of the sort that satisfy point 2. I expect an AI reasoning about its successor to not simulate it with perfect fidelity - instead, it's much more practical to make approximations that make the reasoning process different from instantiating the successor.
I expect agentic simulacra to occur without intentionally simulating them, in that agents are just generally useful for solving prediction problems and that in conducting millions of predictions (as would be expected of a product on the order of ChatGPT, or future successors,) it's probable for agentic simulacra to occur. Even if these agents are just approximations, in predicting the behaviors of approximated agents their preferences could still be satisfied in the real world (as described in the Hubinger post.)
The problem I'm interested in is how you ensure that all subsequent agentic simulacra (whether occurred intentionally or otherwise) are safe, which seems difficult to verify formally due to the Löbian Obstacle.
As someone who's barely scratched the surface of any of this, I was vaguely under the impression that "big-brain" described most or all of the theoretic/conceptual alignment in this cluster of things, including e.g. both the Löbian Obstacle and infrabayesianism. Once I learn all these more in-depth and think on them, I may find and appreciate subtler-but-still-important gradations of "galaxy-brained-ness" within this idea cluster.
Layman here 👋
Iiuc we cannot trust the proof of an unaligned simulacra's suggestion because if it is smarter than us.
Would that be a non-issue if verifying the proof is easier than making it?
If we can know how hard it is to verify a proof without verifying, then we can find a safe protocol for communicating with this simulacra. Is this possible?
In 2013, Eliezer Yudkowsky and Marcello Herreshoff published Tiling Agents for Self-Modifying AI, and the Löbian Obstacle. It is worth comprehending because it:
When I first read this paper, I terribly misunderstood it. Due to it not being particularly approachable material for someone not well-versed in logic, I was confidently wrong for at least one month. This post summarizes my understanding of what the Löbian Obstacle is (having re-read the paper,) and why I think it is still an important idea one decade after its publication.
An agent A1 occupies a fully-known, deterministic and closed environment. A1 has a goal G that is either satisfied or otherwise by an outcome, for which A1's preference is satisfaction. An action bi∈Acts0 performed by an agent created by A1, hereafter referred to as A0, must therefore satisfy the statement:
¯¯bi⇒A0⊩¯¯bi→G
Where ¯¯bi denotes the actual performance of bi, and ⊩ denotes cognitive belief in the succeeding statement. Even if A1 could verify by inspection of A0's design that ¯¯bi⇒A0⊩¯¯bi→G will hold, as in:
A1⊩A0→(∀bi∈Acts0:¯¯bi→□0⌈¯¯bi→G⌉)
Where □0⌈ϕ⌉ refers to a proof of ϕ from the axioms of A0; Axm0, this is unknowable, as it would require:
A1⊩∀bi:(□0⌈¯¯bi→G⌉→(¯¯bi→G))
For this to be so it would need to be that Axm1 could prove that if some proof of ⌈ϕ(x)⌉ exists in Axm0, that ϕ(x) must be true. Tiling Agents for Self-Modifying AI, and the Löbian Obstacle shows this to be impossible; a formal system cannot prove its own soundness schema.
The above was a brief paraphrasing of section two of the original paper, which contains many additional details and complete proofs. How the Löbian Obstacle relates to simulators is my current topic of research, and this section will make the case that this is an important component of designing safe simulators.
We should first consider that simulating an agent is not distinguishable from creating one, and that consequently the implications of creating dangerous agents should generalize to their simulation. Hubinger et al. (2023) have stated similar concerns, and provide a more detailed examination of the argument.
It is also crucial to understand that simulacra are not necessarily terminating, and may themselves use simulation as a heuristic for solving problems. This could result in a kind of hierarchy of simulacra. In advanced simulators capable of very complex simulations, we might expect a complex network of simulacra bound by acausal trade and 'complexity theft,' whereby one simulacrum tries to obtain more simulation complexity as a form of resource acquisition or recursive self-improvement.
I expect this to happen. Lower complexity simulacra may still be more intelligent than their higher complexity counterparts, and as a simulator's simulacra count may grow exponentially, as does the likelihood that one simulacrum attempts complexity theft.
If we want safe simulators, we need the subsequent, potentially abyssal simulacra hierarchy to be aligned all the way down. Without being able to thwart the Löbian Obstacle, I doubt a formal guarantee is attainable. If we could do so, we may only need to simulate one aligned tiling agent, for which we might settle with a high certainty informal guarantee of alignment if short on time. I outlined how I thought that could be done here, although I've advanced the theory considerably since and will post an updated write-up soon.
If we can't reliably thwart the Löbian Obstacle, we should consider alternatives: