At this point I would direct the "deferred task" apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.
Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.
In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.
There is probably a much better frame, but this is a new start to the "induction base case" in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is "unlikely to get rid of it" despite one naive attempt above.
If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without "avoiding this kind of conditional, deceptive reasoning," abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.
There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la "exception thrown while handling previous exception."
Thank you, Larks! Salute. FYI that I am at least one who has informally committed (see below) to take up this mantle. When would the next one typically be due?
https://twitter.com/robertzzk/status/1564830647344136192?s=20&t=efkN2WLf5Sbure_zSdyWUw
Inspecting code against a harm detection predicate seems recursive. What if the code or execution necessary to perform that inspection properly itself is harmful? An AGI is almost certainly a distributed system with no meaningful notion of global state, so I doubt this can be handwaved away.
For example, a lot of distributed database vendors, like Snowflake, do not offer a pre-execution query planner. This can only be performed just-in-time as the query runs or retroactively after it has completed, as the exact structure may be dependent on co-location of data and computation that is not apparent until the data referenced by the query is examined. Moreover, getting an accurate dry-run query plan may be as expensive as executing the query itself.
By analogy, for certain kinds of complex inspection procedures you envision, executing the inspection itself thoroughly enough to be reflective of the true execution risk may be as complex and as great of a risk of being harmful according to its values.
This was my thought exactly. Construct a robust satellite with the following properties.
Let a "physical computer" be defined as a processor powered by classical mechanics, e.g., through pulleys rather than transistors, so that it is robust to gamma rays, solar flares and EMP attacks, etc.
On the outside of the satellite, construct an onion layer of low-energy light-matter interacting material, such as alternating a coat of crystal silicon / CMOS with thin protective layers of steel, nanocarbon, or other hard material. When the device is constructed, ensure there are linings of Boolean physical input and output channels connecting the surface to the interior (like the proteins coating a membrane in a cell, except that the membrane will be solid rather than liquid), for example, through a jackhammer or moving rod mechanism. This will be activated through a buildup of the material on the outside of the artifact, effectively giving a time counter with arbitrary length time steps depending on how we set up the outer layer. Any possible erosion of the outside of the satellite (from space debris or collisions) will simply expose new layers of the "charging onion".
In the inside of the satellite, place a 3D printer constructed as a physical computer, together with a large supply of source material. For example, it might print in a metal or hard polymer, possibly with a supply of "boxes" in which to place the printed output. These will be the micro-comets launched as periodic payloads according to the timing device constructed on the surface. The 3D printer will fire according to an "input" event defined by the physical Boolean input, and may potentially be replicated multiple times within the hull in isolated compartments with separate sources of material, to increase reliability and provide failover in case of local failures of the surface layer.
The output of the 3D printer payload will be a replica of the micro-comet containing the message payload, funneled and ejected into an output chute where gravity will take over and handle the rest (this may potentially require a bit of momentum and direction aiming to kick off correctly, but some use of magnets here is probably sufficient). Alternatively, simply pre-construct the micro-comets and hope they stay intact, to be emitted in regular intervals like a gumball machine that fires once a century.
Finally, we compute a minimal set of orbits and trajectories over the continents and land areas likely to be most populated and ensure there is a micro-comet ejected regularly, e.g., say every 25-50 years. It is now easy to complete the argument by fiddling with the parameters and making some "Drake equation"-like assumptions about success rates to say any civilization with X% coverage of the landmass intersecting with the orbits of the comets will have > 25% likelihood of discovering a micro-comet payload.
The only real problem with this approach is guaranteeing your satellites are not removed in the future in the event future ancestors of our civilization disagree with this method. I don't see a solution to this other than through solving the value reflection problem, building a defense mechanism into the satellites that is certain to fail -- as you start getting close to the basic AI drive of self-preservation and will anyway be outsmarted by any future iteration of our civilization -- or making the satellites small or undetectable enough that finding and removing them is economically more pain than it is worth.
http://mathoverflow.net/questions/53122/mathematical-urban-legends
Another urban legend, which I've heard told about various mathematicians, and which Misha Polyak self-effacingly tells about himself (and therefore might even be true), is the following:
As a young postdoc, Misha was giving a talk at a prestigious US university about his new diagrammatic formula for a certain finite type invariant, which had 158 terms. A famous (but unnamed) mathematician was sitting, sleeping, in the front row. "Oh dear, he doesn't like my talk," thought Misha. But then, just as Misha's talk was coming to a close, the famous professor wakes with a start. Like a man possessed, the famous professor leaps up out of his chair, and cries, "By golly! That looks exactly like the Grothendieck-Riemann-Roch Theorem!!!" Misha didn't know what to say. Perhaps, in his sleep, this great professor had simplified Misha's 158 term diagrammatic formula for a topological invariant, and had discovered a deep mathematical connection with algebraic geometry? It was, after all, not impossible. Misha paced in front of the board silently, not knowing quite how to respond. Should he feign understanding, or admit his own ignorance? Finally, because the tension had become too great to bear, Misha asked in an undertone, "How so, sir?" "Well," explained the famous professor grandly. "There's a left hand side to your formula on the left." "Yes," agreed Misha meekly. "And a right hand side to your formula on the right." "Indeed," agreed Misha. "And you claim that they are equal!" concluded the great professor. "Just like the Grothendieck-Riemann-Roch Theorem!"