Bounding the space of controller actions more is the key bit. The (vague) claim is that if you have an argument that an empirically tested automated safety scheme is safe, in sense that you’ll know if the output is correct, you may be able to find a more constrained setup where more of the structure is human-defined and easier to analyze, and that the originally argument may port over to the constrained setup.

I’m not claiming this is always possible, though, just that it’s worth searching for. Currently the situation is that we don’t have well-developed arguments that we can recognize the correctness of automated safety work, so it’s hard to test the “less automation” hypothesis concretely.

I don’t think all automated research is automated safety: certainly you can do automated pure capabilities. But I may have misunderstood that part of the question.

Automation collapse

Geoffrey Irving1moΩ220

To clarify, such a jailbreak is a real jailbreak; the claim is that it might not count as much evidence of “intentional misalignment by a model”. If we’re happy to reject all models which can be jailbroken we’ve falsified the model, but if we want to allow models which can be jailbroken but are intent aligned we have a false negative signal for alignment.

Proveably Safe Self Driving Cars [Modulo Assumptions]

Geoffrey Irving1mo10

I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like where $ϵ$ is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a "not 1GB" guarantee, but not something smaller than that.

Robert Caro And Mechanistic Models In Biography

Geoffrey Irving4mo10

I can confirm it’s very good!

Scalable oversight as a quantitative rather than qualitative problem

Geoffrey Irving4moΩ550

+1 to the quantitative story. I’ll do the stereotypical thing and add self-links: https://arxiv.org/abs/2311.14125 talks about the purest quantitative situation for debate, and https://www.alignmentforum.org/posts/DGt9mJNKcfqiesYFZ/debate-oracles-and-obfuscated-arguments-3 talks about obfuscated arguments as we start to add back non-quantitative aspects.

Debate, Oracles, and Obfuscated Arguments

Geoffrey Irving4moΩ330

Thank you!

I think my intuition is that weak obfuscated arguments occur often in the sense that it’s easy to construct examples where Alice thinks for a certain amount time and produces her best possible answer so far, but where she might know that further work would uncover better answers. This shows up for any task like “find me the best X”. But then for most such examples Bob can win if he gets to spend more resources, and then we can settle things by seeing if the answer flips based on who gets more resources.

What’s happening in the primality case is that there is an extremely wide gap between nothing and finding a prime factor. So somehow you have to show that this kind of wide gap only occurs along with extra structure that can be exploited.

Robert Caro And Mechanistic Models In Biography

Geoffrey Irving4mo00

We’re not disagreeing: by “covers only two people” I meant “has only two book series”, not “each book series covers literally a single person”.

Robert Caro And Mechanistic Models In Biography

Geoffrey Irving4mo30

Unfortunately all the positives of these books come paired with a critical flaw: Caro only manages to cover two people, and hasn’t even finished the second one!

Have you found other biographers who’ve reached a similar level? Maybe the closest I’ve found was “The Last Lion” by William Manchester, but it doesn’t really compare giving how much the author fawns over Churchill.

Non-Disparagement Canaries for OpenAI

Geoffrey Irving6mo14911

To be more explicit, I’m not under any nondisparagement agreement, nor have I ever been. I left OpenAI prior to my cliff, and have never had any vested OpenAI equity.

I am under a way more normal and time-bounded nonsolicit clause with Alphabet.