Chief Scientist at the UK AI Safety Institute (AISI). Previously, DeepMind, OpenAI, Google Brain, etc.
These buckets seem reasonable, and +1 to it being important that some of the resulting ideas are independent of debate. In particular on the inner alignment this exercise (1) made me think exploration hacking might be a larger fraction of the problem than I had thought before, which is encouraging as it might be tractable, but (2) there may be an opening for learning theory that tries to say something about residual error along the lines of https://x.com/geoffreyirving/status/1920554467558105454.
On the systematic human error front, we'll put out a short post on that soon (next week or so), but broadly the framing is to start with a computation which consults humans, and instead of assuming the humans are have unbiased error instead assume that the humans are wrong on some unknown ε-fraction of queries w.r.t. to some distribution. You can then try to change the debate protocol so that it detects if you can choose the ε-fraction to flip the answer, and report uncertainty in this case. This still requires you to make some big assumption about humans, but is a weaker assumption, and leads to specific ideas for protocol changes.
I do mostly think this requires whitebox methods to reliably solve, so it would be less "narrow down the search space" and more "search via a different parameterisation that takes advantage of whitebox knowledge".
You'd need some coupling argument to know that the problems have related difficulty, so that if A is constantly saying "I don't know" to other similar problems it counts as evidence that A can't reliably know the answer to this one. But to be clear, we don't know how to make this particular protocol go through, since we don't know how to formalise that kind of similarity assumption in a plausibly useful way. We do know a different protocol with better properties (coming soon).
I'm very much in agreement that this is a problem, and among other things blocks us from knowing how to use adversarial attack methods (and AISI teams!) from helping here. Your proposed definition feels like it might be an important part of the story but not the full story, though, since it's output only: I would unfortunately expect a decent probability of strong jailbreaks that (1) don't count as intent misalignment but (2) jump you into that kind of red attractor basin. Certainly ending up in that kind of basin could cause a catastrophe, and I would like to avoid it, but I think there is a meaningful notion of "the AI is unlikely to end up in that basin of its own accord, under nonadversarial distributions of inputs".
Have you seen good attempts at input-side definitions along those lines? Perhaps an ideal story here would be a combination of an input-side definition and the kind of output-side definition you're pointing at.
Yes, that is a clean alternative framing!
Bounding the space of controller actions more is the key bit. The (vague) claim is that if you have an argument that an empirically tested automated safety scheme is safe, in sense that you’ll know if the output is correct, you may be able to find a more constrained setup where more of the structure is human-defined and easier to analyze, and that the originally argument may port over to the constrained setup.
I’m not claiming this is always possible, though, just that it’s worth searching for. Currently the situation is that we don’t have well-developed arguments that we can recognize the correctness of automated safety work, so it’s hard to test the “less automation” hypothesis concretely.
I don’t think all automated research is automated safety: certainly you can do automated pure capabilities. But I may have misunderstood that part of the question.
To clarify, such a jailbreak is a real jailbreak; the claim is that it might not count as much evidence of “intentional misalignment by a model”. If we’re happy to reject all models which can be jailbroken we’ve falsified the model, but if we want to allow models which can be jailbroken but are intent aligned we have a false negative signal for alignment.
I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like where is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a "not 1GB" guarantee, but not something smaller than that.
I can confirm it’s very good!
I broadly agree with these concerns. I think we can split it into (1) the general issue of AGI/ASI driving humans out of distribution and (2) the specific issue of how assumptions about human data quality as used in debate will break down. For (2), we'll have a short doc soon (next week or so) which is somewhat related, along the lines of "assume humans are right most of the time on a natural distribution, and search for protocols which report uncertainty if the distribution induced by a debate protocol on some new class of questions is sufficiently different". Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there's no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn't competitive.
One general note is that scalable oversight is a method for accelerating an intractable computation built out of tractable components, and these components can include both human and conventional software. So if you understand the domain somewhat well, you can try mitigate failures of (2) (and potentially gain more traction on (1)) by formalising part of the domain. And this formalisation can be bootstrapped: you can use on-distribution human data to check specifications, and then use those specifications (code, proofs, etc.) in order to rely on human queries for a smaller portion of the over next-stage computation. But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.