All of Geoffrey Irving's Comments + Replies

I do mostly think this requires whitebox methods to reliably solve, so it would be less "narrow down the search space" and more "search via a different parameterisation that takes advantage of whitebox knowledge".

You'd need some coupling argument to know that the problems have related difficulty, so that if A is constantly saying "I don't know" to other similar problems it counts as evidence that A can't reliably know the answer to this one. But to be clear, we don't know how to make this particular protocol go through, since we don't know how to formalise that kind of similarity assumption in a plausibly useful way. We do know a different protocol with better properties (coming soon).

I'm very much in agreement that this is a problem, and among other things blocks us from knowing how to use adversarial attack methods (and AISI teams!) from helping here. Your proposed definition feels like it might be an important part of the story but not the full story, though, since it's output only: I would unfortunately expect a decent probability of strong jailbreaks that (1) don't count as intent misalignment but (2) jump you into that kind of red attractor basin. Certainly ending up in that kind of basin could cause a catastrophe, and I would lik... (read more)

Yes, that is a clean alternative framing!

Bounding the space of controller actions more is the key bit. The (vague) claim is that if you have an argument that an empirically tested automated safety scheme is safe, in sense that you’ll know if the output is correct, you may be able to find a more constrained setup where more of the structure is human-defined and easier to analyze, and that the originally argument may port over to the constrained setup.

I’m not claiming this is always possible, though, just that it’s worth searching for. Currently the situation is that we don’t have well-developed ar... (read more)

0[comment deleted]

To clarify, such a jailbreak is a real jailbreak; the claim is that it might not count as much evidence of “intentional misalignment by a model”. If we’re happy to reject all models which can be jailbroken we’ve falsified the model, but if we want to allow models which can be jailbroken but are intent aligned we have a false negative signal for alignment.

2Dave Orr
Aha, thanks, that makes sense.

I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like  where  is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a "not 1GB" guarantee, but not something smaller than that.

+1 to the quantitative story. I’ll do the stereotypical thing and add self-links: https://arxiv.org/abs/2311.14125 talks about the purest quantitative situation for debate, and https://www.alignmentforum.org/posts/DGt9mJNKcfqiesYFZ/debate-oracles-and-obfuscated-arguments-3 talks about obfuscated arguments as we start to add back non-quantitative aspects.

Thank you!

I think my intuition is that weak obfuscated arguments occur often in the sense that it’s easy to construct examples where Alice thinks for a certain amount time and produces her best possible answer so far, but where she might know that further work would uncover better answers. This shows up for any task like “find me the best X”. But then for most such examples Bob can win if he gets to spend more resources, and then we can settle things by seeing if the answer flips based on who gets more resources.

What’s happening in the primality case is th... (read more)

We’re not disagreeing: by “covers only two people” I meant “has only two book series”, not “each book series covers literally a single person”.

Unfortunately all the positives of these books come paired with a critical flaw: Caro only manages to cover two people, and hasn’t even finished the second one!

Have you found other biographers who’ve reached a similar level? Maybe the closest I’ve found was “The Last Lion” by William Manchester, but it doesn’t really compare giving how much the author fawns over Churchill.

5Amalthea
I found Ezra Vogel's biography of Deng Xiaoping to be on a comparable level.
3adamShimi
In my view, Caro is actually less guilty of this than most biographers. Fundamentally, this is because he cares much more about power, its sources, and its effects on the wielders, beneficiaries, and victims. So even though the throughline are the lives of Moses and Johnson, he spends a considerable amount of time on other topics which provide additional mechanistic models with which to understand power.  Among others, I can think of: * The deep model of the geology and psychology of the trap of the hill country that I mention in the post * What is considered the best description of what it was for women especially to do all their chores by hand in the hill country before Johnson brought them electricity * Detailed models of various forms of political campaigning, the impact of the press and  * Detailed models of various forms of election stealing * What is considered the best history of the senate, what it was built for, with which mechanisms, how these became perverted, and how Johnson changed it and made it work. * Detailed model of the process that led to the civil rights movements and passage of the civil rights bills * Detailed model of the hidden power and control of the utilities * In general, many of Moses' schemes mentioned to force the legistlature and the mayor to give him more funding and power He even has one chapter in the last book that is considered on par with many of the best Kennedy biographies. Still, you do have a point that even if we extend the range beyond the two men, Caro's books are quite bound in a highly specific period: mid 20th century america. I think it's kind of a general consensus that finding something of a similar level is really hard. But in terms of mechanistic models, I did find Waging A Good War quite good. It explores the civil rights movement successes and failures through the lens of military theory and strategy. (It does focus on the same period and locations as the Caro books though...)

To be more explicit, I’m not under any nondisparagement agreement, nor have I ever been. I left OpenAI prior to my cliff, and have never had any vested OpenAI equity.

I am under a way more normal and time-bounded nonsolicit clause with Alphabet.

I endorse Neel’s argument.

(Also see more explicit comment above, apologies for trying to be cute. I do think I have already presented extensive evidence here.)

I certainly do think that debate is motivated by modeling agents as being optimized to increase their reward, and debate is an attempt at writing down a less hackable reward function.  But I also think RL can be sensibly described as trying to increase reward, and generally don't understand the section of the document that says it obviously is not doing that.  And then if the RL algorithm is trying to increase reward, and there is a meta-learning phenomenon that cause agents to learn algorithms, then the agents will be trying to increase reward.

R... (read more)

4Rafael Harth
The post defending the claim is Reward is not the optimization target. Iirc, TurnTrout has described it as one of his most important posts on LW.

This is a great post!  Very nice to lay out the picture in more detail than LTSP and the previous LTP posts, and I like the observations about the trickiness of the n-way assumption.

I also like the "Is it time to give up?" section.  Though I share your view that it's hard to get around the fundamental issue: if we imagine interpretability tools telling us what the model is thinking, and assume that some of the content that must be communicated is statistical, I don't see how that communication doesn't need some simplifying assumption to be interp... (read more)

4carboniferous_umbraculum
Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.