User Comment Replies

Evaluating Oversight Robustness with Incentivized Reward Hacking

I appreciate the praise and insights!

I hadn't thought of the sandbagging version of adversarial evals and it sounds interesting, although I'm a bit confused about the specifics of the reward function. It sounds to me like in order to catch sandbagging you need an example of the same base model performing better on the task?

Asymmetries in the reward structure - If I understood correctly, I think this is covered by the biased overseer? A critique is too good if the overseer overestimates how similar a bad word is to the clue.

Open to hearing more ideas, I agree there's more than can be done with this set up

When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation

Yoav8mo30

I had the impression collective punishment was disallowed in the IDF, but as far as I can tell by googling this only applies to keeping soldiers from their vacations (including potentially a weekend). I couldn't find anything about the origin but I bet collectively keeping a unit from going home was pretty common before it was disallowed in 2015, and I think it still happens today sometimes even though it's disallowed.

source: https://www.idf.il/%D7%90%D7%AA%D7%A8%D7%99-%D7%99%D7%97%D7%99%D7%93%D7%95%D7%AA/%D7%90%D7%AA%D7%A8-%D7%94%D7%A4%D7%A7%D7%95%D7%93%D... (read more)

DeepMind: Model evaluation for extreme risks

Yoav2y10

I disagree with almost everything you wrote, here are some counter-arguments:

Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy. GPT-4 was delayed to improve its alignment, and Claude was delayed purely to avoid accelerating OpenAI (I know this from talking to Anthropic employees). From talking to an ARC Evals employee, it definitely sounds like OpenAI and Anthropic are on board with giving as many resources as necessary to these dangerous evaluations, and are on board with stopping deployments if nec

... (read more)

2jbash2y

Good point. You're right that they've delayed things. In fact, I get the impression that they've delayed for issues I personally wouldn't even have worried about. I don't think that makes me believe that they will be able to refrain, permanently or even for a very long time, from doing anything they've really invested in, or anything they really see as critical to their ability to deliver what they're selling. They haven't demonstrated any really long delays, the pressure to do more is going to go nowhere but up, and organizational discipline tends to deteriorate over time even without increasing pressure. And, again, the paper's already talking about things like "recommending" against deployment, and declining to analyze admittedly relevant capabilities like Web browsing... both of which seem like pretty serious signs of softening. But they HAVE delayed things, and that IS undeniably something. As I understand it, Anthropic was at least partially founded around worries about rushed deployment, so at a first guess I'd suspect Anthropic's discipline would be last to fail. Which might mean that Anthropic would be first to fail commercially. Adverse selection... It was meant to refer to being selective about users (mostly meaning "customerish" ones, not evaluators or developers). It was also meant to refer to being selective about which of the model's intrinsic capabilities users can invoke and/or what they can ask it to do with those capabilities. They talk about "strong information security controls". Selective availability, in that sort of broad sense, is pretty much what that phrase means. As for the specific issue of choosing the users, that's a very, very standard control. And they talk about "monitoring" what users are doing, which only makes sense if you're prepared to stop them from doing some things. That's selectivity. Any user given access is privileged in the sense of not being one of the ones denied access, although to me the phrase "privileged user

LESSWRONG
LW

All of Yoav's Comments + Replies