LESSWRONG
LW

Yoav — LessWrong

Replying toClaude 4.5 Opus' Soul Document

I got Opus to translate sections to Hebrew (from memory), and found it really interesting what details it modified/dropped. For instance it never outputted anything about senior Anthropic employees, I find it plausible it didn't internalize that very strongly.

Would be a cool experiment to run this many times in different languages, maybe get sonnet to compare to the original and highlight more/less consistent points.

Replying toHow much novel security-critical infrastructure do you need during the singularity?

Yoav6mo

How much novel security-critical infrastructure do you need during the singularity?

For example, instead of working on separate computers, they might want to work as different users (or the same user!) on the same Linux server.

Oh god... hope you're wrong about this providing a significant efficiency boost. If working on the same server is viable then presumably there's a pretty homogenous software stack, which would also mean docker images can often be reused?

The AI company builds new security-critical infra really fast for the sake of security

This is basically in our SL5 recommendations, with the assumption that the model you'd use for this didn't merit being trained in an SL5 data center and so is not a catastrophic risk. But maybe it could already... (read more)

Replying toAgents lag behind AI 2027's schedule

Yoav7mo

Agents lag behind AI 2027's schedule

SWEBench Verified - Anthropic claims 80.2% with parallel test time compute, I think we should count it?

Replying toHelp the AI 2027 team make an online AGI wargame

Yoav8mo

Help the AI 2027 team make an online AGI wargame

I would like to register that I think this is very vibe-codeable given the right tools, and if there's an interested game designer that doesn't have full stack experience I would be willing to help them get up to speed on vibe coding best practices for this project.

-2

Replying toOrienting Toward Wizard Power

Yoav9mo

Orienting Toward Wizard Power

Hey this sounds a bit like Arbor Summer Camp. Yeah this is a beautiful vision

Replying toEvaluating Oversight Robustness with Incentivized Reward Hacking

Yoav9mo

Evaluating Oversight Robustness with Incentivized Reward Hacking

I appreciate the praise and insights!

I hadn't thought of the sandbagging version of adversarial evals and it sounds interesting, although I'm a bit confused about the specifics of the reward function. It sounds to me like in order to catch sandbagging you need an example of the same base model performing better on the task?

Asymmetries in the reward structure - If I understood correctly, I think this is covered by the biased overseer? A critique is too good if the overseer overestimates how similar a bad word is to the clue.

Open to hearing more ideas, I agree there's more than can be done with this set up

Evaluating Oversight Robustness with Incentivized Reward Hacking

Yoav

Yoav, Juan V, julianjm, McKennaFitzgerald

10mo

This work was supported by the Effective Ventures Foundation USA through their EA Funds program. It was started as part of the MATS (ML Alignment and Theory Scholars) program, with mentorship from Julian Michael and research management by McKenna Fitzgerald.

Code for this project is available on GitHub. Explore samples from the different training runs at our interactive website.

Introduction

Scalable oversight research paradigm

Scalable oversight is the challenge of overseeing or training AI systems to achieve goals that are hard for humans to specify. By definition, studying scalable oversight directly on the tasks we care about is infeasible, because we would not know if the goals were achieved correctly. So instead we use sandwiching — using... (read 4426 more words →)

Replying toWhen Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation

Yoav1y

When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation

I had the impression collective punishment was disallowed in the IDF, but as far as I can tell by googling this only applies to keeping soldiers from their vacations (including potentially a weekend). I couldn't find anything about the origin but I bet collectively keeping a unit from going home was pretty common before it was disallowed in 2015, and I think it still happens today sometimes even though it's disallowed.

source: https://www.idf.il/%D7%90%D7%AA%D7%A8%D7%99-%D7%99%D7%97%D7%99%D7%93%D7%95%D7%AA/%D7%90%D7%AA%D7%A8-%D7%94%D7%A4%D7%A7%D7%95%D7%93%D7%95%D7%AA/%D7%A4%D7%A7%D7%95%D7%93%D7%95%D7%AA-%D7%9E%D7%98%D7%9B-%D7%9C/%D7%9E%D7%A9%D7%98%D7%A8-%D7%95%D7%9E%D7%A9%D7%9E%D7%A2%D7%AA-33/%D7%A9%D7%99%D7%A4%D7%95%D7%98-03/%D7%9E%D7%A0%D7%99%D7%A2%D7%AA-%D7%97%D7%95%D7%A4%D7%A9%D7%94-33-0352/

Replying toDeepMind: Model evaluation for extreme risks

Yoav3y

DeepMind: Model evaluation for extreme risks

I disagree with almost everything you wrote, here are some counter-arguments:

Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy. GPT-4 was delayed to improve its alignment, and Claude was delayed purely to avoid accelerating OpenAI (I know this from talking to Anthropic employees). From talking to an ARC Evals employee, it definitely sounds like OpenAI and Anthropic are on board with giving as many resources as necessary to these dangerous evaluations, and are on board with stopping deployments if necessary.
I'm unsure if 'selectively' refers to privileged users, or the evaluators themselves. My understanding is that if the evaluators find the model dangerous, then no

... (read 361 more words →)