(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format)
(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)
If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is:
...One natural approach to judge safety is bootstrapping: when aligning a new AI system, use the previous generati
...The key idea is to use the AI systems themselves to help identify the reasons that the AI system produced the output. For example, we could put two copies of the model in a setting where each model is optimized to point out flaws in the other’s outputs to a human “judge”. Ideally, if one model introduced a subtle flaw in their output that the judge wouldn’t notice by default, the other model would point out and explain the flaw, enabling the judge to penalise the first model appropriately.
In amplified oversight, any question that is too hard to super
When people argue many AIs competing will make us safe, Yud often counters with how AI will coordinate with each other but not us. This is probably true, but not super persuasive. I think a more intuitive explanation is that offense and defense are asymmetrical. An AI defending my home cannot simply wait for attacks to happen and then defend against them (eg another AI cuts off the power, or fries my AI's CPU with a laser). To truly defend my home, an AI would have to monitor and, importantly, control a hugely outsized part of the world (possibly the entire world).
Also, his statements in the verge are so bizarre to me:
"SA: I learned that the company can truly function without me, and that’s a very nice thing. I’m very happy to be back, don’t get me wrong on that. But I come back without any of the stress of, “Oh man, I got to do this, or the company needs me or whatever.” I selfishly feel good because either I picked great leaders or I mentored them well. It’s very nice to feel like the company will be totally fine without me, and the team is ready and has leveled up."
2 business days away and the company is rea...
Let that last paragraph sink in. The leadership team ex-Greg is clearly ready to run the company without Altman.
I'm struggling to interpret this, so your guesses as to what this might mean would be helpful. It seems he clearly wanted to come back - is he threatening to leave again if he doesn't get his way?
Also note Ilya not included in the leadership team.
While Ilya will no longer serve on the board, we hope to continue our working relationship and are discussing how he can continue his work at OpenAI.
This statement also really stood out to me - if ...
According to Bloomberg, "Even CEO Shear has been left in the dark, according to people familiar with the matter. He has told people close to OpenAI that he doesn’t plan to stick around if the board can’t clearly communicate to him in writing its reasoning for Altman’s sudden firing."
Evidence that Shear simply wasn't told the exact reason, though the "in writing" part is suspicious. Maybe he was told not in writing and wants them to write it down so they're on the record.
Sam's latest tweet suggests he can't get out of the "FOR THE SHAREHOLDERS" mindset.
"satya and my top priority remains to ensure openai continues to thrive
we are committed to fully providing continuity of operations to our partners and customers"
This does sound antithetical to the charter and might be grounds to replace Sam as CEO.
I feel like, not unlike the situation with SBF and FTX, the delusion that OpenAI could possibly avoid this trap maps on the same cognitive weak spot among EA/rationalists of "just let me slip on the Ring of Power this once bro, I swear it's just for a little while bro, I'll take it off before Moloch turns me into his Nazgul, trust me bro, just this once".
This is honestly entirely unsurprising. Rivers flow downhill and companies part of a capitalist economy producing stuff with tremendous potential economic value converge on making a profit.
https://twitter.com/i/web/status/1726526112019382275
"Before I took the job, I checked on the reasoning behind the change. The board did *not* remove Sam over any specific disagreement on safety, their reasoning was completely different from that. I'm not crazy enough to take this job without board support for commercializing our awesome models."
It seems to me that the idea of scalable oversight itself was far easier to generate than to evaluate. If the idea had been generated by an alignment AI rather than various people independently suggesting similar strategies, would we be confident in our ability to evaluate it? Is there some reason to believe alignment AIs will generate ideas that are easier to evaluate than scalable alignment? What kind of output would we need to see to make an idea like scalable alignment easy to evaluate?
"I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain."
This seems like a completely absurd claim to me, unless by verification you mean some much weaker claim like that you can show something sometimes works.
Coming from the world of software, generating solutions that seem to work is almost always far easier than any sort of formal verification that they work. I think this will be doubly true in any sort of adversarial situation where any f...
This is definitely the crux so probably really the only point worth debating.
RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it's still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don't like, but I think agents will reliably overcome that obstacle as useful agents won't just be outputting the most probable continuation, they'll be searching through decision space and finding those unlikely co... (read more)