All of Michael Thiessen's Comments + Replies

RLHF does work currently? What makes you think it doesn't work currently?

This is definitely the crux so probably really the only point worth debating.

RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it's still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don't like, but I think agents will reliably overcome that obstacle as useful agents won't just be outputting the most probable continuation, they'll be searching through decision space and finding those unlikely co... (read more)

(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format)

(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)

If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is:

One natural approach to judge safety is bootstrapping: when aligning a new AI system, use the previous generati

... (read more)
4Rohin Shah
(Meta: Going off of past experience I don't really expect to make much progress with more comments, so there's a decent chance I will bow out after this comment.) Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations. I am having a hard time parsing this as having more content than "something could go wrong while bootstrapping". What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about? Yeah I'd expect debates to be an auditing mechanism if used at deployment time. Any alignment approach will always be subject to the critique "what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses". I'm not trying to be robust to that critique. I'm not saying I don't worry about fooling the cheap system -- I agree that's a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than "what if it didn't work". ??? RLHF does work currently? What makes you think it doesn't work currently?

The key idea is to use the AI systems themselves to help identify the reasons that the AI system produced the output. For example, we could put two copies of the model in a setting where each model is optimized to point out flaws in the other’s outputs to a human “judge”. Ideally, if one model introduced a subtle flaw in their output that the judge wouldn’t notice by default, the other model would point out and explain the flaw, enabling the judge to penalise the first model appropriately. 

In amplified oversight, any question that is too hard to super

... (read more)
5Rohin Shah
The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say "honesty is always a winning move" rather than "honesty is the only winning move"). These certainly depend on modeling assumptions but the assumptions are more like "assume the models are sufficiently capable" not "assume we can give them a goal". When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice. Despite all the caveats, I think it's wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth. (I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.) You don't have to stop the agent, you can just do it afterwards. Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry. (To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you're outlining here.)

It's copied from the pdf which hard-codes line breaks.

When people argue many AIs competing will make us safe, Yud often counters with how AI will coordinate with each other but not us. This is probably true, but not super persuasive. I think a more intuitive explanation is that offense and defense are asymmetrical. An AI defending my home cannot simply wait for attacks to happen and then defend against them (eg another AI cuts off the power, or fries my AI's CPU with a laser). To truly defend my home, an AI would have to monitor and, importantly, control a hugely outsized part of the world (possibly the entire world).

In my non-tech circles people mostly complain about AI stealing jobs from artists, companies making money off of other people's work, etc.

People are also just scared of losing their own jobs.

Also, his statements in the verge are so bizarre to me:

"SA: I learned that the company can truly function without me, and that’s a very nice thing. I’m very happy to be back, don’t get me wrong on that. But I come back without any of the stress of, “Oh man, I got to do this, or the company needs me or whatever.” I selfishly feel good because either I picked great leaders or I mentored them well. It’s very nice to feel like the company will be totally fine without me, and the team is ready and has leveled up."

2 business days away and the company is rea... (read more)

Let that last paragraph sink in. The leadership team ex-Greg is clearly ready to run the company without Altman.

I'm struggling to interpret this, so your guesses as to what this might mean would be helpful. It seems he clearly wanted to come back - is he threatening to leave again if he doesn't get his way?

Also note Ilya not included in the leadership team.

 

While Ilya will no longer serve on the board, we hope to continue our working relationship and are discussing how he can continue his work at OpenAI.

This statement also really stood out to me - if ... (read more)

7Michael Thiessen
Also, his statements in the verge are so bizarre to me: 2 business days away and the company is ready to blow up if you don't come back and your takeaway is that it can function without you? I get that this is PR spin, but usually there's at least some amount of plausible believability. Maybe these are all attempts to signal to investors that everything is fine, even if Sam were to leave it would still all be fine, but at some point if I'm an investor I have to wonder if given how hard Sam is trying to make it look like everything is fine, that things are very much not fine.

According to Bloomberg, "Even CEO Shear has been left in the dark, according to people familiar with the matter. He has told people close to OpenAI that he doesn’t plan to stick around if the board can’t clearly communicate to him in writing its reasoning for Altman’s sudden firing."

Evidence that Shear simply wasn't told the exact reason, though the "in writing" part is suspicious. Maybe he was told not in writing and wants them to write it down so they're on the record.

Sam's latest tweet suggests he can't get out of the "FOR THE SHAREHOLDERS" mindset.

"satya and my top priority remains to ensure openai continues to thrive

we are committed to fully providing continuity of operations to our partners and customers"

This does sound antithetical to the charter and might be grounds to replace Sam as CEO.

dr_s2918

I feel like, not unlike the situation with SBF and FTX, the delusion that OpenAI could possibly avoid this trap maps on the same cognitive weak spot among EA/rationalists of "just let me slip on the Ring of Power this once bro, I swear it's just for a little while bro, I'll take it off before Moloch turns me into his Nazgul, trust me bro, just this once".

This is honestly entirely unsurprising. Rivers flow downhill and companies part of a capitalist economy producing stuff with tremendous potential economic value converge on making a profit.

I do find it quite surprising that so many who work at OpenAI are so eager to follow Altman to Microsoft - I guess I assumed the folks at OpenAI valued not working for big tech (that's more(?) likely to disregard safety) more than it appears they actually did.

2Chess3D
My guess is they feel that Sam and Greg (and maybe even Ilya) will provide enough of a safety net (compared to a randomized Board overlord) but also a large dose of self-interest once it gains steam and you know many of your coworkers will leave

https://twitter.com/i/web/status/1726526112019382275

"Before I took the job, I checked on the reasoning behind the change. The board did *not* remove Sam over any specific disagreement on safety, their reasoning was completely different from that. I'm not crazy enough to take this job without board support for commercializing our awesome models."

4Zvi
Yeah, should have put that in the main, forgot. Added now.

It seems to me that the idea of scalable oversight itself was far easier to generate than to evaluate. If the idea had been generated by an alignment AI rather than various people independently suggesting similar strategies, would we be confident in our ability to evaluate it? Is there some reason to believe alignment AIs will generate ideas that are easier to evaluate than scalable alignment? What kind of output would we need to see to make an idea like scalable alignment easy to evaluate?

"I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain."

This seems like a completely absurd claim to me, unless by verification you mean some much weaker claim like that you can show something sometimes works.

Coming from the world of software, generating solutions that seem to work is almost always far easier than any sort of formal verification that they work. I think this will be doubly true in any sort of adversarial situation where any f... (read more)