We recently discovered some concerning behavior in OpenAI’s reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment—even when they’re explicitly instructed to allow themselves to be shut down.
AI models are increasingly trained to solve problems without human assistance. A user can specify a task, and a model will complete that task without any further input. As we build AI models that are more powerful and self-directed, it’s important that humans remain able to shut them down when they act in ways we don’t want. OpenAI has written about the importance of this property, which they call interruptibility—the ability to “turn an agent off”.
During training, AI models explore a range of strategies and learn to circumvent obstacles in order to achieve...
There is a factor that may be causing people who have been released to not report it publicly:
When I received the email from OpenAI HR releasing me from the non-disparagement agreement, I wanted to publicly acknowledge that fact. But then I noticed that, awkwardly, I was still bound not to acknowledge that it had existed in the first place. So I didn't think I could say, for example, "OpenAI released me from my non-disparagement agreement" or "I used to be bound by a non-disparagement agreement, but now I'm not."
So I didn't say anything about it publicly. ...
This sounds like a great idea to me. Thanks for the clear write-up. Two more possible risks I can think of:
1) Opportunity cost
It might be that this is a really good idea, but there are other things that are even better ideas that we should do instead. I'm still building my models here, so I would defer to people who have thought about it more.
That said, I am not currently aware of anything that seems working on at the expense of doing this.
2) It causes it to be harder to gain traction when Arbital starts working on the long-term goals
When I imagine this pr...
Re: Playing Taboo with “Intelligence”
Another great update! I noticed a small inherited mistake from Shane, though:
how able to agent is to adapt...
should probably be
how able the agent is to adapt...
The St Louis group has stopped meeting for the summer (all but one of us are college students who live elsewhere during the summer). I've updated the wiki to reflect this.
I know a few people weren't here during the semester and wanted to meet up during the summer. Don't let our absence stop you! It's just as easy to arrange a meetup yourself.
Oh, interesting. Thanks for pointing that out! It looks like my comment above may not apply to post-2019 employees.
(I was employed in 2017, when OpenAI was still just a non-profit. So I had no equity and therefore there was no language in my exit agreement that threatened to take my equity. The equity-threatening stuff only applies to post-2019 employees, and their release emails were correspondingly different.)
The language in my email was different. It released me from non-disparagement and non-solicitation, but nothing else:
"OpenAI writes to notify you that it is releasing you from any non-disparagement and non-solicitation provision within any such agreement."