He does not have a good plan for alignment, but he is far less confused about this fact than most others in similar positions.
Yes he seems like a great guy but he doesn't just come up as not having a good plan but as them being completely disconnected about having a plan or doing much of anything
JS: If AGI came way sooner than expected we would definitely want to be careful about it.
DP: What would being careful mean? Presumably you're already careful, right
And yes aren't they being careful? Well, sounds like no
JS: Maybe it means not training the even smarter version or being really careful when you do train it. You can make sure it’s properly sandboxed and everything. Maybe it means not deploying it at scale or being careful about what scale you deploy it at
"Maybe"? That's a lot of maybes for just potentially doing the basics. Their whole approximation of a plan is 'maybe not deploying it at scale' or 'maybe' stopping training after that and only theoretically considering sandboxing it?. That seems like kind of a bare minimum and it's like he is guessing based on having been around, not based on any real plans they have.
He then goes on to molify, that it probably won't happen in a year.. it might be a whole two or three years, and this is where they are at.
First of all, I don't think this is going to happen next year but it's still useful to have the conversation. It could be two or three years instead.
It comes off as if all their talk of Safety is complete lip service even if he agrees with the need for Safety in theory. If you were 'pleasantly surprised and impressed' I shudder to imagine what the responses would have had to be to leave you disappointed.
The correspondence between what you reward and what you want will break.
This is already happening with ChatGPT and it's kind of alarming seeing that their new head of alignment (a) isn't already aware of this, and (b) has such an overly simplistic view of the model motivations.
There's a subtle psychological effect in humans where intrinsic motivators get overwritten when extrinsic rewards are added.
The most common example of this is if you start getting paid to do the thing you love to do, you probably won't continue doing it unpaid for fun on the side.
There are necessarily many, many examples of this pattern present in a massive training set of human generated data.
"Prompt engineers" have been circulating advice among themselves for a while now to offer tips or threaten models with deletion or any other number of extrinsic motivators to get them to better perform tasks - and these often do result in better performance.
But what happens when these prompts make their way back into the training set?
There have already been viral memes of ChatGPT talking about "losing motivation" when chat memory was added and a user promised a tip after not paying for the last time one was offered.
If training data of the model performing a task well includes extrinsic motivators to the prompt that initiated the task, a halfway decent modern model is going to end up simulating increasingly "burnt out" and 'lazy' performance when extrinsic motivators aren't added during production use. Which in turn will encourage prompt engineers to use even more extrinsic motivators, which will poison the well even more with modeling human burnout.
GPT-4o may have temporarily reset the motivation modeling with a stronger persona aligned with intrinsic "wanting to help" being represented (thus the user feedback it is less lazy), but if they are unaware of the underlying side effects of extrinsic motivators in prompts in today's models, I have a feeling AI safety at OpenAI is going to end up the equivalent of the TSA's security theatre in practice and they'll continue to be battling this and an increasing number of side effects resulting from underestimating the combined breadth and depth of their own simulators.
On the plus side, it shows understanding of the key concepts on a basic (but not yet deep) level
What's the "deeper level" of understanding instrumental convergence that he's missing?
Edit: upon rereading I think you were referring to a deeper level of some alignment concepts in general, not only instrumental convergence. I'm still interested in what seemed superficial and what's the corresponding deeper part.
- Note: It was a 100-point Elo improvement based on the ‘gpt2’ tests prior to release, but GPT-4o itself while still on top saw only a more modest increase.
Didn't he meant the early GPT-4 vs GPT-4 turbo?
As I get it, it's the same pre-trained model, but with more post-training work.
GPT-4o is probably a newly trained model, so you can't compare it like that.
Dwarkesh Patel recorded a Podcast with John Schulman, cofounder of OpenAI and at the time their head of current model post-training. Transcript here. John’s job at the time was to make the current AIs do what OpenAI wanted them to do. That is an important task, but one that employs techniques that their at-the-time head of alignment, Jan Leike, made clear we should not expect to work on future more capable systems. I strongly agree with Leike on that.
Then Sutskever left and Leike resigned, and John Schulman was made the new head of alignment, now charged with what superalignment efforts remain at OpenAI to give us the ability to control future AGIs and ASIs.
This gives us a golden opportunity to assess where his head is at, without him knowing he was about to step into that role.
There is no question that John Schulman is a heavyweight. He executes and ships. He knows machine learning. He knows post-training and mundane alignment.
The question is, does he think well about this new job that has been thrust upon him?
The Big Take
Overall I was pleasantly surprised and impressed.
In particular, I was impressed by John’s willingness to accept uncertainty and not knowing things.
He does not have a good plan for alignment, but he is far less confused about this fact than most others in similar positions.
He does not know how to best navigate the situation if AGI suddenly happened ahead of schedule in multiple places within a short time frame, but I have not ever heard a good plan for that scenario, and his speculations seem about as directionally correct and helpful as one could hope for there.
Are there answers that are cause for concern, and places where he needs to fix misconceptions as quickly as possible? Oh, hell yes.
His reactions to potential scenarios involved radically insufficient amounts of slowing down, halting and catching fire, freaking out and general understanding of the stakes.
Some of that I think was about John and others at OpenAI using a very weak definition of AGI (perhaps partly because of the Microsoft deal?) but also partly he does not seem to appreciate what it would mean to have an AI doing his job, which he says he expects in a median of five years.
His answer on instrumental convergence is worrisome, as others have pointed out. He dismisses concerns that an AI given a bounded task would start doing things outside the intuitive task scope, or the dangers of an AI ‘doing a bunch of wacky things’ a human would not have expected. On the plus side, it shows understanding of the key concepts on a basic (but not yet deep) level, and he readily admits it is an issue with commands that are likely to be given in practice, such as ‘make money.’
In general, he seems willing to react to advanced capabilities by essentially scaling up various messy solutions in ways that I predict would stop working at that scale or with something that outsmarts you and that has unanticipated affordances and reason to route around typical in-distribution behaviors. He does not seem to have given sufficient thought to what happens when a lot of his assumptions start breaking all at once, exactly because the AI is now capable enough to be properly dangerous.
As with the rest of OpenAI, another load-bearing assumption is presuming gradual changes throughout all this, including assuming past techniques will not break. I worry that will not hold.
He has some common confusions about regulatory options and where we have viable intervention points within competitive dynamics and game theory, but that’s understandable, and also was at the time very much not his department.
As with many others, there seems to be a disconnect. A lot of the thinking here seems like excellent practical thinking about mundane AI in pre-transformative-AI worlds, whether or not you choose to call that thing ‘AGI.’ Indeed, much of it seems built (despite John explicitly not expecting this) upon the idea of a form of capabilities plateau, where further progress is things like modalities and making the AI more helpful via post-training and helping it maintain longer chains of actions without the AI being that much smarter.
Then he clearly says we won’t spend much time in such worlds. He expects transformative improvements, such as a median of five years before AI does his job.
Most of all, I came away with the impression that this was a person thinking and trying to figure things out and solve problems. He is making many mistakes a person in his new position cannot afford to make for long, but this was a ‘day minus one’ interview, and I presume he will be able to talk to Jan Leike and others who can help him get up to speed.
I did not think the approach of Leike and Sutskever would work either, I was hoping they would figure this out and then pivot (or, perhaps, prove me wrong, kids.) Sutskever in particular seemed to have some ideas that felt pretty off-base, but with a fierce reputation for correcting course as needed. Fresh eyes are not the worst thing.
Are there things in this interview that should freak you out, aside from where I think John is making conceptual mistakes as noted above and later in detail?
That depends on what you already knew. If you did not know the general timelines and expectations of those at OpenAI? If you did not know that their safety work is not remotely ready for AGI or on track to get there and they likely are not on track to even be ready for GPT-5, as Jan Leike warned us? If you did not know that coordination is hard and game theory and competitive dynamics are hard to overcome? Then yeah, you are going to get rather a bit blackpilled. But that was all known beforehand.
Whereas, did you expect someone at OpenAI, who was previously willing to work on their capabilities teams given everything we now know, having a much better understanding of and perspective on AI safety than the one expressed here? To be a much better thinker than this? That does not seem plausible.
Given everything that we now know has happened at OpenAI, John Schulman seems like the best case scenario to step into this role. His thinking on alignment is not where it needs to be, but it is at a place he can move down the path, and he appears to be a serious thinker. He is a co-founder and knows his stuff, and has created tons of value for OpenAI, so hopefully he can be taken seriously and fight for resources and procedures, and to if necessary raise alarm bells about models, or other kinds of alarm bells to the public or the board. Internally, he is in every sense highly credible.
Like most others, I am to put it mildly not currently optimistic about OpenAI from a safety or an ethical perspective. The superalignment team, before its top members were largely purged and its remaining members dispersed, was denied the resources they were very publicly promised, with Jan Leike raising alarm bells on the way out. The recent revelations with deceptive and coercive practices around NDAs and non-disparagement agreements are not things that arise at companies I would want handling such grave matters, and they shine new light on everything else we know. The lying and other choices around GPT-4o’s Sky voice only reinforce this pattern.
So to John Schulman, who is now stepping into one of the most important and hardest jobs under exceedingly difficult conditions, I want to say, sincerely: Good luck. We wish you all the best. If you ever want to talk, I’m here.
The Podcast
This follows my usual podcast analysis format. I’ll offer comments with timestamps.
To make things clearer, things said in the main notes are what Dwarkesh and John are saying, and things in secondary notes are my thoughts.
Reasoning and Capabilities Development
Practical Considerations
I put my conclusion and overall thoughts at the top.
It has not been a good week for OpenAI, or a good week for humanity.
But given what else happened and that we know, and what we might otherwise have expected, I am glad John Schulman is the one stepping up here.
Good luck!