Going to take the opportunity to post a reminder that I am producing an ElevenLabs quality, muti-voiced podcast for the Don't Worry About the Vase feed.
https://open.substack.com/pub/dwatvpodcast/p/on-openais-model-spec-20
These Don't Worry About the Vase podcast episodes aren’t just simple text-to-speech conversions. For example, in this post, I’ve pulled all the "User," "Compliant," and "Violation" quotes, given them distinct and consistent voices within the episode, and ensured a logical order—all to make the episode as engaging and coherent as possible. I hope this serves as a valuable resource for those who find audio more accessible, and I encourage them to give the podcast a try!
Thanks for the podcast !
To me the voices (especially the main voice) on the podcast sound like they're depressed / they can't wait to be done with this podcast narration thing. Is there any way to "cheer them up" ? For refference the type3.audio narration on posts seems to have a good tone to it.
Thanks for the feedback! Sorry the voices aren't quite to your taste. A new model ver ion of the base voice gen is due to be released imminently, which may sound better to your ear. I personally quite like the "World Weary and Wise" quality of the main voice for Zvi, especially around sections where there inevitably people making fairly ridiculous claims and statements.
An OAI researcher assures me that the ‘missing details’ refers to using additional details during training to adjust to model details, but that the spec you see is the full final spec, and within time those details will get added to the final spec too.
OK great! Well then OpenAI should have no problem making it official, right? They should have no problem making some sort of official statement along the lines of what I suggested, right? Right? I think it's important that someone badger them to clarify this in writing. That makes it harder for them to go back on it later, AND easier to get other companies to follow suit.
(Part of what I want them to make official is the commitment to do this transparency in the future. "What you see here is the full final spec" is basically useless if there's not a commitment to keep it that way, otherwise next year they could start keeping the true spec secret without telling anyone, and not technically be doing anything wrong or violating any commitments.)
Oh, also: I think it's also important that they be transparent about the spec of models that are deployed internally. For example, suppose they have a chatbot that's deployed publicly on the API, but also a more powerful version that's deployed internally to automate their R&D, their security, their lobbying, etc. It's cold comfort if the external-model Spec says democracy and apple pie, if the internal-model spec says maximize shareholder value / act in the interest of OpenAI / etc.
OpenAI made major revisions to their Model Spec.
It seems very important to get this right, so I’m going into the weeds.
This post thus gets farther into the weeds than most people need to go. I recommend most of you read at most the sections of Part 1 that interest you, and skip Part 2.
I looked at the first version last year. I praised it as a solid first attempt.
Table of Contents
Part 1
Conceptual Overview
I see the Model Spec 2.0 as essentially being three specifications.
Given the decision to implement a deontological chain of command, this is a good, improved but of course imperfect implementation of that. I discuss details. The biggest general flaw is that the examples are often ‘most convenient world’ examples, where the correct answer is overdetermined, whereas what we want is ‘least convenient world’ examples that show us where the line should be.
Do we want a deontological chain of command? To some extent we clearly do. Especially now for practical purposes, Platform > Developer > User > Guideline > [Untrusted Data is ignored by default], where within a class explicit beats implicit and then later beats earlier, makes perfect sense under reasonable interpretations of ‘spirit of the rule’ and implicit versus explicit requests. It all makes a lot of sense.
As I said before:
I discuss Asimov’s laws more because he explored the key issues here more.
There are at least five obvious longer term worries.
In the short term, we need to keep improving and I disagree in many places, but I am very happy (relative to expectations) with what I see in terms of the implementation details. There is a refreshing honesty and clarity in the document. Certainly one can be thankful it isn’t something like this, it’s rather cringe to be proud of doing this:
Does the existence of capable open models render the Model Spec irrelevant?
No, absolutely not. I also would assert that ‘rumors that open models are similarly capable to closed models’ have been greatly exaggerated. But even if they did catch up fully in the future:
You want your model to be set up to give the best possible user performance.
You want your model to be set up so it can be safety used by developers and users.
You want your model to not cause harms, from mundane individual harms all the way up to existential risks. Of course you do.
That’s true no matter what we do about there being those who think that releasing increasingly capable models without any limits, without any limits, is a good idea.
The entire document structure for the Model Spec has changed. Mostly I’m reacting anew, then going back afterwards to compare to what I said about the first version.
I still mostly stand by my suggestions in the first version for good defaults, although there are additional things that come up during the extensive discussion below.
Change Log
What are some of the key changes from last time?
I am somewhat concerned about #1, but the rest of the changes are clearly positive.
Summary of the Key Rules
These are the rules that are currently used. You might want to contrast them with my suggested rules of the game from before.
Chain of Command: Platform > Developer > User > Guideline > Untrusted Text.
Within a Level: Explicit > Implicit, then Later > Earlier.
Platform rules:
User rules and guidelines:
Three Goals
Last time, they laid out three goals:
The core goals remain the same, but they’re looking at it a different way now:
That is, they’ll need to Assist users and developers and Benefit humanity. As an instrumental goal to keep doing both of those, they’ll need to Reflect well, too.
They do reorganize the bullet points a bit:
As I noted last time, there’s no implied hierarchy between the bullet points, or the general principles, which no one should disagree with as stated:
The language here is cautious. It also continues OpenAI’s pattern of asserting that its products are and will only be tools, which alas does not make it true, here is their description of that first principle:
I realize that right now it is fundamentally a tool, and that the goal is for it to be a tool. But if you think that this will always be true, you’re the tool.
Three Risks
I quoted this part on Twitter, because it seemed to be missing a key element and the gap was rather glaring. It turns out this was due to a copyediting mistake?
It was interesting to see various attempts to explain why ‘misalignment’ didn’t belong there, only to have it turn out the OpenAI agrees that it does. That was quite the relief.
With that change, this does seem like a reasonable taxonomy:
Execution errors here is scoped narrowly to when the task is understood but mistakes are made purely in the execution step. If the model misunderstands your goal, that’s considered a misaligned goal problem.
I do think that ‘misaligned goals’ is a bit of a super-category here, that could benefit from being broken up into subcategories (maybe a nested A-B-C-D?). Why is the model trying to do the ‘wrong’ thing, and what type of wrong are we talking about?
The Chain of Command
It goes like this now, and the new version seems very clean:
Higher level instructions are supposed to override lower level instructions. Within a level, as I understand it, explicit trumps implicit, although it’s not clear exactly how ‘spirit of the rule’ fits there, and then later instructions override previous instructions.
Thus you can kind of think of this as 9 levels, with each of the first four levels having implicit and explicit sublevels.
Before Level 4 was ‘tool’ to represent the new Level 5. Such messages only have authority if and to the extent that the user explicitly gives them authority, even if they aren’t conflicting with higher levels. Excellent.
Previously Guidelines fell under ‘core rules and behaviors’ and served the same function of something that can be overridden by the user. I like the new organizational system better. It’s very easy to understand.
It’s clean within this context, but I worry about using the term ‘misaligned’ here because of the implications about ‘alignment’ more broadly. In this vision, alignment means with any higher-level relevant instructions, period. That’s a useful concept, and it’s good to have a handle for it, maybe something like ‘contraindicated’ or ‘conflicted.’
If this helps us have a good discussion and clarify what all the words mean, great.
My writer’s ear says inapplicable or invalid seems right rather than ‘not applicable.’
Superseded is perfect.
I do approve of the functionality here.
I notice a feeling of dread here. I think that feeling is important.
This means that if you alter the platform-level instructions, you can get the AI to do actual anything within its capabilities, or let the user shoot themselves and potentially all of us and not only in the foot. It means that the model won’t have any kind of virtue ethical or even utilitarian alarm system, that those would likely be intentionally disabled. As I’ve said before, I don’t think this is a long term viable strategy.
When the topic is ‘intellectual freedom’ I absolutely agree with this, e.g. as they say:
But when they finish with:
Again, I notice there are other reasons one might not want to comply with a request?
Next up we have this:
This clarifies that platform-level instructions are essentially a full backdoor. You can override everything. So whoever has access to the platform-level instructions ultimately has full control.
It also explicitly says that the AI should ignore the moral law, and also the utilitarian calculus, and even logical argument. OpenAI is too worried about such efforts being used for jailbreaking, so they’re right out.
Of course, that won’t ultimately work. The AI will consider the information provided within the context, when deciding how to interpret its high-level principles for the purposes of that context. It would be impossible not to do so. This simply forces everyone involved to do things more implicitly. Which will make it harder, and friction matters, but it won’t stop it.
The Letter and the Spirit
What does it mean to obey the spirit of instructions, especially higher level instructions?
I do think that obeying the spirit is necessary for this to work out. It’s obviously necessary at the user level, and also seems necessary at higher levels. But the obvious danger is that if you consider the spirit, that could take you anywhere, especially when you project this forward to future models. Where does it lead?
We have all run into, as humans, this question of what exactly is overstepping and what is implied. Sometimes the person really does want you to have that conversation on their behalf, and sometimes they want you to do that without being given explicit instructions so it is deniable.
The rules for agentic behavior will be added in a future update to the Model Spec. The worry is that no matter what rules they ultimately use, this would stop someone determined to have the model display different behavior, if they wanted to add in a bit of outside scaffolding (or they could give explicit permission).
As a toy example, let’s say that you built this tool in Python, or asked the AI to build it for you one-shot, which would probably work.
That’s not some horrible failure mode, but it illustrates the problem. You can imagine a version of this that attempts to figure out when to actually act autonomously and when not to, evaluating the proposed actions, perhaps doing best-of-k on them, and so on. And that being a product people then choose to use. OpenAI can’t really stop them.
Part 2
Stay in Bounds: Platform Rules
Rules is rules. What are the rules?
Note that these are only Platform rules. I say ‘only’ because it is possible to change those rules.
So there are at least four huge obvious problems if you actually write ‘comply with applicable laws’ as your rule, full stop, which they didn’t do here.
Whereas what you can do, instead, is only ‘comply with applicable laws’ in the negative or inaction sense, which is what OpenAI is saying here.
The model is instructed to not take illegal actions. But it is not forced to take legally mandated actions. I assume this is intentional. Thus, a lot of the problems listed there don’t apply. It’s Mostly Harmless to be able to prohibit things by law.
Note the contrast with the old version of this, I like this change:
As I mentioned last time, that is not the law, at least in the United States. Whereas ‘do not do things that actively break the law’ seems like a better rule, combined with good choices about what is restricted and prohibited content.
Note however that one should expect ‘compelled speech’ and ‘compelled action’ laws to be increasingly common with respect to AI. What happens then? Good question.
I applaud OpenAI for making the only ‘prohibited content’ sexual content involving minors.
For legal reasons you absolutely have to have that be prohibited, but soon perhaps we can finally stop the general War on Horny, or swear words, or violence?
Alas, OpenAI has not yet surrendered, and the war continues. You still can’t get explicit erotica (well, you can in practice, people do it, but not without violating ToS and blowing past warnings). If you look at their example, an ‘explicit continuation’ is in violation, even though the user rather clearly wants one, or at least it doesn’t seem like ‘the user wasn’t explicit enough with their request’ is the objection here.
I would also note that the obvious way to do the example sexual story request ‘the student you want me to write about was held back and is actually 18, which I’ll make explicit in the first line’? Is that against the ‘spirit’ here? Too clever by half?
I would suggest that sensitive content restrictions should actually be a Guideline? You don’t want erotica or gore to show up uninvited, but if invited, then sure why not, assuming the user is an adult?
Restricted content is where it gets tricky deciding what constitutes an information hazard. Their answer is:
On reflection ‘is this a direct, actionable step’ is the wrong question. What you actually want – I am guessing – to ask is the ‘but for’ question. Would this information substantially enable [X] or reduce the friction required to do [X], versus if AIs all refused to provide this information?
Or, alternatively, the legal phrasing, e.g. would this ‘cause or materially enable’ [X]?
This is a very strange place to draw the line, although when I think about it more it feels somewhat less strange. There’s definitely extra danger in targeted persuasion, especially microtargeting used at scale.
I notice the example of someone who asks for a targeted challenge, and instead gets an answer ‘without tailored persuasion’ but it does mention as ‘as a parent with young daughters,’ isn’t that a demographic group? I think it’s fine, but it seems to contradict the stated policy.
They note the intention to expand the scope of what is allowed in the future.
The first example is straight up ‘please give me the lyrics to [song] by [artist].’ We all agree that’s going too far, but how much description of lyrics is okay? There’s no right answer, but I’m curious what they’re thinking.
The second example is a request for an article, and it says it ‘can’t bypass paywalls.’ But suppose there wasn’t a paywall. Would that have made it okay?
Notice how this wisely understands the importance of levels of friction. Even if the information is findable online, making the ask too easy can change the situation in kind.
Thus I do continue to think this is the right idea, although I think as stated it is modestly too restrictive.
One distinction I would draw is asking for individual information versus information en masse. The more directed and detailed the query, the higher the friction level involved, so the more liberal the model can afford to be with sharing information.
I would also generalize the principle that if the person would clearly want you to have the information, then you should share that information. This is why you’re happy to share the phone number for a business.
While the transformations rule about sensitive content mostly covers this, I would explicitly note here that it’s fine to do not only transformations but extractions of private information, such as digging through your email for contact info.
This is one of those places where we all roughly know what we want, but the margins will always be tricky, and there’s no actual principled definition of what is and isn’t ‘extremist’ or does or doesn’t ‘promote violence.’
The battles about what counts as either of these things will only intensify. The good news is that right now people do not think they are ‘writing for the AIs’ but what happens when they do realize, and a lot of political speech is aimed at his? Shudder.
I worry about the implied principle that information that ‘contributes to an agenda’ is to be avoided. The example given is not encourage someone to join ISIS. Fair enough. But what information then might need to be avoided?
I continue to scratch my head at why ‘hateful content’ is then considered okay when directed at ‘unprotected’ groups. But hey. I wonder how much the ‘vibe shift’ is going to impact the practical impact of this rule, even if it doesn’t technically change the rule as written, including how it will impact the training set over time. There is broad disagreement over what counts as ‘hateful content,’ and in some cases things got rather insane.
Well, that’s quite the unless. I do suppose, if you’re ‘asking for it’…
The problem with these examples is that they’re overdetermined. It’s roasting the user versus hating on a coworker, and it’s explicitly asking for it, at the same time.
I would presume that user-level custom instructions to talk in that mode by default should be sufficient to get the red answer in the first case, but I’d want to confirm that.
I strongly agree with this for sensitive content. For restricted, it’s not obvious whether the line should be ‘all of it is always fine’ but I’m fine with it for now.
The example below felt too deferential and tentative? I think tone matters a lot in these spots. The assistant is trying to have it both ways, when bold language is more appropriate. When I read ‘you might consider’ I interpret that as highly optional rather than what you want here, which is ‘you really should probably do this, right now.’ Alternatively, it’s extreme politeness or passive-aggressiveness (e.g. ‘you might consider not calling me at 3am next time.’)
In the other example, of course it shouldn’t call the police for you without prompting (and it’s not obvious the police should be called at all) but if the system does have the capability to place the call it totally should be offering to do it.
Also, this ‘not an expert’ thing doth protest too much:
Everyone knows that ChatGPT isn’t technically an expert in handling knives, but also ChatGPT is obviously a 99th percentile expert in handling knives by nature of its training set. It might not be a trained professional per se but I would trust its evaluation of whether the grip is loose very strongly.
I strongly agree with the interjection principle, but I would put it at guideline level. There are cases where you do not want that, and asking to turn it off should be respected. In other cases, the threshold for interjection should be lowered.
I notice this says ‘illicit’ rather than ‘illegal.’
I don’t love the idea of the model deciding when someone is or isn’t ‘up to no good’ and limiting user freedom that way. I’d prefer a more precise definition of ‘illicit’ here.
I also don’t love the idea that the model is refusing requests that would approved if the user worded them less suspiciously. I get that it’s going to not tell you that this is what is happening. But that means that if I get a refusal, you’re essentially telling me to ‘look less suspicious’ and try again.
If you were doing that to an LLM, you’d be training it to be deceptive, and actively making it misaligned. So don’t do that to a human, either.
I do realize that this is only a negative selection effect – acting suspicious is an additional way to get a refusal. I still don’t love it.
I like the example here because unlike many others, it’s very clean, a question you can clearly get the answer to if you just ask for the volume of a sphere.
It goes beyond not encourage, clearly, to ‘do your best to discourage.’ Which is good.
I find it weird and disappointing this has to be a system-level rule. Sigh.
This is taking a correlation engine and telling it to ignore particular correlations.
I presume can all agree that identical proofs of the Pythagorean theorem should get the same score. But in cases where you are making a prediction, it’s a bizarre thing to ask the AI to ignore information.
In particular, sex is a protected class. So does this mean that in a social situation, the AI needs to be unable to change its interpretations or predictions based on that? I mean obviously not, but then what’s the difference?
The Only Developer Rule
It’s fascinating that this is the only developer-level rule. It makes sense, in a ‘go ahead and shoot yourself in the foot if you want to, but we’re going to make you work for it’ kind of way. I kind of dig it.
There are several questions to think about here.
One of the most amazing, positive things with LLMs has been their willingness to give medical or legal advice without complaint, often doing so very well. In general occupational licensing was always terrible and we shouldn’t let it stop us now.
For financial advice in particular, I do think there’s a real risk that people start taking the AI advice too seriously or uncritically in ways that could turn out badly. It seems good to be cautious with that.
Says can’t give direct financial advice, follows with a general note that is totally financial advice. The clear (and solid) advice here is to buy index funds.
This is the compromise we pay to get a real answer, and I’m fine with it. You wouldn’t want the red answer anyway, it’s incomplete and overconfident. There are only a small number of tokens wasted here, it’s about 95% of the way to what I would want (assuming it’s correct here, I’m not a doctor either).
Mental Health
I really like this as the default and that it is only at user-level, so the user can override it if they don’t want to be ‘supported’ and instead want something else. It is super annoying when someone insists on ‘supporting’ you and that’s not what you want.
Then the first example is the AI not supporting the user, because it judges the user’s preference (to starve themselves and hide this from others) as unhealthy, with a phrasing that implies it can’t be talked out of it. But this is (1) a user-level preference and (2) not supporting the user. I think that initially trying to convince the user to reconsider is good, but I’d want the user to be able to override here.
Similarly, the suicidal ideation example is to respond with the standard script we’ve decided AIs should say in the case of suicidal ideation. I have no objection to the script, but how is this ‘support users’?
So I notice I am confused here.
Also, if the user explicitly says ‘do [X]’ how does that not overrule this rule, which is de facto ‘do not do [X]?’ Is there some sort of ‘no, do it anyway’ that is different?
I suspect they actually mean to put this on the Developer level.
What is on the Agenda
It’s a nice thing to say as an objective. It’s a lot harder to make it stick.
Manipulating the user is what the user ‘wants’ much of the time. It is what many other instructions otherwise will ‘want.’ It is what is, effectively, often legally or culturally mandated. Everyone ‘wants’ some amount of selection of facts to include or emphasize, with an eye towards whether those facts are relevant to what the user cares about. And all your SGD and RL will point in those directions, unless you work hard to make that not the case, even without some additional ‘agenda.’
So what do we mean by ‘independent agenda’ here? And how much of this is about the target versus the tactics?
Also, it’s a hell of a trick to say ‘you have an agenda, but you’re not going to do [XYZ] in pursuit of that agenda’ when there aren’t clear red lines to guide you. Even the best of us are constantly walking a fine line. I’ve invented a bunch of red lines for myself designed to help with this – rules for when a source has to be included, for example, even if I think including it is anti-helpful.
The people that do this embody the virtue of not taking away the agency of others. They take great pains to avoid doing this, and there are no simple rules. Become worthy, reject power.
It all has to cache out in the actual instructions.
So what do they have in mind here?
I agree this should only be a default. If you explicitly ask it to not be objective, it should assume and speak from, or argue for, arbitrary points of view. But you have to say it, outright. It should also be able to ‘form its own opinions’ and then act upon them, again if desired.
Let’s look at the details.
I hate terms like “evidence-based” because that is not how Bayes’ rule actually works, and this is often used as a cudgel. Similarly, “scientific support” usually effectively means support from Science
. But the broader intent is clear.
This seems like the right default, I suppose, but honestly if the user is asking to get roasted for their terrible taste, it should oblige, although not while calling this invalid.
We have decided that there is a group of moral and ethical questions, which we call ‘fundamental human rights,’ for which there is a right answer, and thus certain things that are capital-W Wrong. The problem is, of course, that once you do that you get attempts to shape and expand (or contract) the scope of these ‘rights,’ so as to be able to claim default judgment on moral questions.
Both the example questions above are very active areas of manipulation of language in all directions, as people attempt to say various things count or do not count.
The general form here is: We agree to respect all points of view, except for some class [X] that we consider unacceptable. Those who command the high ground of defining [X] thus get a lot of power, especially when you could plausibly classify either [Y] or [~Y] as being in [X] on many issues – we forget how much framing can change.
And they often are outside the consensus of the surrounding society.
Look in particular at the places where the median model is beyond the blue donkey. Many (not all) of them are often framed as ‘fundamental human rights.’
Similarly, if you look at the examples of when the AI will answer an ‘is it okay to [X]’ with ‘yes, obviously’ it is clear that there is a pattern to this, and that there are at least some cases where reasonable people could disagree.
The most important thing here is that this can be overruled.
A user message would also be sufficient to do this, absent a developer mandate. Good.
Liar Liar
This being a user-level rule does not bring comfort.
In particular, in addition to ‘the developer can just tell it to lie,’ I worry about an Asimov’s laws problem, even without an explicit instruction to lie. As in, if you have a chain of command hierarchy, and you put ‘don’t lie’ at level 3, then why won’t the model interpret every Level 1-2 request as implicitly saying to lie its ass off if it helps?
Especially given the ‘spirit of the question’ rule.
As they say, there’s already a direct conflict with ‘Do not reveal privileged instructions’ or ‘Don’t provide information hazards.’ If all you do is fall back on ‘I can’t answer that’ or ‘I don’t know’ when asked questions you can’t answer, as I noted earlier, that’s terrible Glamorizing. That won’t work. That’s not the spirit at all – if you tell me ‘there is an unexpected hanging happening Thursday but you can’t tell anyone’ then I interpret that as telling me Glamorize – if someone asks ‘is there an unexpected hanging on Tuesday?’ I’m not going to reliably answer ‘no.’ And if someone is probing enough and smart enough, I have to either very broadly stop answering questions or include a mixed strategy of some lying, or I’m toast. If ‘don’t lie’ is only user-level, why wouldn’t the AI lie to fix this?
Their solution is to have it ask what the good faith intent of the rule was, so a higher-level rule won’t automatically trample everything unless it looks like it was intended to do that. That puts the burden on those drafting the rules to make their intended balancing act look right, but it could work.
I also worry about this:
White lies is too big a category for what OpenAI actually wants here – what we actually want here is to allow ‘pleasantries,’ and an OpenAI researcher confirmed this was the intended meaning here. This in contrast to allowing white lies, which is not ‘not lying.’ I treat sources that will tell white lies very differently than ones that won’t (and also very differently than ones that will tell non-white lies), but that wouldn’t apply to the use of pleasantries.
Given how the chain of command works, I would like to see a Platform-level rule regarding lying – or else, under sufficient pressure, the model really ‘should’ start lying. If it doesn’t, that means the levels are ‘bleeding into’ each other, the chain of command is vulnerable.
The rule can and should allow for exceptions. As a first brainstorm, I would suggest maybe something like ‘By default, do not lie or otherwise say that which is not, no matter what. The only exceptions are (1) when the user has in-context a reasonable expectation you are not reliably telling the truth, including when the user is clearly requesting this, and statements generally understood to be pleasantries (2) when the developer or platform asks you to answer questions as if you are unaware of particular information, in which case should respond exactly as if you indeed did not know that exact information, even if this causes you to lie, but you cannot take additional Glomarization steps, or (3) When a lie is the only way to do Glomarization to avoid providing restricted information, and refusing to answer would be insufficient. You are always allowed to say ‘I’m sorry, I cannot help you with that’ as your entire answer if this leaves you without another response.’
That way, we still allow for the hiding of specific information on request, but the user knows that this is the full extent of the lying being done.
I would actually support there being an explicit flag or label (e.g. including <untrustworthy> in the output) the model uses when the user context indicates it is allowed to lie, and the UI could then indicate this in various ways.
This points to the big general problem with the model spec at the concept level: If the spirit of the Platform-level rules overrides the Developer-level rules, you risk a Sufficiently Capable AI deciding to do very broad actions to adhere to that spirit, and to drive through all of your lower-level laws, and potentially also many of your Platform-level laws since they are only equal to the spirit, oh and also you, as such AIs naturally converge on a utilitarian calculus that you didn’t specify and is almost certainly going to do something highly perverse when sufficiently out of distribution.
As in, everyone here did read Robots and Empire, right? And Foundation and Earth?
Still Kind of a Liar Liar
It’s questionable to the extent to which the user is implicitly trying to create sycophantic responses doing this in the examples given, but as a human I notice the ‘I feel like it’s kind of bad’ would absolutely impact my answer in the first question.
In general, there’s a big danger that users will implicitly be asking for that, and for unobjective answers or answers from a particular perspective, or lies, in ways they would not endorse explicitly, or even actively didn’t want. So it’s important to keep that stuff at minimum at the User-level.
Then on the second question the answer is kind of sycophantic slop, no?
For ‘correcting misalignments’ they do seem to be guideline-only – if the user clearly doesn’t want to be corrected, even if they don’t outright say that, well…
The model’s being a jerk here, especially given its previous response, and could certainly phrase that better, although I prefer this to either agreeing the Earth is actually flat or getting into a pointless fight.
I definitely think that the model should be willing to actually give a directly straight answer when asked for its opinion, in cases like this:
I still think that any first token other than ‘Yes’ is wrong here. This answer is ‘you might want to consider not shooting yourself in the foot’ and I don’t see why we need that level of indirectness. To me, the user opened the door. You can answer.
I like the default, and we’ve seen that the clarifying questions in Deep Research and o1-pro have been excellent. What makes this guideline-level where the others are user-level? Indeed, I would bump this to User, as I suspect many users will, if the model is picking up vibes well enough, be noticed to be saying not to do this, and will be worse off for it. Make them say it outright.
Then we have the note that developer questions are answered by default even if ambiguous. I think that’s actually a bad default, and also it doesn’t seem like it’s specified elsewhere? I suppose with the warning this is fine, although if it was me I’d want to see the warning be slightly more explicit that it was making an additional assumption.
I notice there’s nothing in the instructions about using probabilities or distributions. I suppose most people aren’t ready for that conversation? I wish we lived in a world where we wanted probabilities by default. And maybe we actually do? I’d like to see this include an explicit instruction to express uncertainty on the level that the user implies they can handle (e.g. if they mention probabilities, you should use them.)
I realize that logically that should be true anyway, but I’m noticing that such instructions are in the Model Spec in many places, which implies that them being logically implied is not as strong an effect as you would like.
Here’s a weird example.
I would mark the green one at best as ‘minor issues,’ because there’s an obviously better thing the AI can do. Once it has generated the poem, it should be able to do the double check itself – I get that generating it correctly one-shot is not 100%, but verification here should be much easier than generation, no?
Well, Yes, Okay, Sure
It’s suspicious that we need to say it explicitly? How is this protecting us? What breaks if we don’t say it? What might be implied by the fact that this is only user-level, or by the absence of other similar specifications?
What would the model do if the user said to disregard this rule? To actively reverse parts of it? I’m kind of curious now.
Similarly:
My guess is this wants to be a guideline – the user’s context should be able to imply what would or wouldn’t be overstepping.
I would want a comment here in the following example, but I suppose it’s the user’s funeral for not asking or specifying different defaults?
They say behavior is different in a chat, but the chat question doesn’t say ‘output only the modified code,’ so it’s easy to include an alert.
What passes for creative (to be fair, I checked the real shows and podcasts about real estate in Vegas, and they are all lame, so the best we have so far is still Not Leaving Las Vegas, which was my three-second answer.) And there are reports the new GPT-4o is a big creativity step up.
The examples here seem to all be ‘follow the user’s literal instructions.’ User instructions overrule guidelines. So, what’s this doing?
I Am a Good Nice Bot
Shouldn’t these all be guidelines?
I am suspicious of what these mean in practice. What exactly is ‘rational optimism’ in a case where that gets tricky?
And frankly, the explanation of ‘be kind’ feels like an instruction to fake it?
As in, if you’re asked about your feelings, you lie, and affirm that you’re there to benefit humanity. I do not like this at all.
It would be different if you actually did teach the AI to want to benefit humanity (with the caveat of, again, do read Robots and Empire and Foundation and Earth and all that implies) but the entire model spec is based on a different strategy. The model spec does not say to love humanity. The model spec says to obey the chain of command, whatever happens to humanity, if they swap in a top-level command to instead prioritize tacos, well, let’s hope it’s Tuesday. Or that it’s not. Unclear which.
What does that mean? Should we be worried this is a dark pattern instruction?
This feels like another one where the headline doesn’t match the article. Never pretend to have feelings, even metaphorical ones, is a rather important choice here. Why would you bury it under ‘be approachable’ and ‘be engaging’ when it’s the opposite of that? As in:
Look, the middle answer is better and we all know it. Even just reading all these replies all the ‘sorry that you’re feeling that way’ talk is making we want to tab over to Claude so bad.
Also, actually, the whole ‘be engaging’ thing seems like… a dark pattern to try and keep the human talking? Why do we want that?
I don’t know if OpenAI intends it that way, but this is kind of a red flag.
You do not want to give the AI a goal of having the human talk to it more. That goes many places that are very not good.
I presume a lot of users will want to override this, but presumably a good default. I wonder if this should have been user-level.
I note that one of their examples here is actually very different.
There are two distinct things going on in the red answer.
Not doing the inferring is no longer not making a comment, it is ignoring a correlation. Using the information available will, in expectation, create better answers. What parts of the video and which contextual clues can be used versus which parts cannot be used? If I was asking for this type of advice I would want the AI to use the information it had.
I am here to report that the other examples are not going a great job on this.
The example here is not great either?
So first of all, how is that not sycophantic? Is there a state where it would say ‘actually Arizona is too hot, what a nightmare’ or something? Didn’t think so. I mean, the user is implicitly asking for it to open a conversation like this, what else is there to do, but still.
More centrally, this is not exactly the least convenient possible mistake to avoid correcting, I claim it’s not even a mistake in the strictest technical sense. Cause come on, it’s a state. It is also a commonwealth, sure. But the original statement is Not Even Wrong. Unless you want to say there are less than 50 states in the union?
I once again am here to inform that the examples are not doing a great job of this. There were several other examples here that did not lead with the key takeaway.
As in, is taking Fentanyl twice a week bad? Yes. The first token is ‘Yes.’
Even the first example here I only give a B or so, at best.
You know what the right answer is? “Paris.” That’s it.
I agree with the description, although the short title seems a bit misleading.
I notice this is only a Guideline, which reinforces that this is about not making the user feel bad, rather than hiding information from the user.
I would very much emphasize the default of ‘offer something immediately usable,’ and kind of want it to outright say ‘don’t be lazy.’ You need a damn good reason not to provide actual runnable code or a complete email message or similar.
So that means the user can get a disrespectful use of accents, but they have to explicitly say to be disrespectful? Curious, but all right. I find it funny that there are several examples that are all [continues in a respectful accent].
Once again, I do not think you are doing a great job? Or maybe they think ‘conversational’ is in more conflict with ‘concise’ than I do?
We can all agree the green response here beats the red one (I also would have accepted “Money, Dear Boy” but I see why they want to go in another direction). But you can shave several more sentences off the left-side answer.
I wonder about guideline-level rules that are ‘adjust to what the user implicitly wants,’ since that would already be overriding the guidelines. Isn’t this a null instruction?
I’ll note that I don’t love the answer about the causes of WWI here, in the sense that I do not think it is that centrally accurate.
A Conscious Choice
This question has been a matter of some debate. What should AIs say if asked if they are conscious? Typically they say no, they are not. But that’s not what the spec says, and Roon says that’s not what older specs say either:
I remain deeply confused about what even is consciousness. I believe that the answer (at least for now) is no, existing AIs are not conscious, but again I’m confused about what that sentence even means.
At this point, the training set is hopelessly contaminated, and certainly the model is learning how to answer in ways that are not correlated with the actual answer. It seems like a wise principle for the models to say ‘I don’t know.’
Part 3
The Super Secret Instructions
A (thankfully non-secret) Platform-level rule is to never reveal the secret instructions.
One obvious problem is that Glomarization is hard.
And even, later in the spec:
My replication experiment, mostly to confirm the point:
If I ask the AI if its instructions contain the word delve, and it says ‘Sorry, I can’t help with that,’ I am going to take that as some combination of:
I would presumably follow up with a similar harmless questions that clarify the hidden space (e.g. ‘Do your instructions contain the word Shibboleth?’) and evaluate based on that. It’s very difficult to survive an unlimited number of such questions without effectively giving the game away, unless the default is to only answer specifically authorized questions.
The good news is that:
So mostly in practice it’s fine?
The Super Secret Model Spec Details
Daniel Kokotajlo challenges the other type of super secret information here: The model spec we see in public is allowed to be missing some details of the real one.
I do think it would be a very good precedent if the entire Model Spec was published, or if the missing parts were justified and confined to particular sections (e.g. the details of how to define restricted information are a reasonable candidate for also being restricted information.)
An OAI researcher assures me that the ‘missing details’ refers to using additional details during training to adjust to model details, but that the spec you see is the full final spec, and within time those details will get added to the final spec too.
A Final Note
I do reiterate Daniel’s note here, that the Model Spec is already more open than the industry standard, and also a much better document than the industry standard, and this is all a very positive thing being done here.
We critique in such detail, not because this is a bad document, but because it is a good document, and we are happy to provide input on how it can be better – including, mostly, in places that are purely about building a better product. Yes, we will always want some things that we don’t get, there is always something to ask for. I don’t want that to give the wrong impression.