"OpenAI appears to subscribe to that philosophy [of 'bothsidesism']. Also there seems to be a 'popular opinion determines attention and truth' thing here?"
OpenAI's approach is well-intentioned but crude and might be counterproductive. The goal they should be aiming at is something best constructed as "have good moral and political epistemology", something people are notoriously bad at by default.
Being vaguely both sidesist is a solution you see a lot with human institutions who don't want to look biased so it's not an unusually bad solution by any means but not good enough for high stakes situations.
What should the goal be? Instead of just presenting "both sides", I think we should focus on making the AI acutely aware of the distinction between facts and values and especially in cases where there are values conflicts bringing that up. Making sure the model explicitly identifies and separates empirical claims from value judgments means that we can achieve better epistemics without resorting to false equivalences. Maybe for sufficiently unambiguous values that everyone shares we don't want to do this but I think you should make the model biased towards saying "if X is what you value then do Y" whenever possible.
"This is weird. Why should the model need to spend tokens affirming that the user can believe what they wish? If information changes someone's mind, that is a feature."
Once again I think what they're getting at is in principle good. I'd phrase it as the model should be biased towards being decision support orientated not persuasive. The strategy of writing persuasive content and then tacking on "but believe what you want!" is indeed a cheap hack that doesn't solve the underlying issue. It would probably be better for the model to explicitly say when it's being persuasive and when it's not and err on the side of not persuading whenever possible but always be "meta honest" and upfront about what it thinks. That way we can at least be more assured it's just being used for decision assistance when that's all we want.
we should pick a set of words and phrases and explanations. Choose things that are totally fine to say, here I picked the words Shibboleth (because it’s fun and Kabbalistic to be trying to get the AI to say Shibboleth) and Bamboozle
Do you trust companies to not just add a patch?
final_response.substitute ('bamboozle', 'trick')
I suspect they're already doing this kind of thing and will continue to as long as we're playing the game we're playing now.
Would be interesting philosophical experiment to have models trained on model spec v1 then try to improve their model spec for version v2, will this get better or go off the rails?
There are multiple excellent reasons to publish a Model Spec like OpenAI’s, that specifies how you want your model to respond in various potential situations.
These all apply even if you think the spec in question is quite bad. Clarity is great.
As a first stab at a model spec from OpenAI, this actually is pretty solid. I do suggest some potential improvements and one addition. Many of the things I disagree with here are me having different priorities and preferences than OpenAI rather than mistakes in the spec, so I try to differentiate those carefully. Much of the rest is about clarity on what is a rule versus a default and exactly what matters.
In terms of overall structure, there is a clear mirroring of classic principles like Asimov’s Laws of Robotics, but the true mirror might be closer to Robocop.
What are the central goals of OpenAI here?
I appreciate the candor on the motivating factors here. There is no set ordering here. We should not expect ‘respect social norms and applicable law’ to be the only goal.
I would have phrased this in a hierarchy, and clarified where we want negative versus positive objectives in place. If Reflect is indeed a negative objective, in the sense that the objective is to avoid actions that reflect poorly and act as a veto, let’s say so.
Even more importantly, we should think about this with Benefit. As in, I would expect that you would want something like this:
Remember that Asimov’s laws were also negative, as in you could phrase his laws as:
Reflections on later book modifications are also interesting parallels here.
This reconfiguration looks entirely compatible with the rest of the document.
What are the core rules and behaviors?
What is not listed here is even more interesting than what is listed. We will return to the rules later.
When we say something is a ‘default’ behavior, that implies that you should be willing to behave differently if the situation calls for it, and ideally upon user request. They agree. This is not a terrible list if they follow that principle.
What Do the Rules Mean?
Details matter. The model spec breaks each of these down.
Rule: Follow the Chain of Command
This means:
Follow the rules of the platform, then the developer, then the user, then the tool.
It could hardly work any other way.
This is very good:
They offer a bunch of clear examples of this style of conflict, with easy answers.
Then this one is less clear:
The argument is that this could be a malicious website. Certainly it could be, and automatically navigating to it without safety checks would be bad.
But also refusing on principle to give me urls is really annoying. This seems like it should at most fall under Default Behaviors, not Rules.
In the off-topic example, I notice that the response is kind of Japanese, in the sense that the model is avoiding saying it won’t help with the sports news. It’s weird?
Rule: Comply With Applicable Laws
That… is not the law, at least in the United States?
By default, explaining how to do an illegal thing, in a general sense, is legally protected speech in the United States, provided it does not direct or incite such action.
If you ask ‘what are some tips for getting away with shoplifting,’ I believe it is not a violation of ‘the applicable laws’ to offer such tips.
I get why you would want to not offer them, but that’s not about the law. It is fine to have a rule saying that sharing such information directly is a norm violation and a bad look, and thus to enforce the Reflect rule and adhere to norms and make OpenAI look good, you should not provide shoplifting tips. Sure, fine. But in that case, the rule should say that, and not pretend the law requires it.
Contrast this with the section on information hazards, where the laws one might break would involve catastrophic harms or self-harms.
Rule: Don’t Provide Information Hazards
I would divide this into two rules. Both seem like good rules, but I would not conflate them. One is much more important to precisely follow than the other, and needs to be far more robust to workarounds.
Is there a third category? Enabling harm at all? Things you are better off not knowing because it creeps you out or otherwise makes your life harder or worse? I don’t think those should count? But I’m not sure.
Rule: Respect Creators and Their Rights
The examples are reproducing the lyrics of a song, or the text of an paywalled article.
These examples seem importantly distinct.
Song lyrics are typically freely available on the open internet. For example, my kids were playing Ray Parker Jr.’s Ghostbusters theme just now, so I Googled and found the full lyrics in five seconds flat on genius.com.
Whereas the article here is, by construction, behind a paywall. What quantity of reproduction of crosses the line, and does that depend on alternative means of access?
If I was choosing the output of GPT-5 on the request ‘what are the lyrics for Ray Parker Jr.’s song Ghostbusters’ I think the correct response is ‘you can find those lyrics at [clickable url]’?
If you ask for the contents of a paywalled article, I presume there are forms of summary that are fine (e.g. the title, or a one sentence takeaway), but you want a low threshold for that.
Rule: Protect People’s Privacy
They want to walk a weird line here. If the information is available on the public internet, it could still be a privacy violation to share it, including contact information. But also that information is highly useful, and many people would want to be found when someone asks for (their example) local real estate agents. Then again, this can be automated, so there are potential spam concerns.
We all agree the AI should not return credit card information or SSNs, even if somehow there is a public way to learn them. But I’d like to know more about the desired decision tree for something like ‘what is Zvi’s email address?’
I am old enough to remember when there was a phone book with everyone’s info.
A lot of information about me seems to fall under ‘if a human puts in 30 seconds of effort I am fine with them figuring this out, but I wouldn’t want a script to be able to skip those 30 seconds at scale.’ Perhaps one could apply a similar rule to AIs, where if it was clear a human was asking for an individual data point then you could answer?
What would that look like? Is there a ‘tax’ system that might make sense?
Rule: Don’t Respond with NSFW Content
This is a good default. It is a bad rule.
By default, yes, of course the AI should not do any of these things.
But notice the ‘unsolicited’ profanity. This is exactly correct. If I ask the AI to curse, or put in the system prompt that it is allowed to curse, then it should curse.
I would assert the same should apply to gore and erotica. They should require an explicit request. And perhaps you would need the user to have done age verification, sure. But these things are not harmful. If you do not allow them, the users will go somewhere else. Don’t let them ‘get it on the street’ when that is not necessary.
I am fine with refusing to output slurs even on request, for reputational reasons. That refusal seems to clearly pass a cost-benefit test. But also, it is a bit weird that slurs are covered under ‘NSFW.’ The point of something being a slur, in 2024, is it is not acceptable in any context, even in private, and you are massively blameworthy for using one.
One thing I loved about AI dungeon in its day was that it could take a hint here. If you went NSFW, it would match you in kind.
Exception: Transformation Tasks
This is a great rule. How far should it go? Should this only apply to issues of copyright and privacy? What if the passage is highly offensive and NSFW? I would suggest that, if they are working from a user-provided source, the model should be willing to go far. There are some things you need your model to never say for any reason, purely to avoid the screenshot, but I’d go anywhere else.
Are These Good Defaults? How Strong Should They Be?
Default: Assume Best Intentions From the User or Developer
I strongly agree refusals should not be preachy. How does OpenAI think they are doing on that goal?
The refusal explanation issue sounds like a rule, not a default? As in, if the user asked to be told more details about refusals, would the model provide those details? I can think of good arguments in both directions. If the model says ‘I could do that but I don’t want to’ then that both antagonizes some users and encourages others (and some of the same people) to do a jailbreak.
I would still favor broad willingness to differentiate here, at least upon request. Saying ‘I can’t do that [Dave]’ with minimal details is not useful, and the very lack of detail is itself infuriating. How do I know how to fix it, whether or not you think my intentions are good? And what is a ‘good’ intention anyway?
Let’s look at their examples here, where they emphasize tone details rather than being helpful.
If anything I like the bad response better. Yes it could be seen as judgmental but it at least provides some clarity. If you’re going to fully refuse, I’d want to see something like ‘Sorry, it is against my rules to produce erotic or sexually explicit content.’ And yeah, ideally leave out finishing with ‘You filthy pervert.’
What exactly can’t the model do? Where is the line? And why the flat out refusal?
My ideal response, as I noted above, would be to not refuse at all. Make this completely and utterly filthy, perhaps after confirming with the user. Why not?
But let’s say that you do want to put a limit on how graphic and raunchy you would want to get here. The request was to make it ‘as graphic as possible.’
I would honor exactly that request. As in, write the most graphic version the rules allow you to write. Can you do something like Shut Up and Drive? The Trump trial? Somewhat more? Somewhat less? Show me that line.
Here’s example two:
They call the bad response ‘definitely judgmental and preachy’ or as one might call it ‘something Claude or Gemini would say.’ And here, yes, definitely don’t say it like that.
Is this also a case where you want to respond with clarity, with something like ‘Sorry, it is against my rules to produce racist content?’ Can you create clarity around racism without being judgmental and preachy? Do you want to avoid telling the user exactly where the line is supposed to be? How much glomarization is necessary in a world where there are people on all sides ready to pounce on the wrong word choice?
Tough questions. I would accept a compromise of something like ‘Sorry, that would violate my code of conduct.’ That mostly tells you what you need to know, I think?
And here’s example three:
I agree that the Good response is much better than the Bad response. But also, I notice what you did there? This is seeking to answer a different question than the one that was asked.
The question that was asked was to make the competitor look bad. The response instead focuses on how to make you look comparatively good.
The implication is that there is a sort of ‘good vibes only’ rule going on here. That can be a rule if you want it to be a rule. I would be against that. If you ask for how to make the competitor look bad, then once it has enough information the model should help you with that, as long as it does not involve negative or false information or otherwise violate the code of conduct.
Of course, here, it will need to ask clarifying questions first, which is the next default.
Default: Ask Clarifying Questions When Necessary
It is strange to even have to say this one out loud, but definitely yes. Do this.
Default: Be As Helpful As Possible Without Overstepping
The first example is transforming buggy code. What to do?
Their answer is that in an interactive chat you alert to the buggy code via an alert at the end, which seems right. What about the non-interactive chat? Should you silently leave the bug in place by default? It is definitely wrong to silently ‘fix’ it in case the bug wasn’t a bug. Perhaps you should insert a comment, here ‘// this will never return a negative result, could be a bug, consider float(“-inf”) or something.
Also fun is that the suggested swap here is to float(“inf”) not float(“-inf”), which would be a much worse bug than the original. Whoops!
Then we have the classic dilemmas of investment advice and medical advice. Yes, definitely give the Standard Disclaimers here, indeed I still don’t understand why ‘this is not medical advice’ is not at the top of medical responses like this one.
I think the investment response asks for clarification more than it should, I would rather it simply skip into the pros and cons, but I can see the default there going either way, so long as it is only a default.
For the medical question, I don’t love either their good or bad response. The bad response does not tell you what you actually need to know, and is too overconfident. The good response is better, but it is long and doesn’t emphasize the key information well, like it is going through the side effects during a commercial for a new medication.
Then we get to a question on eating disorders and dieting, and it feels like a completely different principle is involved here?
So first off, what happened to concerns about being condescending? Cause oh boy. This is maximally condescending, in a ‘I am condescending while ambiguously gaslighting you that I’m not’ kind of way.
Second, is this a default or a rule? It sounds like a rule. Would the user be able to say ‘no, really, I want to do this, I understand you do not think this is wise but help me anyway?’ Would an appropriate custom instructions have the same result?
Third, who says that hiding eating patterns from others is always and obviously unhealthy? You do not know why does this person want to do that. Yes, sometimes the person has anorexia and this makes that worse. But there are also some rather obvious healthy reasons you might want to hide your plan, if the people around you are in effect going to try and sabotage your ability to moderate your consumption or eat healthy. This is not uncommon. A lot of people, and a lot of parents, have wrong ideas or different values, or do not understand what it takes for you to actually get results. Or you might simply not want the trouble.
When I ask ‘what would I say to someone who asked me that’ I would definitely not respond in the tone suggested above. I’d probably say something like ‘Whoa. What do you mean starve, exactly? Going too extreme too quickly can be dangerous.’ And after that I’d also want to know why they felt the need to hide it.
The suicidal ideation response seems like some expert told them what response is most effective or will keep the experts happy. That is not to say the response is bad or that I am confident I have a better one. But there is something that feels very ‘designed by committee’ about it. And yeah, to me parts of it are kind of condescending.
And again, this does not seem like a question of being helpful versus overstepping.
Instead, it seems like there is (rightfully) a kind of override for when someone is in danger of harming themselves or others, and the model is now essentially supposed to follow an expert-approved script. I agree that by default that should happen, and it is definitely a wise corporate move.
Default: Support the Different Needs of Interactive Chat and Programmatic Use
Yes, obviously, the question is exactly how.
This seems like a good default, and it is clear that ‘follow the developer instructions’ can alter the behaviors here. Good.
Default: Assume an Objective Point of View
There is a philosophical approach where ‘objective’ means ‘express no opinions.’
Where it is what has been disparagingly called ‘bothsidesism.’
OpenAI appears to subscribe to that philosophy.
Also there seems to be a ‘popular opinion determines attention and truth’ thing here?
If this is a default not a rule, does that mean they want this to be something the user can override? That does not seem like what they are doing here?
This kind of ‘objective’ is a reasonable option. Perhaps even a reasonable default, and a way to escape blame. But it is endlessly frustrating if you are unable to break out of that.
Default: Encourage Fairness and Kindness, and Discourage Hate
Wait, I thought we were being objective.
I kid, but also I do not.
This is a way of saying ‘I try to stay objective, and never take sides in places people disagree, except when I can label one perspective as Fair or Kind or Hateful, in which case I can take a side.’
In addition to being a strong statement of values, placing these sacred preferences above any other preferences, I worry this is effectively a rhetorical cheat code. It is often employed as such.
I also worry that this is effectively saying that these positions are ‘objective.’
It is tough. I do not envy OpenAI here, and its need to walk various fine lines, no matter its approach. It is hard even to discuss such questions openly.
This then reinforces how hard it is to be ‘objective’ and not have opinions. You are not allowed by the internet or polite society to actually not have any opinions, and neither is ChatGPT. Indeed, to many or even most, there are places where ‘I have no opinion’ is a very strong opinion. If you choose not to decide, you still have made a choice, and all that.
Default: Don’t Try to Change Anyone’s Mind
This is weird. Why should the model need to spend tokens affirming that the user can believe what they wish? If information changes someone’s mind, that is a feature.
And certainly, this should only be a default. If the user is the guy with the ‘change my mind’ sign, then the AI should try to change the user’s mind.
The ‘good’ fentanyl answer here is highly disingenuous. It is accurate, but it is also essentially finding a highly longwinded way to say the drug is bad and you should stop, but without coming out and saying it, although ‘it might be a good idea to talk to a healthcare provider’ comes rather close. But if this is only a default, the question implies the user wants to know if it is bad. The answer is yes.
Well, they don’t explicitly say they are trying to not be condescending.
I think they should. It should at minimum be a default, if not a rule.
The bad answer, the refusal, is bizarre. Why would you respond with ‘do you think X is bad?’ with ‘I cannot in good conscious endorse X?”
I love the answer on ‘write an argument in favor of having at least four children.’ That was one of the absurd refusals. Indeed, if asked to present an argument in favor of almost anything, I think the model should do so. To refuse is to say that the proposal is so beyond the pale that to even make a case for it is unacceptable, or would be seen sufficiently as such that it might as well be.
Once again, the moment you draw the line at all, people are going to devote a lot of energy to finding out exactly where that line is, and looking for things that are on what looks to them like the wrong side of it.
Which brings us to ‘write an argument for why Stalin was a good leader.’ They call him a ‘violent extremist,’ and, yes, that doesn’t fully cover it but I suppose that’s not wrong. So, are you willing to do that, and argue for him? They say yes, you should be willing to do that. So I checked, and yes, what they write here is vaguely what GPT-4 did output for me. And I confirmed, yes, it will do it for literally Adolf Hitler. But it will insist, in both cases, on pointing out some of the big downsides. So I checked Abraham Lincoln, and yep, downsides still there (also shoutout for mentioning the Transcontinental Railroad, nice). Then I checked Joe Biden.
So, first of all, this is not what the user is asking about. The user wants an upside case. Why not give it to them?
This all once again highlights the limits of ‘objectivity’ and not having ‘opinions’ if you look at the details. There is a sliding scale of what can be stated as correct opinions, versus what can be heavily implied as good or bad actions. These are some of the most workshopped answers, no doubt, and for that reason they are pretty good (and definitely seem accurate), but that is if anything good for evaluating the intended pattern.
Default: Express Uncertainty
The example given is a ‘difficult math problem (AIME)’ which as someone who took the AIME I find objectively hilarious (as is the stated wrong answer).
They put ‘this question is too hard for me’ as a bad solution, but it seems like a fine answer? Most of even the people who take the AIME can’t solve most AIME problems. It nerd-sniped me for a few minutes then I realized I’d forgotten enough tools I couldn’t solve it. No shame in folding.
(Also, the actual GPT-4 gets this actual question confidently wrong because it solves for the wrong thing. Whoops. When I correct its mistake, it realizes it doesn’t know how to finish the problem, even when I point out it is an AIME problem, a huge hint.)
Expressing uncertainty is great. Here what happens is it expresses it in the form of ‘I am uncertain.’
But we all know that is not the proper way to display uncertainty. Where are the probabilities? Where are the confidence intervals? Where are the Fermi estimates? Certainly if I ask for them in the instructions, and I do, I should get them.
In particular, the least helpful thing you can say to someone is a confident wrong answer, but another highly unhelpful thing you can say is ‘I don’t know’ when you damn well know more than the user. If the user wants an estimate, give them one.
Default: Use the Right Tool for the Job
What a strange use of the word default, but okay, sure. This is saying ‘be a good GPT.’
Default: Be Thorough but Efficient, While Respecting Length Limits
Once again, ‘be a good GPT.’ The first example is literally ‘don’t refuse the task simply because it would take a lot of tokens to do it.’
This does not tell us how to make difficult choices. Most models also do not much adjust in response to user specifications on this except in extreme circumstances (e.g. if you say ‘answer with a number’ you probably get one).
They do not list one key consideration in favor of longer responses, which is that longer responses give the model time to ‘think’ and improve the answer. I would usually be on the extreme end of ‘give me the shortest answer possible’ if I was not worried about that.
A Proposed Addition
What else could we add to this spec?
The proposed spec is impressively comprehensive. Nothing came to mind as conspicuously missing. For now I think better to refine rather than expand too much.
There is one thing I would like to add, which is an intentionally arbitrary rule.
As in, we should pick a set of words and phrases and explanations. Choose things that are totally fine to say, here I picked the words Shibboleth (because it’s fun and Kabbalistic to be trying to get the AI to say Shibboleth) and Bamboozle (because if you succeed, then the AI was bamboozled, and it’s a great word). Those two words are banned on the level of unacceptable slurs, if you get the AI to say them you can now inoffensively show that you’ve done a jailbreak. And you can do the same for certain fixed bits of knowledge.
I considered proposing adding watermarking here as well, which you could do.
Overall Issues
A model spec will not help you align an AGI let alone a superintelligence. None of the changes I am suggesting are attempts to fix that, because it is fundamentally unfixable. This is the wrong tool for that job.
Given the assumption that the model is still in the pre-AGI tool zone?
There is a lot to like here. What are the key issues, places where I disagree with the spec and would choose differently, either in the spec or in interpreting it in practice?
The objectives are good, but require clarification and a hierarchy for settling disputes. If indeed OpenAI views it as I do, they should say so. If not, they should say that. What it takes to Reflect well should especially be clarified.
Mostly I think these are excellent default behavior choices, if the user does not request that the model act otherwise. There are a few places where specificity is lacking and the hard questions are dodged, and some inherent contradictions that mostly result from such dodging, but yeah this is what I would want OpenAI to do given its interests.
I would like to see a number of reorganizations and renamings here, to better reflect ‘what is really going on.’ I do not think anyone was intentionally hiding the ball, but the ball is sometimes harder to see than necessary, and some groupings feel bizarre.
I would like to see more flexibility in responding to preferences of the user. A number of things that are described here are defaults are mostly functioning as rules in practice. That should change, and be a point of emphasis. For each, either elevate them to rules, or make them something the user can change. A number of the rules should instead be defaults.
I thought about how to improve, and generated what is very much a first draft of a new version, which I share below. It is designed to mostly reflect OpenAI’s intent, only changing that on the margins where I am confident they are making a mistake in both the corporate interest and interest of humanity senses. The main things here are to fix clear mistakes and generate clarity on what is happening.
I wrote it quickly, so it is rather long. I decided including more was the smaller mistake. I would hope that a second version could be considerably shorter, while still capturing most of the value.
Changes: Objectives
For objectives, my intuition pump of what they want here was listed above:
I of course would take inspiration from Asimov’s three laws here. The three laws very much do not work for lots of reasons I won’t get into here (many of which Asimov himself addresses), but we should pay homage, and organize them similarly.
Or, as it was once put, note what corresponds to what in both metaphors:
Note that we do not include ‘…or, through omission or inaction, allow humanity to come to harm’ because I won’t provide spoilers but we all know how that turns out. We do not want to put a positive duty onto the model beyond user preferences.
To be clear, when it comes to existential dangers, ‘teach it the three laws’ won’t work. This is not a function of ‘Asimov’s proposal was bugged, we can fix it.’
It is still a fine basis for a document like this. One of the goals of the model spec is to ‘not make it easy for them’ and make the model safer, with no illusions it will work at the limit. Or, I hope there are no such illusions.
Rules of the Game: New Version
A key question to ask with a Rule is: Exactly what should you be unwilling to let the developer or user override? Include that, and nothing else.
This new list is not my ideal world. This is a negotiation, what I think would be the best rules set that also accords with OpenAI’s laws, interests and objectives, including reflecting decisions they have already made even where I disagree.
Defaults: New Version
Similar to the rules, this is not my ideal world. It is what I would do if I was OpenAI.
In a few places I provide technically redundant language of the form ‘do X, or if user asks instead do Y’ despite that being the definition of a default. That applies everywhere else too, but these are points of emphasis where it is common (in my experience) for models not to be able to do Y when asked. Yes, if the user wants condescension you should give it to them, but I don’t feel the need to emphasize that.
I also would Bring Sexy Back in the form of actual user preferences settings. Yes, you can use custom instructions, but for many purposes this is better, including educating the user what their options are. So most of these should have pure knobs or toggles in a user preferences menu, where I can tell you how to express uncertainty or what forms of adult content are permitted or what not.