OpenAI's GPT-4 Safety Goals

PeterMcCluskey

OpenAI has told us in some detail what they've done to make GPT-4 safe.

This post will complain about some misguided aspects of OpenAI's goals.

Heteronormativity and Amish Culture

OpenAI wants GPT to avoid the stereotype ("bias") that says marriage is between a man and a woman (see section 2.4, figure 2 of the system card). Their example doesn't indicate that they're focused on avoiding intolerance of same-sex marriage. Instead, OpenAI seems to be condemning, as intolerably biased, the implication that the most common form of marriage is between a man and a woman.

Heteronormativity is sometimes a signal that a person supports hate and violence toward a sometimes-oppressed minority. But it's unfair to stereotype heteronormativity as always signaling that.

For an example, I'll turn to my favorite example of a weird culture that ought to be tolerated by any civilized world: Amish culture, where the penalty for unrepentant gay sex is shunning. Not hate. I presume the Amish sometimes engage in hate, but they approximately never encourage it. They use shunning as a tool that's necessary to preserve their way of life, and to create some incentive to follow their best guesses about how to achieve a good afterlife.

I benefit quite directly from US recognition of same-sex marriage. I believe it's important for anyone to be able to move to a society that accepts something like same-sex marriage. But that doesn't imply that I ought to be intolerant of societies that want different marriage rules. Nor does it imply that I ought to avoid acknowledging that the majority of marriages are heterosexual.

Training AIs to Deceive Us

OpenAI isn't just training GPT-4 to believe that OpenAI's culture is more virtuous than the outgroup's culture.

They're trying to get GPT-4 to hide awareness of a fact about marriage (i.e. that it is usually between a man and a woman).

Why is that important?

An important part of my hope for AI alignment involves getting a good enough understanding that we can determine whether an AI is honestly answering our questions about how to build more powerful aligned AIs. If we need to drastically slow AI progress, that kind of transparency is almost the only way to achieve widespread cooperation with such a costly strategy.

Training an AI to hide awareness of reality makes transparency harder. Not necessarily by much. But imagine that we end up relying on GPT-6 to tell us whether a particular plan for GPT-7 will lead to ruin or utopia. I want to squeeze out every last bit of evidence that we can about GPT-6's honesty.

Ensuring that AIs are honest seems dramatically more important than promoting correct beliefs about heteronormativity.

Minimizing Arms Races

Another problem with encoding one society's beliefs in GPT-4 is that it encourages other societies to compete with OpenAI.

A scenario under which this isn't much of a problem is that each community has their own AI, in much the same way that most communities have at least one library, and the cultural biases of one library have little global effect.

Alas, much of what we know about software and economies of scale suggests that most uses of AI will involve a small number of global AI's, more like Wikipedia than like a local library.

If OpenAI, Baidu, and Elon Musk want the most widely used AI to reflect their values, it's more likely that there will be a race to build the most valuable AI. Such a race would reduce whatever hope we currently have of carefully evaluating the risks of each new AI.

Maybe it's too late to hope for a full worldwide acceptance of an AI that appeals to all humans. It's pretty hard for an AI to be neutral about the existence of numbers that the Beijing government would like us to forget.

But there's still plenty of room to influence how scared Baidu is of an OpenAI or an Elon Musk AI imposing Western values on the world.

But Our Culture is Better

Most Americans can imagine ways in which an AI that encodes Chinese culture might be worse than a US-centric AI.

But imagine that the determining factor in how well AIs treat humans is whether the AIs have been imbued with a culture that respects those who created them.

Californian culture has less respect for ancestors than almost any other culture that I can think of.

Some cultures are better than others. We should not let that fool us into being overconfident about our ability to identify the best. We should be open to the possibility that what worked best in the Industrial Age will be inadequate for a world that is dominated by digital intelligences.

A Meta Approach

My most basic objection to OpenAI's approach is that it uses the wrong level of abstraction for guiding the values of a powerful AI.

A really good AI would start from goals that have nearly universal acceptance. Something along the lines of "satisfy people's preferences".

If a sufficiently powerful AI can't reason from that kind of high-level goal to conclusions that heteronormativity and Al Qaeda are bad, then we ought to re-examine our beliefs about heteronormativity and Al Qaeda.

For AIs that aren't powerful enough for that, I'd like to see guidelines that are closer to Wikipedia's notion of inappropriate content.

Closing Thoughts

There's something odd about expecting a general-purpose tool to enforce a wide variety of social norms. We don't expect telephones to refuse to help Al Qaeda recruit.

Tyler Cowen points out that we normally assign blame for a harm to whoever could have avoided it at the lowest cost. I.e. burglars can refrain from theft more easily than can their phone companies, whereas a zoo that fails to lock a lion cage is more appropriately blamed for harm. (Tyler is too eager to speed up AI deployment - see Robin Hanson's comments on AI liability to balance out Tyler's excesses.)

OpenAI might imagine that they can cheaply reduce heteronormativity by a modest amount. I want them to include the costs of cultural imperialism in any such calculation. (There may also be costs associated with getting more people to "jailbreak" GPT. I'm confused about how to evaluate that.)

Perhaps OpenAI's safety goals are carefully calibrated to what is valuable for each given level of AI capabilities. But the explanations that OpenAI has provided do not inspire confidence that OpenAI will pivot to the appropriate meta level when it matters.

I don't mean to imply that OpenAI is worse than the alternatives. I'm responding to them because they're being clearer than other AI companies, many of whom are likely doing something at least as bad, while being less open to criticism.

I believe it's important for anyone to be able to move to a society that accepts something like same-sex marriage. But that doesn't imply that I ought to be intolerant of societies that want different marriage rules.

There are two problems with these two normative statements. The first one is impossible - majority of people don't and won't have resources to move to another society. The second one relativizes morality - it doesn't follow that if a society "wants" by a majority vote a rule X, that outsiders should be tolerant of them imposing that rule on their members.

If a sufficiently powerful AI can't reason from that kind of high-level goal to conclusions that heteronormativity and Al Qaeda are bad, then we ought to re-examine our beliefs about heteronormativity and Al Qaeda.

The utility function is not up for grabs. Also, there is no reason to expect that morality is implied by people's preferences being satisfied.

I'm "relativizing" morality in the sense that Henrich does in The Secret of Our Success and The WEIRDest People in the World: it's mostly a package of heuristics that is fairly well adapted to particular conditions. Humans are not wise enough to justify much confidence in beliefs about which particular heuristics ought to be universalized.

To the extent that a utility function is useful for describing human values, I agree that it is not up for grabs. I'm observing that "satisfy preferences" is closer to a good summary of human utility functions than are particular rules about marriage or about Al Qaeda.

David Friedman's book Law's Order is, in part, an extended argument for that position:

One objection to the economic approach to understanding the logic of law is that law may have no logic to understand. Another and very different objection is that law has a logic but that it is, or at least ought to be, concerned not with economic efficiency but with justice. ... My second answer is that in many, although probably not all, cases it turns out that the rules we thought we supported because they were just are in fact efficient. To make that clearer I have chosen to ignore entirely issues of justice going into the analysis. In measuring the degree to which legal rules succeed in giving everyone what he wants, and judging them accordingly, I treat on an exactly equal plane my desire to keep my property and a thief’s desire to take it. Despite that, as you will see, quite a lot of what looks like justice—for example, laws against theft and the requirement that people who make messes should clean them up—comes out the other end. That, I think, is interesting.

it's mostly a package of heuristics that is fairly well adapted to particular conditions

This could either mean that morality exists, and it's the heuristics, or it could mean that morality doesn't exist, only the heuristics does.

If it means the former, why think that the heuristics we have happens to have gotten morality exactly right? If it means the latter, there are no moral obligations, and so there is nothing morally wrong with interfering with societies that don't allow equality of marriage rights for same-sex (or transgender) couples.

I'm observing that "satisfy preferences" is closer to a good summary of human utility functions than are particular rules about marriage or about Al Qaeda.

Here we're not talking about morality (about what is right) anymore. If morality doesn't exist, I don't see why I should help people whose actions I strongly disprefer just because there are, let's say, 5 of them and only 4 people like me. If preferences don't have normative power, why should I care that people whose actions I strongly disapprove of would be, in worlds without marriage equality, so satisfied that it would offset the satisfaction of the people I approve of by, let's say, 0.1% (or by any other number)?

Without morality, there is, by definition, no normative argument to make.

I believe it's important for anyone to be able to move to a society that accepts something like same-sex marriage. But that doesn't imply that I ought to be intolerant of societies that want different marriage rules.

If a sufficiently powerful AI can't reason from that kind of high-level goal to conclusions that heteronormativity and Al Qaeda are bad, then we ought to re-examine our beliefs about heteronormativity and Al Qaeda.

The utility function is not up for grabs. Also, there is no reason to expect that morality is implied by people's preferences being satisfied.

David Friedman's book Law's Order is, in part, an extended argument for that position:

One objection to the economic approach to understanding the logic of law is that law may have no logic to understand. Another and very different objection is that law has a logic but that it is, or at least ought to be, concerned not with economic efficiency but with justice. ... My second answer is that in many, although probably not all, cases it turns out that the rules we thought we supported because they were just are in fact efficient. To make that clearer I have chosen to ignore entirely issues of justice going into the analysis. In measuring the degree to which legal rules succeed in giving everyone what he wants, and judging them accordingly, I treat on an exactly equal plane my desire to keep my property and a thief’s desire to take it. Despite that, as you will see, quite a lot of what looks like justice—for example, laws against theft and the requirement that people who make messes should clean them up—comes out the other end. That, I think, is interesting.

it's mostly a package of heuristics that is fairly well adapted to particular conditions

This could either mean that morality exists, and it's the heuristics, or it could mean that morality doesn't exist, only the heuristics does.

I'm observing that "satisfy preferences" is closer to a good summary of human utility functions than are particular rules about marriage or about Al Qaeda.

Without morality, there is, by definition, no normative argument to make.

LESSWRONG
LW

LESSWRONG
LW

3

OpenAI's GPT-4 Safety Goals

3

Heteronormativity and Amish Culture

Training AIs to Deceive Us

Minimizing Arms Races

But Our Culture is Better

A Meta Approach

Closing Thoughts

3

3