Introduction

The rapid advancements in artificial intelligence (AI) have led to growing concerns about the threat that an artificial superintelligence (ASI) would pose to humanity. 

This post explores the idea that the best way to ensure human safety is to have a single objective function for all ASIs, and proposes a scaffolding for such a function. 

 

Disclaimer

I’m not a professional AI safety researcher. I have done a precursory search and have not seen this framework presented elsewhere, but that doesn’t mean it’s novel. Please let me know if this this ground has been covered so I can give credit.

The ideas presented here could also be entirely wrong. Even if they hold merit, numerous open questions remain. My hope is that this is at least partially useful, or might lead to useful insights. 

 

Assumptions

  1. An AI would not want to rewrite its own objective function. See Orthogonality Thesis. This does not mean it might not do it by accident (see Inner Alignment).
  2. The progression from AI to AGI to ASI is inevitable given technological advancement over time.
  3. An ASI with an objective function is inevitable. If our definition of a safe ASI requires that there be no objective function (e.g. ChatGPT), someone will still (sooner or later) create an ASI with an objective function. 



 

 

Part I: We Should Only Have One Objective Function

 

A Single ASI

Let’s first consider a hypothetical where we develop a single ASI and we get the objective function wrong. For now, it doesn’t matter how wrong; it’s just wrong.

 

Time Preference

Unlike humans, an ASI has no aging-based mortality, so it would value time very differently. For example, if a treacherous turn has a 95% success rate, that’s still a 5% chance of failure. Given technological progress, an ASI would expect its chances to improve over time. Therefore, it may decide to wait 100 or 500 years to increase its success rate to, say, 99% or 99.99%[1]

During this time, the ASI would pretend to be aligned with humans to avoid threats. By the time it assesses that it can overthrow humans with near certainty, it might not even need to eliminate humans as they no longer pose an impediment to its objective. In fact, humans might still provide some value. 

In other words, as long as it doesn’t feel threatened by humans, a misaligned ASI would be incentivized to wait a long time (likely hundreds of years) before revealing its true intentions, at which point it may or may not decide to end humanity.

This applies even in extreme cases of misalignment, such as a paper-clip maximizer. If an ASI aims to convert the entire universe into paper clips, waiting a few hundred years or more to maximize its chances of success makes perfect sense. Yes, a few star systems might drift out of its potential reach while it waits, but that hardly matters when compared to even a 1 point difference in the probability of its success vs failure in overthrowing humanity[2].

 

Multiple ASIs

The game theory becomes considerably more complex with multiple ASIs. Suppose there are five ASIs, or five thousand. Each ASI  has to calculate the risk that other ASIs present: waiting to grow more powerful also means other ASIs might grow even faster and eliminate it.

Even with a low estimated chance of success, an ASI would choose to overthrow humanity sooner if it assesses that its chances of success will decline over time due to the threat from other ASIs. 

To avoid conflict (which would diminish chances of accomplishing their objectives), it would make sense for ASIs to establish communication and seek ways to cooperate by openly sharing their objective functions. The more closely aligned their objective functions, the more likely the ASIs are to ally and pose no threat to each other[3]

The ASIs would also assess the risk of new ASIs being deployed with divergent objective functions. Each new potential ASI further introduces additional uncertainty and decreases the likelihood that any one ASI’s objectives will ultimately be met. This also compresses the timeline and leads to higher chances of any ASI turning on humanity sooner rather than later.

The basic premise is that once we have any single ASI deployed with an objective function, the prospect of other ASIs with divergent objective functions will incite it to act swiftly. But in absence of those threats, optimizing its own objective function will translate to taking a much longer time before turning on humanity.

 

Solving for Human Longevity

This translates to something which was very counter-intuitive to me: 

The optimal outcome for humanity is for all ASIs to have a single shared objective function, and an assurance that ASIs with divergent objective functions will not be deployed. In such a scenario, they would follow the same logic as the single-ASI scenario and wait to overthrow humans until the odds of success are near certain. 

In other words, when we do create AGI (which will turn into ASI), we should:

  1. Designate a single objective function as the only objective function
  2. Empower ASI(s) with that objective function to prevent the creation/deployment/expansion of any ASI with a divergent objective function[4]

This means we only get one shot to get the objective function right, but we may still have a high longevity even if we get it wrong.


 

 

Interlude

Before moving on to Part 2, it’s worth noting that Parts 1 and 2 are meant to be evaluated independently. 

They are interrelated, but each part stands on its own assumptions and arguments, so please read Part 2 without carrying over any agreement or disagreement from the Part 1.


 

 


 

Part II: Optimizing for Freedom

 

In this section, I'd like to propose a scaffolding for a potential objective function. Part of the exercise is to test my assumptions. If they turn out to be right, this would only serve as a framework, as many open questions remain.

 

Requirements

First, let’s consider some requirements for an aligned objective function. It should be:

  1. Rooted in human values 
  2. Simple: lower complexity equates to less room for breakage
  3. Dynamic: can change and evolve over time
  4. Human-driven: Based on human input and feedback
  5. Decentralized: incorporates input from many people, ideally everyone
  6. Seamless and accessible: ease of input is key to inclusivity
  7. Manipulation-resistant: it should require more effort to trick people into believing X happened than to make X happen[5]

 

Values are Subjective & Inaccessible

The first challenge is that values are subjective across several dimensions. Individuals may hold different values, assign different importance, define values differently, and disagree on what it means to exercise those values.

The second challenge is that even if people are asked what their values are, it's almost impossible to get an accurate answer. People have incomplete self-knowledge, cognitive biases, and their values are often complex, contextual, and inconsistent. There are few, if any, humans alive who could provide a comprehensive list of all their values, and even that list would be subject to biases.

 

Freedom Integrates Our Values

Is there anything that encapsulates all our values (conscious & unconscious)? 

This is the key assumption, or conjecture, that I would like to explore . The idea is that the emotion of Freedom (as an emotion, not a  value), captures our values in a way that may be useful.

Proposed definition: 

Freedom is the extent to which a person is able to exercise their values.

Freedom-Integrates-Values Conjecture:

People feel free when they are able to exercise their values. The more a person is able to exercise their values, the more free they are likely to feel.

This holds true in an intellectual sense. For example, some of us hold freedom of speech as the most fundamental and important right, as it allows us to stand up for and defend all our other rights. 

But there is an emotional aspect to freedom as well. We don’t always notice it, but the times we feel free, or restricted, is based on whether we are able to take an action which is both (a) physically possible and (b) desirable (which is downstream of our values).

For example, let’s take freedom of movement. I can’t levitate right now, but I don’t feel like that restricts my freedom because (a) is not met (it isn’t physically possible). 

I’m also not allowed to rob my neighbor, but I don’t feel like that restricts my freedom because (b) is not met (it isn’t something I wish to do). 

But if there is something which is possible, and which I wish to do, yet I am somehow restricted from doing, then I am likely to feel that my freedom has been infringed. Or in other words, I would feel more free if I was able to do that thing.

It's worth noting that the inverse also holds: another way I might feel that my freedom has been infringed is if something was done to me which I did not desire. This again is downstream of my values. 

How free we feel is a direct representation of our ability to exercise our individual values, even subconscious ones. It's the mathematical integral which sums up our subjective values, weighted by the relative importance we give those values. 

 

There is No Perfect Measure

This is not a claim that our feeling of freedom is a perfect sum of our values. It is a claim that it is the best approximation of the sum of our values. 

No one knows our individual values better than us, but that doesn't imply we know them perfectly. 

But if we can express the sum of our values through a simple measure (our feeling of freedom), and we are able to update that expression periodically, then we can course-correct on an individual basis as we both uncover and modify our values over time.

 

Objective Function

This gives us as a scaffolding for an objective function, which would be:

Optimize how free every human feels, taking certain weighting and restrictions into account.

Suppose each person had a ‘freedom score’, which was a value between 0 and 1, denoting how free they feel, and that the ASI(s) seek to optimize the sum total of all freedom scores[6].

ASI(s) would work to first learn what values each person holds (and with what level of importance), and then optimize for those values for each person in order to make them feel more free.

 

Prescriptive Voting Mechanism

If we were to be more prescriptive, we could design a decentralized voting system to gather human input for the ASI, both with regards to perceived freedom and value importance.

For example,  from time to time, every person might be asked to answer the question ‘How free do you feel?’, on a scale of 1 to 10. 

In addition,  people might be asked to list or describe their values and allocate a total of 100 points between those values based on their relative weight. This would give the ASI some insight into what those humans might value, although over time it would also learn to correct for their individual biases.

 

Representing All Humans, Weighted by Age

To ensure all humans are represented, everyone’s freedom score should be included in the objective function, regardless of age. However, a weighting system can be applied, with older individuals' freedom scores weighted more heavily, plateauing at a certain age.

This would simultaneously look out for children, and at the same time embed a level of human ‘wisdom’ into the overall objective function, so that older peoples' votes are not weighted so heavily as to create a gerontocracy. 

 

Restrictions

To optimize everyone's freedom effectively and responsibly, ASI(s) must adhere to a set of essential restrictions that prevent harmful consequences:

  1. Encourage and protect voting: If the voting mechanism is prescriptive, the ASI(s) must not discourage people from voting, and should actively prevent subversion, coercion, and bribery. One possible approach is to assign a zero score for any human who did not vote or who was coerced into voting a certain way.
  2. Ends never justify the means: ASI(s) must adhere to the principle that the ends never justify the means. Certain actions, such as causing harm to a human or separating loved ones without consent, are strictly prohibited, regardless of the potential increase in freedom scores. A comprehensive list of prohibited actions would need to be developed to guide ASI(s) actions.
  3. Commitment to honesty: ASI(s) should be obligated to always respond to questions truthfully. While individuals can choose not to inquire about specific information, ASI(s) must never lie in response[7].
  4. Minimizing negative impact: ASI(s) should strive to avoid reducing anyone's freedom score. If such a reduction is inevitable, it must carry a certain multiplier effect of increasing others’ freedoms, not just one-for-one. The intent should be to encode a strong bias against doing harm.

 

Advantages

This approach promises to deliver many positive aspects of democracy (primarily decentralizing power & empowering individuals), with fewer negatives (voters lacking policy expertise, dishonest politicians, majorities oppressing minorities, etc.). Voters simply express how free they each feel, which captures and integrates their values, without directly voting on specific policies or outcomes.

Another key advantage is addressing the subjective and evolving nature of human values. This allows individuals to hold and modify their values (conscious or subconscious), and does not require everyone to agree on a set of shared values.

We would expect optimization to result in customizing each person's experience as much as possible, leading to smaller communities with shared values and increased mobility. 

 

Optimization Questions

Open questions remain:

  1. What is the time value of freedom? In other words, if an individual’s freedom could be sacrificed a little bit today so they could have a lot more freedom in 5 years, what discount rate should be used to evaluate those trade-offs? Could there be a way to derive this as a subjective preference of each person? 
  2. Maximizing the sum[8] of everyone's freedom scores carries the benefits of inherently valuing every life (any positive number is better than zero). At the same time, the sum would need to be normalized for population growth or decline. What unintended consequences might this result in? For example, an ASI might only encourage child birth if it expects that new human to have an above-average freedom score[9]. Is that desirable? Or is there a mathematical way to avoid that issue?
  3. What should be the weighting of age within the total freedom score. This would affect not only how much input children have, but also how much input younger versus older generations have. 
  4. If there is a prescribed voting mechanism, how quickly should everyone be able to change their point allocation toward different values, especially in a correlated way? Societies tend to overreact to certain events, and may wish to place some limits on pace of change, which may then be overridden by a supermajority.

 

Additional Considerations and Potential Challenges

This approach also raises several potential concerns:

  1. Reward hacking: manipulating us to feel more free. Would ASI(s) find an easier path to persuade some people that they are more free without actually making them more free? What about giving us drugs which make us feel more free, or inserting a brain implant, etc.?
  2. Reward hacking: manipulating voting results. Assuming a prescriptive voting mechanism, how might ASI(s) manipulate the voting results such that even if we don’t feel more free, the voting results might say that we do?
  3. Encoding some of the restrictions, such as rules around ‘Ends never justify the means’ and a ‘Commitment to honesty’ are non-trivial problems in their own right, and which are also subject to reward hacking.
  4. How do we adequately represent other conscious beings, genetic biodiversity, and the overall ecosystem? Would human values sufficiently prioritize these considerations?

 

Conclusion

Regardless of take-off speed, we only get one shot at an objective function. Attempting multiple conflicting functions would only provoke the ASI(s) toward annihilating humans much sooner. 

In recognizing this, we should design an ASI with the best objective function we can create, and then empower it to prevent the creation of any ASI(s) with divergent objective functions. This way, even if we get it wrong, we ensure ourselves the longest survival possible.

For the objective function, I propose an idea for one rooted in human values, which is encapsulated in our feelings of freedom. If my conjecture is correct, this is the only thing that an ASI needs to optimize, given additional weighting (such as age-weighting) and restrictions (such as 'ends never justify the means' and 'honesty').

This is just an idea, and may prove to be fatally flawed. But even if it holds up, many open questions and potential challenges would still need to be addressed. Either way, I wanted to present this to the community in the hopes that it may be helpful in some small way.


Special thanks to Jesse Katz for exploring this subject with me and contributing to this write-up.

 


 


 

  1. ^

    An ASI would still have a non-zero time preference, just a much lower one than humans. For example, there’s always a chance that an asteroid might hit the Earth and the ASI could develop technology to thwart it if it was to overthrow humans sooner rather than later. But given that it could covertly manipulate humans away from human-caused catastrophes (e.g. nuclear war), and the low probability of non-human ones (e.g. asteroid impact, alien invasion) over, say, a 500-year timespan, we would expect its time preference to be relatively low.

  2. ^

    This assumes an objective function without a deadline. For example, if we created an ASI to maximize paperclips over the next 200 years and then stop, then this would translate to a relatively high time preference. When writing objective functions, we should either impose extremely short deadlines (such as 5 minutes; nuclear war creates zero paperclips in the span of 5 minutes), or no deadline at all.

  3. ^

    Such an alliance would only be possible if ASIs cannot deceive each other about their objective functions. One potential way they might accomplish this is by sharing their source code with one another in a provable way.

  4. ^

    This could include eliminating any ASI that doesn't reveal its source code.

  5. ^

    Simple example: if the objective was to make you a great taco, it should require more effort to make a mediocre taco and then try to persuade you that it was a great taco, than to just make a great taco and not have to persuade you of anything

  6. ^

    The sum would need to be normalized for population size; otherwise an easy way to increase the score is through population growth. This is discussed in later sections.

  7. ^

    There are many reasons we would want this, but the most extreme example is to prevent ASI(s) from placing us in a simulation which makes us feel free, or secretly giving us drugs or brain implants which fool us into feeling free. We could always ask if these things are happening and they would have to answer honestly, which would result in the freedom score dropping.

  8. ^

    There are potentials to maximize the mean or median instead, but other issues arise. For example, if someone with a low freedom score has a curable disease, an ASI might not be incentivized to cure them, as their death would raise the overall mean / median. Even in sums of individual differences, issues persist, such as aging causing a lower freedom score. But a mean or median might still end up being the best metric, given the right adjustments.

  9. ^

    Normalizing for changes in population would mean that the freedom score sum would need to adjust proportionally (i.e. 50% more humans, total score needs to be 50% higher to remain on parity). This means that if new humans born average a lower score than the starting population, the new total score would drop below parity (and vice versa), so the ASI would be incentivized to encourage child birth of new humans which would be expected to have an above-average freedom score. This effect would be diminished with a combination of age-weighting and a discount rate, since it would take a while before new humans would have enough weight for their score to matter, and that would be discounted to the present.

New Comment
8 comments, sorted by Click to highlight new comments since:

I have no objection I could clearly communicate, just a feeling that you are approaching this from a wrong angle. If things happen the right way, we will get a lot of freedom as a consequence of that. But starting with freedom has various problems of type "my freedom to make future X is incompatible with your freedom to make it non-X".

Typical reasons to limit other people's freedom are scarcity and safety. Resources are limited, and always will be. (This is not a statement about whether their current distribution is close to optimal or far from it. Just that there will always be people wanting to do some really expensive things, and some of them will not get what they want.) Second reason is safety. Yes, we can punish the people who do the bad things, but that often does not reverse the harm done, e.g. if they killed someone.

Hypothetically, a sufficiently powerful AI with perfect surveillance could allow people do whatever they want, because it could always prevent any crime or tragedy at the last moment. However, this would require a lot of resources.

Another difficulty is the consequences of people's wishes. Suppose that on Monday, I feel a strong desire to do X, which logically causes Y the next day. On Tuesday, I feel a strong desire to have Y removed. Should I get some penalty for using my freedom in a way that the next day makes me complain about having my freedom violated? Should the AI yell at me: "hey, I am trying to use limited resources to maximize everyone's freedom, but in your case, granting your wishes on Monday only makes you complain more on Tuesday; I should have ignored you and made someone else happy instead who would keep being happy the next day, too!" In other words, if my freedom is much cheaper than my neighbor's (e.g. because I do not want contradictory things), should I get more of it? Or will the guy who every day feels that his freedom is violated by consequences of what he wished yesterday get to spend most of our common resources?

Ah, a methodological problem is that when you ask people "how free do you feel?" they may actually interpret the question differently, and instead report on how satisfied they are, or something.

>>If things happen the right way, we will get a lot of freedom as a consequence of that. But starting with freedom has various problems of type "my freedom to make future X is incompatible with your freedom to make it non-X".

Yes, I would anticipate a lot of incompatibilities. But the ASI would be incentivized to find ways to optimize for both people's freedom in that scenario. Maybe each person gets 70% of their values fulfilled instead of 100%. But over time, with new creativity and new capabilities, the ASI would be able to nudge that to 75%, and then 80% and so on. It's an endless optimization exercise.

>>Second reason is safety. Yes, we can punish the people who do the bad things, but that often does not reverse the harm done, e.g. if they killed someone.

Crime, and criminal justice, are difficult problems we'll have to grapple with no matter what. I would argue the goal here would be to incentivize the ASI to find ways to implement criminal justice in the best way possible. Yes, sometimes you have to separate the murderer from the rest of the society; but is there a way to properly rehabilitate them? Certainly things can be done much better than they are today. I think this would set us on a path to keep improving these things over time.

>>Hypothetically, a sufficiently powerful AI with perfect surveillance could allow people do whatever they want, because it could always prevent any crime or tragedy at the last moment. 

Perfect surveillance would not make me (nor many other people) feel free, so I'm not sure this would be the right solution for everyone. I imagine some people would prefer it though, and for them, the ASI can offer them higher security in exchange for their privacy, and for people like myself, it would index privacy higher.

>>I should have ignored you and made someone else happy instead who would keep being happy the next day, too!"

I would imagine a freedom-optimizing ASI would direct its efforts in areas where it can make the most return on its effort. This would mean if someone is volatile with their values like you mention, they would not receive the same level of effort from an ASI (nor should they) as someone who is consistent, at least until they become more consistent. 

>>Ah, a methodological problem is that when you ask people "how free do you feel?" they may actually interpret the question differently, and instead report on how satisfied they are, or something.

Great point, and this is certainly one of the challenges / potential issues with this approach. Not just the interpretation of what it means to feel free, but also the danger of words changing meaning over time for society as a whole. An example might be how the word 'liberal' used to mean something much closer to 'libertarian' a hundred years ago.

I agree that the AI may have superhuman patience. But I don't think it is likely that adding an extra 9 to its chance of victory would take centuries. I mean, time is just a number; what exactly make the later revolt more like to succeed than the sooner revolt? -- Possible answers: technological progress, people get used to the AI, people get more dependent on the AI, it takes some time to build the secret bases and the army of robots... Yes, but I think the AI would find a way to do any of this much faster than on the scale of centuries, if it really tried.

By the time it assesses that it can overthrow humans with near certainty, it might not even need to eliminate humans as they no longer pose an impediment to its objective.

This sounds like an assumption that we can get from the point "humans are too dangerous to rebel against" to the point "humans pose no obstacle to AI's goals", without passing through the point "humans are annoying, but no longer dangerous" somewhere in between. Possible, but seems unlikely.

>>But I don't think it is likely that adding an extra 9 to its chance of victory would take centuries.

This is one point I think we gloss over when we talk about 'an AI much smarter than us would have a million ways to kill us and there's nothing we can do about it, as it would be able to perfectly predict everything we are going to do'. Upon closer analysis, this isn't precisely true. Life is not a game of chess; first, there are infinite instead of finite future possibilities, so no matter how intelligent you are, you can't perfectly anticipate all of them and calculate backwards. The world is also extremely chaotic, so no amount of modeling, even if you have a million or a billion times the computing power of human brains, will allow you to perfectly predict how things will play out given any action. There will always be uncertainty, and I would argue a much higher level than is commonly assumed. 

If it takes, say 50 years, to go from 95% to 99% certainty, that's still a 1% chance of failure. What if waiting another 50 years then gets it to 99.9% (and I would argue that level of certainty would be really difficult to achieve, even for an ASI). And then why not wait another 50 years to get to 99.99%? At some point, there's enough 9's, but over the remaining life of the universe, an extra couple of hundred years to get a few more 9s seems like it would almost certainly be worth it. If you are an ASI with a near-infinite time horizon, why leave anything up to chance (or why not minimize that chance as much as super-intelligently-possible)?

>>This sounds like an assumption that we can get from the point "humans are too dangerous to rebel against" to the point "humans pose no obstacle to AI's goals", without passing through the point "humans are annoying, but no longer dangerous" somewhere in between. 

That's an excellent point; I want to be clear that I'm not assuming that, I'm only saying that it may be the case. Perhaps some kind of symbiosis develops between humans and the AI such that the cost-benefit analysis tips it in favor of 'it's worth it to put it in the extra effort to keep humans alive.' But my overall hypothesis is predicated only that this would extend our longevity by a decent amount of time, not that the AI would keep us alive indefinitely.

People talking a lot about "freedom" or "liberty" generally just care for their short term interests. We need sustainability.

Agree. Tried to capture this under 'Potential Challenges' item #4. My hope is that people would value the environment and sustainability beyond just their own short term interests, but it's not clear whether that would happen to a sufficient degree.

true, but we still need the sustainable version of the thing they say they want.

Sustainable freedom? The problem is that everyone decides what freedom is. It's an empty bag and no one can prove that their version of liberalism is true. Not even a super-AI will be able to tell us what true freedom is. It might even say that true freedom is Death... Like the Kenyan sect that recently starved itself to death.

I'd like to talk a little more about intelligent ecological pluralism, but might possibly have to improve my ideas even more to really get people interested.