Towards an Ethics Calculator for Use by an AGI

sweenesm

If Artificial General Intelligence (AGI) is achieved without a highly consistent way of determining what’s the most ethical decision for it to make, there’s a very good chance it’ll do things that many humans won’t like. One way to give an AGI the ability to consistently make ethical decisions could be to provide it with a straightforward mathematical framework to calculate the ethics of a situation based on approximated parameters. This would also likely enable some level of explainability for the AGI’s decisions. I’ve been pursuing such a framework and have come up with a preliminary system that appears to calculate the “ethics” of some idealized decisions in a manner that’s consistent with my values and ethical intuitions, meaning it hasn’t produced any wildly counterintuitive results for the admittedly very limited number of ethical decision scenarios I’ve looked at so far. I don’t put forward my values and ethical intuitions as the “right” ones, but believe they're reasonably consistent so should provide a decent foundation to build a proof-of-concept ethics calculation system around.

For determining the "most ethical” decision an AGI could make in a given situation, the criterion I’ve chosen is that the decision should maximize expected long-term value in the world. I define value to be how useful something ultimately is in supporting and promoting life and net “positive” experiences, where “positive” can contain significant subjectivity. This is basically a utilitarian philosophical approach, although I include the expected value of upholding rights as well.

Setting Up and Applying a Mathematical Framework

Here’s an outline of the steps I’ve used to devise and apply this system:

Come up with a “minimal set” of value destructions that could occur over a wide range of situations.
Come up with parameters to calculate approximate weights for each value destruction in the minimal set.
Come up with a “minimal set” of possible value builds that could occur over a wide range of situations.
Come up with parameters to calculate approximate weights for each value build.
For each side of the decision to be made in a given situation, list relevant value destructions and value builds and estimate the parameters to calculate weights for each one.
Take the net value builds and destructions, and subtract out any value builds or destructions that individuals who’d experience them likely wouldn’t want considered in the overall value equation for a decision (for more on this, see below).
In the net value builds and destructions, include the expected effects of relevant value building or destroying that one intends to do after the initial decision (such as to remove a patient’s appendix and sew them back up after an initial decision to cut them open).
Compare the expected net value changes for possible decisions - the most ethical decision will have the largest net value build or smallest net value destruction.
Test the net value build and destruction calculations over a wide range of situations to confirm that the recommended most ethical decisions are self-consistent, i.e., they don’t yield any “crazy” results such as telling us it’s worth killing everyone to make as many paper clips as possible (see Bostrom 2003).

My preliminary minimal sets of value destructions and builds are given here. I’ll likely refine these lists further in the future. [Update, Jan. 19, 2024: I've updated these value change lists (same link as before), including to avoid a given value build being simply not a given value destruction, and vice versa - I believe this is a better way to avoid the "double counting" that I talk about 3 paragraphs below.]

Regarding #1 above: combinations of value destructions from the minimal set should be able to describe other value destructions not explicitly in this minimal set. An example would be arson of someone else’s property without their permission, which involves the minimal set value destructions of (at the very least) violation of property rights and destruction of property, but could also involve pain, long-term health issues and/or someone dying.

Regarding #6 above: in terms of personal value builds or destructions that people might not want considered in the calculations, an example situation could be having a hypothetical switch that would kill one person if you flipped it or pinch everyone in the world if you didn’t flip it. If people knew that someone had to die so they wouldn’t get pinched, a large fraction of them likely wouldn’t want that on their conscience and would want the weight of their individual pain to not be considered in the ethics calculations of whether to flip the switch or not.

For accounting purposes in these ethics calculations, the “do nothing” option is set as a baseline of zero value destruction and zero value build, and other options have their value builds and destructions considered with respect to this baseline. For instance, acting to save a life that would’ve been lost if you did nothing would be considered to be a value build in the life saved by acting, and not a value destruction in the life lost when doing nothing. If the value of the life were included both as a build for the case of taking action and a destruction when not taking action, it would constitute “double counting” of the relative value difference between the options. [Update, Jan. 19, 2024: I've changed the way I avoid double counting - by updating the value destruction and build lists so they don't overlap as opposites of each other (see update above). Therefore, the "do nothing" option now has its own associated value builds/destructions, i.e., the "opposites" of these value builds/destructions aren't included in the value equations of the other options.]

In this methodology, simplifications necessarily had to be made to approximate the complex world in which we live where everything is interconnected and decisions and actions can have unforeseen long-term effects. Nevertheless, if the calculations yield seemingly self-consistent results over a broad range of scenarios, they should provide a useful starting point for conveying human ethics to an AGI.^[1]

In the current, proof-of-concept version of these calculations, some of the value weight equations, such as the one for someone dying, are “zeroth order” approximations and could use significant refinement.^[2] This is left to future work, as is incorporating the effects of uncertainty in the input parameters.

I chose the relative weights of different value destructions and builds by “feel” to try to match my ethical intuitions. An example would be the value of rights versus the value of a human life. These relative weights are certainly open to debate, although there are likely only limited ranges over which they could be modified before the calculations yield some obviously counterintuitive results. By the way, I believe the calculations should strongly favor rights over “classic” utilitarianism considerations, in order to keep an AGI from doing bad things (violating rights) on a potentially massive scale, in the name of the “greater good.”

Philosophical Aspects

Some aspects of this work that may be interesting to philosophers include:

The ethical framework and associated equations used to calculate value weights are not meant to be based on strictly logically consistent arguments, and shouldn’t be interpreted as such; the results of the calculations are what should be judged as seeming to be consistent, i.e., matching ethical intuitions, or not. This is in part because it seems unlikely that all our ethical intuitions would be logically consistent with each other. Here I’m trying to “fit” ethical intuitions with mathematical consistency, with an underlying assumption that individual decisions that seem to be in-line with our ethical intuitions are generally more likely to result in greater value in the long-term.
In my experience thus far, the most difficult part in creating a mathematical framework that consistently matches my ethical intuitions has been coming up with appropriate value weight equations for rights to life, body integrity and property. That is, equations that account for the relative effects of “villains” who’ve violated other involved people's rights, people who are willingly trying to kill or hurt themselves, people who were being reckless versus those being cautious but accidentally got in harm’s way, and the relative values of different rights such as property versus life versus body integrity. The rights I’m talking about here are not set by any government’s laws, i.e., they’re not legal rights; they exist in the calculations to represent how upholding or violating someone’s “rights” affects other’s behaviors - basically what it teaches people. I don’t know if I’ve yet succeeded in capturing all the properties I’d like to have present in the rights equations (especially when it comes to the question of “torture”), but I think the situation may at least be partly remedied by considering the effects of different decisions on the consciences of people affected by those decisions. Even if an AGI without a conscience is making a decision, the consciences of the AGI creators and of people affected by the particular decision should still be considered, I believe. It may also be desirable to give an AGI a “virtual conscience” by assuming in the ethics calculations that it does have a conscience, which would likely make it seem more trustworthy to humans, and thus able to build more value. Note that when I say “consciences” should be considered, I actually mean idealized versions of consciences, since some people don’t have very strong consciences or have ones that aren't particularly in-line with reality. I’m still fleshing out exactly what an “idealized version” of a conscience looks like and how best to implement it in the ethics calculations.
I relied heavily on the “Trolley Problem” ethical dilemma and many of its variants to try to suss out what I thought was the most value building/least value destroying path in each situation, rather than the most “morally permissible” path (I see the concept of moral permissibility as a subset of value building/destroying) (see Foot 1967, Thomson 1976, 1985, 2008). In the standard “bystander” trolley problem, I take a view similar to that of Friedman (2002) and Thomson (2008) that a bystander shouldn’t turn the trolley onto a side track to kill one person and thus save the 5 people on the main track. This is because killing the one constitutes violating their rights, and for the bystander, doesn’t involve avoiding violating the rights of the 5 on the main track (as it may be considered to for the trolley driver). I’m effectively making a distinction between killing and letting die: the former involves rights violations while the latter doesn’t (both involve effects on one’s conscience). Note that other ways of handing the trolley problem could still be represented in the math, such as the bystander case being treated the same as the driver case, etc.
To handle the “footbridge trolley problem,” I use the concept of being in a certain percent of “harm’s way,” which can be assigned (admittedly somewhat arbitrarily) for different situations. In this case, anyone on any tracks that connect to the main track could be considered to be in a relatively high percentage of harm’s way, while anyone not on a connected track (such as on a footbridge above the tracks) would be in a very low percentage of harm’s way. Similarly, if we’re talking about a car on a road, any pedestrian on the road would be considered to be in a high percentage of harm’s way, while those off the road would be considered to be in a low percentage of harm’s way. With respect to the trolley driver, people (other than “villains”) who are mostly out of harm’s way have a higher value weight assigned to their right to life than do those mostly in harm’s way. This “harm’s way” distinction for rights was chosen first because it seems to align with my ethical intuitions. I also think the harm’s way distinction will likely result in the most long-term value building/destruction avoiding. This is because people generally want to feel some level of control of when and how they get themselves in harm’s way, rather than feel like they could be put in harm’s way at any time by anyone who determines it’s for the “greater good.” The increased perceived risk of being put in harm's way at any time would likely reduce the amount of value people would build as they’d put resources towards guarding against this seemingly ever present risk, and also have less trust of, and thus less cooperation with, others. Further, I believe it generally promotes responsibility for people who get themselves into harm’s way to bear the consequences for their decisions that helped get them there rather than try to shift the consequences onto others. Of course, sometimes people get themselves in harm’s way seemingly by chance (such as being born with genetic defects), which is why I say “generally” promotes responsibility. In my opinion, anything that promotes responsibility promotes self-esteem building, which in turn promotes value effectively increasing in the world. This is because people having higher self-esteems tends to result in more of their experiences being taken as subjectively “positive.” This “percent harm’s way” model is also applied to property rights, although in that case, intent to replace/repair property later can negate most of the rights differences between destroying property that’s in harm’s way versus property that isn’t.
Animals aren’t considered to have rights. Their lives have value and how we treat animals has significant effects on our consciences and self-esteems, and perhaps even on how we treat each other, but I consider the concept of rights (again, not "legal" rights) as only useful in how it generally helps more value to be built by humans. I don’t believe animals share amongst themselves how unjust it was that their rights were violated that day, while humans very well might, thus changing human behavior on a larger scale than animal behavior (which can still be influenced by being hunted, for instance, but not by the concept of having their rights violated by being hunted). At the same time, humans who have impaired brain function are treated as having identical rights to a typical human, even if they can’t understand the concept of rights. This is in part because I believe how we treat these humans has a greater effect on us and how we treat other humans than does how we treat animals. Basically, I think animals can generally be more readily differentiated from typical humans in our minds.
Sadistic and masochistic pleasures were not considered to build value as “positive” experiences because, in my opinion, they’re ultimately both anti-survival. Also, these sorts of “pleasure” involve not taking responsibility for one’s emotions and thus act to reduce self-esteem, which in turn reduces one’s tendency to experience life overall as “positive.” Therefore, these pleasures were actually considered as value destructions.

The “pleasure/positive experience” scale is set to be a factor of at least 1,000 lower in weight than the pain scale (both scales are based on powers of at least 10 rather than being linear). This prioritizes elimination of suffering while still considering maximizing of (pro-survival) pleasure/positive experiences as a value build. This asymmetry was used because it seems to me that we wouldn’t even care about ethics if it weren’t for pain and suffering.

For more about how these calculations handle different ethical dilemmas, including variations of the well-known “trolley problem,” click here.

To my GitHub account, I’ve posted Python code for a "demo" ethics calculator that considers a significantly smaller set of value destructions and builds than in the full version of the code, which is not yet complete.

Towards Implementing this Framework in an AGI System

Before integrating these ethics calculations into a “real-life” AI or AGI decision-making system, the following steps should be performed:

Refine the value weight equations for all minimal set value builds and destructions - this may involve literature and/or real-world studies to establish better estimates of such things as the effects of different levels and frequencies of rights violations on the amount of value that gets built or destroyed in a society
Include the effects of uncertainties in the value weight equation parameters
Set up machine learning systems to identify the value destructions and value builds in a given situation, or verify that an AGI can consistently do this accurately
Set up machine learning systems to estimate the value weight parameters
Perform studies and set up machine learning systems to estimate fractions of people who would want their individual value destructions or builds to not be considered in ethical calculations under certain scenarios
Consider more example situations to ensure consistency of the calculations with ethical intuitions over as broad as possible a range of situations - some relative re-weighting of value destructions and builds may be necessary so that the calculations align with intuitions/expectations. I can only think of so many situations myself - I’d be happy to hear if anyone has other situations they think the calculations likely won’t give a reasonable answer for, and/or ways they believe an AGI might be guided by these calculations to pursue some path that humans really won’t like (think “I, Robot,” the film).

This is a fairly limited description of what I’ve put together, and leaves open many questions. The point of this write-up is not to provide all the answers, but to report the viability of one possible “ethics calculator” and suggest its potential utility. I believe these ethics calculations could provide a useful method for helping an AGI make the most long-term value building decisions it can. If there’s interest, I may provide more details of the calculations and the reasoning behind them in the future.

Thanks for reading.

Some References

Prior to and while working on these ethics calculations, I read a number of different philosophical and self-help resources, many of them listed below. These helped hone some of my ideas, especially with the various ethical dilemmas, thought experiments, and logical arguments they presented.

Bostrom, N., "Ethical issues in advanced artificial intelligence." Science fiction and philosophy: from time travel to superintelligence (2003): 277-284.

Branden, N., “The Six Pillars of Self-Esteem,” (2011).

Bruers, S., Braeckman, J., “A Review and Systemization of the Trolley Problem,” Philosophia: Philosophical Quarterly of Israel, 42, 251-69 (2014).

Chapell, R.Y., Meissner, D., and MacAskill, W., utilitarianism.net

D’Amato, A., Dancel, S., Pilutti, J., Tellis, L., Frascaroli, E., and Gerdes, J.C., “Exceptional Driving Principles for Autonomous Vehicles,” Journal of Law and Mobility 2022: 1-27 (2022).

Friedman, A.W., “Minimizing Harm: Three Problems in Moral Theory,” PhD thesis, MIT, (2002).

Greene, J.D., Cushman, F.A., Stewart, L.E., Lowenberg, K., Nystrom, L.E., and Cohen, J.D., “Pushing Moral Buttons: The Interaction between Personal Force and Intention in Moral Judgment,” Cognition, (2009) 111 ( 3 ): 364 – 371.

Guttormsen, T.J., “How to Build Healthy Self-Esteem,” Udemy course: https://www.udemy.com/course/healthy-self-esteem

Huemer, M. “Knowledge, Reality and Value,” (2021).

Huemer, M., Fake Nous blog: https://fakenous.substack.com/

Internet Encyclopedia of Philosophy, “Ethics of Artificial Intelligence” https://iep.utm.edu/ethics-of-artificial-intelligence/

Kamm, F., lectures on the trolley problem: https://www.youtube.com/watch?v=A0iXklhA5PQ and https://www.youtube.com/watch?v=U-T_zopKRCQ

Kaufmann, B.N., “Happiness Is a Choice,” (1991).

Lowe, D., “The deep error of political libertarianism: selfownership, choice, and what’s really valuable in life,” Critical Review of International Social and Political Philosophy, 23 (6): 683-705 (2020).

MacAskill, W., Bykvist, K., and Ord, T., “Moral Uncertainty,” (2020).

MacAskill, W. “What We Owe the Future,” (2022).

Pearce, D., “Can Biotechnology Abolish Suffering?,” (2018).

Robbins, T., “Awaken the Giant Within,” (1991).

Shafer Landau, R., “The Fundamentals of Ethics,” 1st edition, (2010).

Shafer Landau, R., “The Ethical Life: Fundamental Readings in Ethics and Moral Problems,” (2018).

Singer, P., “Ethics in the Real World,” (2016).

Singer, P., “The Life You Can Save,” (2009).

Singer, P., “Ethics and Intuitions,” The Journal of Ethics (2005) 9: 331-52.

Stanford Encyclopedia of Philosophy, “Deontological Ethics” https://plato.stanford.edu/entries/ethics-deontological/

Thomson, J.J., “Killing, Letting Die, and the Trolley Problem,” Monist (1976) 59: 204-17.

Thomson, J.J., “Turning the Trolley,” Philosophy & Public Affairs, 36 (4): 359-74 (2008).

Vinding, M., “Suffering-Focused Ethics: Defense and Implications,” (2020).

^{^}
If, instead of making all decisions itself, an AGI were used as an aid to a human, the question arises as to how much an AGI should aid a human in pursuing a decision that goes against its calculations for the most ethical decision. In this case, the AGI could be programmed to not aid a human in pursuing a certain option of a decision if that option didn’t involve the least overall rights violations of all the options. Alternatively, a “damage threshold” could be set wherein a human wouldn’t be aided (and, possibly, the AGI would intervene against the human) in any decision options that exceeded the threshold of rights violations or risk of rights violations. The benefit of this would be to leave some space for humans to still “be human" and mess things up as part of their process of learning from their mistakes. It should also help provide a path for people to raise their self-esteems when they take responsibility for damages they’ve caused.
^{^}
Though I haven’t worked out the relative weights yet, the value of a human life will likely include considerations for: 1) intrinsic value, 2) the value of someone’s positive experiences, 3) social value to others, 4) potential for earning money (on the negative side, potential for stealing), 5) potential non-paid labor (on the negative side, potential for violence, killing and abuse), 6) cuteness/attractiveness, 7) reproductive potential, and 8) setting a good example for others and inspiring effort.

[-]RogerDearnaley1y*10

I suspect doing a good job of this is going to be extremely challenging. My loose-order-of-magnitude estimate of the Kolmogorov complexity of a decent Ethics/human values calculator is somewhere in the terabytes (of the order of the size of our genome, i.e. a few gigabytes, is a plausible lower bound, but there's no good reason for it to be an upper bound). However, a sufficiently rough approximation might be a lot smaller, and even that could be quite useful (if prone to running into Goodhart's Law under optimization pressure). I think it's quite likely that doing something like this will be useful in AI-Assisted Alignment, in which case having sample all-human attempts from which to start is likely to be valuable.

Did you look at the order of magnitude of standard civil damages for various kinds of harm? That seems like the sort of thing your model should be able to predict successfully.

Also, these sorts of “pleasure” involve not taking responsibility for one’s emotions and thus act to reduce self-esteem, which in turn reduces one’s tendency to experience life overall as “positive.” Therefore, these pleasures were actually considered as value destructions.

I know a number of intelligent, apparently sane, yet kinky people who would disagree with you. If you're interested in the topic, you might want to read some more on it, e.g.: Safe, Sane, and Consensual—Consent and the Ethics of BDSM If nothing else, your model should be able to account for the fact that at least a few percent of people do this.

[-]sweenesm1y10

Thank you for the comment. Yes, I agree that "doing a good job of this is going to be extremely challenging.” I know it’s been challenging for me just to get to the point that I’ve gotten to so far (which is somewhat past my original post). I like to joke that I’m just smart enough to give this a decent try and just stupid enough to actually try it. And yes, I’m trying to find a rough approximation as a good starting point, in hopes that it’ll be useful.

Thanks for the suggestion about civil damages - I haven’t looked into that, only criminal “damages” (in terms of criminal sentences) thus far. I actually don’t expect that the first version of my calculations, based on my own ethics/values, will particularly agree with civil damages, but it may be interesting to see if the calculations can be modified to follow an alternate ethical framework (one less focused on self-esteem) that does give reasonable agreement.

Regarding masochistic and sadistic pleasure, it depends on how we define them. One might regard people who enjoy exercise as being into “masochistic pleasure.” That’s not what I mean by it. For masochistic pleasure I basically mean pleasure that comes from one’s own pain, plus self-loathing. Sadistic pleasure would be pleasure that comes from the thought of others’ pain, plus self-loathing (even if it may appear as loathing of other, the way I see it, it’s ultimately self-loathing). Self-loathing involves not taking responsibility for one’s emotions about oneself and is part of having a low self-esteem. I appreciate you pointing to the need for clarification on this, and hope it's now clarified a bit. Thanks again for the comment!