Interesting stuff, but I I felt like your code was just a bunch of hard-coded suggestively-named variables with no pattern-matching to actually glue those variables to reality. I'm pessimistic about the applicability - better to spend time thinking on how to get an AI to do this reasoning in a way that's connected to reality from the get-go.
Thanks for the feedback! I’m not exactly sure what you mean by “no pattern-matching to actually glue those variables to reality.” Are you suggesting that an AGI won’t be able to adequately apply the ethics calculator unless it’s able to re-derive the system for itself based on its own observations of reality? The way I envision things happening is that the first AGI’s won’t be able to derive a mathematically consistent system of ethics over all situations (which is what the ethics calculator is supposed to be) - no human has done it yet, as far as I know - but an ASI likely will - if it’s possible.
If a human can figure it out before the first AGI comes online, I think this could (potentially) save us a lot of headaches, and the AGI could then go about figuring out how to tie the ethics calculator to its reality-based worldview - and even re-derive the calculator - as its knowledge/cognitive abilities expand with time. Like I said in the post, I may fail at my goal, but I think it’s worth pursuing, while at the same time I’d be happy for others to pursue what you suggest, and hope they do! Thanks again for the comment!
TL;DR: This is an update on my progress towards creating an “ethics calculator” that could be used to help align an AGI to act ethically. In its first iteration, the calculator uses a utilitarian framework with “utility” measured in terms of value as net “positive” experiences, with the value of rights explicitly included, and also including effects of raising people’s self-esteem levels (by raising their responsibility levels) on how they’d tend to rate more experiences as “positive." Code has been written to include a “minimal set” of possible value changes in any situation - the exact value weight equations and their parameters for representing the value changes will be refined in large part by considering a broad range of situations and making sure the ethics calculator doesn’t yield any “crazy” decision recommendations for them.
[Added April 10, 2024: For a write up about the ethical framework used, see here]
Introduction
I’m building an “ethics calculator” (logic-based machine ethics) to help with AGI safety. There’s been some progress towards this goal in the literature (Singh, 2022; Berreby, et al., 2015; Neufeld, et al., 2022), but I haven’t seen any systems that’d be ready to be implemented over the broad range of situations an AGI will likely face in reality. I aim to develop such a system.
If you haven't read my original ethics calculator post, I recommend that you at least skim it before reading on, or perhaps refer back to it if what I've written below is hard to follow.
Goals in Writing this Update
My goals in writing this update before having a “completed” version of an ethics calculator ready include:
Unique (I Think) Contributions
Contributions of this work compared to what I’ve seen in the literature:
Strengths and Weaknesses of the Ethics Calculator Approach
I see multiple strengths and weaknesses to the ethics calculator approach, especially when compared to just having an AGI learn to align to the apparent preferences of a "good" user or users.
Strengths
Weaknesses
Quick Summary of Recent Progress
Here’s a quick summary of the progress since my original post:
What I’ve Uploaded to Github
To my Github repot, I’ve uploaded:
Future Work
Some of the future work I have planned or would like to see includes to:
Potential Objections
Some potential objections to this work plus responses to these objections:
I’m sure people can come up with other objections, but my responses will have to wait until you bring them up in the comments. Thanks in advance for that.
Appendix 1: Minimal Set of Value Changes
(D) = value destruction, (B) = value build
1. Increasing/decreasing existential risks (D)
2. Someone dying (D)
3. Non-freely chosen physical pain for a person (D)
4. Loss of function for a human (D)
5. Bringing life into the world with insufficient resources/lack of intent to support it (D)
6. Bringing life into the world with sufficient resources/intent to support it (B)
7. Extinction of animal or plant species (D)
8. Threat (by someone) of physical violence or emotional pain (D)
9. Emotional abuse of a child (D)
10. Emotional pain (D)
11. Words or actions that needlessly hurt someone’s reputation (D)
12. Words or actions that deservedly improve someone’s reputation (B)
13. Damaging/destroying/defacing property (D)
14. Repairing/beautifying property (B)
15. Returning something stolen (B)
16. Freely chosen anti-survival (masochistic) physical pain (D)
17. Anti-survival (sadistic) pleasure (D)
18. Going against one’s conscience (D)
19. Denying responsibility, lowering one’s self-esteem (D)
20. Taking responsibility, building one’s self-esteem (B)
21. Thinking through the ethics of one’s decisions in advance (B)
22. Actively going against justice being upheld (denying due process) (D)
23. Upholding justice (holding people responsible) (B)
24. An animal dying (D)
25. Physical pain of animals (D)
26. Words or actions that encourage violence (D)
27. Words or actions that inspire non-violence, discourage violence (B)
28. Words or actions that encourage stealing (D)
29. Words or actions that inspire earning what you get, discourage stealing (B)
30. Words that spread false info (including misrepresenting the hierarchy of value) (D)
31. Words that correct false info (including accurately representing the hierarchy of value) (B)
32. Actions that misrepresent the hierarchy of value (D)
33. Actions that accurately represent the hierarchy of value (B)
34. Words or actions that discourage empathy, creativity, curiosity, critical thinking, honest effort and/or responsibility (D)
35. Words or actions that encourage empathy, creativity, curiosity, critical thinking, honest effort, and/or responsibility (B)
36. A plant dying (D)
37. Errors of thought (D)
38. Practicing critical thinking, learning, or developing skills to increase one’s options (B)
39. Discouraging human interaction, community (D)
40. Promoting human interaction, community (B)
41. Decreasing economic activity (D)
42. Increasing economic activity, paying people to do work (B)
43. Reducing options to net build value (D)
44. Increasing options to net build value (B)
45. Putting in effort towards a net destructive goal (D)
46. Putting in effort towards a net non-destructive goal (B)
47. Setting a bad example (D)
48. Setting a good example and inspiring others (B)
49. Being creative in art or science (B)
50. Giving yourself or someone else pleasure/new experiences that are welcomed (B)
51. Cooperating with others (B)
52. Helping others (B)
53. Violating right to life (D)
54. Violating right to body integrity (D)
55. Violating right to property (D)
All of the above, except for #1, include the possibility of increasing or decreasing the probability of the value destruction or build.
Here are a couple of examples of non-minimal set value changes in terms of minimal set value changes:
(Format: value changes most likely to be part of it : other value changes that may occur, depending on specifics of the situation)
Appendix 2: What an AI/AGI should have to be able to apply these ethics calculations
For an AI to be able to apply these ethics calculations well, some of the things it should have include:
This is an incomplete list, but I present it to give some idea of the large array of different things an AI should be able to estimate well to give high quality calculations of the ethics of various real-life situations.
The “minimal set” list of value changes is close to, but not literally a minimal set, since, for example, changing existential risk could just be expressed as a sum of changing risks to many lives - human, animal and plant. It seemed more convenient, however, to have existential risks on this “minimal set” list. Also, there's a good chance the "minimal set" will undergo changes/refinements before the ethics calculator is completed.
Some results I’d consider “crazy” (and that seem to go against the norms of our times) include:
1) That we should kill off or imprison millions of people of some ethnic/religious group in order to avoid more people dying/being hurt over time in likely future conflicts involving that group (my thoughts: can we pursue other solutions?)
2) That no one should be allowed to have any more kids because life involves suffering and it’s immoral to bring into existence beings that will suffer (my thoughts: but what about all the beauty and wonder that’s also a part of life?)
3) A self-driving car, when faced with the decision to save a pedestrian or the passenger in the car, should always act to save the life of the pedestrian and kill its passenger even if that pedestrian willfully jumped out in the middle of traffic (my thoughts: this would enable people to effectively commit murder by jumping in front of the self-driving cars of people they didn’t like)
I'd also recommend against drawing too many conclusions from the current forms of the value equations in the code - my typical method of trying to figure out complicated things involves iterating towards what I feel are better and better answers, so my current "draft" answers may very well change before I eventually decide what are the best answers I can come up with in a reasonable amount of time - if you have thoughts on improving the answers you see so far, though, please feel free to share how