One of the approaches in Steven Byrnes' Brain-like AGI Safety is reverse engineering human motivation systems, e.g., the Social-instinct AGI in chapter 12. Your break-down suggests that 'just' reverse-engineering human alignment is not enough.
Arbitrarily-scalable deference-morality looks like an intent-aligned AGI. One lens on why intent alignment is difficult is that deference-morality is inherently unnatural for agents who are much more capable than the others around them.
This is a really useful framing, it crystallized a lot of messy personal moral intuitions. Thanks for writing it.
This is really useful as a framing for what kinds of disagreements arise among altruists (or within an altruist who notices a conflict in their intuition). I think it also explains some of the variance in altruistic targets (whether you care about distant very poor people more than closer but way less poor, or animals, or "the rich" or other categories which can be understood on these dimensions).
Saying that "naively maximizing the agency of future agents would involve ensuring that they only have very easily-satisfied preferences" is clearly wrong. You appear to be completely misdefining agency here. Agency is full of the ability to come to decisions and value things on your own, not having it picked for you. It is the ability to try to make the world 'more' to your liking, not for the world to just be the way you like.
An agent that does not exist has zero agency. An agent that does exist, but has been fully controlled has zero agency. Only agents that make real choices in the world have agency. The maximally satisfiable being (let's stop calling it an agent) does nothing, or does things without regard to how the world should be and thus has no agency. Maximizing agency does not equal creating only beings with zero agency.
This glaring error makes the whole 'cooperation-morality' segment seem to be shakily reasoned. I'm not sure how it changes things, but having a third of it this way makes the whole post seem unreliable.
It’s weird to think about what “respecting agency” means when the agent in question doesn’t currently exist and you are building it from scratch and you get to build it however you want. You can’t apply normal intuitions here.
For example, brainwashing a human such that they are only motivated to play tic-tac-toe is obviously not “respecting their agency”. We’re all on the same page for that situation.
But what if we build an AGI from scratch such that it is only motivated to play tic-tac-toe, and then we let the AGI do whatever it wants in the world (which happens to be playing tic-tac-toe)? Are we disrespecting its agency? If so, I don’t feel the force of the argument that this is bad. Who exactly are we harming here? Is it worse than not making this AGI in the first place?
Was evolution “disrespecting my agency” when I was born with a hunger drive and sex drive and status drive etc.? If not, why would it be any different to make an AGI with (only) a tic-tac-toe drive? Or if yes, well, we face the problem that we need to put some drives into our AGI or else there’s no “agent” at all, just a body that takes random actions, or doesn’t do anything at all.
I never talked about respect in my points for a reason. This isn't about respect. It's about how it is not an agent if it doesn't do anything in an attempt to make the world more to its liking. If it does nothing, or does things randomly (without regard to making things better), that is hardly agentic. If I don't care at all about colors, then picking between a red shirt, and an otherwise identical blue shirt is not an agentic choice, merely a necessary choice (and I will not likely have even bothered thinking about it as a choice involving color.). Identically, if I just have to always wear a specific color, I am not being an agent by wearing that color. There are obviously degrees of agency, too, but the article is genuinely assuming that beings that do basically nothing are still agents.
I like this breakdown! But I have one fairly big asterisk — so big, in fact, that I wonder if I'm misunderstanding you completely.
Care-morality mainly makes sense as an attitude towards agents who are much less capable than you - for example animals, future people, and people who aren’t able to effectively make decisions for themselves.
I'm not sure animals belong on that list, and I'm very sure that future people don't. I don't see why it should be more natural to care about future humans' happiness than about their preferences/agency (unless, of course, one decides to be that breed of utilitarian across the board, for present-day people as well as future ones).
Indeed, the fact that one of the futures we want to avoid is one of future humans losing all control over their destiny, and instead being wireheaded to one degree or another by a misaligned A.I., handily demonstrates that we don't think about future-people in those terms at all, but in fact generally value their freedom and ability to pursue their own preferences, just as we do our contemporaries'.
(As I said, I also disagree with taking this approach for animals. I believe that insofar as animals have intelligible preferences, we should try to follow those, not perform naive raw-utility calculations — so that e.g. the question is not whether a creature's life is "worth living" in terms of a naive pleasure/pain ratio, but whether the animal itself seems to desire to exist. That being said, I do know nonzero amounts of people in this community have differing intuitions on this specific question, so it's probably fair game to include in your descriptive breakdown.)
I assume that you do think it makes sense to care about the welfare of animals and future people, and you're just questioning why we shouldn't care more about their agency?
The reductio for caring more about animals' agency is when they're in environments where they'll very obviously make bad decisions - e.g. there are lots of things which are poisonous and they don't know; there are lots of cars that would kill them, but they keep running onto the road anyway; etc. (The more general principle is that the preferences of dumb agents aren't necessarily well-defined from the perspective of smart agents, who can elicit very different preferences by changing the inputs slightly.)
The reductio for caring more about future peoples' agency is in cases where you can just choose their preferences for them. If the main thing you care about is their ability to fulfil their preferences, then you can just make sure that only people with easily-satisfied preferences (like: the preference that grass is green) come into existence.
The other issue I have with focusing primarily on agency is that, as we think about creatures which are increasingly different from humans, my intuitions about why I care about their agency start to fade away. If I think about a universe full of paperclip maximizers with very high agency... I'm just not feeling it. Whereas at least if it's a universe full of very happy paperclip maximizers, that feels more compelling.
(I do care somewhat about future peoples' agency; and I personally define welfare in a way which includes some component of agency, such that wireheading isn't maximum-welfare. But I don't think it should be the main thing.)
(Also, as I wrote this comment, I realized that the phrasing in the original sentence you quoted is infelicitous, and so will edit it now.)
Thank you! This is helpful. I'll start with the bit where I still disagree and/or am still confused, which is the future people. You write:
The reductio for caring more about future peoples' agency is in cases where you can just choose their preferences for them. If the main thing you care about is their ability to fulfil their preferences, then you can just make sure that only people with easily-satisfied preferences (like: the preference that grass is green) come into existence.
Sure. But also, if the main thing you care about is their ability to be happy, you can just make sure that only people whom green grass sends to the heights of ecstasy come into existence? This reasoning seems like it proves too much.
I'd guess that your reply is going to involve your kludgier, non-wireheading-friendly idea of "welfare". And that's fair enough in terms of handling this kind of dilemma in the real world; but running with a definition of "welfare" that smuggles in that we also care about agency a bit… seems, to me, like it muddles the original point of wanting to cleanly separate the three "primary colours" of morality.
That aside:
Re: animals, I think most of our disagreement just dissolves into semantics. (Yay!) IMO, keeping animals away from situations which they don't realize would kill them just falls under the umbrella of using our superior knowledge/technology to help them fulfill their own extrapolated preference to not-get-run-over-by-a-car. In your map this probably taken care of by your including some component of agency in "welfare", so it all works out.
Re: caring about paperclip paximizers: intuitively I care about creatures' agencies iff they're conscious/sentient, and I care more if they have feelings and emotions I can grok. So, I care a little about the paperclip-maximizers getting to maximize paperclips to their heart's content if I am assured that they are conscious; and I care a bit more if I am assured that they feel what I would recognise as joy and sadness based on the current number of paperclips. I care not at all otherwise.
If I think about a universe full of paperclip maximizers with very high agency... I'm just not feeling it. Whereas at least if it's a universe full of very happy paperclip maximizers, that feels more compelling.
This is really the old utilitarian argument that we value things (like agency) in addition to utility because they are instrumentally useful (which agency is). But if agency had never given us utility, we would never have valued it.
If you don’t fully trust that agent, though, then it seems very tricky to reason about how much you should defer to them, because they may be manipulating you heavily. In such cases the approach that seems most robust is to diversify worldviews using a meta-rationality strategy which includes some strong principles.
This doesn't seem to follow. Why wouldn't the 'strong principles' also be a product of heavy manipulation?
Strong principles tend to be harder to manipulate, because:
a) Strong principles tend to be simple and clear; there's not much room for cherrypicking them to produce certain outcomes.
b) Principle-driven actions are less dependent on your specific beliefs.
Regardless of how much harder they may be to manipulate, they can never be invulnerable. Which implies that given enough time, all principles, even the strongest, are subject to change.
Let’s consider three ways you can be altruistic towards another agent:
I think a lot of unresolved tensions in ethics comes from seeing these types of morality as in opposition to each other, when they’re actually complementary:
Cooperation-morality and deference-morality have the weakness that they can be exploited by the agents we hold those attitudes towards; and so we also have adaptations for deterring or punishing this (which I’ll call conflict-morality). I’ll mostly treat conflict-morality as an implicit part of cooperation-morality and deference-morality; but it’s worth noting that a crucial feature of morality is the coordination of coercion towards those who act immorally.
Morality as intrinsic preferences versus morality as instrumental preferences
I’ve mentioned that many moral principles are rational strategies for multi-agent environments even for selfish agents. So when we’re modeling people as rational agents optimizing for some utility function, it’s not clear whether we should view those moral principles as part of their utility functions, versus as part of their strategies. Some arguments for the former:
Some arguments for the latter:
The rough compromise which I use here is to:
Rederiving morality from decision theory
I’ll finish by elaborating on how different decision theories endorse different instrumental strategies. Causal decision theories only endorse the same actions as our cooperation-morality intuitions in specific circumstances (e.g. iterated games with indefinite stopping points). By contrast, functional decision theories do so in a much wider range of circumstances (e.g. one-shot prisoner’s dilemmas) by accounting for logical connections between your choices and other agents’ choices. Functional decision theories follow through on commitments you previously made; and sometimes follow through on commitments that you would have made. However, the question of which hypothetical commitments they should follow through with depends on how updateless they are.
Updatelessness can be very powerful - it’s essentially equivalent to making commitments behind a veil of ignorance, which provides an instrumental rationale for implementing cooperation-morality. But it’s very unclear how to reason about how justified different levels of updatelessness are. So although it’s tempting to think of updatelessness as a way of deriving care-morality as an instrumental goal, for now I think it’s mainly just an interesting pointer in that direction. (In particular, I feel confused about the relationship between single-agent updatelessness and multi-agent updatelessness like the original veil of ignorance thought experiment; I also don’t know what it looks like to make commitments “before” having values.)
Lastly, I think deference-morality is the most straightforward to derive as an instrumentally-useful strategy, conditional on fully trusting the agent you’re deferring to - epistemic deference intuitions are pretty common-sense. If you don’t fully trust that agent, though, then it seems very tricky to reason about how much you should defer to them, because they may be manipulating you heavily. In such cases the approach that seems most robust is to diversify worldviews using a meta-rationality strategy which includes some strong principles.