Epistemic status: Reframing of a trivial result, followed by a more controversial assertion.
Summary: For bounded agents acting in complex environments, using raw consequentialist reasoning for every moment-to-moment decision is prohibitively costly/time-inefficient. A rich suite of heuristics/approximations may be developed to supplement it, recording the correspondences between certain classes of actions and the consequences actions from that class robustly tend to cause. Instincts, traditions, and moral inclinations/deontology are well-explained by this: an agent that started out as a pure environment-based consequentialist would end up deriving them ex nihilo.
I use this to argue that humans are pure consequentialists over environment-states: we don't have real terminal preferences for some actions over others.
Bootstrapping
Suppose you're an agent with an utility function over world-states. You exist in a complex environment populated by other agents, whose exact utility functions are varied and unknown. You would like to maneuver the environment into a high-utility state, and prevent adversarial agents from moving it into low-utility states. How do you do that?
Consequentialism is the natural, fundamental answer. At every time-step, you evaluate the entire array of actions available to you, then pick the action that's part of the sequence most likely to move the world to the highest-value state.
That, however, is a lot of computation. You have to consider what actions you can in fact take, their long-term terminal effects, their instrumental effects (i. e., how much it expands or constraints your future action-space), their acausal effects, and so on.
Every time-step.
And you have to run that computation every time-step, because you're an embedded agent: at any given time your model of the world is incomplete, and every time-step you get new information you must update on. You can't pre-compute the optimal path through the environment, you have to continuously re-plot it.
If you're a bounded agent with limited computational capabilities, all of this would take time. Agents who can think quicker would out-compete those who think slower. (As a trivial example, they'd win in a knife fight.) Therefore, it'd be beneficial to find faster algorithms for computing the consequences, to replace or supplement "raw" consequentialism.
Is there a way to do that?
Suppose the environment is relatively stable across time and space. From day to day, you interact with mostly the same types of objects (trees, rocks, animals, tribesmen), and they maintain the same properties in different locations and situations.
In that case, you can abstract over actions. You can notice that all action of a certain kind tend to increase your future action-space (such as "stockpiling food" or "improving your reputation among the tribe") or provide direct utility ("eating tasty food"), whereas others tend to constrain it (such as "openly stealing") or provide direct dis-utility ("getting punched"). These actions tend to have these effects regardless of the specifics — whether they happen at day or at night, in a cave or in the forest.
Moreover, you can notice that some actions have the same effects independent of the agent executing them. When someone brings in a lot of food and donates it to the common tribal stockpile, that's good. When someone protects you from being brained with a club, that's good. When someone screws up a stealth check and gets a hunt party killed by a predator, that's bad.
When you murder a rival, that's good, but only myopically. Probabilistically, you're not going to get away with it long-term, which may get you killed or exiled or at least tank your reputation. Similarly, it's statistically bad if someone else in the tribe murders, because that a) decreases the tribe's manpower, b) it might be one of your friends, c) it might be you. Therefore, murder is bad.
These agent-agnostic actions are particularly useful. If your fellow agents are sufficiently similar to you architecture-wise, you can expect them to arrive at the same conclusions regarding which actions are agent-agnostically good and which are bad. That means you'd all commit to the same timeless policy, and know that everyone else has committed to it, and know everyone knows that, etc. Basically, you'll never have to think about some kinds of acausal effects again!
At the end, you end up with a rich pile of heuristics that are well-adapted for the environment you're in. They're massively faster than raw consequentialism, and approximate it well.
Can you do even better?
Suppose there are agents you interact with frequently (your tribesmen). Necessarily, you develop detailed models of their thought patterns and utility functions. You can then abstract over those as well, break all possible minds into sets of (personality) traits. These traits correspond to the types of behavior you can expect from these agents. (Cowardice, selfishness, bravery, cleverness, creativity, dishonesty...)
Once you did that, you can use even quicker computational shortcuts. Instead of evaluating actions against your heuristics framework, let alone trying to directly track their consequences, you can simply check your annotated model of the author of a given action. Is whoever executes it a highly trustworthy and competent agent with a compatible-with-yours utility function? Or a fool with a bad reputation?
Depending on the answer, you might not even need any additional information to confidently predict whether that action moves the world to a high-utility or a low-utility state. Whatever this person does is likely good, no matter what they do. Whatever that person does is likely bad.
When everything is working as intended, you use these higher-level models most of the time, and occasionally dip down to lower levels to check that they're still working as intended. (Is this person actually good, or is just tricking me with good rationalizations? Does this kind of action actually have a track record of good consequences?)
And, of course, you use lower-level abstractions for off-distribution things: novel types of actions you've never seen before and whose consequences you have to explicitly track, or new agents you're unfamiliar with.
The problems appear when you start to forget to do that. When you assume that certain actions are inherently bad, even if the environment changes and the correspondence between them and bad outcomes breaks down (entrenched traditions). Or when you assume that some person is inherently good, and the things they do are good because they did them.
Recapping
In sufficiently complex environments, using pure consequentialist reasoning is computationally costly.
If the environment is sufficiently stable, you can approximate consequentialism using algorithmically efficient shortcuts.
Given an environment, you can pick out types of actions that stably correspond to good or bad outcomes in it, then directly use these heuristics to make and evaluate decisions.
Given agents you know to be aligned-and-competent or adversarial, you can just know to support/oppose anything they do without expending resource on closer evaluation, because whatever they do stably corresponds to good/bad outcomes.
Higher-level models need to be checked by lower-level methods once in a while, to ensure that the correspondences didn't break down/weren't false to begin with.
You can also view it in terms of natural abstractions. For some actions, you can throw away most of the contextual information (the when, the where, the who) and still predict their consequences correctly most of the time. For some agents, you can throw away all information about the actions they execute except the who, and still predict correctly whether the consequences would be good or bad.
This framework explains a lot of things. Notably, the normative ethical theories: deontology and virtue ethics are little but parts of the efficient implementation of consequentialism for bounded agents in sufficiently stable environments. Instincts and traditions too, of course.
Applying
There's a great deal of confusion regarding the best way to model humans. One particular point of contention is whether we can be viewed as utility-maximizers over real-world outcomes. Deontological inclinations, instincts, and habits are often seen as complications/counter-arguments. Some humans "value" being honest and cooperative — but those aren't "outcomes over the environment", except in some very convoluted sense!
Act-based agents is an attempt to formalize this. Such agents have utility functions defined over their actions instead of environment-states, and are a good way to model idealized deontologists. It seems natural to suggest that humans, too, have an act-based component in their utility functions.
I suggest an alternate view: human preferences for certain kinds of actions over others are purely instrumental, not terminal.
In the story I've outlined, we've started with a pure environment-based consequentialist agent. That agent then computed the appropriate instincts and norms and social standards for the environment it found itself in, and preferred acting on them. But it's never modified its purely environment-based utility function in the meantime!
Similar, I argue, is the case with humans. We don't actually "value" being honest and cooperative. We act honest and cooperative because we expect actions with these properties to more reliably lead to good outcomes — because they're useful heuristics that assist us in arriving at high-utility world-states.
When we imagine these high-utility world-states — a brilliant future full of prosperity, or personal success in life — we imagine the people in them to be honest and cooperative too. But that's not because we value these things inherently. It's because that's our prediction of how agents in a good future would need to act, for that future to be good. We expect certain heuristics to hold.
One clue here is that the various preferences over actions don't seem to be stable under reflection. If they conflict with consequentialism-over-outcomes, they end up overridden every time. The ones we keep around — like the norm against murder — we keep around because we agree that it's a good heuristic. The ones where the correspondence broke down — like an outdated tradition, or a primal instinct — we reject or suppress.
There are some complications arising from the fact that with humans, it happened the other way around from what I'd described. Intelligence started out as a bunch of heuristics, then generality was developed on top of them. This led to us having some hard-coded instincts and biases that are very poorly adapted for our modern environment.
It also added-in reward signals, where we feel good for employing heuristics we consider good. This muddles the picture even more: when we imagine a bright future, we have to specify that in it, we'd be allowed to keep our instincts and deeply-internalized heuristics satisfied, as to be happy. But the actual outcome we're aiming for there is that happiness, not our execution of certain actions.
As such: humans are well-described as pure consequentialists over environment-states.
Epistemic status: Reframing of a trivial result, followed by a more controversial assertion.
Summary: For bounded agents acting in complex environments, using raw consequentialist reasoning for every moment-to-moment decision is prohibitively costly/time-inefficient. A rich suite of heuristics/approximations may be developed to supplement it, recording the correspondences between certain classes of actions and the consequences actions from that class robustly tend to cause. Instincts, traditions, and moral inclinations/deontology are well-explained by this: an agent that started out as a pure environment-based consequentialist would end up deriving them ex nihilo.
I use this to argue that humans are pure consequentialists over environment-states: we don't have real terminal preferences for some actions over others.
Bootstrapping
Suppose you're an agent with an utility function over world-states. You exist in a complex environment populated by other agents, whose exact utility functions are varied and unknown. You would like to maneuver the environment into a high-utility state, and prevent adversarial agents from moving it into low-utility states. How do you do that?
Consequentialism is the natural, fundamental answer. At every time-step, you evaluate the entire array of actions available to you, then pick the action that's part of the sequence most likely to move the world to the highest-value state.
That, however, is a lot of computation. You have to consider what actions you can in fact take, their long-term terminal effects, their instrumental effects (i. e., how much it expands or constraints your future action-space), their acausal effects, and so on.
Every time-step.
And you have to run that computation every time-step, because you're an embedded agent: at any given time your model of the world is incomplete, and every time-step you get new information you must update on. You can't pre-compute the optimal path through the environment, you have to continuously re-plot it.
If you're a bounded agent with limited computational capabilities, all of this would take time. Agents who can think quicker would out-compete those who think slower. (As a trivial example, they'd win in a knife fight.) Therefore, it'd be beneficial to find faster algorithms for computing the consequences, to replace or supplement "raw" consequentialism.
Is there a way to do that?
Suppose the environment is relatively stable across time and space. From day to day, you interact with mostly the same types of objects (trees, rocks, animals, tribesmen), and they maintain the same properties in different locations and situations.
In that case, you can abstract over actions. You can notice that all action of a certain kind tend to increase your future action-space (such as "stockpiling food" or "improving your reputation among the tribe") or provide direct utility ("eating tasty food"), whereas others tend to constrain it (such as "openly stealing") or provide direct dis-utility ("getting punched"). These actions tend to have these effects regardless of the specifics — whether they happen at day or at night, in a cave or in the forest.
Moreover, you can notice that some actions have the same effects independent of the agent executing them. When someone brings in a lot of food and donates it to the common tribal stockpile, that's good. When someone protects you from being brained with a club, that's good. When someone screws up a stealth check and gets a hunt party killed by a predator, that's bad.
When you murder a rival, that's good, but only myopically. Probabilistically, you're not going to get away with it long-term, which may get you killed or exiled or at least tank your reputation. Similarly, it's statistically bad if someone else in the tribe murders, because that a) decreases the tribe's manpower, b) it might be one of your friends, c) it might be you. Therefore, murder is bad.
These agent-agnostic actions are particularly useful. If your fellow agents are sufficiently similar to you architecture-wise, you can expect them to arrive at the same conclusions regarding which actions are agent-agnostically good and which are bad. That means you'd all commit to the same timeless policy, and know that everyone else has committed to it, and know everyone knows that, etc. Basically, you'll never have to think about some kinds of acausal effects again!
At the end, you end up with a rich pile of heuristics that are well-adapted for the environment you're in. They're massively faster than raw consequentialism, and approximate it well.
Can you do even better?
Suppose there are agents you interact with frequently (your tribesmen). Necessarily, you develop detailed models of their thought patterns and utility functions. You can then abstract over those as well, break all possible minds into sets of (personality) traits. These traits correspond to the types of behavior you can expect from these agents. (Cowardice, selfishness, bravery, cleverness, creativity, dishonesty...)
Once you did that, you can use even quicker computational shortcuts. Instead of evaluating actions against your heuristics framework, let alone trying to directly track their consequences, you can simply check your annotated model of the author of a given action. Is whoever executes it a highly trustworthy and competent agent with a compatible-with-yours utility function? Or a fool with a bad reputation?
Depending on the answer, you might not even need any additional information to confidently predict whether that action moves the world to a high-utility or a low-utility state. Whatever this person does is likely good, no matter what they do. Whatever that person does is likely bad.
When everything is working as intended, you use these higher-level models most of the time, and occasionally dip down to lower levels to check that they're still working as intended. (Is this person actually good, or is just tricking me with good rationalizations? Does this kind of action actually have a track record of good consequences?)
And, of course, you use lower-level abstractions for off-distribution things: novel types of actions you've never seen before and whose consequences you have to explicitly track, or new agents you're unfamiliar with.
The problems appear when you start to forget to do that. When you assume that certain actions are inherently bad, even if the environment changes and the correspondence between them and bad outcomes breaks down (entrenched traditions). Or when you assume that some person is inherently good, and the things they do are good because they did them.
Recapping
You can also view it in terms of natural abstractions. For some actions, you can throw away most of the contextual information (the when, the where, the who) and still predict their consequences correctly most of the time. For some agents, you can throw away all information about the actions they execute except the who, and still predict correctly whether the consequences would be good or bad.
This framework explains a lot of things. Notably, the normative ethical theories: deontology and virtue ethics are little but parts of the efficient implementation of consequentialism for bounded agents in sufficiently stable environments. Instincts and traditions too, of course.
Applying
There's a great deal of confusion regarding the best way to model humans. One particular point of contention is whether we can be viewed as utility-maximizers over real-world outcomes. Deontological inclinations, instincts, and habits are often seen as complications/counter-arguments. Some humans "value" being honest and cooperative — but those aren't "outcomes over the environment", except in some very convoluted sense!
Act-based agents is an attempt to formalize this. Such agents have utility functions defined over their actions instead of environment-states, and are a good way to model idealized deontologists. It seems natural to suggest that humans, too, have an act-based component in their utility functions.
I suggest an alternate view: human preferences for certain kinds of actions over others are purely instrumental, not terminal.
In the story I've outlined, we've started with a pure environment-based consequentialist agent. That agent then computed the appropriate instincts and norms and social standards for the environment it found itself in, and preferred acting on them. But it's never modified its purely environment-based utility function in the meantime!
Similar, I argue, is the case with humans. We don't actually "value" being honest and cooperative. We act honest and cooperative because we expect actions with these properties to more reliably lead to good outcomes — because they're useful heuristics that assist us in arriving at high-utility world-states.
When we imagine these high-utility world-states — a brilliant future full of prosperity, or personal success in life — we imagine the people in them to be honest and cooperative too. But that's not because we value these things inherently. It's because that's our prediction of how agents in a good future would need to act, for that future to be good. We expect certain heuristics to hold.
One clue here is that the various preferences over actions don't seem to be stable under reflection. If they conflict with consequentialism-over-outcomes, they end up overridden every time. The ones we keep around — like the norm against murder — we keep around because we agree that it's a good heuristic. The ones where the correspondence broke down — like an outdated tradition, or a primal instinct — we reject or suppress.
There are some complications arising from the fact that with humans, it happened the other way around from what I'd described. Intelligence started out as a bunch of heuristics, then generality was developed on top of them. This led to us having some hard-coded instincts and biases that are very poorly adapted for our modern environment.
It also added-in reward signals, where we feel good for employing heuristics we consider good. This muddles the picture even more: when we imagine a bright future, we have to specify that in it, we'd be allowed to keep our instincts and deeply-internalized heuristics satisfied, as to be happy. But the actual outcome we're aiming for there is that happiness, not our execution of certain actions.
As such: humans are well-described as pure consequentialists over environment-states.