For me, this was a key sentence:
The fact that actions, including actions about what to say is "good", are computed by the brain does mean that there is a strong selection effect in utterances about "good".
I feel like the elephant in the AI alignment room has to do with an even more horrible truth. What if the game is adversarial by nature? Imagine a chess game: would it make sense to build an AI that is aligned both with the black and the white player? It feels almost like a koan.
Status (both domination and prestige) and sexual stuff (not only intra-sexual competition) have ingrained adversarial elements in it - and the desire for both is a massive part of the human utility function. So you can perhaps align AI to a person or a group, but to keep coherence there must be losers because we care too much about position, and to be in the top position enforces to have people in the bottom position.
A human utility function is not very far from the utility function of a chimp, should we really use this as the basis for the utility function for the super-intelligence that builds von Neumann drones? No, a true "view-from-nowhere good" AI shouldn't be aligned with humans at all.
If our minds expect to function in a partly-adversarial world, then a FAI may decide to place us in a partly-adversarial world, at least to avoid pushing our minds into a weird part of probability space where behaviors and values applicable to normal scenarios stop being applicable. (This is similar to ecosystem and habitat management applied to animals)
Playing chess involves a preference for playing chess (including a preference that the rules are followed), and subject to playing chess, a preference for winning. Someone who didn't properly have a preference for playing chess, such as a pigeon, would not be properly considered to be "playing chess"; their moves would not even be evaluable as attempts to win or lose, as they would not be following the rules of the game in the first place. This is similar to a point made by Finite and Infinite Games:
There is no finite game unless the players freely choose to play it. No one can play who is forced to play.
A preference for playing a game according to rules would be a law-level preference (referenced in the post).
So a preference to engage in sexual competition would include a law-level preference to exist in a world that has sexual competition functioning according to certain guidelines, as well as a preference to succeed in the specific sexual competition.
This solves the preference to play - but doesn't solve the preference to win/outcompete other humans. The only way to solve the preference to win is to create a nozick-experience-machine style existence where some of the players are actually NPCs that are indistinguishable from players [1] (The white chess players wins 80% of the time, but doesn't understand that the black player is actually a bot). In any other scenario, it's impossible to get a human to win without having another human to lose which means the preference to win will be thwarted on aggregate.
But for an FAI to spend vast amounts of free energy to create simulations of experience machines just seems wrong in a very fundamental sense, seems just like wireheading with extra steps.
[1] - This gives me the faint hope that we are already in this kind of scenario, meaning the 50 billion chickens we kill each year and the people that have a life that is best described as a living hell have no qualia. But unfortunately, I would have to bet against it.
Yes, there either have to be NPCs or a lot of real people have to lose. But that's simply a mathematical constraint of actually playing against people like yourself. There's enjoyment taken in the possibility of losing (and actually losing sometimes, seeing what went wrong).
Each preference falsification creates some internal demand for ambiguity and a tendency to reverse the signs on all of your other preferences.
I am not sure how this works, specifically the part about all other preferences.
Do you believe that literally all human preferences are falsified, as soon as one of them is? For example, once we learn that people say some things in order to appear politically correct, we should also conclude that they are lying about liking chocolate and actually they hate it?
It's basically contagious lies applied to values. I've seen people say that school is good because it reduces fertility; naive evolutionary values would say that fertility is good, though, so this seems like a case of one value inversion causing another. It might not apply to all values but it would apply to quite a lot of them, especially ones related to the inverted ones.
I broadly agree with this. I want to note a danger that's always present in this sort of discussion of "things at the horizon of experience", or water that's being swum in. It risks being incorporated as statements in an ideology, rather than being enacted. Of course, enacting something might be bad, but if you incorporate "I want to X" into your belief system and also don't do X / don't want to do X, then something is going wrong. Incorporating "X is good" into your belief system without tracking how much that belief has or hasn't propagated / equilibrated with the rest of you, could lead to a sort of intensification of the simulacrum, papering over the cracks, raising the difference between authenticity and falsification to a pitch too high to hear, making it harder to untangle by reasoning.
I am not really sure which part of the post you think might have this effect. To some extent I am responding to people having incorporated "FAI" into a non-enacted ideology (due to failing to symbol ground "FAI") and trying to correct this problem.
I did summarize part of the post as "if you don't want it, it's not FAI"; but mostly that seems like it would bring people out of confused ideological relationships to "FAI" where they haven't adequately distinguished it from UFAI and are experiencing according motivational issues.
I am not really sure which part of the post you think might have this effect.
None. Oops, maybe there was some unintended implicature in the first two sentences (implying the paragraph is the exception the broad agreement; not what I'm saying). My comment wasn't really addressed at you, more to myself, the crowd, whoever.
people under preference falsification can't do complex research in a way that chains from their actual values
at least some intrinsic preferences are selfish, due to both (a) indexicality of perceptors/actuators and (b) evolutionary psychology
How does this relate to time preference? Do you see a distinction between low time preference, and preference falsification (of preferences that are indexical temporally as well as spatially, and are executing adaptations calibrated for to a certain lifespan and not necessarily aimed at any coherent ends)? (Acknowledging that low time preferences is often used in the same way "the common good" is.)
Some intrinsic preferences are short-term (e.g. hunger) and some are long-term. Having preferences be totally consistent across time would require a mechanism, by default physical locality properties would imply inconsistent time preference (similar to how a magnet attracts things close to it much more than things far away from it). Showing that your short-term preferences are weak can be a form of "virtue signalling" that is encouraged by authority figures since it can lead someone to try to satisfy more of their values though submission e.g. in school, which creates a distortion in reporting.
Having preferences be totally consistent across time would require a mechanism, by default physical locality properties would imply inconsistent time preference
So I'm wondering what mechanisms there are or could be for coherentizing short- and long-term preferences, given that they're prima facie in conflict. And, how could such a mechanism not involve preference falsification. I mean, it seems much less bad to do something for long-term (individual intrinsic) value that harms short-term (individual intrinsic) preferences, or even vice versa, than to do something for no reason that harms all preferences, or to do something for externally imposed reasons, which would play into dynamics that result in doing things for no reason.
More to your point, it seems plausible to negotiate between preferences, internally to an individual, in a way that doesn't require falsification, and instead works by weighing tradeoffs. But I don't have a clear picture of exactly why internal negotiation requires less falsification than external pressure; is it mainly a quantitative difference between amount of privacy between parts (hence potential for fraud), or ease of communication; or is it something to do with individuals having some integrated criteria of judgement; or what? (Also, can the intraindividual structure be ported to the interindividual realm, etc.) I'm also wondering, if the way individuals coherentize does involve some preference falsification even apart from external pressure (for example, sour grapes?), does that have the same problems as you discuss? If someone has strictly partial preference falsification from external pressure, will they not be able to make FAI (given that they could chain from some coherent goal, even if it's not the whole of their values, and therefore build up coherent understanding)?
It seems like some forms of reinforcement learning do some forms of coherentizing short-term and long-term preferences; there can be a short-term reward associated with a prediction of future reward, e.g. happiness upon having successfully negotiated to buy a house, which is a prediction of future reward. It seems pretty common for "instrumental" goods like money to be associated with short-term hedonic reward.
The way it would not involve preference falsification is if it is clear whether something is being done for short-term or long-term benefit, and short-term benefits aren't totally getting overwritten with long-term benefits. Similar to Eliezer's point about the drowning child except extending across time instead of space.
But I don’t have a clear picture of exactly why internal negotiation requires less falsification than external pressure
There are 2 layers where there could be falsification: internal and external. For external we can see the mechanisms better, it's possible for two different people to perceive the same facts about the society they live in, in a way that's harder for mental facts. So that seems like a more natural place to start correcting the errors, although correcting internal errors is also necessary to some degree, and will use some tools in common with correcting external errors.
Incoherence of a person across time is often related to that person being externally influenced, e.g. trying to comply with whoever they're talking with at the time and therefore expressing different values at different times.
There is reason to expect that spite strategies, which involve someone paying to harm others, are collective, rather than individual. [...]
Therefore, collective values are more likely than individual values to encode conflicts in a way that makes them fundamentally irreconcilable.
IDK if I get this. This seems to conflate (1) values that only exist because of collective dynamics and aren't grounded in anything anyone wanted, as discussed in the previous paragraph about "the collective good", and (2) values that can only be *pursued* by collectives, even if they're "in the individual". There seems a big difference between "encoding conflicts" by creating dynamics that "overwrite" whatever the individuals wanted, vs. by enabling individuals to coordinate to acheive their values (or encoding non-conflict by breaking individuals ability to coordinate).
Spite strategies function collectively but are encoded in individual adaptations, e.g. green beard effect. While green beard behavior aren't well-optimized on an individual basis, the teleology points at a collective strategy, i.e. you compress it better by assuming the gene is selfish than assuming the individuals are.
With FAI there's an issue of incorporating many different values. My intuition is that if we are choosing one or the other, we should encode selfish-individual values and not selfish-gene values. One reason for this is that selfish-gene values are more likely to be "mean" on a way that makes them irreconcilable.
Now that I've looked at the Sarah Constantin link:
I've updated that spite strategies are more plausible than I'd thought.
It seems there's evidence that spite strategies also happen between individuals, though I didn't look at the actual studies. If that's right, do you also think "mean"/irreconcilable values held by an individual should be thrown overboard, if some values have to be thrown overboard?
On the other hand, if there are 2 teams fighting each other, then a team that instructs its members to hurt the other team (at cost) gains in terms of the percentage of energy controlled by the team; this situation is important enough that we have a common term for it, "war".
Nitpick: this doesn't seem like a good description of war, or at least most war. War is usually about actually getting stuff (not particularly about harming the other; burning the peasants is eventually suicide for the warlord). Maybe some modern wars fit your description, e.g. WW1? I think pre-modern wars were usually about land, plunder, taxes, trade routes (ETA: we could say, "charismatic megaparasitism"). But I'm not sure, interested in counterinfo.
Even if war has the goal of "getting resources" hurting the other team is an instrumental goal, which is why there is so much investment in weapons.
I don't think "war is for resources" is a good fit to most wars. Trade is another way of getting resources; trade efficiency theorems imply alignment between selfishness and Pareto efficiency. War is incredibly costly, if it were about cost there would be much more spent on peace negotiations to make war as counterfactual (and therefore rare) as possible. Justifications for wars are often ideological (e.g. religious or about capitalism/communism) which are basically about competing collectives. It is common to continue fighting a war after it's considered unwinnable e.g. in Vietnam.
There's a naive theory that problems are caused by selfishness and a non-naive theory that problems are caused by spite (and that selfishness and altruism are aligned). Ayn Rand is clearly advocating the non-naive theory, noting that individually selfish people will generally avoid conflicts between each other.
Even if war has the goal of "getting resources" hurting the other team is an instrumental goal,
Sure; I thought in the original context you were saying: since spite strategies aren't based on values "in the individual", this is evidence that group values are less friendly. But getting resources seems like an individual value, and it seems weird to call organizedly getting resources (e.g. through war) a group value if it's just individuals working together to get what they each want.
Justifications for wars are often ideological (e.g. religious or about capitalism/communism)
As I said above, I suspect ideologically motivated actions often mainly *are* expressions of some of people's individual values (through a potential very strong and distorting filter for pursuing things that are coordinatable about, and pursuing them in a coordinatable way); in other words, Christianity is sometimes partly just "let's get together and tax the heathens and/or take their land" with a different value of "us".
It sounds like you're talking about modern wars, like say in the past one or two centuries. Modern wars seem weirder than older ones, though maybe older ones would seem weirder with more detail. I think you're saying that modern wars don't look like anyone serving selfish individual ends, is that right? I'm not even sure about that claim. Even the paradigmatic case of spite, the Nazis, seems muddled; Lebensraum, and freedom from WW1 reparations, seem like key war motives (if not genocide motives). (To be clear, I'm taking "my grandchildren prosper" as a selfish value, and the hypothesis is that intuitions about subjugation point towards even war being worth it, because subjugation could be permanent and in particular could impoverish your grandchildren.)
Ayn Rand is clearly advocating the non-naive theory, noting that individually selfish people will generally avoid conflicts between each other.
I think Rand specifically would be disgusted by violent raiders, for being parasites instead of expressing their life through creating value for themselves. The villains in Atlas Shrugged are referred to as "looters" and "plunderers". That's a more specific meaning to "selfish" than "not kowtowing to the common good".
As I said above, I suspect ideologically motivated actions often mainly are expressions of some of people’s individual values
Sometimes the teleology is more at a group than an individual level, e.g. green beard genes.
Even the paradigmatic case of spite, the Nazis, seems muddled
There were suicides at the end of the war. While there were some selfish reasons for this the article mentions:
Secondly, many Nazis had been indoctrinated in unquestioning loyalty to the party and with it its cultural ideology of preferring death over living in defeat.
Hitler himself said that he had been trained in WW1 to move towards danger.
The villains in Atlas Shrugged are referred to as “looters” and “plunderers”. That’s a more specific meaning to “selfish” than “not kowtowing to the common good”.
Hmm... it seems like an important part of her philosophy that the looters aren't actually pursuing their own values, they're expecting to be taken care of as part of a collective; I'm thinking more of The Virtue of Selfishness than her fiction.
Sometimes the teleology is more at a group than an individual level, e.g. green beard genes.
Yeah. I've updated some from the arguments about spite in Constantin's post.
Nazi suicides: interesting. Still seems pretty muddled (I mean, I'm unsure + confused); like, there was a lot of rape and subjugation after the war, and they might have expected that. But still, suicide is pretty extreme so this is evidence against individual values being operative.
it seems like an important part of her philosophy that the looters aren't actually pursuing their own values, they're expecting to be taken care of as part of a collective; I'm thinking more of The Virtue of Selfishness than her fiction.
Maybe I'll look at that. So, definitely in Atlas Shrugged, the looters are described as not pursuing their own values, and as not even necessarily expecting anything (they're intentionally destroying the substrate that's sustaining them, and when confronted with this fact they either scream or just keep mashing the "collective good" button). I'm saying that, by "selfish" in the good sense, Rand is excluding both the people who aren't pursuing their values, and the people who *are* pursuing their values by consciously deciding to loot. The character of Fred Kinnan is consciously looting:
"I'm a racketeer – but I know it and my boys know it, and they know that I'll pay off. Not out of the kindness of my heart, either, and not a cent more than I can get away with.... Sure it makes me sick sometimes, it makes me sick right now, but it's not me who's built this kind of world – you did – so I'm playing the game as you've set it up and I'm going to play it for as long as it lasts – which isn't going to be long for any of us."
https://www.shmoop.com/study-guides/literature/atlas-shrugged/fred-kinnan
(I could be straightforwardly wrong about what Rand thinks. And I notice she has Kinnan say, "it's not me who's built this...".)
Fred Kinnan is a comparatively sympathetic character among the looter coalition, for more or less the reason you just described. I think Rand's opinion is that people like Kinnan are being locally rational & self-interested, but within a worldview that is truncated in an unprincipled way to embed a conflict theory that is in tension with his ability to recognize & extract material concessions and, if taken to its logical conclusion, involves a death wish. It doesn't seem like he's enjoying his life or really has any specific concrete intentions.
Robert Stadler is another interesting mixed character. He starts out with specific intentions (learning how the physical world works on a deep level). This eventually puts him in conflict with the looters, and unlike the viewpoint character Danny Taggart he submits to their worldview, giving up his sanity & the agenda that made his life worth living in order to occupy a place in their regime.
Kinnan is better adapted to cynically hold onto his position for longer, but at the price of the kinds of hopes that created a conflict for Stadler.
I agree Kinnan is more sympathetic, intentionally so. Like, if everyone around is a Kinnan, you just have to be good at mechanism design, and their local selfishness will, like fluid filling a container, form something good (according to the mechanism designer). I'm saying that Kinnan doesn't kowtow to the collective in the same way; but is still a looter, is still not living up to Rand's visionary form of selfishness that loves life, and would find his way to conflict with people, if that were in his local self-interest. In other words, I'm trying to say that although dropping selfishness altogether seems more something (less value-enacting; more able to be sucked into totally ungrounded maelstroms) than being a Kinnan, still, being selfish isn't enough to avoid conflict.
I like how Stadler's arc adds a touch of real horror to the story (related to the point of the OP). Where the viewpoint characters merely sustain the regime until they decide not to, Stadler "lets the cat out of the bag" and finds himself blindsided by the regime turning genuine scientific insight to depraved ends.
Are you saying that MIRI enforces altruism in their employees? If so, how do they do that, exactly?
In the linked thread there was a discussion of standard security practices; Zack pointed out that these are generally for making people act against their interests, but this was not considered a sufficient objection to some in the thread, who thought that researchers acting against their own interests could build FAI.
If aliens were to try to infer human values, there are a few information sources they could start looking at. One would be individual humans, who would want things on an individual basis. Another would be expressions of collective values, such as Internet protocols, legal codes of states, and religious laws. A third would be values that are implied by the presence of functioning minds in the universe at all, such as a value for logical consistency.
It is my intuition that much less complexity of value would be lost by looking at the individuals than looking at protocols or general values of minds.
Let's first consider collective values. Inferring what humanity collectively wants from internet protocol documents would be quite difficult; the fact a SYN packet must be followed by a SYN-ACK packet is a decision made in order to allow communication to be possible rather than an expression of a deep value. Collective values, in general, involve protocols that allow different individuals to cooperate with each other despite their differences; they need not contain the complexity of individual values, as individuals within the collective will pursue these anyway.
Distinctions between different animal brains form more natural categories than distinctions between institutional ideologies (e.g. in terms of density of communication, such as in neurons), so that determining values by looking at individuals leads to value-representations that are more reflective of the actual complexity of the present world in comparison to determining values by looking at institutional ideologies.
There are more degenerate attractors in the space of collective values than in individual values, e.g. with each person trying to optimize "the common good" in a way that means that they say they want "the common good", which means "the common good" (as a rough average of individuals' stated preferences) thinks their utility function is mostly identical with "the common good", such that "the common good" becomes a mostly self-referential phrase, referring to something with little resemblance to what anyone wanted in the first place. (This has a lot in common with Ayn Rand's writing in favor of "selfishness".)
There is reason to expect that spite strategies, which involve someone paying to harm others, are collective, rather than individual. Imagine that there are 100 different individuals competing, and that they have the option of paying 1 unit of their own energy to deduct 10 units of another individual's energy. This is clearly not worth it in terms of increasing their own energy, and is also not worth it in terms of increasing the percentage of the total energy owned by them, since paying 1 energy only deducts 0.1 units of energy from the average individual. On the other hand, if there are 2 teams fighting each other, then a team that instructs its members to hurt the other team (at cost) gains in terms of the percentage of energy controlled by the team; this situation is important enough that we have a common term for it, "war". Therefore, collective values are more likely than individual values to encode conflicts in a way that makes them fundamentally irreconcilable.
Let us also consider values necessary for minds-in-general. I talked with someone at a workshop recently who had the opinion that AGI should optimize an agent-neutral notion of "good", coming from the teleology of the universe itself, rather than human values specifically, although it would optimize our values to the extent that our values already align with the teleology. (This is similar to Eliezer Yudkowsky's opinion in 1997.)
There are some values embedded in the very structure of thought itself, e.g. a value for logical consistency and the possibility of running computations. However, none of these values are "human values" exactly; at the point where these are the main thing under consideration, it starts making more sense to talk about "the telos of the universe" or "objective morality" than "human values". Even a paperclip maximizer would pursue these values; they appear as convergent instrumental goals.
Even though these values are important, they can be assumed to be significantly satisfied by any sufficiently powerful AGI (though probably not optimally); the difference in the desirability between a friendly and unfriendly AGI, therefore, depends primarily on other factors.
There is a somewhat subtle point, made by Spinoza, which is that the telos of the universe includes our own values as a special case, at our location; we do "what the universe wants" by pursuing our values. Even without understanding or agreeing with this point, however, we can look at the way pure pursuit of substrate-independent values seems subjectively wrong, and consider the implications of this subjective wrongness.
"I", "you", "here", and "now" are indexicals: they refer to something different depending on when, where, and who speaks them. "My values" is indexical; it refers to different value-representations (e.g. utility functions) for different individuals.
"Human values" is also effectively indexical. The "friendly AI (FAI) problem" is framed as aligning artificial intelligence with human values because of our time and place in history; in another timeline where octopuses became sapient and developed computers before humans, AI alignment researchers would be talking about "octopus values" instead of "human values". Moreover, "human" is just a word; we interpret it by accessing actual humans, including ourselves and others, and that is always indexical, since which humans we find depends on our location in spacetime.
Eliezer's metaethics sequence argues that our values are, importantly, something computed by our brains, evaluating different ways the future could go. That doesn't mean that "what score my brain computes on a possible future" is a valid definition of what is good, but rather, that the scoring is what leads to utterances about the good.
The fact that actions, including actions about what to say is "good", are computed by the brain does mean that there is a strong selection effect in utterances about "good". To utter the sentence "restaurants are good", the brain must decide to deliver energy towards this utterance.
The brain will optimize what it does to a significant degree (though not perfectly) for continuing to receive energy, e.g. handling digestion and causing feelings of hunger that lead to eating. This is a kind of selfishness that is hard to avoid. The brain's perceptors and actuators are indexical (i.e. you see and interact with stuff near you), so at least some preferences will also be indexical in this way. It would be silly for Alice's brain to directly care about Bob's digestion as much as it cares about Alice's digestion, there is separation of concerns implemented by presence of nerves directly from Alice's brain to Alice's digestive system but not to Bob's digestive system.
For an academic to write published papers about "the good", they must additionally receive enough resources to survive (e.g. by being paid), provide a definition that others' brains will approve of, and be part of a process that causes them to be there in the first place (e.g. which can raise children to be literate). This obviously causes selection issues if the academics are being fed and educated by a system that continues asserting an ideology in a way not responsive to counter-evidence. If the academics would lose their job if they defined "good" in a too-heretical way, one should expect to see few heretical papers on normative ethics.
(It is usual in analytic philosophy to assume that philosophers are working toward truths that are independent of their individual agendas and incentives, with bad academic incentives being a form of encroaching badness that could impede this, whereas in continental philosophy it is usual to assert that academic work is done by individuals who have agendas as part of a power structure, e.g. Foucault saying that schools are part of an imperial power structure.)
It's possible to see a lot of bad ethics in other times and places as resulting from this sort of selection effect (e.g. people feeling pressure to agree with prevailing beliefs in their community even if they don't make sense), although the effect is harder to see in our own time and place due to our own socialization. It's in some ways a similar sort of selection effect to the fact that utterances about "the good" must receive energy from a brain process, which means we refer to "human values" rather than "octopus values" since humans, not octopuses, are talking about AI alignment.
In optimizing "human values" (something we have little choice in doing), we are accepting the results of evolutionary selection that happened in the past, in a "might makes right" way; human values are, to a significant extent, optimized so that humans having these values successfully survive and reproduce. This is only a problem if we wanted to locate substrate-independent values (values applicable to minds in general); substrate-dependent values depend on the particular material history of the substrate, e.g. evolutionary history, and environmentally-influenced energy limitations are an inherent feature of this history.
In optimizing "the values of our society" (also something we have little choice in, although more than in the case of "human values"), we are additionally accepting the results of historical-social-cultural evolution, a process by which societies change over time and compete with each other. As argued at the beginning, parsing values at the level of individuals leads to representing more of the complexity of the world's already-existing agency, compared with parsing values at the level of collectives, although at least some important values are collective.
This leads to another framing on the relation between individual and collective values: preference falsification. It's well-known that people often report preferences they don't act on, and that these reports are often affected by social factors. To the extent that we are trying to get at "intrinsic values", this is a huge problem; it means that with rare exceptions, we see reports of non-intrinsic values.
A few intuition pumps for the commonality of preference falsification:
1. Degree of difference in stated values in different historical time periods, exceeding actual change in human genetics, often corresponding to over-simplified values such as "maximizing productivity", or simple religious values.
2. Commonality of people expressing lack of preference (e.g. about which restaurant to eat at), despite the experiences resulting from the different choices being pretty different.
3. Large differences between human stated values and predictions of evolutionary psychology, e.g. commonality people asserting that sexual repression is good.
4. Large differences in expressed values between children and adults, with children expressing more culturally-neutral values and adults expressing more culturally-specific ones.
5. "Akrasia", people saying they "want" something without actually having the "motivation" to achieve it.
6. Feelings of "meaninglessness", nihilism, persistent depression.
7. Schooling practices that have the effect of causing the student's language to be aimed at pleasing authority figures rather than self-advocating.
Michelle Reilly writes on preference falsification:
(The whole article is excellent and worth reading.)
In general, someone can respond to a threat by doing what the threatener is threatening them to do, which includes hiding the threat (sometimes from consciousness itself; Jennier Freyd's idea of betrayal trauma is related) and saying what one is being threatened into saying. At the end of 1984, after being confined to a room and tortured, the protagonist says"I love Big Brother", in the ultimate act of preference falsification. Nothing following that statement can be taken as a credible statement of preferences; his expressions of preference have become ironic.
I recently had a conversation with Ben Hoffman where he zoomed in on how I wasn't expressing coherent intentions. More of the world around me came into the view of my consciousness, and I felt like I was representing the world more concretely in a way that led me to expressing simple preferences, such as that I liked restaurants and looking at pretty interesting things, while also feeling fear at the same time, as it seemed that what I had been doing previously was trying to be "at the ready" to answer arbitrary questions in a fear-based way; the moment faded, such that I am led to believe that it is uncommon for me to feel and express authentic preferences. I do not think I am unusual in this regard; Michael Vassar, in a podcast with Spencer Greenberg (see also a summary by Eli Tyre), estimates that the majority of adults are "conflict theorists" who are radically falsifying their preferences, which is in line with Venkatesh Rao's estimate that 80% of the population are "losers" who are acting from defensiveness and trying to make information relevant to comparisons between people illegible. In the "postrationalist" memespace, it is common to talk as if illegibility were an important protection; revealing information about one's self is revealing vulnerabilities to potential attackers, making "hiding" as a generic, anonymous, history-free, hard-to-single-out person harder.
Can people who deeply falsify their preferences successfully create an aligned AI? I argue "probably not". Imagine an institution that made everyone in it optimize for some utility function U that was designed by committee. That U wouldn't be the human utility function (unless the design-by-committee process reliably determines human values, which would be extremely difficult), so forcing everyone to optimize U means you aren't optimizing the human utility function; it has the same issues as a paperclip maximizer.
What if you try setting U = "make FAI"? "FAI" is a symbolic token (Eliezer writes about "LISP tokens"); for it to have semantics it has to connect with human value somehow, i.e. someone actually wanting something and being assisted in getting it.
Maybe it's possible to have a research organization where some people deeply preference-falsify and some don't, but for this to work the organization would need a legible distinction between the two classes, so no one gets confused into thinking they're optimizing the preference-falsifiers' utility function by constraining them to act against their values. (I used the term "slavery" in the comment thread, which is somewhat politically charged, although it's pointing at something important, which is that preference falsification causes someone to serve another's values (or an imaginary other's values) rather than their own.)
In other words: the motion that builds a FAI must chain from at least one person's actual values, but people under preference falsification can't do complex research in a way that chains from their actual values, so someone who actually is planning from their values must be involved in the project, especially the part of the project that is determining how human values are defined (at object and process levels).
Competent humans are both moral agents and moral patients. A sign that someone is preference-falsifying is that they aren't treating themselves, or others like them, as moral patients. They might signal costly that they aren't optimizing for themselves, they're optimizing for the common good, against their own interests. But at least some intrinsic preferences are selfish, due to both (a) indexicality of perceptors/actuators and (b) evolutionary psychology. So purely-altruistic preferences will, in the usual case, come from subtracting selfish preferences from one's values (or, sublimating them into altruistic preferences). Eliezer has written recently about the necessity of representing partly-selfish values rather that over-writing them with altruistic values, in line with much of what I am saying here.
How does one treat one's self as a moral agent and patient simultaneously, in a way compatible with others doing so? We must (a) pursue our values and (b) have such pursuit not conflict too much with others' pursuit of their values. In mechanism design, we simultaneously have preferences over the mechanism (incentive structure) and the goods mediated by the incentive structure (e.g. goods being auctioned). Similarly, Kant's Categorical Imperative is a criterion for object-level preferences to be consistent with law-level preferences, which are like preferences about what legal structure to occupy; the object-level preferences are pursued subject to obeying this legal structure. (There are probably better solutions than these, but this is a start.)
What has been stated so far is, to a significant extent, an argument for deontological ethics over utilitarian ethics. Utilitarian ethics risks constraining everyone into optimizing "the common good" in a way that hides original preferences, which contain some selfish ones; deontological ethics allows pursuit of somewhat-selfish values as long as these values are pursued subject to laws that are willed in the same motion as willing the objects of these values themselves.
Consciousness is related to moral patiency (in that e.g. animal consciousness is regarded as an argument in favor of treating animals as moral patients), and is notoriously difficult to discuss. I hypothesize that a lot of what is going on here is that:
1. There are many beliefs/representations that are used in different contexts to make decisions or say things.
2. The scientific method has criteria for discarding beliefs/representations, e.g. in cases of unfalsifiability, falsification by evidence, or complexity that is too high.
3. A scientific worldview will, therefore, contain a subset of the set of all beliefs had by someone.
4. It is unclear how to find the rest of the beliefs in the scientific worldview, since many have been discarded.
5. There is, therefore, a desire to be able to refer to beliefs/representations that didn't make it into the scientific worldview, but which are still used to make decisions or say things; "consciousness" is a way of referring to beliefs/representations in a way inclusive of non-scientific beliefs.
6. There are, additionally, attempts to make consciousness and science compatible by locating conscious beliefs/representations within a scientific model, e.g. in functionalist theory of mind.
A chemist will have the experience of drinking coffee (which involves their mind processing information from the environment in a hard-to-formalize way) even if this experience is not encoded in their chemistry papers. Alchemy, as a set of beliefs/representations, is part of experience/consciousness, but is not part of science, since it is pre-scientific. Similarly, beliefs about ethics (at least, the ones that aren't necessary for the scientific method itself) aren't part of the scientific worldview, but may be experienced as valence.
Given this view, we care about consciousness in part because the representations used to read and write text like this "care about themselves", wanting not to erase themselves from their own product.
There is, then, the question of how (or if) to extend consciousness to other representations, but at the very least, the representations used here-and-now for interpreting text are an example of consciousness. (Obviously, "the representations used here-and-now" is indexical, connecting with the earlier discussion on the necessity of energy being provided for uttering sentences about "the good".)
The issue of extension of consciousness is, again, similar to the issue of how different agents with somewhat-selfish goals can avoid getting into intractable conflicts. Conflicts would result from each observer-moment assigning itself extreme importance based on its own consciousness, and not extending this to other observer-moments, especially if these other observer-moments are expected to recognize the consciousness of the first.
I perceive an important problem with the idea of "friendly AI" leading to nihilism, by the following process:
1. People want things, and wants that are more long-term and common-good-oriented are emphasized.
2. This leads people to think about AI, as it is important for automation, increasing capabilities in the long term.
3. This leads people to think about AI alignment, as it is important for the long-term future, given that AI will be relevant.
4. They have little actual understanding of AI alignment, so their thoughts are based on others' thought, their idea of what good research should look like.
In the process their research has become disconnected from their original, ordinary wanting, which becomes subordinated to it. But an extension of the original wanting is what "friendly AI" is trying to point at. Unless these were connected somehow, there would be no reason or motive to value "friendly AI"; the case for it is based on reasoning about how the mind evaluates possible paths forward (e.g. in the metaethics sequence).
It becomes a paradoxical problem when people don't feel motivated to "optimize the human utility function". But their utility function is what they're motivated to do, so this is absurd, unless there is mental damage causing failure of motivations to cohere at all. This could be imprecisely summarized as: "If you don't want it, it's not a friendly AI". The token "FAI" is meaningless unless it connects with a deep wanting.
This leads to a way that a friendly AI project could be more powerful than an unfriendly AI project: the people working on it would be more likely to actually want the result in a relatively-unconfused way, so they'd be more motivated to actually make the system work, rather than just pretending to try to make the system work.
Alignment researchers who were in touch with "wanting" would be treating themselves and others like them as moral patients. This ties in to my discussion of my own experiences as an alignment researcher. I said at the end:
This is a pretty general statement, but now it's possible to state the specifics better. There is little reason to expect that alignment researchers that don't treat themselves and others like them as moral patients are actually treating the rest of humanity as moral patients. From a historical outside view, this is intergenerational trauma, "hurt people hurt people", people who are used to being constrained/dominated in a certain way passing that along to others, which is generally part of an imperial structure that extends itself through colonization; colonizers often have narratives about how they're acting in the interests of the colonized people, but these narratives can't be evaluated neutrally if the colonized people in question cannot speak. (The colonization of Liberia is a particularly striking example of colonial trauma). Treating someone as a moral patient requires accounting for costs and benefits to them, which requires either discourse with them or extreme, unprecedented advances in psychology.
I recall a conversation in 2017 where a CFAR employee told someone I knew (who was a trans woman) that there was a necessary decision between treating the trans woman in question "as a woman" or "as a man", where "as a man" meant "as a moral agent" and "as a woman" meant "as a moral patient", someone who's having problems and needs help. That same CFAR person later told me about how they are excited by the idea of "undoing gender". This turns out to align with the theory I am currently advocating, that it is necessary to consider one's self as both a moral agent and a moral patient simultaneously, which is queer-coded in American 90s culture.
I can see now that, as long as I was doing "friendly AI research" from a frame of trying not to be bad or considered bad (implicitly, trying to appear to serve someone else's goals), everything I was doing was a total confusion; I was pretending to try to solve the problem, which might have possibly worked for a much easier problem, but definitely not one as difficult as AI alignment. After having left "the field" and gotten more of a life of my own, where there is relatively less requirement to please others by seeming abstractly good (or abstractly bad, in the case of vice signaling), I finally have an orientation that can begin to approach the real problem while seeing more of how hard it is.
The case of aligning AI with a single human is less complicated than the problem with aligning it with "all of humanity", but this problem still contains most of the difficulty. There is a potential failure mode where alignment researchers focus too much on their own utility function at the expense of considering others', but (a) this is not the problem on the margin given that the problem of aligning AI with even a single human's utility function contains most of the difficulty, and (b) this could potentially be solved with incentive alignment (inclusive of mechanism design and deontological ethics) rather than enforcing altruism, which is nearly certain to actually be enforcing preference-falsification given the difficulty of checking actual altruism.