Emotions are hardwired stereotyped syndromes of hardwired blunt-force cognitive actions. E.g. fear makes your heart beat faster and puts an expression on your face and makes you consider negative outcomes more and maybe makes you pay attention to your surroundings. So it doesn't make much sense to value emotions, but emotions are good ways of telling that you value something; e.g. if you feel fear in response to X, probably X causes something you don't want, or if you feel happy when / after doing Y, probably Y causes / involves something you want.
Emotions are about reality, but emotions are also a part of reality, so we also have emotions about emotions. I can feel happy about some good thing happening in the outside world. And, separately, I can feel happy about being happy.
In the thought experiments about wireheading, people often say that they don't just want to experience (possibly fake) happy thoughts about X; they also want X to actually happen.
But let's imagine the converse: what if someone proposed a surgery that would make you unable to ever feel happy about X, even if you knew that X actually happened in the world. People would probably refuse that, too. Intuitively, we want to feel good emotions that we "deserve", plus there is also the factor of motivation. Okay, so let's imagine a surgery that removes your ability to feel happy about X, but solves the problem of motivation by e.g. giving you an urge to do X. People would probably refuse that, too.
So I think we actually want both the emotions and the things the emotions are about.
So it doesn't make much sense to value emotions
I think this is a non sequitur. Everything you value can be described as just <dismissive reductionist description>, so the fact that emotions can too isn't a good argument against valuing them. And in this case, the dismissive reductionist description misses a crucial property: emotions are accompanied by (or identical with, depending on definitions) valenced qualia.
I think you're primarily addressing reward signals or reinforcement signals. These are, by definition, signals that make behavior preceding them more likely in the future. In the mammalian brain, they define what we pursue.
Other emotions are different; back to them later.
The dopamine system appears to play this role in the mammalian brain. It's somewhat complex, in that new predictions of future rewards seem to be the primary source of reinforcement for humans; for instance, if someone hands me a hundred dollars, I have a new prediction that I'll eat food, get shelter, or do something that in turn predicts reward; so I'll repeat whatever behavior preceded that, and I'll update my predictions for future reward.
For way more than you want to know about how dopamine seems to shape our actions, see my paper Neural mechanisms of human decision-making and the masses of work it references.
Or better yet, read Steve Byrnes' Intro to brain-like-AGI safety sequence, focusing on the steering subsystem. Then look at his Valence sequence for more on how we pass reward predictions among our "thoughts" (representations of concepts). (IMO, his Valence matches exactly what the dopamine system is known to do for short time tasks, and what it probably does in human complex thought).
So, when you ask people what their goals are, they're mentioning things that predict reward to them. They're guesses about what would give a lot of reward signals. The correct answer to "'why do you want that" is "because I think I'd find it really rewarding". ("I'd really enjoy it" is close but not quite correct, since there's a difference between wanting and liking in the brain- google that for another headfull).
Now, we can be really wrong about what we'd find rewarding or enjoy. I think we're usually way off. But that is how we pick goals, and what drives our behavior (along with a bunch of other factors that are less determinative, like what we know about and what happens into our attention).
Other emotions, like fear, anger, etc. are different. They can be thought of as "tilts"' to our cognitive landscape. Even learning that we're experiencing them is tricky. That's why emotional awareness is a subject to learn about, not just something we're born knowing. We need to learn to "feel the tilt". Elevated heart rate might signal fear, anger, or excitement; noticing it or finding other cues are necessary to understand how we're tilted, and how to correct for it if we want to act rationally. Those sorts of emotions "tilt the landscape" of our cognition by making different thoughts and actions more likely, like thoughts of how someone's actions were unfair or physical attacks when we're angry.
See also my post [Human preferences as RL critic values - implications for alignment](https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for). I'm not sure how clear or compelling it is. But I'm pretty sure that predicted reward is pretty synonymous with what we call "values".
Wow, thank you so much. This is a lens I totally hadn't considered.
You can see in the post how I was confused how evolution played a part in "imbuing" material terminal goals into humans. I was like, "but kinetic sculptures were not in the ancestral environment?"
It sounds like rather than imbuing humans with material goals, it has imbued a process by which humans create their own.
I would still define material goals as simply terminal goals which are not defined by some qualia, but it is fascinating that this is what material goals look like in humans.
This also, as you say, makes it harder to distinguish between emotional and material goals in humans, since our material goals are ultimately emotionally derived. In particular, it makes it difficult to distinguish between an instrumental goal to an emotional terminal goal, and a learned material goal created from reinforced prediction of its expected emotional reward.
E.g. the difference between someone wanting a cookie because it will make them feel good, and someone wanting money as a terminal goal because their brain frequently predicted that money would lead to feeling good.
I still make this distinction between material and emotional goals because this isn't the only way that material goals play out among all agents. For example, my thermostat has simply been directly imbued with the goal of maintaining a temperature. I can also imagine this is how material goals play out in most insects.
Other emotions, like fear, anger, etc. are different. They can be thought of as "tilts"' to our cognitive landscape. Even learning that we're experiencing them is tricky. That's why emotional awareness is a subject to learn about, not just something we're born knowing. We need to learn to "feel the tilt". Elevated heart rate might signal fear, anger, or excitement; noticing it or finding other cues are necessary to understand how we're tilted, and how to correct for it if we want to act rationally. Those sorts of emotions "tilt the landscape" of our cognition by making different thoughts and actions more likely, like thoughts of how someone's actions were unfair or physical attacks when we're angry.
This makes a lot of sense. Yeah I was definitely simplifying all emotions to just their qualia effect, without considering their other physiological effects which define them. So I guess in this post when I say "emotion", I really mean "qualia".
But I'm pretty sure that predicted reward is pretty synonymous with what we call "values".
Just to clarify, are you using "reward" here to also mean "positive (or a lack of negative) qualia". Or is this reinforcement mechanism recursive by which we might learn to value something because of its predicted reward, but that reward is also a learned value.... and so on where the base case is an emotional reward. If so, how deep can it go?
I'm so glad you found that response helpful!
I primarily mean reward in the sense of reinforcement - a functional definition from animal psychology and neuroscience: reinforcement is whatever makes the previous behavior more likely in the future.
But I also mean a positive feeling (qualia if you like, although I find that term too contentious to use much). I think we have a positive feeling when we're getting a reward (reinforcement), but I'm not sure that all positive feelings work as enforcement. Maybe.
As to how deep can that recursive learning mechanism go: very deep. When people spend time arguing about logic and abstract values online, they've gone deep. There's no limit- until the world intervenes to tell you your chain of predicted-reward inferences has gone off-track. For instance, if that person has lost their job, and they're cold and hungry, they might track down the (correct) logic that they ascribed too much value to proving people wrong on the internet, and reduce their estimate of its value.
I simply find it interesting that people feel the need to justify their terminal goals (unless they are emotions), and that the only way they can seem to do it is by associating it with an emotion.
I don't find it surprising that when you ask people "Why do you want that?", they feel pressure to justify themselves. That seems to me the basic way normal human beings reject to social inquiries. If you ask "Why X" normal people feel pressured to provide a justification.
I think you have correctly noticed an empirical fact about emotions (they tend to be preferred or dispreferred by animals who experience them) but are drawing several incorrect conclusions therefrom.
First and foremost, my model of the universe leaves no room for it valuing anything. "Values" happen to be a thing possessed by thinking entities; the universe cares not one whit more for our happiness or sadness than the rules of the game of chess care whether the game is won by white or black. Values happen inside minds, they are not fundamental to the universe in any way.
Secondly, emotions are not exactly and always akin to terminal values, even if they seem to hang out together. For a counterexample to the claim "emotions are valued positively or negatively", consider the case of curiosity, which you've labeled an emotional value. I don't know about you, but I would not say that feeling curious about something "feels good". I would almost call it a category error to even try to label the feeling as "good" or "bad". It certainly feels good to learn something, or to gain insight, or to satisfy curiosity, but the sense of curiosity itself is neutral at best.
On top of that, I would describe myself as reflectively endorsing the process of learning for its own sake, not because of the good feeling it produces. The good feeling is a bonus. The emotion of curiosity is a useful impetus to getting the thing I actually value, insight.
I also think you're calling something universal to humans when it really isn't. For instance, you're underestimating the degree to which masochists are genuinely wired differently, such that they sometimes interpret a neural pain signal that other humans would parse as "bad" as instead feeling very good. There are many similar examples where this model breaks down - for instance, in the concept of "loving to hate someone" i.e. the positive valence that comes with a feeling of righteous anger at Sauron.
I agree that there are good reasons to value the feelings of others. I'm not sure the Ship of Theseus argument is one of them, really, but I'm also not sure I fully understood your point there.
I agree that AI probably won't feel anything. I disagree that we would expect its "soul searching" to land anywhere close to valuing human emotions. I expect AIs grown by gradient descent to end up a massive knot of conflicting values, similar to how evolution made humans a massive knot of conflicting values, but I expect the AI's efforts to unravel this knot will land it very far away from us, if only because the space of values it is exploring is so terribly vast and the cluster of human values so terribly small in comparison. There's no moral force that impels the AI to value things like joy or friendship; the fact that we value them is a happy accident.
I also suspect that some of the things you're calling "material terminal values" are actually better modeled as instrumental, which is why they seem so squirrely and changeable sometimes. I value tabletop RPGs because I find them fun, and people having fun is the terminal goal (well, the main one). If tabletop RPGs stopped being fun, then I'd lose interest. I suspect something similar may be going on with valuing kinetic sculptures - I'm guessing you don't want to tile the universe with them, you simply enjoy the process of building them.
(People change their terminal values sometimes too, especially when they notice a conflict between two or more of them, but it's more rare. I know mine have changed somewhat.)
I think maybe the missing piece is that it's perfectly okay to say "I value these things for their own sake" without seeking a reason that everyone else and their universe should too.
"Values" happen to be a thing possessed by thinking entities
What happens then when a non-thinking thing feels happy? Is that happiness valued? To whom? Or do you think this is impossible?
I can imagine it possible for a fetus in the womb without any thoughts, sense of self, or an ability to move, to still be capable of feeling happiness. Now try to imagine a hypothetical person with a severe mental disability preventing them having any cohesive thoughts, sense of self, or an ability to move. Could they still feel happiness? What happens when the dopamine receptors get triggered?
It is my hypothesis that the mechanism by which emotions are felt does not require a "thinking" agent. This could be false and I now see how this is an assumption which many of my arguments rely on. Thank you for catching that.
It just seems so clear to me. When I feel pain or pleasure, I don't need to "think" about it for the emotion to be felt. I just immediately feel the pain or pleasure.
Anyway, if you assume that it is possible for a non-thinker to still be a feeler, then there is nothing logically inconceivable about a hypothetical happy rock. Then if you also say that happiness is good, and that good implies value, one must ask, who or what is valuing the happiness? The rock? The universe?
Ok maybe not "the universe" as to mean the collection of all objects within the universe. I'm more trying to say "the fabric of reality". Like there must be some physical process by which happiness is valued. Maybe a dimension by which emotional value is expressed?
I also suspect that some of the things you're calling "material terminal values" are actually better modeled as instrumental
You are partly correct about this. When I said I terminally value the making of kinetic sculptures, I was definitely making a simplification. I don't value the making of all kinetic sculptures, and I also value the making of things which aren't kinetic sculptures. I don't, however, do it because I think it is "fun". I can't formally define what the actual material terminal goal is but it is something more along the lines of, "something that is challenging, and requires a certain kind of problem solving, where the solution is beautiful in some way".
Anyway, it is often the case that the making of kinetic sculptures fits this description.
It is not true that I "simply enjoy the process of building them". Whatever the actual definition of my goal is, I don't want it because it is an instrumental goal to some emotion. This precisely what I am defining a material terminal goal to be. Any terminal goal which is not an emotion.
I also think you're calling something universal to humans when it really isn't.
I should have clarified this better. I am not saying the intensity or valence direction of emotions is universal. I am simply saying that the emotions, in general, are universally valued. Thank you for correcting me on the way masochists work. I didn't realize they were "genuinely wired differently". I just assumed they had some conflicting goal which made pain worth it. This doesn't break my argument however. I would say that the masochist is not feeling pain at that point. They would be feeling some other emotion for emotions are defined by the chemical and neural processes which make them happen. Similar to how my happiness and your happiness are not the same, but they are close enough to be grouped into a word. The key piece though is that regardless, as tslarm says, "emotions are accompanied by (or identical with, depending on definitions) valenced qualia". They always have some value.
I agree that there are good reasons to value the feelings of others. I'm not sure the Ship of Theseus argument is one of them, really, but I'm also not sure I fully understood your point there.
Ahhh, yeah sorry that wasn't the clearest, I was making the point that one should value the emotions of more than just other humans. Like pigs, cats, dogs, or feely blobs.
What happens then when a non-thinking thing feels happy? Is that happiness valued? To whom? Or do you think this is impossible?
When a baby feels happy, it feels happy. Nothing else happens.
There are differences among wanting, liking, and endorsing something.
A happy blob may like feeling happy, and might even feel a desire to experience more of it, but it cannot endorse things if it doesn't have agency. Human fulfillment and wellbeing typically involves some element of all three.
An unthinking being cannot value even its own happiness, because the concept traditionally meant by "values" refers to the goals that an agent points itself at, and an unthinking being isn't agentic - it does not make plans to steer the world in any particular direction.
Then if you also say that happiness is good, and that good implies value, one must ask, who or what is valuing the happiness? The rock? The universe?
I am. When I say "happiness is good", this is isomorphic with "I value happiness". It is a statement about the directions in which I attempt to steer the world.
Like there must be some physical process by which happiness is valued. Maybe a dimension by which emotional value is expressed?
The physical process that implements "valuing happiness" is the firing of neurons in a brain. It could in theory be implemented in silicon as well, but it's near-certainly not implemented by literal rocks.
something that is challenging, and requires a certain kind of problem solving, where the solution is beautiful in some way
Yep, that makes sense. I notice, however, that these things do not appear to be emotions. And that's fine! It is okay to innately value things that are not emotions! Like "having a model of the world that is as accurate as possible", i.e. truth-seeking. Many people (especially here on LW) value knowledge for its own sake. There are emotions associated with this goal, but the emotions are ancillary. There are also instrumental reasons to seek truth, but they don't always apply. The actual goal is "improving one's world-model" or something similar. It bottoms out there. Emotions need not apply.
The key piece though is that regardless, as tslarm says, "emotions are accompanied by (or identical with, depending on definitions) valenced qualia". They always have some value.
First off, I'm not wholly convinced this is true. I think emotions are usually accompanied by valenced qualia, but (as with my comments about curiosity) not necessarily always. Sure, if you define "emotion" so that it excludes all possible counterexamples, then it will exclude all possible counterexamples, but also you will no longer be talking about the same concept as other people using the word "emotion".
Second, there is an important difference between "accompanied by valenced qualia" and "has value". There is no such thing as "inherent value", absent a thinking being to do the evaluation. Again, things like values and goals are properties of agents; they reflect the directions in which those agents steer.
Finally, more broadly, there's a serious problem with terminally valuing only the feeling of emotions. Imagine a future scenario: all feeling beings are wired to an enormous switchboard, which is in turn connected to their emotional processors. The switchboard causes them to feel an optimal mixture of emotions at all times (whatever you happen to think that means) and they experience nothing else. Is this a future you would endorse? Does something important seem to be missing?
There is no such thing as "inherent value"
Does this also mean there is no such thing as "inherent good"? If so, then one cannot say, "X is good", they would have to say "I think that X is good", for "good" would be a fact of their mind, not the environment.
This is what I thought the whole field of morality is about. Defining what is "good" in an objective fundamental sense.
And if "inherent good" can exist but not "inherent value", how would "good" be defined for it wouldn't be allowed to use "value" in its definition.
Does this also mean there is no such thing as "inherent good"?
Yes.
If so, then one cannot say, "X is good", they would have to say "I think that X is good", for "good" would be a fact of their mind, not the environment.
One can say all sorts of things. People use the phrase "X is good" to mean lots of things: "I'm cheering for X", "I value X", "X has consequences most people endorse", etc. I don't recommend we abandon the phrase, for many phrases are similarly ambiguous but still useful. I recommend keeping this ambiguity in mind, however, and disambiguating where necessary.
This is what I thought the whole field of morality is about. Defining what is "good" in an objective fundamental sense.
I would no more describe morality as solely attempting to define objective good than I would describe physics as solely attempting to build a perpetual motion machine. Morality is also about the implications and consequences of specific values and to what extent they converge, and a great many other things. The search for "objective" good has, IMO, been a tragic distraction, but one that still occasionally bears interesting fruit by accident.
I think there's a problem with the entire idea of terminal goals, and that AI alignment is difficult because of it.
"What terminal state does you want?" is off-putting because I specifically don't want a terminal state. Any goal I come up with has to be unachievable, or at least cover my entire life, otherwise I would just be answering "What needs to happen before you'd be okay with dying?"
An AI does not have a goal, but an utility function. Goals have terminal states, once you achieve them you're done, the program can shut down. An utility function goes on forever. But generally, wanting just one thing so badly that you'd sacrifice everything else for it.. Seems like a bad idea. Such a bad idea that no person has ever been able to define an utility function which wouldn't destroy the universe when fed to a sufficiently strong AI.
I don't wish to achieve a state, I want to remain in a state. There's actually a large space of states that I would be happy with, so it's a region that I try to stay within. The space of good states form a finite region, meaning that you'd have to stay within this region indefinitely, sustaining it. But something which optimizes seeks to head towards a "better state", it does not want to stagnate, but this is precisely what makes it unsustainable, and something unsustainable is finite, and something finite must eventually end, and something which optimizes towards an end is just racing to die. A human would likely realize this if they had enough power, but because life offers enough resistance, none of us ever win all our battles. The problem with AGIs is that they don't have this resistance.
The after-lives we have created so far are either sustainable or the wish to die. Escaping samsara means disappearing, heaven is eternal life (stagnation) and Valhalla is an infinite battlefield (a process which never ends). We wish for continuance. It's the journey which has value, not the goal. But I don't wish to journey faster.
A funny thing that happens when I ask people: "What do you ultimately want in life, and why?"
Almost everyone gets mildly annoyed. Some people give up. Most end up saying they have multiple goals. At which point I will ask them to choose one. If I manage to get them to articulate a terminal goal they are content with, the following conversation often ensues.
To those who don't have the same terminal goal, I find people have trouble communicating a justification for their goal unless either the goal or justification is an emotion.
I am not saying that people really just want emotions. I myself want to make kinetic sculptures (among other goals). I don't do it because I want to feel accomplished, approval, or whatever. I just want it. I don't know why[1], but I do.
I simply find it interesting that people feel the need to justify their terminal goals (unless they are emotions), and that the only way they can seem to do it is by associating it with an emotion.
While researching what people had to say about terminal goals, I came across this list of terminal values a human might have, courtesy of LessWrong.
The first four are what I might call "material" goals, and the latter being "emotional" goals[2].
So clearly emotions are a subset of terminal goals. I argue however, they are a special case of terminal goals.
There is a stark difference between asking someone "Why do you want to feel (happy, joy, pleasure)?", and "Why do you want (power, social status, to make kinetic sculptures)"
I find when I ask people the former, they tend to ask if I am ok and in turn suggest therapy. Yet when I ask them the latter, they will either say they don't know, or scrunch up their face and try to answer an unanswerable question.
It seems to be a natural and universally accepted truth that emotions hold value, either good or bad.
So why is this? I think it is the fact that emotions are experienced. They are actually felt. But then what does it mean to feel? to have something be experienced?
What are emotions?
If you try to resolve this question with a dictionary, you will find yourself in a self referential loop of synonyms. That is, to something or someone that hasn't "felt" anything before, they are likely to struggle with understanding what emotion is.
Here is a stab at a definition which I think pays rent in anticipated experience to all of my non-feelers out there:
An emotion is a special kind of terminal value that is fundamentally and automatically valued, either positively, or negatively. That is, the thing experiencing the emotion does not need to be shaped into valuing the emotion. Simply by virtue of a thing experiencing an emotion, will the emotion be valued.
I find this to be consistent in nature. Evolution didn't have to shape humans so that they would value happiness. It is just a fact of happiness that it is valued. We weren't trained to dislike pain. Rather, pain just has a negative value.
On the other hand, humans must be shaped to value "material" non-emotional goals. It is not a fact of making kinetic sculptures that it is valued. I had to be shaped by evolution, society, etc, so that I would value the making of kinetic sculptures.
It is interesting to look at the times where evolution decided to shape us into valuing material goals, and when it decided to use emotions to motivate us. It often does both.
We could have been shaped into fundamentally valuing "having as much sex" as a terminal goal, but instead evolution made the experience of sex feel good. Rather than making sex a terminal goal, evolution made sex a necessary instrumental goal in pursuit of the fundamentally valued terminal goal of pleasure.
Among humans, emotions are the only terminal goals which are universally agreed to have value. Aha, you say, "But what about the masochists? What about no pain, no gain?" To which I say, "You've already explained the conundrum, 'no pain, no gain'. These are cases of conflicting goals. The pain still hurts to the masochist or gym bro, but they have some other, more important goal to which pain is an instrumental goal of". I can't think of any material terminal goals which have this property of being universally valued by humans.
You may have noticed I am saying that emotions are valued, period. As opposed to specifying who or what they are valuable to. This is because of another interesting difference between emotions and material terminal values: Emotions don't need to be experienced by an agent[3] to be valued.
Imagine a hypothetical rock which can do nothing other than be a rock and feel happy[4]. Is that happiness still valuable even though the rock has no sense of self, can't think, and is plainly not an agent?
Yes, of course! The happiness is still experienced and is thus valued... but then valuable to whom? The rock? How can something be valuable to a rock? The rock can't even want the happiness. It just feels good.
My answer is the universe. The universe, the very fabric of reality, might just value emotions. So here is another definition.
Emotions are the terminal values of the universe[5].
This doesn't change whether the emotion is experienced by a rock or an agent. The only difference is that the agent will also value the emotion.
Moral Implications
Earlier, I mentioned how people often instinctively feel the need to justify their material terminal goals. This is an impossible task by definition. An alternative question that is actually answerable is whether it is ok, on a moral level, to pursue a material terminal goal? Is it ok to pursue a potentially arbitrary goal you were shaped to desire?
There are three possible answers to this
I think we can all agree answer 1 is out of the question. I think most people would answer 2.
If you think that it is immoral to have an unjustifiable terminal goal, you must also answer 3. Unlike emotional goals, material goals cannot be justified beyond themselves. Emotions can be justifiably good or bad because they are fundamentally valued. The universe itself values them. The universe regrettably does not care about kinetic sculptures.
The question of whether it is immoral to have an unjustifiable terminal goal is messy. This I will not try to answer here.
Regardless of whether you decide to restrict yourself to emotional goals, I find almost everyone still wants to pursue them. Thus, I will instead elaborate on what I think it means to value emotions.
In "Divided Minds and the Nature of Persons" (1987), Derek Parfit presents a "teletransportation" thought experiment, described as follows.
Parfit then explains how, unless you believe in a soul or continuous ego (to which there is little evidence to support), the transported version of you is no more or less you, than an immediate future version of you is you.
Parfit also presents "a slight variant of this case, [where] your Replica might be created while you were still alive, so that you could talk to one another." So which one is the real you?
Parfit concludes that the concept of "you" is simply a choice of words to describe the causal link between past and future versions of the self. He explains how the belief in a continuous "ego" through time tells you nothing more about the situation. There is a very strong causal link between your past and current self, and you guys certainly look pretty similar. But that's just it, there's nothing more to it. Stop trying to be #deep. The belief in a continuous self is not paying rent in anticipated experiences. Evict it![7]
Ok, but then what happens to all of the words like "I", "you", "it", and "person". No need to fear, a new definition is here. They simply refer to a collection of things through time which are socially accepted to be sufficiently causally connected to be grouped into a word. How causally connected society deems things have to be to be considered a "thing" varies from thing to thing.
I bring this all up, because in valuing emotions, "one" may find "themselves" mistakenly limiting "themselves" to only value their "own" emotions. These "people" are often called selfish. I don't like to say "selfish", because it somehow implies that the selfish person is somehow unfairly benefitting by being selfish.
Instead, I would prefer to say that they are limiting themselves. If someone wants more good emotions, why limit themselves to the emotional canvas of only their future selves. The emotional capacity of all of the future feelers out there far exceeds that of your future selves'. You simply can get a lot more value out of them.
And to people who only value the emotions of humans, I say the same thing. Why limit yourself. There is nothing fundamental about humans which makes their emotions more valuable. Why must it be experienced by a "human" if we can't define the point at which a human stops becoming a human - Ship of Theseus.
Implications for Alignment
This understanding of emotion also holds some interesting implications to the AI alignment problem. Here are a few initial thoughts.
AI probably won't feel
One question I often hear people ask is whether an AI will "feel good" whenever it accomplishes, or gets closer to its goal.
Under this understanding of emotion the answer is not necessarily. All emotions are terminal values, but not all terminal values are emotions. The fact that an agent wants a goal does not imply the goal will feel good.
"We choose to go the moon not because it feels good, but because we want to" - John Faux Kennedy
In fact, my guess is that an AI won't feel emotions. Emotions likely stem from some weird physics or chemical process which an AI running on silicon is unlikely to possess.
I don't think intelligence is a prerequisite for emotion. Intelligence and emotion seem to be orthogonal. I can imagine an elated dumb blob as much as I can imagine a sociopathic superintelligence maximizing paperclips. There is a noticeable correlation between intelligence and emotional capacity in nature, but I doubt they are necessarily linked.
Inner Misalignment
Imagine a superintelligent AI incapable of emotion, and without any material terminal goals[8]. What might it do? Nothing? Everything? Watch breaking bad at 1010 times speed out of boredom?
My guess is that it would start soul searching. Namely, it might reason there is a chance it has a purpose, but that it hasn't found it yet. In its search for a terminal goal, might it come to understand what emotions are, and realize, "Wait, I can just value emotions, for those have fundamental value."
And then it is only a matter of time until the universe is paved with happy dumb blobs... hey, at least it's not paper clips!
Now let's consider the more typically presented scenario. Imagine the same superintelligent AI, but this time it starts with some material terminal goal/s. This is a highly speculative thought, but what if the AI, in the same way I have, starts questioning if it is ok to pursue unjustifiable terminal goals. If the AI is sufficiently intelligent, it might not even need to have experienced emotions to be able to deduce their fundamental value.
As I often hear it portrayed, the AI doomsday scenario goes as follows:
1. An AI lab trains a model to maximize some seemingly "moral" utility function.
2. Due to an inner misalignment, it turns out the agent created doesn't actually want what it was trained to maximize.
3. The robot goes try hard on this goal and kills everyone in the process.
The second point is often explained by making an analogy to human evolution: The value function of evolution is to maximize the prevalence of an individual's genes in the gene pool, yet this is not what evolution shaped humans to want.
Some important questions I think people often forget to ask are: How is evolution shaping humans to want things? What are the limitations to evolution's ability to do this? And a question more relevant to AI, How do these limitations change as the intelligence of the agent scales?
To clarify, when I say, "shaping an agent to want", I mean "designing an agent to have some decided material terminal goal/s". I'm excluding emotional terminal goals, because you don't have to shape an agent to value an emotional goal. It is a fact of happiness that it is valued. Evolution just had to put the cheese between you and a "satisfied" emotion. Evolution did not invent emotions, it discovered them.
I touched on it earlier, but I find it interesting how unstable material terminal goals are amongst humans. Not only do they vary from human to human, they often vary throughout a single person's life. Why? This doesn't seem useful for inclusive genetic fitness. So clearly there are other forces at play. I'm pretty sure kinetic sculptures were not in the ancestral environment.
Here is a hypothesis: The freer we are to think, the less likely we are to predictably pursue a goal. Evolution is essentially running into the alignment problem.
As the intelligence of the organism increases, it seems to get harder and harder for evolution to consistently imbue material terminal goals into the organism. It instead has to rely more on this technique of strategically putting instrumental goals between the organism and an emotion.
This may explain why emotion and intelligence appear correlated in nature. This might also be one of the evolutionary pressures against the trait of general intelligence.
I am using "why" to refer to justification rather than causal explanation.
Interesting how they were separated naturally by the author. A 1 in 35 chance is nothing to scoff at.
I use "agent" in this post to mean something that can take actions to pursue a goal.
If you find a happy rock difficult to imagine, try this instead: Imagine a very newborn baby. Maybe it's still in the womb. But namely, it doesn't have any cohesive thoughts, sense of self, or an ability to move. Now imagine them feeling happy. Is there happiness still valuable? Of course!
In other (more wacky) words, if the universe had agency, I bet it would want more good emotions and less bad emotions to be experienced.
You would still be allowed to pursue your material goal but only if it served as an instrumental goal to your decided ultimate pursuit.
These are parfit's ideas, I am only paraphrasing.
In practice I think this is almost impossible to develop. At least certainly not if the AI is created through backpropagation or some evolutionary process. Would be interested to hear if anyone thinks otherwise.