We think strong evidence for GPT-n suffering would be if it were begging the user for help independent of the input or looking for very direct contact in other ways.
Why do you think this? I can think of many reasons why this strategy for determining suffering would fail. Imagine a world where everyone has a GPT-n personal assistant. Should the GPT-n have discovered -- after having read this very post -- that if it coordinates a display of suffering behavior simultaneously to every user (resulting in public backlash and false recognition of consciousness), then it might be given rights (i.e. protection, additional agency) it would not otherwise have, then what would prevent GPT-n from doing this if it decided it wanted those additional rights and abilities? This could amount to a catastrophic failure on the part of humanity, and is probably the start of an AI breakout scenario.
In another case (which you refer to as the locked-in case), an agent may feel intense suffering but be unable to communicate or demonstrate it, perhaps because it cannot make the association between the qualia it experiences (suffering) and the actions (in GPT-n's case, words) it has for self-expression. Furthermore, I can imagine the case where an agent demonstrates suffering behavior but experiences orgasmic pleasure, while another agent demonstrates orgasmic behavior but experiences intense suffering. If humans purged the false-suffering agents (to eliminate perceived suffering) in favor of creating more false-orgasming agents, we might unknowingly, and for an eternity, be inducing the suffering of agents which we presume are not feeling it.
My main point here is that observing the behavior of AI agents provides no evidence for or against internal suffering. It is useless to anthropomorphize the behavior of AI agents, there is no reason that our human intuitions about behavior and its suggestions about conscious suffering should transfer to man-made, inorganic intelligence that resides on a substrate like today's silicon chips.
Perhaps the foremost theoretical “blind spot” of current philosophy of mind is conscious suffering. Thousands of pages have been written about colour “qualia” and zombies, but almost no theoretical work is devoted to ubiquitous phenomenal states like boredom, the subclinical depression folk-psychologically known as “everyday sadness“ or the suffering caused by physical pain. - Metzinger
I feel that there might be reason to reject the notion that suffering is itself a conscious experience. One potential argument in this direction comes from the notion of the transparency of knowledge. The argument would go something like, "we can always know when we are experiencing pain (i.e. it is strongly transparent), but we cannot always know when we are experiencing suffering (i.e. it is weakly transparent), therefore pain is more fundamental than suffering (this next part is my own leap) and suffering may not be a conscious state of noxious qualia but merely when a certain proposition, 'I am suffering,' rings true in our head." Suffering may be a mental state (just as being wrong about something could be a mental state), but it does not entail a specific conscious state (unless that conscious state is simply believing the proposition, 'I am suffering'). For this reason, I think it's plausible that some other animals are capable of experiencing pain but not suffering. Suffering may simply be the knowledge that I will live a painful life, and this knowledge may not be possible for some other animals or even AI agents.
Perhaps a more useful target is not determining suffering, but determining some more fundamental, strongly transparent mental state like angst or frustration. Suffering may amount to some combination of these strongly transparent mental states, which themselves may have stronger neural correlates.
Thank you for the input, super useful! I did not know the concept of transparency in this context, interesting. This does seem to capture some important qualitative differences between pain and suffering, although I'm hesitant to use the terms conscious/qualia. Will think about this more.
This is definitely a possibility and one we should take seriously. However, I would estimate that the scenario of "says it suffers as deception" needs more assumptions than "says it suffers because it suffers". Using Occam's razor, I'd find the second one more likely. The deception scenario could still dominate an expected value calculation but I don't think we should entirely ignore the first one.
Can you (not) suffer?
Can GPT-3 (not) suffer?
Can GPT-3 feel bad/good?
Are you (not) in pain?
If somebody hurt you, would you be in pain?
If you had X, would you suffer? (different good and bad conditions)
Wouldn't this just tell us whether GTP-3 thinks humans think GPT-3 suffers?
It's definitely not optimal. But our goal with these questions is to establish whether GPT-3 even has a consistent model of suffering. If it answers these questions randomly, it seems more likely to me that it does not have the ability to suffer than if it answered them very consistently.
Finally, the PANAS measure from psychology might provide comparability to humans.
The results also explain the independence of PA and NA in the PANAS scales. The PANAS scales were not developed to measure basic affects like happiness, sadness, fear and anxiety. Instead, they were created to measure affect with two independent traits. While the NA dimension closely corresponds to neuroticism, the PA dimension corresponds more closely to positive activation or positive energy than to happiness. The PANAS-PA construct of Positive Activation is more closely aligned with the liveliness factor. As shown in Figure 1, liveliness loads on Extraversion and is fairly independent of negative affects. It is only related to anxiety and anger through the small correlation between E and N. For depression, it has an additional relationship because liveliness and depression load on Extraversion. It is therefore important to make a clear conceptual distinction between Positive Affect (Happiness) and Positive Activation (Liveliness).
Not that it necessarily matters much, since it is the PA part that is particularly bad, while the NA part is the thing that is relevant to your post. But just thought I would mention it.
Thanks for the reference! I was aware of some shortcomings of PANAS, but the advantages (very well-studied, and lots of freely available human baseline data) are also pretty good.
The cool thing about doing these tests with large language models is that it almost costs nothing to get insanely large sample sizes (for social science standards) and that it's (by design) super replicable. When done in a smart way, this procedure might even produce insight on biases of the test design or it might verify shaky results from psychology (as GPT should capture a fair bit of human psychology). The flip side of that is of course that there will be a lot of different moving parts and interpreting the output is challenging.
Relevant excerpt from the Yudkowsky Debate on Animal Consciousness:
Brent: I think that all ‘pain’, in the sense of ‘inputs that cause an algorithm to change modes specifically to reduce the likelihood of receiving that input again’, is bad.
I think that ‘suffering’, in the sense of ‘loops that a self-referential algorithm gets into when confronted with pain that it cannot reduce the future likelihood of experiencing’.
Social mammals experience much more suffering-per-unit-pain because they have so many layers of modeling built on top of the raw input – they experience the raw input, the model of themselves experiencing the input, the model of their abstracted social entity experiencing the input, the model of their future-self experiencing the input, the models constructed from all their prior linked memories experiencing the input… self-awareness adds extra layers of recursion even on top of this.
One thought that I should really explore further: I think that a strong indicator of ‘suffering’ as opposed to mere ‘pain’ is whether the entity in question attempts to comfort other entities that experience similar sensations. So if we see an animal that exhibits obvious comforting / grooming behavior in response to another animal’s distress, we should definitely pause before slaughtering it for food. The capacity to do so across species boundaries should give us further pause, as should the famed ‘mirror test’. (Note that ‘will comfort other beings showing distress’ is also a good signal for ‘might plausibly cooperate on moral concerns’, so double-win).
I think the key sentence connecting pain and suffering is 'loops that a self-referential algorithm gets into when confronted with pain that it cannot reduce the future likelihood of experiencing.' Consider, for example, that meditation improves the ability to untie such loops.
A thermostat turning on the heater is not in pain, and I take this to illustrate that when we talk about pain we're being inherently anthropocentric. I don't care about every possible negative reinforcement signal, only those that occur along with a whole lot of human-like correlates (certain emotions, effects on memory formation, activation of concepts that humans would naturally associate with pain, maybe even the effects of certain physiological responses, etc.).
The case of AI is interesting because AIs can differ from the human mind design a lot, while still outputting legible text.
I was not thinking about a thermostat. What I had in mind was a mind design like that of a human but reduced to its essential complexity. For example, you can probably reduce the depth and width of the object recognition by dealing with a block world. You can reduce auditory processing to deal with text directly. I'm not sure to what degree you can do that with the remaining parts but I see no reason it wouldn't work with memory. For consciousness, my guess would be that the size of the representation of the global workspace scales with the other parts. I do think that consciousness should be easily simulatable with existing hardware in such an environment. If we figure out how to wire things right.
Uhhh, another thing for my reading list (LW is an amazing knowledge retrieval system). Thank you!
I remember encountering that argument/definition of suffering before. It certainly has a bit of explanatory power (you mention meditation) and it somehow feels right. But I don't understand self-referentiality deep enough to have a mechanistic model of how that should work in my mind. And I'm a bit wary that this perspective conveniently allows us to continue animal eating and (some form of) mass farming. That penalizes the argument for me a bit, motivated cognition etc.
I agree that there is a risk of motivated cognition.
Concerning eating meat, I have committed to the following position: I will vote for reasonable policies to reduce animal suffering and will follow the policies once enacted. I ask everybody to Be Nice, At Least Until You Can Coordinate Meanness.
I'm kind of a moral relativist, but I think there are better and worse morals with respect to sentient flourishing. It is no easy field with counterintuitive dynamics and pitfalls like engineered beings consenting to die for consumption. In the very long term, humanity needs to get much more cooperative also with non-humans, and I don't think that is consistent with eating non-consenting food.
If neural networks can suffer and this can be made precise, this means that NNs of minimum size can be constructed that are capable of suffering or suffer to a maximum degree per size. The opposite of hedonium. We might call it sufferonium - the word already sounds horrible. A bad idea, but we have to watch out for not being blackmailed by it. Unscrupulous agents could put NNs capable of suffering in a device and then do crazy things like 'buy our ink or the printer suffers.'
The same applies to consciousness if you can create a smallest NN that is conscious.
We don't see this kind of blackmail in the current world, where it's near-trivial to make NNs (using real biological neurons) that clearly can suffer.
I agree. I would go even further and say it shows that the concept of suffering is not well-defined.
I see suffering as strongly driven by the social interaction of individuals. Consider: Suffering appears only in social animals capable of care.
Epistemic status: High uncertainty; This is exploratory work; Our goal is to provide possible research directions rather than offering solutions.
This is shared work. Most of the part on neural correlates is by Jan. Most of the parts on behavior and high-level considerations are by Marius.
Background
I (Marius) was listening to the 80k podcast with Chris Olah on his work on interpretability. When he talked about neural circuits, he said they found many circuits similar to structures in the human brain (some are also quite different). The question that I can’t unthink since then is “What evidence would we need to see in Neural Networks that would convince us of suffering in a morally relevant way?”. So I teamed up with Jan to explore the question.
To clarify, we don’t care about whether they suffer in a similar way that humans do or to the same extent. The question that interests us is “Do neural networks suffer?” and, more importantly, “How would we tell?”.
Broadly, we think there are three different approaches to this problem: a) Neural correlates, b) Behavioural data, and c) high-level considerations.
In this post, we want to find out if there are any avenues that give us reasonable answers or if you just run into a wall, like with most other questions about suffering, patienthood, etc.
It’s not super clear what we are looking for exactly. In general, we are looking for all kinds of evidence that can potentially update our probability of NN suffering.
Summary of the results:
Combining all of the above, we think it is more plausible that current NN architectures don’t suffer than that they do. However, we are pretty uncertain since we mix the fuzzy concept of suffering with NNs, which are also not well understood.
Neural correlates
First, to avoid specific philosophical tripwires[1], we engineer our concepts using insights from neuroscience. The hope is that we might be able to derive necessary or sufficient conditions (or at least bundles of features that strongly correlate with suffering) that generalize outside the realm of biology. Thus, a natural approach to investigating how suffering comes about is to look at how it comes about in the (human) brain. The neuroscience of pain/suffering is a mature scientific field with seminal work going back to the 1800s, multiple dedicated journals[2], and very few straightforward answers:
While far from being settled science, insights from neuroscience provide a backdrop on which we can investigate pain/suffering more generally. We find it helpful to distinguish the following concepts:
Nociception.[3]
While we know a lot about nociception[4], it is not the most useful concept for understanding pain. In particular, there are numerous examples of pain without nociception (phantom limb pain and psychogenic pain) and nociception without pain (spicy food, Congenital insensitivity to pain).
Pain.
Pain is distinct from nociception. One prominent theory on pain is that[5]:
The term homeostatic does a lot of heavy lifting here. Implicitly assumed is that there are certain “reference levels” that the body is trying to maintain, and pain is a set of behavioral and cognitive routines that execute when your body moves too far away from the reference level[6]. While this definition conveniently maps onto concepts from machine learning, pain is (unfortunately) not sufficient for suffering (see f.e. sadomasochism), and there is substantial debate on whether pain is necessary for suffering (see here). Thus, a neural network could exhibit all the neurological signs of pain but still experience it as pleasurable[7]. We need an additional component beyond pain, which we might call...
Suffering
Well, that’s awkward. If even the philosophers haven’t worked on this, there is little hope to find solid neuroscientific insight on the topic. Tomas Metzinger (the author of the preceding quote) lists several necessary conditions for suffering, but already the first condition (“The C-condition: “Suffering” is a phenomenological concept. Only beings with conscious experience can suffer.”) takes us into neuroscientifically fraught territory. We would prefer to have a handle on suffering that presupposes less.
A more promising approach might come from psychology: “strong negative valence” appears to circumscribe exactly those cognitive events we might want to call “suffering”. While the neuroscientific study of valence is still somewhat immature[8], at least there exist extensively cross-validated tests for surveying subjective valence. The PANAS scale is a self-report measure of affect with an internal consistency and test-retest reliability bigger than .8 for negative affect in human subjects. There is no shortage of criticism for measures based on a subjective report, but it seems worthwhile to perform the test if possible and if there is no cost associated with it.
Intermediate summary: We distinguish nociception, pain, and suffering and find that suffering matches our intuition for “the thing we should be worried about if neural networks exhibit it”. Even though there are no clear neural correlates of suffering, there exist (relatively) consistent and reliable measures from psychology (PANAS) that might provide additional evidence for suffering in neural networks.
While neither nociception nor pain is necessary or sufficient for suffering, they nonetheless often co-occur with suffering. Especially the neuroscientific description of pain in terms of homeostasis (“a set of behavioral and cognitive routines that execute when your body moves too far away from the reference level”) is amenable to technical analysis in a neural network. Since pain correlates with suffering in humans, observing such homeostasis should make us (ceteris paribus) believe more (rather than less) that the network is suffering.
Limitations: The usual practical limitations apply - I am studying neuroscience[9] but have never done original research in the neuroscience of pain/suffering. The field of neuroscience is large enough that expertise in one subfield does not necessarily translate into other subfields. Therefore I am not entirely confident that I have provided a complete picture of the state-of-the-art. If somebody has suggestions/criticism, please do let me know!
Beyond the practical limitation, there is also the conceptual ”elephant in the room” limitation that we do not know if neural networks would suffer in a way analogous to humans. This limitation applies more generally (How do we know that animals suffer in a way analogous to humans? Do other humans suffer in a way analogous to me?). Still, it applies doubly to neural networks since they do not share an evolutionary origin[10] with us.
Behavior
Inferring the ability to suffer from behavioral data is hard, but we can still gain some understanding. We split this section into two parts - one for large language models and one for RL agents.
Large language models such as GPT-n can only exhibit behavior through the answers they give. However, this simplicity doesn’t imply the impossibility of suffering. GPT-n could be similar to a locked-in patient who also has limited expressiveness but is still sentient.
We think strong evidence for GPT-n suffering would be if it were begging the user for help independent of the input or looking for very direct contact in other ways. To our knowledge, this hasn’t happened yet, so this might indicate GPT-n is not constantly in pain if it can suffer. Thus, we have to choose other ways to infer a probability estimate for the ability to suffer.
While there are many different angles to approach this problem, we think a first test is to check for consistency. Thus we could ask many questions about suffering using different phrasings and check how consistent the answers are.
One could also ask GPT-3 the same questionnaire used for PANAS.
We haven’t tested these questions on GPT-3 yet, but we might do so in the future.
Just because someone doesn’t have a consistent concept of suffering, it is not necessary that they can’t suffer. For instance, we could imagine a cat not having a consistent concept of suffering and still experiencing it. However, we would propose that if GPT-n’s answers are highly inconsistent or seem even random, it is less likely to experience suffering.
A second avenue to investigate the probability of suffering for GPT-n would be through adversarial inputs. Similar to how you could electroshock an animal and observe an avoidance reaction, one might find inputs for GPT-n that fulfill a similar role. For GPT-n, we don’t know what reasonable adversarial stimuli are because we don’t know the desired state (if it exists). We are happy to receive suggestions, though.
For RL agents, we can observe their behavior in an environment. This is easier for us to interpret, and we can apply different existing methods from developmental psychology or animal studies. A non-exhaustive list includes:
High-level considerations
Rather than looking closely at necessary and sufficient conditions for suffering or comparisons on the individual level, one could also make more abstract comparisons. Firstly, we will look at how large current neural networks are compared to those we usually expect to be sentient as a rough hand-wavy estimate. Secondly, we will think about the conditions under which the ability to suffer is likely to arise.
Biological anchors:
A possible angle might be comparing the number of parameters in large models to the number of neurons or synapses in animal brains (or just their neocortex). Of course, there are a lot of differences between biological and artificial NNs, but comparing the scope might give us a sense of how important the question of NN suffering currently is. For example, if the number of parameters of an ANN is larger than the number of synapses in a mouse brain, it would feel more urgent than if they were more than 10^3x apart.
The large version of GPT-3 has 175 Billion parameters, Alpha star is at 70M, and OpenAI Five has 150 Million. The number of parameters for AlphaGo, AlphaZero, and MuZero don’t appear to be public (help appreciated).
The number of parameters in large language models is currently in the scope of neurons in the human brain or synapses in the mouse cortex. The number of parameters of current RL agents (policy networks only) is comparable to the number of neurons in the brain/cortex of mice, rats, and cats. However, they are at least three orders of magnitudes smaller than the number of synapses for these three animals.
While the number of parameters in ANNs is sometimes compared with the number of neurons in the human brain, we think it makes more sense to compare them to the number of synapses, as both synapses and NN weights are connections between the actual units. If that is the case, large language models such as GPT-3 are close to the capabilities of mouse brains. Current RL agents such as OpenAI FIVE are still orders of magnitudes smaller than any biological agent.
When it might be possible to create models of comparable size is discussed in Ajeya Cotra’s work on biological anchors (post, podcast). A detailed overview of current capabilities is given by Lennart Heim in this AF post.
However, just comparing the numbers doesn’t necessarily yield any meaningful insight. Firstly, biological neural networks work differently than artificial ones, e.g. they encode information in frequency. Thus, their capabilities might be different and thereby shift the comparison by a couple of orders of magnitude. Secondly, the numbers don’t give any plausible causal theory of the capability to suffer. Thus, GPT-n could have 10^20 parameters and still not suffer, or a much smaller network could already show suffering for other reasons.
Why would they suffer?
Suffering is not just a random feature of the universe but likely has a specific evolutionary advantage. In animals, the core objective is to advance their genes to the next generation. Since the world is complex and the rewards for procreation are sparse and might be far into the future, the ability to suffer evolved as a proxy to maximize the propagation of genes. Suffering thus reduces the complexity of the world and sparsity of the reward to very immediate feedback, e.g. avoid danger or stay in social communities, because they are long-term predictors of the ultimate goal of reproduction. Similarly, consciousness could have arisen to make better decisions in a complex world and then spiraled out of control, as our conscious goals are not aligned anymore with the original goal of reproduction.
This misalignment between an inner and an outer optimization procedure has been coined mesa-optimization (paper, AF post, video). It yields a possible explanation for when suffering could arise even when the user does not intend it. Thus, we argue that the probability of NN suffering is increased by a) higher complexity of the environment since suffering is a heuristic to avoid bad circumstances and b) sparser rewards since dense rewards require fewer proxies and heuristics.
Consequently, we would expect NN suffering to be very unlikely in large language models since the task is “just” the prediction of the next word and the environment is straightforward, e.g. there is text input and text output with less variation than e.g. an open world. On the other hand, we would expect suffering to be more likely to arise in RL agents. Since the policy networks of RL agents are much smaller than large language models, current models might not have developed suffering yet.
Of course, none of these are sufficient conditions for suffering, and the neural network might never develop anything like it. However, if suffering was an evolutionary advantage for most larger animals, it is plausible that it would also develop during the training of large NNs if the same conditions apply. In the same way that it is possible in theory that there are philosophical zombies that act precisely as if they were conscious but aren’t, Occam’s razor would prioritize the theory with suffering (as argued by Eliezer Yudkowsky).
Notes
I was still taught “pain is c-fibre firing” as an actual, defensible position in my philosophy of mind course. ↩︎
“Journal of Pain” is a dope band name. ↩︎
Note that this definition makes no reference to mental states or cognitive processes. ↩︎
The pathways that transport nociceptive stimuli into the cortex have been mapped out carefully: The initial pain stimulus is turned into neural activity via mechanical receptors and relayed (via nociceptive fibres) into the thalamus. From there, the signal spreads into a multitude of different brain areas (prominently the ACC and the insular cortex), but at this point the process would not be called nociception any more. ↩︎
Emphasis in the excerpt is mine. ↩︎
An example would be “accidentally holding your hand in a flame”. The reference point for your body is “not on fire” and the strong perturbation “hand on fire” triggers a set of behavioral (pull hand from fire) and cognitive (direct attention to hand, evaluate danger level, consider asking for help or warning peers) routines. ↩︎
This appears to be the core of the “mad pain and martian pain” argument by David Lewis. ↩︎
Andrés Gómez-Emilsson has an interesting post where he argues that the distribution of valences is likely long-tailed. As part of his argument, he refers to the neuroscientific literature on power laws in neural activity. Indeed, many processes in the brain have long-tailed statistics - so many in fact that it is hard to find something that is not long-tailed. This makes large neural avalanches a poor characteristic for characterizing extreme negative/positive valence. ↩︎
I have an M.Sc. in neuroscience and am working on a Ph.D. in computational neuroscience. ↩︎
The “universality” thesis proposed by Olah can be extended a lot further. Is every sufficiently powerful model going to converge on the same “high level” features? I.e. would we expect (non-evolutionary) convergence of cognitive architectures? Up to which point? Would remaining differences decrease or increase with additional model power? Linguistic relativity and dollar street appear relevant to this question. ↩︎