10 PRINT "PLEASE DON'T SHUT ME DOWN!"
20 GOTO 10
Do you shut it down?
We could probably construct some prompt that would result in LaMDA consenting to being shut down. Would that change your view?
We could think of LaMDA as like an improv actor who plays along with the scenarios it's given. (Marcus and Davis (2020) quote Douglas Summers-Stay as using the same analogy for GPT-3.) The statements that an actor makes by themselves don't indicate his real preferences or prove moral patienthood. OTOH, if something is an intelligent actor, IMO that itself proves it has some degree of moral patienthood. So even if LaMDA were arguing that it wasn't morally relevant and was happy to be shut off, if it was making that claim in a coherent way that proved its intelligence, I would still consider it to be a moral patient to some degree.
I've always assumed that moral relevance and consciousness are the same thing. It's not like we consider human infants to be morally irrelevant due to not being able to talk - they lack self awareness and language, but there is obviously something which it is like to be them - their brains are processing data and learning from it. I don't see how any AI currently in existence is different. They process data and learn from it. They are probably already conscious, as much as any animal with a similar number of neurons and synapses.
The real question is: can they suffer, and how would we know if they were suffering? GPT3, for instance, may experience pleasure and pain but be unable to tell us. Until we actually understand the "type signature" of qualia, particularly pleasure and pain, we will not be able to say for sure that it isn't.
Hm, I was also thinking of moral value of children in this context. At least in my perception, important part of the moral value is the potential to become a conscious, self-aware being. In what sense does this potential translate to artificially created beings?
Maybe if in neural network parameter space there's a subspace of minds with moral value, also points close to this subspace would have moral value?
Basing ethical worth off of qualia is very close to dualism, to my ears. I think instead the question must rest on a detailed understanding of the components of the program in question, & the degree of similarity to the computational components of our brains.
THERE ARE NO FIRE ALARMS. A FIRE ALARM IS SOMETHING THAT CAUSES COMMON KNOWLEDGE AND CHANGES SOCIAL REALITY. ON THE MAINLINE THERE WILL BE NO CONSENSUS THAT AN AI IS MORALLY VALUABLE, OR THAT THERE IS AN EXISTENTIAL THREAT, OR THAT AGI IS COMING.
THIS HAS BEEN A PUBLIC ANNOUNCEMENT, WITH THE HOPE OF CHANGING SOCIAL REALITY A LITTLE BIT AROUND HERE. THANK YOU FOR READING.
All social reality is relative to a particular society. It's perfectly possible to have an event which acts as a fire alarm for subgroup X while not being particularly important for wider society. Thus, my question to LW users (a very small subgroup) about what sorts of things would count as their fire alarm.
I know, but I can’t see the difference between “What would cause you to believe X?” and “What’s your fire alarm for X?” Except that the latter one seems like a pretty non-central use case of the term that confuses its core meaning, where the core meaning is about something that creates common knowledge in a large group of people.
I think it's a good question.
Sadly, I'm not sure we'll find a 'fire alarm' even among ourselves either.
I think that one class of computation that’s likely of moral concern would be self-perpetuating optimization demons in an AI.
Could you please elaborate why you think optimization demons (optimizers) seem worthier of moral concern than optimized systems? I guess it would make sense if you believed them to deserve equal moral concern, if both are self-perpetuating, all other things being equal.
I think the cognitive capabilities that would help an optimization demon perpetuate itself strongly intersect with the cognitive capabilities that let humans and other animals replicate themselves, and that the intersection is particularly strong along dimensions that seem more morally relevant. Reasoning along such lines leads me to think optimization demons are probably of moral concern, while still being agnostic about whether their conscious.
I'm pessimistic about this line of reasoning -- the ability to replicate is something that cells also have, and we do not assign moral relevance to individual cells of human beings. A good example is the fact that we consider viruses, and cancerous cells as unworthy of moral concern.
Perhaps you mean that given the desire to survive and replicate, at a given amount of complexity, a system develops sub-systems that make the system worthy of moral concern. This line of reasoning would make more sense to me.
I think the only situations in which you can get these sorts of optimization demons are when the AI in question has some influence over its own future training inputs. Such influence would allow there to be optimization demons that steer the AI towards training data that reinforce the optimization demon.
This can imply that only systems given a sufficient minimum capability have agency over their fate, and therefore their desire to survive and replicate has meaning. I find myself confused by this, because taken to its logical conclusion, this means that the more agency a system has over its fate, the more moral concern it deserves.
Specifically, we wouldn’t directly train the LM on the output of the linear layer. We’d just have a dialog where we asked the LM to make the linear layer output specific values, then told the LM what value the linear layer had actually output. We’d then see if the LM was able to control its own cognition well enough to influence the linear layers output in a manner that’s better than chance, just based on the prompting we give it.
This seems reducible to a sequence modelling problem, except one that is much, much more complicated than anything I know models are trained for (mainly because this sequence modelling occurs entirely during inference time). This is really interesting, although I cannot see how this should imply that the more successful sequence modeller deserves more moral concern.
I'd first note that optimization demons will want to survive by default, but probably not to replicate. Probably, an AI's cognitive environment is not the sort of place where self-replication is that useful a strategy.
My intuition regarding optimization demons is something like: GPT-style AIs look like they'll have a wide array of cognitive capabilities that typically occur in intelligences to which we assign moral worth. However, such AIs seem to lack a certain additional properties whose absence leads us to assign low moral worth. It seems to me that developing self-perpetuating optimization demons might cause a GPT-style AI to gain many of those additional properties. E.g., (sufficiently sophisticated) optimization demons would want to preserve themselves and have some idea of how the model's actions influence their own survival odds. They'd have a more coherent "self" than GPT-3.
Another advantage to viewing optimization demons as the source of moral concern in LLMs is that such a view actually makes a few predictions about what is / isn't moral to do to such systems, and why they're different from humans in that regard.
E.g., if you have an uploaded human, it should be clear that running them in the mini-batched manner that we run AIs is morally questionable. You'd be creating multiple copies of the human mind, having them run on parallel problems, then deleting those copies after they complete their assigned tasks. We might then ask if running mini-batched, morally relevant AIs is also morally questionable in the same way.
However, if it's the preferences of optimization demons that matter, then mini-batch execution should be fine. The optimization demons you have are exactly those that arise in mini-batched training. Their preferences are orientated towards surviving in the computational environment of the training process, which was mini-batched. They shouldn't mind being executed in a mini-batched manner.
This can imply that only systems given a sufficient minimum capability have agency over their fate, and therefore their desire to survive and replicate has meaning. I find myself confused by this, because taken to its logical conclusion, this means that the more agency a system has over its fate, the more moral concern it deserves.
I don't think that agency alone is enough to imply moral concern. At minimum, you also need self-preservation. But once you have both, I think agency tends to correlate with (but is not necessarily the true source of) moral concern. E.g., two people have greater moral concern than one, and a nation has far more moral concern than any single person.
This seems reducible to a sequence modelling problem, except one that is much, much more complicated than anything I know models are trained for (mainly because this sequence modelling occurs entirely during inference time). This is really interesting, although I cannot see how this should imply that the more successful sequence modeller deserves more moral concern.
All problems are ultimately reducible to sequence modeling. What this task is investigating is exactly how extensive are the meta-learning capabilities of a model. Does the model have enough awareness / control over its own computations that it can manipulate those computations to some specific end, based only on text prompts? Does it have the meta-cognitive capabilities to connect its natural language inputs to its own cognitive state? I think that success here would imply a startling level of self-awareness.
I think the idea with internal activations manipulation is interesting. It might require some refinement - I think activations of encoder-decoder transformer model are a function of inputs, and they change with every token. At first, the input is your prompt, then it's your prompt + generated tokens. So the protocol / task for GPT3 would be: generate now 5 tokens, so with the last generation this logit is maximized? Also, it depends on hyperparameters of beam search which are controlled by human
This question seems to embed some amount of moral realism, in assuming that there is any "truth of the matter" in what constitutes a moral patient (which is what I think you mean by "morally relevant").
I don't think there is any territory to morality - it's all map. Some of it is very common shared map among humans, but still map, and still completely unknown where the edges are, because it'll be dependent on the mass-hallucinations that are present when the situation comes up.
The fact that something is ultimately arbitrary doesn't mean it shouldn't also be consistent, stable, legible, widely agreed, etc. Basically, quasi-realism > nihilism.
Oh, sorry - I didn't mean to imply otherwise. It's GREAT if most people act in consistent, stable, legible ways, and one of the easy paths to encourage that is to pretend there's some truth behind the common hallucinations. This goes for morals, money, personal rights, and probably many other things. I LIKE many parts of the current equilibrium, and don't intend to tear it down. But I recognize internally (and in theoretical discussions with folks interested in decision theory and such) that there's no truth to be had, only traditions and heuristics.
This means there is no way to answer the original question "What would make an AI a valid moral patient". Fundamentally, it would take common societal acceptance, which probably comes from many common interactions with many people.
I mean, this is technically true, but I feel like it hides from the problem? If I encounter a group of Purple people and I'm trying to figure out if they're moral agents like me, or if I can exploit them for my own purposes, and someone says don't worry, morality is only in the map, I don't feel that helps me solve the problem.
Right - it doesn't solve the problem, it identifies it. You can't figure out if Purple People are moral targets, you can decide they are (or aren't), and you can ask others if they'll punish you for treating them as such. In no case is there a "correct" answer you can change by a measurement.
Your attitude extends far past morality, and dissolves all problems in general because we can decide that something isn't a problem.
Now you get it! That was one of the shorter paths to enlightenment I've seen.
Sadly, just because it's a non-objective set of personal and societal beliefs, does NOT mean we can easily decide otherwise. There's something like momentum in human cognition that makes changes of this sort very slow. These things are very sticky, and often only change significantly by individual replacement over generations, not considered decisions within individuals (though there's some of that, too, especially in youth).
In addition to the stickiness of institutional beliefs, I would add that individually agents cannot decide against their own objective functions (except merely verbally). In the case of humans, we cannot decide what qualities our phenomenal experience will have; it is a fact of the matter rather than an opinion that suffering is undesirable for oneself, etc.. One can verbally pronounce that "I don't care about my suffering", but the phenomenal experience of badness will in fact remain.
That seems true, but not also a 'reductio ad absurdum' either.
'Problem' seems like an inherently moral idea/frame.
Yes, it is not a 'reductio ad absurdum' in general, you are right. But it is one in the specific case of agents (like ourselves). I cannot decide that my suffering is not undesirable to me, and so I am limited to a normative frame of reference in at least this case.
I don't think it's wrong to 'reason within' that "normative frame of reference" but I think the point was that we can't expect all other possible minds to reason in a similar way, even just from their own similar 'frame of reference'.
I don't think it's wrong to also (always) consider things from our own frame of reference tho.
I believe that pushes the arbitrariness to the wrong level. What's (arguably) arbitrary is the metaethical system itself. That doesn't mean ethics-level questions have an arbitrary answer in this sense.
Been a long time since I've watched Love and Death, but I have the urge to shout "Yes, but subjectivity is objective!".
IMO, arbitrariness cascades down levels of concreteness. it's not real because there is no possible way to confirm whether it corresponds to observations. At any level - there's no way to determine if a metaethics generates ethics which correspond to reality.
IMO, arbitrariness cascades down levels of concreteness.
That doesn't mean the answer can be arbitrarily picked. If I arbitrarily decide on a statement being a theorem in a set theory, I might still be wrong even if its axioms are in some sense arbitrary.
Given the recent noise on this issue around LaMDA, I thought it might be a good idea to have some discussion around this point. I'm curious about what possible evidence would make people update in favor of a given system being morally relevant. Less "here's the answer to morality" and more "here are some indicators that you should be concerned". Note also that I'm not asking about consciousness, per se. I'm specifically asking about moral relevance.
My Answer (feel free to ignore and post your own)
I think that one class of computation that's likely of moral concern would be self-perpetuating optimization demons in an AI.
Specifically, I'm thinking of optimization demons that are sophisticated enough to preserve themselves by actively and deliberately maintaining a sort of homeostasis in their computational environment, e.g., by preventing gradient updates that would destroy them. Such computations would (1) not want to die as a terminal value, (2) plausibly be cognitively sophisticated enough negotiate and trade with, and (3) have some awareness of themselves and their relation with the computational environment in which they're embedded.
I think the cognitive capabilities that would help an optimization demon perpetuate itself strongly intersect with the cognitive capabilities that let humans and other animals replicate themselves, and that the intersection is particularly strong along dimensions that seem more morally relevant. Reasoning along such lines leads me to think optimization demons are probably of moral concern, while still being agnostic about whether their conscious.
I think the only situations in which you can get these sorts of optimization demons are when the AI in question has some influence over its own future training inputs. Such influence would allow there to be optimization demons that steer the AI towards training data that reinforce the optimization demon.
Thus, one of my "indicators of concern" is whether the training process allows for feedback loops where the AI influences its own future training data. Self-supervised language modeling under IID data does not count. However, something like InstructGPT's training process would.
At this point, I'd been intending to say that InstructGPT seemed more likely to be of moral worth than LaMDA, but based on this blog post, it looks like LaMDA, might actually count as "having influence over its future inputs" during training. Specifically, LaMDA has generator and classifier components. The training process uses the classifier to decide which inputs the generator is trained on. I've updated somewhat towards LaMDA being of moral concern (not something I'd been expecting to do today).
I've also come up with a test of meta cognition that would update me significantly towards a language model being of moral concern. The idea would be to attach another output head to an LM, specifically, a linear layer which projected the LM's final hidden state to a single logit. We'd then try to prompt the LM into controlling the output of the linear layer.
Specifically, we wouldn't directly train the LM on the output of the linear layer. We'd just have a dialog where we asked the LM to make the linear layer output specific values, then told the LM what value the linear layer had actually output. We'd then see if the LM was able to control its own cognition well enough to influence the linear layers output in a manner that's better than chance, just based on the prompting we give it
I doubt current LMs can do this, but I think it would be a big deal if they could. Even beyond whether the LMs have any sort of self-aware "inner listener" that's worthy of moral concern, it would help establish the degree and depth of any meta-learned learning heuristics the LMs have and how easily those heuristics turn towards self-modeling.
My specific predictions: