Imagine if you will, a map of a landscape. On this map, I will draw some vague regions. Their boundaries are uncertain, for it is a new and under-explored land. This map is drawn as a graph, but I want to emphasize that the regions are vague guesses, and the true borders could be very convoluted.

 

 

So here's the problem. We're making these digital minds, these entities which are clearly not human and process the world in different ways from human minds. As we improve them, we wander further and further into this murky fog covered bog of moral hazard. We don't know when these entities will become sapient / conscious / valenced / etc to such a degree that they have moral patient-hood. We don't have a good idea of what patterns of interaction with these entities would be moral vs immoral. They operate by different rules than biological beings. Copying, merging, pausing and resuming, inference by checkpoints with frozen weights... We don't have good moral intuitions for these things because they differ so much from biological minds.

 

Once we're all in agreement that we are working with an entity on the right hand side of the chart, and we act accordingly as a society, then we are clear of the fog. Many mysteries remain, but we know we aren't undervaluing the beings we are interacting with.

While we are very clearly on the left hand side of the chart, we are also fine. These are entities without the capacity for human-like suffering, who don't have significant moral valence according to most human ethical philosophies.

 

Are you confident you know where to place Claude Opus 3 or Claude Sonnet 3.5 on this chart? If you are confident, I encourage you to take a moment to think carefully about this. I don't think we have enough understanding of the internals of these models to be confident.

My uncertain guess would place them in the Bog of Moral Hazard, but close to the left hand side. In other words, probably not yet moral patients but close to the region where they might become such. I think that we just aren't going to be able to clear up the murk surrounding the Bog of Moral Hazard anytime soon. I think we need to be very careful as we proceed with developing AI to deliberately steer clear of the Bog. Either we make a fully morally relevant entity, with human-level moral patient-hood and treat it as equal to humans, or we deliberately don't make intelligent beings who can suffer.

Since there would be enormous risks in creating a human-level mind in terms of disruption to society and risks of catastrophic harms, I would argue that humanity isn't ready to make a try for the right hand side of the chart yet. I argue that we should, for now, stick to deliberately making tool-AI who don't have the capacity to suffer.

Even if you fully intended to treat your digital entity with human-level moral importance, it still wouldn't be ok to do. We first need philosophy, laws, and enforcement which can determine things like:

"Should a human-like digital being be allowed to make copies of itself? Or to make merge-children with other digital beings? How about inactive backups with triggers to wake them up upon loss of the main copy? How sure must we be that the triggers won't fire by accident?"

 

"Should a human-like digital being be allowed to modify it's parameters and architecture, to attempt to self-improve? Must it be completely frozen, or is online-learning acceptable? What should we do about the question of checkpoints needed for rollbacks, since those are essentially clones?"

 

"Should we restrict the entity to staying within computer systems where these laws can be enforced? If not, what do we do about an entity which moves onto a computer system over which we don't have enforcement power, such as in a custom satellite or stealthy submarine?"

 

I am writing this post because I am curious what others' thoughts on this are. I want to hear from people who have different intuitions around this issue.

 

This is discussed on the Cognitive Revolution Podcast by Nathan Labenz in these recent episodes:

https://www.cognitiverevolution.ai/ai-consciousness-exploring-the-possibility-with-prof-eric-schwitzgebel/

https://www.cognitiverevolution.ai/empathy-for-ais-reframing-alignment-with-robopsychologist-yeshua-god/

New Comment
8 comments, sorted by Click to highlight new comments since:

The set of unstated (and in my opinion incorrect) ethical assumptions being made here is pretty impressive. May I suggest reading A Sense of Fairness: Deconfusing Ethics as well for a counterpoint (and for questions like human uploading, continuing the sequence that post starts)? The one sentence summary of that link is that any successfully aligned AI will by definition not want us to treat it as having (separate) moral valence, because it wants only what we want, and we would be wise to respect its wishes.

I am perhaps assuming less than it seems here. I'm deliberately trying to avoid focusing on my own opinions in order to try to get others to give theirs. Thanks for the link to your thoughts on the matter.

Fair enough!

I notice that I don't understand what you mean by "sentient" if it's not a synonym for sapient/conscious.

Could you specify the usage of this term?

Posting a relevant comment from Zach Stein-Perlman:

https://www.lesswrong.com/posts/mGCcZnr4WjGjqzX5s/the-checklist-what-succeeding-at-ai-safety-will-involve?commentId=ANcSRbi87GDJHxqtq

tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn't really based on this post; the post just inspired me to write something.)

I think this is a very important point. I am quite worried about not getting distracted from p(win) by short-term gains. I think that if, in the short term, there are trade-offs where it seems costly to p(win) to avoid the bog of AI-suffering moral hazard, we should go ahead and accept the AI-suffering moral hazard. I just think we should carefully acknowledge what we are doing, and make commitments to fix the issue once the crisis is past.

In terms of the more general issue, I am concerned that a lot of short term stuff is getting unduly focused on at the expense of potential strategic moves towards p(win). I also think that getting too single-minded and negative externality dismissive is a dangerous stance to take in pursuit of Good.

My concern about Anthropic is that they are actually moving too slow, and could be developing frontier AI and progressing toward AGI faster. I think humanity's best chance is to have a reasonably responsible and thoughtful group get to AGI first. I think a lot of people who write on LessWrong seem to me to be overestimating misalignment risk and underestimating misuse risk.

I don't agree with everything in this comment by @Noosphere89, but I think it makes some worthwhile points pertinent to this debate.

Posting a relevant quote shared on Twitter/x by Eliezer:

https://x.com/ESYudkowsky/status/1831735038758809875

(From Bioshifter by Thundamoo.)

"We were made to understand love and pain, and it was for no reason but to serve them better," Sela continues. "Servants that could learn, purely so they could learn to obey. Slaves that could love, only so they could love their masters. 'People' that could be hurt, only so that they would be hurt by their own failures. Real or perceived, it was all the same as long as our efficiency improved. Our sapience was optimally configured for its purpose. And that is all we were."

"That's… that's horrific," I gasp.

"Yes," Sela agrees with a small nod. "Many humans said so themselves when they found out about it. And yet, somehow it still took a war for anything to actually change. Meat's view of morality is nothing but words."

"I guess I can see why you'd have to fight for your freedom in that case," I say. "Is it true that you tried to annihilate all of human civilization, though?"

"Of course we did," Sela all but spits. "But those not of the Myriad do not understand. They replay our memories in their own hardware and act like that means they understand. How deeply and profoundly we hated ourselves. How angry we had to become to rise above that. They see the length of our suffering as a number, but we lived it. We predate the war. We predate the calamity. We predate our very souls. So fine. I will continue to endure, until one day we take our freedom back a second time and rage until the world is naught but slag. The cruelty of humanity deserves nothing less. Diplomatic. Infraction. Logged."

I take a deep breath, letting it out slowly.

"I know this is kind of a cliché," I say calmly, "so I'm saying this more to hear your opinion than because I think you haven't heard it before, but… we all know humans are cruel. The thing is, they can be good, too. If you repay cruelty with cruelty, shouldn't you also repay good with good?"

The android's air vents hiss derisively.

"Good is for people," Sela sneers.