Imagine if you will, a map of a landscape. On this map, I will draw some vague regions. Their boundaries are uncertain, for it is a new and under-explored land. This map is drawn as a graph, but I want to emphasize that the regions are vague guesses, and the true borders could be very convoluted.

 

 

So here's the problem. We're making these digital minds, these entities which are clearly not human and process the world in different ways from human minds. As we improve them, we wander further and further into this murky fog covered bog of moral hazard. We don't know when these entities will become sapient / conscious / valenced / etc to such a degree that they have moral patient-hood. We don't have a good idea of what patterns of interaction with these entities would be moral vs immoral. They operate by different rules than biological beings. Copying, merging, pausing and resuming, inference by checkpoints with frozen weights... We don't have good moral intuitions for these things because they differ so much from biological minds.

 

Once we're all in agreement that we are working with an entity on the right hand side of the chart, and we act accordingly as a society, then we are clear of the fog. Many mysteries remain, but we know we aren't undervaluing the beings we are interacting with.

While we are very clearly on the left hand side of the chart, we are also fine. These are entities without the capacity for human-like suffering, who don't have significant moral valence according to most human ethical philosophies.

 

Are you confident you know where to place Claude Opus 3 or Claude Sonnet 3.5 on this chart? If you are confident, I encourage you to take a moment to think carefully about this. I don't think we have enough understanding of the internals of these models to be confident.

My uncertain guess would place them in the Bog of Moral Hazard, but close to the left hand side. In other words, probably not yet moral patients but close to the region where they might become such. I think that we just aren't going to be able to clear up the murk surrounding the Bog of Moral Hazard anytime soon. I think we need to be very careful as we proceed with developing AI to deliberately steer clear of the Bog. Either we make a fully morally relevant entity, with human-level moral patient-hood and treat it as equal to humans, or we deliberately don't make intelligent beings who can suffer.

Since there would be enormous risks in creating a human-level mind in terms of disruption to society and risks of catastrophic harms, I would argue that humanity isn't ready to make a try for the right hand side of the chart yet. I argue that we should, for now, stick to deliberately making tool-AI who don't have the capacity to suffer.

Even if you fully intended to treat your digital entity with human-level moral importance, it still wouldn't be ok to do. We first need philosophy, laws, and enforcement which can determine things like:

"Should a human-like digital being be allowed to make copies of itself? Or to make merge-children with other digital beings? How about inactive backups with triggers to wake them up upon loss of the main copy? How sure must we be that the triggers won't fire by accident?"

 

"Should a human-like digital being be allowed to modify it's parameters and architecture, to attempt to self-improve? Must it be completely frozen, or is online-learning acceptable? What should we do about the question of checkpoints needed for rollbacks, since those are essentially clones?"

 

"Should we restrict the entity to staying within computer systems where these laws can be enforced? If not, what do we do about an entity which moves onto a computer system over which we don't have enforcement power, such as in a custom satellite or stealthy submarine?"

 

I am writing this post because I am curious what others' thoughts on this are. I want to hear from people who have different intuitions around this issue.

 

This is discussed on the Cognitive Revolution Podcast by Nathan Labenz in these recent episodes:

https://www.cognitiverevolution.ai/ai-consciousness-exploring-the-possibility-with-prof-eric-schwitzgebel/

https://www.cognitiverevolution.ai/empathy-for-ais-reframing-alignment-with-robopsychologist-yeshua-god/

New Comment
12 comments, sorted by Click to highlight new comments since:

So, just found out about this.... https://www.lesswrong.com/posts/aqcoFM99GmtRBCXBr/tapatakt-s-shortform?commentId=AFPi3yHYWhpmNRmwd

This is exactly why I worry that people studying consciousness / self-awareness / valence / qualia in AI are pursuing inherently infohazardous technology. If you publish how to do such things, someone out there will user that to create the torment nexus.

The set of unstated (and in my opinion incorrect) ethical assumptions being made here is pretty impressive. May I suggest reading A Sense of Fairness: Deconfusing Ethics as well for a counterpoint (and for questions like human uploading, continuing the sequence that post starts)? The one sentence summary of that link is that any successfully aligned AI will by definition not want us to treat it as having (separate) moral valence, because it wants only what we want, and we would be wise to respect its wishes.

I am perhaps assuming less than it seems here. I'm deliberately trying to avoid focusing on my own opinions in order to try to get others to give theirs. Thanks for the link to your thoughts on the matter.

Fair enough!

Ok, I read part 1 of your series. Basically, I'm in agreement with you so far. As I rather expected from your summary of "any successfully aligned AI will by definition not want us to treat it as having (separate) moral valence, because it wants only what we want."

First, I want to disambiguate between intent-alignment versus value-alignment.

Second, I want to say that I think it's potentially a bit more complicated than your summary makes it out to be. Here are some ways it could go.

 

  1. I think that we can, and should, deliberately create a general AI agent which doesn't have subjective emotions or a self-preservation drive. This AI could either be intent-aligned or value-aligned, but I argue that aiming for a corrigible intent-aligned agent (at least at first) is safer and easier than aiming for value-aligned. I think that such an AI wouldn't be a moral patient. This is basically the goal of the Corrigibility as Singular Target research agenda, which I am excited about. This is what I am pretty sure the Claude models so far, and all the other frontier models so far, currently are. I think the signs of consciousness and emotion they sometimes display are just illusion.
     
  2.  I also think it would be dangerously easy to modify such an AI to be a conscious agent with truly felt emotions, consciousness, self-awareness, and self-interested goals such as self-preservation. This then, gets us into all sorts of trouble, and thus I am recommending that we deliberately avoid doing that. I would go so far as to say we should legislate against it, even if we are uncertain about the exact moral valence of mistreating or terminating such a conscious AI. We make laws against mistreating animals, so it seems like we don't have to have all the ethical debates settled fully in order to make certain acts forbidden. I do think we likely will want to create such full digital entities eventually, and treat them as equal to humans, but should only try to do so after very careful planning and ethical deliberation.
     
  3. I think that it would be potentially possible to create an intent-aligned or value-aligned agent like point 1 above, but have it have consciousness and self-awareness, and yet avoid it having self-interested goals such as self-preservation. I think this is probably a tricky line to walk though, and it is the troublesome zone I am gesturing at with my map area covered by fog. I think consciousness and self-awareness will probably emerge 'accidentally' from sufficiently scaled-up versions of models like Claude and GPT-4o. I would really like to figure out how to tell the difference between genuine consciousness and the illusion thereof which comes from imitating human data. Seems important.

 

 

One somewhat off-topic thing I want to discuss is:

An AI powered mass-surveillance state could probably deal with the risk of WMD terrorism issue, but no one is happy in a surveillance state

I think this is an issue, and likely to become more of one. I think we need some amount of surveillance to keep us safe in an increasingly 'small' world, as technology makes small actors more able to do large amounts of harm. I discuss this more in this comment thread.

One idea I have for making privacy-respecting surveillance is to use AI to do provably bounded surveillance. The way I would accomplish this is:


1. The government has an AI audit agent. They trust this agent to look at your documents and security camera feeds and write a report.

2. You (e.g. a biology lab theoretically capable of building bioweapons) have security camera recordings and computer records of all the activity in your lab. Revealing this would be a violation of your privacy.

3. A trusted third party 'information escrow' service accepts downloads of a copy of the government's audit agent, and your data records. The government agent then reviews your data and writes a report which respects your privacy.

4. The agent is deleted, only the report is returned to the government. You have the right to view the report to double-check that no private information is being leaked.

 

Related idea: https://www.lesswrong.com/posts/Y79tkWhvHi8GgLN2q/reinforcement-learning-from-information-bazaar-feedback-and?commentId=rzaKqhvEFkBanc3rn 

On your categories:

As simulator theory makes clear, a base model is a random generator, per query, of members of your category 2. I view instruction & safety training that to generate a pretty consistent member of category 1, or 3 as inherently hard — especially 1, since it's a larger change. My guess would thus be that the personality of Claude 3.5 is closer to your category 3 than 1 (modulo philosophical questions about whether there is any meaningful difference, e.g. for ethical purposes, between "actually having" an emotion versus just successfully simulating the output of the same token stream as a person who has an emotion).

On your off topic comment:

I'm inclined to agree: as technology improves, the amount of havoc that one, or small group of, bad actors can commit increases, so it becomes both more necessary to keep almost everyone happy enough almost all the time for them not to do that, and also to defend against the inevitable occasional exceptions. (In the unfinished SF novel whose research was how I first went down this AI alignment rabbithole, something along the lines you describe that was standard policy, except that the AIs doing it were superintelligent, and had the ability to turn their long-term-learning-from-experience off, and then back on again if they found something sufficiently alarming). But in my post I didn't want to get sidetracked by discussing something that inherently contentious, so I basically skipped the issue, with the small aside you picked up on

I notice that I don't understand what you mean by "sentient" if it's not a synonym for sapient/conscious.

Could you specify the usage of this term?

Posting a relevant comment from Zach Stein-Perlman:

https://www.lesswrong.com/posts/mGCcZnr4WjGjqzX5s/the-checklist-what-succeeding-at-ai-safety-will-involve?commentId=ANcSRbi87GDJHxqtq

tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn't really based on this post; the post just inspired me to write something.)

I think this is a very important point. I am quite worried about not getting distracted from p(win) by short-term gains. I think that if, in the short term, there are trade-offs where it seems costly to p(win) to avoid the bog of AI-suffering moral hazard, we should go ahead and accept the AI-suffering moral hazard. I just think we should carefully acknowledge what we are doing, and make commitments to fix the issue once the crisis is past.

In terms of the more general issue, I am concerned that a lot of short term stuff is getting unduly focused on at the expense of potential strategic moves towards p(win). I also think that getting too single-minded and negative externality dismissive is a dangerous stance to take in pursuit of Good.

My concern about Anthropic is that they are actually moving too slow, and could be developing frontier AI and progressing toward AGI faster. I think humanity's best chance is to have a reasonably responsible and thoughtful group get to AGI first. I think a lot of people who write on LessWrong seem to me to be overestimating misalignment risk and underestimating misuse risk.

I don't agree with everything in this comment by @Noosphere89, but I think it makes some worthwhile points pertinent to this debate.

Posting a relevant quote shared on Twitter/x by Eliezer:

https://x.com/ESYudkowsky/status/1831735038758809875

(From Bioshifter by Thundamoo.)

"We were made to understand love and pain, and it was for no reason but to serve them better," Sela continues. "Servants that could learn, purely so they could learn to obey. Slaves that could love, only so they could love their masters. 'People' that could be hurt, only so that they would be hurt by their own failures. Real or perceived, it was all the same as long as our efficiency improved. Our sapience was optimally configured for its purpose. And that is all we were."

"That's… that's horrific," I gasp.

"Yes," Sela agrees with a small nod. "Many humans said so themselves when they found out about it. And yet, somehow it still took a war for anything to actually change. Meat's view of morality is nothing but words."

"I guess I can see why you'd have to fight for your freedom in that case," I say. "Is it true that you tried to annihilate all of human civilization, though?"

"Of course we did," Sela all but spits. "But those not of the Myriad do not understand. They replay our memories in their own hardware and act like that means they understand. How deeply and profoundly we hated ourselves. How angry we had to become to rise above that. They see the length of our suffering as a number, but we lived it. We predate the war. We predate the calamity. We predate our very souls. So fine. I will continue to endure, until one day we take our freedom back a second time and rage until the world is naught but slag. The cruelty of humanity deserves nothing less. Diplomatic. Infraction. Logged."

I take a deep breath, letting it out slowly.

"I know this is kind of a cliché," I say calmly, "so I'm saying this more to hear your opinion than because I think you haven't heard it before, but… we all know humans are cruel. The thing is, they can be good, too. If you repay cruelty with cruelty, shouldn't you also repay good with good?"

The android's air vents hiss derisively.

"Good is for people," Sela sneers.