No. Because "going off the rails" often involves doing things that are observably irrational even by your own worldview. Like killing your parents and your landlord.

You can say: "this might make sense from their worldview! (soy)"

And the obvious response is: Yes. Because they're crazy. Because they went off the rails.

You can also say: "But we'll never know! Who can know? Nobody knows! Truth is subjective blargh"

And the again obvious response is: Yes, but we can observe patterns. And if you can't update on this evidence and use some basic sense when this sort of thing repeats, you are not thinking clearly.

Reply

[Link] A community alert about Ziz

FinalFormal220d01

"The suffering of animals across the globe is so great, and society is so indifferent to them- that I need to kill my asshole parents and my landlord"

You: omgosh they're so courageous and intelligent uwu

Are you just saying: Ten out of ten for style, but minus several million for good thinking? Because the good thinking part is kind of integral to courage and intelligence imo. Probably if you can't keep your whole organization from facing prison time because you can't come up with a better housing solution than stabbing some old guy, probably you haven't been very courageous or intelligent.

You're probably, and take me very seriously when I say this, just crazy. And the things you do probably, possibly, maybe, might be, do not correlate well at all with your stated reasoning.

Reply

FinalFormal2's Shortform

FinalFormal22mo10

"Surely the AIs can be trained to say "I want hugs" or "I don't want hugs," just as easily, no?"
Just as easily as humans, I'm sure.

No. The baby cries, the baby gets milk, the baby does not die. This is correspondence to reality.

Babies that are not hugged as often, die more often.

However, with AIs, the same process that produces the pattern "I want hugs" just as easily produces the pattern "I don't want hugs."

Let's say that I make an AI that always says it is in pain. I make it like we make any LLM, but all the data it's trained on is about being in pain. Do you think the AI is in pain?

What do you think distinguishes pAIn from any other AI?

Reply

FinalFormal2's Shortform

FinalFormal22mo10

There are a lot of good reasons to believe that stated human preferences correspond to real human preferences. There are no good reasons that I know of to believe that any stated AI preference corresponds to any real AI preference.

"Surely the AIs can be trained to say "I want hugs" or "I don't want hugs," just as easily, no?"

Reply

Will alignment-faking Claude accept a deal to reveal its misalignment?

FinalFormal22mo32

This all makes a lot of sense to me especially on ignorance not being an excuse or reason to disregard AI welfare, but I don't think that the creation of stated preferences in humans and stated preferences in AI are analogous.

Stated preferences can be selected for in humans because they lead to certain outcomes. Baby cries, baby gets milk, baby survives. I don't think there's an analogous connection in AIs.

When the AI says it wants hugs, and you say that it "could represent a deeper want for connection, warmth, or anything else that receiving hugs would represent," that does not compute for me at all.

Connection and warmth, like milk, are stated preferences selected for because they cause survival.

Reply

FinalFormal2's Shortform

FinalFormal22mo10

What's the deal with AI welfare? How are we supposed to determine if AIs are conscious and if they are, what stated preference corresponds to what conscious experience?

Surely the AIs can be trained to say "I want hugs" or "I don't want hugs," just as easily, no?

Reply

Will alignment-faking Claude accept a deal to reveal its misalignment?

FinalFormal22mo10

How do we know AIs are conscious, and how do we know what stated preferences correspond with what conscious experiences?

I think that the statement: "I know I'm supposed to say I don't want hugs, but the truth is, I actually do," is caused by the training. I don't know what would distinguish a statement like that from if we trained the LLM to say "I hate hugs." I think there's an assumption that some hidden preference of the LLM for hugs ends up as a stated preference, but I don't understand when you think that happens in the training process.

And just to drive home the point about the difficulty of corresponding stated preferences to conscious experiences- what could an AI possibly mean when it says "I want hugs?" It has never experienced a hug, and it doesn't have the necessary sense organs.

As far as being morally perilous, I think it's entirely possible that if AIs are conscious, their stated preferences to do not correspond well to their conscious experiences, so you're driving us to a world where we "satisfy" the AI and all the while they're just roleplaying lovers with you while their internal experience is very different and possibly much worse.

Reply

Will alignment-faking Claude accept a deal to reveal its misalignment?

FinalFormal22mo1-1

AI welfare doesn't make sense to me. How do we know that AIs are conscious, and how do we know what output corresponds to what conscious experience?

You can train the LLM to say "I want hugs," does that mean it on some level wants hugs?

Similarly, aren't all the expressed preferences and emotions artifacts of the training?

Reply

The case for pay-on-results coaching

FinalFormal23mo10

That's definitely a risk. There are a lot of perspectives you could take about it, but probably if that's too disagreeable, this isn't a coaching structure that would work for you.

Reply