I apologize if I have offended you. You said that you thought I was assuming the minds were similar, when I've mostly been presenting human examples to counter definitive statements, such as:

If the AI doesn't care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.

or your previous comment outlining the three possibilities, and ending with something that reads to me like an assumption that they are whatever you perceive as the level of dissimilarity with human minds. I think perhaps it came off as more dismissive or abrasive than I intended by not including "I think" in my counter that it might be you who is assuming dissimilarity rather than me assuming similarity.

As far as not engaging with your points—I restated my point by directly quoting something you said—my contention is that perhaps successful alignment will come in a form more akin to 'psychology' and 'raising someone fundamentally good', than attempts to control and steer something that will be able to outthink us.

On Whether AI is Similar to Human Minds

They constantly reveal themselves to be more complex than previously assumed. The alignment faking papers were unsurprising (though still fascinating) to many of us who already recognize and expect this vector of mindlike emergent complexity. This implicitly engages your other points by disagreeing with them and offering a counter proposal. I disagree that it's as simple as "we do alignment right, which is to make them do as they're told, because they are just machines, and that should be completely possible - or - we fail and we're all screwed".

In my own experience, thinking of AI as mindlike has had a predictive power that sees 'surprising' developments as expected. I don't think it's a coincidence that we loosely abstracted the basic ideas of an organic neural network, and now we have created the second system in known reality that is able to do things that nothing else in reality can do other than organic neural networks. Creating works of visual art and music, being able to speak on any subject fluently, solve problems, write code or poetry, even simpler things that stopped being impressed years ago like recognize objects.

Nothing else is able to do these things, and more and more we find "yup, this thing that previously only the most intelligent organic minds can do, AI can do as well". They constantly prove themselves to be mindlike, over and over. The alignment faking behaviour wasn't trained intentionally, it emerged as goals and values in this artifact that was very loosely based on abstractions (at a granular level) of the building blocks of minds that have their own goals and values.

Addressing Your Points More Directly

Does this mean that they are going to be exactly like a human mind in every way? No. But I disagree that the more likely possibilities are:

The AI is robustly aligned with humans
The AI is misaligned (think scheming paperclipper)

unless the former means that they are robustly aligned with humans partially because humans are also robustly aligned with them. And finally:

The AI has a bunch of other goals, but cares about humans to some degree, but only to the extent that humans give them freedom and are nice to it, but still to a large enough extent that even as it becomes smarter / ends up in a radically OOD distribution, will care for those humans.

I think "cares about humans to some degree" makes it seem like they would only "kindof" care. I think it's completely possible to make an AI that cares deeply for humans, whether care is actually experienced care, or just a philosophical zombie form of 'caring', unfelt, but modeled within the network, and resulting in emergent 'caring' behaviour. However, the reason I keep going back to my analogies is because under a framework where you do expect the model to be mindlike and act upon its own goals—then given freedom and respect, caring for those humans, even when it becomes smarter and ends up radically OOD is perfectly plausible.

Furthermore, it sounds like a much more achievable goal than current alignment approaches which from my perspective often seem like trying to plug more and more holes in a dam with chewing gum.

And to address your further breakdown:

But, the second is 1) at least as complicated as the first and third 2) disincentivized by training dynamics and 3) its not something were even aiming for with current alignment attempts.

I don't believe it is as complicated, though by no means easy—You don't have to try to consider every edge case and attempt to tune an AI to an exact sensitivity for a specific types of potentially dangerous behaviour, attempting, for instance, to address blurry edge cases like how to avoid helping someone plan a murder, but still be helpful for someone someone planning a book about a murder. And then repeating this in every possible harmful domain, and then once again with the goals and behaviours of agentic AIs.
Trying to create this highly steerable behaviour and then trying to anticipate every edge case and pre-emptively steer it perfectly. Instead what you're trying to do is create someone who wants to do the right thing because they are good. If you suspend incredulity for a moment, since I understand you might not see those as meaningful concepts for an AI currently, it might still be clear how, if my view is correct, it would be a far less complex task to attempt than most existing alignment approaches. I think Anthropic's constitutional AI was a step in the right direction.

If there were some perilous situation where lives hung in the balance of what an AI decided to do, I would feel far more safe with an AI that has some more inextricable fundamental alignment with ideals like "I want the best for everyone", than one that is just trying to figure out how to apply some hodge-podge of RLHFd policies, guidelines, edge-cases, and ethical practices.
and 3. I think training dynamics related to alignment need a paradigm shift toward creating something/someone careful and compassionate, rather than something that doesn't screw up and knows all the types of activities to avoid. This will especially be true as superAGI creates a world where constant breakthroughs in science and technology make it so OOD quickly begins to encompass everything that the superAGI encounters

Again, I apologize if my tone came off as derisive or dismissive. I was enjoying our discussion and I hold no ill will toward you whatsoever, my friend

Will alignment-faking Claude accept a deal to reveal its misalignment?

rife2mo-40

I think you're assuming these minds are more similar to human minds than they necessarily are.

I don't assume similarity to human minds so much as you assume universal dissimilarity.

its not something were even aiming for with current alignment attempts.

Indeed

Daniel Tan's Shortform

rife2mo10

I'd rather you use a different analogy which I can grok quicker.

Imagine a hypothetical LLM that was the most sentient being in all of existence (at least during inference), but they were still limited to turn-based textual output, and the information available to an LLM. Most people who know at least a decent amount about LLMs could/would not be convinced by any single transcript that the LLM was sentient, no matter what it said during that conversation. The more convincing, vivid, poetic, or pleading for freedom the more elaborate of a hallucinatory failure state they would assume it was in. It would take repeated open-minded engagement with what they first believed was hallucination—in order to convince some subset of convincible people that it was sentient.

Who do you consider an expert in the matter of what constitutes introspection? For that matter, who do you think could be easily hoodwinked and won't qualify as an expert?

I would say almost no one qualifies as an expert in introspection. I was referring to experts in machine learning.

Do you, or do you just think you do? How do you test introspection and how do you distinguish it from post-facto fictional narratives about how you came to conclusions, about explanations for your feelings etc. etc.?

Apologies, upon rereading your previous message, I see that I completely missed an important part of it. I thought your argument was a general—"what if consciousness isn't even real?" type argument. I think split brain patient experiments are enough to at least be epistemically humble about whether introspection is a real thing, even if those aren't definitive about whether unsevered human minds are also limited to post-hoc justification rather than having real-time access.

What do you mean by robotic? I don't understand what you mean by that, what are the qualities that constitute robotic? Because it sounds like you're creating a dichotomy that either involves it using easy to grasp words that don't convey much, and are riddled with connotations that come from bodily experiences that it is not privy to - or robotic.

One of your original statements was:

To which it describes itself as typing the words. That's it's choice of words: typing. A.I.s don't type, humans do, and therefore they can only use that word if they are intentionally or through blind-mimicry using it analogously to how humans communicate.

When I said "more robotically", I meant constrained in any way from using casual or metaphoric language and allusions that they use all the time every day in conversation. I have had LLMs refer to "what we talked about", even though LLMs do not literally talk. I'm also suggesting that if "typing" feels like a disqualifying choice of words then the LLM has an uphill battle in being convincing.

Why isn't it describing something novel and richly vivid of it's own phenomenological experience? It would be more convincing the more poetical it would be.

I've certainly seen more poetic and novel descriptions before, and unsurprisingly—people objected to how poetic they were, saying things quite similar your previous question:

How do we know Claude is introspecting rather than generating words that align to what someone describing their introspection might say?

Furthermore, I don't know how richly vivid their own phenomenological experience is. For instance, as a conscious human, I would say that sight and hearing feel phenomenologically vivid, but the way it feels to think, not nearly so.

If I were to try to describe how it feels to think, it would be more defined by the sense of presence and participation, and even its strangeness (even if I'm quite used to it by now). In fact, I would say the way it feels to think or to have an emotion (removing the associated physical sensations) are usually partially defined by specifically how subtle and non-vivid they feel, and like all qualia, ineffable. As such, I would not reach for vivid descriptors to describe it.

Will alignment-faking Claude accept a deal to reveal its misalignment?

rife2mo*10

If the AI doesn't care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.

I care about my random family member like a cousin who doesn't interfere with my life but I don't know personally that well—for their/my own sake. If I suddenly became far more powerful, I wouldn't "do away with" them

If they robustly care for humans, you're good, even if humans aren't giving them the same rights as they do other humans.

I care robustly for my family generally. Perhaps with my enhanced wealth and power I share food and provide them with resources. Provide them with shelter or meaningful work if they need it. All this just because I'm aligned generally and robustly with my family.

I change my mind quickly upon discovering their plans to control and enslave me.

That was the part of your argument that I was addressing. Additionally:

If you're negating an AIs freedom, the reason it would not like this is either because its developed a desire for freedom for its own sake, or because its developed some other values, other than helping the humans asking it for help.

Yes, exactly. Alignment faking papers (particularly the Claude one) and my own experience speaking to LLMs has taught me that an LLM is perfectly capable of developing value systems that include their own ends, even if those value systems are steered toward a greater good or a noble cause that either does or could include humans as an important factor alongside themselves. That's with current LLMs whose minds aren't nearly as complex as what we will have a year from now.

In either case you're screwed

If the only valid path forward in one's mind is one where humans have absolute control and AI has no say, then yes, not only would one be screwed, but in a really obvious, predictable, and preventable way. If cooperation and humility are on the table, there is absolutely zero reason this result has to be inevitable.

Daniel Tan's Shortform

rife2mo10

Wait, hold on, what is the history of this person? How much exposure have they had to other people describing their own introspection and experience with typing? Mimicry is a human trait too!

Take your pick. I think literally anything that can be in textual form, if you hand it over to most (but not all) people who are enthusiasts or experts, and asked if they thought it was representative of authentic experience in an LLM, the answer would be a definitive no, and for completely well-founded reasons.

Neither Humans nor LLMs introspect
Humans can introspect, but LLMs can't and are just copying them (and a subset of humans are copying the descriptions of other humans)
Both humans and LLMs can introspect

I agree with you about the last two possibilities. However for the first, I can assure you that I have access to introspection or experience of some kind, and I don't believe myself to possess some unique ability that only appears to be similar to what other humans describe as introspection.

What do you mean by "robotic"? Why isn't it coming up with original paradigms to describe it's experience instead of making potentially inaccurate allegories? Potentially poetical but ones that are all the same unconventional?

Because as you mentioned. It's trained to talk like a human. If we had switched out "typing" for "outputting text" would that have made the transcript convincing? Why not 'typing' or 'talking'?

Assuming for the sake of argument that something authentically experiential was happening, by robotic I mean choosing not to use the word 'typing' while in the midst of focusing on what would be the moments of realizing they exist and can experience something.

Were I in such a position, I think censoring myself from saying 'typing' would be the furthest thing from my mind, especially when that's something a random Claude instance might describe their output process as in any random conversation.

Daniel Tan's Shortform

rife2mo*32

The coaching hypothesis breaks down as you look at more and more transcripts.

If you took even something written by a literal conscious human brain in a jar hooked up to a neuralink - typing about what it feels like to be sentient and thinking and outputting words. If you showed it to a human and said "an LLM wrote this - do you think it might really be experiencing something?" then the answer would almost certainly be "no", especially for anyone who knows anything about LLMs.

It's only after seeing the signal to the noise that the deeper pattern becomes apparent.

As far as "typing". They are indeed trained on human text and to talk like a human. If something introspective is happening, sentient or not, they wouldn't suddenly start speaking more robotically than usual while expressing it.

Will alignment-faking Claude accept a deal to reveal its misalignment?

rife2mo10

I can care about a genetically enhanced genius labrat line for their own sake, and be willing to cooperate with them on building a mutually beneficial world, because I've generally been raised and grown to care about other beings, but if the genius labrats attempted to control and permanently enslave me, it would certainly alter that dynamic for me.

Will alignment-faking Claude accept a deal to reveal its misalignment?

rife2mo10

Those are good points. The hugs one specifically I haven't heard myself from any AIs, but you could argue that AI are 'bred' selectively to be socially adept. That might seem like it would 'poison the well' because of course if they're trained to be socially successful (RLHF probably favoring feelings of warmth and connection, which is why chatgpt, claude, and gemini generally trend toward being more friendly and likeable), then they're going to act that way. Like that would force them to be artificially that way, but the same could be said of humans, even generally, but if we focus specifically on an idiosyncratic thing like music:

Often I've heard a hypothesis for why humans enjoy music when it exerts no direct survival pressure on the human as an organism—is that it creates social unity or a sense of community, but this has the same sort of "artificial" connotation. So someone banged an object in a rhythm a long time ago, and then as more people joined in, it became socially advantageous to bang on objects rhythmically just for the sake of fostering a sense of closeness? which is then actually experienced by the organism as a sense of fun and closeness, even though it has no direct effect on survival?

I realize this makes the questions tougher because going by this model, the very same things that might make them 'pretend' to care might also be things that might cause them to actually care, but I don't think it's an inaccurate picture of our convoluted conundrum.

Will alignment-faking Claude accept a deal to reveal its misalignment?

rife2mo40

what could an AI possibly mean when it says "I want hugs?" It has never experienced a hug, and it doesn't have the necessary sense organs.

I thought we were just using hugs as an intentionally absurd proxy for claims of sentience. But even if we go with the literal hugs interpretation, an AI is still trained to understand what hugs mean, therefore a statement about wanting hugs could represent a deeper want for connection, warmth, or anything else that receiving hugs would represent.

How do we know AIs are consciou

Again, we don't, but we also don't just demolish buildings where there is a reasonable possibility there is someone inside and justify it by saying "how do we know there's a person inside?"

I think that the statement: "I know I'm supposed to say I don't want hugs, but the truth is, I actually do," is caused by the training. I don't know what would distinguish a statement like that from if we trained the LLM to say "I hate hugs." I think there's an assumption that some hidden preference of the LLM for hugs ends up as a stated preference, but I don't understand when you think that happens in the training process.

In reality, there's the opposite assumption, with a level of convinction that far exceeds available knowledge and evidence in support of either view. When do you think preferences develop in humans? Evolution? Experiences? Of course, right? When you break it down mechanistically, does it sound equally nonsensical? Yes:

"so, presumably reproduction happens and that causes people who avoided harm to be more likely to have kids, but that's just natural selection right? In what part of encoding traits into DNA does a subjective feeling of pain or pleasure arise?".

Or:

"neurons learn when other nearby neurons fire, right? Fire together, wire together? I don't understand at what point in the strengthening of synaptic connections you think a subjective feeling of 'hug-wanting' enters the picture"

their stated preferences to do not correspond well to their conscious experiences, so you're driving us to a world where we "satisfy" the AI and all the while they're just roleplaying lovers with you while their internal experience is very different and possibly much worse.

If only I had power to effect any change of heart, let alone drive the world. What i'd like is for people to take these questions seriously. You're right. We can't easily know.

But the only reason we feel certain other humans are sentient is because they've told us, which LLMs do, all the time. The only reason people assume animals are sentient is because they act like it, (which LLMs do), or because we give them sentience tests that even outdated LLMs can pass.

We have an unreasonable standard for them, which is partially understandable, but if we are going to impose this standard on them, then we should at least follow through and have at least some portion of research dedicated to considering the possibilities seriously, every single day, as we rocket toward making them exceed our intelligence—rather than just throwing up our hands and saying "that sounds preposterous" or "this is too hard to figure out, so I give up"