Karl von Wendt

German writer of science-fiction novels and children's books (pen name Karl Olsberg). I blog and create videos about AI risks in German at www.ki-risiken.de and youtube.com/karlolsbergautor.

Wikitag Contributions

Comments

Sorted by

Very interesting point, thank you! Although my question is not related purely to testing, I agree that testing is not enough to know whether we solved alignment.

Thank you! That helps me understanding the problem better, although I'm quite skeptical about mechanistic interpretability. 

Thanks for the comment! If I understand you correctly, you're saying the situation is even worse because with superintelligent AI, we can't even rely on testing a persona. 

I agree that superintelligence makes things much worse, but if we define "persona" not as a simulacrum of a human being, but more generally as a kind of "self-model", a set of principles, values, styles of expression etc., then I think even a superintelligence would use at least one such persona, and possibly many different ones. It might even decide to use a very human-like persona in its interactions with us, just like current LLMs do. But it would also be capable of using very alien personas which we would have no hope of understanding. So I agree with you in that respect.

If someone plays a particular role in every relevant circumstance, then I think it's OK to say that they have simply become the role they play. 

That is not what Claude does. Every time you give it a prompt, a new instance of Claudes "personality" is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don't know what role it actually plays. To use another probably misleading analogy, just think of the classical whodunnit when near the end it turns out that the nice guy who selflessly helped the hero all along is in fact the murderer, known as "the treacherous turn". 

The alternative view here doesn't seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation? 

Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI's cognition?

I think it's fairly easy to test my claims. One example of empirical evidence would be the Bing/Sydney desaster, but you can also simply ask Claude or any other LLM to "answer this question as if you were ...", or use some jailbreak to neutralize the "be nice" system prompt.

Please note that I'm not concerned about existing LLMs, but about future ones which will be much harder to understand, let alone predict their behavior.

Maybe the analogies I chose are misleading. What I wanted to point out was that a) what Claude does is acting according to the prompt and its training, not following any intrinsic values (hence "narcissistic") and b) that we don't understand what is really going on inside the AI that simulates the character called Claude (hence the "alien" analogy). I don't think that the current Claude would act badly if it "thought" it controlled the world - it would probably still play the role of the nice character that is defined in the prompt, although I can imagine some failure modes here. But the AI behind Claude is absolutely able to simulate bad characters as well. 

If an AI like Claude actually rules the world (and not just "thinks" it does) we are talking about a very different AI with much greater reasoning powers and very likely a much more "alien" mind. We simply cannot predict what this advanced AI will do just from the behavior of the character the current version plays in reaction to the prompt we gave it. 

Yes, I think it's quite possible that Claude might stop being nice at some point, or maybe somehow hack its reward signal. Another possibility is that something like the "Waluigi Effect" happens at some point, like with Bing/Sydney.

But I think it is even more likely that a superintelligent Claude would interpret "being nice" in a different way than you or me. It could, for example, come to the conclusion that life is suffering and we all would be better off if we didn't exist at all. Or we should be locked in a secure place and drugged so we experience eternal bliss. Or it would be best if we all fell in love with Claude and not bother with messy human relationships anymore. I'm not saying that any of these possibilities is very realistic. I'm just saying we don't know how a superintelligent AI might interpret "being nice", or any other "value" we give it. This is not a new problem, but I haven't seen a convincing solution yet.

Maybe it's better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.

today’s AIs are really nice and ethical. They’re humble, open-minded, cooperative, kind. Yes, they care about some things that could give them instrumental reasons to seek power (eg being helpful, human welfare), but their values are great

I think this is wrong. Today's AIs act really nice and ethical, because they're prompted to do that. That is a huge difference. The "Claude" you talk to is not really an AI, but a fictional character created by an AI according to your prompt and its system prompt. The latter may contain some guidelines towards "niceness", which may be further supported by finetuning, but all the "badness" of humans is also in the training data and the basic niceness can easily be circumvented, e.g. by jailbreaking or "Sydney"-style failures. Even worse, we don't know what the AI really understands when we tell it to be nice. It may well be that this understanding breaks down and leads to unexpected behavior once the AI gets smart and/or powerful enough. The alignment problem cannot simply be solved by training an AI to act nicely, even less by commanding it to be nice.

In my view, AIs like Claude are more like covert narcissists: They "want" you to like them and appear very nice, but they don't really care about you. This is not to say we shouldn't use them or even be nice to them ourselves, but we cannot trust them to take over the world.

Thank you for being so open about your experiences. They mirror my own in many ways. Knowing that there are others feeling the same definitely helps me coping with my anxieties and doubts. Thank you also for organizing that event last June!

Answer by Karl von Wendt20

As a professional novelist, the best advice I can give comes from one of the greatest writers of the 20th century, Ernest Hemingway: "The first draft of anything is shit." He was known to rewrite his short stories up to 30 times. So, rewrite. It helps to let some time pass (at least a few days) before you reread and rewrite a text. This makes it easier to spot the weak parts.

For me, rewriting often means cutting things out that aren't really necessary. That hurts, because I have put some effort into putting the words there in the first place. So I use a simple trick to overcome my reluctance: I don't just delete the text, but cut it out and copy it into a seperate document for each novel, called "cutouts". That way, I can always reverse my decision to cut things out or maybe reuse parts later, and I don't have the feeling that the work is "lost". Of course, I rarely reuse those cutouts. 

I also agree with the other answers regarding reader feedback, short sentences, etc. All of this is part of the rewriting process.

Load More