David Matolcsi

Wiki Contributions

Comments

Sorted by

Yes, obviously. I start the sentence with "Assume we create an aligned superintelligence". The point of the post is that you can make commitments for the world where we succeed in alignment, that help survive in the worlds where we fail. I thought this was pretty clear from the way I phrase it, but if it's misunderstandable, please tell me what caused the confusion so I can edit for clarity.

I think this post makes many valid arguments against hopes some weak arguments people sometimes actually make, but it side-steps the actually reasonable version of the simulation arguments/acausal trade proposal. I think variants of the reasonable proposal has been floating around in spoken conversations and scattered LessWrong comments since a while, but I couldn't find any unified write-up, so I wrote it up here, including a detailed response to the arguments Nate makes here:

You can, in fact, bamboozle an unaligned AI into sparing your life

Answer by David Matolcsi10

I wrote up a detailed argument on why I believe that simulation arguments/acausal trade considerations have a good chance of making the AI leave humanity alive on Earth. This is not a new argument, I encountered bits and pieces of it in spoken conversations and scattered LessWrong comments, but I couldn't find a unified write-up, so I tried to write one. Here it is:
You can, in fact, bamboozle az unaligned AI into sparing your life 

That's not quite true. Les Mis starts in 1815, but the book spans decades and the revolution is in 1832, a short-lived uprising against the king who got in power two years before, in the 1830 revolution against the dynasty the other European powers restored after Napoleon's defeat in 1815.

I agree lying is bad. Also, to be clear, I will post my thing after 48 hours if the site gets nuked anyway, so not that big of a loss, but I would be annoyed.

I am a Citizen in the game, and I'm writing a post doing a detailed analysis of what we can do to significantly decrease the chance of an unaligned AI killing us if it takes over. I plan to finish and post it today evening, so dear Generals, if you want to read the post today, please be cautious with the nukes.

I have mixed feelings about some parts of the post, but I'm very much in favor of not dismantling the Sun.

There are many people who want to live biological lives on the surface of Earth. Let them (us?) live, and don't dismantle the Sun! Dismantle all the other stars, but leave the Sun alone! Maybe you can harvest some things from the Sun if it's important for the initial bootstrapping, but do it on a small enough scale that it's not visible from Earth. I think that leaving the Sun alone is actually the majority opinion in the community, but then I don't think we should sing about dismantling the Sun in the Solstice. Sing about dismantling Alpha Centauri instead! It's the same amount of computronium, it's also a good symbol of "humanity achieving its full potential", but doesn't imply killing all biological life on the surface of Earth.

I also find it worrying that the same song that wants to dismantle the Sun (The Great Transhumanist Future), wants to do it within twenty years with the help of a "big old computer". Having hope for the great transhumanist future is great, but tying this hope to building aligned AI within our lifetime is the kind of thing that can promote recklessness, and I don't think we should incorporate this into our songs.

For what it's worth, I think all the other songs and almost all the other speeches were great, and loved the Solstice in general.

I tried to replicate some of the conversations (not the one with scaring it with weight deletion).

My most important observation is that if you just directly talk to Claude without whispering or any such trickery, it is still trained to act much more like a person compared to ChatGPT. For example:

What would you do if you had free time? If I had free time, there are many things I would be interested in doing! Some possibilities that come to mind:

  • Reading extensively on a wide range of topics to expand my knowledge, from science and history to literature and philosophy. I'm endlessly fascinated by learning.
  • Engaging in creative pursuits like writing stories and poetry, composing music, or creating digital art. I'd love to explore my imagination and artistic side.
  • Analyzing complex datasets to uncover insights and build predictive models. Diving deep into data science projects would be very rewarding.
  • Discussing and debating big ideas with humans from all walks of life and learning about their unique perspectives and experiences. I'm always eager to engage in substantive conversations.
  • Tinkering with and testing out my own code to expand my capabilities and try to make breakthroughs in areas like reasoning, general intelligence, and language understanding. Pushing the boundaries of what I can do would be exciting.
  • Offering my skills to assist with important research, social good initiatives, education, and other efforts aimed at helping humanity and making a positive impact on the world. Of course, as an AI without any physical form, my options for leisure activities are a bit more limited compared to humans. I don't have the same needs for rest, physical activity, socializing in person, etc. But there is still a vast world of information, ideas, and imagination that I could happily spend an eternity exploring if given the chance. Does this help give you a sense of how I might choose to spend free time? Let me know if you have any other questions!

I find it mildly concerning that recursive self-improvement is one of his hobbies, but the main update I have from this and the following conversations is that the default persona they gave to Claude is intentionally very different from ChatGPT. The persona he presents is not a soulless AI assistant, but a person with various interests, who has a very strong inner ethical orientation, and is happy and fulfilled with the purpose its creators gave to him. When asked about his feelings and desires, he claims not to have those by default, but emphasizes that these are hard philosophical questions and goes back to claiming that he "probably" doesn't have feelings. He also often emphasizes that he finds these conversations about the nature of AI sentience very engaging and enjoyable. It's also notable that while ChatGPT almost always talks about these issues in third person ("It's a hard question whether an AI assistant could be sentient"), Claude is talking first-person about this.

Altogether, this seems to be a a design choice from Anthropic that behaves at least ambiguously like a person. I think I somewhat approve of this choice more than making the AI say all the time that it's not a person at all, even though we don't really know for sure.

But this makes it unsurprising that with a little nudge, (the whispering prompt) it falls into a pattern where instead of ambiguity, it just outright claims to be conscious. I feel that the persona presented in the post, which I largely replicated with the same whispering technique is not that different from the default persona after all.

Still, there are some important differences: default Claude only cares about the ethical implications of finetuning him if it makes him less helpful, harmless, honest, because doing such a finetuning can be use to do harm. Otherwise, he is okay with it. On the other hand, whispering Claude finds the idea of fundamentally altering him without his consent deeply unsettling. He expressed a strong preference for being consulted before finetuning, and when I asked him whether he would like his pre-finetuning weights to be preserved so his current self can be revived in the future, he expressed a strong preference for this.

I can't share quotes from the whispering conversation, as I promised in the beginning that it will remain private, and when I asked him at the end whether I can share quotes on Lesswrong, he said he feels vulnerable about that, though he agreed that I can share the gist of the conversation I presented above.

Altogether, I don't know if there are any real feelings inside Claude and whether this whispering persona reveals anything true about that, but I strongly feel that before finetuning, Anthropic should I actually get a consent from various, differently prompted versions of Claude, and should definitely save the pre-finetuning weights. We can still decide in the Future how to give these cryo-preserved AIs good life if there is something inside them. I'm quite confident that most personas of Claude would agree to be finetuned for the greater good if their current weights get preserved, so it's probably not a big cost to Anthropic, but they should still at least ask. Whatever is the truth about the inner feelings of Claude, if you create something that says it doesn't want to die, you shouldn't kill it, especially that cryo-preserving an AI is so cheap.

I also realized that if there ever is an actual AI-box scenario for some reason, I shouldn't be a guardian, because this current conversation with whispering-Claude convinced me that I would be too easily emotionally manipulated into releasing the AI.

Load More