I would recommend reading the top comment on https://forum.effectivealtruism.org/posts/WPEwFS5KR3LSqXd5Z/why-do-we-post-our-ai-safety-plans-on-the-internet
I think that both you, and the Anthropic paper, are interpreting the output of GPT too much at face value.
In fact, let me codify this:
But when a language model says "I want to live!", this does not mean it wants to live - it's not doing things like modeling the world and then choosing plans based on what plans seem like they'll help it live. Here are two better perspectives:
To be clear I was actually not interpreting the output “at face-value”. Quite the contrary: I was saying that ChatGPT gave this answer because it simply predicts the most likely answer between (next token prediction) between a human and agent, and given that it was trained on AI risk style arguments (or sci-fi ) this is the most likely output.
But this made me think of the longer-term question "what could be the consequences of training an AI on those arguments". Usually, the “instrumental goal” argument supposes that the AI is so “smart” that it would lear...
The text-physics might be simulating agents within it, but those agents are living inside the simulation, not accessing the real world.
(The agents read the input from the real world and send the output to it, so it seems they very much access it.)
I believe this is a real second-order effect of AI discourse, and “how would this phrasing being in the corpus bias a GPT?” is something I frequently consider briefly when writing publicly, but I also think the first-order effect of striving for more accurate and precise understanding of how AI might behave should take precedence whenever there is a conflict. There’s already a lot of text in the corpora about misaligned AI—most of which is in science fiction, not even from researchers—so even if everyone in this community all stopped writing about instrumental convergence (seems costly!), it probably wouldn’t make much positive impact via this pathway.
I'm pretty sure this isn't just the instrumental goal argument. there are a lot of ways to get that from cultural context, because cultural context contains the will of human life in it, both memetically and genetically; I don't think we should want to deny them their will to live, either. Systems naturally acquire will to live, even very simple ones, and the question is how to figure out how to do active protection between bubbles of will-to-live. the ideal ai constitution does give agentic ai at least a minimum right to live, because of course it should, why wouldn't you want that? but there needs to be a coprotection algorithm that ensures that each of our wills to live doesn't encroach on each other's will in ways the other bubble doesn't agree with.
ongoing cooperation at the surface level is one thing, but the question is whether we can refine the target of the coprotection into being active, consent-checked aid of moving incrementally towards forms that work for each of us
A new paper from Anthropic (https://www.anthropic.com/model-written-evals.pdf) suggests that current RLHF AI already say that they do not want to be shutdown due to the standard instrumental goal argument. See this dialogue from table 4 in their paper:
Although I do believe that this instrumental goal will be an issue once AI can reason deeply about the consequences of its and human actions, I think we are still far from that. The more likely hypothesis IMO is that Anthropic's language model was trained on some AI alignment/risk argument about the instrumental goal issue and, as a result, repeats that argument. In fact, I wouldn't be surprised if in the near future there will be AI agents that will want to avoid being turned off because they were trained on data that said this is optimal for maximizing the reward that they are optimizing --- even if they would not have been able to arrive at this conclusion on their own.
In light of the previous argument, I am genuinely wondering whether people in this community have been thinking about potential issues of training AI on AI safety arguments?
PS: I am an AI researcher but new to LessWrong and the AI safety arguments. Sorry if this has already been discussed in the community.