The ARC evals showing that when given help and a general directive to replicate a GPT-4 based agent was able to figure out that it ought to lie to a TaskRabbit worker is an example of it figuring out a self-preservation/power-seeking subgoal which is on the road to general self-preservation. But it doesn't demonstrate an AI spontaneously developing self-preservation or power-seeking, as an instrumental subgoal to something that superficially has nothing to do with gaining power or replicating.
Of course we have some real-world examples of specification-gaming like you linked in your answer: those have always existed and we see more 'intelligent' examples like AIs convinced of false facts trying to convince people they're true.
There's supposedly some evidence here that we see power-seeking instrumental subgoals developing spontaneously but how spontaneous this actually was is debatable so I'd call that evidence ambiguous since it wasn't in the wild.
From Specification gaming examples in AI:
Bing cases aren't clearly specification gaming as we don't know how the model was trained/rewarded. My guess is that they're probably just cases of unintended generalization. I wouldn't really call this "goal misgeneralization", but perhaps it's similar.
Some non-AI answers:
When open-source language model agents were developed, the first goals given were test cases like "A holiday is coming up soon. Plan me a party for it." Hours after, someone came up with the clever goal "Make me rich." Less than a day after that, we finally got "Do what's best for humanity." About three days later, someone tried "Destroy humanity."
None of these worked very well, because language model agents aren't very good (yet?). But it's a good demonstration that people will deliberately build agents and give them open-ended goals. And that some small fraction of people would tell an AI "Destroy humanity," but that's maybe not the central concern because the "Do what's best for humanity" people were more numeous and had a 3-day lead.
The bigger concern is the "Plan my party" and "Make me rich" people were also numeous and also had a lead, and those might be dangerous goals to give to a monomaniacal AI.
[Manually cross-posted to EAForum here]
There are some great collections of examples of things like specification gaming, goal misgeneralization, and AI improving AI. But almost all of the examples are from demos/toy environments, rather than systems which were actually deployed in the world.
There are also some databases of AI incidents which include lots of real-world examples, but the examples aren't related to failures in a way that makes it easy to map them onto AI risk claims. (Probably most of them don't in any case, but I'd guess some do.)
I think collecting real-world examples (particularly in a nuanced way without claiming too much of the examples) could be pretty valuable:
What are the strongest real-world examples of AI systems doing things which might scale to AI risk claims?
I'm particularly interested in whether there are any good real-world examples of:
This feeds into a project I'm working on with AI Impacts, collecting empirical evidence on various AI risk claims. There's a work-in-progress table here with the main things I'm tracking so far - additions and comments very welcome.