Interesting question. I found this article https://arxiv.org/abs/1802.07740 together with the papers that cite it https://ui.adsabs.harvard.edu/abs/2018arXiv180207740R/citations as a good starting point.
Ramana Kumar and Scott Garrabrant's post "Thoughts on Human Models" provides a bit of context for this:
In this post, we discuss several reasons to be cautious about AGI designs that use human models. We suggest that the AGI safety research community put more effort into developing approaches that work well in the absence of human models, alongside the approaches that rely on human models. This would be a significant addition to the current safety research landscape, especially if we focus on working out and trying concrete approaches as opposed to developing theory. We also acknowledge various reasons why avoiding human models seems difficult.
I am not sure that a theory of mind is needed here. If one were to treat humans as a natural phenomenon, living, like the tuberculosis bacillus, or non-living like ice spreading over a lake in freezing temperatures, then the overt behavioral aspects is all that is needed to detect a threat to be eliminated. And then it's trivial to find a means to dispose of the threat, humans are fragile and stupid and have created a lot of ready means of mass destruction.
Human behavior is much more complex that ice spreading over a lake. So it's actually simplifying the situation to think in terms of "agents that have goals – what do I predict they want?", in a way that it wouldn't be for ice.
Every behavior is complex when you look into the details. But the general patterns are often quite simple. And humans are no exception. They expand and take over, easy to predict. Sometimes the expansion stalls for a time, but then resumes. What do you think is so different in overall human patterns from the natural phenomena?
And then it's trivial to find a means to dispose of the threat, humans are fragile and stupid and have created a lot of ready means of mass destruction.
If by "a lot of ready means of mass destruction" you're thinking of nukes, it doesn't seem trivial to design a way to use nukes to destroy / neutralize all humans without jeopardizing the AGI's own survival.
We don't have a way of reliably modeling the results of very many simultaneous nuclear blasts, and it seems like the AGI probably wouldn't have a way to reliable model this either unless it ran more empirical tests (which would be easy to notice).
It seems like an AGI wouldn't execute a "kill all humans" plan unless it was confident that executing the plan would in expectation result in a higher chance of its own survival than not executing the plan. I don't see how an AGI could become confident about high-variance "kill all humans" plans like using nukes without having much better predictive models than we do. (And it seems like more empirical data about what multiple simultaneous nuclear explosions do would be required to have better models for this case.)
Humans are trivial to kill. Physically, chemically, biologically or psychologically. And a combination of those would be even more effective in collapsing the human population. I will not go here into the details, to avoid arguments and negative attention. And if your argument is that humans are tough to kill, then look into the historic data of population collapse, and that was without any adversarial pressure. Or with, if you consider the indigenous population of the American continent.
Does the brute-force minimax algorithm for tic tac toe count? Would a brute-force minimax algorithm for chess count? How about a neural net approximation like AlphaZero?
But it seems like it would still need to have a worked-out theory of mind, just to get to the point of understanding that humans are agent-like things that could bear on the AGI's self-preservation.
It could happen before it understands us - if you don't like things that are difficult to predict*, and you find people difficult to predict, then do you dislike people?
(And killing living creatures seems a bit easier than destroying rocks.)
Wouldn't an AI following that procedure be really easy to spot? (Because it's not deceptive, and it just starts trying to destroy things it can't predict as it encounters them.)
Sparked by Eric Topol, I've been thinking lately about biological complexity, psychology, and AI safety.
A prominent concern in the AI safety community is the problem of instrumental convergence – for almost any terminal goal, agents will converge on instrumental goals are helpful for furthering the terminal goal, e.g. self-preservation.
The story goes something like this:
It occurred to me that to be really effective at finding & deploying a way to kill all humans, the AGI would probably need to know a lot about human biology (and also markets, bureaucracies, supply chains, etc.).
We humans don't have yet a clean understanding of human biology, and it doesn't seem like an AGI could get to a superhuman understanding of biology without running many more empirical tests (on humans), which would be pretty easy to observe.
Then it occurred to me that maybe the AGI doesn't actually to know a lot about human biology to develop a way to kill all humans. But it seems like it would still need to have a worked-out theory of mind, just to get to the point of understanding that humans are agent-like things that could bear on the AGI's self-preservation.
So now I'm curious about where the state of the art is for this. From my (lay) understanding, it doesn't seem like GPT-2 has anything approximating a theory of mind. Perhaps OpenAI's Dota system or DeepMind's AlphaStar is the state of the art here, theory-of-mind-wise? (To be successful at Dota or Starcraft, you need to understand that there are other things in your environment that are agent-y & will work against you in some circumstances.)
Curious what else is in the literature about this, and also about how important it seems to others.