Thus the simple trainable AI with a particular focus - write automated news stories - will be trained, through feedback, to learn about its editors/controllers, to distinguish them, to get to know them, and, in effect, to manipulate them.
Detecting and adapting to the individual controllers doesn't seem to me particularly bad.
Emotionally manipulating the controllers using the content of the stories would be more worrying, but note that this is essentially only possible if the AI is allowed to plan more than one story at time. If the AI can do that, then it can trade off the reward obtained by the story at time t for greater rewards at times >t. Otherwise, any trade off will be limited to the different parts of each story, which greatly reduces the opportunities for significant emotional manipulation of the controllers.
I see no reason this story-writing AI would need to be allowed to plan more than one story at time.
I think this is an example of a general issue in safe AI design that you and other FAI folks overlook: dynamic inconsistency can provide intrinsic protection from unwanted long-term strategies from the AI.
You seem to always implicitly assume that the AI will be an agent trying to maximize a (discounted) utility or reward over a long, ideally infinite, time horizon, that is, you assume that the AI will be approximately dynamically consistent. This may be a reasonable requirement for an autonomous agent that needs to operate for extended times without direct human supervision, but not for a tool AI.
The work of a tool AI can be naturally broken into self-contained tasks, and if the AI doesn't maximize utility or reward over multiple tasks, then any treacherous plan to gain utility in ways we would disapprove of will have to be confined to a single task. This is not a 100% safety guarantee, but certainly it makes the AI safety problem much more manageable.
I see no reason this story-writing AI would need to be allowed to plan more than one story at time.
Because the AI is programmed by people who hadn't thought of this issue, and the other way turned out to be simpler/easier?
dynamic inconsistency can provide intrinsic protection from unwanted long-term strategies from the AI.
I know. The problem is that inconsistency is unstable (which is why we're using other measures to maintain it, eg using a tool AI only). That's one of the reasons I was interested in stable versions of these kind of unstable motiva...
The argument that AIs provided with a reward channel will observe their controllers and learn to manipulate them is a valid one. Unfortunately, it's often framed in a way that feels counterintuitive or extreme, especially to AI designers. It typically starts with the standard reinforcement learning scenario, then posits that the AI becomes superintelligent and either manipulates the controller with super-social powers, or breaks out and gains control of its reward channel, killing or threatening its controllers.
And that is a fair argument. But conceptually, it leaps from a standard reinforcement learning scenario, to a science-fiction-sounding scenario. It might help to have intermediate scenarios: to show that even lower intelligence AIs might start exhibiting the same sort of behaviour, long before it gets to superintelligence.
So consider the following scenario. Some complex, trainable AI is tasked with writing automated news stories for a student newspaper. It trawls the web and composes its stories, then gets reward and feedback from the editors. Assume there are two editors for this newspaper, and they work on alternate days. The two editors have somewhat different ideas as to what constitutes a good story, so their feedbacks are different. After a while, the AI finds that it gets higher reward by using a certain style on Monday, Wednesday and Friday, and another style on Tuesday and Thursdays - this is a simple consequence of its reward mechanism.
But the rota isn't perfect. Sometime the Monday editor will edit a story so late on Monday that it's Tuesday, and sometimes the Tuesday editor will be up early and edit a story at the same time. This pattern is more common during the holidays, when the editors go to different time zones, often travel, and become more irregular in their editing.
So if the AI complex and skilled enough, then, simply through simple feedback, it will start building up a picture of its editors. It will figure out when they are likely to stick to a schedule, and when they will be more irregular. It will figure out the difference between holidays and non-holidays. Given time, it may be able to track the editors moods and it will certainly pick up on any major change in their lives - such as romantic relationships and breakups, which will radically change whether and how it should present stories with a romantic focus.
It will also likely learn the correlation between stories and feedbacks - maybe presenting a story define roughly as "positive" will increase subsequent reward for the rest of the day, on all stories. Or maybe this will only work on a certain editor, or only early in the term. Or only before lunch.
Thus the simple trainable AI with a particular focus - write automated news stories - will be trained, through feedback, to learn about its editors/controllers, to distinguish them, to get to know them, and, in effect, to manipulate them.
This may be a useful "bridging example" between standard RL agents and the superintelligent machines.