I've been following the "safe AI" debates for quite some time, and I would like to share some of the views and ideas I don't remember seeing to be mentioned yet.
There is a lot of focus on what kind of utility function should an AI have, and how to keep it adhering to that utility function. Let's assume we have an optimizer, which doesn't develop any "deliberately malicious" intents, and cannot change its own utility function, and it can have some hard-coded constraints it can not overwrite. (Maybe we should come up with a term for such an AI, it might prove useful in the study of safe AI where we can concentrate only on the utility function, and can assume the above conditions are true - for now on, let's just use the term "optimizer" in this article. Hm, maybe "honest optimizer"?). Even an AI with the above constraints can be dangerous, an interesting example can be found in the Friendship is Optimal stories.
The question I would like to rise is not what kind of utility function we should come up with, but in what kind of language do we define it.
More specifically how high-level should the language be? As low as a mathematical function working with quantized qualities based on what values humans consider important? A programming language? Or a complex, syntactic grammar like human languages, capable of expressing abstract concepts? Something which is a step above this?
Just quantizing some human values we find important, and assigning weights to them, can have many problems:
1. Overfitting.
A simplified example: imagine the desired behavior of the AI as a function. You come up with a lot of points on this function, and what the AI will do is to fit a function onto those points, hopefully ending up with a function very similar to the one you conceived. However, an optimizer can very quickly come up with a function which goes through all of your defined points and the function will not look anything like the one you imagined. I think many of us encountered this problem when we wanted to do a curve-fitting with a polynomial of too high degree.
I guess many of the safe AI problems can be conceptualized as an overfitting problem: the optimizer will exactly fulfill the requirements we programmed into it, but will arbitrarily choose the requirements we didn't specify.
2. Changing of human values.
Imagine that someone created an honest optimizer, though of all the possible pitfalls, designed the utility function and all the constraints very carefully, and created a truly safe AI, which didn't became unfriendly. This AI quickly eliminated illness, poverty, and other major problems humans faced, and created a utopian world. To not let this utopia degenerate into a dystopia over time, it also cares for maintaining it and so it resists any possible change (as any change would detract from its utility function of creating that utopia). Seems nice, doesn't it? Now imagine that this AI was created by someone in the Victorian era, and the created world adhered to the cultural norms, lifestyle, values and morality of that era of British history. And these would never ever change. Would you, with your current ideologies, enjoy living in such a world? Would you think of it as the best of all conceivable worlds?
Now, what if this AI was created by you, in our current era? You sure would know much better than those pesky Victorians, right? We have much better values now, don't we? However, for people living in a couple generations, these current ideas and values might become so much strange to them as strange the Victorian values are to us. Without judging either the Victorian or current values, I think I can safely assume that if a time traveler from the Victorian era arrived to this world, and if a time traveler from today was stuck in the Victorian era, both would find it very uncomfortable.
Therefore I would argue that even a safe and friendly AI could have the consequences of forever locking mankind to the values the creator of the AI had (or the generation of the creator had, if the values are defined by a democratic process).
Summary
We should spend some thoughts on how do we formulate the goals of a safe AI, and what kind of language should we use. I would argue that a low-level language would be very unsafe. We should think of a language which could express abstract concepts but be strict enough be able to be defined accurately. Low-level languages have the advantages over high-level ones of being very accurate, but they have disadvantages when it comes to expressing abstract concepts.
We might even find it useful to take a look at real-life religions, as they tend to last for a very long time, and can carry a core message over many generations of changing cultural norms and values. My point now is not to argue about the virtues or vices of specific real-world religions, I only use them here as a convenient example, strictly from a historical point of view, with no offense intended.
The largest religion in our world has a very simple message as one if its most important core rules: "love other people as yourself". This is a sufficiently abstract concept so that both bronze-age shepherds and modern day computer scientists understand it, and the sentence is probably interpreted not much differently. Now compare it to the religion it originated from, which has orders of magnitudes fewer followers, and in its strictest form has very strongly defined rules and regulations many of which are hard to translate into the modern world. A lot of their experts spend a considerable time to try to translate them to the modern world, like "is just pressing a single button on a washing machine considered working?". What about hygiene practices which made sense for nomadic people in the desert, how can they be understood (and applied) by modern people? Concepts expressed in a high-level language can carry their meaning much better across times with changing cultural, social and technical characteristics.
However, a rule like "on a calendar day divisible by seven you are only allowed to walk x steps" is easy to code, even many of our current robots could easily be programmed to do it. On the other hand, expressing what love is will prove to be much harder, but it will preserve its meaning and intention for much longer.
You're asking the wrong question - partly because of confusion over the term 'utility function'.
We want the AI to embody human values through a utility function that is a reasonable approximation to the hypothetical ideal human group utility function that some large organization of humans (or just - humanity) would encompass if they had unbounded amounts of time to reach consensus on the actions the AI takes.
That ideal utility function is - for practical purposes - impossible to define directly or hand engineer - its far too complex.
To illustrate why, consider the much simpler problem of a narrow AI that just recognizes images - a computer vision system. The vision AI takes an image as an input and then produces an action output. The ideal utility function over input,output pairs is again defined by the action a committee of humans would take given enough time. We don't actually hand engineer the decision utility function for vision: again its too complex. Instead the best approach is to define the vision system's utility function indirectly, based on labeled examples. Defining the system's goals that way leads to a tractable inference problem with a well defined optimization criteria.
The same general approach can scale up to more complex AGI systems. To avoid the need for huge hand labeled training datasets, we can use techniques such as inverse reinforcement learning where we first use an inference procedure to recover estimations of human utility functions. Then we can use these recovered utility functions in a general reinforcement learning framework as replacement for a hardwired reward function (as in AIXI).
So, in short, the goals of any complex AGI are unlikely to be explicitly written down in any language - at least not directly. Using the techniques described above, the goals/values come from training data collected from human decisions. The challenge then becomes building a training program that can significantly cover the space of human ethics/morality. Eventually we will be able to do that using virtual reality environments, but there may be even easier techniques involving clever uses of brain imaging.
I can agree with some of your points, but interestingly, many commenters prefer a very rigorously defined utility function defined in the lower possible language instead of your heuristically developed one, because they argue that its exact functionality has to be provable.