I'm claiming that it is possible to define the utility function of any agent. For unintelligent "agents" the result is probably unstable. For intelligent agents the result should be stable.
The evidence is that I have a formalism which produces this definition in a way compatible with intuition about "agent having a utility function". I cannot present evidence which doesn't rely on intuition since that would require having another more fundamental definition of "agent having a utility function" (which AFAIK might not exist). I do not consider this to be a problem since all reasoning falls back to intuition if you ask "why" sufficiently many times.
I don't see any meaningful definition of intelligence or instrumental rationality without a utility function. If we accepts humans are (approximately) rational / intelligent, they must (in the same approximation) have utility functions.
It also seems to me (again, intuitively) that the very concept of "preference" is incompatible with e.g. intransitivity. In the approximation it makes sense to speak of "preferences" at all, it makes sense to speak of preferences compatible with the VNM axioms ergo utility function. Same goes for the concept of "should". If it makes sense to say one "should" do something (for example build a FAI), there must be a utility function according to which she should do it.
Bottom line, eventually it all hits philosophical assumptions which have no further formal justification. However, this is true of all reasoning. IMO the only valid method to disprove such assumptions is either by reductio ad absurdum or by presenting a different set of assumptions which is better in some sense. If you have such an alternative set of assumption for this case or a wholly different way to resolve philosophical questions, I would be very interested to know.
I'm claiming that it is possible to define the utility function of any agent.
It is trivially possible to do that. Since no choice is strictly identical, you just add enough details to make each choice unique, and then choose a utility function that will always reach that choice ("subject has a strong preference for putting his left foot forwards when seeing an advertisement for deodorant on Tuesday morning that are the birthdays of prominent Dutch politicians").
A good simple model of human behaviour is that of different modules expressing pref...
To construct a friendly AI, you need to be able to make vague concepts crystal clear, cutting reality at the joints when those joints are obscure and fractal - and them implement a system that implements that cut.
There are lots of suggestions on how to do this, and a lot of work in the area. But having been over the same turf again and again, it's possible we've got a bit stuck in a rut. So to generate new suggestions, I'm proposing that we look at a vaguely analogous but distinctly different question: how would you ban porn?
Suppose you're put in change of some government and/or legal system, and you need to ban pornography, and see that the ban is implemented. Pornography is the problem, not eroticism. So a lonely lower-class guy wanking off to "Fuck Slaves of the Caribbean XIV" in a Pussycat Theatre is completely off. But a middle-class couple experiencing a delicious frisson when they see a nude version of "Pirates of Penzance" at the Met is perfectly fine - commendable, even.
The distinction between the two case is certainly not easy to spell out, and many are reduced to saying the equivalent of "I know it when I see it" when defining pornography. In terms of AI, this is equivalent with "value loading": refining the AI's values through interactions with human decision makers, who answer questions about edge cases and examples and serve as "learned judges" for the AI's concepts. But suppose that approach was not available to you - what methods would you implement to distinguish between pornography and eroticism, and ban one but not the other? Sufficiently clear that a scriptwriter would know exactly what they need to cut or add to a movie in order to move it from one category to the other? What if the nude "Pirates of of Penzance" was at a Pussycat Theatre and "Fuck Slaves of the Caribbean XIV" was at the Met?
To get maximal creativity, it's best to ignore the ultimate aim of the exercise (to find inspirations for methods that could be adapted to AI) and just focus on the problem itself. Is it even possible to get a reasonable solution to this question - a question much simpler than designing a FAI?