For the sake of argument, let's consider an agent to be autonomous if:

 

  • It has sensors and actuators (important for an agent)
  • It has an internal representation of its goals. I will call this internal representation its desires.
  • It has some kind of internal planning function that given sensations and desires, chooses actions to maximize the desirability of expected outcomes

 

 

I want to point to the desires of the agent specifically to distinguish them from the goals we might infer for it if we were to observe its actions over a long period of time. Let's call these an agent's empirical goals. (I would argue that it is in many cases impossible to infer an agent's empirical goals from its behavior, but that's a distraction from my main point so I'm just going to note it here for now.)

I also want to distinguish them from the goals it might arrive on stably if it were to execute some internal goal-modification process optimizing for certain conditions. Let's call these an agent's reflective goals.

The term utility function, so loved by the consequentialists, frequently obscures these important distinctions. An argument that I have heard for the expansive use of the term "utility function" in describing agents is is: The behavior of all agents can be characterized using a utility function, therefore all agents have utility functions. This argument depends on a fallacious conflation of desire, empirical goals, and reflective goals.

An important problem that I gather this community thinks about deeply is how to think about agents whose reflective goals are different from its present desires--say, the desires I have and have transferred over to it. For example, if I want to program an agent with desires and the capacity to reflect, then can I guarantee that it will be faithful to the desires I intended for it?

A different problem that I am interested in is what other classes of agents there may be besides autonomous agents. Specifically, what if an agent does not have an internal representation of its desires. Is that possible? Part of my motivation for this is my interest in Buddhism. If an enlightened agent is one with no desires, how could one program a bodhisattva AI whose only motivation was the enlightenment of all beings?

An objection to this line of thinking is that even an agent that desires enlightenment for all beings has desires. It must in some formal sense have a utility function, because Von Neumann. Right?

I'm not so sure, because complexity.

To elaborate: if we define a utility function that is so complex (either in its compressed "spatial" representation such as its Kolmogorov complexity or in some temporal dimension of complexity like its running time or logical depth) that it cannot be represented internally to an agent because it lacks the capacity to do so, then it would be impossible for the agent to have that utility function as its desire.

However, such a utility function could be ascribed to the agent as its empirical goal if the agent were both internally composed and embedded in the world in such a way that it acted as if it had those complex desires. This is consistent with, for example, Buddhist writings about how enlightened beings act with spontaneous compassion. Goal oriented planning does not seem to get in the way here.

How could an AI be compassionate? Perhaps an AI could be empathetic if it could perceive, through its sensors, the desires (or empirical goals, or reflective goals) of other agents and internalize them as its own. Perhaps it does this only temporarily. Perhaps it has, in place of a goal-directed planning mechanism, a way of reconciling differences in its internalized goal functions. This internal logic for reconciling preferences is obviously critical for the identity of the agent and is the Achilles heel of the main thrust of this argument. Surely that logic could be characterized with a utility function and would be, effectively, the agent's desire?

Not so, I repeat, if that logic were not a matter of internal representation as much as the logic of the entire system composed of both the agent and its environment. In other words, if the agent's desires are identical to the logic of the entire world within which it is a part, then it no longer has desire in the sense defined above. It is also no longer autonomous in the sense defined above. Nevertheless, I think it is an important kind of agent when one is considering possible paradigms of ethical AI. In general, I think that drawing on non-consequentialist ethics for inspiration in designing "friendly" AI is a promising research trajectory.


 

New Comment
5 comments, sorted by Click to highlight new comments since:
[-][anonymous]00

what if an agent does not have an internal representation of its desires. Is that possible?

Human infants, perhaps.

I have some difficulties mapping the terms you use (and roughly define) to the usual definitions of these terms. For example

It has an internal representation of its goals. I will call this internal representation its desires.

Nonetheless I see some interesting differentiations in your elaborations. It makes a difference of whether a utility function is explicitly coded as part of the system and whether the implicit utility function of a system is inferred from its overall function. And also different from the utility function inferred for the composition of the system with its environment.

I also like how you relate these concepts to compassion and consequentialsm even though the connections appears vague to me. Some more elaboration - or rather more precise relationships could help.

How could an AI be compassionate? Perhaps an AI could be empathetic if it could perceive, through its sensors, the desires (or empirical goals, or reflective goals) of other agents and internalize them as its own.

In other words, it tries to maximize human values. Isn't this the standard way of programming a Friendly AI?

Isn't this the standard way of programming a Friendly AI?

I don't think it makes sense to speak about a standard way of programming a Friendly AI.

"Designing" would probably be a better word. The standard idea for how you could make an AI friendly.