I wrote a fable for the EA "AI fables" contest, which raises the question of what happens when you copy values from humans to AIs, and those values contain self-referential pointers.  The fable just raises the issue, and is more about contemporary human behavior than nitty-gritty representational issues.  But further reflection made me think the issue may be much more-serious than the fable suggests, so I wrote this: De Dicto and De Se Reference Matters for Alignment (a crosslink to forum.effectivealtruism.org; yes I should've posted it here first and crosslinked in the other direction, but I didn't).

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 4:16 AM

Nice read. Not an actual problem with inverse RL, I think, because an AI observing human24 try to get a cookie will learn want(human24 gets cookie) not want(I* get cookie) unless you've put in special work to make it otherwise. But potentially a problem with more abstract cashings-out of the idea "learn human values and then want that." 

You're looking at the logical form and imagining that that's a sufficient understanding to start pursuing the goal. But it's only sufficient in toy worlds, where you have one goal at a time, and the mapping between the goal and the environment is so simple that the agent doesn't need to understand the value, or the target of "cookie", beyond "cookie" vs. "non-cookie". In the real world, the agent has many goals, and the goals will involve nebulous concepts, and have many considerations and conditions attached, eg how healthy is this cookie, how tasty is it, how hungry am I.  It will need to know /why/ it, or human24, wants a cookie in order to intelligently know when to get the cookie, and to resolve conflicts between goals, and to do probability calculations which involve the degree to which different goals are correlated in the higher goals they satisfy.

There's a confounding confusion in this particular case, in which you seem to be hoping the robot will infer that the agent of the desired act is the human, both in the case of the human, and of the AI.  But for values in general, we often want the AI to act in the way that the human would act, not to want the human to do something. Your posited AI would learn the goal that it wants human24 to get a cookie.

What it all boils down to is:  You have to resolve the de re / de dicto / de se interpretation in order to understand what the agent wants.  That means an AI also has to resolve that question in order to know what a human wants. Your intuitions about toy examples like "human 24 always wants a cookie, unconditionally, forever" will mislead you, in the ways toy-world examples misled symbolic AI researchers for 60 years.

you seem to be hoping the robot will infer that the agent of the desired act is the human, both in the case of the human, and of the AI

No, I'm definitely just thinking about IRL here.

IRL takes a model of the world and of the human's affordances as given constants, assumes the human is (maybe noisily) rational, and then infers human desires in terms of that world model, which then can also be used by the AI to choose actions if you have a model of the AI's affordances. It has many flaws, but it's definitely worth refreshing yourself about occasionally.

By "just thinking about IRL", do you mean "just thinking about the robot using IRL to learn what humans want"?  'Coz that isn't alignment.

'But potentially a problem with more abstract cashings-out of the idea "learn human values and then want that"' is what I'm talking about, yes.  But it also seems to be what you're talking about in your last paragraph.

"Human wants cookie" is not a full-enough understanding of what the human really wants, and under what conditions, to take intelligent actions to help the human.  A robot learning that would act like a paper-clipper, but with cookies.  It isn't clear whether a robot which hasn't resolved the de dicto / de re / de se distinction in what the human wants will be able to do more good than harm in trying to satisfy human desires, nor what will happen if a robot learns that humans are using de se justifications.

Here's another way of looking at that "nor what will happen if" clause:  We've been casually tossing about the phrase "learn human values" for a long time, but that isn't what the people who say that want.  If AI learned human values, it would treat humans the way humans treat cattle.  But if the AI is to learn to desire to help humans satisfy their wants, it isn't clear that the AI can (A) internalize human values enough to understand and effectively optimize for them, while at the same time (B) keeping those values compartmentalized from its own values, which make it enjoy helping humans with their problems.  To do that the AI would need to want to propagate and support human values that it disagrees with.  It isn't clear that that's something a coherent, let's say "rational", agent can do.