Hi there, I've been thinking a lot about AI Alignment and values, the latter for longer than the former, admittedly. I'm in graduate school and study values through ethics. I would love to start a conversation about a thought that shot through my mind just last night. In thinking about values, we often focus on the principles, the concepts such as "good" and "bad" -- most simply, the nouns and adjectives. These are challenging to bridge consensus even in the same language, let alone across cultural, linguistic, and geographic boundaries. In my past experience as an English teacher, conveying verbs was always easier than trying to explain things like integrity.

Here's my question: what if instead of fixed concepts and rules, AI alignment focused on actions as the underlying reward function? In other words, might programming AI to focus on the means rather than the ends facilitate an environment in which humans are freer to act and reach their own ends, prioritizing activated potential over predetermined outcome? Can the action, instead of the outcome, become the parameter, rendering AI a facilitator rather than a determiner? 

There's a lot more to these questions, with details and explanations that I would be happy to dive into with anyone interested in discussing further (I didn't think it appropriate to make my first post too lengthy). Either way, I'm happy to have found this group and look forward to connecting with likeminded and unlikeminded folks. Thank you for reading! ~Elisabeth

New Comment
7 comments, sorted by Click to highlight new comments since:

One interpretation of this would be imitation learning: teaching a system to imitate human strategies, rather than optimize some objective of its own.

The problem with imitation learning is: since humans are pretty smart, a close imitation of a human strategy is probably going to involve planning in the deliberate service of some values. So if you set a big neural network on the problem of imitating humans, it will develop its own preferences and ability to plan. This is a recipe for an inner optimizer. Its values and planning will have to line up with humans in typical cases, but in extreme cases (eg adversarial examples), it could be very different. This can be a big problem, because the existence of such an AI could itself push us to extreme cases where the AI has trouble generalizing.

Another interpretation of your idea could be "approval-directed agents". These are not trained to imitate humans, but rather, trained based on human approval of actions. However, unlike reinforcement learners, they don't plan ahead to maximize expected approval. They only learn to take specific actions more when they are approved of, and less when they earn disapproval.

Unlike imitation learners, approval-directed agents can be more capable than human trainers. However, unlike reinforcement learning agents, approval-directed agents don't have any incentive to take over control of their reward buttons. All the planning ahead comes from humans, looking at particular sorts of actions and deciding that they're good.

Unfortunately, this still faces basically the same problem as imitation learning. Because humans are approving/disapproving based on complicated models of the world and detailed thoughts about the consequences of actions, a big neural network has good reason to replicate those faculties within itself. You get an inner optimizer again, with the risks of misalignment that this brings.

Thanks for this response. I heard a similar discussion recently, with someone talking about whether an algorithm's reward function was activated because it got the answer correct or because it knew it was what the programmers wanted it to do. It's not clear since the decision-making pathways are not always clear, especially with more complex machine learning. 

The inner optimizer thing is really interesting; I hadn't heard it coined like that before. Is it in AI's interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize? Variability would decrease in the population and the probability mechanisms of machine learning would approach certainty, thus rendering the AI basically ineffective.

Is it in AI's interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize?

There's an approach called learning the prior through imitative generalization, that seemed to me a promising way to address this problem. Most relevant quotes from that article:

We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to questions, even if it knows the best answer is different. If we only have access to these human-like answers to questions, that probably doesn’t give us enough information to supervise a superhuman model.

What we’re going to call ‘Imitative Generalization’ is a possible way to narrow the gap between the things our model knows, and the questions we can train our model to answer honestly. It avoids the pathological generalisation by only using ML for IID tasks, and imitating the way humans generalize. This hopefully gives us answers that are more like ‘how a human would answer if they’d learnt from all the data the model has learnt from’. We supervise how the model does the transfer, to get the sort of generalisation we want.

Here's my question: what if instead of fixed concepts and rules, AI alignment focused on actions as the underlying reward function? In other words, might programming AI to focus on the means rather than the ends facilitate an environment in which humans are freer to act and reach their own ends, prioritizing activated potential over predetermined outcome? Can the action, instead of the outcome, become the parameter, rendering AI a facilitator rather than a determiner? 

If I'm understanding your right, which I'm not sure I am, I think this just collapses back to the normal case but where the thing being optimized for are those that you demarcate as "means" rather than "ends". That is, the means literally become the ends because they are the things being optimized for.

I think you are understanding correctly and I see your point. So the question becomes: we intervene before it becomes cyclical so that the focus is process and not outcome? That's where the means and the ends remain separate. In effect, can a non-deterministic AI model be written?

In principle, you could enumerate every possible scenario your AI system could encounter and specify what the best action is in that situation. In practice, this either requires an impossible amount of computer memory, an impossible knowledge of what the system is likely to encounter, an impossible amount of work to actually create and test the mapping of input states to output actions, or more commonly a combination of these.

Even if it could be done - in what sense could the result be meaningfully called "AI"? A modern computer behaves perfectly deterministically, always performing the same action under the same conditions, but they are tools, not intelligences, and they can't learn on their own or generalize to new inputs. An AI with an ability to understand natural language will eventually be able to learn and use words it hasn't heard before, but my computer will never "know" what to do if I remove the "enter" button from the keyboard and plug in a toaster.

I'm far from well-read on these topics myself so I'm likely misunderstanding the question or poorly answering it. I recommend looking at some of the curated sequences on AI safety on LessWrong (click the "Library" header in the sidebar menu and scroll for relevant titles). It's very possible your questions are addressed there.

Does it have to be deterministic though? Can a program be open-ended to the effect that process is optimized and outcome is undetermined? (Perhaps navigating the world like that is "intelligence" without the "artificial.") I think AI is capable of learning on its own though, or at least programming other algorithms without human input. And one of the issues there is that once it learns language, as you point out, it will be able to do things we can't really fathom right now, I think. 

Thanks for the sequence rec. I'll check it out!