You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

[paper] [link] Defining human values for value learners

5 Kaj_Sotala 03 March 2016 09:29AM

MIRI recently blogged about the workshop paper that I presented at AAAI.

My abstract:

Hypothetical “value learning” AIs learn human values and then try to act according to those values. The design of such AIs, however, is hampered by the fact that there exists no satisfactory definition of what exactly human values are. After arguing that the standard concept of preference is insufficient as a definition, I draw on reinforcement learning theory, emotion research, and moral psychology to offer an alternative definition. In this definition, human values are conceptualized as mental representations that encode the brain’s value function (in the reinforcement learning sense) by being imbued with a context-sensitive affective gloss. I finish with a discussion of the implications that this hypothesis has on the design of value learners.

Their summary:

Economic treatments of agency standardly assume that preferences encode some consistent ordering over world-states revealed in agents’ choices. Real-world preferences, however, have structure that is not always captured in economic models. A person can have conflicting preferences about whether to study for an exam, for example, and the choice they end up making may depend on complex, context-sensitive psychological dynamics, rather than on a simple comparison of two numbers representing how much one wants to study or not study.

Sotala argues that our preferences are better understood in terms of evolutionary theory and reinforcement learning. Humans evolved to pursue activities that are likely to lead to certain outcomes — outcomes that tended to improve our ancestors’ fitness. We prefer those outcomes, even if they no longer actually maximize fitness; and we also prefer events that we have learned tend to produce such outcomes.

Affect and emotion, on Sotala’s account, psychologically mediate our preferences. We enjoy and desire states that are highly rewarding in our evolved reward function. Over time, we also learn to enjoy and desire states that seem likely to lead to high-reward states. On this view, our preferences function to group together events that lead on expectation to similarly rewarding outcomes for similar reasons; and over our lifetimes we come to inherently value states that lead to high reward, instead of just valuing such states instrumentally. Rather than directly mapping onto our rewards, our preferences map onto our expectation of rewards.

Sotala proposes that value learning systems informed by this model of human psychology could more reliably reconstruct human values. On this model, for example, we can expect human preferences to change as we find new ways to move toward high-reward states. New experiences can change which states my emotions categorize as “likely to lead to reward,” and they can thereby modify which states I enjoy and desire. Value learning systems that take these facts about humans’ psychological dynamics into account may be better equipped to take our likely future preferences into account, rather than optimizing for our current preferences alone.

Would be curious to hear whether anyone here has any thoughts. This is basically a "putting rough ideas together and seeing if they make any sense" kind of paper, aimed at clarifying the hypothesis and seeing whether others kind find any obvious holes in it, rather than being at the stage of a serious scientific theory yet.

 

 

[Link] Using Stories to Teach Human Values to Artificial Agents

1 Gunnar_Zarncke 21 February 2016 08:07PM

Abstract:

Value alignment is a property of an intelligent agent indicating that it can only pursue goals that are beneficial to humans. Successful value alignment should ensure that an artificial general intelligence cannot intentionally or unintentionally perform behaviors that adversely affect humans. This is problematic in practice since it is difficult to exhaustively enumerated by human programmers. In order for successful value alignment, we argue that values should be learned. In this paper, we hypothesize that an artificial intelligence that can read and understand stories can learn the values tacitly held by the culture from which the stories originate. We describe preliminary work on using stories to generate a value-aligned reward signal for reinforcement learning agents that prevents psychotic-appearing behavior. 

-- Using Stories to Teach Human Values to Artificial Agents 

Comment by the lead researcher Riedl (cited on Slashdot):

"The AI ... runs many thousands of virtual simulations in which it tries out different things and gets rewarded every time it does an action similar to something in the story," said Riedl, associate professor and director of the Entertainment Intelligence Lab. "Over time, the AI learns to prefer doing certain things and avoiding doing certain other things. We find that Quixote can learn how to perform a task the same way humans tend to do it. This is significant because if an AI were given the goal of simply returning home with a drug, it might steal the drug because that takes the fewest actions and uses the fewest resources. The point being that the standard metrics for success (eg, efficiency) are not socially best." 

Quixote has not learned the lesson of "do not steal," Riedl says, but "simply prefers to not steal after reading and emulating the stories it was provided."

Value learners & wireheading

5 Manfred 03 February 2016 09:50AM

Dewey 2011 lays out the rules for one kind of agent with a mutable value system. The agent has some distribution over utility functions, which it has rules for updating based on its interaction history (where "interaction history" means the agent's observations and actions since its origin). To choose an action, it looks through every possible future interaction history, and picks the action that leads to the highest expected utility, weighted both by the possibility of making that future happen and the utility function distribution that would hold if that future came to pass.

Drone can bring sandwich either to work or to homeWe might motivate this sort of update strategy by considering a sandwich-drone bringing you a sandwich. The drone can either go to your workplace, or go to your home. If we think about this drone as a value-learner, then the "correct utility function" depends on whether you're at work or at home - upon learning your location, the drone should update its utility function so that it wants to go to that place. (Value learning is unnecessarily indirect in this case, but that's because it's a simple example.)

Suppose the drone begins its delivery assigning equal measure to the home-utility-function and to the work-utility-function (i.e. ignorant of your location), and can learn your location for a small cost. If the drone evaluated this idea with its current utility function, it wouldn't see any benefit, even though it would in fact deliver the sandwich properly - because under its current utility function there's no point to going to one place rather than the other. To get sensible behavior, and properly deliver your sandwich, the drone must evaluate actions based on what utility function it will have in the future, after the action happens.

If you're familiar with how wireheading or quantum suicide look in terms of decision theory, this method of deciding based on future utility functions might seem risky. Fortunately, value learning doesn't permit wireheading in the traditional sense, because the updates to the utility function are an abstract process, not a physical one. The agent's probability distribution over utility functions, which is conditional on interaction histories, defines which actions and observations are allowed to change the utility function during the process of predicting expected utility.

Dewey also mentions that so long as the probability distribution over utility functions is well-behaved, you cannot deliberately take action to raise the probability of one of the utility functions being true. But I think this is only useful to safety when we understand and trust the overarching utility function that gets evaluated at the future time horizon. If instead we start at the present, and specify a starting utility function and rules for updating it based on observations, this complex system can evolve in surprising directions, including some wireheading-esque behavior.

 

The formalism of Dewey 2011 is, at bottom, extremely simple. I'm going to be a bad pedagogue here: I think this might only make sense if you go look at equations 2 and 3 in the paper, and figure out what all the terms do, and see how similar they are. The cheap summary is that if your utility is a function of the interaction history, trying to change utility functions based on interaction history just gives you back a utility function. If we try to think about what sort of process to use to change an agent's utility function, this formalism provides only one tool: look out to some future time horizon, and define an effective utility function in terms of what utility functions are possible at that future time horizon. This is different from the approximations or local utility functions we would like in practice.

If we take this scheme and try to approximate it, for example by only looking N steps into the future, we run into problems; the agent will want to self-modify so that next timestep it only looks ahead N-1 steps, and then N-2 steps, and so on. Or more generally, many simple approximation schemes are "sticky" - from inside the approximation, an approximation that changes over time looks like undesirable value drift.

Common sense says this sort of self-sabotage should be eliminable. One should be able to really care about the underlying utility function, not just its approximation. However, this problem tends to crop up, for example whenever the part of the future you look at does not depend on which action you are considering; modifying to keep looking at the same part of the future unsurprisingly improve the results you get in that part of the future. If we want to build a paperclip maximizer, it shouldn't be necessary to figure out every single way to self-modify and penalize them appropriately.

We might evade this particular problem using some other method of approximation that does something more like reasoning about actions than reasoning about futures. The reasoning doesn't have to be logically impeccable - we might imagine an agent that identifies a small number of salient consequences of each action, and chooses based on those. But it seems difficult to show how such an agent would have good properties. This is something I'm definitely interested in.

 

Handwritten 9One way to try to make things concrete is to pick a local utility function and specify rules for changing it. For example, suppose we wanted an AI to flag all the 9s in the MNIST dataset. We define a single-time-step utility function by a neural network that takes in the image and the decision of whether to flag or not, and returns a number between -1 and 1. This neural network is deterministically trained for each time step on all previous examples, trying to assign 1 to correct flaggings and -1 to mistakes. Remember, this neural net is just a local utility function - we can make a variety of AI designs involving it. The goal of this exercise is to design an AI that seems liable to make good decisions in order to flag lots of 9s.

The simplest example is the greedy agent - it just does whatever has a high score right now. This is pretty straightforward, and doesn't wirehead (unless the scoring function somehow encodes wireheading), but it doesn't actually do any planning - 100% of the smarts have to be in the local evaluation, which is really difficult to make work well. This approach seems unlikely to extend well to messy environments.

Since Go-playing AI is topical right now, I shall digress. Successful Go programs can't get by with only smart evaluations of the current state of the board, they need to look ahead to future states. But they also can't look all the way until the ultimate time horizon, so they only look a moderate way into the future, and evaluate that future state of the board using a complicated method that tries to capture things important to planning. In sufficiently clever and self-aware agents, this approximation would cause self-sabotage to pop up. Even if the Go-playing AI couldn't modify itself to only care about the current way it computes values of actions, it might make suboptimal moves that limit its future options, because its future self will compute values of actions the 'wrong' way.

If we wanted to flag 9s using a Dewian value learner, we might score actions according to how good they will be according to the projected utility function at some future time step. If this is done straightforwardly, there's a wireheading risk - the changes to its utility function are supplied by humans who might be influenced by actions. I find it useful to apply a sort of "magic button" test - if the AI had a magic button that could rewrite human brains, would it pressing that button have positive expected utility for it? If yes, then this design has problems, even though in our current thought experiment it's just flagging pictures.

To eliminate wireheading, the value learner can use a model of the future inputs and outputs and the probability of different value updates given various inputs and outputs, which doesn't model ways that actions could influence the utility updates. This model doesn't have to be right, it just has to exist. On one hand, this seems like a sort of weird doublethink, to judge based on a counterfactual where your actions don't have impacts you could otherwise expect. On the other hand, it also bears some resemblance to how we actually reason about moral information. Regardless, this agent will now not wirehead, and will want to get good results by learning about the world, if only in the very narrow sense of wanting to play unscored rounds that update its value function. If its value function and value updating made better use of unlabeled data, it would also want to learn about the world in the broader sense.

 

Overall I am somewhat frustrated, because value learners have these nice properties, but are computationally unrealistic and do not play well with approximation. One can try to get the nice properties elsewhere, such as relying on an action-suggester to not suggest wireheading, but it would be nice to be able to talk about this as an approximation to something fancier.

Communicating concepts in value learning

3 Manfred 14 December 2015 03:06AM

Epistemic status: Trying to air out some thoughts for feedback, we'll see how successfully. May require some machine learning to make sense, and may require my level of ignorance to seem interesting.

 

Many current proposals for value learning are garden-variety regression (or its close cousin, classification). The agent doing the learning starts out with some model for what human values look like (a utility function over states of the world, or a reward function in a Markov decision process, or an expected utility function over possible actions), and receives training data that tells it the right thing to do in a lot of different situations. And so the agent finds the parameters of the model that minimize some loss function with the data, and Learns Human Values.

All these models of "the right thing to do" I mentioned are called parametric models, because they have some finite template that they update based on the data. Non-parametric models, on the other hand, have to keep a record of the data they've seen - prediction with a non-parametric model often looks like taking some weighted average of nearby known examples (though not always), while a parametric model would (often) fit some curve to the data and predict using that. But we'll get back to this later.

An obvious problem with current proposals is that it's very resource-intensive to communicate a category or concept to the agent. An AI might be able to automatically learn a lot about the world, but if we want to define its preferences, we have to somehow pick out the concept of "good stuff" within the representation of the world learned by the AI. Current proposals for this look like supervised learning, where huge amounts of labeled data are needed to specify "good stuff," and for many proposals I'm concerned that we'll actually end up specifying "stuff that humans can be convinced is good," which is not at all the same. Humans are much better learners than these supervised learning systems - they learn from fewer examples, and have a better grasp of the meaning and structure behind examples. This hints that there are some big improvements to be made in value learning.

This comparison to humans also leads to my vaguer concerns. It seems like the labeled examples are too crucial, and the unlabeled data not crucial enough. We want a value learner to understand concepts based on just a few examples so long as it has unlabeled data to fill in the gaps, and be able to learn more about morality from observation as a core competency, not as a pale shadow of its learning from labeled data. It seems like fine-tuning the model for the labeled data with stochastic gradient descent is missing something important.

To digress slightly, there are additional problems (e.g. corrigibility) once you build an agent that has an output channel instead of merely sponging up information, and these problems are harder if we want value learning from observation. If we want a value learning agent that could learn a simplified version of human morality, and then use that to learn the full version, we might need something like the Bayesian guarantee of Dewey 2011, or a functional analogue thereof.

 

One inspiration for alternative learning schemes might be clustering. As a toy example, imagine finding literal clusters in thing-space by k-means clustering. If you want to specify a cluster, you can do something like pick a small sample of examples and force them to be in the same cluster, and allow the number of clusters you try to find in the data to vary so that the statistics of the mandatory cluster are not very different from any other's. The huge problem here is that the idea of "thing-space" elides the difficulty of learning a representation of the world (or equivalently, elides how really, really complicated the cluster boundaries are in terms of observations).

Because learning how to understand the world already requires you to be really good at learning things, it's not obvious to me what identifying and using clusters in the data will entail. One might imagine that if we modeled the world using a big pile of autoencoders, this pile would already contain predictors for many concepts we might want to specify, but that if we use examples to try and communicate a concept that was not already learned, the pile might not even contain the features that make our concept easy to specify. Further speculation in this vein is fun, but is likely pointless at my current level of understanding. So even though learning well from unlabeled data is an important desideratum, I'm including this digression on clustering because I think it's interesting, not because I've shed much light.

 

Okay, returning to the parametric/non-parametric thing. The problem of being bad at learning from unlabeled data shows up in diverse proposals like inverse reinforcement learning and Hibbard 2012's two-part example. And in these cases it's not due to the learning algorithm per se, but for the simple reason that at some point the representation of the world is treated as fixed - the value learner is assumed to understand the world, and then proceeds to learn or be told human values in terms of that understanding. If you can no longer update your understanding of the world, naturally this causes problems with learning from observation.

We should instead design agents that are able to keep learning about the world. And this brings us back to the idea of communicating concepts via examples. The most reasonable way to update learned concepts in light of new information seems to be to just store the examples and re-apply them to the new understanding. This would be a non-parametric model of learned concepts.

What concepts to learn and how to use them to make decisions is not at all known to me, but as a placeholder we might consider the task of learning to identify "good actions," given proposed actions and some input about the world (similar to the "Learning from examples" section of Christiano's Approval Directed Agents).

Moral AI: Options

9 Manfred 11 July 2015 09:46PM

Epistemic status: One part quotes (informative, accurate), one part speculation (not so accurate).

One avenue towards AI safety is the construction of "moral AI" that is good at solving the problem of human preferences and values. Five FLI grants have recently been funded that pursue different lines of research on this problem.

The projects, in alphabetical order:

Most contemporary AI systems base their decisions solely on consequences, whereas humans also consider other morally relevant factors, including rights (such as privacy), roles (such as in families), past actions (such as promises), motives and intentions, and so on. Our goal is to build these additional morally relevant features into an AI system. We will identify morally relevant features by reviewing theories in moral philosophy, conducting surveys in moral psychology, and using machine learning to locate factors that affect human moral judgments. We will use and extend game theory and social choice theory to determine how to make these features more precise, how to weigh conflicting features against each other, and how to build these features into an AI system. We hope that eventually this work will lead to highly advanced AI systems that are capable of making moral judgments and acting on them.

Techniques: Top-down design, game theory, moral philosophy

Previous work in economics and AI has developed mathematical models of preferences, along with algorithms for inferring preferences from observed actions. [Citation of inverse reinforcement learning] We would like to use such algorithms to enable AI systems to learn human preferences from observed actions. However, these algorithms typically assume that agents take actions that maximize expected utility given their preferences. This assumption of optimality is false for humans in real-world domains. Optimal sequential planning is intractable in complex environments and humans perform very rough approximations. Humans often don't know the causal structure of their environment (in contrast to MDP models). Humans are also subject to dynamic inconsistencies, as observed in procrastination, addiction and in impulsive behavior. Our project seeks to develop algorithms that learn human preferences from data despite the suboptimality of humans and the behavioral biases that influence human choice. We will test our algorithms on real-world data and compare their inferences to people's own judgments about their preferences. We will also investigate the theoretical question of whether this approach could enable an AI to learn the entirety of human values.

Techniques: Trying to find something better than inverse reinforcement learning, supervised learning from preference judgments

The future will see autonomous agents acting in the same environment as humans, in areas as diverse as driving, assistive technology, and health care. In this scenario, collective decision making will be the norm. We will study the embedding of safety constraints, moral values, and ethical principles in agents, within the context of hybrid human/agents collective decision making. We will do that by adapting current logic-based modelling and reasoning frameworks, such as soft constraints, CP-nets, and constraint-based scheduling under uncertainty. For ethical principles, we will use constraints specifying the basic ethical ``laws'', plus sophisticated prioritised and possibly context-dependent constraints over possible actions, equipped with a conflict resolution engine. To avoid reckless behavior in the face of uncertainty, we will bound the risk of violating these ethical laws. We will also replace preference aggregation with an appropriately developed constraint/value/ethics/preference fusion, an operation designed to ensure that agents' preferences are consistent with the system's safety constraints, the agents' moral values, and the ethical principles of both individual agents and the collective decision making system. We will also develop approaches to learn ethical principles for artificial intelligent agents, as well as predict possible ethical violations.

Techniques: Top-down design, obeying ethical principles/laws, learning ethical principles

The objectives of the proposed research are (1) to create a mathematical framework in which fundamental questions of value alignment can be investigated; (2) to develop and experiment with methods for aligning the values of a machine (whether explicitly or implicitly represented) with those of humans; (3) to understand the relationships among the degree of value alignment, the decision-making capability of the machine, and the potential loss to the human; and (4) to understand in particular the implications of the computational limitations of humans and machines for value alignment. The core of our technical approach will be a cooperative, game-theoretic extension of inverse reinforcement learning, allowing for the different action spaces of humans and machines and the varying motivations of humans; the concepts of rational metareasoning and bounded optimality will inform our investigation of the effects of computational limitations.

Techniques: Trying to find something better than inverse reinforcement learning (differently this time), creating a mathematical framework, whatever rational metareasoning is

Autonomous AI systems will need to understand human values in order to respect them. This requires having similar concepts as humans do. We will research whether AI systems can be made to learn their concepts in the same way as humans learn theirs. Both human concepts and the representations of deep learning models seem to involve a hierarchical structure, among other similarities. For this reason, we will attempt to apply existing deep learning methodologies for learning what we call moral concepts, concepts through which moral values are defined. In addition, we will investigate the extent to which reinforcement learning affects the development of our concepts and values.

Techniques: Trying to identify learned moral concepts, unsupervised learning 

 

The elephant in the room is that making judgments that always respect human preferences is nearly FAI-complete. Application of human ethics is dependent on human preferences in general, which are dependent on a model of the world and how actions impact it. Calling an action ethical also can also depend on the space of possible actions, requiring a good judgment-maker to be capable of search for good actions. Any "moral AI" we build with our current understanding is going to have to be limited and/or unsatisfactory.

Limitations might be things like judging which of two actions is "more correct" rather than finding correct actions, only taking input in terms of one paragraph-worth of words, or only producing good outputs for situations similar to some combination of trained situations.

Two of the proposals are centered on top-down construction of a system for making ethical judgments. Designing a system by hand, it's nigh-impossible to capture the subtleties of human values. Relatedly, it seems weak at generalization to novel situations, unless the specific sort of generalization has been forseen and covered. The good points of a top down approach are that it can capture things that are important, but are only a small part of the description, or are not easily identified by statistical properties. A top-down model of ethics might be used as a fail-safe, sometimes noticing when something undesirable is happening, or as a starting point for a richer learned model of human preferences.

Other proposals are inspired by inverse reinforcement learning. Inverse reinforcement learning seems like the sort of thing we want - it observes actions and infers preferences - but it's very limited. The problem of having to know a very good model of the world in order to be good at human preferences rears its head here. There are also likely unforseen technical problems in ensuring that the thing it learns is actually human preferences (rather than human foibles, or irrelevant patterns) - though this is, in part, why this research should be carried out now.

Some proposals want to take advantage of learning using neural networks, trained on peoples' actions or judgments. This sort of approach is very good at discovering patterns, but not so good at treating patterns as a consequence of underlying structure. Such a learner might be useful as a heuristic, or as a way to fill in a more complicated, specialized architecture. For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn't require a very good model of the world.