The case for value learning
This post is mainly fumbling around trying to define a reasonable research direction for contributing to FAI research. I've found that laying out what success looks like in the greatest possible detail is a personal motivational necessity. Criticism is strongly encouraged.
The power and intelligence of machines has been gradually and consistently increasing over time, it seems likely that at some point machine intelligence will surpass the power and intelligence of humans. Before that point occurs, it is important that humanity manages to direct these powerful optimizers towards a target that humans find desirable.
This is difficult because humans as a general rule have a fairly fuzzy conception of their own values, and it seems unlikely that the millennia of argument surrounding what precisely constitutes eudaimonia are going to be satisfactorily wrapped up before the machines get smart. The most obvious solution is to try to leverage some of the novel intelligence of the machines to help resolve the issue before it is too late.
Lots of people regard using a machine to help you understand human values as a chicken and egg problem. They think that a machine capable of helping us understand what humans value must also necessarily be smart enough to do AI programming, manipulate humans, and generally take over the world. I am not sure that I fully understand why people believe this.
Part of it seems to be inherent in the idea of AGI, or an artificial general intelligence. There seems to be the belief that once an AI crosses a certain threshold of smarts, it will be capable of understanding literally everything. I have even heard people describe certain problems as "AI-complete", making an explicit comparison to ideas like Turing-completeness. If a Turing machine is a universal computer, why wouldn't there also be a universal intelligence?
To address the question of universality, we need to make a distinction between intelligence and problem solving ability. Problem solving ability is typically described as a function of both intelligence and resources, and just throwing resources at a problem seems to be capable of compensating for a lot of cleverness. But if problem-solving ability is tied to resources, then intelligent agents are in some respects very different from Turing machines, since Turing machines are all explicitly operating with an infinite amount of tape. Many of the existential risk scenarios revolve around the idea of the intelligence explosion, when an AI starts to do things that increase the intelligence of the AI so quickly that these resource restrictions become irrelevant. This is conceptually clean, in the same way that Turing machines are, but navigating these hard take-off scenarios well implies getting things absolutely right the first time, which seems like a less than ideal project requirement.
If an AI that knows a lot about AI results in an intelligence explosion, but we also want an AI that's smart enough to understand human values, is it possible to create an AI that can understand human values, but not AI programming? In principle it seems like this should be possible. Resources useful for understanding human values don't necessarily translate into resources useful for understanding AI programming. The history of AI development is full of tasks that were supposed to be solvable only by a machine smart enough to possess general intelligence, where significant progress was made in understanding and pre-digesting the task, allowing problems in the domain to be solved by much less intelligent AIs.
If this is possible, then the best route forward is focusing on value learning. The path to victory is working on building limited AI systems that are capable of learning and understanding human values, and then disseminating that information. This effectively softens the AI take-off curve in the most useful possible way, and allows us to practice building AI with human values before handing them too much power to control. Even if AI research is comparatively easy compared to the complexity of human values, a specialist AI might find thinking about human values easier than reprogramming itself, in the same way that humans find complicated visual/verbal tasks much easier than much simpler tasks like arithmetic. The human intelligence learning algorithm is trained on visual object recognition and verbal memory tasks, and it uses those tools to perform addition. A similarly specialized AI might be capable of rapidly understanding human values, but find AI programming as difficult as humans find determining whether 1007 is prime. As an additional incentive, value learning has an enormous potential for improving human rationality and the effectiveness of human institutions even without the creation of a superintelligence. A system that helped people better understand the mapping between values and actions would be a potent weapon in the struggle with Moloch.
Building a relatively unintelligent AI and giving it lots of human values resources to help it solve the human values problem seems like a reasonable course of action, if it's possible. There are some difficulties with this approach. One of these difficulties is that after a certain point, no amount of additional resources compensates for a lack of intelligence. A simple reflex agent like a thermostat doesn't learn from data and throwing resources at it won't improve its performance. To some extent you can make up for intelligence with data, but only to some extent. An AI capable of learning human values is going to be capable of learning lots of other things. It's going to need to build models of the world, and it's going to have to have internal feedback mechanisms to correct and refine those models.
If the plan is to create an AI and primarily feed it data on how to understand human values, and not feed it data on how to do AI programming and self-modify, that plan is complicated by the fact that inasmuch as the AI is capable of self-observation, it has access to sophisticated AI programming. I'm not clear on how much this access really means. My own introspection hasn't allowed me anything like hardware level access to my brain. While it seems possible to create an AI that can refactor its own code or create successors, it isn't obvious that AIs created for other purposes will have this ability on accident.
This discussion focuses on intelligence amplification as the example path to superintelligence, but other paths do exist. An AI with a sophisticated enough world model, even if somehow prevented from understanding AI, could still potentially increase its own power to threatening levels. Value learning is only the optimal way forward if human values are emergent, if they can be understood without a molecular level model of humans and the human environment. If the only way to understand human values is with physics, then human values isn't a meaningful category of knowledge with its own structure, and there is no way to create a machine that is capable of understanding human values, but not capable of taking over the world.
In the fairy tale version of this story, a research community focused on value learning manages to use specialized learning software to make the human value program portable, instead of only running on human hardware. Having a large number of humans involved in the process helps us avoid lots of potential pitfalls, especially the research overfitting to the values of the researchers via the typical mind fallacy. Partially automating introspection helps raise the sanity waterline. Humans practice coding the human value program, in whole or in part, into different automated systems. Once we're comfortable that our self-driving cars have a good grasp on the trolley problem, we use that experience to safely pursue higher risk research on recursive systems likely to start an intelligence explosion. FAI gets created and everyone lives happily ever after.
Whether value learning is worth focusing on seems to depend on the likelihood of the following claims. Please share your probability estimates (and explanations) with me because I need data points that originated outside of my own head.
- There is regular structure in human values that can be learned without requiring detailed knowledge of physics, anatomy, or AI programming. [poll:probability]
- Human values are so fragile that it would require a superintelligence to capture them with anything close to adequate fidelity.[poll:probability]
- Humans are capable of pre-digesting parts of the human values problem domain. [poll:probability]
- Successful techniques for value discovery of non-humans, (e.g. artificial agents, non-human animals, human institutions) would meaningfully translate into tools for learning human values. [poll:probability]
- Value learning isn't adequately being researched by commercial interests who want to use it to sell you things. [poll:probability]
- Practice teaching non-superintelligent machines to respect human values will improve our ability to specify a Friendly utility function for any potential superintelligence.[poll:probability]
- Something other than AI will cause human extinction sometime in the next 100 years.[poll:probability]
- All other things being equal, an additional researcher working on value learning is more valuable than one working on corrigibility, Vingean reflection, or some other portion of the FAI problem. [poll:probability]
[Playground] In this sub-thread you may try polls without anybody being getting angry with you (except possibly some DBA if you really overdo it).
Example: Use this as an example:
[poll]{Option 1}{Option 2}{Option n}
(note: you can see this verbatim because I escaped the "[" with "\" in the comment (another nice trick))
Don't be surprised if you edit your poll later and find this replaced by something like
[poll\:12345]
this is to ensure that your poll stays the same after you edited the other parts of the comment.
Hi,
I have two areas I'd like to study: deep learning, and anything on the MIRI research guide. Lots of material is available on both topics, but I'd like to pair up with someone to build a good learning strategy (for lack of a better expression).
I have some knowledge of algebra, probability theory, logic, game theory, machine learning (Master's Degree in Computer Science).
Regarding deep learning, I have a small collection of links, Udacity, and I'm positive learning materials abound now that the field is really popular.
Regarding MIRI's research guide, well, the guide itself provides a lot of links and pointers.
My timezone is CET (UTC+1).
I'm working through the udacity deep learning course right now, and I'm always trying to learn more things on the MIRI research guide. I'm in a fairly different timezone, but my schedule is pretty flexible. Maybe we can work something out?
It's a bit hard for RNN's to learn, but they can end up much better than humans.
After some engineering effort. Researchers didn't just throw a random RNN at the problem in 1990 and found they worked as great as transistors at arithmetic... Plus, if you want to pick extremes (the best RNNs now), are the best RNNs better at adding or multiplying extremely large numbers than human savants?
This raises a really interesting point that I wanted to include in the top level post, but couldn't find a place for. It seems plausible/likely that human savants are implementing arithmetic using different, and much more efficient algorithms than those used by neurotypical humans. This was actually one of the examples I considered in support of the argument that neurons can't be the underlying reason humans struggle so much with math.
Humans are not bad at math. We are excellent at math. We can calculate the best trajectory to throw a ball into a hoop, the exact way to move our jiggly appendages to achieve it, accounting for a million little details, all in a blink of an eye. Few if any modern computers can do as well.
The problem is one of the definition: we call "math" the part of math that is HARD FOR HUMANS. Because why bother giving a special name to something that does not require special learning techniques?
This is a really broad definition of math. There is regular structure in kinetic tasks like throwing a ball through a hoop. There's also regular structure in tasks like natural language processing. One way to describe that regular structure is through a mathematical representation of it, but I don't know that I consider basketball ability to be reliant on mathematical ability. Would you describe all forms of pattern matching as mathematical in nature? Is the fact that you can read and understand this sentence also evidence that you are good at math?
Could a digital intelligence be bad at math?
One of the enduring traits that I see in most characterizations of artificial intelligences is the idea that an AI would have all of the skills that computers have. It's often taken for granted that a general artificial intelligence would be able to perfectly recall information, instantly multiply and divide 5 digit numbers, and handily defeat Gary Kasparov at chess. For whatever reason, the capabilities of a digital intelligence are always seen as encompassing the entire current skill set of digital machines.
But this belief is profoundly strange. Consider how much humans struggle to learn arithmetic. Basic arithmetic is really simple. You can build a bare bones electronic calculator/arithmetic logic unit on a breadboard in a weekend. Yet humans commonly spend years learning how to perform those same simple operations. And the mental arithmetic equipment humans assemble at the end of this is still relatively terrible: slow, labor intensive, and prone to frequent mistakes.
It is not totally clear why humans are this bad at math. It is almost certainly unrelated to brains computing using neurons instead of transistors. Based on personal experience and a cursory literature review, counting seems to rely primarily on identifying repeated structures in a linked list, and seems to be stored as verbal memory. When we first learn the most basic arithmetic we rely on visual pattern matching, and as we do more math basic math operations get stored in a look-up table in verbal memory. This is an absolutely bonkers way to implement arithmetic.
While humans may be generally intelligent, that general intelligence seems to be accomplished using some fairly inelegant kludges. We seem to have a preferred framework for understanding built on our visual and verbal systems, and we tend to shoehorn everything else into that framework. But there's nothing uniquely human about that problem. It seems to be characteristic of learning algorithms in general, and so if our artificial learner started off by learning skills unrelated to math, it might learn arithmetic via a similarly convoluted process. While current digital machines do arithmetic via a very efficient process, a digital mind that has to learn those patterns may arrive at a solution as slow and convoluted as the one humans rely on.
How do you get the top portion of the second payoff matrix from the first? Intuitively, it should be by replacing the Agent A's payoff with the sum of the agents' payoffs, but the numbers don't match.
Most people are altruists but only to their in-group, and most people have very narrow in-groups. What you mean by an altruist is probably someone who is both altruistic and has a very inclusive in-group. But as far as I can tell, there is a hard trade-off between belonging to a close-knit, small in-group and identifying with a large, diverse but weak in-group. The time you spend helping strangers is time taken away from potentially helping friends and family.
It's the average({4-2}/2), rather than the sum, since the altruistic agent is interested in maximizing the average utility.
The tribal limitations on altruism that you allude to are definitely one of the tendencies that much of our cultural advice on altruism targets. In many ways the expanding circle of trust, from individuals, to families, to tribes, to cities, to nation states, etc. has been one of the fundamental enablers of human civilization.
I'm less sure about the hard trade-off that you describe. I have a lot of experience being a member of small groups that have altruism towards non-group members as an explicit goal. In that scenario, helping strangers also helps in-group members achieve their goals. I don't think large-group altruism precludes you from belonging to small in-groups, since very few in-groups demand any sort of absolute loyalty. While full effort in-group altruism, including things like consciously developing new skills to better assist your other group members would absolutely represent a hard trade-off with altruism on a larger scale, people appear to be very capable of belonging to a large number of different in-groups.
This implies that the actual level of commitment required to be a part of most in-groups is rather low, and the socially normative level of altruism is even lower. Belonging to a close-knit in-group with a particularly needy member, (e.g. having a partially disabled parent, spouse, or child) may shift the calculus somewhat, but for most in-groups being a member in good-standing has relatively undemanding requirements. Examining my own motivations it seems that for many of the groups that I participate in most of the work that I do to fulfilling expectations and helping others within those group is more directly driven by my desire for social validation than my selfless perception of the intrinsic value of the other group members.
Conflicting advice on altruism
As far as I can tell, rather than having a single well-defined set of preferences or utility function, my actions more closely reflect the outcome of a set of competing internal drives. One of my internal drives is strongly oriented towards a utilitarian altruism. While the altruist internal drive doesn't dominate my day-to-day life, compared to the influence of more basic drives like the desires for food, fun, and social validation, I have traditionally been very willing to drop whatever I'm doing and help someone who asks for, or appears to need help. This altruistic drive has an even more significant degree of influence on my long term planning, since my drives for food, fun, etc. are ambivalent between the many possible futures in which they can be well-satisfied.
I'm not totally sure to what extent strong internal drives are genetic or learned or controllable, but I've had a fairly strong impulse towards altruism for well over a decade. Unfortunately, even over fairly long time frames it isn't clear to me that I've been a particularly "effective" altruist. This discussion attempts to understand some of the beliefs and behaviors that contributed to my personal failure/success as an altruist, and may also be helpful to other people looking to engage in or encourage similar prosocial habits.
Game Theory Model
Imagine a perfect altruist competing in a Prisoner's Dilemma style game. The altruist in this model is by definition a pure utilitarian who wants to maximize the average utility, but is completely insensitive to the distribution of the utility.1 A trivial real world example similar to this would be something like picking up litter in a public place. If the options are Pick up (Cooperate) and Litter (Defect) then an altruist might choose to pick up litter even though they themselves don't capture enough of the value to justify the action. Even if you're skeptical that unselfish pure utilitarians exist, the payoff matrix and much of this analysis applies to a broader range of prosocial behaviors where it's difficult for a single actor to capture the value he or she generates.
The prisoner's dilemma payoff matrix for the game in which the altruist is competing looks something like this:
| Agent B | ||
| Agent A | Cooperate | Defect |
| Cooperate | 2,2 | -2,4 |
| Defect | 4,-2 | -1,-1 |
Other examples with altered payoff ratios are possible, but this particular payoff matrix creates an interesting inversion of the typical strategy for the prisoner's dilemma. If we label the altruist Agent A (A for Altruist), then A's dominant strategy is Cooperate. Just as in the traditional prisoner's dilemma, A prefers if B also cooperates, but A will cooperate regardless of what B does. The iterated prisoner's dilemma is even more interesting. If A and B are allowed to communicate before and between rounds, A may threaten to employ a tit-for-tat-like strategy and to defect in the future against defectors, but this threat is somewhat hollow, since regardless of threats, A's dominant strategy in any given round is still to cooperate.
A population of naive altruists is somewhat unstable for the same reason that a population of naive cooperators is unstable. It's vulnerable to infiltration by defectors. The obvious meta-strategies for individual altruists and altruist populations are to either become proficient at identifying defectors and to ignore/avoid them or to successfully threaten defectors into cooperating. Both the identify/avoid and the threaten/punish tactics have costs associated with them, and which approach is a better strategy depends on how much players are expected to change over the course of time/a series of games. Incorrigible defectors cannot be threatened/punished and must be avoided,while for more malleable defectors it may be possible to threaten them into cooperation.
If we assume that agent B is selfish and we express the asymmetry in the agent values in terms of our payoff matrix, then the symmetric payoff matrix above is equivalent to the top portion of a new payoff matrix given by
| Agent B | ||
| Agent A | Cooperate | Defect |
| Cooperate | 2,2 | 1,4 |
| Defect | 1,-2 | -1,-1 |
| Avoid | 0,0 | 0,0 |
The only difference between the two matrices is in this latter case we've given the altruist an avoid option. There is no simple way to include the threaten option, since threaten relies on trying to convince Agent B that Agent A is either unreasonable or not an altruist and including that sort of bluff in the formal model makes is difficult to create payoff matrices that are both simple and reasonable. However, we can still make a few improvements to our formal model before we're forced to abandon it and talk about the real world.
Adding Complexity
The relatively simple payoff matrices in the previous section can easily be made more realistic and more complicated. In the iterated version of the game, if the total number of times A can cooperate in games is limited, then for each game in which she cooperates, she incurs an opportunity cost equal to the difference between her received payoff and her ideal payout. Under this construction an altruist who cooperates with a defector receives a negative utility as long as games with other cooperators are available.
| Agent B | ||
| Agent A | Cooperate | Defect |
| Cooperate | 2,2 | -1,4 |
| Defect | -1,-2 | -3,-1 |
| Avoid | 0,0 | 0,0 |
In this instance, A no longer has a dominant strategy. A should cooperate with B if she thinks that B will cooperate, but A should avoid B if she thinks that B will defect. A thus has a strong incentive to build a sophisticated model of B, which can be used either to convince B to cooperate or at the very least correctly predict B's defection. For a perfect altruist, more information and judgment of agent B leads to better average outcomes.
The popularity of gatekeeper organizations like GiveWell and Charity Navigator in altruist communities makes a lot of sense if those communities are aware of their vulnerability to defectors. Because charitable dollars are so fungible, giving money to a charity is an instance where opportunity costs play a significant role. While meta-charities offer some other advantages, a significant part of their appeal, especially for organizations like Charity Navigator, is helping people avoid "bad" charities.
Interestingly, with this addition, A's behavior may to start to look less and less like pure altruism. Even if A is totally indifferent to the distribution of utility, if A can reliably identify some other altruists then she will preferentially cooperate with them and avoid games with unknown agents in which there is a risk of defection. The benefits of cooperation could then disproportionately accrue within the altruist in-group, even if none of the altruists intend that outcome.
An observer who had access only to the results of the games and not the underlying utility functions of the players would be unlikely to conclude that the clique of A-like agents that exhibited strong internal cooperation and avoided games with all other players had a purely altruistic utility function. Their actions pattern-match much more readily to something more selfish and more like typical human tribal behavior, suggesting either a self-serving or an "us versus them" utility function instead of one that has increasing the average payoff as its goal. If we include the threaten/punish option, the altruist population may look even less like a population of altruists.
That erroneous pattern match isn't a huge issue for the perfectly rational pure altruist in our game theory model. Unfortunately, human beings are often neither of those things. A significant amount of research suggests that people's beliefs are strongly influenced by their actions, and what they think those actions say about them. An actual human that started with the purely altruistic utility function of Agent A in this section, who rationally cooperated with a set of other easily identified altruists, might very well alter his utility function to seem more consistent with his actions. The game theoretic model, in which the values of the agent are independent of the agents choices starts to break down.
While very few individuals are perfect altruists/pure utilitarians as defined here, a much larger fraction of the population nominally considers the altruist value system to be an ethical ideal. The ideal that people have approximately equal value may not always be reflected in how most people live, but many people espouse such a belief and even want to believe it. We see this idea under all sorts of labels: altruism, being a utilitarian, trying to "love your neighbor as yourself", believing in the spiritual unity of humankind, or even just an innate sense of fairness.
Someone who is trying to be an altruist may have altruism or a similar ethical injunction as one of many of their internal drives, and the drive for altruism may be relatively weak compared to their desires for personal companionship, increased social status, greater material wealth, etc. For this individual, the primary threat to the effectiveness of their prosocial behavior is not the possibility that they might cooperate with a defector; it is instead the possibility that their selfish drives might overwhelm their desire to act altruistically, and they themselves might not cooperate.
Received Wisdom on Altruism
Much of the cultural wisdom in my native culture that addresses how to be a good altruist is geared towards people who are trying to be altruists, rather than towards altruists who are trying to be effective. The best course of action in the two situations is often very different, but it took me a considerable amount of time to realize the distinction.
For people trying to be altruists, focusing on the opportunity costs of their altruism is exactly the wrong thing to do. Imagining all the other things that they could buy with their money instead of giving it to a homeless person or donating it to the AMF will make it very unlikely they will give the money away. Judging the motivations of others often provides ample excuses for not helping someone. Seeking out similar cooperators can quickly turn into self-serving tribalism and indifference towards people unlike the tribe. Most people have really stringent criteria for helping others, and so most given the chance to help, most people don't.
The cultural advice I received on altruism tended to focus on avoiding these pitfalls. It stressed ideas like, "Do whatever good you can, wherever you are", and emphasized not to judge or condemn others, but to give second chances, to try and believe in the fundamental goodness of people, and to try to cooperate and value non-tribe members and even enemies.
When I was trying to be an altruist, I took much of this cultural how-to advice on altruism very seriously and for much of my life often helped/cooperated with anyone who asked, regardless of whether the other person was likely to defect. Even when people literally robbed me I would rationalize that whoever stole my bike must have really needed a bike, and so the even my involuntary "cooperation" with the thief probably was a net positive from a utilitarian standpoint.
Effective Altruism
I don't think I've been particularly effective as an altruist because I haven't been judgmental enough, because I've been too focused on doing whatever good I could where I was instead of finding the places I could do the most good and moving myself to those places. I'm now trying to spend nearly as much energy identifying opportunities to do good, as I do actively trying to improve the world.
At the same time, I'm still profoundly wary of the instinct not to help, or of thinking, "This isn't my best opportunity to do good" because I know that's it's very easy to get in the habit of not helping people. I'm trying to move away from my instinct towards reactive helping anyone who asks towards something that looks more like proactive planning, but I'm not at all convinced that most other people should be trying to move in that same direction.
As with achieving any goal, success requires a balance between insufficient planning and analysis paralysis. I think for altruism in particular, this balance was and is difficult to strike in part because of the large potential for motivated selfish reasoning, but also because most of my (our?) cultural wisdom emphasizes convenient immediate action as the correct form of altruism. Long term altruistic planning is typically not much mentioned or discussed, possibly because most people just aren't that strongly oriented towards utilitarian values.
Conclusion
If helping others is something that you're committed enough to that a significant limitation on your effectiveness is that you often help the wrong people, then diverting energy into judging who you help and consciously considering opportunity costs is probably a good idea. If helping others is something you'd like to do, but you rarely find yourself actually doing, the opposite advice may be apropos.
1. In idealized formulations of game theory, "utility" is intended to describe not just physical or monetary gain, but to include effects like desire for fairness, moral beliefs, etc. Symmetric games are fairly unrealistic under that assumption, and such a definition of utility would preclude our altruist from many games altogether. Utility in this first example is defined only in terms of personal gain, and explicitly does not include the effects of moral satisfaction, desire for fairness, etc.
View more: Next
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
"There are 729,500 single women my age in New York City. My picture and profile successfully filtered out 729,499 of them and left me with the one I was looking for."
I know this is sort of meant as a joke, but I feel like one of the more interesting questions that could be addressed in an analysis like this is what percentage of the women in the dating pool could you actually have had a successful relationship with. How strong is your filter and how strong does it need to be? There's a tension between trying to find/obtain the best of many possible good options, and trying to find the one of a handful of good options in a haystack of bad ones.
I'm somewhat amazed that you looked at 300 profiles, read 60 of them, and liked 20 of them enough to send them messages. Only 1 in 5 potential matches met your standards for appearance, but 1 in 3 met your standards based on what they wrote, and that's not even taking into account the difference in difficulty between reading a profile and composing a message.
You make a big deal about the number of people available online, but in your previous article on soccer players you implied that the average had a much larger effect on the tails than the average did. If you're really looking for mates in the tails of the distribution, and 1 in 729,500 is about 4.5 sigma event, then being involved in organizations whose members are much more like your ideal mate on average may be a better strategy than online dating.