This post is mainly fumbling around trying to define a reasonable research direction for contributing to FAI research. I've found that laying out what success looks like in the greatest possible detail is a personal motivational necessity. Criticism is strongly encouraged. 

The power and intelligence of machines has been gradually and consistently increasing over time, it seems likely that at some point machine intelligence will surpass the power and intelligence of humans. Before that point occurs, it is important that humanity manages to direct these powerful optimizers towards a target that humans find desirable.

This is difficult because humans as a general rule have a fairly fuzzy conception of their own values, and it seems unlikely that the millennia of argument surrounding what precisely constitutes eudaimonia are going to be satisfactorily wrapped up before the machines get smart. The most obvious solution is to try to leverage some of the novel intelligence of the machines to help resolve the issue before it is too late.

Lots of people regard using a machine to help you understand human values as a chicken and egg problem. They think that a machine capable of helping us understand what humans value must also necessarily be smart enough to do AI programming, manipulate humans, and generally take over the world. I am not sure that I fully understand why people believe this. 

Part of it seems to be inherent in the idea of AGI, or an artificial general intelligence. There seems to be the belief that once an AI crosses a certain threshold of smarts, it will be capable of understanding literally everything. I have even heard people describe certain problems as "AI-complete", making an explicit comparison to ideas like Turing-completeness. If a Turing machine is a universal computer, why wouldn't there also be a universal intelligence?

To address the question of universality, we need to make a distinction between intelligence and problem solving ability. Problem solving ability is typically described as a function of both intelligence and resources, and just throwing resources at a problem seems to be capable of compensating for a lot of cleverness. But if problem-solving ability is tied to resources, then intelligent agents are in some respects very different from Turing machines, since Turing machines are all explicitly operating with an infinite amount of tape. Many of the existential risk scenarios revolve around the idea of the intelligence explosion, when an AI starts to do things that increase the intelligence of the AI so quickly that these resource restrictions become irrelevant. This is conceptually clean, in the same way that Turing machines are, but navigating these hard take-off scenarios well implies getting things absolutely right the first time, which seems like a less than ideal project requirement.

If an AI that knows a lot about AI results in an intelligence explosion, but we also want an AI that's smart enough to understand human values, is it possible to create an AI that can understand human values, but not AI programming? In principle it seems like this should be possible.  Resources useful for understanding human values don't necessarily translate into resources useful for understanding AI programming. The history of AI development is full of tasks that were supposed to be solvable only by a machine smart enough to possess general intelligence, where significant progress was made in understanding and pre-digesting the task, allowing problems in the domain to be solved by much less intelligent AIs. 

If this is possible, then the best route forward is focusing on value learning. The path to victory is working on building limited AI systems that are capable of learning and understanding human values, and then disseminating that information. This effectively softens the AI take-off curve in the most useful possible way, and allows us to practice building AI with human values before handing them too much power to control. Even if AI research is comparatively easy compared to the complexity of human values, a specialist AI might find thinking about human values easier than reprogramming itself, in the same way that humans find complicated visual/verbal tasks much easier than much simpler tasks like arithmetic. The human intelligence learning algorithm is trained on visual object recognition and verbal memory tasks, and it uses those tools to perform addition. A similarly specialized AI might be capable of rapidly understanding human values, but find AI programming as difficult as humans find determining whether 1007 is prime. As an additional incentive, value learning has an enormous potential for improving human rationality and the effectiveness of human institutions even without the creation of a superintelligence. A system that helped people better understand the mapping between values and actions would be a potent weapon in the struggle with Moloch.

Building a relatively unintelligent AI and giving it lots of human values resources to help it solve the human values problem seems like a reasonable course of action, if it's possible. There are some difficulties with this approach. One of these difficulties is that after a certain point, no amount of additional resources compensates for a lack of intelligence. A simple reflex agent like a thermostat doesn't learn from data and throwing resources at it won't improve its performance. To some extent you can make up for intelligence with data, but only to some extent. An AI capable of learning human values is going to be capable of learning lots of other things. It's going to need to build models of the world, and it's going to have to have internal feedback mechanisms to correct and refine those models. 

If the plan is to create an AI and primarily feed it data on how to understand human values, and not feed it data on how to do AI programming and self-modify, that plan is complicated by the fact that inasmuch as the AI is capable of self-observation, it has access to sophisticated AI programming. I'm not clear on how much this access really means. My own introspection hasn't allowed me anything like hardware level access to my brain. While it seems possible to create an AI that can refactor its own code or create successors, it isn't obvious that AIs created for other purposes will have this ability on accident. 

This discussion focuses on intelligence amplification as the example path to superintelligence, but other paths do exist. An AI with a sophisticated enough world model, even if somehow prevented from understanding AI, could still potentially increase its own power to threatening levels. Value learning is only the optimal way forward if human values are emergent, if they can be understood without a molecular level model of humans and the human environment. If the only way to understand human values is with physics, then human values isn't a meaningful category of knowledge with its own structure, and there is no way to create a machine that is capable of understanding human values, but not capable of taking over the world.

In the fairy tale version of this story, a research community focused on value learning manages to use specialized learning software to make the human value program portable, instead of only running on human hardware. Having a large number of humans involved in the process helps us avoid lots of potential pitfalls, especially the research overfitting to the values of the researchers via the typical mind fallacy. Partially automating introspection helps raise the sanity waterline. Humans practice coding the human value program, in whole or in part, into different automated systems. Once we're comfortable that our self-driving cars have a good grasp on the trolley problem, we use that experience to safely pursue higher risk research on recursive systems likely to start an intelligence explosion. FAI gets created and everyone lives happily ever after.

Whether value learning is worth focusing on seems to depend on the likelihood of the following claims. Please share your probability estimates (and explanations) with me because I need data points that originated outside of my own head.

 I can't figure out how to include working polls in a post, but there should be a working version in the comments.
  1. There is regular structure in human values that can be learned without requiring detailed knowledge of physics, anatomy, or AI programming. [poll:probability]
  2. Human values are so fragile that it would require a superintelligence to capture them with anything close to adequate fidelity.[poll:probability]
  3. Humans are capable of pre-digesting parts of the human values problem domain. [poll:probability]
  4. Successful techniques for value discovery of non-humans, (e.g. artificial agents, non-human animals, human institutions) would meaningfully translate into tools for learning human values. [poll:probability]
  5. Value learning isn't adequately being researched by commercial interests who want to use it to sell you things. [poll:probability]
  6. Practice teaching non-superintelligent machines to respect human values will improve our ability to specify a Friendly utility function for any potential superintelligence.[poll:probability]
  7. Something other than AI will cause human extinction sometime in the next 100 years.[poll:probability]
  8. All other things being equal, an additional researcher working on value learning is more valuable than one working on corrigibility, Vingean reflection, or some other portion of the FAI problem. [poll:probability]

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 11:42 PM
  • There is regular structure in human values that can be learned without requiring detailed knowledge of physics, anatomy, or AI programming. [pollid:1091]
  • Human values are so fragile that it would require a superintelligence to capture them with anything close to adequate fidelity.[pollid:1092]
  • Humans are capable of pre-digesting parts of the human values problem domain. [pollid:1093]
  • Successful techniques for value discovery of non-humans, (e.g. artificial agents, non-human animals, human institutions) would meaningfully translate into tools for learning human values. [pollid:1094]
  • Value learning isn't adequately being researched by commercial interests who want to use it to sell you things. [pollid:1095]
  • Practice teaching non-superintelligent machines to respect human values will improve our ability to specify a Friendly utility function for any potential superintelligence.[pollid:1096]
  • Something other than AI will cause human extinction sometime in the next 100 years.[pollid:1097]
  • All other things being equal, an additional researcher working on value learning is more valuable than one working on corrigibility, Vingean reflection, or some other portion of the FAI problem. [pollid:1098]

There is regular structure in human values that can be learned without requiring detailed knowledge of physics, anatomy, or AI programming.

While there is some regular structure to human values, I don't think you can say that the totality of human values has a completely regular structure. There are too many cases of nameless longings and generalized anxieties. Much of art is dedicated exactly to teasing out these feelings and experiences, often in counterintuitive contexts.

Can they be learned without detailed knowledge of X, Y and Z? I suppose it depends on what "detailed" means - I'll assume it means "less detailed than the required knowledge of the structure of human values." That said, the excluded set of knowledge you chose - "physics, anatomy, or AI programming" - seems really odd to me. I suppose you can poll people about their values (or use more sophisticated methods like prediction markets), but I don't see how this can yield more than "the set of human values that humans can articulate." It's something, but this seems to be a small subset of the set of human values. To characterize all dimensions of human values, I do imagine that you'll need to model human neural biophysics in detail. If successful, it will be a contribution to AI theory and practice.

Human values are so fragile that it would require a superintelligence to capture them with anything close to adequate fidelity.

To me, in this context, the term "fragile" means exactly that it is important to characterize and consider all dimensions of human values, as well as the potentially highly nonlinear relationships between those dimensions. An at-the-time invisible "blow" to at-the-time unarticulated dimension can result in unfathomable suffering 1000 years hence. Can a human intelligence capture the totality of human values? Some of our artists seem to have glimpses of the whole, but it seems unlikely to me that a baseline human can appreciate the whole clearly.

Part of it seems to be inherent in the idea of AGI, or an artificial general intelligence. There seems to be the belief that once an AI crosses a certain threshold of smarts, it will be capable of understanding literally everything.

The MIRI/LessWrong sphere is very enamoured of "universal" problem solvers like AIXI. The main pertinent fact about these is that they can't be built out of atoms in our universe. Nonetheless, MIRI think it is possible to get useful architecture-indpendent generalisations out of AIXI-style systems.

"Anyway that sounds great right? Universal prior. Right. What's it look like? Way oversimplifying, it rates hypotheses' likelihood by their compressibility, or algorithmic complexity. For example, say our perfect AI is trying to figure out gravity. It's going to treat the hypothesis that gravity is inverse-square as more likely than a capricious intelligent faller. It's a formalization of Occam's razor based on real, if obscure, notions of universal complexity in computability theory.

But, problem. It's uncomputable. You can't compute the universal complexity of any string, let alone all possible strings. You can approximate it, but there's no efficient way to do so (AIXItl is apparently exponential, which is computer science talk for "you don't need this before civilization collapses, right?").

So the mathematical theory is perfect, except in that it's impossible to implement, and serious optimization of it is unrealistic. Kind of sums up my view of how well LW is doing with AI, personally, despite this not being LW. Worry about these contrived Platonic theories while having little interest in how the only intelligent beings we're aware of actually function."

I think your criticism is a little harsh. Turing machines are impossible to implement as well, but they are still a useful theoretical concept.

Theoretical systems are useful so long as you keep track of where they depart from reality.

Consider the following exchange:

Engineer: The programme is acquiring more memory than it is releasing' so it will eventually fill the memory and crash.

Computer Scientist: No it won't, the memory is infinite.

Do the MIRI crowd make similar errors? Sure, consider Bostrom's response to Oracle AI. He assumes that an Oracle can only be a general intelligence coupled to a utility function that makes it want to answer questions and do nothing else.

I take your point that theorists can appear to be concerned with problems that have very little impact. On the other hand, there are some great theoretical results and concepts that can prevent us futility wasting our time and guide us to areas where success is more likely.

I think you're being ungenerous to Bolstrom. His paper on the possibility of Oracle type AIs is quite nuanced, and discusses many difficulties that would have to be overcome ...

http://www.nickbostrom.com/papers/oracle.pdf

To be fair to Bostrom' he doesn't go all the way down the rabbit hole -- arguing that oracles aren't any different to agentive AGIs.