Comment author: Wei_Dai 13 March 2016 12:10:04AM *  6 points [-]

The trainers are responsible for getting M to do what the trainers want, and the user trusts the trainers to do what the user wants.

In that case, there would be severe principle-agent problems, given the disparity between power/intelligence of the trainer/AI systems and the users. If I was someone who couldn't directly control an AI using your scheme, I'd be very concerned about getting uneven trades or having my property expropriated outright by individual AIs or AI conspiracies, or just ignored and left behind in the race to capture the cosmic commons. I would be really tempted to try another AI design that does purport to have the AI serve my interests directly, even if that scheme is not as "safe".

If I imagine an employee who sucks at philosophy but thinks 100x faster than me, I don't feel like they are going to fail to understand how to defer to me on philosophical questions.

If an employee sucks at philosophy, how does he even recognize philosophical problems as problems that he needs to consult you for? Most people have little idea that they should feel confused and uncertain about things like epistemology, decision theory, and ethics. I suppose it might be relatively easy to teach an AI to recognize the specific problems that we currently consider to be philosophical, but what about new problems that we don't yet recognize as problems today?

Aside from that, a bigger concern for me is that if I was supervising your AI, I would be constantly bombarded with philosophical questions that I'd have to answer under time pressure, and afraid that one wrong move would cause me to lose control, or lock in some wrong idea.

Consider this scenario. Your AI prompts you for guidance because it has received a message from a trading partner with a proposal to merge your AI systems and share resources for greater efficiency and economy of scale. The proposal contains a new AI design and control scheme and arguments that the new design is safer, more efficient, and divides control of the joint AI fairly between the human owners according to your current bargaining power. The message also claims that every second you take to consider the issue has large costs to you because your AI is falling behind the state of the art in both technology and scale, becoming uncompetitive, so your bargaining power for joining the merger is dropping (slowly in the AI's time-frame, but quickly in yours). Your AI says it can't find any obvious flaws in the proposal, but it's not sure that you'd consider the proposal to really be fair under reflective equilibrium or that the new design would preserve your real values in the long run. There are several arguments in the proposal that it doesn't know how to evaluate, hence the request for guidance. But it also reminds you not to read those arguments directly since they were written by a superintelligent AI and you risk getting mind-hacked if you do.

What do you do? This story ignores the recursive structure in ALBA. I think that would only make the problem even harder, but I could be wrong. If you don't think it would go like this, let me know how you think this kind of scenario would go.

In terms of your #1, I would divide the decisions requiring philosophical understanding into two main categories. One is decisions involved in designing/improving AI systems, like in the scenario above. The other, which I talked about in an earlier comment, is ethical disasters directly caused by people who are not uncertain, but just wrong. You didn't reply to that comment, so I'm not sure why you're unconcerned about this category either.

Comment author: paulfchristiano 19 March 2016 09:26:49PM 2 points [-]

If an employee sucks at philosophy, how does he even recognize philosophical problems as problems that he needs to consult you for? Most people have little idea that they should feel confused and uncertain about things like epistemology, decision theory, and ethics. I suppose it might be relatively easy to teach an AI to recognize the specific problems that we currently consider to be philosophical, but what about new problems that we don't yet recognize as problems today?

Is this your reaction if you imagine delegating your affairs to an employee today? Are you making some claim about the projected increase in the importance of these philosophical decisions? Or do you think that a brilliant employees' lack of metaphilosophical understanding would in fact cause great damage right now?

I would divide the decisions requiring philosophical understanding into two main categories. One is decisions involved in designing/improving AI systems, like in the scenario above...

I agree that AI may increase the stakes for philosophical decisions. One of my points is that a natural argument that it might increase the stakes---by forcing us to lock in an answer to philosophical questions---doesn't seem to go through if you pursue this approach to AI control. There might be other arguments that building AI systems force us to lock in important philosophical views, but I am not familiar with those arguments.

I agree there may be other ways in which AI systems increase the stakes for philosophical decisions.

I like the bargaining example. I hadn't thought about bargaining as competitive advantage before, and instead had just been thinking about the possible upside (so that the cost of philosophical error was bounded by the damage of using a weaker bargaining scheme). I still don't feel like this is a big cost, but it's something I want to think about somewhat more.

If you think there are other examples like this that might help move my view. On my current model, these are just facts that increase my estimates for the importance of philosophical work, I don't really see it as relevant to AI control per se. (See the sibling, which is the better place to discuss that.)

one wrong move would cause me to lose control

I don't see cases where a philosophical error causes you to lose control, unless you would have some reason to cede control based on philosophical arguments (e.g. in the bargaining case). Failing that, it seems like there is a philosophically simple, apparently adequate notion of "remaining in control" and I would expect to remain in control at least in that sense.

Comment author: Wei_Dai 13 March 2016 12:10:04AM *  6 points [-]

The trainers are responsible for getting M to do what the trainers want, and the user trusts the trainers to do what the user wants.

In that case, there would be severe principle-agent problems, given the disparity between power/intelligence of the trainer/AI systems and the users. If I was someone who couldn't directly control an AI using your scheme, I'd be very concerned about getting uneven trades or having my property expropriated outright by individual AIs or AI conspiracies, or just ignored and left behind in the race to capture the cosmic commons. I would be really tempted to try another AI design that does purport to have the AI serve my interests directly, even if that scheme is not as "safe".

If I imagine an employee who sucks at philosophy but thinks 100x faster than me, I don't feel like they are going to fail to understand how to defer to me on philosophical questions.

If an employee sucks at philosophy, how does he even recognize philosophical problems as problems that he needs to consult you for? Most people have little idea that they should feel confused and uncertain about things like epistemology, decision theory, and ethics. I suppose it might be relatively easy to teach an AI to recognize the specific problems that we currently consider to be philosophical, but what about new problems that we don't yet recognize as problems today?

Aside from that, a bigger concern for me is that if I was supervising your AI, I would be constantly bombarded with philosophical questions that I'd have to answer under time pressure, and afraid that one wrong move would cause me to lose control, or lock in some wrong idea.

Consider this scenario. Your AI prompts you for guidance because it has received a message from a trading partner with a proposal to merge your AI systems and share resources for greater efficiency and economy of scale. The proposal contains a new AI design and control scheme and arguments that the new design is safer, more efficient, and divides control of the joint AI fairly between the human owners according to your current bargaining power. The message also claims that every second you take to consider the issue has large costs to you because your AI is falling behind the state of the art in both technology and scale, becoming uncompetitive, so your bargaining power for joining the merger is dropping (slowly in the AI's time-frame, but quickly in yours). Your AI says it can't find any obvious flaws in the proposal, but it's not sure that you'd consider the proposal to really be fair under reflective equilibrium or that the new design would preserve your real values in the long run. There are several arguments in the proposal that it doesn't know how to evaluate, hence the request for guidance. But it also reminds you not to read those arguments directly since they were written by a superintelligent AI and you risk getting mind-hacked if you do.

What do you do? This story ignores the recursive structure in ALBA. I think that would only make the problem even harder, but I could be wrong. If you don't think it would go like this, let me know how you think this kind of scenario would go.

In terms of your #1, I would divide the decisions requiring philosophical understanding into two main categories. One is decisions involved in designing/improving AI systems, like in the scenario above. The other, which I talked about in an earlier comment, is ethical disasters directly caused by people who are not uncertain, but just wrong. You didn't reply to that comment, so I'm not sure why you're unconcerned about this category either.

Comment author: paulfchristiano 19 March 2016 09:04:51PM 2 points [-]

In that case, there would be severe principle-agent problems, given the disparity between power/intelligence of the trainer/AI systems and the users. If I was someone who couldn't directly control an AI using your scheme, I'd be very concerned about getting uneven trades or having my property expropriated outright by individual AIs or AI conspiracies, or just ignored and left behind in the race to capture the cosmic commons. I would be really tempted to try another AI design that does purport to have the AI serve my interests directly, even if that scheme is not as "safe".

Are these worse than the principal-agent problems that exist in any industrialized society? Most humans lack effective control over many important technologies, both in terms of economic productivity and especially military might. (They can't understand the design of a car they use, they can't understand the programs they use, they don't understand what is actually going on with their investments...) It seems like the situation is quite analogous.

Moreover, even if we could build AI in a different way, it doesn't seem to do anything to address the problem, since it is equally opaque to an end user who isn't involved in the AI development process. In any case, they are in some sense at the mercy of the AI developer. I guess this is probably the key point---I don't understand the qualitative difference between being at the mercy of the software developer on the one hand, and being at the mercy of the software developer + the engineers who help the software run day-to-day on the other. There is a slightly different set of issues for monitoring/law enforcement/compliance/etc., but it doesn't seem like a huge change.

(Probably the rest of this comment is irrelevant.)

To talk more concretely about mechanisms in a simple example, you might imagine a handful of companies who provide AI software. The people who use this software are essentially at the mercy of the software providers (since for all they know the software they are using will subvert their interests in arbitrary ways, whether or not there is a human involved in the process). In the most extreme case an AI provider could effectively steal all of their users' wealth. They would presumably then face legal consequences, which are not qualitatively changed by the development of AI if the AI control problem is solved. If anything we expect the legal system and government to better serve human interests.

We could talk about monitoring/enforcement/etc., but again I don't see these issues as interestingly different from the current set of issues, or as interestingly dependent on the nature of our AI control techniques. The most interesting change is probably the irrelevance of human labor, which I think is a very interesting issue economically/politically/legally/etc.

I agree with the general point that as technology improves a singleton becomes more likely. I'm agnostic on whether the control mechanisms I describe would be used by a singleton or by a bunch of actors, and as far as I can tell the character of the control problem is essentially the same in either case.

I do think that a singleton is likely eventually. From the perspective of human observers, a singleton will probably be established relatively shortly after wages fall below subsistence (at the latest). This prediction is mostly based on my expectation that political change will accelerate alongside technological change.

Comment author: Wei_Dai 07 February 2016 11:06:08PM 3 points [-]

How did you arrive at the conclusion that we're not facing big expected costs with these questions? It seems to me that for example the construction of large nuclear arsenals and lack of sufficient safeguards against nuclear war has already caused a large expected cost, and may have been based on one or more incorrect philosophical understandings (e.g., to the question of, what is the right amount of concern for distant strangers and future people). Similarly with "how much should we prioritize fast technological development?" But this is just from intuition since I don't really know how to compute expected costs when the uncertainties involved have a large moral or normative component.

My best guess is that we will get to do many-centuries-of-current-humanity worth of thinking before we really need to get any of these questions right.

Do you expect technological development to have plateaued by then (i.e., AIs will have invented essentially all technologies feasible in this universe)? If so, do you think there won't be any technologies among them that would let some group of people/AIs unilaterally alter the future of the universe according to their understanding of what is normative? (For example, intentionally or accidentally destroy civilization, or win a decisive war against the rest of the world.) Or do you think something like a world government will have been created to control the use of such technologies?

Comment author: paulfchristiano 19 March 2016 08:48:25PM 1 point [-]

How did you arrive at the conclusion that we're not facing big expected costs with these questions?

There are lots of things we don't know, and my default presumption is for errors to be non-astronomically-costly, until there are arguments otherwise.

I agree that philosophical problems have some stronger claim to causing astronomical damage, and so I am more scared of philosophical errors than e.g. our lack of effective public policy, our weak coordination mechanisms, global warming, the dismal state of computer security.

But I don't see really strong arguments for philosophical errors causing great damage, and so I'm skeptical that we are facing big expected costs (big compared to the biggest costs we can identify and intervene on, amongst them AI safety).

That is, there seems to be a pretty good case that AI may be built soon, and that we lack the understanding to build AI systems that do what we want, that we will nevertheless build AI systems to help us get what we want in the short term, and that in the long run this will radically reduce the value of the universe. The cases for philosophical errors causing damage are overall much more speculative, have lower stakes, and are less urgent.

the construction of large nuclear arsenals and lack of sufficient safeguards against nuclear war has already caused a large expected cost, and may have been based on one or more incorrect philosophical understandings

I agree that philosophical progress would very slightly decrease the probability of nuclear trouble, but this looks like a very small effect. (Orders of magnitude smaller than the effects from say increased global peace and stability, which I'd probably list as a higher priority right now than resolving philosophical uncertainty.) It's possible we disagree about the mechanics of this particular situation.

Do you expect technological development to have plateaued by then (i.e., AIs will have invented essentially all technologies feasible in this universe)?

No. I think that 200 years of subjective time probably amounts 5-10 more doublings of the economy, and that technological change is a plausible reason that philosophical error would eventually become catastrophic.

I said "best guess" but this really is a pretty wild guess about the relevant timescales.

intentionally or accidentally destroy civilization

As with the special case of nuclear weapons, I think that philosophical error is a relatively small input into world-destruction.

win a decisive war against the rest of the world

I don't expect this to cause philosophical errors to become catastrophic. I guess the concern is that the war will be won by someone who doesn't much care about the future, thereby increasing the probability that resources are controlled by someone who prefers not undergo any further reflection? I'm willing to talk about this scenario more, but at face value the prospect of a decisive military victory wouldn't bump philosophical error above AI risk as a concern for me.

I'm open to ending up with a more pessimistic view about the consequences of philosophical error, either by thinking through more possible scenarios in which it causes damage or by considering more abstract arguments.

But if I end up with a view more like yours, I don't know if it would change my view on AI safety. It still feels like the AI control problem is a different issue which can be considered separately.

Comment author: Wei_Dai 10 March 2016 10:59:09PM 7 points [-]

It seems to be a combination of all of these.

  1. Training an AI to defer to one's eventual philosophical judgments and interim method of managing uncertainty (and not falling prey to marketing worlds and incorrect but persuasive philosophical arguments etc) seems really hard, and made harder by the recursive structure in ALBA and the fact that the first level AI is sub-human in capacity which then has to handle being bootstrapped and training the next level AI. What percent of humans can accomplish this task, do you think? (I'd argue that the answer is likely zero, but certainly very small.) How do the rest use your AI?
  2. Assuming that deferring to humans on philosophy and managing uncertainty is feasible but costly, how many people could resist dropping this feature and the associated cost, in favor of adopting some sort of straightforward utility maximization framework with a fixed utility function that they think captures most or all of their values, if that came as a suggestion from the AI with an apparently persuasive argument? If most people do this and only a few don't (and those few are also disadvantaged in the competition to capture the cosmic commons due to deciding to carry these costs), that doesn't seem like much of a win.
  3. This is tied in with 1 and 2, in that correct meta-philosophical understanding is needed to accomplish 1, and unreasonable philosophical certainty would cause people to fail step 2.
  4. Even if the AIs keep deferring to their human users and don't end up short-circuit their philosophical judgements, if the AI/human systems become very powerful while still having incorrect and strongly held philosophical views, that seems likely to cause disaster. We also don't have much reason to think that if we put people in such positions of power (for example, being able to act as a god in some simulation or domain of their choosing), that most will eventually realize their philosophical errors and converge to correct views, that the power itself wouldn't further distort their already error-prone reasoning processes.
Comment author: paulfchristiano 11 March 2016 10:31:56PM *  1 point [-]

Re 1:

For a working scheme, I would expect it to be usable by a significant fraction of humans (say, comparable to the fraction that can learn to write a compiler).

That said, I would not expect almost anyone to actually play the role of the overseer, even if a scheme like this one ended up being used widely. An existing analogy would be the human trainers who drive facebook's M (at least in theory, I don't know how that actually plays out). The trainers are responsible for getting M to do what the trainers want, and the user trusts the trainers to do what the user wants. From the user's perspective, this is no different from delegating to the trainers directly, and allowing them to use whatever tools they like.

I don't yet see why "defer to human judgments and handle uncertainty in a way that they would endorse" requires evaluating complex philosophical arguments or having a correct understanding of metaphilosophy. If the case is unclear, you can punt it to the actual humans.

If I imagine an employee who sucks at philosophy but thinks 100x faster than me, I don't feel like they are going to fail to understand how to defer to me on philosophical questions. I might run into trouble because now it is comparatively much harder to answer philosophical questions, so to save costs I will often have to do things based on rough guesses about my philosophical views. But the damage from using such guesses depends on the importance of having answers to philosophical questions in the short-term.

It really feels to me like there are two distinct issues:

  1. Philosophical understanding may help us make good decisions in the short term, for example about how to trade off extinction risk vs faster development, or how to prioritize the suffering of non-human animals. So having better philosophical understanding (and machines that can help us build more understanding) is good.
  2. Handing off control of civilization to AI systems might permanently distort society's values. Understanding how to avoid this problem is good.

These seem like separate issues to me. I am convinced that #2 is very important, since it seems like the largest existential risk by a fair margin and also relatively tractable. I think that #1 does add some value, but am not at all convinced that it is a maximally important problem to work on. As I see it, the value of #1 depends on the importance of the ethical questions we face in the short term (and on how long-lasting are the effects of differential technological progress that accelerates our philosophical ability).

Moreover, it seems like we should evaluate solutions to these two problems separately. You seem to be making an implicit argument that they are linked, such that a solution to #2 should only be considered satisfactory if it also substantially addresses #1. But from my perspective, that seems like a relatively minor consideration when evaluating the goodness of a solution to #2. In my view, solving both problems at once would be at most 2x as good as solving the more important of the two problems. (Neither of them is necessarily a crisp problem rather than an axis along which to measure differential technological development.)

I can see several ways in which #1 and #2 are linked, but none of them seem very compelling to me. Do you have something in particular in mind? Does my position seem somehow more fundamentally mistaken to you?

(This comment was in response to point 1, but it feels like the same underlying disagreement is central to points 2 and 3. Point 4 seems like a different concern, about how the availability of AI would itself change philosophical deliberation. I don't really see much reason to think that the availability of powerful AI would make the endpoint of deliberation worse rather than better, but probably this is a separate discussion.)

Comment author: cousin_it 10 March 2016 07:15:59PM *  2 points [-]

I mostly had in mind 2. Not sure how predicting humans is different from putting humans in hypotheticals. It seems like the same problems could happen.

Comment author: paulfchristiano 11 March 2016 05:58:31AM *  2 points [-]

I agree that the same problem appears for ALBA. I was originally working with proposals where the improbability of the human's situation was bounded, but the recursive structure can lead to arbitrarily large improbability. I hadn't thought about this explicitly.

Predicting humans is different from putting humans in hypotheticals, in the sense that in principle you can actually sample from the situations that cause humans to think they are in a simulation or whatever.

For example, suppose the human had access to a button that said "This is weird, I'm probably in a simulation," and suppose that we expected the human to press it in any case where they would start behaving weirdly. Then we could potentially sample from the subset of situations where the human presses the button. And if we manage to do that, then the human isn't right to suspect they are in a simulation (any more than they already should believe they are in a simulation, prior to even building the AI).

I agree that you shouldn't expect the current version of the scheme to cope with this problem. And it's certainly not obvious that the approach sketched above can work, there are many important questions. But I do think that there is a clear path to fixing this kind of problem.

(Interestingly, this is exactly analogous to what the active learning scheme has to do, in order to get the prediction algorithms to predict well.)

ETA: although I hadn't thought about this explicitly in the context of ALBA, I have expected to need some way to overweight "weird" situations in order to stop them from being problematic, ever since here.

Comment author: Wei_Dai 10 March 2016 10:37:06AM 14 points [-]

Compared to its competition in the AGI race, MIRI was always going to be disadvantaged by both lack of resources and the need to choose an AI design that can predictably be made Friendly as opposed to optimizing mainly for capability. For this reason, I was against MIRI (or rather the Singularity Institute as it was known back then) going into AI research at all, as opposed to pursuing some other way of pushing for a positive Singularity.

In any case, what other approaches to Friendliness would you like MIRI to consider? The only other approach that I'm aware of that's somewhat developed is Paul Christiano's current approach (see for example https://medium.com/ai-control/alba-an-explicit-proposal-for-aligned-ai-17a55f60bbcf), which I understand is meant to be largely agnostic about the underlying AI technology. Personally I'm pretty skeptical but then I may be overly skeptical about everything. What are your thoughts? I don't recall seeing you having commented on them much.

Are you aware of any other ideas that MIRI should be considering?

Comment author: paulfchristiano 10 March 2016 06:57:00PM 4 points [-]

Do you have a concise explanation of skepticism about the overall approach, e.g. a statement of the difficulty or difficulties you think will be hardest to overcome by this route?

Or is your view more like "most things don't work, and there isn't much reason to think this would work"?

In discussion you most often push on the difficulty of doing reflection / philosophy. Would you say this is your main concern?

My take has been that we just need to meet the lower bar of "wants to defer to human views about philosophy, and has a rough understanding of how humans want to reflect and want to manage their uncertainty in the interim."

Regarding philosophy/metaphilosophy, is it fair to describe your concern as one of:

  1. The approach I am pursuing can't realistically meet even my lower bar,
  2. Meeting my lower bar won't suffice for converging to correct philosophical views,
  3. Our lack of philosophical understanding will cause problems soon in subjective time (we seem to have some disagreement here, but I don't feel like adopting your view would change my outlook substantially), or
  4. AI systems will be much better at helping humans solve technical than philosophical problems, driving a potentially long-lasting (in subjective time) wedge between our technical and philosophical capability, even if ultimately we would end up at the right place?

My hope is that thinking and talking more about bootstrapping procedures would go a long way to resolving the disagreements between us (either leaving you more optimistic or me more pessimistic). I think this is most plausible if #1 is the main disagreement. If our disagreement is somewhere else, it may be worth also spending some time focusing somewhere else. Or it may be necessary to better define my lower bar in order to tell where the disagreement is.

Comment author: cousin_it 10 March 2016 12:10:52PM *  3 points [-]

As far as I can tell, Paul's current proposal might still suffer from blackmail, like his earlier proposal which I commented on. I vaguely remember discussing the problem with you as well.

One big lesson for me is that AI research seems to be more incremental and predictable than we thought, and garage FOOM probably isn't the main danger. It might be helpful to study the strengths and weaknesses of modern neural networks and get a feel for their generalization performance. Then we could try to predict which areas will see big gains from neural networks in the next few years, and which parts of Friendliness become easy or hard as a result. Is anyone at MIRI working on that?

Comment author: paulfchristiano 10 March 2016 06:31:19PM *  2 points [-]

As far as I can tell, Paul's current proposal might still suffer from blackmail, like his earlier proposal which I commented on

What problem do you have in mind here?

I thought that the previous problem was mostly psychological, i.e. that if humans were rational agents then this AI would be roughly as vulnerable to blackmail as its designers. So I thought the issue was the psychological strangeness (and great length) of the weird hypothetical.

Here we have no such hypothetical, and the system's behavior only depends on the predicted behavior of humans in the real world. That seems to address the narrow version of your concern.

I can see two analogous problems:

  1. The learning system may believe itself to be in a simulation, as discussed here. But that seems to cash out to a straightforward technical problem: we want to ensure that as long as there is some important scenario where the AI behaves strangely, we provide feedback on one such scenario. And this technical problem both (1) looks like it should be soluble in principle, (2) looks pretty tractable right now, and (3) is needed to resolve a whole host of other problems (a bunch of exotic failures, but also a bunch of more prosaic difficulties).
  2. The behavior of the system depends on what a human would do across many cases. We can sample from these cases at random, but maybe a small fraction of them are sufficiently surprising that they would lead a human to believe they are in a simulation. We would like to sample these situations with disproportionate probability as well so that the human can won't panic when they find themselves in one, but I haven't talked in any detail about how to do that and it's not obvious whether it is possible. (I do think it's possible.)

Did you have in mind 1, 2, or something else?

Comment author: Wei_Dai 06 February 2016 01:27:09PM 0 points [-]

The goal / best case is that the development of AI doesn't depend on sorting out these kinds of considerations for its own sake, only insofar as the AI has to actually make critical choices that depend on these considerations.

Isn't a crucial consideration here how soon after the development of AI they will be faced with such choices? If the answer is "soon" then it seems that we should try to solve the problems ahead of time or try to delay AI. What's your estimate? And what do you think the first such choices will be?

Comment author: paulfchristiano 07 February 2016 04:45:03AM 2 points [-]

What's your estimate? And what do you think the first such choices will be?

I think that we are facing some issues all of the time (e.g. some of these questions probably bear on "how much should we prioritize fast technological development?" or "how concerned should we be with physics disasters?" or so on), but that it will be a long time before we face really big expected costs from getting these wrong. My best guess is that we will get to do many-centuries-of-current-humanity worth of thinking before we really need to get any of these questions right.

I don't have a clear sense of what the first choices will be. My view is largely coming from not seeing any serious candidates for critical choices.

Anything to do with expansion into space looks like it will be very far away in subjective time (though perhaps not far in calendar time). Maybe there is some stuff with simulations, or value drift, but neither of those look very big in expectation for now. Maybe all of these issues together make 5% difference in expectation over the next few hundred subjective issues? (Though this is a pretty unstable estimate.)

Comment author: Wei_Dai 01 February 2016 10:38:17PM *  0 points [-]

FAI designs that require high confidence solutions to many philosophical problems also do not seem very promising to me at this point. I endorse looking for alternative approaches.

I agree that act-based agents seem to require fewer high confidence solutions to philosophical problems. My main concern with act-based agents is that these designs will be in competition with fully autonomous AGIs (either alternative designs, or act-based agents that evolve into full autonomy due to inadequate care of their owners/users) to colonize the universe. The dependence on humans and lack of full autonomy in act-based agents seem likely to cause a significant weakness in at least one crucial area of this competition, such as general speed/efficiency/creativity, warfare (conventional, cyber, psychological, biological, nano, etc.), cooperation/coordination, self-improvement, and space travel. So even if these agents turn out to be "safe", I'm not optimistic that we "win" in the long run.

My own idea is to aim for FAI designs that can correct their philosophical errors, autonomously, the same way that we humans can. Ideally, we'd fully understand how humans reason about philosophical problems and how philosophy normatively ought to be done before programming or teaching that to an AI. But realistically, due to time pressure, we might have to settle for something suboptimal like teaching through examples of human philosophical reasoning. Of course there's lots of ways for this kind of AI to go wrong as well, so I also consider it to be a long shot.

Do you think that we need to understand issues like this one, and have confidence in that understanding, prior to building powerful AI systems?

Let me ask you a related question. Suppose act-based designs are as successful as you expect them to be. We still need to understand issues like the one described in Eliezer's post (or solve the meta-problem of understanding philosophical reasoning) at some point, right? When do you think that will be? In other words, how much time do you think successfully creating act-based agents buys us?

Comment author: paulfchristiano 05 February 2016 11:10:21PM 0 points [-]

Suppose act-based designs are as successful as you expect them to be.

It's not so much that I have confidence in these approaches, but that I think (1) they are the most natural to explore at the moment, and (2) issues that seem like they can be cleanly avoided for these approaches seem less likely to be fundamental obstructions in general.

We still need to understand issues like the one described in Eliezer's post (or solve the meta-problem of understanding philosophical reasoning) at some point, right? When do you think that will be?

Whenever such issues bear directly on our decision-making in such a way that making errors would be really bad. For example, when we encounter a situation where we face a small probability of a very large payoff, then it matters how well we understand the particular tradeoff at hand. The goal / best case is that the development of AI doesn't depend on sorting out these kinds of considerations for its own sake, only insofar as the AI has to actually make critical choices that depend on these considerations.

The dependence on humans and lack of full autonomy in act-based agents seem likely to cause a significant weakness in at least one crucial area of this competition,

I wrote a little bit about efficiency here. I don't see why an approval-directed agent would be at a serious disadvantage compared to an RL agent (though I do see why an imitation learner would be at a disadvantage by default, and why an approval-directed agent may be unsatisfying from a safety perspective for non-philosophical reasons).

Ideally you would synthesize data in advance in order to operate without access to counterfactual human feedback at runtime---it's not clear if this is possible, but it seems at least plausible. But it's also not clear to me it is necessary, as long as we can tolerate very modest (<1%) overhead from oversight.

Of course if such a period goes on long enough then it will be a problem, but that is a slow-burning problem that a superintelligent civilization can address at its leisure. In terms of technical solutions, anything we can think of now will easily be thought of in this future scenario. It seems like the only thing we really lose is the option of technological relinquishment or serious slow-down, which don't look very attractive/feasible at the moment.

Comment author: Houshalter 03 February 2016 01:50:32PM *  -1 points [-]

I'm not saying the situation is impossible, just really really unlikely. The AI would need to output big binary files like images, and know someone intended to decode them, and somehow get around statistical detection by AI 2 (stenography is detectable since the lowest order bits of an image are not uniformly random.)

You might have a point that it's probably not best to publish things produced by the AI on the internet. If this is a serious risk, then it could still be done safely with a small group.

Comment author: paulfchristiano 03 February 2016 06:25:28PM 1 point [-]

The general lesson from steganography is that it is computationally easier to change a distribution in an important way than to detect such a change. In order to detect a change you need to consider all possible ways in which a distribution could be meaningfully altered, while in order to make a change you just have to choose one. From a theory perspective, this is a huge asymmetry that favors the an attacker.

This point doesn't seem directly relevant though, unless someone offers any good reason to actually include the non-imitation goal, rather than simply imitating the successful human trials. (Though there are more subtle reasons to care about problematic behavior that is neither penalized nor rewarded by your training scheme. It would be nicer to have positive pressure to do only those things you care about. So maybe the point ends up being relevant after all.)

Actually, in the scheme as you wrote it there is literally no reason to include this second goal. The distinguisher is already trying to distinguish the generator's behavior from [human conditioned on success], so the generator already has to succeed in order to win the game. But this doesn't introduce any potentially problematic optimization pressure, so it just seems better.

View more: Prev | Next