Wei_Dai comments on Three Approaches to "Friendliness" - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (84)
If I understand correctly, in order for your designs to work, you must first have a question-answerer or predictor that is much more powerful than a human (i.e., can answer much harder questions that a human can). For example, you are assuming that the AI would be able to build a very accurate model of an arbitrary human overseer from sense data and historical responses and predict their "considered judgements", which is a superhuman ability. My concern is that when you turn on such an AI in order to test it, it might either do nothing useful (i.e., output very low quality answers that give no insights to how safe it would eventually be) because it's not powerful enough to model the overseer, or FOOM out of control due to a bug in the design or implementation and the amount of computing power it has. (Also, how are you going to stop others from making use of such powerful answers/predictors in a less safe, but more straightforward and "efficient" way?)
With a white-box metaphilosophical AI, if such a thing was possible, you could slowly increase its power and hopefully observe a corresponding increase in the quality of its philosophical output, while fixing any bugs that are detected and knowing that the overall computing power it has is not enough for it to vastly outsmart humans and FOOM out of control. It doesn't seem to require access to superhuman amounts of computing power just to start to test its safety.
I don’t think that the question-answerer or reinforcement learner needs to be superhuman. I describe them as using human-level abilities rather than superhuman abilities, and it seems like they could also work with subhuman abilities. Concretely, if we imagine applying those designs with a human-level intelligence acting in the interests of a superhuman overseer, they seem (to me) to work fine. I would be interested in problems you see with this use case.
Your objection to the question-answering system seemed to be that the AI may not recognize that human utterances are good evidence about what the overseer would ultimately do (even if they were), and that it might not be possible or easy to teach this. If I’m remembering right and this is still the problem you have in mind, I’m happy to disagree about it in more detail. But it seems that this objection couldn’t really apply to the reinforcement learning approach.
It seems like these systems could be within a small factor of optimal efficiency (certainly within a factor of 2, say, but hopefully much closer). I would consider a large efficiency loss to be failure.
The AI needs to predict what the human overseer "wants" from it, i.e., what answers the human would score highly. If I was playing the role of such an AI, I could use the fact that I am myself a human and thinks similarly to the overseer, and ask myself, "If I was in the overseer's position, what answers would I judge highly?" In particular, I could use the fact that I likely have philosophical abilities similar to the overseer, and could just apply my native abilities to satisfy the overseer. I do not have to first build a detailed model of the overseer from scratch and then run that model to make predictions. It seems to me that the AI in your design would have to build such a model, and doing so seems a superhuman feat. In other words, if I did not already have native philosophical abilities on par with the overseer, I I couldn't give answers to any philosophical questions that the overseer would find helpful, unless I had the superhuman ability to create a model of the overseer including his philosophical abilities, from scratch.
Suppose that you are the AI, and the overseer is a superintelligent alien with very different values and philosophical views. How well do you think that things will end up going for the alien? (Assuming you are actually trying to win at the RL / question-answering game.)
It seems to me like you can pursue the aliens' values nearly as well as if they were your own. So I'm not sure where we disagree (assuming you don't find this thought experiment convincing):
I think that while my intelligence is not greater than the alien's, I would probably do the thing that you suggested, "don't do anything the user would find terrible; acquire resources; make sure the user remains safe and retains effective control over those resource", but if the aliens were to start to trust me enough to upgrade my cognitive abilities to be above theirs, I could very well end up causing disaster (from their perspective) either by subtly misunderstanding some fine point of their values/philosophical views (*), or by subverting the system through some design or implementation flaw. The point is that my behavior while my abilities are less than super-alien are not a very good indication of how safe I will eventually be.
(*) To expand on this, suppose that as my cognitive abilities increase, I develop increasingly precise models of the alien, and at some point I decide that I can satisfy the alien's values better by using resources directly instead of letting the alien retain control (i.e., I could act more efficiently this way and I think that my model is as good as the actual alien), but it turns out that I'm wrong about how good my model is, and end up acting on a subtly-but-disastrously wrong version of the alien's values / philosophical views.
I discuss the most concerning-to-me instance of this in problem (1) here; it seems like that discussion applies equally well to anything that might work fine at first but then break when you become a sufficiently smart reasoner.
I think the basic question is whether you can identify and exploit such flaws at exactly the same time that you recognize their possibility, or whether you can notice them slightly before. By “before” I mean with a version of you that is less clever, has less time to think, has a weaker channel to influence the world, or is treated with more skepticism and caution.
If any of these versions of you can identify the looming problem in advance, and then explain it to the aliens, then they can correct the problem. I don’t know if I’ve ever encountered a possible flaw that wasn’t noticeable “before” it was exploitable in one of these senses. But I may just be overlooking them, and of course even if we can’t think of any it’s not such great reassurance.
Of course even if you can’t identify such flaws, you can preemptively improve the setup for the aliens, in advance of improving your own cognition. So it seems like we never really care about the case where you are radically smarter than the designer of the system, we care about the case where you are very slightly smarter. (Unless this system-improvement is a significant fraction of the difficulty of actually improving your cognition, which seems far-fetched.)
Other than the issue from the first part of this comment, I don't really see why the behavior changes (in a way that invalidates early testing) when you become super-alien in some respects. It seems like you are focusing on errors you may make that would cause you to receive a low payoff in the RL game. As you become smarter, I expect you to make fewer such errors. I certainly don't expect you to predictably make more of them.
(I understand that this is a bit subtle, because as you get smarter the problem also may get harder, since your plans will e.g. be subject to more intense scrutiny and to more clever counterproposals. But that doesn't seem prone to lead to the kinds of errors you discuss.)
Paraphrasing, I think you're saying that, if the reinforcement game setup continues to work, you expect to make fewer errors as you get smarter. And the only way getting smarter hurts you is if it breaks the game (by enabling you to fall into traps faster than you can notice and avoid them).
Is that right?
I don’t see how this could predictably happen without something else going wrong first. I agree and acknowledged that the question-learning solution is hard-to-test, so let’s focus on the RL approach. (Though I also don’t expect this to happen for the question-answering solution.) In this comment I’ll focus on the misunderstanding case.
So in the future, you expect to predictably make a decision which the aliens would consider catastrophically bad. It seems to me like:
If the solution would really be considered catastrophically bad, and it chosen for evaluation, then it will receive a very low payoff—unless the scheme fails in some other way that we have not yet discussed.
So you would only make such mistakes if you thought that you would receive enough expected benefit from more aggressive decisions that it offsets this predictable possibility of a low payoff from catastrophic error.
But if you took more conservative actions, you could justify those actions (when they were evaluated) by explaining the predicted possibility of a catastrophic outcome. Unless something else has gone wrong, the aliens care more about averting this prospect of a bad outcome than saving time by you being more aggressive, so they shouldn’t penalize you for this.
So if you behave aggressively even at the risk of a catastrophic error, it seems like one of the following must have gone wrong:
In fact the aliens wouldn’t be able to detect a catastrophic error during evaluation.
The conservative policy is actually worse than the aggressive policy in expectation, based on the considered judgment of the aliens.
The aliens wouldn’t accept the justification for conservatism, based on a correct argument that its costs are outweighed by the possibility for error.
This argument is wrong, or else it’s right but you wouldn’t recognize this argument or something like it.
Any of these could happen. 1 and 3 seem like they lead to more straightforward problems with the scheme, so would be worthwhile to explore on other grounds. 2 doesn’t seem likely to me, unless we are dealing with a very minor catastrophe. But I am open to arguing about it. The basic question seems to be how tough it is to ask the aliens enough questions to avoid doing anything terrible.
The examples you give in the parallel thread don’t seem they could present a big problem; you can ask the alien a modest number of questions like “how do you feel about the tradeoff between the world being destroyed and you controlling less of it?” And you can help to the maximum possible extent in answering them. Of course the alien won’t have perfect answers, but their situation seems better than the situation prior to building such an AI, when they were also making such tradeoffs imperfectly (presumably even more imperfectly, unless you are completely unhelpful to the aliens for answering such questions). And there don’t seem to be many plans where the cost of implementing the plan is greater than the cost of consulting the alien about how it feels about possible consequences of that plan.
Of course you can also get this information in other ways (e.g. look at writings and past behavior of the aliens) or ask more open-ended questions like “what are the most likely way things could go wrong, given what I expect to do over the next week,” or pursue compromise solutions that the aliens are unlikely to consider too objectionable.
ETA: actually it's fine if the catastrophic plan is not evaluated badly, all of the work can be done in the step where the aliens prefer conservative plans to aggressive ones in general, after you explain the possibility of a catastrophic error.
What if this is true, because other aliens (people) have similar AIs, so the aggressive policy is considered better, in a PD-like game theoretic sense, but it would have been better for everyone if nobody had built such AIs?
With any of the black-box designs I've seen, I would be very reluctant to push the button that would potentially give it superhuman capabilities, even if we have theoretical reasons to think that it would be safe, and we've fixed all the problems we've detected while testing at lower levels of computing power. There are too many things that could go wrong with such theoretical reasoning, and easily many more flaws that won't become apparent until the system becomes smarter. Basically the only reason to do it would be time pressure, due to the AI race or something else. (With other kinds of FAI designs, i.e., normative and white-box metaphilosophical, it seems that we can eventually be more confident about their safety but they are harder to design and implement in the first place, so we should wait for them if we have the option to.) Do you agree with this?
In some sense I agree. If there were no time pressure, then we would want to proceed in only the very safest way possible, which would not involve AI at all. My best guess would be to do a lot of philosphical and strategic thinking as unmodified and essentially unaided humans, perhaps for a very very long time. After that you might decide on a single, maximally inoffensive computional aid, and then repeat. But this seems like quite an alien scenario!
I am not sold that in milder cases you would be much better off with e.g. a normative AI than black box designs. Why is it less error prone? It seems like normative AI must perform well across a wide range of unanticipated environments, to a much greater extent than with black box designs, and with clearer catastrophic consequences for failure. It seems like you would want to do something that remains under the control of something as close to a human as possible, for as long as possible.
In some sense the black box approach is clearly more dangerous (ignoring time limits), since it doesn't really get you closer to your goal. We will probably have to solve these other problems eventually. The black box metaphilosophical AI is really more like a form of cognitive enhancement. But it seems like enhancement is basically the right thing to do for now, even if we make the time crunch quite a bit milder.
I agree that you need to rely extensively on empirical evidence to gain confidence in your system (though I think this is probably true for other plausible approaches as well).
I don't fully understand the argument against using such empirical evidence, or rather I don't see how to make the argument go through without the prospect of a treacherous turn, which we haven't addressed at all.
I agree that there are kinds of problems which do not arise when testing with weak systems, and these problems cannot be detected empirically until you are working with stronger systems. The main question is whether you can debug these problems while working with stronger systems.
For concreteness, imagine a bug in your system for rendering answers; a malicious answer can compromise the computer and assign the answerer a high reward. Let's be pessimistic, and assume that the easiest way to compromise the computer is to create a devourer that will be released into the world, that is as smart as the answerer itself, and whose preferences are maximally destructive.
(This is going to repeat ideas from the great-uncle...)
It seems like this possibility is not catastrophic.
--before
The attack can probably be noticed by the only slightly weaker agents which helped you design the current system. The only problem is if the attack cannot be noticed by any agent who isn't clever enough to realize that such an attack is possible.
You can sanitize the agent's answers using a similarly intelligent agent with more restricted outputs, e.g. one who just says "possible exploit" or "not a possible exploit."
The execution environment itself can make extensive use of similarly smart agents, and is therefore less likely to be vulnerable to an attack even if inexpertly designed.
---after
The basic issue is that agent Z is preceded by nearly-as-smart agent Y, and so the prospects for danger are not that great if agent Z (and only agent Z) fails. The argument for catastrophic risks seems to come from the big disparity in power between the AI being controlled and the human overseer.
The concern seems to rest on the prospect of a treacherous turn: if agent Z is clever enough, then agent Y will also defect rather than helping you contain agent Z, even if in earlier tests agent Y seemed to be benign.
But I don't see why this would happen for the RL system, and it seems like that's the crux of the argument. I agree that it would happen for the question-answering system (I was the first to admit that the question-answering system was hard to test).
You may have other kinds of difficulties in mind, but all of the ones that I can think of seem to rest on a treacherous turn or something similar. Is there some other reason to expect failure to be catastrophic?
I'm not pointing out any specific reasons, but just expect that in general, failures when dealing with large amounts of computing power can easily be catastrophic. You have theoretical arguments for why they won't be, given a specific design, but again I am skeptical of such arguments in general.
I agree there is some risk that cannot be removed with either theoretical arguments or empirical evidence. But why is it greater for this kind of AI than any other, and in particular than white-box metaphilosophical or normative AI?
Normative AI seems like by far the worst, since:
So in that case we have particular concrete reasons to think that emprical testing won't be adequate, in addition to the general concern that empirical testing and theoretical argument is never sufficient. To me, white box metaphilosophical AI seems somewhere in between.
(One complaint is that I just haven't given an especially strong theoretical argument. I agree with that, and I hope that whatever systems people actually use, they are backed by something more convicing. But the current state of the argument seems like it can't point in any direction other than in favor of black box designs, since we don't yet have any arguments at all that any other kind of system could work.)
It seems like the question is: "How much more productive is the aggressive policy?"
It looks to me like the answer is "Maybe it's 1% cheaper or something, though probably less." In this case, it doesn't seem like the AI itself is introducing (much of) a PD situation, and the coordination problem can probably be solved.
I don't know whether you are disagreeing about the likley cost of the aggressive policy, or the consequences of slight productivity advantages for the aggressive policy. I discuss this issue a bit here, a post I wrote a few days ago but just got around to posting.
Of course there may be orthogonal reasons that the AI faces PD-like problems, e.g. it is possible to expand in an undesirably destructive way by building an unrelated and dangerous technology. Then either:
The alien user would want to coordinate in the prisoner's dilemma. In this case, the AI will coordinate as well (unless it makes an error leading to a lower reward).
The alien user doesn't want to coordinate in the prisoner's dilemma. But in this case, the problem isn't with the AI at all. If the users hadn't built AI they would have faced the same problem.
I don't know which of these you have in mind. My guess is you are thinking of (2) if anything, but this doesn't really seem like an issue to do with AI control. Yes, the AI may have a differential effect on e.g. the availability of destructive tech and our ability to coordinate, and yes, we should try encourage differential progress in AI capabilities just like we want to encourage differential progress in society's capabilities more broadly. But I don't see how any solution to the AI control problem is going to address that issue, nor does it seem especially concerning when compared to the AI control problem.
Maybe we have different things in mind for "aggressive policy". I was think something like "give the AI enough computing power to achieve superhuman intelligence so it can hopefully build a full-fledged FAI for the user" vs the "conservative policy" of "keep the AI at its current level where it seems safe, and find another way to build an FAI".
A separate but related issue is that it appears such an AI can either be a relatively safe or unsafe AI, depending on the disposition of the overseer (since an overseer less concerned with safety would be more likely to approve of potentially unsafe modifications to the AI). In a sidenote of the linked article, you wrote about why unsafe but more efficient AI projects won't overtake the safer AI projects in AI research:
But how will the safe projects exclude the unsafe projects from economies of scale and favorable terms of trade, if the unsafe projects are using the same basic design but just have overseers who care more about capability than safety?
Controlling the distribution of AI technology is one way to make someone's life harder, but it's not the only way. If we imagine a productivity gap as small as 1%, it seems like it doesn't take much to close it.
(Disclaimer: this is unusally wild speculation; nothing I say is likely to be true, but hopefully it gives the flavor.)
If unsafe projects perfectly pretend to be safe projects, then they aren't being more efficient. So it seems like we can assume that they are observably different from safe projects. (For example, there can't just be complexity-loving humans who oversee projects exactly as if they had normal values; they need to skimp on oversight in order to actually be more eficient. Or else they need to differ in some other way...) If they are observably different, then possible measures include:
Of course unsafe projects can go to greater lengths in order to avoid these issues, for example by moving to friendlier jurisdictions or operating a black market in unsafe technology. But as these measures become more extreme they become increasingly easy to identify. If unsafe jurisdictions and black markets have only a few percent of the population of the world, then it's easy to see how they could be less efficient.
(I'd also expect e.g. unsafe jurisdictions to quickly cave under international pressure, if the rents they could extract were a fraction of a percent of total productivty. They could easily be paid off, and if they didn't want to be paid off, they would not be miliratily competitive.)
All of these measures become increasingly implausible at large productivty differentials. And I doubt that any of these particular foreseeable measures will be important. But overall, given that there are economies of scale, I find it very likely that the majority can win. The main question is whether they care enough to.
Normally I am on the other side of a discussion similar to this one, but involving much larger posited productivity gaps and a more confident claim (things are so likely to be OK that it's not worth worrying about safety). Sorry if you were imagining a very much larger gap, so that this discussion isn't helpful. And I do agree that there is a real possibility that things won't be OK, even for small productivity gaps, but I feel like it's more likely than not to be OK.
Also note that at a 1% gap, we can basically wait it out. If 10% of the world starts out malicious, then by the time the economy has grown 1000x, then 11% of the world is malicious, and it seems implausible that the AI situation won't change during that time---certainly contemporary thinking about AI will be obsoleted, in an economic period as long as 0-2015AD. (The discussion of social coordination is more important in the case where there are larger efficiency gaps, and hence probably larger differences in how the projects look and what technology they need.)
ETA: Really the situation is not so straightforward, since 1% more productivity leads to more than 1% more profit; overall this issue really seems too complicated for this kind of vague theoretical speculation to be meaningfully accurate, but I hope I've given the basic flavor of my thinking.
And finally, I intended 1% as a relatively conservative estimate. I don't see any particular reason you need to have so much waste, and I wouldn't be surprised if it end up much lower, if future people end up pursuing some strategy along these lines.
Your examples of possible mistakes seemed to involve not knowing how the alien would feel about particular tradeoffs. This doesn't seem related to how much computational power you have, except insofar as having more power might lead you to believe that it is safe to try and figure out what the alien thinks from first principles. But that's not a necessary consequence of having more computing power, and I gave an argument that more computing power shouldn't predictably lead to trouble.
Why do you think that more computing power requires a strategy which is "aggressive" in the sense of having a higher probability of catastrophic failure?
You might expect that building "full-fledged FAI" requires knowing a lot about the alien, and you won't be able to figure all of that out in advance of building it. But again, I don't understand why you can't build an AI that implements a conservative strategy, in the sense of being quick to consult the user and unlikley to make a catastrophic error. So it seems like this just begs the question about the relative efficacy of conservative vs. aggressive strategies.
I don't quite understand the juxtaposition to the white box metaphilosophical algorithm. If we could make a simple algorithm which exhibited weak philosphical ability, can't the RL learner also use such a simple algorithm to find weak philosophical answers (which will in turn receive a reasonable payoff from us)?
Is the idea that by writing the white box algorithm we are providing key insights about what metaphilosphy is, that an AI can't extract from a discussion with us or inspection of our philosphical reasoning? At a minimum it seems like we could teach such an AI how to do philosphy, and this would be no harder than writing an algorithm (I grant that it may not be much easier).
It seems to me that we need to understand metaphilosphy well enough to be able to write down a white-box algorithm for it, before we can be reasonably confident that the AI will correctly solve every philosophical problem that it eventually comes across. If we just teach an AI how to do philosophy without an explicit understanding of it in the form of an algorithm, how do we know that the AI has fully learned it (and not some subtly wrong version of doing philosophy)?
Once we are able to write down a white-box algorithm, wouldn't it be safer to implement, test, and debug the algorithm directly as part of an AI designed from the start to take advantage of the algorithm, rather than indirectly having an AI learn it (and then presumably verifying that its internal representation of the algorithm is correct and there aren't any potentially bad interactions with the rest of the AI)? And even the latter could reasonably be called white-box also since you are actually looking inside the AI and making sure that it has the right stuff inside. I was mainly arguing against a purely black box approach, where we start to build AIs while having little understanding of metaphilosophy, and therefore can't look inside the AI to see if has learned the right thing.
I don't think this is core to our disagreement, but I don't understand why philosophical questions are especially relevant here.
For example, it seems like a relatively weak AI can recognize that "don't do anything the user would find terrible; acquire resources; make sure the user remains safe and retains effective control over those resource" is a praise-winning strategy, and then do it. (Especially in the reinforcement learning setting, where we can just tell it things and it can learn that doing things we tell it is a praise-winning strategy.) This strategy also seems close to maximally efficient---the costs of keeping humans around and retaining the ability to consult them are not very large, and the cost of eliciting the needed information is not very high.
So it seems to me that we should be thinking about the AI's ability to identify and execute strategies like this (and our ability to test that it is correctly executing such strategies).
I discussed this issue a bit in problems #2 and #3 here. It seems like "answers to philosophical questions" can essentially be lumped under "values," in that discussion, since the techniques for coping with unknown values also seem to cope with unknown answers to philosophical questions.
ETA: my position looks superficially like a common argument that people give for why smart AI wouldn't be dangerous. But now the tables are turned---there is a strategy that the AI can follow which will cause it to earn high reward, and I am claiming that a very intelligent AI can find it, for example by understanding the intent of human language and using this as a clue about what humans will and won't approve of.
Acquiring resources has a lot of ethical implications. If you're inventing new technologies and selling them, you could be increasing existential risk. If you're trading with others, you would be enriching one group at the expense of another. If you're extracting natural resources, there's questions of fairness (how hard should you drive bargains or attempt to burn commons) and time preference (do you want to maximize short term or long term resource extraction). And how much do you care about animal suffering, or the world remaining "natural"? I guess the AI could present a plan that involves asking the overseer to answer these questions, but the overseer probably doesn't have the answers either (or at least should not be confident of his or her answers).
What we want is to develop an AI that can eventually do philosophy and answer these questions on its own, and correctly. It's the "doing philosophy correctly on its own" part that I do not see how to test for in a black-box design, without giving the AI so much power that it can escape human control if something goes wrong. The AI's behavior, while it's in the not-yet-superintelligent, "ask the overseer about every ethical question" phase, doesn't seem to tell us much about how good the design and implementation is, metaphilosophically.
Google Maps answers better than a human "how do I get from point A to point B". I don't think it does nothing useful just because it's not powerful enough to model the overseer.
I think you misunderstood. My comments were meant to apply to Paul Christiano's specific proposals, not to AIs in general.