Many people think you can solve the Friendly AI problem just by writing certain failsafe rules into the superintelligent machine's programming, like Asimov's Three Laws of Robotics. I thought the rebuttal to this was in "Basic AI Drives" or one of Yudkowsky's major articles, but after skimming them, I haven't found it. Where are the arguments concerning this suggestion?
It comes from the difference between the targets of an optimizing system, which drive the paths it selects to explore, and the constraints on such a system, which restrict the paths it can select to explore.
An optimizing system, given a path that leads it to bypass a target, will discard that path... that's part of what it means to optimize for a target.
An optimizing system, given a path that leads it to bypass a constraint, will not necessarily discard that path. Why would it?
An optimizing system, given a path that leads it to bypass a constraint and draw closer to a target than other paths, will choose that path.
It seems to follow that adding constraints to an optimizing system is a less reliable way of constraining its behavior than adding targets.
I don't care whether we talk about "targets and constraints" or "values and rules" or "goals and failsafes" or whatever language you want to use, my point is that there are two genuinely different things under discussion, and a distinction between them.
Yes, the distinction is drawn from analogy to the intelligences I have experience with -- as you say, anthropomorphic. I said this explicitly in the first place, so I assume you mean here to agree with me. (My reading of your tone suggests otherwise, but I don't trust that I can reliably infer your tone so I am mostly disregarding tone in this exchange.)
That said, I also think the relationship between them reflects something more generally true of optimizing systems, as I've tried to argue for a couple of times now.
I can't tell whether you think those arguments are wrong, or whether I just haven't communicated them successfully at all, or whether you're just not interested in them, or what.
There's no reason it would. If "doing X for X seconds" is its target, then it looks for paths that do that. Again, that's what it means for something to be a target of an optimizing system.
(Of course, if I do X for 2X seconds, I have in fact done X for X seconds, in the same sense that all months have 28 days.)
I'm not quite sure I understand what you mean here, but if I'm understanding the gist: I'm not saying that encoding scope boundaries as targets, or 'values,' is difficult (nor am I saying it's easy), I'm saying that for a sufficiently capable optimizing system it's safer than encoding scope boundaries as failsafes.
It was not my intention to imply any hostility or resentment. I thought 'anthropomorphic' is valid terminology in such a discussion. I was also not agreeing with you. If you are an expert and have been offended by implying that what you said might be due to an anthropomorphic bias, then accept my apology, I was merely trying to communicate my perception of the subject matter.
I had wedrifid telling me the same yesterday, that my tone isn't appropriate when I wrote about his superior and rational use of the re... (read more)