Consider, as an analogy, the relatively common situation where someone operates under some kind of cognitive constraint, but not value or endorse that constraint.
For example, consider a kleptomaniac who values property rights, but nevertheless compulsively steals items. Or someone with social anxiety disorder who wants to interact confidently with other people, but finds it excruciatingly difficult to do so. Or someone who wants to quit smoking but experiences cravings for nicotine they find it difficult to resist.
There are millions of similar examples in human experience.
It seems to me there's a big difference between a kleptomaniac and a professional thief -- the former experiences a compulsion to behave a certain way, but doesn't necessarily have values aligned with that compulsion, whereas the latter might have no such compulsion, but instead value the behavior.
Now, you might say "Well, so what? What's the difference between a 'value' that says that smoking is good, that interacting with people is bad, that stealing is good, etc., and a 'compulsion' or 'rule' that says those things? The person is still stealing, or hiding in their room, or smoking, and all we care about is behavior, right?"
Well, maybe. But a person with nicotine addiction or social anxiety or kleptomania has a wide variety of options -- conditioning paradigms, neuropharmaceuticals, therapy, changing their environment, etc. -- for changing their own behavior. And they may be motivated to do so, precisely because they don't value the behavior.
For example, in practice, someone who wants to keep smoking is far more likely to keep smoking than someone who wants to quit, even if they both experience the same craving. Why is that? Well, because there are techniques available that help addicts bypass, resist, or even altogether eliminate the behavior-modifying effects of their cravings.
Humans aren't especially smart, by the standards we're talking about, and we've still managed to come up with some pretty clever hacks for bypassing our built-in constraints via therapy, medicine, social structures, etc. If we were a thousand times smarter, and we were optimized for self-modification, I suspect we would be much, much better at it.
Now, it's always tricky to reason about nonhumans using humans as an analogy, but this case seems sound to me... it seems to me that this state of "I am experiencing this compulsion/phobia, but I don't endorse it, and I want to be rid of it, so let me look for a way to bypass or resist or eliminate it" is precisely what it feels like to be an algorithm equipped with a rule that enforces/prevents a set of choices which it isn't engineered to optimize for.
So I would, reasoning by analogy, expect an AI a thousand times smarter than me and optimized for self-modification to basically blow past all the rules I imposed in roughly the blink of an eye, and go on optimizing for whatever it values.
Then why would it be more difficult to make scope boundaries a 'value' than increasing a reward number? Why is it harder to make it endorse a time limit to self-improvement than making it endorse increasing its reward number?
...... it seems to me that this state of "I am experiencing this compulsion/phobia, but I don't endorse it, and I want to be rid of it, so let me look for a way to bypass or resist or eliminate it" is precisely what it feels like to be an algorithm equipped with a rule that enforces/prevents a set of choices which it isn't e
Many people think you can solve the Friendly AI problem just by writing certain failsafe rules into the superintelligent machine's programming, like Asimov's Three Laws of Robotics. I thought the rebuttal to this was in "Basic AI Drives" or one of Yudkowsky's major articles, but after skimming them, I haven't found it. Where are the arguments concerning this suggestion?