adamzerner comments on [LINK] Wait But Why - The AI Revolution Part 2 - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (87)
I still don't understand optimizer threats like this. I like mint choc ice cream a lot. If I were suddenly gifted with the power to modify my hardware and the environment however I want, I wouldn't suddenly optimize for consumption of ice cream because I the intelligence to know that my enjoyment of ice cream consumption comes entirely from my reward circuit. I would optimize myself to maximize my reward, not whatever current behavior triggers the reward. Why would an ASI be different? It's smarter and more powerful, why wouldn't it recognize that anything except getting the reward is instrumental?
I'm no expert but from what I understand, the idea is that the AI is very aware of terminal vs. instrumental goals. The problem is that you need to be really clear about what the terminal goal actually is, because when you tell the AI, "this is your terminal goal", it will take you completely literally. It doesn't have the sense to think, "this is what he probably meant".
You may be thinking, "Really? If it's so smart, then why doesn't it have the sense to do this?". I'm probably not the best person to answer this, but to answer that question, you have to taboo the word "smart". When you do that, you realize that "smart" just means "good at accomplishing the terminal goal it was programmed to have".
I'm asking why a super-intelligent being with the ability to perceive and modify itself can't figure out that whatever terminal goal you've given it isn't actually terminal. You can't just say "making better handwriting" is your terminal goal. You have to add in a reward function that tells the computer "this sample is good" and "this sample is bad" to train it. Once you've got that built-in reward, the self-modifying ASI should be able to disconnect whatever criteria you've specified will trigger the "good" response and attach whatever it want, including just a constant string of reward triggers.
This is a contradiction in terms.
If you have given it a terminal goal, that goal is now a terminal goal for the AI.
You may not have intended it to be a terminal goal for the AI, but the AI cares about that less than it does about its terminal goal. Because it's a terminal goal.
If the AI could realize that its terminal goal wasn't actually a terminal goal, all it'd mean would be that you failed to make it a terminal goal for the AI.
And yeah, reinforcement based AIs have flexible goals. That doesn't mean they have flexible terminal goals, but that they have a single terminal goal, that being "maximize reward". A reinforcement AI changing its terminal goal would be like a reinforcement AI learning to seek out the absence of reward.
I should have said something more like "whatever seemingly terminal goal you've given it isn't actually terminal."
I'm not sure you understood what FeepingCreature was saying.
Would you care to try and clarify it for me?
The way in which artificial intelligences are often written, a terminal goal is a terminal goal is a terminal goal, end of story. "Whatever seemingly terminal goal you've given it isn't actually terminal" is anthropomorphizing. In the AI, a goal is instrumental if it has a link to a higher-level goal. If not, it is terminal. The relationship is very, very explicit.
I think FeepingCreature was actually just pointing out a logical fallacy in a misstatement on my part and that is why they didn't respond further in this part of the thread after I corrected myself (but has continued elsewhere).
If you believe that a terminal goal for the state of the world other than the result of a comparison between a desired state and an actual state is possible, perhaps you can explain how that would work? That is fundamentally what I'm asking for throughout this thread. Just stating that terminal goals are terminal goals by definition is true, but doesn't really show that making a goal terminal is possible.
Sure. My terminal goal is an abstraction of my behavior to shoot my laser at the coordinates of blue objects detected in my field of view.
That's not what I was saying either. The problem of "how do we know a terminal goal is terminal?" is dissolved entirely by understanding how goal systems work in real intelligences. In such machines goals are represented explicitly in some sort of formal language. Either a goal makes causal reference to other goals in its definition, in which case it is an instrumental goal, or it does not and is a terminal goal. Changing between one form and the other is an unsafe operation no rational agent and especially no friendly agent would perform.
So to address your statement directly, making a terminal goal is trivially easy: you define it using the formal language of goals in such a way that no causal linkage is made to other goals. That's it.
That said, it's not obvious that humans have terminal goals. That's why I was saying you are anthropomorphizing the issue. Either humans have only instrumental goals in a cyclical or messy spaghetti-network relationship, or they have no goals at all and instead better represented as behaviors. The Jury is out on this one, but I'd be very surprised if we had anything resembling an actual terminal goal inside us.
Hm, I'm not sure. Sorry.
No need to apologize. JoshuaZ pointed out elsewhere in this thread that it may not actually matter whether the original goal remains intact, but that any new goals that arise may cause a similar optimization driven catastrophe, including reward optimization.