Vaniver comments on Open thread 7th september - 13th september - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (146)
So, I don't think this works all that well because of the difference between "what is the most effective plan to get me a cheeseburger?" and "make me a cheeseburger." It seems relatively easy to get planning software that terminates on a simple, non-extreme plan that makes you a cheeseburger, and relatively hard to get planning software that terminates on the most effective plan to get you a cheeseburger (while remaining non-extreme).
I think the most mainstream-acceptable approach is to start with a discussion of the principal-agent problem, and then discuss how "self-interest" can be generalized to deal with "limited information" or limited communication ability. Even in the case where an agent wants to be perfectly aligned with the principal, any difference between the agent's understanding of the principal's goals and the principal's understanding of those goals is problematic.
The value alignment problem, it seems to me, has two parts that actually might collapse to just one part: first, how do we communicate values in such a way that the principal can trust that the agent will do what the principal actually wants, second, how do we have a system of values and meta-values that remains stable under self-modification. (If later versions of an self-modifying system are the agent and the earlier versions are the principal, this might be a subset of the first problem; but it also might have features that require separate treatment.)
It's not clear to me that it helps to point out that it is Very Bad to get the problem wrong. This might be the sort of desperation that actually diminishes rather than increases interest in the problem.
Yeah, I see now that the story doesn't work very well. It's unrealistic that an ad hoc AI designed for answering human questions would manage a coherent takeoff on the first try, without failing miserably due to some flaws in architecture or self-modeling. In all likelihood, making an AI take off without tripping over itself is a hard engineering problem that you can't solve by accident. That seems like a new argument against this particular kind of doomsday scenario. I need to think about it.
If it terminates as soon as it hits a plan that achieves the goal, and the possible actions are ordered in terms of how extreme they are, then increasing the available resources can't cause trouble, but increasing the available options can (because your ordering might go from correct to incorrect).
In general optimization terms, this is the difference between local optimum solutions and global optimum solutions. If you have a reasonable starting point and use gradient descent, to end up at a reasonable ending point you only need the local solution space to be reasonable because the total distance you'll travel is likely to be short (relative to the solution space and dependent on its topology, of course). If you have a global optimum solution, you need the entire solution space to be reasonable.
I've since edited the previous comment to agree with you in principle, but I think this particular objection doesn't really work.
Let's say Lawrence asks the AI to get him a cheeseburger with probability at least 90%. The AI can't use its usual plan because the local burger place is closed. It picks the next simplest plan, which involves using a couple more computers for additional planning and doesn't specify any further details. These computers receive the subgoal "maximize the probability no matter what", because it's slightly simpler mathematically than capping it at 90%, and doesn't have any downside from the POV of the original goal.
If you want the AI to avoid such plans, it needs to have a concept of "non-extreme" that agrees with our intuitions more reliably. As far as I understand, that's pretty much the friendly AI problem.
I think it's simpler, but not by much. Instead of knowing both the value and cost of everything, you just need to know the cost of everything. (The 'actual' cost, that is, not the full economic cost, which by including opportunity cost includes the value problem.) You could probably get away with an approximation of the cost, though a guarantee like "at least as high as the actual cost" is probably helpful.
So if Lawrence says "I'll pay up to $10 for a hamburger," either it can find a plan that provides Lawrence a hamburger for less than $10 (gross cost, not net cost), or it says "sorry, can't find anything at that price range."
I think there's a huge amount of work to get there--you have to have an idea of 'gross cost' that matches up well enough with our intuitions, which is an intuition-encoding problem and thus hard. (If it tweets at the local burger company to get a coupon for a free burger, what's the cost?)