I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased.
I was thinking about something like this:
A simple problem P1 has a Friendly solution with cost 12, and three Unfriendly solutions with costs 11. We either have to add three simple heuristics, one to block each Unfriendly solution, or maybe one more complex heuristics to block them all -- but the latter option assumes that the three Unfriendly solutions have some similar essence, which can be identified and blocked.
A complex problem P9 has a Friendly solution with cost 8000, and thousand Unfriendly solutions with costs between 7996 and 7999. Three hundred of them are blocked by heuristics already developed for problems P1-P8, but there are seven hundred new ones. -- The problem is not that the distance between 7996 and 8000 is greater than between 12 and 11, but rather that within that distance the number of "creatively different" Unfriendly solutions is growing too fast. We have to find a ton of heuristics before moving on to P10.
This all is just some imaginary numbers, but my intuition is that the more complex problem may provide not only a few much cheaper Unfriendly solutions, but also extremely many little cheaper Unfriendly solutions, whose diversity may be difficult to cover by a small set of heuristics.
On the other hand, having developed enough heuristics, maybe we will see a pattern emerging, and make a better description of human utility functions. Specifying what we want, even if it proves very difficult, may still be more simple than adding all the "do not want" exceptions to the model. Maybe having a decent set of realistic "do not want" exceptions will help us discover what we really want. (By realistic I mean: really generated by AI's attempts for a simple solution, simple as in Occam's razor; not just pseudosolutions generated by an armchair philosopher.)
My intuitions for how to frame the problem run a little differently.
The way I see it, there is no possible way to block all unFriendly or model-breaking solutions, and it's foolish to try. Try framing it this way: any given solution has some chance of breaking the models, probably pretty low. Call that probability P (god I miss Latex) The goal is to get Friendliness close enough to the top of the list that P (which ought to be constant) times the distance D from the top of the list is still below whatever threshold we set as an acceptable risk to life...
"All that is necessary for evil to triumph is that good men do nothing."
155,000 people are dying, on average, every day. For those of us who are preference utilitarians, and also believe that a Friendly singularity is possible, and capable of ending this state of affairs, it also puts a great deal of pressure on us. It doesn't give us leave to be sloppy (because human extinction, even multiplied by a low probability, is a massive negative utility). But, if we see a way to achieve similar results in a shorter time frame, the cost to human life of not taking it is simply unacceptable.
I have some concerns about CEV on a conceptual level, but I'm leaving those aside for the time being. My concern is that most of the organizations concerned with a first-mover X-risk are not in a position to be that first mover -- and, furthermore, they're not moving in that direction. That includes the Singularity Institute. Trying to operationalize CEV seems like a good way to get an awful lot of smart people bashing their heads against a wall while clever idiots trundle ahead with their own experiments. I'm not saying that we should be hasty, but I am suggesting that we need to be careful of getting stuck in dark intellectual forests with lots of things that are fun to talk about until an idiot with the tinderbox burns it down.
My point, in short, is that we need to be looking for better ways to do things, and to do them extremely quickly. We are working on a very, very, existentially tight schedule.
So, if we're looking for quicker paths to a Friendly, first-mover singularity, I'd like to talk about one that seems attractive to me. Maybe it's a useful idea. If not, then at least I won't waste any more time thinking about it. Either way, I'm going to lay it out and you guys can see what you think.
So, Friendliness is a hard problem. Exactly how hard, we don't know, but a lot of smart people have radically different ideas of how to attack it, and they've all put a lot of thought into it, and that's not a good sign. However, designing a strongly superhuman AI is also a hard problem. Probably much harder than a human can solve. The good news is, we don't expect that we'll have to. If we can build something just a little bit smarter than we are, we expect that bootstrapping process to take off without obvious limit.
So let's apply the same methodology to Friendliness. General goal optimizers are tools, after all. Probably the most powerful tools that have ever existed, for that matter. Let's say we build something that's not Friendly. Not something we want running the universe -- but, Friendly enough. Friendly enough that it's not going to kill us all. Friendly enough not to succumb to the pedantic genie problem. Friendly enough we can use it to build what we really want, be it CEV or something else.
I'm going to sketch out an architecture of what such a system might look like. Do bear in mind this is just a sketch, and in no way a formal, safe, foolproof design spec.
So, let's say we have an agent with the ability to convert unstructured data into symbolic relationships that represent the world, with explicitly demarcated levels of abstraction. Let's say the system has the ability to build Bayesian causal relationships out of its data points over time, and construct efficient, predictive models of the behavior of the concepts in the world. Let's also say that the system has the ability to take a symbolic representation of a desired future distribution of universes, a symbolic representation of the current universe, and map between them, finding valid chains of causality leading from now to then, probably using a solid decision theory background. These are all hard problems to solve, but they're the same problems everyone else is solving too.
This system, if you just specify parameters about the future and turn it loose, is not even a little bit Friendly. But let's say you do this: first, provide it with a tremendous amount of data, up to and including the entire available internet, if necessary. Everything it needs to build extremely effective models of human beings, with strongly generalized predictive power. Then you incorporate one or more of those models (say, a group of trusted people) as a functional components: the system uses them to generalize natural language instructions first into a symbolic graph, and then into something actionable, working out the details of what it meant, rather than what is said. Then, when the system is finding valid paths of causality, it takes its model of the state of the universe at the end of each course of action, feeds them into its human-models, and gives them a veto vote. Think of it as the emergency regret button, iterated computationally for each possibility considered by the genie. Any of them that any of the person-models find unacceptable are disregarded.
(small side note: as described here, the models would probably eventually be indistinguishable from uploaded minds, and would be created, simulated for a short time, and destroyed uncountable trillions of times -- you'd either need to drastically limit the simulation depth of a models, or ensure that everyone who you signed up to be one of the models knew the sacrifice they were making)
So, what you've got, plus or minus some spit and polish, is a very powerful optimization engine that understands what you mean, and disregards obviously unacceptable possibilities. If you ask it for a truly Friendly AI, it will help you first figure out what you mean by that, then help you build it, then help you formally prove that it's safe. It would turn itself off if you asked it too, and meant it. It would also exterminate the human species if you asked it to and meant it. Not Friendly, but Friendly enough to build something better.
With this approach, the position of the Friendly AI researcher changes. Instead of being in an arms race with the rest of the AI field with a massive handicap (having to solve two incredibly hard problems against opponents who only have to solve one), we only have to solve a relatively simpler problem (building a Friendly-enough AI), which we can then instruct to sabotage unFriendly AI projects and buy some time to develop the real deal. It turns it into a fair fight, one that we might actually win.
Anyone have any thoughts on this idea?