What does "the AI" mean? Computers don't come with the ability to interpret English. You still need to either translate "do the thing that this body of writing describes" into formal language, or program a method of translating English instructions in general while avoiding gotcha interpretations (e.g, "this body of writing" surely "describes", in passing, an AI that kills us all or simply does nothing). Intelligence as I imagine it requires a goal to have meaning; if that goal is just 'do something that literally fulfills your instruction according to dictionary meanings', the most efficient way to accomplish that is to find some mention of an AI that does nothing and imitate it. Whereas if we program in some drive to fulfill what humans should have asked for, that sounds a lot like CEV. I don't find it obvious, to put it mildly, that your extra step adds anything.
I assume "AI capable of understanding and treating as its final goal some natural language piece of text", which is of course hard to create. I don't think this presupposes that the AI automatically interprets instructions as we wish them to be interpreted, which is the part that we add by supplying a long natural languge description of ways that we might specify this, and problems we would want to avoid in doing so.
After this section, it feels like the "do what I mean"/"do what I want" instruction pretty much solves the problem of what we want the AI to value. If the creator the of the AI doesn't want things that work to a good future, then it seems like they would be unlikely to succeed in specifying a good future through other means. On the other hand, if the creator wants the right thing, then DWIM seems to avoid all perverse instantiations. Additionally, it seems like the only technical requirement is that the AI be able to follow natural language instructions (maybe with a bit of simpler definitions of value for the AI to use while it is still learning). Overall, my impression is that this area doesn't require nearly as much work as other parts of superintelligence design (such as getting an AI to value goals described in natural language in the first place).
Suppose we have a bunch of short natural language descriptions of what we would want the AI to value. Can we simply give the AI a list of these, and tell it to maximize all of these values given some kind of equal weighting? It seems to me that, much more than in other areas of superintelligence design, the things we come up with are likely to point to what we want, and so aggregating a bunch of these descriptions is more likely to lead to what we want than picking any description individually. Does it seem like this would work? Is there any way this can go wrong?
What if we gave the AI the contents of this entire superintelligence chapter, or the entire body of writing on AI design, and told the AI something like "do the thing that this body of writing describes". It seems like this would help in specifying situations that we want to avoid, that might seem ambiguous in any particular short natural language description of what we would want the AI to do. Would this be likely to be more or less robust than trying to come up with a short natural language description? Could we assume the AI would already take all this into account, even if given only a short natural language description?
Is CEV intended to be specified in great technical depth, or is it intended to be plugged into a specification for an AI capable of executing arbitrary natural language commands in a natural language form?
Split the cake into parts and provide it to different extrapolated volitions that didn't cohere, after allocating some defense army patrol keeping the borders from future war?
One faction might have values that lead to something highly dis-valued by another faction (ie. one faction values slavery, another opposes slavery for all beings, even those of the first faction).
But if CEV doesn't give the same result when seeded with humans from any time period in history, I think that means it doesn't work, or else that human values aren't coherent enough for it to be worth trying.
Hmm, maybe one could try to test the CEV implementation by running it on historical human values and seeing whether it approaches modern human values (when not run all the way to convergence).
A value learning AI on the other hand would take use of the word 'friendly' as a clue about a hidden thing that it cares about. This means if the value learning AI could trick the person into saying 'friendly' more, this would be no help to it
That's great, but do any of these approaches actually accomplish this? I still have some reading to do, but as best as I can tell, they all rely on some training data. Like a person shouting "Friendly!" and "Unfriendly!" at different things.
The AI will then just do the thing it thinks would make the person most likely to shout "Friendly!". E.g. torturing them unless they say it repeatedly.
Yudkowsky argues against a very similar idea here.
It seems to me that the only advantage of this approach is that it prevents the AI from having any kind of long-term plans. The AI only cares about how much it's "next action" will please it's creator. It doesn't care about anything that happens 50 steps from now.
Essentially we make the AI really really lazy. Maybe it wants convert the Earth to paperclips, but it never feels like working on it.
This isn't an entirely bad idea. It would mean we could create an "oracle" AI which just answers questions, based on how likely we are to like the answer. We then have some guarantee that it doesn't care about manipulating the outside world or escaping from it's box.
I think the difference is between writing an algorithm that detects the sound of a human saying "Friendly!" (which we can sort-of do today), and writing an algorithm that detects situations where some impartial human observer would tell you that the situation is "Friendly!" if asked about it. (I don't propose that this is the criteria that should be used, but your algorithm needs at least that level of algorithmic sophistication for value learning). The situation you talk about will always happen with the first sort of algorithm. The second sort of algorithm could work, although lack of training data might lead to it functionally behaving in the same way as the first, or to making a similar class of mistakes.
Bostrom says a human doesn't try to disable its own goal accretion (though that process alters its values) in part because it is not well described as a utility maximizer (p190, footnote 11). Why assume AI will be so much better described as a utility maximizer that this characteristic will cease to hold?
I can think of a few reasons why it might seem like humans don't try to disable goal accretion:
*Humans can't easily perform reliable self-modifications, and as a result usually don't consider things like disabling goal accretion as something that's possible.
*When a human believes something strongly enough to want to try to fix it as a goal, mechanisms kick in to hold it in place that don't involve consciously considering value accretion disabling as a goal. For example, confirmation bias and other cognitive biases, making costly commitments to join a group of people who also share that goal (which makes it harder to take it away, ect.).
*Cognitive biases lead us to underestimate the amount values have shifted in the past, and wildly underestimate how our values might shift in the future
*Humans believe that all value accretion is good, because it lead to the present set of values, and they are good and right. Also, humans believe that their values will not change in the future, because they feel objectively good and right (subjectively objective).
*Our final goals are inaccessible, so we don't really know what it is we would want to fix as our goals.
*Our actual final goals (if there is something like that that can be meaningfully specified) include keeping the goal accretion mechanism running.
It seems likely that an AI system which humans understand well enough to design might have fewer of these properties.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
I will try this one more time. I'm assuming the AI needs a goal to do anything, including "understand". The question of what a piece of text "means" does not, I think, have a definite answer that human philosophers would agree on.
You could try to program the AI to determine meaning by asking whether the writer (of the text) would verbally agree with the interpretation in some hypothetical situation. In which case, congratulations: you've rediscovered part of CEV. As with full CEV, the process of extrapolation is everything. (If the AI is allowed to ask what you'd agree to under torture or direct brain-modification, once it gets the ability to do those, then it can take anything whatsoever as its goal.)
Okay, you're right, this does presuppose correctly performing volition extrapolation (or pointing the AI to the right concept of volition). It doesn't presuppose full CEV over multiple people, or knowing whether you want to specify CEV or MR, which slightly simplifies the underlying problem.