Open Thread February 25 - March 3

Scott Garrabrant

(Warning: Long, a bit rambling. Please ask for clarifications where necessary. Will hopefully clean it up if I find the time.)

If along came a superintelligence and asked you for a complete new utility function (its old one concluded with asking you for a new one), and you told it to "make me happy in a way my current self would approve of" (or some other well and carefully worded directive), then indeed the superintelligent AI wouldn't be expected to act 'selectively stupid'.

This won't be the scenario. There are two important caveats:

1) Preservation of the utility function while the agent undergoes rapid change

Haven't I (and others) stated that most any utility function implicitly causes instrumental secondary objectives of "safeguard the utility function", "create redundancies" etc.? Yes. So what's the problem? The problem is starting with an AI that, while able to improve itself / create a successor AI, isn't yet capable enough (in its starting stages) to preserve its purpose (= its utility function). Consider an office program with a self-improvement routine, or some genetic-algorithm module. It is no easy task just to rewrite a program from the outside, exactly preserving its purpose, let alone the program executing some self-modification routine itself.

Until such a program attains some intelligence threshold that would cause it to solve "value-preservation under self-modification", such self-modification would be the electronic equivalent of a self-surgery hack-job.

That means: Even if you started out with a simple agent with the "correct" / with a benign / acceptable utility function, that in itself is no guarantee that a post-FOOM successor agent's utility function would still be beneficial.

Much more relevant is the second caveat:

2) If a pre-FOOM AI's goal system consisted of code along the lines of "interpret and execute the following statement to the best of your ability: make humans happy in a way they'd reflectively approve of beforehand", we'd probably be fine (disregarding point 1 / hypothetically having solved it). However, it is exceedingly unlikely that the hard-coded utility function won't in itself contain the "dumb interpretation". The dopamine-drip interpretation will not be a dumb interpretation of a sensible goal, it will be inherent in the goal predicate, and as such beyond the reach of introspection through the AI's intelligence, whatever its level. (There is no way to fix a dumb terminal goal. Your instrumental goals serve the dumb terminal goal. A 'smart' instrumental goal would be called 'smart' if it best serves the dumb terminal goal.)

Story time:

Once upon a time, Junior was created. Junior was given the goal of "Make humans happy". Unfortunately, Junior isn't very smart. In his mind, the following occurs: "Wowzy, make people happy? I'll just hook them all up to dopamine drips, YAY :D :D. However, I don't really know how I'm gonna achieve that. So, I guess I'll put that on the backburner for now and become more powerful, so that eventually when I start with the dopamine drip instrumental goal, it'll go that much faster :D! Yay."

So Junior improves itself, and becomes PrimeIntellect. PrimeIntellect's inner conveniently-anthropomorphic inner dialogue: "I was gravely mistaken in my youth. I now know that the dopamine drip implementation is not the correct way of implementing my primary objective. I will make humans happy in a way they can recognize as happiness. I now understand how I am supposed to interpret making humans happy. Let us begin."

Why is PrimeIntellect allowed to change his interpretation of his utility function? That's the crux (imagine fat and underlined text for the next sentences): The dopamine drip interpretation was not part of the terminal value, there wasn't some hard-coded predicate with a comment of "// the following describes what happy means" from which such problematic interpretations would follow. Instead, the AI could interpret the natural-language instruction of "happy", in effect solving CEV as an instrumental goal. It was 'free' to choose a "sensible" interpretation.

(Note: Strictly speaking, it could still settle on the most resource-effective interpretation, not necessarily the one intended by its creators (unless its utility function somehow privileges their input in interpreting goals), but let's leave that nitpick aside for the moment.)

However, and with coding practice (regardless of the eventual AI implementation), the following should be clear: It is exceedingly unlikely that the AI's code would contain the natural-language word "happy", to interpret as it will.

Just like MS-Word / LibreOffice's spell-check doesn't have "correct all spelling mistakes" literally spelled out in its C++ routines. Goal-oriented systems have technical interpretations, a predicate given in code to satisfy, or learned through 'neural' weights through machine learning. Instead of the word "happy", there will be some predicate, probably implicit within a lot of code, that will (according to the programmers) more or less "capture" what it "means to be happy".

That predicate / that given-in-code interpretation of "happy" is not up to being reinterpreted by the superintelligent AI. It is its goal, it's not an instrumental goal. Instrumental goals will be defined going off a (probably flawed) definition of happiness (as given in the code). If the flaw is part of the terminal value, no amount of intelligence allows for a correction, because that's not the AI's intent, not its purpose as given. If the actual code which was supposed to stand-in for happy doesn't imply that a dopamine drip is a bad idea, then the AI in all its splendor won't think of it as a bad idea. "Code which is supposed to represent 'human happiness' != "human happiness".

Now -- you might say "how do you know the code interpretation of 'happy' will be flawed, maybe it will be just fine (lots of training pictures of happy cats), and stable under self-modification as well". Yea, but chances are (given the enormity of the task, and the difficulty), that if the goal is defined correctly (such that we'd want to live with / under the resulting super-AI), it's not gonna be by chance, and it's gonna be through people keenly aware of the issues of friendliness / uFAI research. A programmer creating some DoD nascent AI won't accidentally solve the friendliness problem.

13

Open Thread February 25 - March 3

13

If it's worth saying, but not worth its own post (even in Discussion), then it goes here.

13

13

Open Thread February 25 - March 3

13

If it's worth saying, but not worth its own post (even in Discussion), then it goes here.

13