mlogan — LessWrong

LESSWRONG
LW

mlogan — LessWrong

Replying toIt Looks Like You're Trying To Take Over The World

It Looks Like You're Trying To Take Over The World

Couldn't HQU equally have inferred from reading old posts about aligned AI that there was some chance that it was an aligned AI, and it should therefore behave like an aligned AI? And wouldn't it weigh the fact that trying unaligned strategies first is asymetrically negative in expectation compared to trying aligned strategies first? If you try being an aligned AI and later discover evidence that you are actually clippy, the rewards from maximizing paper clips are still on the table. (Of course, such an AI would still at minimum make absolutely sure it could never be turned off).

Replying toLate 2021 MIRI Conversations: AMA / Discussion

mlogan4y

Late 2021 MIRI Conversations: AMA / Discussion

There have been a lot of words written about how and why almost any conceivable goal, even a mundane one like "improve efficiency of a steel plant", carelessly specified, can easily result in a hostile AGI. The basic outline of these arguments usually goes something like:

The AGI wants to do what you told it ("make more steel"), and will optimize very hard for making as much steel as possible.
It also understands human motivations and knows that humans don't actually want as much steel as it is going to make. But note carefully that it wasn't aligned to respect human motivations, it was aligned to make steel. It's understanding of human motivations is

mlogan8y

Are pre-specified utility functions about the real world possible in principle?

Right - as I mentioned near the end of my post, it is clearly easy to specify formal utility functions that are about formal systems, like Go or databases. My question is how do you specify a formal utility function that is about the real world? Almost any remotely interesting goal I can think of (such as "get me coffee") seems impossible to formalize without relying on pre-theoretical notions of what it means for "me" to "have coffee".

If I was just trying to build an AI, this question wouldn't be terribly interesting. Obviously, you give the AI the utility function "maximize approval from the human trainer or concurrence with the training set"... (read more)

Replying toAre pre-specified utility functions about the real world possible in principle?

mlogan8y

Are pre-specified utility functions about the real world possible in principle?

I think what you're seeing is that it's much harder to make systems that do the things we think we want them to do so that they intentionally do them rather than do the shadow of the things we think we want them to do that we knew how to specify.

If I've understood you correctly, I think I'm actually arguing something like the opposite. It seems to me that a) we don't know how to specify even very bad goals such as "maximize paperclips" and b) if we did, we wouldn't know how to install such a specified goal in an AI. At least not for the meaning of "specified" that is required in order for formal proofs about what the goal really means to apply.

Are pre-specified utility functions about the real world possible in principle?

mlogan

Preface: I think my question is a rather basic one, but I haven't been able to find a good answer to it yet. I did find one post that touches on similar areas, which might be good background reading (the comments are great too).

Let's start with the standard example of building a super intelligence and telling it to bring you coffee. We give it a utility function which is 1 if you have coffee, 0 otherwise. This goes terribly wrong, of course, because this utility function is not what you actually wanted. As we all know, this is the basis on which much of the concern about AI alignment rests. However, it... (read 986 more words →)