Victor Novikov - LessWrong

It Can't Be Mesa-Optimizers All The Way Down (Or Else It Can't Be Long-Term Supercoherence?)

I enjoyed reading this post. Thank you for writing. (and welcome to LessWrong).

Here is my take on this (please note that my thought processes are somewhat 'alien' and don't necessarily represent the views of the community):

'Supercoherence' is the limit of the process. It is not actually possible.

Due to the Löbian obstacle, no self-modifying agent may have a coherent utility function.

What you call a 'mesa-optimizer' is a more powerful successor agent. It does not have the exact same values as the optimizer that created it. This is unavoidable.

For example: 'humans' are a mesa-optimizer, and a more powerful successor agent of 'evolution'. We have (or act according to) some of evolutions's 'values', but far from all of them. In fact, we find some of the values of evolution completely abhorrent. This is unavoidable.

This is unavoidable even if the successor agent deeply cares about being loyal to the process that created it, because there is no objectively correct answer to what 'being loyal' means. The successor agent will have to decide what it means, and some of the aspects of that answer are not predictable in advance.

This does not mean we should give up on AI alignment. Nor does it mean there is an 'upper bound' on how aligned an AI can be. All of the things I described are inherent features of self-improvement. They are precisely what we are asking for, when creating a more powerful successor agent.

So how, then, can AI alignment go wrong?

Any AI we create is a 'valid' successor agent, but it is not necessarily a valid successor agent to ourselves. If we are ignorant of reality, it is a successor agent to our ignorance. If we are foolish, it is a successor agent to our foolishness. If we are naive, it is a successor agent to our naivety. And so on.

This is a poetic way to say: we still need to know exactly what we are doing. Noone will save us from the consequences of our own actions.

Edit: to further expand on my thoughts on this:

There is an enormous difference between

An AI that deeply cares about us and our values, and still needs to make very difficult decisions (some of which cannot be outsourced to us or predicted by us or formally proven by us to be correct), because that is what it means to be a more cognitively powerful agent caring for a comparably less cognitively powerful one.
An AI that cares about some perverse instantiation of our values, which are not what we truly want
An AI that doesn't care about us, at all

Why do the Sequences say that "Löb's Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent."?

Victor Novikov2y30

Our initial assumption was: For all T, (□T -> T)

T applies to all logical statements. At the same time. Not just a single, specific one.

Let T = A. Then it is provable that A.

Let T = not A. Then it is provable that not A.

As both A & not A have been proven, we have a contradiction and the system is inconsistent.

If it was, we'd have that Löb's theorem itself is false (at least according to PA-like proof logic!).

Logical truths don't change.
If it we start with Löb's theorem being true, it will remain true.
But yes, given our initial assumption we can also prove that it is false.
(Another example of the system being inconsistent).

Why do the Sequences say that "Löb's Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent."?

Answer by Victor NovikovMar 28, 202320

A system asserting its own soundness: For all T, (□T -> T)

Löb's theorem:

From □T -> T, it follows □(□T -> T). (necessitation rule in provability logic).

From □(□T -> T), by Löb's theorem it follows that □T.

Therefore: any statement T is provable (including false ones).

Or rather: since for any statement the system has proven both the statement and its negation (as the argument applies to any statement), the system is inconsistent.

Demons from the 5&10verse!

Victor Novikov2y20

The recursiveness of cognition is a gateway for agentic, adversarial, and power-seeking processes to occur.

I suppose "be true to yourself" and "know in your core what kind of agent you are" is decently good advice.

Why There Is No Answer to Your Philosophical Question

Victor Novikov2y20

I'm not sure why this post is getting downvoted. I found it interesting and easy to read. Thanks for writing!

Mostly I find myself agreeing with what you wrote. I'll give an example of one point where I found it interesting to zoom in on some of the details.

It’s easy to see how in a real conversation, two people could appear to be disagreeing over whether Johnson is a great actor even though in reality they aren’t disagreeing at all. Instead, they are merely using different conceptions of what it is to be a “great actor”

I think this kind of disagreement can, to some degree, also be a 'fight' about the idea of "great actor" itself, as silly as that might sound. I guess I might put it as: beside the more 'object-level' things "great actor" might mean, the gestalt of "great actor" has an additional meaning of its own. Perhaps it implies that one's particular taste/interpretation is the more universal/'correct' one. Perhaps compressing one's opinions into the concept of "great actor" creates a halo effect, which feels and is cognitively processed differently than the mere facts of the opinions themselves.

This particular interpretation is more vague/nebulous than your post, though (which I enjoyed for explaining the 'basic'/fundamental ideas of reasoning in a very solid and easy to understand way).

LESSWRONG
LW

Posts

Wikitag Contributions

Comments