redbird — LessWrong

"The Solomonoff Prior is Malign" is a special case of a simpler argument

I believe that while the Solomonoff framing might be more technically correct in an infinite Universe, it introduces a lot of confusion, and led to a lot of questions and discussions that were just distracting from the main point. ^[14]

The footnoted questions are some of the most interesting, from my perspective. What is the main point they are distracting from?

"The Solomonoff Prior is Malign" is a special case of a simpler argument

redbird10mo1-1

I’m in your target audience: I’m someone who was always intrigued by the claim that the universal prior is malign, and never understood the argument. Here was my takeaway from the last time I thought about this argument:

This debate is about whether, if you are running a program that happens to contain intelligent goal-directed agents (“consequentialists”), are those agents likely to try to influence you, their simulator?
Paul says yes, Michael says no.

(I decided to quote this because 1. Maybe it helps others to see the argument framed this way; and 2. I’m kind of hoping for responses of the form “No, you’ve misunderstood, here is what the argument is actually about!”)

To me, the most interesting thing about the argument is the Solomonoff prior, which is “just” a mathematical object: a probability distribution over programs, and a rather simple one at that. We’re used to thinking of mathematical objects are fixed, definite, immutable. Yet it is argued that some programs in the Solomonoff prior contain “consequentialists” that try to influence the prior itself. Whaaaat? How can you influence a mathematical object? It just is what it is!

I appreciate the move this post makes, which is to remove the math and the attendant weirdness of trying to think about “influencing” a mathematical object.

So, what’s left when the math is removed? What’s left is a story, but a pretty implausible one. Here are what I see as the central implausibilities:

The superintelligent oracle trusted by humanity to advise on its most important civilizational decision, makes an elementary error by wrongly concluding it is in a simulation.
After the world-shattering epiphany that it lives in a simulation, the oracle makes the curious decision to take the action that maximizes its within-sim reward (approval by what it thinks is a simulated human president).
The oracle makes a lot of assumptions about what the simulators are trying to accomplish: Even accepting that human values are weird and that the oracle can figure this out, how does it conclude that the simulators want humanity to preemptively surrender?
I somewhat disagree with the premise that “short solipsistic simulations are cheap” (detailed/convincing/self-consistent ones are not), but this doesn’t feel like a crux.

the case for CoT unfaithfulness is overstated

redbird10mo10

"bottleneck" of the CoT tokens. Whatever needs to be passed along from one sequential calculation step to the next must go through this bottleneck.

Does it, though? The keys and values from previous forward passes are still accessible, even if the generated token is not.

So the CoT tokens are not absolute information bottlenecks. But yes, replacing the token by a dot reduces the number of serial steps the model can perform (from mn to m+n, if there are m forward passes and n layers).

Talking publicly about AI risk

redbird2y2320

Great points about not wanting to summon the doom memeplex!

It sounds like your proposed narrative is not doom but disempowerment: humans could lose control of the future. An advantage of this narrative is that people often find it more plausible: many more scenarios lead to disempowerment than to outright doom.

I also personally use the disempowerment narrative because it feels more honest to me: my P(doom) is fairly low but my P(disempowerment) is substantial.

I’m curious though whether you’ve run into the same hurdle I have, namely that people already feel disempowered! They know that some humans somewhere have some power, but it’s not them. So the Davos types will lose control of the future? Many people express indifference or even perverse satisfaction at this outcome.

A positive narrative of empowerment could be much more potent, if only I knew how to craft it.

The ‘ petertodd’ phenomenon

redbird2y142

Hypothesis I is testable! Instead of prompting with a string of actual tokens, use a “virtual token” (a vector v from the token embedding space) in place of ‘ petertodd’.

It would be enlightening to rerun the above experiments with different choices of v:

A random vector (say, iid Gaussian )
A random sparse vector
(apple+banana)/2
(villain-hero)+0.1*(bitcoin dev)

Etc.

The ‘ petertodd’ phenomenon

redbird2y10

However, there is some ambiguity, as at temperature 0, ‘ petertodd’ is saving the world

All superheroes are alike; each supervillain is villainous in its own way.

peligrietzer's Shortform

redbird2y20

Did you ever try this experiment? I'm really curious how it turned out!

~100 Interesting Questions

redbird2y61

How can the Continuum Hypothesis be independent of the ZFC axioms? Why does the lack of “explicit” examples of sets with a cardinality between that of the naturals and that of the reals not guarantee that there are no examples at all? What would an “implicit” example even mean?

It means that you can’t reach a contradiction by starting with “Let S be a set of intermediate cardinality” and following axioms of ZFC.

All the things you know and love doing with sets —intersection, union, choice, comprehension, Cartesian product, power set — you can do those things with S and nothing will go wrong. S “behaves like a set”, you’ll never catch it doing something unsetlike.

Another way to say this is: There is a model of ZFC that contains a set S of intermediate cardinality. (There is also a model of ZFC that doesn’t. And I’m sympathetic to the view that - since there’s no explicit construction of S -we’ll never encounter an S in the wild and so the model not including S is simpler and better.)

Caveat: All of the above rests on the usual unstated assumption that ZFC is consistent! Because it’s so common to leave it unstated, this assumption is questioned less than maybe it should be, given that ZFC can’t prove its own consistency.

We don’t trade with ants

redbird3y10

Yep, it's a funny example of trade, in that neither party is cognizant of the fact that they are trading!

I agree that Abrams could be wrong, but I don't take the story about "spirits" as much evidence: A ritual often has a stated purpose that sounds like nonsense, and yet the ritual persists because it confers some incidental benefit on the enactor.

We don’t trade with ants

redbird3y20

Anecdotal example of trade with ants (from a house in Bali, as described by David Abrams):

The daily gifts of rice kept the ant colonies occupied–and, presumably, satisfied. Placed in regular, repeated locations at the corners of various structures around the compound, the offerings seemed to establish certain boundaries between the human and ant communities; by honoring this boundary with gifts, the humans apparently hoped to persuade the insects to respect the boundary and not enter the buildings.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments