Reflective oracles and superationality
This grew out of an exchange with Jessica Taylor during MIRI’s recent visit to the FHI. Still getting my feel for the fixed point approach; let me know of any errors.
People at MIRI have recently proved you can use reflective oracles so that agents can use them to reason about other agents (including other agents with oracles) and themselves, and consistently reach Nash equilibriums. But can we do better than that?
To recap, a reflective oracle is a machine O such that:
-
P(A()=1)>p implies O(A,p)=1
-
P(A()=0)>1-p implies O(A,p)=0
This works even if A() includes a call to the oracle within its code. Now, all the algorithms used here will be clearly terminating, so we’ll have the other two implications as well (eg (P(A()=0)>p implies O(A,p)=0). And given any δ, we can, with order log(1/δ) questions, establish the probability of A() to within δ. Thus we will write O(A()==a)=p to mean that O(A()==a,(n-1)δ/2)=1 and O(A()==a,(n+1)δ/2)=0, where (n-1)δ/2 < p < (n+1)δ/2.
Note also that O can be used to output a probabilistic output (to within δ), so outputting specific mixed strategies is possible.
Reflection in Probabilistic Logic
Paul Christiano has devised a new fundamental approach to the "Löb Problem" wherein Löb's Theorem seems to pose an obstacle to AIs building successor AIs, or adopting successor versions of their own code, that trust the same amount of mathematics as the original. (I am currently writing up a more thorough description of the question this preliminary technical report is working on answering. For now the main online description is in a quick Summit talk I gave. See also Benja Fallenstein's description of the problem in the course of presenting a different angle of attack. Roughly the problem is that mathematical systems can only prove the soundness of, aka 'trust', weaker mathematical systems. If you try to write out an exact description of how AIs would build their successors or successor versions of their code in the most obvious way, it looks like the mathematical strength of the proof system would tend to be stepped down each time, which is undesirable.)
Paul Christiano's approach is inspired by the idea that whereof one cannot prove or disprove, thereof one must assign probabilities: and that although no mathematical system can contain its own truth predicate, a mathematical system might be able to contain a reflectively consistent probability predicate. In particular, it looks like we can have:
∀a, b: (a < P(φ) < b) ⇒ P(a < P('φ') < b) = 1
∀a, b: P(a ≤ P('φ') ≤ b) > 0 ⇒ a ≤ P(φ) ≤ b
Suppose I present you with the human and probabilistic version of a Gödel sentence, the Whitely sentence "You assign this statement a probability less than 30%." If you disbelieve this statement, it is true. If you believe it, it is false. If you assign 30% probability to it, it is false. If you assign 29% probability to it, it is true.
Paul's approach resolves this problem by restricting your belief about your own probability assignment to within epsilon of 30% for any epsilon. So Paul's approach replies, "Well, I assign almost exactly 30% probability to that statement - maybe a little more, maybe a little less - in fact I think there's about a 30% chance that I'm a tiny bit under 0.3 probability and a 70% chance that I'm a tiny bit over 0.3 probability." A standard fixed-point theorem then implies that a consistent assignment like this should exist. If asked if the probability is over 0.2999 or under 0.30001 you will reply with a definite yes.
Precommitting to paying Omega.
Related to: Counterfactual Mugging, The Least Convenient Possible World
What would you do in situation X?" and "What would you like to pre-commit to doing, should you ever encounter situation X?" should, to a rational agent, be one and the same question.
Applied to Vladimir Nesov's counterfactual mugging, the reasoning is then:
Precommitting to paying $100 to Omega has expected utility of $4950.p(Omega appears). Not precommitting has strictly less utility; therefore I should precommit to paying. Therefore I should, in fact, pay $100 in the event (Omega appears, coin is tails).
To combat the argument that it is more likely that one is insane than that Omega has appeared, Eliezer said:
So imagine yourself in the most inconvenient possible world where Omega is a known feature of the environment and has long been seen to follow through on promises of this type; it does not particularly occur to you or anyone that believing this fact makes you insane.
My first reaction was that it is simply not rational to give $100 away when nothing can possibly happen in consequence. I still believe that, with a small modification: I believe, with moderately high probability, that it will not be instrumentally rational for my future self to do so. Read on for the explanation.
Which Parts Are "Me"?
Everything I am, is surely my brain; but I don't accept everything my brain does, as "me".
Though I don't like the fact, it does seem in my case to be true, that reflecting upon a passion can diminish it. It does seem that in taking a step back from myself to look at what my brain is doing, that this introduces a greater emotional distance from the world, than when I am not reflecting. Though I am extremely cautious when it comes to the moral issue of how to modify human nature, this is one thing that I would change, I think.
The point at which you start thinking that not everything that goes on your brain is to be identified with "you", is something of a Rubicon, I fear. And those who prize the intensity of a simple passionate existence, above success in life or a complex layered personality, should not cross that Rubicon.
(Though in this case you should accept advice from rationalists about practical policies for guarding yourself, because this is a dangerous world in which to be unreflective. For example: if you don't follow the link to the rest of this post, then sign the #$!@ up for cryonics regardless of what your simple passions say on the subject.)
Mind Projection Fallacy
Followup to: How an Algorithm Feels From Inside
In the dawn days of science fiction, alien invaders would occasionally kidnap a girl in a torn dress and carry her off for intended ravishing, as lovingly depicted on many ancient magazine covers. Oddly enough, the aliens never go after men in torn shirts.
Would a non-humanoid alien, with a different evolutionary history and evolutionary psychology, sexually desire a human female? It seems rather unlikely. To put it mildly.
People don't make mistakes like that by deliberately reasoning: "All possible minds are likely to be wired pretty much the same way, therefore a bug-eyed monster will find human females attractive." Probably the artist did not even think to ask whether an alien perceives human females as attractive. Instead, a human female in a torn dress is sexy—inherently so, as an intrinsic property.
They who went astray did not think about the alien's evolutionary history; they focused on the woman's torn dress. If the dress were not torn, the woman would be less sexy; the alien monster doesn't enter into it.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)