Replying toWaterfall Truth Predicates

We should be more careful, though, about what we mean by saying that $φ (x)$ only depends on ${T r}_{m}$ for $m > n$ , though, since this cannot be a purely syntactic criterion if we allow quantification over the subscript (as I did here). I'm pretty sure that something can be worked out, but I'll leave it for the moment.

Replying toWaterfall Truth Predicates

Benya_Fallenstein11y

Waterfall Truth Predicates

I would suggest changing this system by defining $ψ (n)$ to mean that no $m \leq n$ is the Gödel number of a proof of an inconsistency in ZFC (instead of just asserting that $n$ isn't). The purpose of this is to make it so that if ZFC were inconsistent, then we only end up talking about a finite number of levels of truth predicate. More specifically, I'd define $T_{n}$ to be PA plus the axiom schema

$\forall m \geq n . ψ (m) \to \forall x . {T r}_{m} (┌ φ (¯ ¯ ¯ x) ┐) \leftrightarrow φ (x) .$

Then, it seems that Jacob Hilton's proof that the waterfalls are consistent goes through for this waterfall:

Work in ZFC and assume that ZFC is inconsistent. Let $n$ be the lowest Gödel number of a proof of an inconsistency. Let

Benya_Fallenstein11y

No Good Logical Conditional Probability

Hm; we could add an uninterpreted predicate symbol $Q (n)$ to the language of arithmetic, and let $s \equiv Q (0)$ and $r_{n} \equiv Q (¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ n + 1)$ . Then, it seems like the only barrier to recursive enumerability of $T_{\infty}$ is that $P$ 's opinions about $Q (\cdot)$ aren't computable; this seems worrying in practice, since it seems certain that we would like logical uncertainty to be able to reason about the values of computations that use more resources than we use to compute our own probability estimates. But on the other hand, all of this makes this sound like an issue of self-reference, which is its own can of worms (once we have a computable process assigning probabilities to the value of computations, we can consider the sentence saying "I'm assigned probability $< \frac{1}{2}$ " etc.).

Replying toAgents that can predict their Newcomb predictor

Benya_Fallenstein11y

Agents that can predict their Newcomb predictor

The other direction follows from the fact that the algorithm is bounded, and PA can simply show the execution trace of $A$ in $\leq C N$ steps.

Unimportant technical point: I think the length of the PA proof grows faster than this. (More precisely, the length in symbols, rather than the length in number of statements; we're almost always interested in the former, since it determines how quickly a proof can be checked or be found by exhaustive search.) The obvious way of showing $A () = 1$ in PA is to successively show for higher and higher $t$ that "after $t$ ticks, the Turing machine computing $A ()$ is in the following state, and its tapes have the

Benya_Fallenstein11y

An Informal Conjecture on Proof Length and Logical Counterfactuals

Next, we consider the case that PA is consistent and work through the agent’s decision. PA can’t prove $A () \neq 1$ , since we used the chicken rule, so since the sentence $A () = 1 \to U () = 5$ is easily provable, the sentence $A () = 1 \to U () = 10$ (ie. the first sentence that the agents checks for proofs of) must be unprovable.

It seems like this argument needs soundness of PA, not just consistency of PA. Do you see a way to prove in PA that if $P A ⊢ A () \neq 1$ , then PA is inconsistent?

[edited to add:] However, your idea reminds me of my post on the odd counterfactuals of playing chicken, and I think the example I gave there makes your idea go through:

The scenario is that

Benya_Fallenstein11y

Reflective probabilistic logic cannot assign positive probability to its own coherence and an inner reflection principle

If you replace the inner " $= 1$ " by " $> 1 - ϵ$ ", then the literal thing you wrote follows from the reflection principle: Suppose that the outer probability is $< 1$ . Then

$P [(a < P [φ] < b) \land P [a - ϵ < P [φ] < b + ε] \leq 1 - ϵ] > 0.$

Now, $P [a < P [φ] < b] > 0$ implies $P [a \leq P [φ] \leq b] > 0$ , which by the converse of the outer reflection principle yields $a \leq P [φ] \leq b$ , whence $a - ϵ < P [φ] < b + ϵ$ . Now, by the forward direction of the outer reflection principle, we have

$P [a - ϵ < P [φ] < b + ϵ] = 1 > 1 - ϵ,$

which, by the outer reflection principle again, implies

$P [P [a - ϵ < P [φ] < b + ϵ] > 1 - ϵ] = 1,$

a contradiction to the assumption that $\dots \leq 1 - ϵ$ had outer probability $> 0$ .

However, what we'd really like is an inner reflection principle that assigns probability one to the statement *quantified over all $a$ , $b$ , and Gödel numbers $┌ φ ┐$ . I think

Benya_Fallenstein11y

Paraconsistent Tiling Agents (Very Early Draft)

This is very interesting!

My main question, which I didn't see an answer to on my first read, is: If the agent proves that action $a$ leads to the goal $G$ being achieved, are there any conditions under which this implies that the goal actually is achieved? The procrastination paradox shows that this isn't true in general. Is there a class of sentences (e.g., all $Π_{1}$ sentences, though I don't expect that to be true in this case) such that if $P A ⋆ ⊢ φ$ then $N ⊨ φ$ ? In other words, do we have some guarantee that we do better at actually achieving $G$ than an agent which uses the system $B A D := P A + C o n (B A D)$ ?

A more technical point: Given that

Benya_Fallenstein11y

Forum Digest: Corrigibility, utility indifference, & related control ideas

Categorization is hard! :-) I wanted to break it up because long lists are annoying to read, but there was certainly some arbitrariness in dividing it up. I've moved "resource gathering agent" to the odds & ends.

Forum Digest: Corrigibility, utility indifference, & related control ideas

Benya_Fallenstein

11y

This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent's goal system wrong, it doesn't try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent's goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It's current as of 3/21/15.

Papers

As background to the posts listed below, the following two papers may be helpful.

Corrigibility, by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong (2015). This paper introduces the problem

... (read 1013 more words →)

Replying toIdentity and quining in UDT

Benya_Fallenstein11y

Identity and quining in UDT

Want to echo Nate's points!

One particular thing that I wanted to emphasize is that I think you can see as a thread on this forum (in particular, the modal UDT work is relevant) is that it's useful to make formal toy models where the math is fully specified, so that you can prove theorems about what exactly an agent would do (or, sometimes, write a program that figures it out for you). When you write out things that explicitly, then, for example, it becomes clearer that you need to assume that a decision problem is "fair" (extensional) to get certain results, as Nate points out (or if you don't assume it, someone

Benya_Fallenstein

11y

The latest version on this forum of the reflective oracle formalism uses multibit oracles. These oracles are defined on probabilistic oracle machines with advance-only output tapes, and answer queries of the form, "Does machine $M$ , when run on this same oracle, produce output starting with the bitstring $\to x$ with probability at least $p$ ?" (If $M$ sometimes goes into an infinite loop after outputting some prefix of $\to x$ , the oracle behaves as if $M$ outputs additional bits according to some arbitrary but fixed probability distribution; see the above post for details.)

The reason for using multibit oracles was that we wanted to define versions of Solomonoff induction and AIXI, and these need to deal

... (read 222 more words →)

Replying toMeta- the goals of this forum

Benya_Fallenstein11y

Meta- the goals of this forum

Some additions about how the initial system is going to work:

Non-member contributions (comments and links) are going to become publicly visible when they have received 2 likes (from members---only members can like things).
However, members will be able to reply to a contribution as soon as it has received 1 like. This means that if you think someone's made a useful contribution, and you want to reply to them about it, you don't have to wait for a second person to Like it before you can write that reply. (It won't be publicly visible until the contribution has two Likes, though.)

An implementation of modal UDT

Benya_Fallenstein

11y

One of the great advantages of working with Gödel-Löb provability logic is that it's possible to implement an evaluator which efficiently checks whether a sentence in the language of GL is true. Mihaly and Marcello used this to write a program that checks whether two modal agents cooperate or defect against each other. Today, Nate and I extended this with an implementation of modal UDT, which allows us to check what UDT does on different decision problems---see Program.hs in the Github repository. No guarantees for correctness, since this was written rather quickly; if anybody is able to take the time to check the code, that would be very much appreciated!

The implementation of

... (read 295 more words →)

Generalizing the Corrigibility paper's impossibility result?

Benya_Fallenstein

11y

In our paper on corrigibility, we consider the question of how to make a highly intelligent agent that would pursue some goal, but not resist being shut down if its human programmers determined that they had made a mistake in specifying this goal. We assume that we are given two utility functions: a function $U_{N}$ , which specifies the agent's normal goal (which it pursues until the shutdown button is pressed), and a function $U_{S}$ , which specifies the goal of shutting down. We then ask whether there is some sort of combined utility function $U$ such that an agent maximizing $U$ would act as if maximizing $U_{N}$ unless and until the shutdown button

... (read 1145 more words →)

On notation for modal UDT

Benya_Fallenstein

11y

Patrick thinks that the notation I've been using for talking about modal UDT is too hard to read. I agree that it's annoying that after defining a decision problem $\to P (\to a)$ and choosing a decision theory $\to T (\to u)$ ---say, $\to T (\to u) \equiv \to U D T (\to u)$ ---, I every time have to explicitly give separate names to the fixed points of the decision problem and the decision theory, for example by stating that $(\to A, \to U)$ are such that $G L ⊢ (\to A \leftrightarrow \to U D T (\to U)) \land (\to U \leftrightarrow \to P (\to A)) .$

In "Obstacle to modal optimality when you're being modalized", Patrick uses a pretty free-wheeling notation loosely inspired by the notation used in the modal agents paper. As Patrick notes in that post, I'm very much not sold on this notation. Partly, this is because various pieces

... (read 1130 more words →)

From halting oracles to modal logic

Benya_Fallenstein

11y

When reading Vladimir's recent post on defining UDT through modal logic, I had to think a bit about his actual definition of the modal formula(s) corresponding to UDT, because it was phrased in terms of an algorithm doing things instead of an actual modal formula. Then I remembered having worked through the correspondence in the context of modal agents, and it became clear what was going on. I think Vladimir's approach is really interesting and I want to refer to it in future posts, so I thought I'd write a tutorial, since it seems likely that others will have the same problem.

Like Vladimir, let's start with the simplest case of an agent

... (read 1694 more words →)

Third-person counterfactuals

Benya_Fallenstein

11y

If you're thinking about the counterfactual world where you do X in the process of deciding whether to do X, let's call that a first-person counterfactual. If you're thinking about it in the process of deciding whether another agent A should have done X instead of Y, let's call that a third-person counterfactual. The definition of, e.g., modal UDT uses first-person counterfactuals, but when we try to prove a theorem showing that modal UDT is "optimal" in some sense, then we need to use third-person counterfactuals.

UDT's first-person counterfactuals are logical counterfactuals, but our optimality result evaluates UDT by using physical third-party counterfactuals: it asks, would another agent have done better, not, would

... (read 1571 more words →)

The odd counterfactuals of playing chicken

Benya_Fallenstein

11y

In this post, I examine an odd consequence of "playing chicken with the universe", as used in proof-based UDT. Let's say that our agent uses PA, and that it has a provability oracle, so that if it doesn't find a proof, there really isn't one. In this case, one way of looking at UDT is to say that it treats the models of PA as impossible possible worlds: UDT thinks that taking action $a$ leads to utility $u$ iff the universe program $U ()$ returns $u$ in all models $M$ in which $A ()$ returns $a$ . The chicken step ensures that for every $a$ , there is at least one model $M$ in which this

... (read 2177 more words →)

Multibit reflective oracles

Benya_Fallenstein

11y

This post describes a new version of the reflective oracles I described in a previous forum post. This work extends probabilistic Turing machines with an oracle that answers certain kinds of queries about other probabilistic Turing machines with the same kind of oracle; straight-forward ways of doing this would lead to diagonalization problems, which this kind of oracle avoids by answering certain queries probabilistically. Applications include a variant of AIXI that is able to reason about universes containing other instances of the same variant of AIXI; I expect we'll have forum posts about this in the coming weeks.

The new version, which is based on this comment thread with Paul, is significantly simpler

... (read 2394 more words →)

"Evil" decision problems in provability logic

Benya_Fallenstein

11y

A while ago, drnickbone showed in a LessWrong post that for any decision theory, there is a "fair" decision problem on which that decision theory fails---that is, gets a low reward, even though a simple alternative decision theory gets a high reward. Although drnickbone's argument was given in words, it's intuitively clear that it could be formalized as math---except, what exactly do "decision theory" and "decision problem" mean in this context?

In this post, I'll propose definitions of these notions in the context of provability logic, as used in Vladimir Slepnev's modal version of UDT, and give a formalization of drnickbone's result. This implies that no single modal decision theory, including UDT, can

... (read 1487 more words →)

LESSWRONG
LW

LESSWRONG
LW

Benya_Fallenstein

Forum Digest: Corrigibility, utility indifference, & related control ideas

Exploiting EDT

Topological truth predicates: Towards a model of perfect Bayesian agents

Improving the modal UDT optimality result

Benya_Fallenstein

Benya_Fallenstein

Forum Digest: Corrigibility, utility indifference, & related control ideas

Single-bit reflective oracles are enough

An implementation of modal UDT

Generalizing the Corrigibility paper's impossibility result?

On notation for modal UDT

From halting oracles to modal logic

Third-person counterfactuals

Benya_Fallenstein

Forum Digest: Corrigibility, utility indifference, & related control ideas

Exploiting EDT

Topological truth predicates: Towards a model of perfect Bayesian agents

Improving the modal UDT optimality result

Benya_Fallenstein

Benya_Fallenstein

Forum Digest: Corrigibility, utility indifference, & related control ideas

Single-bit reflective oracles are enough

An implementation of modal UDT

Generalizing the Corrigibility paper's impossibility result?

On notation for modal UDT

From halting oracles to modal logic

Third-person counterfactuals

Papers