sandfort - LessWrong

Contest: $1,000 for good questions to ask to an Oracle AI

Correction:

It seems that in general, the less certain any counterfactual oracle is about its prediction, the more self-confirming it is. This is because the possible counterfactual worlds in which we have or acquire self-confirming beliefs regarding the prediction will have a high expected score

This is actually only true in certain cases, since in general many other counterfactual worlds could also have high expected scores. Specifically, it is true to the extent that the oracle is uncertain mostly about aspects of the world that would be affected by the prediction, and to the extent that self-confirming predictions lead to higher scores than any alternative.

Contest: $1,000 for good questions to ask to an Oracle AI

sandfort5y10

Submission (LB). The post's team-choosing example suggests a method for turning any low-bandwidth oracle $O$ into a counterfactual oracle $O^{'}$ : have $O^{'}$ output $o$ from the same set of possible outputs $L$ ; in case of erasure calculate $R (l)$ for a randomly chosen $l \in L$ and set $R^{'} (o) = R (l)$ if $o = l$ and to $- \infty$ otherwise. Although the counterfactual low-bandwidth oracle is not any safer $^{1}$ , it has the advantage of almost never requiring us to evaluate its score. Thus, by running multiple oracles in sequence (stopping the process after the first erasure event) we can (with high probability) receive the full series of answers as if from a high-bandwidth oracle.

For example, we can ask each oracle in turn for advice on how to make a more effective processor. If erasure occurs, we attempt to make a processor with the advice obtained up that point and use some measure of its performance as the score. If there is no erasure event, the final concatenation of answers forms a much safer guide to processor building than an equally large answer from a single oracle.

1. It seems that in general, the less certain any counterfactual oracle is about its prediction, the more self-confirming it is. This is because the possible counterfactual worlds in which we have or acquire self-confirming beliefs regarding the prediction will have a high expected score. Hence:
Submission (CF). Given a high-bandwidth counterfactual oracle, use a second counterfactual oracle with a shared erasure event to predict its score. If the predicted score's distance from its upper bound is greater than some chosen limit, discard the high-bandwidth prediction.

Contest: $1,000 for good questions to ask to an Oracle AI

sandfort5y10

Submission (CF). Use a counterfactual oracle to send a message to ourselves with a time delay. We choose an episode length $T$ and set of possible messages $M$ . The oracle outputs a time $t_{o} < T$ and message $m_{o} \in M$ . We are automatically sent $m_{o}$ at time $t_{o}$ unless erasure occurs, in which case at the end of the episode we input a time $t^{*}$ and message $m^{*}$ which we wish we had received at $t^{*}$ . The oracle's reward is then $- (t^{*} - t_{o})^{2}$ if $m_{o} = m^{*}$ and $- \infty$ otherwise. We do not find out that erasure has occurred until the end of the episode, ensuring that events before $t_{o}$ are the same as in the counterfactual world.

Here are some examples, which can work with a single possible message, used as an alert. EDIT: I made the examples slightly more concrete.

Set the episode length equal to the planned duration of a research program and send an alert when continuing the program would produce unworthwhile or dangerous results. The time delay ensures we keep the useful results obtained before $t_{o}$ , whereas receiving a warning at the start of the program could change the outcome. Of course, this relies on us being able to recognize dangerous results by the end of the episode and remaining in contact with the oracle until then.
Use a second counterfactual oracle to predict a series of future stock prices. If its erasure event does not occur, run a time-delayed oracle with the same episode length as it. We then use the predicted series to make investments. If the time-delayed oracle's erasure event occurs, we input the time at which we think our actions made the predictions remaining too inaccurate to use. When we receive an alert from the time-delayed oracle, we can discard the remaining predictions and/or rerun the oracles.

EDIT: Here are some potentially useful modifications we can make to the oracle:

Have the oracle accept input at any time during the episode, possibly multiple times. If erasure occurs, the last message entered during the episode is used as the target. For example, we could set a default message that is sent to the start of the episode as a safety measure unless we subsequently update it.
We can allow the episode to be ended early. Other than happening earlier, the scoring mechanism remains the same. If erasure did not occur, nothing happens when we try to end the episode (we would then know that the oracle has made a bad prediction). We might wish to end the episode early if we were waiting for a specific event, or in extreme situations in which we think we will lose access to the oracle.
Instead of the oracle predicting the message and time delay at the start of the episode, have it output $m_{o}$ at any time $t_{o} < T$ . Erasure/shutdown now also happens at $t_{o}$ , while scoring occurs at the end of the episode as before. This lets the oracle actively make use of observations from the start of the episode until $t_{o}$ , increasing its predictive power (if the oracle is sufficiently powerful, this should change not affect its output).

LESSWRONG
is fundraising!
LW
$

Posts

Wiki Contributions

Comments