Recently, we've been considering agents that instead of just being given the problem setup have to actually learn it from experience. For example, instead of being told that an agent is a perfect predictor, they just have to play until they realise this. When an agent encounters a problem like iterated Parfit's Hitchhiker, there is a weird effect that can occur during exploration if the predictor is perfect.

We will turn Parfit's Hitchiker into an iterated problem by changing being left in the desert to a large utility penalty, rather than just dying. Suppose an agent has decided that next time it is in town that it won't pay. As long as it is committed to this, it'll never actually end up in town and so it'll never actually fulfil this commitment. It'll always be predicted to defect and so will never get the chance. We will call this stuck exploration because the exploration never resolves.

One trivial way around this is to never commit based on the next time, but instead control exploration by a pseudo-random variable based on the time. However, avoiding Stuck Exploration doesn't necessarily mean that the agent will necessarily end up understanding the situation. The agent will notice that it always pays in town; ie. that it never takes the explore option when there. An EDT agent would be able to figure out that it is likely being predicted, while a dualistic agent like AIXI wouldn't be able to directly make this connection. Of course, AIXI would notice that something weird was happening and this would distort the model of the algorithm it chooses in some way. Perhaps it would a spurious link, but we won't try to delve into this at this stage.

This post was a result of discussions with Davide Zagami and supported by the EA Hotel and AI Safety Research Program. Thanks to Pablo Moreno and Luke Miles for feedback.

New to LessWrong?

14

Ω 5

New Comment


6 comments, sorted by Click to highlight new comments since:

I am confused about the iterated Parfit's hitchhiker setup. In the one-shot PH agent not only never gets to safety, they die in the desert. So you must have something else in mind. Presumably they can survive the desert in some way (how?) at the cost of lower utility. Realizing that precommitting to not paying results in suboptimal outcomes, the agent would, well, "explore" other options, including precommitting to paying.

If my understanding is correct, then this game is isomorphic to betting on a coin toss:

You are told that you win $1000 if a coin lands heads, and $1 if the coin lands tails. What you do not know is that the coin is 100% biased and always lands tails.

In that less esoteric setup you will initially bet on tails, but after a few losses, realize that the coin is biased and adjust your bet accordingly.

Yeah, need to replace dying with losing a lot of utility. I've updated the post.

But coin needs to depend on your prediction instead of being always biased a particular way.

But coin needs to depend on your prediction instead of being always biased a particular way.

I don't see why, where would the isomorphism break?

In the iterated version we want the coin to sometimes be heads, sometimes tails. Sorry, I'm confused, I have no idea why you want to transform the problem like that?

Was trying to explain, but it looks like I screwed something up in the reformulation :)

But coin needs to depend on your prediction instead of being always biased a particular way.

Does it? I think it only depends on failure to explore/update which is a property of the (broken) agent, not an effect of the setup.

My recommended semi-causal agent (thanks to https://www.lesswrong.com/posts/9m2fzjNSJmd3yxxKG/acdt-a-hack-y-acausal-decision-theory for the idea) does the following: Start out with X% intend to heads/pay and 1-X intend to tails/not pay, based on priors of chance (NOT 0 or 1) and value of each payout box in the matrix. Randomize and commit before start of game (so the predictor can operate), adjust per bayes' rule and re-randomize the committment after each iteration. You'll never get stuck at certainty, so you'll converge on the optimal percentage for the power of the predictor and the outcomes actually available.

This doesn't work with computationally-infeasible ranges of action/outcome, but it seems to solve iterated (including iterated in thought-experiments) simple definable cases.