Comment author: buybuydandavis 22 July 2016 10:56:22PM 1 point [-]

I always had the informal impression that the optimal policies were deterministic

So an impression that optimal memoryless polices were deterministic?

That seems even less likely to me. If the environment has state, and you're not allowed to, you're playing at a disadvantage. Randomness is one way to counter state when you don't have state.

But it really does seem that there is a difference between facing an environment and another player - the other player adapts to your strategy in a way the environment doesn't. The environment only adapts to your actions.

I still don't see a difference. Your strategy is only known from your actions by both another player and the environment, so they're in the same boat.

Labeling something the environment or a player seems arbitrary and irrelevant. What capabilities are we talking about? Are these terms of art for which some standard specifying capability exists?

What formal distinctions have been made between players and environments?

Comment author: Stuart_Armstrong 23 July 2016 04:56:32PM 1 point [-]

Take a game with a mixed strategy Nash equilibrium. If you and the other player follow this, using source of randomness that remain random for the other player, then it is never to your advantage to deviate from this. You play this game, again and again, against another player or against the environment.

Consider an environment in which the opponent's strategies are in an evolutionary arms race, trying to best beat you; this is an environmental model. Under this, you'd tend to follow the Nash equilibrium on average, but, at (almost) any given turn, there's a deterministic choice that's a bit better than being stochastic, and it's determined by the current equilibrium of strategies of the opponent/environment.

However, if you're facing another player, and you make deterministic choices, you're vulnerable if ever they figure out your choice. This is because they can peer into your algorithm, not just track your previous actions. To avoid this, you have to be stochastic.

This seems like a potentially relevant distinction.

Comment author: Lumifer 22 July 2016 07:44:20PM 1 point [-]

The environment only adapts to your actions.

Is this how you define environment?

Comment author: Stuart_Armstrong 23 July 2016 04:43:16PM 1 point [-]

At least as an informal definition, it seems pretty good.

Comment author: buybuydandavis 21 July 2016 12:21:25PM 1 point [-]

I always had the informal impression that the optimal policies were deterministic

Really? I wouldn't have ever thought that at all. Why do you think you thought that?

when facing the environment rather that other players. But stochastic policies can also be needed if the environment is partially observable

Isn't kind of what a player is? Part of the environment with a strategy and only partially observable states?

Although for this player, don't you have an optimal strategy, except for the first move? The Markov "Player" seems to like change.

Isn't this strategy basically optimal? ABABABABABAB... Deterministic, just not the same every round. Am I missing something?

Comment author: Stuart_Armstrong 22 July 2016 06:51:28PM 1 point [-]

ABABABABABAB...

It's deterministic, but not memoryless.

But it really does seem that there is a difference between facing an environment and another player - the other player adapts to your strategy in a way the environment doesn't. The environment only adapts to your actions.

I think for unbounded agents facing the environment, a deterministic policy is always optimal, but this might not be the case for bounded agents.

Comment author: Gram_Stone 19 July 2016 02:19:44PM *  4 points [-]

Is the Absent-minded Driver an example of a single-player decision problem whose optimal policy is stochastic? Isn't the optimal policy to condition your decision on an unbiased coin?

I ask because it seems like it might make a good intuitive example, as opposed to the POMDP in the OP. But I'm not sure who your intended audience is.

Comment author: Stuart_Armstrong 19 July 2016 05:16:43PM 3 points [-]

Yes, you can see this POMDP as a variant of the absent minded-driver, and get that result.

Comment author: Larks 12 July 2016 12:18:07AM 0 points [-]

Yup, I think I understand that, and agree you need to at least tend to one. I'm just wondering why you initially use the loser definition of theta (where it doesn't need to tend to one, and can instead be just 0 )

Comment author: Stuart_Armstrong 12 July 2016 01:50:26PM 0 points [-]

When defining safe interruptibility, we let theta tend to 1. We probably didn't specify that earlier, when we were just introducing the concept?

Comment author: Viliam 11 July 2016 02:31:59PM 1 point [-]

perfectly feasible

Citation needed.

Comment author: Stuart_Armstrong 11 July 2016 05:55:18PM 1 point [-]

In software, it's trivial: create a subroutine with only a very specific output, include the entity inside it. Some precautions are then needed to prevent the entity from hacking out through hardware weaknesses, but that should be doable (using isolation in faraday cage if needed).

Comment author: Viliam 11 July 2016 02:33:00PM 3 points [-]

I like how the examples of the robot failures are... uhm... not like from the Terminator movie. May make some people discuss them more seriously.

Comment author: Stuart_Armstrong 11 July 2016 05:52:59PM 1 point [-]

Yep!

Comment author: Larks 10 July 2016 03:33:57AM 0 points [-]

Very interesting paper, congratulations on the collaboration.

I have a question about theta. When you initially introduce it, theta lies in [0,1]. But it seems that if you choose theta = (0n)n, just a sequence of 0s, all policies are interruptible. Is there much reason to initially allow such a wide ranging theta - why not restrict them to converge to 1 from the very beginning? (Or have I just totally missed the point?)

Comment author: Stuart_Armstrong 10 July 2016 05:09:02AM 0 points [-]

We're working on the theta problem at the moment. Basically we're currently defining interruptibility in terms of convergence to optimality. Hence we need the agent to explore sufficiently, hence we can't set theta=1. But we want to be able to interrupt the agent in practice, so we want theta to tend to one.

Comment author: morganism 07 July 2016 08:27:10PM 0 points [-]

these folks say that you won't be able to sandbox a AGI, due to the nature of computing itself.

Assuming that a superintelligence will contain a program that includes all the programs that can be executed by a universal Turing machine on input potentially as complex as the state of the world, strict containment requires simulations of such a program, something theoretically (and practically) infeasible.

http://arxiv.org/abs/1607.00913v1

But perhaps we could fool it, by poisoning some crucial databases it uses in subtle ways.

DeepFool: a simple and accurate method to fool deep neural networks

http://arxiv.org/abs/1511.04599v3

Comment author: Stuart_Armstrong 09 July 2016 08:13:59AM 1 point [-]

strict containment requires simulations of such a program, something theoretically (and practically) infeasible.

Sandboxing just requires that you be sure that the sandboxed entity can't send bits outside the system (except on some defined channel, maybe), which is perfectly feasible.

Comment author: fubarobfusco 23 June 2016 08:07:44PM 1 point [-]

With the internet of things physical goods can treat their owner differently than other people. A car can be programmed to only be driven by their owner.

Theoretically yes, but that doesn't seem to be how "smart" devices are actually being programmed.

Comment author: Stuart_Armstrong 23 June 2016 11:44:40PM 1 point [-]

With the internet of things physical goods can treat their owner differently than other people. A car can be programmed to only be driven by their owner.

Which shift the verification to the imperfect car code.

View more: Prev | Next