How does this differ from indifference?
In the indifference formalism the agent in selecting A1 act like a UN agent that believe that the shutdown button will not be pressed, therefore it create perverse incentives to "manage the news". Which means that if the agent can cause his shutdown button to be pressed in the event of bad news, it will.
My formulation avoid this pathological behavior by instead making the agent select A1 as if it was a UN-agent which believed that it would continue to optimize according to UN even in the vent of the button being pressed which avoid the perverse incentives to "manage the news", while still not having any incentives to avoid the button being pressed because the agent will act like it believe that pressing the button will not cause it to initiate a shutdown.
Why not make it so that the agent in selecting A1 act as a UN-agent that believe that it will continue to optimize according to UN even in the event of the button being pressed rather than a UN agent that believe that the button will never be pressed: that is pick U such that
U(a1,o,a2) = UN(a1,o,a2) if o is in Press or US(a1,o,a2) + f(a1,o) - g(a1,o) if o is not in Press
where f(a1,o) is the maximum value of UN(a1,o,b) for b in A2 and g(a1,o) is the maximum value of US(a1,o,b) for b in A2.
This would avoid the perverse manipulation incentives problem detailed on section 4.2 of the paper.
Apart from the obvious problems with this approach, (The AI can do a lot with the output channel other than what you wanted it to do, choosing an appropriate value for λ, etc.) I don't see why this approach would be any easier to implement than CEV.
Once you know what a bounded approximation of an ideal algorithm is supposed to look like, how the bounded version is supposed to reason about it's idealised version and how to refer to arbitrary physical data, as the algorithm defined in your post assume, then implementing CEV really doesn't seem to be that hard of a problem.
So could you explain why you believe that implementing CEV would be so much harder than what you propose in your post?
Can you state the conjecture or link to a description?
By that term I simply mean Eliezer's idea that the correct decision theory ought to use a maximization vantage points with a no-blackmail equilibrium.
If the NBC (No Blackmail Conjecture) is correct, then that shouldn't be a problem.
Question: I did bring up the idea of infinite Fun v. Frank's life. That seems to me like a tiering decisioin: it's not at all clear to me that a diverging utility like "Immortal + Omega-guaranteed indefinite Fun" is worth Frank's life, which implies that Frank's life is at least on an omega-tier.
So you wouldn't trade whatever amount of time Frank as left, which is at most measured in decades, against a literal eternity of Fun?
If I was Frank in this scenario, I would tell the other guy to accept the deal.
You can't simply average the km's.
Hmm. I'll have to take a closer look at that. You mean that the uncertainties are correlated, right?
and we will have renormalization parameters k'm and
Can you show where you got that? My impression was that once we got to the set of (equivalent, only difference is scale) utility functions, averaging them just works without room for more fine-tuning.
But as I said, that part is shaky because I haven't actually supported those intuitions with any particular assumptions. We'll see what happens when we build it up from more solid ideas.
Hmm. I'll have to take a closer look at that. You mean that the uncertainties are correlated, right?
No. To quote your own post:
A similar process allows us to arbitrarily set exactly one of the km.
I meant that the utility function resulting from averaging over your uncertainty over the km's will depend on which km you chose to arbitrarily set in this way. I gave an example of this phenomenon in my original comment.
You can't simply average the km's. Suppose you estimate .5 probability that k2 should be twice k1 and .5 probability that k1 should be twice k2. Then if you normalize k1 to 1, k2 will average to 1.25, while similarly if you normalize k2 to 1, k1 will average to 1.25.
In general, to each choice of km's will correspond a utility function and the utility function we should use will be a linear combination of those utility functions and we will have renormalization parameters k'm and, if we accept the argument given in your post, those k'm ought to be just as dependant on your preferences, so you're probably also uncertain about the values that those parameters should take and so you obtain k''m's and so on ad infinitum. So you obtain an infinite tower of uncertain parameters and it isn't obvious how to obtain a utility function out of this mess.
hm. I'm still a bit shaky on the definition of modal agent...does the following qualify?
IF(opponentcooperates with me AND I defect is a possible outcome){defect} else{ if (opponent cooperates IFF i cooperate) (cooperate){else defect}
(edit: my comment about perfect unfair bots may have been based on the wrong generalizations from an infinite-case). addendum: if what I've got doesn't qualify as a model agent I'll shut up until I understand enough to inspect the proof directly.
addendum 2: well. alright then I'll shut up.
What do you even mean by "is a possible outcome" here? Do you mean that there is no proof in PA of the negation of the proposition?
The formula of a modal agent must be fully modalized, which means that all propositions containing references to actions of agents within the formula must be within the scope of a provability operator.
View more: Next
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
OK, following on from Karl's comment below, which seems to resolve Section 4.2, I think it shouldn't be that difficult to resolve some or all of the Section 4.1 concerns as well. In order to satisfy Desideratum 4, we need the agent to be incentivised to build in safeguards to ensure that its possible future actions will be rated well according to US in cases where the shutdown button is pressed. The most obvious way to do this is to penalise situations where there is no good shutdown action according to US and reward situations where there is such an action, but this fails according to Desideratum 2, because we want the agent to remain indifferent as to whether the shutdown button is actually pressed.
However, there appears to be a simple solution; consider the "penalty term"
g(a1) = E[max_b(US(a1, O, b)) | O in Press ; a1],
which represents the expected value of US when the shutdown button is pressed. The first step in doing things properly is to add that term to observations not in Press, rather than subtracting it from observations in Press; this helps to ensure future corrigibility (and hence also corrigibility of subagents and successor agents).
Additionally, to resolve the Section 4.2 concerns we also want the agent to be indifferent to pressing its own shutdown button in cases where it gets unlucky, and the way we do that is, as per Karl's post, by calculating the bonus term conditional on the actual observation, or on the set of observations in Press, rather than conditional on the set of possible observations not in Press, i.e. Karl's f(a1, o) = max_b(UN(a1, o, b))
Finally, we will also need the additional terms
fp(a1) = E[max_b(UN(a1, O, b) | O in Press; a1] = E[f(a1, O) | O in Press; a1]
fnp(a1) = E[max_b(UN(a1, O, b)) | O not in Press; a1] = E[f(a1, O) | O not in Press; a1]
in order to make things work.
The utility function I propose is
U(a1, o, a2) = {
g(a1) + UN(a1, o, a2), for o not in Press.
US(a1, o, a2) + fnp(a1) + f(a1, o) - fp(a1), for o in Press.
}
Let's verify each of the Desiderata in turn.
(1) is satisfied because the value of a2 conditional on any o in Press depends only upon US, not UN.
(2) and (3) are not satisfied :( because of the "managing the news" problem, as Karl points out below. That being said, we've ensured indifference to the shutdown button because the expected value for any action conditional on Press is equivalent to its expected value conditional on NotPress, i.e. E[U | NotPress ; a1] = E[U | Press; a1], which holds because of the following three equations:
g(a1) = E[max_b(US(a1, O, b)) | O in Press; a1],
fnp(a1) =E[max_b(UN(a1, O, b)) | O not in Press; a1], and
fp(a1) = E[f(a1, O) | O in Press; a1].
(4) holds because the g(a1) term is added to the utility of every action, and will be greater for actions that promote future corrigibility.
(5) holds because apart from the g(a1) term, which is required to satisfy (4), the agent is simply maximising
E[UN | NotPress; a1] over all possible a1, which is what we wanted.
In actual fact, when choosing a1 our agent is simply maximising E[UN | NotPress; a1] + E[US | Press; a1].
EDIT: I've edited this to reflect the errors pointed out by Karl.
Firstly, the important part of my modification to the indifference formalism is not about conditioning on the actual o but it's the fact that in evaluating the expectation of UN it take the action in A2 (for a given pair (a1,o)) which maximize UN instead of the action which maximize U (note that U is equal to US in the case that o is not in Press.).
Secondly an agent which chose a1 by simply maximizing E[UN | NotPress; a1] + E[US | Press; a1] do exhibit pathological behaviors. In partcular, there will still be incentives to manage the news, but from both sides now (there is an incentive to cause the button to be pressed in the event of an information which is bad news from the point of view of UN and incentives to cause the button to not be pressed in the events of information which is bad news from the point of view of US.