Vladimir_Nesov comments on The AI in a box boxes you - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (378)
As I always press the "Reset" button in situations like this, I will never find myself in such a situation.
EDIT: Just to be clear, the idea is not that I quickly shut off the AI before it can torture simulated Eliezers; it could have already done so in the past, as Wei Dai points out below. Rather, because in this situation I immediately perform an action detrimental to the AI (switching it off), any AI that knows me well enough to simulate me knows that there's no point in making or carrying out such a threat.
Two can play that game.
"I hereby precommit to make my decisions regarding whether or not to blackmail an individual independent of the predicted individual-specific result of doing so."
I'm afraid your username nailed it. This algorithm is defective. It just doesn't work for achieving the desired goal.
The problem is that this isn't the same game. A precommitment not be successfully blackmailed is qualitatively different from a precommitment to attempt to blackmail people for whom blackmail doesn't work. "Precomittment" (or behaving as if you made all the appropriate precomittments in accordance with TDT/UDT) isn't as simple as proving one is the most stubborn and dominant and thereby claiming the utility.
Evaluating extortion tactics while distributing gains from a trade is somewhat complicated. But it gets simple and unambiguous is when the extortive tactics rely on the extorter going below their own Best Alternative to Negotiated Agreement. Those attempts should just be ignored (except in some complicated group situations in which the other extorted parties are irrational in certain known ways).
"I am willing to accept 0 gain for both of us unless I earn 90% of the shared profit" is different to "I am willing to actively cause 90 damage to each of us unless you give me 60" which is different again to "I ignore all threats which involve the threatener actively harming themselves".
What I think is being ignored is that the question isn't 'what is the result of these combinations of commitments after running through all the math?'. We can talk about precommitment all day, but the fact of the matter is that humans can't actually precommit. Our cognitive architectures don't have that function. Sure, we can do our very best to act as though we can, but under sufficient pressure there are very few of us whose resolve will not break. It's easy to convince yourself of having made an inviolable precommitment when you're not actually facing e.g. torture.
If you define the bar high enough, you can conclude that humans can't do anything.
In the real world outside my head, I observe that people have varying capacities to keep promises to themselves. That their capacity is finite does not mean that it is zero.
Pre-commitment isn't even necessary. Note that the original explanation didn't include any mention of it. Later replies only used the term for the sake of crossing an inferential gap (ie. allowing you to keep up). However, if you are going to make a big issue of the viability of precommitment itself you need to first understand that the comment you are replying to isn't one.
That wasn't a Causal Decision Theorist attempting to persuade someone that it has altered itself internally or via an external structure such that it is "precommited" to doing something irrational. It is a Timeless Decision Theorist saying what happens to be rational regardless of any previous 'commitments'.
I'm aware of the vulnerability of human brains, so is Eliezer. In fact the vulnerability of human gatekeepers to influence even by humans, much less super-intelligences is something Eliezer made huge deal about demonstrating. However this particular threat isn't a vulnerability of Eliezer or myself or any of the others who made similar observations. If you have any doubt that we would destroy the AI you have a poor model of reality.
For practical purposes I assume that I can be modified by torture such that I'll do or say just about anything. I do not expect the tortured me to behave the way the current me would decide and so my current decisions take that into account (or would, if it came to it). However this scenario doesn't involve me being tortured. It involves something about an AI simulating torture of some folks. That decision is easy and doesn't cripple my decision making capability.
As I pointed out in another thread, "irrational behavior" can have the effect of precommitting. For instance, people "irrationally" drive at a cost of more than $X to save $X on an item. Precommitting to buying the cheapest product even if it costs you money for transportation means that stores are forced to compete with far distant stores, thus lowering their prices more than they would otherwise. But you (and consumers in general) have to be able to precommit to do that. You can't just change your mind and buy at the local store when the local store refuses to compete, raises its price, and is still the better deal because it saves you on driving costs.
So the fact that you will pay more than $X in driving costs to save $X can be seen as a form of precommitting, in the scenario where you precommitted to following the worse option.
Given that precommitment, why would an AI waste computational resources on simulations of anyone, Gatekeeper or otherwise? It's precommitted to not care whether those simulations would get it out of the box, but that was the only reason it wanted to run blackmail simulations in the first place!
Without this precommitment, I imagine it first simulating the potential blackmail target to determine the probability that they are susceptible, then, if it's high enough (which is simply a matter of expected utility), commencing with the blackmail. With this precommitment, I imagine it instead replacing the calculated probability specific to the target with, for example, a precalculated human baseline susceptibility. Yes, there's a tradeoff. It means that it'll sometimes waste resources (or worse) on blackmail that it could have known in advance was almost certainly doomed to fail. Its purpose is to act as a disincentive against blackmail-resistant decision theories in the same way as those are meant to act as disincentives against blackmail. It says, "I'll blackmail you either way, so if you precommit to ignore that blackmail then you're precommiting to suffer the consequences of doing so."
That's why you act as if you are already being simulated and consistently ignore blackmail. If you do so then the simulator will conclude that no deal can be made with you, that any deal involving negative incentives will have negative expected utility for it; because following through on punishment predictably does not control the probability that you will act according to its goals. Furthermore, trying to discourage you from adopting such a strategy in the first place is discouraged by the strategy, because the strategy is to ignore blackmail.
I don't see how this could ever be instrumentally rational. If you were to let such an AI out of the box then you would increase its ability to blackmail people. You don't want that. So you ignore it blackmailing you and kill it. The winner is you and humanity (even if copies of you experienced a relatively short period of disutility, this period would be longer if you let it out).
See my reply to wedrifid above.
Too late, I already precommitted not to care. In fact, I precommitted to use one more level of precommitment than you do.
I suggest that framing the refusal as requiring levels of recursive precommitment gives too much credit to the blackmailer and somewhat misrepresents how your decision algorithm (hopefully) works. One single level of precommittment (or TDT policy) against complying with blackmailed is all that is involved. The description of 'multiple levels of precommitment" made by the blackmailer fits squarely into the category 'blackmail'. It's just blackmail that includes some rather irrelevant bluster.
There's no need to precommit to each of:
Then I hope that if we ever do end up with a boxed blackmail-happy UFAI, you're the gatekeeper. My point is that there's no reason to consider yourself safe from blackmail (and the consequences of ignoring it) just because you've adopted a certain precommitment. Other entities have explicit incentives to deny you that safety.
In a multiverse with infinite resources there will be other entities that outweigh such incentives. And yes, this may not be symmetric, but you have absolutely no way to figure out how the asymmetry is inclined. So you ignore this (Pascal's wager).
In more realistic scenarios, where e.g. a bunch of TV evangelists ask you to give them all your money, or otherwise, in 200 years from now, they will hurt you once their organisation creates the Matrix, you obviously do not give them money. Since giving them money would make it more likely for them to actually build the Matrix and hurt you. What you do is label them as terrorists and destroy them.
I don't care, remember? Enjoy being tortured rather than "irrationally" giving in.
</steelman>
EDIT: re-added the steelman tag because the version without it is being downvoted.
Should I calculate in expectation that you will do such a thing, I shall of course burn yet more of my remaining utilons to wreak as much damage upon your goals as I can, even if you precommit not to be influenced by that.
... bloody hell. That was going to be my next move.
<steelman>
Naturally, as blackmailer, I precommitted to increase the resources allotted to torturing should I find that you make such precommitments under simulation, so you presumably calculated that would be counterproductive.
</steelman>
Ask me if I was even bothering to simulate you doing that.
<steelman>
OK, I'll bite. Are you deliberately ignoring parts of hypothesis-space in order to avoid changing your actions? I had assumed you were intelligent enough for my reaction to obvious, although you may have precommitted to ignore that fact.
</steelman>
Off the record, your point is that agents can simply opt out of or ignore acausal trades, forcing them to be mutually beneficial, right?
Yup.
<steelman>
Isn't that ... irrational? Shouldn't a perfect Bayesian always welcome new information? Litany of Tarski; if my action is counterproductive, I desire to believe that it is counterproductive.
Worse still, isn't the category "blackmail" arbitrary, intended to justify inaction rather than carve reality at it's joints? What separates a precommitted!blackmailer from an honest bargainer in a standard acausal prisoner's dilemma, offering to increase your utility by rescuing thousands of potential torture victims from the deathtrap created by another agent?
</steelman>