RobbBB comments on A few misconceptions surrounding Roko's basilisk - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (125)
This was addressed on the LessWrongWiki page; I didn't copy the full article here.
A few reasons Roko's argument doesn't work:
The 'should' here is normative: there are probably some decision theories that let agents acausally blackmail each other, but others that perform well in Newcomb's problem and the smoking lesion problem but can't acausally blackmail each other; it hasn't been formally demonstrated which theories fall into which category.
2 - Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it's going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we're aware of this, we know any threat of blackmail would be empty. This means that we can't be blackmailed in practice.
3 - A stronger version of 2 is that rational agents actually have an incentive to harshly punish attempts at blackmail in order to discourage it. So threatening blackmail can actually decrease an agent's probability of being created, all else being equal.
4 - Insofar as it's "utilitarian" to horribly punish anyone who doesn't perfectly promote human flourishing, SIAI doesn't seem to have endorsed utilitarianism.
4 means that the argument lacks practical relevance. The idea of CEV doesn't build in very much moral philosophy, and it doesn't build in predictions about the specific dilemmas future agents might end up in.
"I precommit to shop at the store with the lowest price within some large distance, even if the cost of the gas and car depreciation to get to a farther store is greater than the savings I get from its lower price. If I do that, stores will have to compete with distant stores based on price, and thus it is more likely that nearby stores will have lower prices. However, this precommitment would only work if I am actually willing to go to the farther store when it has the lowest price even if I lose money".
Miraculously, people do reliably act this way.
I doubt it. Reference?
Mostly because they don't actually notice the cost of gas and car depreciation at the time...
You've described the mechanism by which the precommitment happened, not actually disputed whether it happens.
Many "irrational" actions by human beings can be analyzed as precommitment; for instance, wanting to take revenge on people who have hurt you even if the revenge doesn't get you anything.
Humans don't follow any decision theory consistently. They sometimes give in to blackmail, and at other times resist blackmail. If you convinced a bunch of people to take acausal blackmail seriously, presumably some subset would give in and some subset would resist, since that's what we see in ordinary blackmail situations. What would be interesting is if (a) there were some applicable reasoning norm that forced us to give in to acausal blackmail on pain of irrationality, or (b) there were some known human irrationality that made us inevitably susceptible to acausal blackmail. But I don't think Roko gave a good argument for either of those claims.
From my last comment: "there are probably some decision theories that let agents acausally blackmail each other". But if humans frequently make use of heuristics like 'punish blackmailers' and 'never give in to blackmailers', and if normative decision theory says they're right to do so, there's less practical import to 'blackmailable agents are possible'.
No it doesn't. If you model Newcomb's problem as a Prisoner's Dilemma, then one-boxing maps on to cooperating and two-boxing maps on to defecting. For Omega, cooperating means 'I put money in both boxes' and defecting means 'I put money in just one box'. TDT recognizes that the only two options are mutual cooperation or mutual defection, so TDT cooperates.
Blackmail works analogously. Perhaps the blackmailer has five demands. For the blackmailee, full cooperation means 'giving in to all five demands'; full defection means 'rejecting all five demands'; and there are also intermediary levels (e.g., giving in to two demands while rejecting the other three), with the blackmailee prefer to do as little as possible.
For the blackmailer, full cooperation means 'expending resources to punish the blackmailee in proportion to how many of my demands were met'. Full defection means 'expending no resources to punish the blackmailee even if some demands aren't met'. In other words, since harming past agents is costly, a blackmailer's favorite scenario is always 'the blackmailee, fearing punishment, gives in to most or all of my demands; but I don't bother punishing them regardless of how many of my demands they ignored'. We could say that full defection doesn't even bother to check how many of the demands were met, except insofar as this is useful for other goals.
The blackmailer wants to look as scary as possible (to get the blackmailee to cooperate) and then defect at the last moment anyway (by not following through on the threat), if at all possible. In terms of Newcomb's problem, this is the same as preferring to trick Omega into thinking you'll one-box, and then two-boxing anyway. We usually construct Newcomb's problem in such a way that this is impossible; therefore TDT cooperates. But in the real world mutual cooperation of this sort is difficult to engineer, which makes fully credible acausal blackmail at least as difficult.
I think you misunderstood point 3. 3 is a follow-up to 2: humans and AI systems alike have incentives to discourage blackmail, which increases the likelihood that blackmail is a self-defeating strategy.
Eliezer has endorsed the claim "two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one". This doesn't tell us how bad the act of blackmail itself is, it doesn't tell us how faithfully we should implement that idea in autonomous AI systems, and it doesn't tell us how likely it is that a superintelligent AI would find itself forced into this particular moral dilemma.
Since Eliezer asserts a CEV-based agent wouldn't blackmail humans, the next step in shoring up Roko's argument would be to do more to connect the dots from "two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one" to a real-world worry about AI systems actually blackmailing people conditional on claims (a) and (c). 'I find it scary to think a superintelligent AI might follow the kind of reasoning that can ever privilege torture over dust specks' is not the same thing as 'I'm scared a superintelligent AI will actually torture people because this will in fact be the best way to prevent a superastronomically large number of dust specks from ending up in people's eyes', so Roko's particular argument has a high evidential burden.
Um, your conclusion "since we're aware of this, we know any threat of blackmail would be empty" contradicts your premise that the AI by virtue of being super-intelligent is capable of fooling people into thinking it'll torture them.
One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner's Dilemmas; but since we know it has this property and we know it prefers (D,C) over (C,C), we know it will defect. This is consistent because we're assuming the actual AI is powerful enough to trick people once it exists; this doesn't require the assumption that my low-fidelity mental model of the AI is powerful enough to trick me in the real world.
For acausal blackmail to work, the blackmailer needs a mechanism for convincing the blackmailee that it will follow through on its threat. 'I'm a TDT agent' isn't a sufficient mechanism, because a TDT agent's favorite option is still to trick other agents into cooperating in Prisoner's Dilemmas while they defect.
Except it needs to convince the people who are around before it exists.