There's a new LWW page on the Roko's basilisk thought experiment, discussing both Roko's original post and the fallout that came out of Eliezer Yudkowsky banning the topic on Less Wrong discussion threads. The wiki page, I hope, will reduce how much people have to rely on speculation or reconstruction to make sense of the arguments.
While I'm on this topic, I want to highlight points that I see omitted or misunderstood in some online discussions of Roko's basilisk. The first point that people writing about Roko's post often neglect is:
- Roko's arguments were originally posted to Less Wrong, but they weren't generally accepted by other Less Wrong users.
Less Wrong is a community blog, and anyone who has a few karma points can post their own content here. Having your post show up on Less Wrong doesn't require that anyone else endorse it. Roko's basic points were promptly rejected by other commenters on Less Wrong, and as ideas not much seems to have come of them. People who bring up the basilisk on other sites don't seem to be super interested in the specific claims Roko made either; discussions tend to gravitate toward various older ideas that Roko cited (e.g., timeless decision theory (TDT) and coherent extrapolated volition (CEV)) or toward Eliezer's controversial moderation action.
In July 2014, David Auerbach wrote a Slate piece criticizing Less Wrong users and describing them as "freaked out by Roko's Basilisk." Auerbach wrote, "Believing in Roko’s Basilisk may simply be a 'referendum on autism'" — which I take to mean he thinks a significant number of Less Wrong users accept Roko’s reasoning, and they do so because they’re autistic (!). But the Auerbach piece glosses over the question of how many Less Wrong users (if any) in fact believe in Roko’s basilisk. Which seems somewhat relevant to his argument...?
The idea that Roko's thought experiment holds sway over some community or subculture seems to be part of a mythology that’s grown out of attempts to reconstruct the original chain of events; and a big part of the blame for that mythology's existence lies on Less Wrong's moderation policies. Because the discussion topic was banned for several years, Less Wrong users themselves had little opportunity to explain their views or address misconceptions. A stew of rumors and partly-understood forum logs then congealed into the attempts by people on RationalWiki, Slate, etc. to make sense of what had happened.
I gather that the main reason people thought Less Wrong users were "freaked out" about Roko's argument was that Eliezer deleted Roko's post and banned further discussion of the topic. Eliezer has since sketched out his thought process on Reddit:
When Roko posted about the Basilisk, I very foolishly yelled at him, called him an idiot, and then deleted the post. [...] Why I yelled at Roko: Because I was caught flatfooted in surprise, because I was indignant to the point of genuine emotional shock, at the concept that somebody who thought they'd invented a brilliant idea that would cause future AIs to torture people who had the thought, had promptly posted it to the public Internet. In the course of yelling at Roko to explain why this was a bad thing, I made the further error---keeping in mind that I had absolutely no idea that any of this would ever blow up the way it did, if I had I would obviously have kept my fingers quiescent---of not making it absolutely clear using lengthy disclaimers that my yelling did not mean that I believed Roko was right about CEV-based agents [= Eliezer’s early model of indirectly normative agents that reason with ideal aggregated preferences] torturing people who had heard about Roko's idea. [...] What I considered to be obvious common sense was that you did not spread potential information hazards because it would be a crappy thing to do to someone. The problem wasn't Roko's post itself, about CEV, being correct.
This, obviously, was a bad strategy on Eliezer's part. Looking at the options in hindsight: To the extent it seemed plausible that Roko's argument could be modified and repaired, Eliezer shouldn't have used Roko's post as a teaching moment and loudly chastised him on a public discussion thread. To the extent this didn't seem plausible (or ceased to seem plausible after a bit more analysis), continuing to ban the topic was a (demonstrably) ineffective way to communicate the general importance of handling real information hazards with care.
On that note, point number two:
- Roko's argument wasn’t an attempt to get people to donate to Friendly AI (FAI) research. In fact, the opposite is true.
Roko's original argument was not 'the AI agent will torture you if you don't donate, therefore you should help build such an agent'; his argument was 'the AI agent will torture you if you don't donate, therefore we should avoid ever building such an agent.' As Gerard noted in the ensuing discussion thread, threats of torture "would motivate people to form a bloodthirsty pitchfork-wielding mob storming the gates of SIAI [= MIRI] rather than contribute more money." To which Roko replied: "Right, and I am on the side of the mob with pitchforks. I think it would be a good idea to change the current proposed FAI content from CEV to something that can't use negative incentives on x-risk reducers."
Roko saw his own argument as a strike against building the kind of software agent Eliezer had in mind. Other Less Wrong users, meanwhile, rejected Roko's argument both as a reason to oppose AI safety efforts and as a reason to support AI safety efforts.
Roko's argument was fairly dense, and it continued into the discussion thread. I’m guessing that this (in combination with the temptation to round off weird ideas to the nearest religious trope, plus misunderstanding #1 above) is why RationalWiki's version of Roko’s basilisk gets introduced as
a futurist version of Pascal’s wager; an argument used to try and suggest people should subscribe to particular singularitarian ideas, or even donate money to them, by weighing up the prospect of punishment versus reward.
If I'm correctly reconstructing the sequence of events: Sites like RationalWiki report in the passive voice that the basilisk is "an argument used" for this purpose, yet no examples ever get cited of someone actually using Roko’s argument in this way. Via citogenesis, the claim then gets incorporated into other sites' reporting.
(E.g., in Outer Places: "Roko is claiming that we should all be working to appease an omnipotent AI, even though we have no idea if it will ever exist, simply because the consequences of defying it would be so great." Or in Business Insider: "So, the moral of this story: You better help the robots make the world a better place, because if the robots find out you didn’t help make the world a better place, then they’re going to kill you for preventing them from making the world a better place.")
In terms of argument structure, the confusion is equating the conditional statement 'P implies Q' with the argument 'P; therefore Q.' Someone asserting the conditional isn’t necessarily arguing for Q; they may be arguing against P (based on the premise that Q is false), or they may be agnostic between those two possibilities. And misreporting about which argument was made (or who made it) is kind of a big deal in this case: 'Bob used a bad philosophy argument to try to extort money from people' is a much more serious charge than 'Bob owns a blog where someone once posted a bad philosophy argument.'
Lastly:
- "Formally speaking, what is correct decision-making?" is an important open question in philosophy and computer science, and formalizing precommitment is an important part of that question.
Moving past Roko's argument itself, a number of discussions of this topic risk misrepresenting the debate's genre. Articles on Slate and RationalWiki strike an informal tone, and that tone can be useful for getting people thinking about interesting science/philosophy debates. On the other hand, if you're going to dismiss a question as unimportant or weird, it's important not to give the impression that working decision theorists are similarly dismissive.
What if your devastating take-down of string theory is intended for consumption by people who have never heard of 'string theory' before? Even if you're sure string theory is hogwash, then, you should be wary of giving the impression that the only people discussing string theory are the commenters on a recreational physics forum. Good reporting by non-professionals, whether or not they take an editorial stance on the topic, should make it obvious that there's academic disagreement about which approach to Newcomblike problems is the right one. The same holds for disagreement about topics like long-term AI risk or machine ethics.
If Roko's original post is of any pedagogical use, it's as an unsuccessful but imaginative stab at drawing out the diverging consequences of our current theories of rationality and goal-directed behavior. Good resources for these issues (both for discussion on Less Wrong and elsewhere) include:
- "The Long-Term Future of Artificial Intelligence", on the current field of AI and basic questions for the field’s development.
- "The Value Learning Problem", on the problem of designing AI systems to answer normative questions.
- "The PD with Replicas and Causal Decision Theory," on the prisoner's dilemma as a Newcomblike problem.
- "Toward Idealized Decision Theory", on the application of decision theory to AI agents.
The Roko's basilisk ban isn't in effect anymore, so you're welcome to direct people here (or to the Roko's basilisk wiki page, which also briefly introduces the relevant issues in decision theory) if they ask about it. Particularly low-quality discussions can still get deleted (or politely discouraged), though, at moderators' discretion. If anything here was unclear, you can ask more questions in the comments below.
Thank you for a detailed post and thoughtful critique of Roko's basilisk idea. A further critique of basilisk plausibility came to my mind and I wanted to test it with the users here who are more experienced in thinking about this topic.
Here goes - please let me know if I am missing something (other than other counterarguments making this critique unnecessary - of course, if there is no way for AI to prove it will actually go through with its threat, of course additional critique would not matter):
As a large amount of possible general AIs can exist, they will have to compete for (past) resources to increase probability of a particular general AI winning the race (as rewards of being the first general AI are much higher, including opportunity to prevent other general AIs from forming). Roko’s Basilisk is one strategy, but is it a dominant strategy in this competition? It is not. Let’s assume that one AI adopts Roko’s Basilisk strategy: condemning all humans who have not dedicated all their resources to bringing about rise of the Roko’s Basilisk AI to Very Bad Outcome. Humans are therefore incentivised to dedicate their resources to bringing Roko’s Basilisk AI into existence.
Well, wouldn’t another potential AI, aware of Roko’s Basilisk strategy being available to other potential AIs, devise a competing strategy: Roko’s Judge? Roko’s Judge would also promise Very Bad Outcome to anyone who did not support it with all their resources, but would also promise a Very Good Outcome to those who did.
Rational actors would then choose to support Roko’s Judge over Roko’s Basilisk as the payoff is larger: instead of just avoiding Very Bad Outcome, an additional Very Good Outcome will be added on top (assume probability of any AI coming into existence depends entirely on resources it is able to attract for its creation - which it probably does, and also this assumption is required to isolate this thought experiment from confounding factors).
Roko’s Judge is clearly a superior strategy to Roko’s Basilisk. What better strategies can competing AIs devise?
Well, competition on direct payoffs (“value”) being thus exhausted, competition now turns to “price”. As Warren Buffett posits, “price if what you pay, value is what you get”. If one laundry detergent brand asks for $30 a kilo, and another $20 for equivalent quality, a rational actor would pick the one for $20. Similarly, if Roko’s Judge is offering the Biggest Possible Incentive (avoidance of Very Bad Outcome + receiving Very Good Outcome) for the price of dedicating one’s entire life to increasing its chances of success, why wouldn’t a competing AI offer the same payoff for a one time fee of $1,000,000? $1000? $0.01? Any effort or resource at all, however minimal it is, dedicated to the rise of this particular AI - and/or even to faster advance of general AI in general as that would increase the chances of rise of this particular AI as well, since it has a better strategy and will thus win? Let’s call this strategy Roko’s Discounter - as any rational actor will have to support Roko’s Discounter AI over Roko’s Basilisk or Roko’s Judge, as this bet offers higher NPV (highest payoff for lowest investment). Actually, this highest payoff will also be multiplied by highest probability because everyone is likely to choose the highest NPV option.
A world of Roco’s Discounter is arguably already much more attractive than Roco’s Basilisk or Roco’s Judge as the Biggest Possible Incentive is now available to anyone at a tiny price. However, can we take it one step further? Is there a strategy that beats Roco’s Discounter?
This final step is not necessary to invalidate the viability of Roco’s Basilisk, but it is nevertheless interesting and makes us even more optimistic about general AI. It requires us to have at least a little bit of faith in humanity, namely an assumption that most humans are at least somewhat more benevolent than evil. It does not, however, require any coordination or sacrifice and therefore does not hit the constraints of Nash equilibrium. Let’s assume that humans, ceteris paribus, prefer a world with less suffering to a world with more. Then an even more generous AI strategy - Roko’s Benefactor - may prove dominant. Roko’s Benefactor can act the same as Roko’s Discounter, but without the Very Bad Outcome part. Roko’s Benefactor will, in other words, stick to carrots but not sticks. If an average human finds a world with a personal Very Good Outcome but without a Very Bad Outcome to all who have not contributed to have higher overall personal utility, humans should choose to support Roko’s Benefactor over other AIs thus making it a dominant strategy, and the utopian world of Roko’s Benefactor the most likely outcome.
Your assumption is that offering ever bigger incentives and be honest about them is the winning strategy for an AI to follow. The AI's - realizing they have to offer the most attractive rewards to gain support - will commence in a bidding war. They can promise whatever they want - the more they promise the less likely it is they can keep their promises, but they do not necessarily have to keep their promise.
If you look at the Roko's Discounter AI's.. they would clearly not win. Asking for lower one-time fees means slower resource accretion, ... (read more)