You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

A few misconceptions surrounding Roko's basilisk

39 RobbBB 05 October 2015 09:23PM

There's a new LWW page on the Roko's basilisk thought experiment, discussing both Roko's original post and the fallout that came out of Eliezer Yudkowsky banning the topic on Less Wrong discussion threads. The wiki page, I hope, will reduce how much people have to rely on speculation or reconstruction to make sense of the arguments.

While I'm on this topic, I want to highlight points that I see omitted or misunderstood in some online discussions of Roko's basilisk. The first point that people writing about Roko's post often neglect is:

 

  • Roko's arguments were originally posted to Less Wrong, but they weren't generally accepted by other Less Wrong users.

Less Wrong is a community blog, and anyone who has a few karma points can post their own content here. Having your post show up on Less Wrong doesn't require that anyone else endorse it. Roko's basic points were promptly rejected by other commenters on Less Wrong, and as ideas not much seems to have come of them. People who bring up the basilisk on other sites don't seem to be super interested in the specific claims Roko made either; discussions tend to gravitate toward various older ideas that Roko cited (e.g., timeless decision theory (TDT) and coherent extrapolated volition (CEV)) or toward Eliezer's controversial moderation action.

In July 2014, David Auerbach wrote a Slate piece criticizing Less Wrong users and describing them as "freaked out by Roko's Basilisk." Auerbach wrote, "Believing in Roko’s Basilisk may simply be a 'referendum on autism'" — which I take to mean he thinks a significant number of Less Wrong users accept Roko’s reasoning, and they do so because they’re autistic (!). But the Auerbach piece glosses over the question of how many Less Wrong users (if any) in fact believe in Roko’s basilisk. Which seems somewhat relevant to his argument...?

The idea that Roko's thought experiment holds sway over some community or subculture seems to be part of a mythology that’s grown out of attempts to reconstruct the original chain of events; and a big part of the blame for that mythology's existence lies on Less Wrong's moderation policies. Because the discussion topic was banned for several years, Less Wrong users themselves had little opportunity to explain their views or address misconceptions. A stew of rumors and partly-understood forum logs then congealed into the attempts by people on RationalWiki, Slate, etc. to make sense of what had happened.

I gather that the main reason people thought Less Wrong users were "freaked out" about Roko's argument was that Eliezer deleted Roko's post and banned further discussion of the topic. Eliezer has since sketched out his thought process on Reddit:

When Roko posted about the Basilisk, I very foolishly yelled at him, called him an idiot, and then deleted the post. [...] Why I yelled at Roko: Because I was caught flatfooted in surprise, because I was indignant to the point of genuine emotional shock, at the concept that somebody who thought they'd invented a brilliant idea that would cause future AIs to torture people who had the thought, had promptly posted it to the public Internet. In the course of yelling at Roko to explain why this was a bad thing, I made the further error---keeping in mind that I had absolutely no idea that any of this would ever blow up the way it did, if I had I would obviously have kept my fingers quiescent---of not making it absolutely clear using lengthy disclaimers that my yelling did not mean that I believed Roko was right about CEV-based agents [= Eliezer’s early model of indirectly normative agents that reason with ideal aggregated preferences] torturing people who had heard about Roko's idea. [...] What I considered to be obvious common sense was that you did not spread potential information hazards because it would be a crappy thing to do to someone. The problem wasn't Roko's post itself, about CEV, being correct.

This, obviously, was a bad strategy on Eliezer's part. Looking at the options in hindsight: To the extent it seemed plausible that Roko's argument could be modified and repaired, Eliezer shouldn't have used Roko's post as a teaching moment and loudly chastised him on a public discussion thread. To the extent this didn't seem plausible (or ceased to seem plausible after a bit more analysis), continuing to ban the topic was a (demonstrably) ineffective way to communicate the general importance of handling real information hazards with care.

 


On that note, point number two:

  • Roko's argument wasn’t an attempt to get people to donate to Friendly AI (FAI) research. In fact, the opposite is true.

Roko's original argument was not 'the AI agent will torture you if you don't donate, therefore you should help build such an agent'; his argument was 'the AI agent will torture you if you don't donate, therefore we should avoid ever building such an agent.' As Gerard noted in the ensuing discussion thread, threats of torture "would motivate people to form a bloodthirsty pitchfork-wielding mob storming the gates of SIAI [= MIRI] rather than contribute more money." To which Roko replied: "Right, and I am on the side of the mob with pitchforks. I think it would be a good idea to change the current proposed FAI content from CEV to something that can't use negative incentives on x-risk reducers."

Roko saw his own argument as a strike against building the kind of software agent Eliezer had in mind. Other Less Wrong users, meanwhile, rejected Roko's argument both as a reason to oppose AI safety efforts and as a reason to support AI safety efforts.

Roko's argument was fairly dense, and it continued into the discussion thread. I’m guessing that this (in combination with the temptation to round off weird ideas to the nearest religious trope, plus misunderstanding #1 above) is why RationalWiki's version of Roko’s basilisk gets introduced as

a futurist version of Pascal’s wager; an argument used to try and suggest people should subscribe to particular singularitarian ideas, or even donate money to them, by weighing up the prospect of punishment versus reward.

If I'm correctly reconstructing the sequence of events: Sites like RationalWiki report in the passive voice that the basilisk is "an argument used" for this purpose, yet no examples ever get cited of someone actually using Roko’s argument in this way. Via citogenesis, the claim then gets incorporated into other sites' reporting.

(E.g., in Outer Places: "Roko is claiming that we should all be working to appease an omnipotent AI, even though we have no idea if it will ever exist, simply because the consequences of defying it would be so great." Or in Business Insider: "So, the moral of this story: You better help the robots make the world a better place, because if the robots find out you didn’t help make the world a better place, then they’re going to kill you for preventing them from making the world a better place.")

In terms of argument structure, the confusion is equating the conditional statement 'P implies Q' with the argument 'P; therefore Q.' Someone asserting the conditional isn’t necessarily arguing for Q; they may be arguing against P (based on the premise that Q is false), or they may be agnostic between those two possibilities. And misreporting about which argument was made (or who made it) is kind of a big deal in this case: 'Bob used a bad philosophy argument to try to extort money from people' is a much more serious charge than 'Bob owns a blog where someone once posted a bad philosophy argument.'

 


Lastly:

  • "Formally speaking, what is correct decision-making?" is an important open question in philosophy and computer science, and formalizing precommitment is an important part of that question.

Moving past Roko's argument itself, a number of discussions of this topic risk misrepresenting the debate's genre. Articles on Slate and RationalWiki strike an informal tone, and that tone can be useful for getting people thinking about interesting science/philosophy debates. On the other hand, if you're going to dismiss a question as unimportant or weird, it's important not to give the impression that working decision theorists are similarly dismissive.

What if your devastating take-down of string theory is intended for consumption by people who have never heard of 'string theory' before? Even if you're sure string theory is hogwash, then, you should be wary of giving the impression that the only people discussing string theory are the commenters on a recreational physics forum. Good reporting by non-professionals, whether or not they take an editorial stance on the topic, should make it obvious that there's academic disagreement about which approach to Newcomblike problems is the right one. The same holds for disagreement about topics like long-term AI risk or machine ethics.

If Roko's original post is of any pedagogical use, it's as an unsuccessful but imaginative stab at drawing out the diverging consequences of our current theories of rationality and goal-directed behavior. Good resources for these issues (both for discussion on Less Wrong and elsewhere) include:

The Roko's basilisk ban isn't in effect anymore, so you're welcome to direct people here (or to the Roko's basilisk wiki page, which also briefly introduces the relevant issues in decision theory) if they ask about it. Particularly low-quality discussions can still get deleted (or politely discouraged), though, at moderators' discretion. If anything here was unclear, you can ask more questions in the comments below.

Blackmail, continued: communal blackmail, uncoordinated responses

11 Stuart_Armstrong 22 October 2014 05:53PM

The heuristic that one should always resist blackmail seems a good one (no matter how tricky blackmail is to define). And one should be public about this, too; then, one is very unlikely to be blackmailed. Even if one speaks like an emperor.

But there's a subtlety: what if the blackmail is being used against a whole group, not just against one person? The US justice system is often seen to function like this: prosecutors pile on ridiculous numbers charges, threatening uncounted millennia in jail, in order to get the accused to settle for a lesser charge and avoid the expenses of a trial.

But for this to work, they need to occasionally find someone who rejects the offer, put them on trial, and slap them with a ridiculous sentence. Therefore by standing up to them (or proclaiming in advance that you will reject such offers), you are not actually making yourself immune to their threats. Your setting yourself up to be the sacrificial one made an example of.

Of course, if everyone were a UDT agent, the correct decision would be for everyone to reject the threat. That would ensure that the threats are never made in the first place. But - and apologies if this shocks you - not everyone in the world is a perfect UDT agent. So the threats will get made, and those resisting them will get slammed to the maximum.

Of course, if everyone could read everyone's mind and was perfectly rational, then they would realise that making examples of UDT agents wouldn't affect the behaviour of non-UDT agents. In that case, UDT agents should resist the threats, and the perfectly rational prosecutor wouldn't bother threatening UDT agents. However - and sorry to shock your views of reality three times in one post - not everyone is perfectly rational. And not everyone can read everyone's minds.

So even a perfect UDT agent must, it seems, sometimes succumb to blackmail.

Mutual Worth without default point (but with potential threats)

6 Stuart_Armstrong 31 July 2013 09:52AM

Though I planned to avoid posting anything more until well after baby, I found this refinement to MWBS yesterday, so I'm posting it while Miriam sleeps during a pause in contractions.

The mutual worth bargaining solution was built from the idea that the true value of a trade is having your utility function access the decision points of the other player. This gave the idea of utopia points: what happens when you are granted complete control over the other person's decisions. This gave a natural 1 to normalise your utility function. But the 0 point is chosen according to a default point. This is arbitrary, and breaks the symmetry between the top and bottom point of the normalisation.

We'd also want normalisations that function well when players have no idea what their opponents will be. This includes not knowing what their utility functions will be. Can we model what a 'generic' opposing utility function would be?

It's tricky, in general, to know what 'value' to put on an opponent's utility function. It's unclear what kind of utilities would you like to see them have? That's because game theory comes into play, with Nash equilibriums, multiple solution concepts, bargaining and threats: there is no universal default to the result of a game between two agents. There are two situations, however, that are respectively better and worse than all others: the situation where your opponent shares your exact utility function, and the situations where they have the negative of that (they're essentially your 'anti-agent').

If your opponent shares your utility function, then there is a clear ideal outcome: act as if you and the opponent were the same person, acting to maximise your joint utility. This is the utopia point for MWBS, which can be standardised to take value 1.

If your opponent has the negative of your utility, then the game is zero-sum: any gain to you is a loss to your opponent, and there is no possibility for mutually pleasing compromise. But zero-sum games also have a single canonical outcome! For zero-sum games, the concepts of Nash equilibrium, minimax, and maximin are all equivalent (and are generally mixed outcomes). The game has a single defined value: each player can guarantee they get as much utility as that value, and the other player can guarantee that they get no more.

It seems natural to normalise that point to -1 (0 would be equivalent, but -1 feels more appropriate). Given this normalisation for each utility, the two utilities can then be summed and joint maximised in the usual way.

This bargaining solution has a lot of attractive features - it's symmetric in minimal and maximal utilities, does not require a default point, reflects the relative power, and captures the spread of opponents utilities that could be encountered without needing to go into game theory. It is vulnerable to (implicit) threats, however! If I can (potentially) cause a lot of damage to you and your cause, then when you normalise your utility, you get penalised because of what your anti-agent could do if they controlled my decision nodes. So just by having the power do do bad stuff to you, I come out better than I would otherwise (and vice-versa, of course).

I feel it's worth exploring further (especially what happens with multiple agents) - but for me, after the baby.

Semi-open thread: blackmail

0 Stuart_Armstrong 15 July 2013 04:25PM

My blackmail posts have generated some interesting discussion, so I'm just creating this one so that people can post examples of behaviours that they think are either clearly blackmail, or clearly not blackmail, or something in between.

Duller blackmail definitions

7 Stuart_Armstrong 15 July 2013 10:08AM

For a more parable-ic version of this, see here.

Suppose I make a precommitment P to take action X unless you take action Y. Action X is not in my interest: I wouldn't do it if I knew you'd never take action Y. You would want me to not precommit to P.

Is this blackmail? Suppose we've been having a steamy affair together, and I have the letters to prove it. It would be bad for both of these if they were published. Then X={Publish the letters} and Y={You pay me money} is textbook blackmail.

But suppose I own a MacGuffin that you want (I value it at £9). If X={Reject any offer} and Y={You offer more than £10}, is this still blackmail? Formally, it looks the same.

What about if I bought the MacGuffin for £500 and you value it at £1000? This makes no difference to the formal structure of the scenario. Then my behaviour feels utterly reasonable, rather than vicious and blackmail-ly.

What is the meaningful difference between the two scenarios? I can't really formalise it.

Countess and Baron attempt to define blackmail, fail

11 Stuart_Armstrong 15 July 2013 10:07AM

For a more concise version of this argument, see here.

We meet our heroes, the Countess of Rectitude and Baron Chastity, as they continue to investigate the mysteries of blackmail by sleeping together and betraying each other.

The Baron had a pile of steamy letters between him and the Countess: it would be embarrassing to both of them if these letters got out. Yet the Baron confided the letters to a trusted Acolyte, with strict instructions. The Acolyte was to publish these letters, unless the Countess agreed to give the Baron her priceless Ping Vase.

This seems a perfect example of blackmail:

  • The Baron is taking a course of action that is intrinsically negative for him. This behaviour only makes sense if it forces the Countess to take a specific action which benefits him. The Countess would very much like it if the Baron couldn't do such things.

As it turns out, a servant broke the Ping Vase while chasing the Countess's griffon. The servant was swiftly executed, but the Acolyte had to publish the letters as instructed, to great embarrassment all around (sometimes precommitments aren't what they're cracked up to be). After six days of exile in the Countess's doghouse (a luxurious, twenty-room affair) and eleven days of make-up sex, the Baron was back to planning against his lover.

continue reading »