A visual depiction of a prisoner's dilemma. T denotes the best outcome for a given player, followed by R, then P, then S.
One example of a Newcomblike problem is the prisoner's dilemma. This is a two-player game in which each player has two options: \"cooperate,\"cooperate," or \"defect.\"defect." By assumption, each player prefers to defect rather than cooperate, all else being equal; but each player also prefers mutual cooperation over mutual defection.
One of the basic open problems in decision theory is that standard \"rational\""rational" agents will end up defecting against each other, even though it would be better for both players if they could somehow enact a binding mutual agreement to cooperate instead.
In other words, the standard formulation of CDT cannot model scenarios where another agent (or a part of the environment) is correlated with a decision process, except insofar as the decision causes the correlation. The general name for scenarios where CDT fails is \"Newcomblike"Newcomblike problems,\" and these scenarios are ubiquitous in human interactions.
Yudkowsky's interest in decision theory stems from his interest in the AI control problem: \"If"If artificially intelligent systems someday come to surpass humans in intelligence, how can we specify safe goals for them to autonomously carry out, and how can we gain high confidence in the agents' reasoning and decision-making?\" Yudkowsky has argued that in the absence of a full understanding of decision theory, we risk building autonomous systems whose behavior is erratic or difficult to model.
Because Eliezer Yudkowsky founded Less Wrong and was one of the first bloggers on the site, AI theory and \"acausal\""acausal" decision theories — in particular, logical decision theories, which respect logical connections between agents' properties rather than just the causal effects they have on each other — have been repeatedly discussed on Less Wrong. Roko's basilisk was an attempt to use Yudkowsky's proposed decision theory (TDT) to argue against his informal characterization of an ideal AI goal (humanity's coherently extrapolated volition).
A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ.
Roko observed that if two TDT or UDT agents with common knowledge of each other's source code are separated in time, the later agent can (seemingly) blackmail the earlier agent. Call the earlier agent \"Alice\""Alice" and the later agent \"Bob.\"Bob." Bob can be an algorithm that outputs things Alice likes if Alice left Bob a large sum of money, and outputs things Alice dislikes otherwise. And since Alice knows Bob's source code exactly, she knows this fact about Bob (even though Bob hasn't been born yet). So Alice's knowledge of Bob's source code makes Bob's future threat effective, even though Bob doesn't yet exist: if Alice is certain that...
What about a similar AI that helps anyone who tries to bring him into existence and does nothing to other people?
Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a \"basilisk\""basilisk" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent. A basilisk in this context is any information that harms or endangers the people who hear it.
Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a \"basilisk\" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent — aagent. A basilisk in this context is any information that harms or endangers the people who hear it.
Roko's argument was broadly rejected on Less Wrong, with commenters objecting that an agent like the one Roko was describing would have no real reason to follow through on its threat: once the agent already exists, it can't affect the probabilitywill by default just see it as a waste of its existence, so torturingresources to torture people for their past decisions would be a wastedecisions, since this doesn't causally further its plans. A number of resources. Although several decision theories allow one toalgorithms can follow through on acausal threats and promises —promises, via the same precommitment methods that permit mutual cooperation in prisoner's dilemmas — it is not cleardilemmas; but this doesn't imply that such theories can be blackmailed. If they can be blackmailed, thisAnd following through on blackmail threats against such an algorithm additionally requires a large amount of shared information and trust between the agents, which does not appear to exist in the case of Roko's basilisk.
Bunch of broken images in this one
I mostly fixed the page by removing quotes from links (edited as markdown in VS Code, 42 links were like []("...") and 64 quotes were double-escaped \") ... feel free to double check (I also sent feedback to moderators, maybe they want to check for similar problems on other pages on DB level)