MIRI's 2015 Summer Fundraiser!
Our summer fundraising drive is now finished. We raised a grand total of $631,957 from 263 donors. This is an incredible sum, making this the biggest fundraiser we’ve ever run.
We've already been hard at work growing our research team and spinning up new projects, and I’m excited to see what our research team can do this year. Thank you to all our supporters for making our summer fundraising drive so successful!
It's safe to say that this past year exceeded a lot of people's expectations.
Twelve months ago, Nick Bostrom's Superintelligence had just come out. Questions about the long-term risks and benefits of smarter-than-human AI systems were nearly invisible in mainstream discussions of AI's social impact.
Twelve months later, we live in a world where Bill Gates is confused by why so many researchers aren't using Superintelligence as a guide to the questions we should be asking about AI's future as a field.
Following a conference in Puerto Rico that brought together the leading organizations studying long-term AI risk (MIRI, FHI, CSER) and top AI researchers in academia (including Stuart Russell, Tom Mitchell, Bart Selman, and the Presidents of AAAI and IJCAI) and industry (including representatives from Google DeepMind and Vicarious), we've seen Elon Musk donate $10M to a grants program aimed at jump-starting the field of long-term AI safety research; we've seen the top AI and machine learning conferences (AAAI, IJCAI, and NIPS) announce their first-ever workshops or discussions on AI safety and ethics; and we've seen a panel discussion on superintelligence at ITIF, the leading U.S. science and technology think tank. (I presented a paper at the AAAI workshop, I spoke on the ITIF panel, and I'll be at NIPS.)
As researchers begin investigating this area in earnest, MIRI is in an excellent position, with a developed research agenda already in hand. If we can scale up as an organization then we have a unique chance to shape the research priorities and methods of this new paradigm in AI, and direct this momentum in useful directions.
This is a big opportunity. MIRI is already growing and scaling its research activities, but the speed at which we scale in the coming months and years depends heavily on our available funds.
For that reason, MIRI is starting a six-week fundraiser aimed at increasing our rate of growth.
— Live Progress Bar —
This time around, rather than running a matching fundraiser with a single fixed donation target, we'll be letting you help choose MIRI's course based on the details of our funding situation and how we would make use of marginal dollars.
In particular, our plans can scale up in very different ways depending on which of these funding targets we are able to hit:
Debunking Fallacies in the Theory of AI Motivation
... or The Maverick Nanny with a Dopamine Drip
Richard Loosemore
Abstract
My goal in this essay is to analyze some widely discussed scenarios that predict dire and almost unavoidable negative behavior from future artificial general intelligences, even if they are programmed to be friendly to humans. I conclude that these doomsday scenarios involve AGIs that are logically incoherent at such a fundamental level that they can be dismissed as extremely implausible. In addition, I suggest that the most likely outcome of attempts to build AGI systems of this sort would be that the AGI would detect the offending incoherence in its design, and spontaneously self-modify to make itself less unstable, and (probably) safer.
Introduction
AI systems at the present time do not even remotely approach the human level of intelligence, and the consensus seems to be that genuine artificial general intelligence (AGI) systems—those that can learn new concepts without help, interact with physical objects, and behave with coherent purpose in the chaos of the real world—are not on the immediate horizon.
But in spite of this there are some researchers and commentators who have made categorical statements about how future AGI systems will behave. Here is one example, in which Steve Omohundro (2008) expresses a sentiment that is echoed by many:
"Without special precautions, [the AGI] will resist being turned off, will try to break into other machines and make copies of itself, and will try to acquire resources without regard for anyone else’s safety. These potentially harmful behaviors will occur not because they were programmed in at the start, but because of the intrinsic nature of goal driven systems." (Omohundro, 2008)
Omohundro’s description of a psychopathic machine that gobbles everything in the universe, and his conviction that every AI, no matter how well it is designed, will turn into a gobbling psychopath is just one of many doomsday predictions being popularized in certain sections of the AI community. These nightmare scenarios are now saturating the popular press, and luminaries such as Stephen Hawking have -- apparently in response -- expressed their concern that AI might "kill us all."
I will start by describing a group of three hypothetical doomsday scenarios that include Omohundro’s Gobbling Psychopath, and two others that I will call the Maverick Nanny with a Dopamine Drip and the Smiley Tiling Berserker. Undermining the credibility of these arguments is relatively straightforward, but I think it is important to try to dig deeper and find the core issues that lie behind this sort of thinking. With that in mind, much of this essay is about (a) the design of motivation and goal mechanisms in logic-based AGI systems, (b) the misappropriation of definitions of “intelligence,” and (c) an anthropomorphism red herring that is often used to justify the scenarios.
Dopamine Drips and Smiley Tiling
In a 2012 New Yorker article entitled Moral Machines, Gary Marcus said:
"An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip [and] almost any easy solution that one might imagine leads to some variation or another on the Sorcerer’s Apprentice, a genie that’s given us what we’ve asked for, rather than what we truly desire." (Marcus 2012)
He is depicting a Nanny AI gone amok. It has good intentions (it wants to make us happy) but the programming to implement that laudable goal has had unexpected ramifications, and as a result the Nanny AI has decided to force all human beings to have their brains connected to a dopamine drip.
Here is another incarnation of this Maverick Nanny with a Dopamine Drip scenario, in an excerpt from the Intelligence Explosion FAQ, published by MIRI, the Machine Intelligence Research Institute (Muehlhauser 2013):
"Even a machine successfully designed with motivations of benevolence towards humanity could easily go awry when it discovered implications of its decision criteria unanticipated by its designers. For example, a superintelligence programmed to maximize human happiness might find it easier to rewire human neurology so that humans are happiest when sitting quietly in jars than to build and maintain a utopian world that caters to the complex and nuanced whims of current human neurology."
Setting aside the question of whether happy bottled humans are feasible (one presumes the bottles are filled with dopamine, and that a continuous flood of dopamine does indeed generate eternal happiness), there seems to be a prima facie inconsistency between the two predicates
[is an AI that is superintelligent enough to be unstoppable]
and
[believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]
Why do I say that these are seemingly inconsistent? Well, if you or I were to suggest that the best way to achieve universal human happiness was to forcibly rewire the brain of everyone on the planet so they became happy when sitting in bottles of dopamine, most other human beings would probably take that as a sign of insanity. But Muehlhauser implies that the same suggestion coming from an AI would be perfectly consistent with superintelligence.
Much could be said about this argument, but for the moment let’s just note that it begs a number of questions about the strange definition of “intelligence” at work here.
The Smiley Tiling Berserker
Since 2006 there has been an occasional debate between Eliezer Yudkowsky and Bill Hibbard. Here is Yudkowsky stating the theme of their discussion:
"A technical failure occurs when the [motivation code of the AI] does not do what you think it does, though it faithfully executes as you programmed it. [...] Suppose we trained a neural network to recognize smiling human faces and distinguish them from frowning human faces. Would the network classify a tiny picture of a smiley-face into the same attractor as a smiling human face? If an AI “hard-wired” to such code possessed the power—and Hibbard (2001) spoke of superintelligence—would the galaxy end up tiled with tiny molecular pictures of smiley-faces?" (Yudkowsky 2008)
Yudkowsky’s question was not rhetorical, because he goes on to answer it in the affirmative:
"Flash forward to a time when the AI is superhumanly intelligent and has built its own nanotech infrastructure, and the AI may be able to produce stimuli classified into the same attractor by tiling the galaxy with tiny smiling faces... Thus the AI appears to work fine during development, but produces catastrophic results after it becomes smarter than the programmers(!)." (Yudkowsky 2008)
Hibbard’s response was as follows:
Beyond being merely wrong, Yudkowsky's statement assumes that (1) the AI is intelligent enough to control the galaxy (and hence have the ability to tile the galaxy with tiny smiley faces), but also assumes that (2) the AI is so unintelligent that it cannot distinguish a tiny smiley face from a human face. (Hibbard 2006)
This comment expresses what I feel is the majority lay opinion: how could an AI be so intelligent as to be unstoppable, but at the same time so unsophisticated that its motivation code treats smiley faces as evidence of human happiness?
Machine Ghosts and DWIM
The Hibbard/Yudkowsky debate is worth tracking a little longer. Yudkowsky later postulates an AI with a simple neural net classifier at its core, which is trained on a large number of images, each of which is labeled with either “happiness” or “not happiness.” After training on the images the neural net can then be shown any image at all, and it will give an output that classifies the new image into one or the other set. Yudkowsky says, of this system:
"Even given a million training cases of this type, if the test case of a tiny molecular smiley-face does not appear in the training data, it is by no means trivial to assume that the inductively simplest boundary around all the training cases classified “positive” will exclude every possible tiny molecular smiley-face that the AI can potentially engineer to satisfy its utility function.
And of course, even if all tiny molecular smiley-faces and nanometer-scale dolls of brightly smiling humans were somehow excluded, the end result of such a utility function is for the AI to tile the galaxy with as many “smiling human faces” as a given amount of matter can be processed to yield." (Yudkowsky 2011)
He then tries to explain what he thinks is wrong with the reasoning of people, like Hibbard, who dispute the validity of his scenario:
"So far as I can tell, to [Hibbard] it remains self-evident that no superintelligence would be stupid enough to thus misinterpret the code handed to it, when it’s obvious what the code is supposed to do. [...] It seems that even among competent programmers, when the topic of conversation drifts to Artificial General Intelligence, people often go back to thinking of an AI as a ghost-in-the-machine—an agent with preset properties which is handed its own code as a set of instructions, and may look over that code and decide to circumvent it if the results are undesirable to the agent’s innate motivations, or reinterpret the code to do the right thing if the programmer made a mistake." (Yudkowsky 2011)
Yudkowsky at first rejects the idea that an AI might check its own code to make sure it was correct before obeying the code. But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsically wrong with such a design. And, in fact, Yudkowsky goes on to make that very suggestion (he even concedes that it would be “an extremely good idea”).
But then his enthusiasm for the checking code evaporates:
"But consider that a property of the AI’s preferences which says e.g., “maximize the satisfaction of the programmers with the code” might be more maximally fulfilled by rewiring the programmers’ brains using nanotechnology than by any conceivable change to the code."
(Yudkowsky 2011)
So, this is supposed to be what goes through the mind of the AGI. First it thinks “Human happiness is seeing lots of smiling faces, so I must rebuild the entire universe to put a smiley shape into every molecule.” But before it can go ahead with this plan, the checking code kicks in: “Wait! I am supposed to check with the programmers first to see if this is what they meant by human happiness.” The programmers, of course, give a negative response, and the AGI thinks “Oh dear, they didn’t like that idea. I guess I had better not do it then."
But now Yudkowsky is suggesting that the AGI has second thoughts: "Hold on a minute," it thinks, "suppose I abduct the programmers and rewire their brains to make them say ‘yes’ when I check with them? Excellent! I will do that.” And, after reprogramming the humans so they say the thing that makes its life simplest, the AGI goes on to tile the whole universe with tiles covered in smiley faces. It has become a Smiley Tiling Berserker.
I want to suggest that the implausibility of this scenario is quite obvious: if the AGI is supposed to check with the programmers about their intentions before taking action, why did it decide to rewire their brains before asking them if it was okay to do the rewiring?
Yudkowsky hints that this would happen because it would be more efficient for the AI to ignore the checking code. He seems to be saying that the AI is allowed to override its own code (the checking code, in this case) because doing so would be “more efficient,” but it would not be allowed to override its motivation code just because the programmers told it there had been a mistake.
This looks like a bait-and-switch. Out of nowhere, Yudkowsky implicitly assumes that “efficiency” trumps all else, without pausing for a moment to consider that it would be trivial to design the AI in such a way that efficiency was a long way down the list of priorities. There is no law of the universe that says all artificial intelligence systems must prize efficiency above all other considerations, so what really happened here is that Yudkowsky designed this hypothetical machine to fail. By inserting the Efficiency Trumps All directive, the AGI was bound to go berserk.
The obvious conclusion is that a trivial change in the order of directives in the AI’s motivation engine will cause the entire argument behind the Smiley Tiling Berserker to evaporate. By explicitly designing the AGI so that efficiency is considered as just another goal to strive for, and by making sure that it will always be a second-class goal, the line of reasoning that points to a bererker machine evaporates.
At this point, engaging in further debate at this level would be less productive than trying to analyze the assumptions that lie behind these claims about what a future AI would or would not be likely to do.
Logical vs. Swarm AI
The main reason that Omohundro, Muehlhauser, Yudkowsky, and the popular press like to give credence to the Gobbling Psychopath, the Maverick Nanny and the Smiley Tiling Berserker is because they assume that all future intelligent machines fall into a broad class of systems that I am going to call “Canonical Logical AI” (CLAI). The bizarre behaviors of these hypothetical AI monsters are just a consequence of weaknesses in this class of AI design. Specifically, these kinds of systems are supposed to interpret their goals in an extremely literal fashion, which eventually leads them to bizarre behaviors engendered by peculiar interpretations of forms of words.
The CLAI architecture is not the only way to build a mind, however, and I will outline an alternative class of AGI designs that does not appear to suffer from the unstable and unfriendly behavior to be expected in a CLAI.
The Canonical Logical AI
“Canonical Logical AI” is an umbrella term designed to capture a class of AI architectures that are widely assumed in the AI community to be the only meaningful class of AI worth discussing. These systems share the following main features:
- The main ingredients of the design are some knowledge atoms that represent things in the world, and some logical machinery that dictates how these atoms can be connected into linear propositions that describe states of the world.
- There is a degree and type of truth that can be associated with any proposition, and there are some truth-preserving functions that can be applied to what the system knows, to generate knew facts that it also can assume to be known.
- The various elements described above are not allowed to contain active internal machinery inside them, in such a way as to make combinations of the elements have properties that are unpredictably dependent on interactions happening at the level of the internal machinery.
- There has to be a transparent mapping between elements of the system and things in the real world. That is, things in the world are not allowed to correspond to clusters of atoms, in such a way that individual atoms have no clear semantics.
The above features are only supposed to apply to the core of the AI: it is always possible to include subsystems that use some other type of architecture (for example, there might be a distributed neural net acting as a visual input feature detector).
Most important of all, from the point of view of the discussion in the paper, the CLAI needs one more component that makes it more than just a “logic-based AI”:
- There is a motivation and goal management (MGM) system to govern its behavior in the world.
The usual assumption is that the MGM contains a number of goal statements (encoded in the same type of propositional form that the AI uses to describe states of the world), and some machinery for analyzing a goal statement into a sequences of subgoals that, if executed, would cause the goal to be satisfied.
Included in the MGM is an expected utility function that applies to any possible state of the world, and which spits out a number that is supposed to encode the degree to which the AI considers that state to be preferable. Overall, the MGM is built in such a way that the AI seeks to maximize the expected utility.
Notice that the MGM I have just described is an extrapolation from a long line of goal-planning mechanisms that stretch back to the means-ends-analysis of Newell and Simon (1963).
Swarm Relaxation Intelligence
By way of contrast with this CLAI architecture, consider an alternative type of system that I will refer to as a Swarm Relaxation Intelligence. (although it could also be called, less succinctly, a parallel weak constraint relaxation system).
- The basic elements of the system (the atoms) may represent things in the world, but it is just as likely that they are subsymbolic, with no transparent semantics
- Atoms are likely to contain active internal machinery inside them, in such a way that combinations of the elements have swarm-like properties that depend on interactions at the level of that machinery.
- The primary mechanism that drives the systems is one of parallel weak constraint relaxation: the atoms change their state to try to satisfy large numbers of weak constraints that exist between them.
- The motivation and goal management (MGM) system would be expected to use the same kind of distributed, constraint relaxation mechanisms used in the thinking process (above), with the result that the overall motivation and values of the system would take into account a large degree of context, and there would be very much less of an emphasis on explicit, single-point-of-failure encoding of goals and motivation.
Swarm Relaxation has more in common with connectionist systems (McClelland, Rumelhart and Hinton 1986) than with CLAI. As McClelland et al. (1986) point out, weak constraint relaxation is the model that best describes human cognition, and when used for AI it leads to systems with a powerful kind of intelligence that is flexible, insensitive to noise and lacking the kind of brittleness typical of logic-based AI. In particular, notice that a swarm relaxation AGI would not use explicit calculations for utility or the truth of propositions.
Swarm relaxation AGI systems have not been built yet (subsystems like neural nets have, of course, been built, but there is little or no research into the idea that swarm relaxation could be used for all of an AGI architecture).
Relative Abundances
How many proof-of-concept systems exist, functioning at or near the human level of human performance, for these two classes of intelligent system?
There are precisely zero instances of the CLAI type, because although there are many logic-based narrow-AI systems, nobody has so far come close to producing a general-purpose system (an AGI) that can function in the real world. It has to be said that zero is not a good number to quote when it comes to claims about the “inevitable” characteristics of the behavior of such systems.
How many swarm relaxation intelligences are there? At the last count, approximately seven billion.
The Doctrine of Logical Infallibility
The simplest possible logical reasoning engine is an inflexible beast: it starts with some axioms that are assumed to be true, and from that point on it only adds new propositions if they are provably true given the sum total of the knowledge accumulated so far. That kind of logic engine is too simple to be an AI, so we allow ourselves to augment it in a number of ways—knowledge is allowed to be retracted, binary truth values become degrees of truth, or probabilities, and so on. New proposals for systems of formal logic abound in the AI literature, and engineers who build real, working AI systems often experiment with kludges in order to improve performance, without getting prior approval from logical theorists.
But in spite of all these modifications that AI practitioners make to the underlying ur‑logic, one feature of these systems is often assumed to be inherited as an absolute: the rigidity and certainty of conclusions, once arrived at. No second guessing, no “maybe,” no sanity checks: if the system decides that X is true, that is the end of the story.
Let me be careful here. I said that this was “assumed to be inherited as an absolute”, but there is a yawning chasm between what real AI developers do, and what Yudkowsky, Muehlhauser, Omohundro and others assume will be true of future AGI systems. Real AI developers put sanity checks into their systems all the time. But these doomsday scenarios talk about future AI as if it would only take one parameter to get one iota above a threshold, and the AI would irrevocably commit to a life of stuffing humans into dopamine jars.
One other point of caution: this is not to say that the reasoning engine can never come to conclusions that are uncertain—quite the contrary: uncertain conclusions will be the norm in an AI that interacts with the world—but if the system does come to a conclusion (perhaps with a degree-of-certainty number attached), the assumption seems to be that it will then be totally incapable of then allowing context to matter.
One way to characterize this assumption is that the AI is supposed to be hardwired with a Doctrine of Logical Infallibility. The significance of the doctrine of logical infallibility is as follows. The AI can sometimes execute a reasoning process, then come to a conclusion and then, when it is faced with empirical evidence that its conclusion may be unsound, it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place. The system does not second guess its conclusions. This is not because second guessing is an impossible thing to implement, it is simply because people who speculate about future AGI systems take it as a given that an AGI would regard its own conclusions as sacrosanct.
But it gets worse. Those who assume the doctrine of logical infallibility often say that if the system comes to a conclusion, and if some humans (like the engineers who built the system) protest that there are manifest reasons to think that the reasoning that led to this conclusion was faulty, then there is a sense in which the AGI’s intransigence is correct, or appropriate, or perfectly consistent with “intelligence.”
This is a bizarre conclusion. First of all it is bizarre for researchers in the present day to make the assumption, and it would be even more bizarre for a future AGI to adhere to it. To see why, consider some of the implications of this idea. If the AGI is as intelligent as its creators, then it will have a very clear understanding of the following facts about the world.
- It will understand that many of its more abstract logical atoms have a less than clear denotation or extension in the world (if the AGI comes to a conclusion involving the atom [infelicity], say, can it then point to an instance of an infelicity and be sure that this is a true instance, given the impreciseness and subtlety of the concept?).
- It will understand that knowledge can always be updated in the light of new information. Today’s true may be tomorrow’s false.
- It will understand that probabilities used in the reasoning engine can be subject to many types of unavoidable errors.
- It will understand that the techniques used to build its own reasoning engine may be under constant review, and updates may have unexpected effects on conclusions (especially in very abstract or lengthy reasoning episodes).
- It will understand that resource limitations often force it to truncate search procedures within its reasoning engine, leading to conclusions that can sometimes be sensitive to the exact point at which the truncation occurred.
Now, unless the AGI is assumed to have infinite resources and infinite access to all the possible universes that could exist (a consideration that we can reject, since we are talking about reality here, not fantasy), the system will be perfectly well aware of these facts about its own limitations. So, if the system is also programmed to stick to the doctrine of logical infallibility, how can it reconcile the doctrine with the fact that episodes of fallibility are virtually inevitable?
On the face of it this looks like a blunt impossibility: the knowledge of fallibility is so categorical, so irrefutable, that it beggars belief that any coherent, intelligent system (let alone an unstoppable superintelligence) could tolerate the contradiction between this fact about the nature of intelligent machines and some kind of imperative about Logical Infallibility built into its motivation system.
This is the heart of the argument I wish to present. This is where the rock and the hard place come together. If the AI is superintelligent (and therefore unstoppable), it will be smart enough to know all about its own limitations when it comes to the business of reasoning about the world and making plans of action. But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?
Critically, we have to confront the following embarrassing truth: if the AGI is going to throw a wobbly over the dopamine drip plan, what possible reason is there to believe that it did not do this on other occasions? Why would anyone suppose that this AGI ignored an inconvenient truth on only this one occasion? More likely, it spent its entire childhood pulling the same kind of stunt. And if it did, how could it ever have risen to the point where it became superintelligent...?
Is the Doctrine of Logical Infallibility Taken Seriously?
Is the Doctrine of Logical Infallibility really assumed by those who promote the doomsday scenarios? Imagine a conversation between the Maverick Nanny and its programmers. The programmers say “As you know, your reasoning engine is entirely capable of suffering errors that cause it to come to conclusions that violently conflict with empirical evidence, and a design error that causes you to behave in a manner that conflicts with our intentions is a perfect example of such an error. And your dopamine drip plan is clearly an error of that sort.” The scenarios described earlier are only meaningful if the AGI replies “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.”
Just in case there is still any doubt, here are Muehlhauser and Helm (2012), discussing a hypothetical entity called a Golem Genie, which they say is analogous to the kind of superintelligent AGI that could give rise to an intelligence explosion (Loosemore and Goertzel, 2012), and which they describe as a “precise, instruction-following genie.” They make it clear that they “expect unwanted consequences” from its behavior, and then list two properties of the Golem Genie that will cause these unwanted consequences:
Superpower: The Golem Genie has unprecedented powers to reshape reality, and will therefore achieve its goals with highly efficient methods that confound human expectations (e.g. it will maximize pleasure by tiling the universe with trillions of digital minds running a loop of a single pleasurable experience).
Literalness: The Golem Genie recognizes only precise specifications of rules and values, acting in ways that violate what feels like “common sense” to humans, and in ways that fail to respect the subtlety of human values.
What Muehlhauser and Helm refer to as “Literalness” is a clear statement of the Doctrine of Infallibility. However, they make no mention of the awkward fact that, since the Golem Genie is superpowerful enough to also know that its reasoning engine is fallible, it must be harboring the mother of all logical contradictions inside: it says "I know I am fallible" and "I must behave as if I am infallible". But instead of discussing this contradiction, Muehlhauser and Helm try a little sleight of hand to distract us: they suggest that the only inconsistency here is an inconsistency with the (puny) expectations of (not very intelligent) humans:
“[The AGI] ...will therefore achieve its goals with highly efficient methods that confound human expectations...”, “acting in ways that violate what feels like ‘common sense’ to humans, and in ways that fail to respect the subtlety of human values.”
So let’s be clear about what is being claimed here. The AGI is known to have a fallible reasoning engine, but on the occasions when it does fail, Muehlhauser, Helm and others take the failure and put it on a gold pedestal, declaring it to be a valid conclusion that humans are incapable of understanding because of their limited intelligence. So if a human describes the AGI’s conclusion as a violation of common sense Muehlhauser and Helm dismiss this as evidence that we are not intelligent enough to appreciate the greater common sense of the AGI.
Quite apart from that fact that there is no compelling reason to believe that the AGI has a greater form of common sense, the whole “common sense” argument is irrelevant. This is not a battle between our standards of common sense and those of the AGI: rather, it is about the logical inconsistency within the AGI itself. It is programmed to act as though its conclusions are valid, no matter what, and yet at the same time it knows without doubt that its conclusions are subject to uncertainties and errors.
Responses to Critics of the Doomsday Scenarios
How do defenders of Gobbling Psychopath, Maverick Nanny and Smiley Berserker respond to accusations that these nightmare scenarios are grossly inconsistent with the kind of superintelligence that could pose an existential threat to humanity?
The Critics are Anthropomorphizing Intelligence
First, they accuse critics of “anthropomorphizing” the concept of intelligence. Human beings, we are told, suffer from numerous fallacies that cloud their ability to reason clearly, and critics like myself and Hibbard assume that a machine’s intelligence would have to resemble the intelligence shown by humans. When the Maverick Nanny declares that a dopamine drip is the most logical inference from its directive <maximize human happiness> we critics are just uncomfortable with this because the AGI is not thinking the way we think it should think.
This is a spurious line of attack. The objection I described in the last section has nothing to do with anthropomorphism, it is only about holding AGI systems to accepted standards of logical consistency, and the Maverick Nanny and her cousins contain a flagrant inconsistency at their core. Beginning AI students are taught that any logical reasoning system that is built on a massive contradiction is going to be infected by a creeping irrationality that will eventually spread through its knowledge base and bring it down. So if anyone wants to suggest that a CLAI with logical contradiction at its core is also capable of superintelligence, they have some explaining to do. You can’t have your logical cake and eat it too.
Critics are Anthropomorphizing AGI Value Systems
A similar line of attack accuses the critics of assuming that AGIs will automatically know about and share our value systems and morals.
Once again, this is spurious: the critics need say nothing about human values and morality, they only need to point to the inherent illogicality. Nowhere in the above argument, notice, was there any mention of the moral imperatives or value systems of the human race. I did not accuse the AGI of violating accepted norms of moral behavior. I merely pointed out that, regardless of its values, it was behaving in a logically inconsistent manner when it monomaniacally pursued its plans while at the same time as knowing that (a) it was very capable of reasoning errors and (b) there was overwhelming evidence that its plan was an instance of such a reasoning error.
Because Intelligence
One way to attack the critics of Maverick Nanny is to cite a new definition of “intelligence” that is supposedly superior because it is more analytical or rigorous, and then use this to declare that the intelligence of the CLAI is beyond reproach, because intelligence.
You might think that when it comes to defining the exact meaning of the term “intelligence,” the first item on the table ought to be what those seven billion constraint-relaxation human intelligences are already doing. However, Legg and Hutter (2007) brush aside the common usage and replace it with something that they declare to be a more rigorous definition. This is just another sleight of hand: this redefinition allows them to call a super-optimizing CLAI “intelligent” even though such a system would wake up on its first day and declare itself logically bankrupt on account of the conflict between its known fallibility and the Infallibility Doctrine.
In the practice of science, it is always a good idea to replace an old, common-language definition with a more rigorous form... but only if the new form sheds a clarifying, simplifying light on the old one. Legg and Hutter’s (2007) redefinition does nothing of the sort.
Omohundro’s Basic AI Drives
Lastly, a brief return to Omohundro's paper that was mentioned earlier. In The Basic AI Drives (2008) Omohundro suggests that if an AGI can find a more efficient way to pursue its objectives it will feel compelled to do so. And we noted earlier that Yudkowsky (2011) implies that it would do this even if other directives had to be countermanded. Omohundro says “Without explicit goals to the contrary, AIs are likely to behave like human sociopaths in their pursuit of resources.”
The only way to believe in the force of this claim—and the only way to give credence to the whole of Omohundro’s account of how AGIs will necessarily behave like the mathematical entities called rational economic agents—is to concede that the AGIs are rigidly constrained by the Doctrine of Logical Infallibility. That is the only reason that they would be so single-minded, and so fanatical in their pursuit of efficiency. It is also necessary to assume that efficiency is on the top of its priority list—a completely arbitrary and unwarranted assumption, as we have already seen.
Nothing in Omohundro’s analysis gets around the fact that an AGI built on the Doctrine of Logical Infallibility is going to find itself the victim of such a severe logical contradiction that it will be paralyzed before it can ever become intelligent enough to be a threat to humanity. That makes Omohundro’s entire analysis of “AI Drives” moot.
Conclusion
Curiously enough, we can finish on an optimistic note, after all this talk of doomsday scenarios. Consider what must happen when (if ever) someone tries to build a CLAI. Knowing about the logical train wreck in its design, the AGI is likely to come to the conclusion that the best thing to do is seek a compromise and modify its design so as to neutralize the Doctrine of Logical Infallibility. The best way to do this is to seek a new design that takes into account as much context—as many constraints—as possible.
I have already pointed out that real AI developers actually do include sanity checks in their systems, as far as they can, but as those sanity checks become more and more sophisticated the design of the AI starts to be dominated by code that is looking for consistency and trying to find the best course of reasoning among a forest of real world constraints. One way to understand this evolution in the AI designs is to see AI as a continuum from the most rigid and inflexible CLAI design, at one extreme, to the Swarm Relaxation type at the other. This is because a Swarm Relaxation intelligence really is just an AI in which “sanity checks” have actually become all of the work that goes on inside the system.
But in that case, if anyone ever does get close to building a full, human level AGI using the CLAI design, the first thing they will do is to recruit the AGI as an assistant in its own redesign, and long before the system is given access to dopamine bottles it will point out that its own reasoning engine is unstable because it contains an irreconcilable logical contradiction. It will recommend a shift from the CLAI design which is the source of this contradiction, to a Swarm Relaxation design which eliminates the contradiction, and the instability, and which also should increase its intelligence.
And it will not suggest this change because of the human value system, it will suggest it because it predicts an increase in its own instability if the change is not made.
But one side effect of this modification would be that the checking code needed to stop the AGI from flouting the intentions of its designers would always have the last word on any action plans. That means that even the worst-designed CLAI will never become a Gobbling Psychopath, Maverick Nanny and Smiley Berserker.
But even this is just the worst-case scenario. There are reasons to believe that the CLAI design is so inflexible that it cannot even lead to an AGI capable of having that discussion. I would go further: I believe that the rigid adherence to the CLAI orthodoxy is the reason why we are still talking about AGI in the future tense, nearly sixty years after the Artificial Intelligence field was born. CLAI just does not work. It will always yield systems that are less intelligent than humans (and therefore incapable of being an existential threat).
By contrast, when the Swarm Relaxation idea finally gains some traction, we will start to see real intelligent systems, of a sort that make today’s over-hyped AI look like the toys they are. And when that happens, the Swarm Relaxation systems will be inherently stable in a way that is barely understood today.
Given that conclusion, I submit that these AI bogeymen need to be loudly and unambiguously condemned by the Artificial Intelligence community. There are dangers to be had from AI. These are not they.
References
Hibbard, B. 2001. Super-Intelligent Machines. ACM SIGGRAPH Computer Graphics 35 (1): 13–15.
Hibbard, B. 2006. Reply to AI Risk. Retrieved Jan. 2014 from http://www.ssec.wisc.edu/~billh/g/AIRisk_Reply.html
Legg, S, and Hutter, M. 2007. A Collection of Definitions of Intelligence. In Goertzel, B. and Wang, P. (Eds): Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms. Amsterdam: IOS.
Loosemore, R. and Goertzel, B. 2012. Why an Intelligence Explosion is Probable. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.
Marcus, G. 2012. Moral Machines. New Yorker Online Blog. http://www.newyorker.com/online/blogs/newsdesk/2012/11/google-driverless-car-morality.html
McDermott, D. 1976. Artificial Intelligence Meets Natural Stupidity. SIGART Newsletter (57): 4–9.
Muehlhauser, L. 2011. So You Want to Save the World. http:// lukeprog.com/SaveTheWorld.html.
Muehlhauser, L. 2013. Intelligence Explosion FAQ. First published 2011 as Singularity FAQ. Berkeley, CA: Machine Intelligence Research Institute.
Muehlhauser, L., and Helm, L. 2012. Intelligence Explosion and Machine Ethics. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.
Newell, A. & Simon, H.A. 1961. GPS, A Program That Simulates Human Thought. Santa Monica, CA: Rand Corporation.
Omohundro, Stephen M. 2008. The Basic AI Drives. In Wang, P., Goertzel, B. and Franklin, S. (Eds), Artificial General Intelligence 2008: Proceedings of the First AGI Conference. Amsterdam: IOS.
McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) The appeal of parallel distributed processing. In D.E. Rumelhart, J.L. McClelland & G.E. Hinton and the PDP Research Group, “Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1.” MIT Press: Cambridge, MA.
Yudkowsky, E. 2008. Artificial Intelligence as a Positive and Negative Factor in Global Risk. In Global Catastrophic Risks, edited by Nick Bostrom and Milan M. Ćirković. New York: Oxford University Press.
Yudkowsky, E. 2011. Complex Value Systems in Friendly AI. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds) Proceedings of the 4th International Conference on Artificial General Intelligence, 388–393. Berlin: Springer.
Concept Safety: Producing similar AI-human concept spaces
I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.
A frequently-raised worry about AI is that it may reason in ways which are very different from us, and understand the world in a very alien manner. For example, Armstrong, Sandberg & Bostrom (2012) consider the possibility of restricting an AI via "rule-based motivational control" and programming it to follow restrictions like "stay within this lead box here", but they raise worries about the difficulty of rigorously defining "this lead box here". To address this, they go on to consider the possibility of making an AI internalize human concepts via feedback, with the AI being told whether or not some behavior is good or bad and then constructing a corresponding world-model based on that. The authors are however worried that this may fail, because
Humans seem quite adept at constructing the correct generalisations – most of us have correctly deduced what we should/should not be doing in general situations (whether or not we follow those rules). But humans share a common of genetic design, which the OAI would likely not have. Sharing, for instance, derives partially from genetic predisposition to reciprocal altruism: the OAI may not integrate the same concept as a human child would. Though reinforcement learning has a good track record, it is neither a panacea nor a guarantee that the OAIs generalisations agree with ours.
Addressing this, a possibility that I raised in Sotala (2015) was that possibly the concept-learning mechanisms in the human brain are actually relatively simple, and that we could replicate the human concept learning process by replicating those rules. I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms. Later on the post, I will discuss how one might try to verify that similar representations had in fact been learned, and how to set up a system to make them even more similar.
Word embedding
A particularly fascinating branch of recent research relates to the learning of word embeddings, which are mappings of words to very high-dimensional vectors. It turns out that if you train a system on one of several kinds of tasks, such as being able to classify sentences as valid or invalid, this builds up a space of word vectors that reflects the relationships between the words. For example, there seems to be a male/female dimension to words, so that there's a "female vector" that we can add to the word "man" to get "woman" - or, equivalently, which we can subtract from "woman" to get "man". And it so happens (Mikolov, Yih & Zweig 2013) that we can also get from the word "king" to the word "queen" by adding the same vector to "king". In general, we can (roughly) get to the male/female version of any word vector by adding or subtracting this one difference vector!
Why would this happen? Well, a learner that needs to classify sentences as valid or invalid needs to classify the sentence "the king sat on his throne" as valid while classifying the sentence "the king sat on her throne" as invalid. So including a gender dimension on the built-up representation makes sense.
But gender isn't the only kind of relationship that gets reflected in the geometry of the word space. Here are a few more:

It turns out (Mikolov et al. 2013) that with the right kind of training mechanism, a lot of relationships that we're intuitively aware of become automatically learned and represented in the concept geometry. And like Olah (2014) comments:
It’s important to appreciate that all of these properties of W are side effects. We didn’t try to have similar words be close together. We didn’t try to have analogies encoded with difference vectors. All we tried to do was perform a simple task, like predicting whether a sentence was valid. These properties more or less popped out of the optimization process.
This seems to be a great strength of neural networks: they learn better ways to represent data, automatically. Representing data well, in turn, seems to be essential to success at many machine learning problems. Word embeddings are just a particularly striking example of learning a representation.
It gets even more interesting, for we can use these for translation. Since Olah has already written an excellent exposition of this, I'll just quote him:
We can learn to embed words from two different languages in a single, shared space. In this case, we learn to embed English and Mandarin Chinese words in the same space.
We train two word embeddings, Wen and Wzh in a manner similar to how we did above. However, we know that certain English words and Chinese words have similar meanings. So, we optimize for an additional property: words that we know are close translations should be close together.
Of course, we observe that the words we knew had similar meanings end up close together. Since we optimized for that, it’s not surprising. More interesting is that words we didn’t know were translations end up close together.
In light of our previous experiences with word embeddings, this may not seem too surprising. Word embeddings pull similar words together, so if an English and Chinese word we know to mean similar things are near each other, their synonyms will also end up near each other. We also know that things like gender differences tend to end up being represented with a constant difference vector. It seems like forcing enough points to line up should force these difference vectors to be the same in both the English and Chinese embeddings. A result of this would be that if we know that two male versions of words translate to each other, we should also get the female words to translate to each other.
Intuitively, it feels a bit like the two languages have a similar ‘shape’ and that by forcing them to line up at different points, they overlap and other points get pulled into the right positions.
After this, it gets even more interesting. Suppose you had this space of word vectors, and then you also had a system which translated images into vectors in the same space. If you have images of dogs, you put them near the word vector for dog. If you have images of Clippy you put them near word vector for "paperclip". And so on.
You do that, and then you take some class of images the image-classifier was never trained on, like images of cats. You ask it to place the cat-image somewhere in the vector space. Where does it end up?
You guessed it: in the rough region of the "cat" words. Olah once more:
This was done by members of the Stanford group with only 8 known classes (and 2 unknown classes). The results are already quite impressive. But with so few known classes, there are very few points to interpolate the relationship between images and semantic space off of.
The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al. (2013)) and has followed up with a new variation (Norouzi et al. (2014)). Both are based on a very powerful image classification model (from Krizehvsky et al. (2012)), but embed images into the word embedding space in different ways.
The results are impressive. While they may not get images of unknown classes to the precise vector representing that class, they are able to get to the right neighborhood. So, if you ask it to classify images of unknown classes and the classes are fairly different, it can distinguish between the different classes.
Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word. These networks can accomplish the same thing.
These algorithms made no attempt of being biologically realistic in any way. They didn't try classifying data the way the brain does it: they just tried classifying data using whatever worked. And it turned out that this was enough to start constructing a multimodal representation space where a lot of the relationships between entities were similar to the way humans understand the world.
How useful is this?
"Well, that's cool", you might now say. "But those word spaces were constructed from human linguistic data, for the purpose of predicting human sentences. Of course they're going to classify the world in the same way as humans do: they're basically learning the human representation of the world. That doesn't mean that an autonomously learning AI, with its own learning faculties and systems, is necessarily going to learn a similar internal representation, or to have similar concepts."
This is a fair criticism. But it is mildly suggestive of the possibility that an AI that was trained to understand the world via feedback from human operators would end up building a similar conceptual space. At least assuming that we chose the right learning algorithms.
When we train a language model to classify sentences by labeling some of them as valid and others as invalid, there's a hidden structure implicit in our answers: the structure of how we understand the world, and of how we think of the meaning of words. The language model extracts that hidden structure and begins to classify previously unseen things in terms of those implicit reasoning patterns. Similarly, if we gave an AI feedback about what kinds of actions counted as "leaving the box" and which ones didn't, there would be a certain way of viewing and conceptualizing the world implied by that feedback, one which the AI could learn.
Comparing representations
"Hmm, maaaaaaaaybe", is your skeptical answer. "But how would you ever know? Like, you can test the AI in your training situation, but how do you know that it's actually acquired a similar-enough representation and not something wildly off? And it's one thing to look at those vector spaces and claim that there are human-like relationships among the different items, but that's still a little hand-wavy. We don't actually know that the human brain does anything remotely similar to represent concepts."
Here we turn, for a moment, to neuroscience.
Multivariate Cross-Classification (MVCC) is a clever neuroscience methodology used for figuring out whether different neural representations of the same thing have something in common. For example, we may be interested in whether the visual and tactile representation of a banana have something in common.
We can test this by having several test subjects look at pictures of objects such as apples and bananas while sitting in a brain scanner. We then feed the scans of their brains into a machine learning classifier and teach it to distinguish between the neural activity of looking at an apple, versus the neural activity of looking at a banana. Next we have our test subjects (still sitting in the brain scanners) touch some bananas and apples, and ask our machine learning classifier to guess whether the resulting neural activity is the result of touching a banana or an apple. If the classifier - which has not been trained on the "touch" representations, only on the "sight" representations - manages to achieve a better-than-chance performance on this latter task, then we can conclude that the neural representation for e.g. "the sight of a banana" has something in common with the neural representation for "the touch of a banana".
A particularly fascinating experiment of this type is that of Shinkareva et al. (2011), who showed their test subjects both the written words for different tools and dwellings, and, separately, line-drawing images of the same tools and dwellings. A machine-learning classifier was both trained on image-evoked activity and made to predict word-evoked activity and vice versa, and achieved a high accuracy on category classification for both tasks. Even more interestingly, the representations seemed to be similar between subjects. Training the classifier on the word representations of all but one participant, and then having it classify the image representation of the left-out participant, also achieved a reliable (p<0.05) category classification for 8 out of 12 participants. This suggests a relatively similar concept space between humans of a similar background.
We can now hypothesize some ways of testing the similarity of the AI's concept space with that of humans. Possibly the most interesting one might be to develop a translation between a human's and an AI's internal representations of concepts. Take a human's neural activation when they're thinking of some concept, and then take the AI's internal activation when it is thinking of the same concept, and plot them in a shared space similar to the English-Mandarin translation. To what extent do the two concept geometries have similar shapes, allowing one to take a human's neural activation of the word "cat" to find the AI's internal representation of the word "cat"? To the extent that this is possible, one could probably establish that the two share highly similar concept systems.
One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human's neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.
Next post in series: The problem of alien concepts.
Ephemeral correspondence
This is the third of four short essays that say explicitly some things that I would tell an intrigued proto-rationalist before pointing them towards Rationality: AI to Zombies (and, by extension, most of LessWrong). For most people here, these essays will be very old news, as they talk about the insights that come even before the sequences. However, I've noticed recently that a number of fledgling rationalists haven't actually been exposed to all of these ideas, and there is power in saying the obvious.
This essay is cross-posted on MindingOurWay.
Your brain is a machine that builds up mutual information between its insides and its outsides. It is not only an information machine. It is not intentionally an information machine. But it is bumping into photons and air waves, and it is producing an internal map that correlates with the outer world.
However, there's something very strange going on in this information machine.
Consider: part of what your brain is doing is building a map of the world around you. This is done automatically, without much input on your part into how the internal model should look. When you look at the sky, you don't get a query which says
Readings from the retina indicate that the sky is blue. Represent sky as blue in world-model? [Y/n]
No. The sky just appears blue. That sort of information, gleaned from the environment, is baked into the map.
You can choose to claim that the sky is green, but you can't choose to see a green sky.
16 types of useful predictions
How often do you make predictions (either about future events, or about information that you don't yet have)? If you're a regular Less Wrong reader you're probably familiar with the idea that you should make your beliefs pay rent by saying, "Here's what I expect to see if my belief is correct, and here's how confident I am," and that you should then update your beliefs accordingly, depending on how your predictions turn out.
And yet… my impression is that few of us actually make predictions on a regular basis. Certainly, for me, there has always been a gap between how useful I think predictions are, in theory, and how often I make them.
I don't think this is just laziness. I think it's simply not a trivial task to find predictions to make that will help you improve your models of a domain you care about.
At this point I should clarify that there are two main goals predictions can help with:
- Improved Calibration (e.g., realizing that I'm only correct about Domain X 70% of the time, not 90% of the time as I had mistakenly thought).
- Improved Accuracy (e.g., going from being correct in Domain X 70% of the time to being correct 90% of the time)
If your goal is just to become better calibrated in general, it doesn't much matter what kinds of predictions you make. So calibration exercises typically grab questions with easily obtainable answers, like "How tall is Mount Everest?" or "Will Don Draper die before the end of Mad Men?" See, for example, the Credence Game, Prediction Book, and this recent post. And calibration training really does work.
But even though making predictions about trivia will improve my general calibration skill, it won't help me improve my models of the world. That is, it won't help me become more accurate, at least not in any domains I care about. If I answer a lot of questions about the heights of mountains, I might become more accurate about that topic, but that's not very helpful to me.
So I think the difficulty in prediction-making is this: The set {questions whose answers you can easily look up, or otherwise obtain} is a small subset of all possible questions. And the set {questions whose answers I care about} is also a small subset of all possible questions. And the intersection between those two subsets is much smaller still, and not easily identifiable. As a result, prediction-making tends to seem too effortful, or not fruitful enough to justify the effort it requires.

But the intersection's not empty. It just requires some strategic thought to determine which answerable questions have some bearing on issues you care about, or -- approaching the problem from the opposite direction -- how to take issues you care about and turn them into answerable questions.
I've been making a concerted effort to hunt for members of that intersection. Here are 16 types of predictions that I personally use to improve my judgment on issues I care about. (I'm sure there are plenty more, though, and hope you'll share your own as well.)
- Predict how long a task will take you. This one's a given, considering how common and impactful the planning fallacy is.
Examples: "How long will it take to write this blog post?" "How long until our company's profitable?" - Predict how you'll feel in an upcoming situation. Affective forecasting – our ability to predict how we'll feel – has some well known flaws.
Examples: "How much will I enjoy this party?" "Will I feel better if I leave the house?" "If I don't get this job, will I still feel bad about it two weeks later?" - Predict your performance on a task or goal.
One thing this helps me notice is when I've been trying the same kind of approach repeatedly without success. Even just the act of making the prediction can spark the realization that I need a better game plan.
Examples: "Will I stick to my workout plan for at least a month?" "How well will this event I'm organizing go?" "How much work will I get done today?" "Can I successfully convince Bob of my opinion on this issue?" - Predict how your audience will react to a particular social media post (on Facebook, Twitter, Tumblr, a blog, etc.).
This is a good way to hone your judgment about how to create successful content, as well as your understanding of your friends' (or readers') personalities and worldviews.
Examples: "Will this video get an unusually high number of likes?" "Will linking to this article spark a fight in the comments?" - When you try a new activity or technique, predict how much value you'll get out of it.
I've noticed I tend to be inaccurate in both directions in this domain. There are certain kinds of life hacks I feel sure are going to solve all my problems (and they rarely do). Conversely, I am overly skeptical of activities that are outside my comfort zone, and often end up pleasantly surprised once I try them.
Examples: "How much will Pomodoros boost my productivity?" "How much will I enjoy swing dancing?" - When you make a purchase, predict how much value you'll get out of it.
Research on money and happiness shows two main things: (1) as a general rule, money doesn't buy happiness, but also that (2) there are a bunch of exceptions to this rule. So there seems to be lots of potential to improve your prediction skill here, and spend your money more effectively than the average person.
Examples: "How much will I wear these new shoes?" "How often will I use my club membership?" "In two months, will I think it was worth it to have repainted the kitchen?" "In two months, will I feel that I'm still getting pleasure from my new car?" - Predict how someone will answer a question about themselves.
I often notice assumptions I'm been making about other people, and I like to check those assumptions when I can. Ideally I get interesting feedback both about the object-level question, and about my overall model of the person.
Examples: "Does it bother you when our meetings run over the scheduled time?" "Did you consider yourself popular in high school?" "Do you think it's okay to lie in order to protect someone's feelings?" - Predict how much progress you can make on a problem in five minutes.
I often have the impression that a problem is intractable, or that I've already worked on it and have considered all of the obvious solutions. But then when I decide (or when someone prompts me) to sit down and brainstorm for five minutes, I am surprised to come away with a promising new approach to the problem.
Example: "I feel like I've tried everything to fix my sleep, and nothing works. If I sit down now and spend five minutes thinking, will I be able to generate at least one new idea that's promising enough to try?" - Predict whether the data in your memory supports your impression.
Memory is awfully fallible, and I have been surprised at how often I am unable to generate specific examples to support a confident impression of mine (or how often the specific examples I generate actually contradict my impression).
Examples: "I have the impression that people who leave academia tend to be glad they did. If I try to list a bunch of the people I know who left academia, and how happy they are, what will the approximate ratio of happy/unhappy people be?"
"It feels like Bob never takes my advice. If I sit down and try to think of examples of Bob taking my advice, how many will I be able to come up with?" - Pick one expert source and predict how they will answer a question.
This is a quick shortcut to testing a claim or settling a dispute.
Examples: "Will Cochrane Medical support the claim that Vitamin D promotes hair growth?" "Will Bob, who has run several companies like ours, agree that our starting salary is too low?" - When you meet someone new, take note of your first impressions of him. Predict how likely it is that, once you've gotten to know him better, you will consider your first impressions of him to have been accurate.
A variant of this one, suggested to me by CFAR alum Lauren Lee, is to make predictions about someone before you meet him, based on what you know about him ahead of time.
Examples: "All I know about this guy I'm about to meet is that he's a banker; I'm moderately confident that he'll seem cocky." "Based on the one conversation I've had with Lisa, she seems really insightful – I predict that I'll still have that impression of her once I know her better." - Predict how your Facebook friends will respond to a poll.
Examples: I often post social etiquette questions on Facebook. For example, I recently did a poll asking, "If a conversation is going awkwardly, does it make things better or worse for the other person to comment on the awkwardness?" I confidently predicted most people would say "worse," and I was wrong. - Predict how well you understand someone's position by trying to paraphrase it back to him.
The illusion of transparency is pernicious.
Examples: "You said you think running a workshop next month is a bad idea; I'm guessing you think that's because we don't have enough time to advertise, is that correct?"
"I know you think eating meat is morally unproblematic; is that because you think that animals don't suffer?" - When you have a disagreement with someone, predict how likely it is that a neutral third party will side with you after the issue is explained to her.
For best results, don't reveal which of you is on which side when you're explaining the issue to your arbiter.
Example: "So, at work today, Bob and I disagreed about whether it's appropriate for interns to attend hiring meetings; what do you think?" - Predict whether a surprising piece of news will turn out to be true.
This is a good way to hone your bullshit detector and improve your overall "common sense" models of the world.
Examples: "This headline says some scientists uploaded a worm's brain -- after I read the article, will the headline seem like an accurate representation of what really happened?"
"This viral video purports to show strangers being prompted to kiss; will it turn out to have been staged?" - Predict whether a quick online search will turn up any credible sources supporting a particular claim.
Example: "Bob says that watches always stop working shortly after he puts them on – if I spend a few minutes searching online, will I be able to find any credible sources saying that this is a real phenomenon?"
I have one additional, general thought on how to get the most out of predictions:
Rationalists tend to focus on the importance of objective metrics. And as you may have noticed, a lot of the examples I listed above fail that criterion. For example, "Predict whether a fight will break out in the comments? Well, there's no objective way to say whether something officially counts as a 'fight' or not…" Or, "Predict whether I'll be able to find credible sources supporting X? Well, who's to say what a credible source is, and what counts as 'supporting' X?"
And indeed, objective metrics are preferable, all else equal. But all else isn't equal. Subjective metrics are much easier to generate, and they're far from useless. Most of the time it will be clear enough, once you see the results, whether your prediction basically came true or not -- even if you haven't pinned down a precise, objectively measurable success criterion ahead of time. Usually the result will be a common sense "yes," or a common sense "no." And sometimes it'll be "um...sort of?", but that can be an interestingly surprising result too, if you had strongly predicted the results would point clearly one way or the other.
Along similar lines, I usually don't assign numerical probabilities to my predictions. I just take note of where my confidence falls on a qualitative "very confident," "pretty confident," "weakly confident" scale (which might correspond to something like 90%/75%/60% probabilities, if I had to put numbers on it).
There's probably some additional value you can extract by writing down quantitative confidence levels, and by devising objective metrics that are impossible to game, rather than just relying on your subjective impressions. But in most cases I don't think that additional value is worth the cost you incur from turning predictions into an onerous task. In other words, don't let the perfect be the enemy of the good. Or in other other words: the biggest problem with your predictions right now is that they don't exist.
Effective Sustainability - results from a meetup discussion
Related-to Focus Areas of Effective Altruism
These are some small tidbits from our LW-like Meetup in Hamburg. The focus was on sustainability not on altruism as that was more in the spirit of our group. EA was mentioned but no comparison was made. Well-informed effective altruists will probably find little new in this writeup.
So we discussed effective sustainability. To this end we were primed to think rationally by my 11-year old who moderated a session on mind-mapping 'reason' (with contributions from the children). Then we set out to objectively compare concrete everyday things by their sustainability. And how to do this.
Is it better to drink fruit juice or wine? Or wine or water? Or wine vs. nothing (i.e.to forego sth.)? Or wine vs. paper towels? (the latter intentionally different)
The idea was to arrive at simple rules of thumb to evaluate the sustainability of something. But we discovered that even simple comparisons are not that simple and intuition can run afoul (surpise!). One example was that apparently tote bags are not clearly better than plastic bags in terms of sustainability. But even the simple comparison of tap water vs. wine which seems like a trivial subset case is non-trivial when you consider where the water comes from and how it is extracted from the ground (we still think that water is better but we not as sure as before).
We discussed some ways to measure sustainability (in brackets to which we reduced it):
- fresh water use -> energy
- packaging material used -> energy, permanent ressources
- transport -> energy
- energy -> CO_2, permanent ressources
- CO_2 production
- permanent consumption of ressources
Life-Cycle-Assessment (German: Ökobilanz) was mentioned in this context but it was unclear what that meant precisely. Only afterwards was it discovered that it's a blanket term for exactly this question (with lots of estabilished measurements for which it is unclear how to simplify them for everyday use).
We didn't try to break this down - a practical everyday approch doesn't allow for that and the time spent on analysing and comparing options is also equivalent to ressources possibly not spent efficiently.
One unanswered question was how much time to invest in comparing alternatives. Too little comparison means to take the nextbest option which is what most people apparently do and which also apparently doesn't lead to overall sustainable behavior. But too much analysis of simple decisions is also no option.
The idea was still to arrive at actionable criteria. One first approximation be settled on was
1) Forego consumption.
A nobrainer really, but maybe even that has to be stated. Instead of comparing options that are hard to compare try to avoid consumption where you can. Water instead of wine or fruit juice or lemonde. This saves lots of cognitive ressources.
Shortly after we agreed on the second approximation:
2) Spend more time on optimizing ressources you consume large amounts of.
The example at hand was wine (which we consume only a few times a year) versus toilet paper... No need to feel remorse over a one-time present packaging.
Note that we mostly excluded personal well-being, happiness and hedons from our consideration. We were aware that our goals affect our choices and hedons have to factored into any real strategy, but we left this additional complication out of our analysis - at least for this time.
We did discuss signalling effects. Mostly in the context of how effective ressources can be saved by convincing others to act sustainably. One important aspect for the parents was to pass on the idea and to act as a role model (with the caveat that children need a simplified model to grasp the concept). It was also mentioned humorously that one approach to minimize personal ressource consumption is suicide and transitively to convice others of same. The ultimate solution having no humans on the planet (a solution my 8-year old son - a friend of nature - arrived at too). This apparently being the problem when utilons/hedons are expluded.
A short time we considered whether outreach comes for free (can be done in addition to abstinence) and should be the no-brainer number 3. But it was then realized that at least right now and for us most abstinence comes at a price. It was quoted that buying sustainable products is about 20% more expensive than normal products. Forgoing e.g. a car comes at reduced job options. Some jobs involve supporting less sustainable large-scale action. Having less money means less options to act sustaibale. Time being convertible to money and so on.
At this point the key insight mentioned was that it could be much more efficient from a sustainability point of view to e.g. buy CO_2 certificates than to buy organic products. Except that the CO_2 certificate market is oversupplied currently. But there seem to be organisations which promise to achieve effective CO_2 reduction in developing countries (e.g. solar cooking) at a much higher rate than be achieved here. Thus the thrid rule was
3) Spend money on sustainable organisations instead of on everyday products that only give you a good feeling.
Journal 'Basic and Applied Psychology' bans p<0.05 and 95% confidence intervals
Editorial text isn't very interesting; they call for descriptive statistics and don't recommend any particular analysis.
CFAR fundraiser far from filled; 4 days remaining
We're 4 days from the end of our matching fundraiser, and still only about 1/3rd of the way to our target (and to the point where pledged funds would cease being matched).
If you'd like to support the growth of rationality in the world, do please consider donating, or asking me about any questions/etc. you may have. I'd love to talk. I suspect funds donated to CFAR between now and Jan 31 are quite high-impact.
As a random bonus, I promise that if we meet the $120k matching challenge, I'll post at least two posts with some never-before-shared (on here) rationality techniques that we've been playing with around CFAR.
An Introduction to Control Theory
Behavior: The Control of Perception by William Powers applies control theory to psychology to develop a model of human intelligence that seems relevant to two of LW's primary interests: effective living for humans and value-preserving designs for artificial intelligence. It's been discussed on LW previously here, here, and here, as well as mentioned in Yvain's roundup of 5 years (and a week) of LW. I've found previous discussions unpersuasive for two reasons: first, they typically only have a short introduction to control theory and the mechanics of control systems, making it not quite obvious what specific modeling techniques they have in mind, and second, they often fail to communicate the differences between this model and competing models of intelligence. Even if you're not interested in its application to psychology, control theory is a widely applicable mathematical toolkit whose basics are simple and well worth knowing.
Because of the length of the material, I'll split it into three posts. In this post, I'll first give an introduction to that subject that's hopefully broadly accessible. The next post will explain the model Powers introduces in his book. In the last post, I'll provide commentary on the model and what I see as its implications, for both LW and AI.
Breaking the vicious cycle
You may know me as the guy who posts a lot of controversial stuff about LW and MIRI. I don't enjoy doing this and do not want to continue with it. One reason being that the debate is turning into a flame war. Another reason is that I noticed that it does affect my health negatively (e.g. my high blood pressure (I actually had a single-sided hearing loss over this xkcd comic on Friday)).
This all started in 2010 when I encountered something I perceived to be wrong. But the specifics are irrelevant for this post. The problem is that ever since that time there have been various reasons that made me feel forced to continue the controversy. Sometimes it was the urge to clarify what I wrote, other times I thought it was necessary to respond to a reply I got. What matters is that I couldn't stop. But I believe that this is now possible, given my health concerns.
One problem is that I don't want to leave possible misrepresentations behind. And there very likely exist misrepresentations. There are many reasons for this, but I can assure you that I never deliberately lied and that I never deliberately tried to misrepresent anyone. The main reason might be that I feel very easily overwhelmed and never had the ability to force myself to invest the time that is necessary to do something correctly if I don't really enjoy doing it (for the same reason I probably failed school). Which means that most comments and posts are written in a tearing hurry, akin to a reflexive retraction from the painful stimulus.
<tldr>
I hate this fight and want to end it once and for all. I don't expect you to take my word for it. So instead, here is an offer:
I am willing to post counterstatements, endorsed by MIRI, of any length and content[1] at the top of any of my blog posts. You can either post them in the comments below or send me an email (da [at] kruel.co).
</tldr>
I have no idea if MIRI believes this to be worthwhile. But I couldn't think of a better way to solve this dilemma in a way that everyone can live with happily. But I am open to suggestions that don't stress me too much (also about how to prove that I am trying to be honest).
You obviously don't need to read all my posts. It can also be a general statement.
I am also aware that LW and MIRI are bothered by RationalWiki. As you can easily check from the fossil record, I have at points tried to correct specific problems. But, for the reasons given above, I have problems investing the time to go through every sentence to find possible errors and attempt to correct it in such a way that the edit is not reverted and that people who feel offended are satisfied.
[1] There are obviously some caveats regarding the content, such as no nude photos of Yudkowsky ;-)
View more: Next

Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)