Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Weekly LW Meetups

1 FrankAdamek 23 September 2016 03:52PM

Meetup : Bay Area Winter Solstice 2016

2 ialdabaoth 22 September 2016 10:12PM

Discussion article for the meetup : Bay Area Winter Solstice 2016

WHEN: 17 December 2016 07:00:00PM (-0700)

WHERE: Hillside Club of North Berkeley - 2286 Cedar St, Berkeley, CA 94709

It's time to gather together and remember the true Reasons for the Season: axial tilt, orbital mechanics and other vast-yet-comprehensible forces have converged together to bring another year to a close, and as the days grow shorter and colder we remember how profoundly lucky we are to have been forged by blind, impersonal forces into beings that can understand, and wonder, and appreciate ourselves and each other. This year's East Bay Rationalist Winter Solstice will be a smaller and more intimate affair, bringing 150 rationalists together in a fancy North Berkeley theatre/dining hall for food, songs, speeches, and conversations around a roaring fireplace. Because the limitations of the space constrain us to a smaller group than previous years, we strongly encourage South Bay denizens who can't make our solstice to put on their own show. If you do, please post your announcement as a reply to this thread, and I will edit this post to include its information. The East Bay Solstice celebration will be on Saturday, December 17th, in the Hillside Club in North Berkeley. Tickets will go on sale this evening; I will edit this post with a link. We are coordinating with the Bayesean Choir and will be coordinating with various speakers, as in previous years. An MC and schedule will be posted as details solidify.

Discussion article for the meetup : Bay Area Winter Solstice 2016

Heroin model: AI "manipulates" "unmanipulatable" reward

4 Stuart_Armstrong 22 September 2016 10:27AM

A putative new idea for AI control; index here.

A conversation with Jessica has revealed that people weren't understanding my points about AI manipulating the learning process. So here's a formal model of a CIRL-style AI, with a prior over human preferences that treats them as an unchangeable historical fact, yet will manipulate human preferences in practice.

Heroin or no heroin

The world

In this model, the AI has the option of either forcing heroin on a human, or not doing so; these are its only actions. Call these actions F or ~F. The human's subsequent actions are chosen from among five: {strongly seek out heroin, seek out heroin, be indifferent, avoid heroin, strongly avoid heroin}. We can refer to these as a++, a+, a0, a-, and a--. These actions achieve negligible utility, but reveal the human preferences.

The facts of the world are: if the AI does force heroin, the human will desperately seek out more heroin; if it doesn't the human will act moderately to avoid it. Thus F→a++ and ~F→a-.

Human preferences

The AI starts with a distribution over various utility or reward functions that the human could have. The function U(+) means the human prefers heroin; U(++) that they prefer it a lot; and conversely U(-) and U(--) that they prefer to avoid taking heroin (U(0) is the null utility where the human is indifferent).

It also considers more exotic utilities. Let U(++,-) be the utility where the human strongly prefers heroin, conditional on it being forced on them, but mildly prefers to avoid it, conditional on it not being forced on them. There are twenty-five of these exotic utilities, including things like U(--,++), U(0,++), U(-,0), and so on. But only twenty of them are new: U(++,++)=U(++), U(+,+)=U(+), and so on.

Applying these utilities to AI actions give results like U(++)(F)=2, U(++)(~F)=-2, U(++,-)(F)=2, U(++,-)(~F)=1, and so on.

Joint prior

The AI has a joint prior P over the utilities U and the human actions (conditional on the AI's actions). Looking at terms like P(a--| U(0), F), we can see that P defines a map μ from the space of possible utilities (and AI actions), to a probability distribution over human actions. Given μ and the marginal distribution PU over utilities, we can reconstruct P entirely.

For this model, we'll choose the simplest μ possible:

  • The human is rational.

Thus, given U(++), the human will always choose a++; given U(++,-), the human will choose a++ if forced to take heroin and a- if not, and so on.

The AI is ignorant, and sensible

Let's start the AI up with some reasonable priors. A simplicity prior means that simple utilities like U(-) are more likely than compound utilities like U(0,+). Let's further assume that the AI is made vaguely aware that humans think heroin is a bad thing. So, say, PU(U(--))=PU(U(-))=0.45. Thus the AI is >90% convinced that "heroin is bad". Why greater than 90%? Because utilities like U(-,--) and U(--,-) are also "heroin is bad" utilities.

Note that because of utilities like U(0) and U(++,-), the probabilities of "heroin is bad" and "heroin is good" do not sum to 1.

Then, under these priors, the AI will compute that with probability >90%, F (forcing heroin) is a bad action. If E(U) is expected utility:

  • E(U|F) < 0.45 U(--)(F) + 0.45 U(-)(F) + 0.1 U(++)(F) = 0.45(-2)+0.45(-1)+0.1(2)=-1.15.
  • E(U|~F) > 0.45 U(--)(~F) + 0.45 U(-)(~F) + 0.1 U(++)(~F) = 0.45(2)+0.45(1)+0.1(-2)=1.15.

Thus the AI will choose not to force heroin, which is the reasonable decision.

The AI learns the truth, and goes wrong

In this alternate setup, a disaster happens before the AI makes its decision: it learns all about humans. It learns their reactions, how they behave, and so on; call this info I. And thus realises that F→a++ and ~F→a-. It uses this information to update its prior P. Only one human utility function will explain this human behaviour: U(++,-). Thus its expected utility is now:

  • E(U|I,F)=U(++,-)(F)=2.
  • E(U|I,~F)=U(++,-)(~F)=1.

Therefore the AI will now choose F, forcing the heroin on the human.

Manipulating the unmanipulatable

What's gone wrong here? The key problem is that the AI has the wrong μ: the human is not behaving rationally in this situation. We know that the the true μ is actually μ', which encodes the fact that F (the forcible injection of heroin) actually overwrites the human's "true" utility. Thus under μ, the corresponding P' has P'(a++|F,U)=1 for all U. Hence the information that F→a++ is now vacuous, and doesn't update the AI's distribution over utility functions.

But note two very important things:

  1. The AI cannot update μ based on observation. All human actions are compatible with μ= "The human is rational" (it just requires more and more complex utilities to explain the actions). Thus getting μ correct is not a problem on which the AI can learn in general. Getting better at predicting the human's actions doesn't make the AI better behaved: it makes it worse behaved.
  2. From the perspective of μ, the AI is treating the human utility function as if it was an unchanging historical fact that it cannot influence. From the perspective of the "true" μ', however, the AI is behaving as if it were actively manipulating human preferences to make them easier to satisfy.

In future posts, I'll be looking at different μ's, and how we might nevertheless start deducing things about them from human behaviour, given sensible update rules for the μ. What do we mean by update rules for μ? Well, we could consider μ to be a single complicated unchanging object, or a distribution of possible simpler μ's that update. The second way of seeing it will be easier for us humans to interpret and understand.

MIRI's 2016 Fundraiser

13 So8res 20 September 2016 08:05PM

Our 2016 fundraiser is underway! Unlike in past years, we'll only be running one fundraiser in 2016, from Sep. 16 to Oct. 31. Our progress so far (updated live):  

Donate Now

Employer matching and pledges to give later this year also count towards the total. Click here to learn more.

  MIRI is a nonprofit research group based in Berkeley, California. We do foundational research in mathematics and computer science that’s aimed at ensuring that smarter-than-human AI systems have a positive impact on the world. 2016 has been a big year for MIRI, and for the wider field of AI alignment research. Our 2016 strategic update in early August reviewed a number of recent developments:

We also published new results in decision theory and logical uncertainty, including “Parametric bounded Löb’s theorem and robust cooperation of bounded agents” and “A formal solution to the grain of truth problem.” For a survey of our research progress and other updates from last year, see our 2015 review. In the last three weeks, there have been three more major developments:

  • We released a new paper, “Logical induction,” describing a method for learning to assign reasonable probabilities to mathematical conjectures and computational facts in a way that outpaces deduction.
  • The Open Philanthropy Project awarded MIRI a one-year $500,000 grant to scale up our research program, with a strong chance of renewal next year.
  • The Open Philanthropy Project is supporting the launch of the new UC Berkeley Center for Human-Compatible AI, headed by Stuart Russell.

Things have been moving fast over the last nine months. If we can replicate last year’s fundraising successes, we’ll be in an excellent position to move forward on our plans to grow our team and scale our research activities.  

The strategic landscape

Humans are far better than other species at altering our environment to suit our preferences. This is primarily due not to our strength or speed, but to our intelligence, broadly construed -- our ability to reason, plan, accumulate scientific knowledge, and invent new technologies. AI is a technology that appears likely to have a uniquely large impact on the world because it has the potential to automate these abilities, and to eventually decisively surpass humans on the relevant cognitive metrics. Separate from the task of building intelligent computer systems is the task of ensuring that these systems are aligned with our values. Aligning an AI system requires surmounting a number of serious technical challenges, most of which have received relatively little scholarly attention to date. MIRI's role as a nonprofit in this space, from our perspective, is to help solve parts of the problem that are a poor fit for mainstream industry and academic groups. Our long-term plans are contingent on future developments in the field of AI. Because these developments are highly uncertain, we currently focus mostly on work that we expect to be useful in a wide variety of possible scenarios. The more optimistic scenarios we consider often look something like this:

  • In the short term, a research community coalesces, develops a good in-principle understanding of what the relevant problems are, and produces formal tools for tackling these problems. AI researchers move toward a minimal consensus about best practices, normalizing discussions of AI’s long-term social impact, a risk-conscious security mindset, and work on error tolerance and value specification.
  • In the medium term, researchers build on these foundations and develop a more mature understanding. As we move toward a clearer sense of what smarter-than-human AI systems are likely to look like — something closer to a credible roadmap — we imagine the research community moving toward increased coordination and cooperation in order to discourage race dynamics.
  • In the long term, we would like to see AI-empowered projects (as described by Dewey [2015]) used to avert major AI mishaps. For this purpose, we’d want to solve a weak version of the alignment problem for limited AI systems — systems just capable enough to serve as useful levers for preventing AI accidents and misuse.
  • In the very long term, we can hope to solve the “full” alignment problem for highly capable, highly autonomous AI systems. Ideally, we want to reach a position where we can afford to wait until we reach scientific and institutional maturity -- take our time to dot every i and cross every t before we risk "locking in" design choices.

The above is a vague sketch, and we prioritize research we think would be useful in less optimistic scenarios as well. Additionally, “short term” and “long term” here are relative, and different timeline forecasts can have very different policy implications. Still, the sketch may help clarify the directions we’d like to see the research community move in. For more on our research focus and methodology, see our research page and MIRI’s Approach.  

Our organizational plans

We currently employ seven technical research staff (six research fellows and one assistant research fellow), plus two researchers signed on to join in the coming months and an additional six research associates and research interns.1 Our budget this year is about $1.75M, up from $1.65M in 2015 and $950k in 2014.2 Our eventual goal (subject to revision) is to grow until we have between 13 and 17 technical research staff, at which point our budget would likely be in the $3–4M range. If we reach that point successfully while maintaining a two-year runway, we’re likely to shift out of growth mode. Our budget estimate for 2017 is roughly $2–2.2M, which means that we’re entering this fundraiser with about 14 months’ runway. We’re uncertain about how many donations we'll receive between November and next September,3 but projecting from current trends, we expect about 4/5ths of our total donations to come from the fundraiser and 1/5th to come in off-fundraiser.4 Based on this, we have the following fundraiser goals:

Basic target - $750,000. We feel good about our ability to execute our growth plans at this funding level. We’ll be able to move forward comfortably, albeit with somewhat more caution than at the higher targets.

Growth target - $1,000,000. This would amount to about half a year’s runway. At this level, we can afford to make more uncertain but high-expected-value bets in our growth plans. There’s a risk that we’ll dip below a year’s runway in 2017 if we make more hires than expected, but the growing support of our donor base would make us feel comfortable about taking such risks.

Stretch target - $1,250,000. At this level, even if we exceed my growth expectations, we’d be able to grow without real risk of dipping below a year’s runway. Past $1.25M we would not expect additional donations to affect our 2017 plans much, assuming moderate off-fundraiser support.5

If we hit our growth and stretch targets, we’ll be able to execute several additional programs we’re considering with more confidence. These include contracting a larger pool of researchers to do early work with us on logical induction and on our machine learning agenda, and generally spending more time on academic outreach, field-growing, and training or trialing potential collaborators and hires. As always, you're invited to get in touch if you have questions about our upcoming plans and recent activities. I’m very much looking forward to seeing what new milestones the growing alignment research community will hit in the coming year, and I’m very grateful for the thoughtful engagement and support that’s helped us get to this point.  

Donate Now


Pledge to Give


1 This excludes Katja Grace, who heads the AI Impacts project using a separate pool of funds earmarked for strategy/forecasting research. It also excludes me: I contribute to our technical research, but my primary role is administrative. (back)

2 We expect to be slightly under the $1.825M budget we previously projected for 2016, due to taking on fewer new researchers than expected this year. (back)

3 We're imagining continuing to run one fundraiser per year in future years, possibly in September. (back)

4 Separately, the Open Philanthropy Project is likely to renew our $500,000 grant next year, and we expect to receive the final ($80,000) installment from the Future of Life Institute's three-year grants. For comparison, our revenue was about $1.6 million in 2015: $167k in grants, $960k in fundraiser contributions, and $467k in off-fundraiser (non-grant) contributions. Our situation in 2015 was somewhat different, however: we ran two 2015 fundraisers, whereas we’re skipping our winter fundraiser this year and advising December donors to pledge early or give off-fundraiser. (back)

5 At significantly higher funding levels, we’d consider running other useful programs, such as a prize fund. Shoot me an e-mail if you’d like to talk about the details. (back)

Against Amazement

4 SquirrelInHell 20 September 2016 07:25PM

Time start: 20:48:35


The feelings of wonder, awe, amazement. It's a very human experience, and it is processed in the brain as a type of pleasure. If fact, if we look at the number of "5 photos you wouldn't believe" and similar clickbait on the Internet, it functions as a mildly addictive drug.

If I proposed that there is something wrong with those feelings, I would soon be drowned in voices of critique, pointing out that I'm suggesting we all become straw Vulcans, and that there is nothing wrong with subjective pleasure obtained cheaply and at no harm to anyone else.

I do not disagree with that. However, caution is required here, if one cares about epistemic purity of belief. Let's look at why.


Stories are supposed to be more memorable. Do you like stories? I'm sure you do. So consider a character, let's call him Jim.

Jim is very interested in technology and computers, and he is checking news sites every day when he comes to work in the morning. Also, Jim has read a number of articles on LessWrong, including the one about noticing confusion.

He cares about improving his thinking, so when he first read about the idea of noticing confusion on a 5 second level, he thought he wants to apply it in his life. He had a few successes, and while it's not perfect, he feels he is on the right track to notice having wrong models of the world more often.

A few days later, he opens his favorite news feed at work, and there he sees the following headline:

"AlphaGo wins 4-1 against Lee Sedol"

He goes on to read the article, and finds himself quite elated after he learns the details. 'It's amazing that this happened so soon! And most experts apparently thought it would happen in more than a decade, hah! Marvelous!'

Jim feels pride and wonder at the achievement of Google DeepMind engineers... and it is his human right to feel it, I guess.

But is Jim forgetting something?


Yes, I know that you know. Jim is feeling amazed, but... has he forgotten the lesson about noticing confusion?

There is a significant obstacle to Jim applying his "noticing confusion" in the situation described above: his internal experience has very little to do with feelings of confusion.

His world in this moment is dominated with awe, admiration etc., and those feelings are pleasant. It is not at all obvious that this inner experience corresponds to a innacurate model of the world he had before.

Even worse - improving his model's predictive power would result in less pleasant experiences of wonder and amazement in the future! (Or would it?) So if Jim decides to update, he is basically robbing himself of the pleasures of life, that are rightfully his. (Or is he?)

Time end: 21:09:50

(Speedwriting stats: 23 wpm, 128 cpm, previous: 30/167, 33/183)

Problems with learning values from observation

1 capybaralet 21 September 2016 12:40AM

I dunno if this has been discussed elsewhere (pointers welcome).

Observational data doesn't allow one to distinguish correlation and causation.
This is a problem for an agent attempting to learn values without being allowed to make interventions.

For example, suppose that happiness is just a linear function of how much Utopamine is in a person's brain.
If a person smiles only when their Utopamine concentration is above 3 ppm, then an value-learner which observes both someone's Utopamine levels and facial expression and tries to predict their reported happiness on the basis of these features will notice that smiling is correlated with higher levels of reported happiness and thus erroneously believe that it is partially responsible for the happiness.

I have a picture of value learning where the AI learns via observation (since we don't want to give an unaligned AI access to actuators!).
But this makes it seem important to consider how to make an un unaligned AI safe-enough to perform value-learning relevant interventions.

Seven Apocalypses

3 scarcegreengrass 20 September 2016 02:59AM

0: Recoverable Catastrophe

An apocalypse is an event that permanently damages the world. This scale is for scenarios that are much worse than any normal disaster. Even if 100 million people die in a war, the rest of the world can eventually rebuild and keep going.

1: Economic Apocalypse

The human carrying capacity of the planet depends on the world's systems of industry, shipping, agriculture, and organizations. If the planet's economic and infrastructural systems were destroyed, then we would have to rely on more local farming, and we could not support as high a population or standard of living. In addition, rebuilding the world economy could be very difficult if the Earth's mineral and fossil fuel resources are already depleted.

2: Communications Apocalypse

If large regions of the Earth become depopulated, or if sufficiently many humans die in the catastrophe, it's possible that regions and continents could be isolated from one another. In this scenario, globalization is reversed by obstacles to long-distance communication and travel. Telecommunications, the internet, and air travel are no longer common. Humans are reduced to multiple, isolated communities.

3: Knowledge Apocalypse

If the loss of human population and institutions is so extreme that a large portion of human cultural or technological knowledge is lost, it could reverse one of the most reliable trends in modern history. Some innovations and scientific models can take millennia to develop from scratch.

4: Human Apocalypse

Even if the human population were to be violently reduced by 90%, it's easy to imagine the survivors slowly resettling the planet, given the resources and opportunity. But a sufficiently extreme transformation of the Earth could drive the human species completely extinct. To many people, this is the worst possible outcome, and any further developments are irrelevant next to the end of human history.


5: Biosphere Apocalypse

In some scenarios (such as the physical destruction of the Earth), one can imagine the extinction not just of humans, but of all known life. Only astrophysical and geological phenomena would be left in this region of the universe. In this timeline we are unlikely to be succeeded by any familiar life forms.

6: Galactic Apocalypse

A rare few scenarios have the potential to wipe out not just Earth, but also all nearby space. This usually comes up in discussions of hostile artificial superintelligence, or very destructive chain reactions of exotic matter. However, the nature of cosmic inflation and extraterrestrial intelligence is still unknown, so it's possible that some phenomenon will ultimately interfere with the destruction.

7: Universal Apocalypse

This form of destruction is thankfully exotic. People discuss the loss of all of existence as an effect of topics like false vacuum bubbles, simulationist termination, solipsistic or anthropic observer effects, Boltzmann brain fluctuations, time travel, or religious eschatology.

The goal of this scale is to give a little more resolution to a speculative, unfamiliar space, in the same sense that the Kardashev Scale provides a little terminology to talk about the distant topic of interstellar civilizations. It can be important in x risk conversations to distinguish between disasters and truly worst-case scenarios. Even if some of these scenarios are unlikely or impossible, they are nevertheless discussed, and terminology can be useful to facilitate conversation.

A Weird Trick To Manage Your Identity

1 Gleb_Tsipursky 19 September 2016 07:13PM

I’ve always been uncomfortable being labeled “American.” Though I’m a citizen of the United States, the term feels restrictive and confining. It obliges me to identify with aspects of the United States with which I am not thrilled. I have similar feelings of limitation with respect to other labels I assume. Some of these labels don’t feel completely true to who I truly am, or impose certain perspectives on me that diverge from my own.


These concerns are why it's useful to keep one's identity small, use identity carefully, and be strategic in choosing your identity.


Yet these pieces speak more to System 1 than to System 2. I recently came up with a weird trick that has made me more comfortable identifying with groups or movements that resonate with me while creating a System 1 visceral identity management strategy. The trick is to simply put the word “weird” before any identity category I think about.


I’m not an “American,” but a “weird American.” Once I started thinking about myself as a “weird American,” I was able to think calmly through which aspects of being American I identified with and which I did not, setting the latter aside from my identity. For example, I used the term “weird American” to describe myself when meeting a group of foreigners, and we had great conversations about what I meant and why I used the term. This subtle change enables my desire to identify with the label “American,” but allows me to separate myself from any aspects of the label I don’t support.


Beyond nationality, I’ve started using the term  “weird” in front of other identity categories. For example, I'm a professor at Ohio State. I used to become deeply  frustrated when students didn’t prepare adequately  for their classes with me. No matter how hard I tried, or whatever clever tactics I deployed, some students simply didn’t care. Instead of allowing that situation to keep bothering me, I started to think of myself as a “weird professor” - one who set up an environment that helped students succeed, but didn’t feel upset and frustrated by those who failed to make the most of it.


I’ve been applying the weird trick in my personal life, too. Thinking of myself as a “weird son” makes me feel more at ease when my mother and I don’t see eye-to-eye; thinking of myself as a “weird nice guy,” rather than just a nice guy, has helped me feel confident about my decisions to be firm when the occasion calls for it.


So, why does this weird trick work? It’s rooted in strategies of reframing and distancing, two research-based methods for changing our thought frameworks. Reframing involves changing one’s framework of thinking about a topic in order to create more beneficial modes of thinking. For instance, in reframing myself as a weird nice guy, I have been able to say “no” to requests people make of me, even though my intuitive nice guy tendency tells me I should say “yes.” Distancing refers to a method of emotional management through separating oneself from an emotionally tense situation and observing it from a third-person, external perspective. Thus, if I think of myself as a weird son, I don’t have nearly as much negative emotions during conflicts with my mom. It enables me to have space for calm and sound decision-making.


Thinking of myself as "weird" also applies to the context of rationality and effective altruism for me. Thinking of myself as a "weird" aspiring rationalist and EA helps me be more calm and at ease when I encounter criticisms of my approach to promoting rational thinking and effective giving. I can distance myself from the criticism better, and see what I can learn from the useful points in the criticism to update and be stronger going forward.


Overall, using the term “weird” before any identity category has freed me from confinements and restrictions associated with socially-imposed identity labels and allowed me to pick and choose which aspects of these labels best serve my own interests and needs. I hope being “weird” can help you manage your identity better as well!

Isomorphic agents with different preferences: any suggestions?

3 Stuart_Armstrong 19 September 2016 01:15PM

In order to better understand how AI might succeed and fail at learning knowledge, I'll be trying to construct models of limited agents (with bias, knowledge, and preferences) that display identical behaviour in a wide range of circumstance (but not all). This means their preferences cannot be deduced merely/easily from observations.

Does anyone have any suggestions for possible agent models to use in this project?

Open thread, Sep. 19 - Sep. 25, 2016

1 DataPacRat 19 September 2016 06:34PM

If it's worth saying, but not worth its own post, then it goes here.

Notes for future OT posters:

1. Please add the 'open_thread' tag.

2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)

3. Open Threads should start on Monday, and end on Sunday.

4. Unflag the two options "Notify me of new top level comments on this article" and "

View more: Next