Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

A Call for More Policy Analysis

1 madhatter 25 June 2017 02:24PM

I would like to see more concrete discussion and analysis of AI policy in the EA community, and on this forum in particular.


 AI policy would broadly encompass all relevant actors meaningfully influencing the future and impact of AI, which would likely be governments, research labs and institutes, and international organizations.


Some initial thoughts and questions I have on this topic:


1)     How do we ensure all research groups with a likely chance of developing AGI know and care about the relevant work in AI safety (which hopefully is satisfactorily worked out by then)?


Some possibilities: trying to make AI safety a common feature of computer science curricula, general community building and more AI safety conferences, more popular culture conveying  non-terminatoresque illustrations of the risk.



2)     What strategies might be available for laggards in a race scenario to retard progress of leading groups, or to steal their research?

Some possibilities in no particular order: espionage, malware, financial or political pressures, power outages, surveillance of researchers.


3)     Will there be clear warning signs?


Not just in general AI progress, but locally near the leading lab. Observable changes in stock price, electricity output, etc.


4)     Openness or secrecy?

Thankfully the Future of Life Institute is working on this one. As I understand the consensus is that openness is advisable now, but secrecy may be necessary later. So what mechanisms are available to keep research private?


5)     How many players will there be with a significant chance of developing AGI? Which players?


6)     Is an arms race scenario likely?


7)     What is the most likely speed of takeoff?



8)     When and where will AGI be developed?


    Personally, I believe the use of forecasting tournaments to get a better sense of when and where AGI will arrive would be a very worthwhile use of our time and resources. After reading Superforecasting by Dan Gardner and Phillip Tetlock I was struck by how effective these tournaments are at singling out those with low Brier scores and using them to get a better-than-average predictions of future circumstances.



Perhaps the EA community could fund a forecasting tournament on the Good Judgment Project posing questions attempting to ascertain when AGI will be developed (I am guessing superforecasters will make more accurate predictions than AI experts on this topic), which research groups are the most likely candidates to be the developers of the first AGI, and other relevant questions. We would need to formulate the questions such that they are specific enough for use in the tournament. 

[Link] The Use and Abuse of Witchdoctors for Life

5 lifelonglearner 24 June 2017 08:59PM

Humans are not agents: short vs long term

4 Stuart_Armstrong 09 June 2017 11:16AM

Crossposted at the Intelligent Agents Forum.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it's predictable that every year they are alive, they will have the same desire to survive till the next year.

This human (not a completely implausible example, I hope!) has a contradiction between their long and short term preferences. So which is accurate? It seems we could resolve these preferences in favour of the short term ("live forever") or the long term ("die after a century") preferences.

Now, at this point, maybe we could appeal to meta-preferences - what would the human themselves want, if they could choose? But often these meta-preferences are un- or under-formed, and can be influenced by how the question or debate is framed.

Specifically, suppose we are scheduling this human's agenda. We have the choice of making them meet one of two philosophers (not meeting anyone is not an option). If they meet Professor R. T. Long, he will advise them to follow long term preferences. If instead, they meet Paul Kurtz, he will advise them to pay attention their short term preferences. Whichever one they meet, they will argue for a while and will then settle on the recommended preference resolution. And then they will not change that, whoever they meet subsequently.

Since we are doing the scheduling, we effectively control the human's meta-preferences on this issue. What should we do? And what principles should we use to do so?

It's clear that this can apply to AIs: if they are simultaneously aiding humans as well as learning their preferences, they will have multiple opportunities to do this sort of preference-shaping.

Mode Collapse and the Norm One Principle

14 tristanm 05 June 2017 09:30PM

[Epistemic status: I assign a 70% chance that this model proves to be useful, 30% chance it describes things we are already trying to do to a large degree, and won't cause us to update much.] 

I'm going to talk about something that's a little weird, because it uses some results from some very recent ML theory to make a metaphor about something seemingly entirely unrelated - norms surrounding discourse. 

I'm also going to reach some conclusions that surprised me when I finally obtained them, because it caused me to update on a few things that I had previously been fairly confident about. This argument basically concludes that we should adopt fairly strict speech norms, and that there could be great benefit to moderating our discourse well. 

I argue that in fact, discourse can be considered an optimization process and can be thought of in the same way that we think of optimizing a large function. As I will argue, thinking of it in this way will allow us to make a very specific set of norms that are easy to think about and easy to enforce. It is partly a proposal for how to solve the problem of dealing with speech that is considered hostile, low-quality, or otherwise harmful. But most importantly, it is a proposal for how to ensure that the discussion always moves in the right direction: Towards better solutions and more accurate models. 

It will also help us avoid something I'm referring to as "mode collapse" (where new ideas generated are non-diverse and are typically characterized by adding more and more details to ideas that have already been tested extensively). It's also highly related to the concepts discussed in the Death Spirals and the Cult Attractor portion of the Sequences. Ideally, we'd like to be able to make sure that we're exploring as much of the hypothesis space as possible, and there's good reason to believe we're probably not doing this very well.  

The challenge: Making sure we're searching for the global optimum in model-space sometimes requires reaching out blindly into the frontiers, the not well-explored regions, which runs the risk of ending up somewhere very low-quality or dangerous. There are also sometimes large gaps between very different regions of model-space where the quality of the model is very low in-between, but very high on each side of the gap. This requires traversing through potentially dangerous territory and being able to survive the whole way through.

(I'll be using terms like "models" and "hypotheses" quite often, and I hope this isn't confusing. I am using them very broadly, to refer to both theoretical understandings of phenomenon and blueprints for practical implementations of ideas). 

We desire to have a set of principles which allows us to do this safely - to think about models of the world that are new and untested, solutions for solving problems that have never been done in a similar way - and they should ensure that, eventually, we can reach the global optimum. 

Before we derive that set of principles, I am going to introduce a topic of interest from the field of Machine Learning. This topic will serve as the main analogy for the rest of this piece, and serve as a model for how the dynamics of discourse should work in the ideal case. 

I. The Analogy: Generative Adversarial Networks

For those of you who are not familiar with the recent developments in deep-learning, Generative Adversarial Networks (GANs)[intro pdf here] are a new type of generative model class that are ideal for producing high-quality samples from very high-dimensional, complex distributions. They have caused great buzz and hype in the deep-learning community due to how impressive some of the samples they produce are, and how efficient they are at generation.

Put simply, a generator model and a critic (sometimes called a discriminator) model perform a two-player game where the critic is trained to distinguish between samples produced by the generator and the "true" samples taken from the data distribution. In turn, the generator is trained to maximize the critic's loss function. Both models are usually parametrized by deep neural networks and can be trained by taking turns running a gradient descent step on each. The Nash equilibrium of this game is when the generator's distribution matches that of the data distribution perfectly. This is never really borne out in practice, but sometimes it gets so close that we don't mind. 

GANs have one principal failure mode, which is often thought to be due to the instability of the system, which is often called "mode collapse" (a term I'm going to appropriate to refer to a much broader concept). It was often believed that, if a careful balance between the generator and critic could not be maintained, one would eventually overpower the other - leading the critic to provide either useless or overly harsh information to the generator. Useless information will cause the generator to update very slowly or not at all, and overly harsh information will lead the samples to "collapse" to a small region of the data space that are the easiest targets for the generator to hit.  

This problem was essentially solved earlier this year due to a series of papers that propose modifications to the loss functions that GANs use, and, most crucially, add another term to the critic's loss which stabilizes the gradient (with respect to the inputs) to have a norm close to one. It was recognized that we actually desire an extremely powerful critic so that the generator can make the best updates it possibly can, but the updates themselves can't go beyond what the generator is capable of handling. With these changes to the GAN formulation, it became possible to use crazy critic networks such as ultra-deep ResNets and train them as much as desired before updating the generator network.  

The principle behind their operation is rather simple to describe, but unfortunately, it is much more difficult to explain why they work so well. However, I believe that as long as we know how to make one, and know specific implementation details that improve their stability, then I believe their principles can be applied more broadly to achieve success in a wide variety of regimes. 

II. GANs as a Model of Discourse

In order to use GANs as a tool for conceptual understanding of discourse, I propose to model of the dynamics of debate as a collection of hypothesis-generators and hypothesis-critics. This could be likened to the structure of academia - researchers publish papers, they go through peer-review, the work is iterated on and improved - and over time this process converges to more and more accurate models of reality (or so we hope). Most individuals within this process play both roles, but in theory this process would still work even if they didn't. For example, Isaac Newton was a superb hypothesis generator, but he also had some wacky ideas that most of us would consider to be obviously absurd. Nevertheless, calculus and Newtonian physics became a part of our accepted scientific knowledge, and alchemy didn't. The system adopted and iterated on his good ideas while throwing away the bad. 

Our community should be capable of something similar, while doing it more efficiently and not requiring the massive infrastructure of academia. 

A hypothesis-generator is not something that just randomly pulls out a model from model-space. It proposes things that are close modifications of things it already holds to be likely within its model (though I expect this point to be debatable). Humans are both hypothesis-generators and hypothesis-critics. And as I will argue, that distinction is not quite as sharply defined as one would think. 

I think there has always been an underlying assumption within the theory of intelligence that creativity and recognition / distinction are fundamentally different. In other words, one can easily understand Mozart to be a great composer, but it is much more difficult to be a Mozart. Naturally this belief entered it's way into the field of Artificial Intelligence too, and became somewhat of a dogma. Computers might be able to play Chess, they might be able to play Go, but they aren't doing anything fundamentally intelligent. They lack the creative spark, they work on pure brute-force calculation only, with maybe some heuristics and tricks that their human creators bestowed upon them.  

GANs seem to defy this principle. Trained on a dataset of photographs of human faces, a GAN generator learns to produce near-photo-realistic images that nonetheless do not fully match any the faces the critic network saw (one of the reasons why CelebA was such a good choice to test these on), and are therefore in some sense producing things which are genuinely original. It may have once been thought that there was a fundamental distinction between creation and critique, but perhaps that's not really the case. GANs were a surprising discovery, because they showed that it was possible to make impressive "creations" by starting from random nonsense and slowly tweaking it in the direction of "good" until it eventually got there (well okay, that's basically true for the whole of optimization, but it was thought to be especially difficult for generative models).

What does this mean? Could someone become a "Mozart" by beginning a musical composition from random noise and slowly tweaking it until it became a masterpiece?

The above seems to imply "yes, perhaps." However, this is highly contingent on the quality of the "tweaking." It seems possible only as long as the directions to update in are very high quality. What if they aren't very high quality? What if they point nowhere, or in very bad directions?

I think the default distribution of discourse is that it is characterized by a large number of these directionless, low quality contributions. And that it's likely that this is one of the main factors behind mode collapse. This is related to what has been noted before: Too much intolerance for imperfect ideas (or ideas outside of established dogma) in a community prevent useful tasks from being accomplished, and progress from being made. Academia does not seem immune to this problem. Where low-quality or hostile discussion is tolerated is where this risk is greatest.  

Fortunately, making sure we get good "tweaks" seems to be the easy part. Critique is in high abundance. Our community is apparently very good at it. We also don't need to worry much about the ratio of hypothesis-generators to hypothesis-critics, as long as we can establish good principles that allow us to follow GANs as closely as possible. The nice feature of the GAN formulation is that you are allowed to make the critic as powerful as you want. In fact, the critic should be more powerful than the generator (If the generator is too powerful, it just goes directly to the argmax of the critic). 

(In addition, any collection of generators is a generator, and any collection of critics is a critic. So this formulation can be applied to the community setting).

III. The Norm One Principle

So the question then becomes, how do we take an algorithm governing a game between models much simpler than a human, and use the same tweaks which consist of nothing more than a few very simple equations? 

Here what I devise is a strategy for taking the concept of the norm of the critic gradient being as close to one as possible, and using that as a heuristic for how to structure appropriate discourse. 

(This is where my argument gets more speculative and I expect to update this a lot, and where I welcome the most criticism).

What I propose is that we begin modeling the concept of "criticism" based on how useful it is to the idea-generator receiving the criticism. Under this model, I think we should start breaking down criticism into two fundamental attributes:

  1. Directionality - does the criticism contain highly useful information, such that the "generator" knows how to update their model / hypothesis / proposal?
  2. Magnitude - Is the criticism too harsh, does it point to something completely unlike the original proposal, or otherwise require changes that aren't feasible for the generator to make?

My claim is that any contribution to a discussion should satisfy the "Norm One Principle." In other words, it should have a well-defined direction, and the quantity of change should be feasible to implement.

If a critique can satisfy our requirements for both directionality and magnitude, then it serves a useful purpose. The inverse claim to this is that if we can't follow these requirements, we risk falling into mode collapse, and the ideas commonly proposed are almost indistinguishable from the ones which preceded them, and ideas which deviate too far from the norm are harshly condemned and suppressed. 

I think it's natural to question whether or not restricting criticism to follow certain principles is a form of speech suppression that prevents useful ideas from being considered. But the pattern I'm proposing doesn't restrict the "generation" process, the creative aspect which produces new hypotheses. It doesn't restrict the topics that can be discussed. It only restricts the criticism of those hypotheses, such that they are maximally useful to the source of the hypothesis. 

One of the primary fears behind having too much criticism is that it discourages people from contributing because they want to avoid the negative feedback. But under the Norm One Principle, I think it is useful to distinguish between disagreement and criticism. I think if we're following these norms properly, we won't need to consider criticism to be a negative reward. In fact, criticism can be positive. Agreement could be considered "criticism in the same direction you are moving in." Disagreement would be the opposite. And these norms also eliminate the kind of feedback that tends to be the most discouraging. 

For example, some things which violate "Norm One":

  • Ad hominem attacks (typically directionless). 
  • Affective Death Spirals (unlimited praise or denunciation is usually directionless, and usually very high magnitude). 
  • Signs that cause aversion (things I "don't like", that trigger my System 1 alarms, which probably violates both directionality and magnitude). 
  • Lengthy lists of changes to make (norm greater than 1, ideally we want to try to focus on small sets of changes that have the highest priority). 
  • Repetition of points that have already been made (norm greater than one). 

One of my strongest hopes is that whomever is playing the part of the "generator" is able to compile the list of critiques easily and use them to update somewhere close to the optimal direction. This would be difficult if the sum of all critiques is either directionless (many critics point in opposite or near-opposite directions) or very high-magnitude (Critics simply say to get as far away from here as possible). 

But let's suppose that each individual criticism satisfies the Norm One principle. We will also assume that the generator is weighing each critique by their respect for whoever produced it, which I think is highly likely. Then the generator should be able to move in a direction unless the sum of the directions completely cancel out. It is unlikely for this to happen - unless there is very strong epistemic disagreement in the community over some fundamental assumptions (in which case the conversation should probably move over to that). 

In addition, it also becomes less likely for the directions to cancel out as the number of inputs increases. Thus, it seems that proposals for new models should be presented to a wide audience, and we should avoid the temptation to keep our proposals hidden to all except for a small set of people we trust.

So I think that in general, this proposed structure should tend to increase the amount of collective trust we have in the community, and that it favors transparency and favors diversity of viewpoints. 

But what of the possible failure modes of this plan? 

This model should fail if the specific details of its implementation either remove too much discussion, or fail to deal with individuals who refuse to follow the norms and refuse to update. Any implementation should allow room for anyone to update. Someone who posts an extremely hostile, directionless comment should be allowed chances to modify their contribution. The only scenario in which the "banhammer" becomes appropriate is when this model fails to apply: The cardinal sin of rationality, the refusal to update. 

IV. Building the Ideal "Generator"

As a final point, I'll note that the above assumes that generators will be able to update their models incrementally. The easy part, as I mentioned, was obtaining the updates, the hard part is accumulating them. This seems difficult with the infrastructure we have in place. What we do have is a good system for posting proposals and receiving feedback (The blog post / comment thread set-up), but this assumes that each "generator" is keeping track of their models by themselves and has to be fully aware of the status of other models on their own. There is no centralized "mixture model" anywhere that contains the full set of models weighted by how much probability they are given by the community. Currently, we do not have a good solution for this problem. 

However, it seems that the first conception of Arbital was centered around finding a solution to this kind of problem:

Arbital has bigger ambitions than even that. We all dream of a world that eliminates the duplication of effort in online argument - a world where, the same way that Wikipedia centralized the recording of definite facts, an argument only needs to happen once, instead of being reduplicated all over the Internet; with all the branches of the argument neatly recorded in the same place, along with some indication of who believes what. A world where 'just check Arbital' had the same status for determining the current state of debates, as 'just check Wikipedia' now has when somebody starts arguing about the population of Melbourne. There's entirely new big subproblems and solutions, not present at all in the current Arbital, that we'd need to tackle that considerably more difficult problem. But to solve 'explaining things' is something of a first step. If you have a single URL that you can point anyone to for 'explaining Bayes', and if you can dispatch people to different pages depending on how much math they know, you're starting to solve some of the key subproblems in removing the redundancy in online arguments.

If my proposed model is accurate, then it suggests that the problem Arbital aims to solve is in fact quite crucial to solve, and that the developers of Arbital should consider working through each obstacle they face without pivoting from this original goal. I feel confident enough that this goal should be high priority that I'd be willing to support its development in whatever way is deemed most helpful and is feasible for me (I am not an investor, but I am a programmer and would also be capable of making small donations, or contributing material). 

The only thing that this model would require for Arbital to do would be to make it as open as possible to contribute, and then perform heavy moderation or filtering of contributed content (but importantly not the other way around, where it is closed to small group of trusted people).

Currently, the incremental changes that would have to be made to LessWrong and related sites like SSC would simply be increased moderation of comment quality. Otherwise, any further progress on the problem would require overcoming much more serious obstacles requiring significant re-design and architecture changes. 

Everything I've written above is also subject to the model I've just outlined, and therefore I expect to make incremental updates as feedback to this post accrues.

My initial prediction for feedback to this post is that the ideas might be considered helpful and offer a useful perspective or a good starting point, but that there are probably many details that I have missed that would be useful to discuss, or points that were not quite well-argued or well thought-out. I will look out for these things in the comments.   

Existential risk from AI without an intelligence explosion

12 AlexMennen 25 May 2017 04:44PM

[xpost from my blog]

In discussions of existential risk from AI, it is often assumed that the existential catastrophe would follow an intelligence explosion, in which an AI creates a more capable AI, which in turn creates a yet more capable AI, and so on, a feedback loop that eventually produces an AI whose cognitive power vastly surpasses that of humans, which would be able to obtain a decisive strategic advantage over humanity, allowing it to pursue its own goals without effective human interference. Victoria Krakovna points out that many arguments that AI could present an existential risk do not rely on an intelligence explosion. I want to look in sightly more detail at how that could happen. Kaj Sotala also discusses this.

An AI starts an intelligence explosion when its ability to create better AIs surpasses that of human AI researchers by a sufficient margin (provided the AI is motivated to do so). An AI attains a decisive strategic advantage when its ability to optimize the universe surpasses that of humanity by a sufficient margin. Which of these happens first depends on what skills AIs have the advantage at relative to humans. If AIs are better at programming AIs than they are at taking over the world, then an intelligence explosion will happen first, and it will then be able to get a decisive strategic advantage soon after. But if AIs are better at taking over the world than they are at programming AIs, then an AI would get a decisive strategic advantage without an intelligence explosion occurring first.

Since an intelligence explosion happening first is usually considered the default assumption, I'll just sketch a plausibility argument for the reverse. There's a lot of variation in how easy cognitive tasks are for AIs compared to humans. Since programming AIs is not yet a task that AIs can do well, it doesn't seem like it should be a priori surprising if programming AIs turned out to be an extremely difficult task for AIs to accomplish, relative to humans. Taking over the world is also plausibly especially difficult for AIs, but I don't see strong reasons for confidence that it would be harder for AIs than starting an intelligence explosion would be. It's possible that an AI with significantly but not vastly superhuman abilities in some domains could identify some vulnerability that it could exploit to gain power, which humans would never think of. Or an AI could be enough better than humans at forms of engineering other than AI programming (perhaps molecular manufacturing) that it could build physical machines that could out-compete humans, though this would require it to obtain the resources necessary to produce them.

Furthermore, an AI that is capable of producing a more capable AI may refrain from doing so if it is unable to solve the AI alignment problem for itself; that is, if it can create a more intelligent AI, but not one that shares its preferences. This seems unlikely if the AI has an explicit description of its preferences. But if the AI, like humans and most contemporary AI, lacks an explicit description of its preferences, then the difficulty of the AI alignment problem could be an obstacle to an intelligence explosion occurring.

It also seems worth thinking about the policy implications of the differences between existential catastrophes from AI that follow an intelligence explosion versus those that don't. For instance, AIs that attempt to attain a decisive strategic advantage without undergoing an intelligence explosion will exceed human cognitive capabilities by a smaller margin, and thus would likely attain strategic advantages that are less decisive, and would be more likely to fail. Thus containment strategies are probably more useful for addressing risks that don't involve an intelligence explosion, while attempts to contain a post-intelligence explosion AI are probably pretty much hopeless (although it may be worthwhile to find ways to interrupt an intelligence explosion while it is beginning). Risks not involving an intelligence explosion may be more predictable in advance, since they don't involve a rapid increase in the AI's abilities, and would thus be easier to deal with at the last minute, so it might make sense far in advance to focus disproportionately on risks that do involve an intelligence explosion.

It seems likely that AI alignment would be easier for AIs that do not undergo an intelligence explosion, since it is more likely to be possible to monitor and do something about it if it goes wrong, and lower optimization power means lower ability to exploit the difference between the goals the AI was given and the goals that were intended, if we are only able to specify our goals approximately. The first of those reasons applies to any AI that attempts to attain a decisive strategic advantage without first undergoing an intelligence explosion, whereas the second only applies to AIs that do not undergo an intelligence explosion ever. Because of these, it might make sense to attempt to decrease the chance that the first AI to attain a decisive strategic advantage undergoes an intelligence explosion beforehand, as well as the chance that it undergoes an intelligence explosion ever, though preventing the latter may be much more difficult. However, some strategies to achieve this may have undesirable side-effects; for instance, as mentioned earlier, AIs whose preferences are not explicitly described seem more likely to attain a decisive strategic advantage without first undergoing an intelligence explosion, but such AIs are probably more difficult to align with human values.

If AIs get a decisive strategic advantage over humans without an intelligence explosion, then since this would likely involve the decisive strategic advantage being obtained much more slowly, it would be much more likely for multiple, and possibly many, AIs to gain decisive strategic advantages over humans, though not necessarily over each other, resulting in a multipolar outcome. Thus considerations about multipolar versus singleton scenarios also apply to decisive strategic advantage-first versus intelligence explosion-first scenarios.

AI safety: three human problems and one AI issue

9 Stuart_Armstrong 19 May 2017 10:48AM

Crossposted at the Intelligent agent foundation.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.

Specifically, I feel AI safety issues can be classified as three human problems and one central AI issue. The human problems are:

  • Humans don't know their own values (sub-issue: humans know their values better in retrospect than in prediction).
  • Humans are not agents and don't have stable values (sub-issue: humanity itself is even less of an agent).
  • Humans have poor predictions of an AI's behaviour.

And the central AI issue is:

  • AIs could become extremely powerful.

Obviously if humans were agents and knew their own values and could predict whether a given AI would follow those values or not, there would be not problem. Conversely, if AIs were weak, then the human failings wouldn't matter so much.

The points about human values is relatively straightforward, but what's the problem with humans not being agents? Essentially, humans can be threatened, tricked, seduced, exhausted, drugged, modified, and so on, in order to act seemingly against our interests and values.

If humans were clearly defined agents, then what counts as a trick or a modification would be easy to define and exclude. But since this is not the case, we're reduced to trying to figure out the extent to which something like a heroin injection is a valid way to influence human preferences. This makes both humans susceptible to manipulation, and human values hard to define.

Finally, the issue of humans having poor predictions of AI is more general than it seems. If you want to ensure that an AI has the same behaviour in the testing and training environment, then you're essentially trying to guarantee that you can predict that the testing environment behaviour will be the same as the (presumably safe) training environment behaviour.


How to classify methods and problems

That's well and good, but how to various traditional AI methods or problems fit into this framework? This should give us an idea as to whether the framework is useful.

It seems to me that:


  • Friendly AI is trying to solve the values problem directly.
  • IRL and Cooperative IRL are also trying to solve the values problem. The greatest weakness of these methods is the not agents problem.
  • Corrigibility/interruptibility are also addressing the issue of humans not knowing their own values, using the sub-issue that human values are clearer in retrospect. These methods also overlap with poor predictions.
  • AI transparency is aimed at getting round the poor predictions problem.
  • Laurent's work on carefully defining the properties of agents is mainly also about solving the poor predictions problem.
  • Low impact and Oracles are aimed squarely at preventing AIs from becoming powerful. Methods that restrict the Oracle's output implicitly accept that humans are not agents.
  • Robustness of the AI to changes between testing and training environment, degradation and corruption, etc... ensures that humans won't be making poor predictions about the AI.
  • Robustness to adversaries is dealing with the sub-issue that humanity is not an agent.
  • The modular approach of Eric Drexler is aimed at preventing AIs from becoming too powerful, while reducing our poor predictions.
  • Logical uncertainty, if solved, would reduce the scope for certain types of poor predictions about AIs.
  • Wireheading, when the AI takes control of reward channel, is a problem that humans don't know their values (and hence use an indirect reward) and that the humans make poor predictions about the AI's actions.
  • Wireheading, when the AI takes control of the human, is as above but also a problem that humans are not agents.
  • Incomplete specifications are either a problem of not knowing our own values (and hence missing something important in the reward/utility) or making poor predictions (when we though that a situation was covered by our specification, but it turned out not to be).
  • AIs modelling human knowledge seem to be mostly about getting round the fact that humans are not agents.

Putting this all in a table:


Not Agents
Poor PredictionsPowerful
Friendly AI


Corrigibility/interruptibility X
AI transparency

Laurent's work

Low impact and Oracles

Robustness to adversaries

Modular approach

Logical uncertainty

Wireheading (reward channel) X X X
Wireheading (human) X
Incomplete specifications X
AIs modelling human knowledge


Further refinements of the framework

It seems to me that the third category - poor predictions - is the most likely to be expandable. For the moment, it just incorporates all our lack of understanding about how AIs would behave, but this might more useful to subdivide.

Thoughts on civilization collapse

15 Stuart_Armstrong 04 May 2017 10:41AM

Epistemic status: an idea I believe moderately strongly, based on extensive reading but not rigorous analysis.

We may have a dramatically wrong idea of civilization collapse, mainly inspired by movies that obsess over dramatic tales of individual heroism.


Traditional view:

In a collapse, anarchy will break out, and it will be a war of all against all or small groups against small groups. Individual weaponry (including heavy weapons) and basic food production will become paramount; traditional political skills, not so much. Government collapse is long term. Towns and cities will suffer more than the countryside. The best course of action is to have a cache of weapons and food, and to run for the hills.


Alternative view:

In a collapse, people will cling to their identified tribe for protection. Large groups will have no difficulty suppressing or taking over individuals and small groups within their areas of influence. Individual weaponry may be important (given less of a police force), but heavy weaponry will be almost irrelevant as no small group will survive alone. Food production will be controlled by the large groups. Though the formal "government" may fall, and countries may splinter into more local groups, government will continue under the control of warlords, tribal elders, or local variants. Cities, with their large and varied-skill workforce, will suffer less than the countryside. The best course of action is to have a stash of minor luxury goods (solar-powered calculators, comic books, pornography, batteries, antiseptics) and to make contacts with those likely to become powerful after a collapse (army officers, police chiefs, religious leaders, influential families).

Possible sources to back up this alternative view:

  • The book Sapiens argues that governments and markets are the ultimate enablers of individualism, with extended-family-based tribalism as the "natural" state of humanity.
  • The history of Somalia demonstrates that laws and enforcement continue even after a government collapse, by going back to more traditional structures.
  • During China's period of anarchy, large groups remained powerful: the nationalists, the communists, the Japanese invaders. The other sections of the country were generally under the control of local warlords.
  • Rational Wiki argues that examples of collapse go against the individualism narrative.


[Link] Nate Soares' "Assuming Good Intent"

8 Raemon 30 April 2017 05:45PM

Use and misuse of models: case study

12 Stuart_Armstrong 27 April 2017 02:36PM

Some time ago, I discovered a post comparing basic income and basic job ideas. This sought to analyse the costs of paying everyone a guaranteed income versus providing them with a basic job with that income. The author spelt out his assumptions and put together a two models with a few components (including some whose values were drawn from various probability distributions). Then he ran a Monte Carlo simulation to get a distribution of costs for either policy.

Normally I should be very much in favour of this approach. It spells out the assumptions, it uses models, it decomposes the problem, it has stochastic uncertainty... Everything seems ideal. To top it off, the author concluded with a challenge aiming at improving reasoning around this subject:

How to Disagree: Write Some Code

This is a common theme in my writing. If you are reading my blog you are likely to be a coder. So shut the fuck up and write some fucking code. (Of course, once the code is written, please post it in the comments or on github.)

I've laid out my reasoning in clear, straightforward, and executable form. Here it is again. My conclusions are simply the logical result of my assumptions plus basic math - if I'm wrong, either Python is computing the wrong answer, I got really unlucky in all 32,768 simulation runs, or you one of my assumptions is wrong.

My assumption being wrong is the most likely possibility. Luckily, this is a problem that is solvable via code.

And yet... I found something very unsatisfying. And it took me some time to figure out why. It's not that these models are helpful, or that they're misleading. It's that they're both simultaneously.

To explain, consider the result of the Monte Carlo simulations. Here are the outputs (I added the red lines; we'll get to them soon):

The author concluded from these outputs that a basic job was much more efficient - less costly - than a basic income (roughly 1 trillion cost versus 3.4 trillion US dollars). He changed a few assumptions to test whether the result held up:

For example, maybe I'm overestimating the work disincentive for Basic Income and grossly underestimating the administrative overhead of the Basic Job. Lets assume both of these are true. Then what?

The author then found similar results, with some slight shifting of the probability masses.


The problem: what really determined the result

So what's wrong with this approach? It turns out that most of the variables in the models have little explanatory power. For the top red line, I just multiplied the US population by the basic income. The curve is slightly above it, because it includes such things as administrative costs. The basic job situation was slightly more complicated, as it includes a disabled population that gets the basic income without working, and a estimate for the added value that the jobs would provide. So the bottom red line is (disabled population)x(basic income) + (unemployed population)x(basic income) - (unemployed population)x(median added value of jobs). The distribution is wider than for basic income, as the added value of the jobs is a stochastic variable.

But, anyway, the contribution of the other variables were very minor. So the reduced cost of basic jobs versus basic income is essentially a consequence of the trivial fact that it's more expensive to pay everyone an income, than to only pay some people and then put them to work at something of non-zero value.


Trees and forests

So were the complicated extra variables and Monte Carlo runs for nothing? Not completely - they showed that the extra variables were indeed of little importance, and unlikely to change the results much. But nevertheless, the whole approach has one big, glaring flaw: it does not account for the extra value for individuals of having a basic income versus a basic job.

And the challenge - "write some fucking code" - obscures this. The forest of extra variables and the thousands of runs hides the fact that there is a fundamental assumption missing. And pointing this out is enough to change the result, without even needing to write code. Note this doesn't mean the result is wrong: some might even argue that people are better off with a job than with the income (builds pride in one's work, etc...). But that needs to be addressed.

So Chris Stucchio's careful work does show one result - most reasonable assumptions do not change the fact that basic income is more expensive than basic job. And to disagree with that, you do indeed need to write some fucking code. But the stronger result - that basic job is better than basic income - is not established by this post. A model can be well designed, thorough, filled with good uncertainties, and still miss the mark. You don't always have to enter into the weeds of the model's assumptions in order to criticise it.

[Link] Putanumonit: If rationality is a religion, it's a crappy one.

5 Jacobian 15 April 2017 04:44PM

View more: Next