Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

In praise of fake frameworks

14 Valentine 11 July 2017 02:12AM

Related to: Bucket errors, Categorizing Has Consequences, Fallacies of Compression

Followup to: Gears in Understanding


I use a lot of fake frameworks — that is, ways of seeing the world that are probably or obviously wrong in some important way.


I think this is an important skill. There are obvious pitfalls, but I think the advantages are more than worth it. In fact, I think the "pitfalls" can even sometimes be epistemically useful.


Here I want to share why. This is for two reasons:


  • I think fake framework use is a wonderful skill. I want it represented more in rationality in practice. Or, I want to know where I'm missing something, and Less Wrong is a great place for that.

  • I'm building toward something. This is actually a continuation of Gears in Understanding, although I imagine it won't be at all clear here how. I need a suite of tools in order to describe something. Talking about fake frameworks is a good way to demo tool #2.


With that, let's get started.

continue reading »

[Link] The Internet as an existential threat

4 Kaj_Sotala 09 July 2017 11:40AM

Against lone wolf self-improvement

27 cousin_it 07 July 2017 03:31PM

LW has a problem. Openly or covertly, many posts here promote the idea that a rational person ought to be able to self-improve on their own. Some of it comes from Eliezer's refusal to attend college (and Luke dropping out of his bachelors, etc). Some of it comes from our concept of rationality, that all agents can be approximated as perfect utility maximizers with a bunch of nonessential bugs. Some of it is due to our psychological makeup and introversion. Some of it comes from trying to tackle hard problems that aren't well understood anywhere else. And some of it is just the plain old meme of heroism and forging your own way.

I'm not saying all these things are 100% harmful. But the end result is a mindset of lone wolf self-improvement, which I believe has harmed LWers more than any other part of our belief system.

Any time you force yourself to do X alone in your room, or blame yourself for not doing X, or feel isolated while doing X, or surf the web to feel some human contact instead of doing X, or wonder if X might improve your life but can't bring yourself to start... your problem comes from believing that lone wolf self-improvement is fundamentally the right approach. That belief is comforting in many ways, but noticing it is enough to break the spell. The fault wasn't with the operator all along. Lone wolf self-improvement doesn't work.

Doesn't work compared to what? Joining a class. With a fixed schedule, a group of students, a teacher, and an exam at the end. Compared to any "anti-akrasia technique" ever proposed on LW or adjacent self-help blogs, joining a class works ridiculously well. You don't need constant willpower: just show up on time and you'll be carried along. You don't get lonely: other students are there and you can't help but interact. You don't wonder if you're doing it right: just ask the teacher.

Can't find a class? Find a club, a meetup, a group of people sharing your interest, any environment where social momentum will work in your favor. Even an online community for X that will reward your progress with upvotes is much better than going X completely alone. But any regular meeting you can attend in person, which doesn't depend on your enthusiasm to keep going, is exponentially more powerful.

Avoiding lone wolf self-improvement seems like embarrassingly obvious advice. But somehow I see people trying to learn X alone in their rooms all the time, swimming against the current for years, blaming themselves when their willpower isn't enough. My message to such people: give up. Your brain is right and what you're forcing it to do is wrong. Put down your X, open your laptop, find a class near you, send them a quick email, and spend the rest of the day surfing the web. It will be your most productive day in months.

[Link] Dissolving the Fermi Paradox (Applied Bayesianism)

11 shin_getter 03 July 2017 09:44AM

Self-conscious ideology

5 casebash 28 June 2017 05:32AM

Operating outside of ideology is extremely hard, if not impossible. Even groups that see themselves as non-ideological, still seem to end up operating within an ideology of some sort.

Take for example Less Wrong. It seems to operate within a few assumptions:

  1.  That studying rationality will provide use with a greater understanding of the world. 
  2. That studying rationality will improve you as a person.
  3. That science is one of our most important tools for understanding the world.

...

These assumptions are also subject to some criticisms. Here's one criticism for each of the previous points:

  1. But will it or are we dealing with problems that are simply beyond our ability to understand (see epistemic learned helplessness)? Do we really understand how minds work well enough to know whether a mind uploaded would still be "you"?
  2. But religious people are happier.
  3. Hume's critique of induction

I could continue discussing assumptions and possible criticisms, but that would be a distraction from the core point, which is that there are advantages to having a concrete ideology that is aware of it's own limitations, as opposed to an implicit ideology that is beyond all criticism.

Self-conscious ideologies also have other advantages:

  • Quick and easy to write since you don't have to deal with all of the special cases.
  • Easy to share and explain. Imagine trying to explain to someone, "Rationality gives us a better understanding of the world, except when it does not". Okay, I'm exaggerating, epistemic humility typically isn't explained that badly, but it certainly complicates sharing.
  • Easier for people to adopt the ideology as a lens through which to examine the world, without needing to assume that it is literally true.
I wrote this post so that people can create self-conscious ideologies and have something to link to so as to avoid having to write up an explanation themselves. Go out into the world and create =P.

A Call for More Policy Analysis

1 madhatter 25 June 2017 02:24PM

I would like to see more concrete discussion and analysis of AI policy in the EA community, and on this forum in particular.

 

 AI policy would broadly encompass all relevant actors meaningfully influencing the future and impact of AI, which would likely be governments, research labs and institutes, and international organizations.

 

Some initial thoughts and questions I have on this topic:

 

1)     How do we ensure all research groups with a likely chance of developing AGI know and care about the relevant work in AI safety (which hopefully is satisfactorily worked out by then)?

 

Some possibilities: trying to make AI safety a common feature of computer science curricula, general community building and more AI safety conferences, more popular culture conveying  non-terminatoresque illustrations of the risk.

 

 

2)     What strategies might be available for laggards in a race scenario to retard progress of leading groups, or to steal their research?

Some possibilities in no particular order: espionage, malware, financial or political pressures, power outages, surveillance of researchers.

 

3)     Will there be clear warning signs?

 

Not just in general AI progress, but locally near the leading lab. Observable changes in stock price, electricity output, etc.

 

4)     Openness or secrecy?

Thankfully the Future of Life Institute is working on this one. As I understand the consensus is that openness is advisable now, but secrecy may be necessary later. So what mechanisms are available to keep research private?

 

5)     How many players will there be with a significant chance of developing AGI? Which players?

 

6)     Is an arms race scenario likely?

 

7)     What is the most likely speed of takeoff?

 

 

8)     When and where will AGI be developed?

 

    Personally, I believe the use of forecasting tournaments to get a better sense of when and where AGI will arrive would be a very worthwhile use of our time and resources. After reading Superforecasting by Dan Gardner and Phillip Tetlock I was struck by how effective these tournaments are at singling out those with low Brier scores and using them to get a better-than-average predictions of future circumstances.

 

 

Perhaps the EA community could fund a forecasting tournament on the Good Judgment Project posing questions attempting to ascertain when AGI will be developed (I am guessing superforecasters will make more accurate predictions than AI experts on this topic), which research groups are the most likely candidates to be the developers of the first AGI, and other relevant questions. We would need to formulate the questions such that they are specific enough for use in the tournament. 

[Link] The Use and Abuse of Witchdoctors for Life

5 lifelonglearner 24 June 2017 08:59PM

Humans are not agents: short vs long term

4 Stuart_Armstrong 09 June 2017 11:16AM

Crossposted at the Intelligent Agents Forum.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it's predictable that every year they are alive, they will have the same desire to survive till the next year.

This human (not a completely implausible example, I hope!) has a contradiction between their long and short term preferences. So which is accurate? It seems we could resolve these preferences in favour of the short term ("live forever") or the long term ("die after a century") preferences.

Now, at this point, maybe we could appeal to meta-preferences - what would the human themselves want, if they could choose? But often these meta-preferences are un- or under-formed, and can be influenced by how the question or debate is framed.

Specifically, suppose we are scheduling this human's agenda. We have the choice of making them meet one of two philosophers (not meeting anyone is not an option). If they meet Professor R. T. Long, he will advise them to follow long term preferences. If instead, they meet Paul Kurtz, he will advise them to pay attention their short term preferences. Whichever one they meet, they will argue for a while and will then settle on the recommended preference resolution. And then they will not change that, whoever they meet subsequently.

Since we are doing the scheduling, we effectively control the human's meta-preferences on this issue. What should we do? And what principles should we use to do so?

It's clear that this can apply to AIs: if they are simultaneously aiding humans as well as learning their preferences, they will have multiple opportunities to do this sort of preference-shaping.

Mode Collapse and the Norm One Principle

15 tristanm 05 June 2017 09:30PM

[Epistemic status: I assign a 70% chance that this model proves to be useful, 30% chance it describes things we are already trying to do to a large degree, and won't cause us to update much.] 

I'm going to talk about something that's a little weird, because it uses some results from some very recent ML theory to make a metaphor about something seemingly entirely unrelated - norms surrounding discourse. 

I'm also going to reach some conclusions that surprised me when I finally obtained them, because it caused me to update on a few things that I had previously been fairly confident about. This argument basically concludes that we should adopt fairly strict speech norms, and that there could be great benefit to moderating our discourse well. 

I argue that in fact, discourse can be considered an optimization process and can be thought of in the same way that we think of optimizing a large function. As I will argue, thinking of it in this way will allow us to make a very specific set of norms that are easy to think about and easy to enforce. It is partly a proposal for how to solve the problem of dealing with speech that is considered hostile, low-quality, or otherwise harmful. But most importantly, it is a proposal for how to ensure that the discussion always moves in the right direction: Towards better solutions and more accurate models. 

It will also help us avoid something I'm referring to as "mode collapse" (where new ideas generated are non-diverse and are typically characterized by adding more and more details to ideas that have already been tested extensively). It's also highly related to the concepts discussed in the Death Spirals and the Cult Attractor portion of the Sequences. Ideally, we'd like to be able to make sure that we're exploring as much of the hypothesis space as possible, and there's good reason to believe we're probably not doing this very well.  

The challenge: Making sure we're searching for the global optimum in model-space sometimes requires reaching out blindly into the frontiers, the not well-explored regions, which runs the risk of ending up somewhere very low-quality or dangerous. There are also sometimes large gaps between very different regions of model-space where the quality of the model is very low in-between, but very high on each side of the gap. This requires traversing through potentially dangerous territory and being able to survive the whole way through.

(I'll be using terms like "models" and "hypotheses" quite often, and I hope this isn't confusing. I am using them very broadly, to refer to both theoretical understandings of phenomenon and blueprints for practical implementations of ideas). 

We desire to have a set of principles which allows us to do this safely - to think about models of the world that are new and untested, solutions for solving problems that have never been done in a similar way - and they should ensure that, eventually, we can reach the global optimum. 

Before we derive that set of principles, I am going to introduce a topic of interest from the field of Machine Learning. This topic will serve as the main analogy for the rest of this piece, and serve as a model for how the dynamics of discourse should work in the ideal case. 

I. The Analogy: Generative Adversarial Networks

For those of you who are not familiar with the recent developments in deep-learning, Generative Adversarial Networks (GANs)[intro pdf here] are a new type of generative model class that are ideal for producing high-quality samples from very high-dimensional, complex distributions. They have caused great buzz and hype in the deep-learning community due to how impressive some of the samples they produce are, and how efficient they are at generation.

Put simply, a generator model and a critic (sometimes called a discriminator) model perform a two-player game where the critic is trained to distinguish between samples produced by the generator and the "true" samples taken from the data distribution. In turn, the generator is trained to maximize the critic's loss function. Both models are usually parametrized by deep neural networks and can be trained by taking turns running a gradient descent step on each. The Nash equilibrium of this game is when the generator's distribution matches that of the data distribution perfectly. This is never really borne out in practice, but sometimes it gets so close that we don't mind. 

GANs have one principal failure mode, which is often thought to be due to the instability of the system, which is often called "mode collapse" (a term I'm going to appropriate to refer to a much broader concept). It was often believed that, if a careful balance between the generator and critic could not be maintained, one would eventually overpower the other - leading the critic to provide either useless or overly harsh information to the generator. Useless information will cause the generator to update very slowly or not at all, and overly harsh information will lead the samples to "collapse" to a small region of the data space that are the easiest targets for the generator to hit.  

This problem was essentially solved earlier this year due to a series of papers that propose modifications to the loss functions that GANs use, and, most crucially, add another term to the critic's loss which stabilizes the gradient (with respect to the inputs) to have a norm close to one. It was recognized that we actually desire an extremely powerful critic so that the generator can make the best updates it possibly can, but the updates themselves can't go beyond what the generator is capable of handling. With these changes to the GAN formulation, it became possible to use crazy critic networks such as ultra-deep ResNets and train them as much as desired before updating the generator network.  

The principle behind their operation is rather simple to describe, but unfortunately, it is much more difficult to explain why they work so well. However, I believe that as long as we know how to make one, and know specific implementation details that improve their stability, then I believe their principles can be applied more broadly to achieve success in a wide variety of regimes. 

II. GANs as a Model of Discourse

In order to use GANs as a tool for conceptual understanding of discourse, I propose to model of the dynamics of debate as a collection of hypothesis-generators and hypothesis-critics. This could be likened to the structure of academia - researchers publish papers, they go through peer-review, the work is iterated on and improved - and over time this process converges to more and more accurate models of reality (or so we hope). Most individuals within this process play both roles, but in theory this process would still work even if they didn't. For example, Isaac Newton was a superb hypothesis generator, but he also had some wacky ideas that most of us would consider to be obviously absurd. Nevertheless, calculus and Newtonian physics became a part of our accepted scientific knowledge, and alchemy didn't. The system adopted and iterated on his good ideas while throwing away the bad. 

Our community should be capable of something similar, while doing it more efficiently and not requiring the massive infrastructure of academia. 

A hypothesis-generator is not something that just randomly pulls out a model from model-space. It proposes things that are close modifications of things it already holds to be likely within its model (though I expect this point to be debatable). Humans are both hypothesis-generators and hypothesis-critics. And as I will argue, that distinction is not quite as sharply defined as one would think. 

I think there has always been an underlying assumption within the theory of intelligence that creativity and recognition / distinction are fundamentally different. In other words, one can easily understand Mozart to be a great composer, but it is much more difficult to be a Mozart. Naturally this belief entered it's way into the field of Artificial Intelligence too, and became somewhat of a dogma. Computers might be able to play Chess, they might be able to play Go, but they aren't doing anything fundamentally intelligent. They lack the creative spark, they work on pure brute-force calculation only, with maybe some heuristics and tricks that their human creators bestowed upon them.  

GANs seem to defy this principle. Trained on a dataset of photographs of human faces, a GAN generator learns to produce near-photo-realistic images that nonetheless do not fully match any the faces the critic network saw (one of the reasons why CelebA was such a good choice to test these on), and are therefore in some sense producing things which are genuinely original. It may have once been thought that there was a fundamental distinction between creation and critique, but perhaps that's not really the case. GANs were a surprising discovery, because they showed that it was possible to make impressive "creations" by starting from random nonsense and slowly tweaking it in the direction of "good" until it eventually got there (well okay, that's basically true for the whole of optimization, but it was thought to be especially difficult for generative models).

What does this mean? Could someone become a "Mozart" by beginning a musical composition from random noise and slowly tweaking it until it became a masterpiece?

The above seems to imply "yes, perhaps." However, this is highly contingent on the quality of the "tweaking." It seems possible only as long as the directions to update in are very high quality. What if they aren't very high quality? What if they point nowhere, or in very bad directions?

I think the default distribution of discourse is that it is characterized by a large number of these directionless, low quality contributions. And that it's likely that this is one of the main factors behind mode collapse. This is related to what has been noted before: Too much intolerance for imperfect ideas (or ideas outside of established dogma) in a community prevent useful tasks from being accomplished, and progress from being made. Academia does not seem immune to this problem. Where low-quality or hostile discussion is tolerated is where this risk is greatest.  

Fortunately, making sure we get good "tweaks" seems to be the easy part. Critique is in high abundance. Our community is apparently very good at it. We also don't need to worry much about the ratio of hypothesis-generators to hypothesis-critics, as long as we can establish good principles that allow us to follow GANs as closely as possible. The nice feature of the GAN formulation is that you are allowed to make the critic as powerful as you want. In fact, the critic should be more powerful than the generator (If the generator is too powerful, it just goes directly to the argmax of the critic). 

(In addition, any collection of generators is a generator, and any collection of critics is a critic. So this formulation can be applied to the community setting).

III. The Norm One Principle

So the question then becomes, how do we take an algorithm governing a game between models much simpler than a human, and use the same tweaks which consist of nothing more than a few very simple equations? 

Here what I devise is a strategy for taking the concept of the norm of the critic gradient being as close to one as possible, and using that as a heuristic for how to structure appropriate discourse. 

(This is where my argument gets more speculative and I expect to update this a lot, and where I welcome the most criticism).

What I propose is that we begin modeling the concept of "criticism" based on how useful it is to the idea-generator receiving the criticism. Under this model, I think we should start breaking down criticism into two fundamental attributes:

  1. Directionality - does the criticism contain highly useful information, such that the "generator" knows how to update their model / hypothesis / proposal?
  2. Magnitude - Is the criticism too harsh, does it point to something completely unlike the original proposal, or otherwise require changes that aren't feasible for the generator to make?

My claim is that any contribution to a discussion should satisfy the "Norm One Principle." In other words, it should have a well-defined direction, and the quantity of change should be feasible to implement.

If a critique can satisfy our requirements for both directionality and magnitude, then it serves a useful purpose. The inverse claim to this is that if we can't follow these requirements, we risk falling into mode collapse, and the ideas commonly proposed are almost indistinguishable from the ones which preceded them, and ideas which deviate too far from the norm are harshly condemned and suppressed. 

I think it's natural to question whether or not restricting criticism to follow certain principles is a form of speech suppression that prevents useful ideas from being considered. But the pattern I'm proposing doesn't restrict the "generation" process, the creative aspect which produces new hypotheses. It doesn't restrict the topics that can be discussed. It only restricts the criticism of those hypotheses, such that they are maximally useful to the source of the hypothesis. 

One of the primary fears behind having too much criticism is that it discourages people from contributing because they want to avoid the negative feedback. But under the Norm One Principle, I think it is useful to distinguish between disagreement and criticism. I think if we're following these norms properly, we won't need to consider criticism to be a negative reward. In fact, criticism can be positive. Agreement could be considered "criticism in the same direction you are moving in." Disagreement would be the opposite. And these norms also eliminate the kind of feedback that tends to be the most discouraging. 

For example, some things which violate "Norm One":

  • Ad hominem attacks (typically directionless). 
  • Affective Death Spirals (unlimited praise or denunciation is usually directionless, and usually very high magnitude). 
  • Signs that cause aversion (things I "don't like", that trigger my System 1 alarms, which probably violates both directionality and magnitude). 
  • Lengthy lists of changes to make (norm greater than 1, ideally we want to try to focus on small sets of changes that have the highest priority). 
  • Repetition of points that have already been made (norm greater than one). 

One of my strongest hopes is that whomever is playing the part of the "generator" is able to compile the list of critiques easily and use them to update somewhere close to the optimal direction. This would be difficult if the sum of all critiques is either directionless (many critics point in opposite or near-opposite directions) or very high-magnitude (Critics simply say to get as far away from here as possible). 

But let's suppose that each individual criticism satisfies the Norm One principle. We will also assume that the generator is weighing each critique by their respect for whoever produced it, which I think is highly likely. Then the generator should be able to move in a direction unless the sum of the directions completely cancel out. It is unlikely for this to happen - unless there is very strong epistemic disagreement in the community over some fundamental assumptions (in which case the conversation should probably move over to that). 

In addition, it also becomes less likely for the directions to cancel out as the number of inputs increases. Thus, it seems that proposals for new models should be presented to a wide audience, and we should avoid the temptation to keep our proposals hidden to all except for a small set of people we trust.

So I think that in general, this proposed structure should tend to increase the amount of collective trust we have in the community, and that it favors transparency and favors diversity of viewpoints. 

But what of the possible failure modes of this plan? 

This model should fail if the specific details of its implementation either remove too much discussion, or fail to deal with individuals who refuse to follow the norms and refuse to update. Any implementation should allow room for anyone to update. Someone who posts an extremely hostile, directionless comment should be allowed chances to modify their contribution. The only scenario in which the "banhammer" becomes appropriate is when this model fails to apply: The cardinal sin of rationality, the refusal to update. 

IV. Building the Ideal "Generator"

As a final point, I'll note that the above assumes that generators will be able to update their models incrementally. The easy part, as I mentioned, was obtaining the updates, the hard part is accumulating them. This seems difficult with the infrastructure we have in place. What we do have is a good system for posting proposals and receiving feedback (The blog post / comment thread set-up), but this assumes that each "generator" is keeping track of their models by themselves and has to be fully aware of the status of other models on their own. There is no centralized "mixture model" anywhere that contains the full set of models weighted by how much probability they are given by the community. Currently, we do not have a good solution for this problem. 

However, it seems that the first conception of Arbital was centered around finding a solution to this kind of problem:

Arbital has bigger ambitions than even that. We all dream of a world that eliminates the duplication of effort in online argument - a world where, the same way that Wikipedia centralized the recording of definite facts, an argument only needs to happen once, instead of being reduplicated all over the Internet; with all the branches of the argument neatly recorded in the same place, along with some indication of who believes what. A world where 'just check Arbital' had the same status for determining the current state of debates, as 'just check Wikipedia' now has when somebody starts arguing about the population of Melbourne. There's entirely new big subproblems and solutions, not present at all in the current Arbital, that we'd need to tackle that considerably more difficult problem. But to solve 'explaining things' is something of a first step. If you have a single URL that you can point anyone to for 'explaining Bayes', and if you can dispatch people to different pages depending on how much math they know, you're starting to solve some of the key subproblems in removing the redundancy in online arguments.

If my proposed model is accurate, then it suggests that the problem Arbital aims to solve is in fact quite crucial to solve, and that the developers of Arbital should consider working through each obstacle they face without pivoting from this original goal. I feel confident enough that this goal should be high priority that I'd be willing to support its development in whatever way is deemed most helpful and is feasible for me (I am not an investor, but I am a programmer and would also be capable of making small donations, or contributing material). 

The only thing that this model would require for Arbital to do would be to make it as open as possible to contribute, and then perform heavy moderation or filtering of contributed content (but importantly not the other way around, where it is closed to small group of trusted people).

Currently, the incremental changes that would have to be made to LessWrong and related sites like SSC would simply be increased moderation of comment quality. Otherwise, any further progress on the problem would require overcoming much more serious obstacles requiring significant re-design and architecture changes. 

Everything I've written above is also subject to the model I've just outlined, and therefore I expect to make incremental updates as feedback to this post accrues.

My initial prediction for feedback to this post is that the ideas might be considered helpful and offer a useful perspective or a good starting point, but that there are probably many details that I have missed that would be useful to discuss, or points that were not quite well-argued or well thought-out. I will look out for these things in the comments.   

Existential risk from AI without an intelligence explosion

12 AlexMennen 25 May 2017 04:44PM

[xpost from my blog]

In discussions of existential risk from AI, it is often assumed that the existential catastrophe would follow an intelligence explosion, in which an AI creates a more capable AI, which in turn creates a yet more capable AI, and so on, a feedback loop that eventually produces an AI whose cognitive power vastly surpasses that of humans, which would be able to obtain a decisive strategic advantage over humanity, allowing it to pursue its own goals without effective human interference. Victoria Krakovna points out that many arguments that AI could present an existential risk do not rely on an intelligence explosion. I want to look in sightly more detail at how that could happen. Kaj Sotala also discusses this.

An AI starts an intelligence explosion when its ability to create better AIs surpasses that of human AI researchers by a sufficient margin (provided the AI is motivated to do so). An AI attains a decisive strategic advantage when its ability to optimize the universe surpasses that of humanity by a sufficient margin. Which of these happens first depends on what skills AIs have the advantage at relative to humans. If AIs are better at programming AIs than they are at taking over the world, then an intelligence explosion will happen first, and it will then be able to get a decisive strategic advantage soon after. But if AIs are better at taking over the world than they are at programming AIs, then an AI would get a decisive strategic advantage without an intelligence explosion occurring first.

Since an intelligence explosion happening first is usually considered the default assumption, I'll just sketch a plausibility argument for the reverse. There's a lot of variation in how easy cognitive tasks are for AIs compared to humans. Since programming AIs is not yet a task that AIs can do well, it doesn't seem like it should be a priori surprising if programming AIs turned out to be an extremely difficult task for AIs to accomplish, relative to humans. Taking over the world is also plausibly especially difficult for AIs, but I don't see strong reasons for confidence that it would be harder for AIs than starting an intelligence explosion would be. It's possible that an AI with significantly but not vastly superhuman abilities in some domains could identify some vulnerability that it could exploit to gain power, which humans would never think of. Or an AI could be enough better than humans at forms of engineering other than AI programming (perhaps molecular manufacturing) that it could build physical machines that could out-compete humans, though this would require it to obtain the resources necessary to produce them.

Furthermore, an AI that is capable of producing a more capable AI may refrain from doing so if it is unable to solve the AI alignment problem for itself; that is, if it can create a more intelligent AI, but not one that shares its preferences. This seems unlikely if the AI has an explicit description of its preferences. But if the AI, like humans and most contemporary AI, lacks an explicit description of its preferences, then the difficulty of the AI alignment problem could be an obstacle to an intelligence explosion occurring.

It also seems worth thinking about the policy implications of the differences between existential catastrophes from AI that follow an intelligence explosion versus those that don't. For instance, AIs that attempt to attain a decisive strategic advantage without undergoing an intelligence explosion will exceed human cognitive capabilities by a smaller margin, and thus would likely attain strategic advantages that are less decisive, and would be more likely to fail. Thus containment strategies are probably more useful for addressing risks that don't involve an intelligence explosion, while attempts to contain a post-intelligence explosion AI are probably pretty much hopeless (although it may be worthwhile to find ways to interrupt an intelligence explosion while it is beginning). Risks not involving an intelligence explosion may be more predictable in advance, since they don't involve a rapid increase in the AI's abilities, and would thus be easier to deal with at the last minute, so it might make sense far in advance to focus disproportionately on risks that do involve an intelligence explosion.

It seems likely that AI alignment would be easier for AIs that do not undergo an intelligence explosion, since it is more likely to be possible to monitor and do something about it if it goes wrong, and lower optimization power means lower ability to exploit the difference between the goals the AI was given and the goals that were intended, if we are only able to specify our goals approximately. The first of those reasons applies to any AI that attempts to attain a decisive strategic advantage without first undergoing an intelligence explosion, whereas the second only applies to AIs that do not undergo an intelligence explosion ever. Because of these, it might make sense to attempt to decrease the chance that the first AI to attain a decisive strategic advantage undergoes an intelligence explosion beforehand, as well as the chance that it undergoes an intelligence explosion ever, though preventing the latter may be much more difficult. However, some strategies to achieve this may have undesirable side-effects; for instance, as mentioned earlier, AIs whose preferences are not explicitly described seem more likely to attain a decisive strategic advantage without first undergoing an intelligence explosion, but such AIs are probably more difficult to align with human values.

If AIs get a decisive strategic advantage over humans without an intelligence explosion, then since this would likely involve the decisive strategic advantage being obtained much more slowly, it would be much more likely for multiple, and possibly many, AIs to gain decisive strategic advantages over humans, though not necessarily over each other, resulting in a multipolar outcome. Thus considerations about multipolar versus singleton scenarios also apply to decisive strategic advantage-first versus intelligence explosion-first scenarios.

View more: Next