TL;DR

A serious possibility is that the first AGI(s) will be developed in a Manhattan Project style setting before any sort of friendliness/safety constraints can be integrated reliably. They will also be substantially short of the intelligence required to exponentially self-improve. Within a certain range of development and intelligence, containment protocols can make them safe to interact with. This means they can be studied experimentally, and the architecture(s) used to create them better understood, furthering the goal of safely using AI in less constrained settings.

Setting the Scene

The year is 2040, and in the last decade a series of breakthroughs in neuroscience, cognitive science, machine learning, and computer hardware have put the long-held dream of a human-level artificial intelligence in our grasp. The wild commercial success of lifelike robotic pets, the integration into everyday work and leisure of AI assistants and concierges, and STUDYBOT's graduation from Harvard's Online degree program with an octuple major and full honors, DARPA, the NSF and the European Research Council have announced joint funding of an artificial intelligence program that will create a superhuman intelligence in 3 years.

Safety was announced as a critical element of the project, especially in light of the self-modifying LeakrVirus that catastrophically disrupted markets in 36 and 37. The planned protocols have not been made public, but it seems they will be centered in traditional computer security rather than techniques from the nascent field of Provably Safe AI, which were deemed impossible to integrate on the current project timeline.

Technological and/or Political issues could force the development of AI without theoretical safety guarantees that we'd certainly like, but there is a silver lining

A lot of the discussion around LessWrong and MIRI that I've seen (and I haven't seen all of it, please send links!) seems to focus very strongly on the situation of an AI that can self-modify or construct further AIs, resulting in an exponential explosion of intelligence (FOOM/Singularity). The focus on FAI is on finding an architecture that can be explicitly constrained (and a constraint set that won't fail to do what we desire).

My argument is essentially that there could be a critical multi-year period preceding any possible exponentially self-improving intelligence during which a series of AGIs of varying intelligence, flexibility and architecture will be built. This period will be fast and frantic, but it will be incredibly fruitful and vital both in figuring out how to make an AI sufficiently strong to exponentially self-improve and in how to make it safe and friendly (or develop protocols to bridge the even riskier period between when we can develop FOOM-capable AIs and when we can ensure their safety). 

I'll break this post into three parts.
  1. why is a substantial period of proto-singularity more likely than a straight-to-singularity situation?
  2. Second, what strategies will be critical to developing, controlling, and learning from these pre-FOOM AIs?
  3. Third, what are the political challenge that will develop immediately before and during this period?
Why is a proto-singularity likely?

The requirement for a hard singularity, an exponentially self-improving AI, is that the AI can substantially improve itself in a way that enhances its ability to further improve itself, which requires the ability to modify its own code; access to resources like time, data, and hardware to facilitate these modifications; and the intelligence to execute a fruitful self-modification strategy.

The first two conditions can (and should) be directly restricted. I'll elaborate more on that later, but basically any AI should be very carefully sandboxed (unable to affect its software environment), and should have access to resources strictly controlled. Perhaps no data goes in without human approval or while the AI is running. Perhaps nothing comes out either. Even a hyperpersuasive hyperintelligence will be slowed down (at least) if it can only interact with prespecified tests (how do you test AGI? No idea but it shouldn't be harder than friendliness). This isn't a perfect situation. Eliezer Yudkowsky presents several arguments for why an intelligence explosion could happen even when resources are constrained, (see Section 3 of Intelligence Explosion Microeconomics) not to mention ways that those constraints could be defied even if engineered perfectly (by the way, I would happily run the AI box experiment with anybody, I think it is absurd that anyone would fail it! [I've read Tuxedage's accounts, and I think I actually do understand how a gatekeeper could fail, but I also believe I understand how one could be trained to succeed even against a much stronger foe than any person who has played the part of the AI]).

But the third emerges from the way technology typically develops. I believe it is incredibly unlikely that an AGI will develop in somebody's basement, or even in a small national lab or top corporate lab. When there is no clear notion of what a technology will look like, it is usually not developed. Positive, productive accidents are somewhat rare in science, but they are remarkably rare in engineering (please, give counterexamples!). The creation of an AGI will likely not happen by accident; there will be a well-funded, concrete research and development plan that leads up to it. An AI Manhattan Project described above. But even when there is a good plan successfully executed, prototypes are slow, fragile, and poor-quality compared to what is possible even with approaches using the same underlying technology. It seems very likely to me that the first AGI will be a Chicago Pile, not a Trinity; recognizably a breakthrough but with proper consideration not immediately dangerous or unmanageable. [Note, you don't have to believe this to read the rest of this. If you disagree, consider the virtues of redundancy and the question of what safety an AI development effort should implement if they can't be persuaded to delay long enough for theoretically sound methods to become available].

A Manhattan Project style effort makes a relatively weak, controllable AI even more likely, because not only can such a project implement substantial safety protocols that are explicitly researched in parallel with primary development, but also because the total resources, in hardware and brainpower, devoted to the AI will be much greater than a smaller project, and therefore setting a correspondingly higher bar for the AGI thus created to reach to be able to successfully self-modify itself exponentially and also break the security procedures.

Strategies to handle AIs in the proto-Singularity, and why they're important

First, take a look the External Constraints Section of this MIRI Report and/or this article on AI Boxing. I will be talking mainly about these approaches. There are certainly others, but these are the easiest to extrapolate from current computer security.

These AIs will provide us with the experimental knowledge to better handle the construction of even stronger AIs. If careful, we will be able to use these proto-Singularity AIs to learn about the nature of intelligence and cognition, to perform economically valuable tasks, and to test theories of friendliness (not perfectly, but well enough to start). 

"If careful" is the key phrase. I mentioned sandboxing above. And computer security is key to any attempt to contain an AI. Monitoring the source code, and setting a threshold for too much changing too fast at which point a failsafe freezes all computation; keeping extremely strict control over copies of the source. Some architectures will be more inherently dangerous and less predictable than others. A simulation of a physical brain, for instance, will be fairly opaque (depending on how far neuroscience has gone) but could have almost no potential to self-improve to an uncontrollable degree if its access to hardware is limited (it won't be able to make itself much more efficient on fixed resources). Other architectures will have other properties. Some will be utility optimizing agents. Some will have behaviors but no clear utility. Some will be opaque, some transparent.

All will have a theory to how they operate, which can be refined by actual experimentation. This is what we can gain! We can set up controlled scenarios like honeypots to catch malevolence. We can evaluate our ability to monitor and read the thoughts of the agi. We can develop stronger theories of how damaging self-modification actually is to imposed constraints. We can test our abilities to add constraints to even the base state. But do I really have to justify the value of experimentation?

I am familiar with criticisms based on absolutley incomprehensibly perceptive and persuasive hyperintelligences being able to overcome any security, but I've tried to outline above why I don't think we'd be dealing with that case.

Political issues

Right now AGI is really a political non-issue. Blue sky even compared to space exploration and fusion both of which actually receive funding from government in substantial volumes. I think that this will change in the period immediately leading up to my hypothesized AI Manhattan Project. The AI Manhattan Project can only happen with a lot of political will behind it, which will probably mean a spiral of scientific advancements, hype and threat of competition from external unfriendly sources. Think space race.

So suppose that the first few AIs are built under well controlled conditions. Friendliness is still not perfected, but we think/hope we've learned some valuable basics. But now people want to use the AIs for something. So what should be done at this point?

I won't try to speculate what happens next (well you can probably persuade me to, but it might not be as valuable), beyond extensions of the protocols I've already laid out, hybridized with notions like Oracle AI. It certainly gets a lot harder, but hopefully experimentation on the first, highly-controlled generation of AI to get a better understanding of their architectural fundamentals, combined with more direct research on friendliness in general would provide the groundwork for this.

New to LessWrong?

New Comment
47 comments, sorted by Click to highlight new comments since: Today at 11:16 AM

If one were to approach it as an actual problem, it would certainly be worthwhile to focus on applying the safety engineering practices from the other fields - making it fail-safe, whenever possible by omitting, rather than adding, features. E.g. a nuclear reactor can't blow up like a nuke chiefly by the lack of implosion assembly, lack of purity from neutron emitters, lack of neutron initiator, etc.

For instance, a "reward optimizer" would, normally, merely combine the reward button signal with the clock signal to produce the value which is actually being optimized. The fantastic adventures of the robot boy who's trying to hold a button down until the heat death of the universe need not be relevant; results are likely to be far less spectacular and go along the lines of setting time to MAX_INT (or more likely, optimizing directly for the final result after the time is factored in), or in case of a more stupid system, starting a fire in the lab because turning off the cooling fans through some driver glitch has raised the oscillator frequency a little bit.

Of course, given that we lack any good idea of how a practical AGI might be built, and the theoretical implementation are highly technical and difficult to mentally process, it is too speculative for us to presently know what the features might be and what could be omitted, and in science fiction all you can do is take our (ontologically basic for humans) notion of intelligence and bolt on some variety of laws of robotics (or a constitution of robotics, or other form of wish list) on top of it.

For the "explosion":

Consider an alien hivemind beehive made of rather unintelligent bees. They're trying to build an artificial bee. If they build one, it's very sub-beehive intelligent, it's below the threshold of any intelligence explosion (assuming that the beehive is roughly at the cusp of intelligence explosion and assuming an intelligence explosion is possible). Yes, eventual displacement of the beehive may happen, but not through some instant "explosion".

The AI - it's hardware and software - will be a product of literally millions human-years of very bright (by the human standards) individuals working on various aspects of the relevant technology, and it's not clear why you would expect different results than for above mentioned alien bees. It is clear why you would want that in a movie - makes for a better plot than "people slowly lose jobs".

Security by omission is a very good point. The same is true in omitting options in protocols. If, for instance, "let the AI out of its box" or even "Give this AI extra hardware or communication with the broader public" are not official options for the result of a certain phase of the project, that makes it all the harder for it to happen.

Questions about architecture, and how we could begin to "bolt on" behavioral constraints are critical. And that's precisely why we need to experiment. I suspect many architectures will be highly opaque and correspondingly difficult to constrain or directly modify. A simulated human brain might fall under this category.

Well, one thing about practical problem solving software is that in reality the action space is very highly dimensional and very, very huge. To make any kind of search work in anything close to reasonable computing time, one needs to be able to control the way the search looks for solutions. The 'optimizations' coincide with greater control over how something works and with movement away from theoretically simple models.

Mere absence of certain aspects of generality can simultaneously render AI massively easier to build (and able to solve given problems we want to solve when running on weaker hardware), and massively safer. Safer in the way that can not be described as some added on goals onto basic human notion of intelligence. Just as a nuclear power plant's safety can not be described as that of a nuke with extra controls. Very little of the safety you find in engineering is based upon some additions onto a theoretical ideal. The problem of making, say, a tablesaw safer (that clever mechanism which instantly stops and retracts it when touched) is dramatically different from the problem of making an ideal forcefield cutting blade safer.

As for simulated human brain, there's really no reason to expect that to lead to some extremely rapid intelligence explosion. What people think is likely, is dependent to what they are exposed to, and this applies as much to the regular masses who think terrorism is a huge risk because it's all over the TV as to nerds who think this sort of scenario is a huge problem because it's all over the sci-fi.

It's the very quintessence of what Bruce Schneier termed a "movie plot threat" (with a movie plot solution as well). It may seem like worrying about movie plot threats can't hurt. But it diverts resources from general safety towards safety against overly specific plots. E.g. instead of being more concerned about the economical impact of emerging technologies, individuals so inclined focus on an overly specific (and correspondingly unlikely) scenario, which was created primarily for entertainment purposes.

Security is both built into (design) and bolted onto (passwords, anti-virus software) software. It is build into (structural integrity, headlights, rules of the road) and bolted onto (seatbelts, airbags) cars. Safety will be architecture dependent. Provable safety in the sense that MIRI researches might be awkward to incorporate into many architectures, if it is possible at all.

If an intelligence explosion is possible, it is probably possible with any architecture, but much more efficient with some. But we won't really know until we experiment enough to at least understand properties of these architectures under naive scaling based on computational resources.

I mention brain emulation specifically because it's the clearest path we have to artificial intelligence (In the same sense that Fusion is a clear path to supplying global energy needs - the theory is known and sound but the engineering obstacles could put it off indefinitely). And presumably once you can make one brain in silico you could make it smarter than a person's by a number of methods.

I'm presuming that at some point, we will want an AI that can program other AIs or self-modify in unexpected ways to improve itself.

But you're right, external safety could be a stopgap not until we could make FOOM-capable AI provably safe, but until we could make FOOM impossible, and keep humans in the driver's seat.

The bolted on security, though, is never bolted onto some idealized notion originating from fiction. That has all the potential of being even more distant from what's needed as hypothetical teleport gate safety is from airbags.

As for brain emulation, the necessary computational power is truly immense, and the path towards such is anything but clear.

With regards to foom, it seems to me that the belief in foom is related to certain ignorance with regards to the intelligence already present, or the role of that on the "takeoff". The combined human (and software) intelligence working on the relevant technologies is already massively superhuman, in the sense of superiority to any individual human. The end result is that the takeoff starts earlier and slower, much like how if you try to bring together chunks of plutonium, due to the substantial level of spontaneous fission already present, the chain reaction will reach massive power level before coefficient of multiplication gets larger than 1.

I agree with the point about how any intelligence that constructs a supercomputer is already superhuman, even if it's just humans working in concert. I think this provides a major margin of safety. I am not quite as skeptical of takeoff overall as you seem to be. But a big science style effort is likely to minimize a lot of risks while a small one is not likely to succeed at all.

Brain emulation is definitely hard, but no other route even approaches plausibility currently. We're 5 breakthroughs away from brain emulation, and 8 away from anything else. So using brain emulation as one possible scenario isn't totally unreasonable imo.

Why do you expect "foom" from brain emulation, though?

My theory is that such expectations are driven by it being so far away that it is hard to picture us getting there gradually, instead you picture skipping straight to some mind upload that can run a thousand copies of itself or the like...

What I expect from the first "mind upload" is a simulated epileptic seizure. Refined gradually into some minor functionality. It is not an actual upload, either, just some samples of different human brains were used to infer general network topology and the like, and that has been simulated, and learns things, running at below realtime. On a computer that is consuming many megawatts of power, and costs more per day than the most expensive movie star or the like. The computer for price of which you could hire a hundred qualified engineers each thinking perhaps 10 times faster than this machine. Gradually refined - with immense difficulty - into human level performance. Nothing like some easy ways to make it smarter - these were used to make it work earlier.

This would be contemporary to (and making use of) software that can and did - by simulation and refinement of parameters - do utterly amazing things - more advanced variation of the software that can design ultra efficient turbine blades and the like today. (Non-AI, non autonomous software which can also be used to design DNA for some cancer-curing virus, or - by a deliberately malicious normal humans - everyone-killing virus, or the like, rendering the upload itself fairly irrelevant as a threat).

What I suspect many futurists imagine, is the mind upload of full working human mind, appearing in the computer, talking and the like - starting point, their mental model got here by magic, not by imagining actual progress. Then there's some easy tweaks, which are again magicked into the mental model, no reduction to anything. The imaginary upload strains one's mental simulator's capacity quite a bit, and in the futurist's mental model, it is not contemporary to any particularly cool technology. So the mind upload enjoys the advantages akin to that of a modern army sent back in time into 1000 BC (with nothing needing any fuel to operate or runways to take off from). And so the imaginary mind upload easily takes over the imaginary world.

I think your points are valid. I don't expect FOOM from anything, necessarily, I just find it plausible (based on Eliezer's arguments about all the possible methods of scaling that might be available to an AI).

I am pitching my arguments towards people who expect FOOM, but the possibility of non-FOOM for a longish while is very real.

And It is probably unwarranted to say anything about architecture, yo'ure right.

But Suppose we have human-level AIs, then decide to consciously build a substantially superhuman AI. Or we have superhuman AIs that can't FOOM, and actively seek to make one that can. The same points apply.

It seems to me that this argument (and arguments which rely on unspecified methods and the like) boils down to breaking the world model to add things with unclear creation history and unclear decomposition into components, and resulting non-reductionist magic infested mental world model misbehaving. Just as it always did in the human history, yielding gods and the like.

You postulate that unspecific magic can create superhuman intelligence - it arises without mental model of necessary work, problems being solved, returns diminishing, and available optimizations being exhausted - is it a surprise that in this broken mental model (broken because we don't know how the AI would be built), because the work is absent, the superhuman intelligence in question creates a greater still intelligence in days, merely continuing the trend of it's unspecific creation? If it's not at all surprising then it's not informative that mental model goes in this direction.

Very good post!

I agree that experimentation with near-human level AI (assuming that it is possible) is unlikely to have catastrophic consequences as long as standard safety engineering practices are applied.
And in fact, experimentation is likely the only way to make any real progress in understanding the critical issues of AI safety and solving them.

In engineering, "Provably safe", "provably secure" designs typically aren't, especially when dealing with novel technologies: once you build a physical system, there is always some aspect that wasn't properly addressed by the theoretical model but turns out to make your system fail in unanticipated ways.
Careful experimentation is needed to gain knowledge of the critical issues of a design in a controlled environment, and once the design has been perfected, you still can't blindly trust it, rather you need to apply extensive redundancy and impact mitigation measures.
That's how we have made productive use of potentially dangerous stuff such as fire, electricity, cars, trains, aeroplanes, nuclear power and microorganisms, without wiping ourselves out so far. I don't think that AI should or could be an exception.
Gambling our future on a brittle mathematical proof, now that would be foolish, in my humble opinion.

Much of the current discussion about AI safety suggests me an analogy with some hypothetical eighteen century people trying to discuss air traffic safety:
They could certainly imagine flying machines, they could understand that these machines would have to work according to Newtonian mechanics and known fluid dynamics, and they could probably foresee some of the inherent dangers of operating such machines. But obviously, their discussions wouldn't produce any significant result, because they would lack knowledge key facts about the architecture of actually workable designs. They wouldn't know anything about internal combustion engines, aluminium, radio communication, radars, and so on.
Present-day self-proclaimed "AI risk experts" look much like those hypothetical eighteen century "aviation risk experts": they have little or no idea of an actual AI design is going to look like, and yet they attempt to argue from first principles (moral philosophy, economic theories and mathematical logic) about its safety.
It goes without saying that I don't have much confidence in their approach.

The difference between AI safety and e.g. car safety is that humanity can survive a single car crash. Provably safe AI is needed because it's the only way to get it right, not because it's the easiest way.

Setting the Scene

My brain tripped a burdensome details warning flag here, for what it's worth. It's probably best to avoid very specific, conjunctive scenarios when talking about the future like this.

It's just a little color and context.

You don't have to believe anything except that either (or both):

  • We gain the capacity to develop AI before we can guarantee friendliness, and some organization attempts to develop maybe-unsafe AI

  • Redundant safety measures are good practice.

I think you are missing a crucial point here. It might be the case (arguably, it is likely to be the case) that the only feasible way to construct a human level AGI without mind uploading (WBE) is to create a self-improving AGI. Such an AGI will start from subhuman intelligence but use its superior introspection and self-modification powers to go supercritical and rapidly rise in intelligence. Assuming we don't have an automatic shut-down triggered by the AGI reaching a certain level of intelligence (since it's completely unclear how to implement one), the AGI might go superhuman rapidly after reaching human level intelligence, w/o anyone having the chance to stop it. Once superhuman, no boxing protocol will make it safe. Why? Because a superhuman intelligence will find ways to get out of the box that you have no chance to guess. Because you cannot imagine the exponentially large space of all the things the AGI can try so you have no way to guarantee it cannot find niddles in that haystack the existence of which you don't even suspect.

As a side note, it is not possible to meaningfully prevent a program from self-modifying if it is running on a universal Turing machine. Arbitrary code execution is always possible, at least with a mild performance penalty (the program can always implement an interpreter).

I think you are missing a crucial point here. It might be the case (arguably, it is likely to be the case) that the only feasible way to construct a human level AGI without mind uploading (WBE) is to create a self-improving AGI. Such an AGI will start from subhuman intelligence but use its superior introspection and self-modification powers to go supercritical and rapidly rise in intelligence.

Seems unlikely. If this seed AI is the most intelligent thing humans can design, and yet it is significantly less intelligent than humans, how can it design something more intelligent than itself?

Software development is hard. We don't know any good heuristic to approach it with a narrow AI like a chess playing engine. We can automate some stuff, like compiling from an high-level programming language to machine code, performing type checking and some optimizations. But automating "coding" (going from a specification to a runnable program) or the invention of new algorithms are probably "AI-complete" problems: we would need an AGI, or something remarkably close to it, in order to do that.

Assuming we don't have an automatic shut-down triggered by the AGI reaching a certain level of intelligence (since it's completely unclear how to implement one), the AGI might go superhuman rapidly after reaching human level intelligence, w/o anyone having the chance to stop it.

Even if it can self-improve its code, and this doesn't quickly run into diminishing returns (which is a quite likely possibility), it would still have limited hardware resources and limited access to outside knowledge. Having, say, a SAT solver which is 5x faster than the industry state of the art won't automatically turn you into an omniscient god.

As a side note, it is not possible to meaningfully prevent a program from self-modifying if it is running on a universal Turing machine.

An universal Turing machine is not physically realizable, but even if it was, your claim is false. Running Tetris on an UTM won't result in self-modification.

Arbitrary code execution is always possible, at least with a mild performance penalty (the program can always implement an interpreter).

Only if the program has access to an interpreter that can execute arbitrary code.

Seems unlikely. If this seed AI is the most intelligent thing humans can design, and yet it is significantly less intelligent than humans, how can it design something more intelligent than itself?

Because humans are not optimized for designing AI. Evolution is much less intelligent than humans and yet it designed something more intelligent than itself: humans. Only it did it very inefficiently and it doesn't bootstrap. But it doesn't mean you need something initially as intelligent as a human to do it efficiently.

But automating "coding" (going from a specification to a runnable program) or the invention of new algorithms are probably "AI-complete" problems: we would need an AGI, or something remarkably close to it, in order to do that.

They are. It just doesn't mean the AI has to be as smart as human the moment it is born.

Even if it can self-improve its code, and this doesn't quickly run into diminishing returns (which is a quite likely possibility)...

There is no reason to believe the diminishing returns point is around human intelligence. Therefore if it is powerful enough to make it to human level, it is probably powerful enough to make it much further.

An universal Turing machine is not physically realizable, but even if it was, your claim is false. Running Tetris on an UTM won't result in self-modification... Only if the program has access to an interpreter that can execute arbitrary code.

The program is the interpreter. My point is that you cannot prevent self-modification by constraining the environment (as long as the environment admits universal computation), you can only prevent it by constraining the program itself. I don't see how RAM limitations significantly alter this conclusion.

Because humans are not optimized for designing AI.

While a sub-human narrow AI can be optimized at designing general AI? That seems unlikely.

Evolution is much less intelligent than humans and yet it designed something more intelligent than itself: humans.

Speaking of biological evolution in a teleological terms always carries the risk of false analogy. If we were to reboot evolution from the Cambrian, it's entirely unnecessary that it would still produce humans, or something of similar intelligence, within the same time frame.

Moreover, evolution is a process of adaptation to the environment. How can a boxed narrow AI produce something which is well adapted to the environment outside the box?

There is no reason to believe the diminishing returns point is around human intelligence. Therefore if it is powerful enough to make it to human level, it is probably powerful enough to make it much further.

Why not? Humans can't self-improve to any significant extent. Stuff less intelligent than humans we can design can't self-improve to any significant extent. Algorithmic efficiency, however defined, is always going to be bounded.

The program is the interpreter.

The program doesn't have a supernatural ghost who can decide "I'm going to be an interpreter starting from now". Either is an interpreter (in which case it is not an AI) or it is not.
You can give an AI an interpreter as part of its environment. Or transistors to build a computer, or something like that. It can even make a cellular automaton using pebbles or do some other "exotic" form of computation.
But all these things run into the resource limits of the box, and the more exotic, the higher the resource requirements per unit of work done.

While a sub-human narrow AI can be optimized at designing general AI? That seems unlikely.

I think it's either that or we won't be able to build any human-level AGI without WBE.

If we were to reboot evolution from the Cambrian, it's entirely unnecessary that it would still produce humans, or something of similar intelligence, within the same time frame.

Agreed. However hominid evolution was clearly not pure luck since it involved significant improvement over a relatively short time span.

Moreover, evolution is a process of adaptation to the environment. How can a boxed narrow AI produce something which is well adapted to the environment outside the box?

Evolution produced something which is adapted to a very wide range of environments, including environments vastly different from the environment in which evolution happened. E.g., US Astronauts walked the surface of the moon which is very different from anything relevant to evolution. We call this something "general intelligence". Ergo, it is possible to produce general intelligence by a process which has little of it.

Humans can't self-improve to any significant extent. Stuff less intelligent than humans we can design can't self-improve to any significant extent.

My point is that it's unlikely the point of diminishing returns is close to human intelligence. If this point is significantly below human intelligence then IMO we won't be able to build AGI without WBE.

The program doesn't have a supernatural ghost who can decide "I'm going to be an interpreter starting from now". Either is an interpreter (in which case it is not an AI) or it is not.

It is an AI which contains an interpreter as a subroutine. My point is, if you somehow succeed to freeze a self-modifying AI at a point in which it is already interesting and but not yet dangerous, then the next experiment has to start from scratch anyway. You cannot keep running it while magically turning self-modification off since the self-modification is an inherent part of the program. This stands in contrast to your ability to e.g. turn on/off certain input/output channels.

I think it's either that or we won't be able to build any human-level AGI without WBE.

Why?

Agreed. However hominid evolution was clearly not pure luck since it involved significant improvement over a relatively short time span.

It wasn't pure luck, there was selective pressure. But this signal towards improvement is often weak and noisy, and it doesn't necessarily correlate well with intelligence: a chimp is smarter than a lion, but not generally more evolutionary fit. Even homo sapiens had a population bottleneck 70,000 which almost led to extinction.

It is my intuition that if something as complex and powerful as human-level intelligence can be engineered in the foreseeable future, than it would have to use some kind of bootstrapping. I admit it is possible than I'm wrong and that in fact progress in AGI would be through a very long sequence of small improvements and that the AGI will be given no introspective / self-modification powers. In this scenario, a "proto-singularity" is a real possibility. However, what I think will happen is that we won't make significant progress before we develop a powerful mathematical formalism. Once such a formalism exists, it will be much more efficient to use it in order to build a pseudo-narrow self-modifying AI than keep improving AI "brick by brick".

Such an AGI will start from subhuman intelligence but use its superior introspection and self-modification powers to go supercritical and rapidly rise in intelligence.

If this seed AI is the most intelligent thing humans can design, and yet it is significantly less intelligent than humans, how can it design something more intelligent than itself?

Because humans are not optimized for designing AI. Evolution is much less intelligent than humans and yet it designed something more intelligent than itself: humans. Only it did it very inefficiently and it doesn't bootstrap. But it doesn't mean you need something initially as intelligent as a human to do it efficiently.

That evolution "designed" something more intelligent than itself inefficiently does imply that we can efficiently design something less intelligent than ourselves that can in turn efficiently design something much more intelligent than its creators?

And your confidence in this is high enough to believe that such an AI can't be contained? Picture me starring in utter disbelief.

People already suck at telling whether Vitamin D is good for you, yet some people seem to believe that they can have non-negligible confidence about the power and behavior of artificial general intelligence.

Even if it can self-improve its code, and this doesn't quickly run into diminishing returns (which is a quite likely possibility)...

There is no reason to believe the diminishing returns point is around human intelligence.

For important abilities, such as persuasion, there are good reasons to believe that there are no minds much better than humans. There are no spoken or written mind hacks that can be installed and executed in a human brain.

That evolution "designed" something more intelligent than itself inefficiently does imply that we can efficiently design something less intelligent than ourselves that can in turn efficiently design something much more intelligent than its creators?

Either we can design a human-level AGI (without WBE) or we cannot. If we cannot, this entire discussion about safety protocols is irrelevant. Maybe we need some safety protocols for experiments with WBE but it's a different story. If we can, then it seems likely that there exists a subhuman AGI which is able to design a superhuman AGI (because there's no reason to believe human-level intelligence is a special point & because this weaker intelligence will be better optimized for designing AGI than humans). Such a self-improvement process creates a positive feedback loop which might lead to very rapid rise in intelligence.

People already suck at telling whether Vitamin D is good for you, yet some people seem to believe that they can have non-negligible confidence about the power and behavior of artificial general intelligence.

Low confidence means stronger safety requirements, not the other way around.

For important abilities, such as persuasion, there are good reasons to believe that there are no minds much better than humans.

What are these reasons?

One of the arguments I heard for humans being the bare minimum level of intelligence for a technological civilization is that there existed no further evolutionary pressure to select for even higher levels of general intelligence.

You just claim that there can be levels of intelligence below us that are better than us at designing levels of intelligence above us and that we can create such intelligences. In my opinion such a belief requires strong justification.

People already suck at telling whether Vitamin D is good for you, yet some people seem to believe that they can have non-negligible confidence about the power and behavior of artificial general intelligence.

Low confidence means stronger safety requirements, not the other way around.

Yes. Something is very wrong with this line of reasoning. I hope GiveWell succeeds at writing a post on this soon. My technical skills are not sufficient to formalize my doubts.

I'll just say as much. I am not going to spend resources on the possibility of catching some exotic disease, even though it could kill me in a horrible way, when there are other more likely risks that could cripple me.

What are these reasons?

I list some caveats here. Even humans hit diminishing returns on many tasks and just stop exploring and start exploiting. For persuasion this should be pretty obvious. Improving a sentence you want to send to your gatekeeper for a million subjective years does not make it one hundred thousand times more persuasive than improving it for 10 subjective years.

When having a fist fight with someone, strategy only gives you little advantage if your combatant is much stronger. An AI trying to take over the world would have to account for its fragility when fighting humans, who are adapted to living outside the box.

To take over the world you either require excellent persuasion skills or raw power. That an AI could somehow become good at persuasion, given its huge inferential distance, lack of direct insight, and without a theory of mind, is in my opinion nearly impossible. And regarding the acquisition of raw power, you will have to show how it is likely going to do so without just conjecturing technological magic.

At the time of the first AI, the global infrastructure will still require humans to keep it running. You need to show that the AI is independent enough of this infrastructure that it can risk its demise in a confrontation with humans.

There are a huge number of questions looming in the background. How would the AI hide its motivations and make predictions about human countermeasures? Why would it be given unsupervised controlled of the equipment necessary to construct molecular factories?

I can of course imagine science fiction stories where an AI does anything. That proves nothing.

I am not going to spend resources on the possibility of catching some exotic disease, even though it could kill me in a horrible way, when there are other more likely risks that could cripple me.

Allow me to make a different analogy. Suppose that someone is planning to build a particle accelerator of unprecedented power. Some experts claim the accelerator is going to create a black hole which will destroy Earth. Other experts think differently. Everyone agrees (in stark contrast to what happened with LHC) that our understanding of processes at these energies is very poor. In these conditions, do you think it would be a good idea to build the accelerator?

In these conditions, do you think it would be a good idea to build the accelerator?

It would not be a good idea. Ideally, you should then try to raise your confidence that it won't destroy the world so far that the expected benefits of building it outweigh the risks. But that's probably not feasible, and I have no idea where to draw the line.

If you can already build something, and there are good reasons to be cautious, then that passed the threshold where I can afford to care, without risking to waste my limited amount of attention on risks approaching Pascal's mugging type scenarios.

Unfriendly AI does not pass this threshold. The probability of unfriendly AI is too low, and the evidence is too "brittle".

I like to make the comparison between an extinction type asteroid, spotted with telescopes, and calculated to have .001 probability of hitting Earth in 2040, vs. a 50% probability of extinction by unfriendly AI at the same time. The former calculation is based on hard facts, empirical evidence, while the latter is purely inference based and therefore very unstable.

In other words, one may assign 50% probability to "a coin will come up heads" and "there is intelligent life on other planets," but one's knowledge about the two scenarios is different in important ways.

ETA:

Suppose there are 4 risks. One mundane risk has a probability of 1/10 and and you assign 20 utils to its prevention. Another less likely risk has a probability of 1/100 but you assign 1000 utils to its prevention. Yet another risk is very unlikely, having a probability of 1/1000, but you assign 1 million utils to its prevention. The fourth risk is extremely unlikely, having a probability of 10^-10000, but you assign 10^10006 to its prevention. All else equal, which one would you choose to prevent and why?

If you wouldn’t choose risk 4 then why wouldn’t the same line of reasoning, or intuition, not be similarly valid in choosing risk number 1 over 2 or 3? And in case that you would choose risk 4 then do you also give money to a Pascalian mugger?

The important difference between an AI risks charity and a deworming charity can’t be its expected utility, because that results in Pascal’s mugging. The difference can neither be that deworming is more probable than AI risks. Because that argument also works against deworming, by choosing a cause that is even more probable than deworming.

And in case you are saying that AI risk is the most probable underfunded risk, then what is the greatest lower bound for “probable” here and how do you formally define it? In other words, in conjunction with doesn’t work either. Because any case of Pascal’s mugging is underfunded as well. You’d have to formally define and justify some well-grounded minimum for “probable”.

The probability of unfriendly AI is too low, and the evidence is too "brittle".

Earlier you said: "People already suck at telling whether Vitamin D is good for you, yet some people seem to believe that they can have non-negligible confidence about the power and behavior of artificial general intelligence." Now you're making high confidence claims about AGI. Also, I remind you the discussion started from my criticism of the proposed AGI safety protocols. If there is no UFAI risk than the safety protocols are pointless.

In other words, one may assign 50% probability to "a coin will come up heads" and "there is intelligent life on other planets," but one's knowledge about the two scenarios is different in important ways.

Not in ways that have to do with expected utility calculation.

Suppose there are 4 risks. One mundane risk has a probability of 1/10 and and you assign 20 utils to its prevention. Another less likely risk has a probability of 1/100 but you assign 1000 utils to its prevention. Yet another risk is very unlikely, having a probability of 1/1000, but you assign 1 million utils to its prevention. The fourth risk is extremely unlikely, having a probability of 10^-10000, but you assign 10^10006 to its prevention. All else equal, which one would you choose to prevent and why?

Risk 4 since it corresponds to highest expected utility.

And in case that you would choose risk 4 then do you also give money to a Pascalian mugger?

My utility function is bounded (I think) so you can only Pascal-mug me that much.

And in case you are saying that AI risk is the most probable underfunded risk...

I have no idea whether it is underfunded. I can try to think about it, but it has little to do with the present discussion.

There are many ways self-modification can be restricted. Only certain numerical parameters may be modified, only some source may be modified while other stuff remains a black box. If it has to implement its own interpreter, that's not a "mild performance penalty" it's a gargantuan one, not to mention that it can be made impossible.

You can also freeze self-modification abilities at any given time and examine the current machine to evaluate intelligence.

These are only examples, but I think we are far too far away from constructing an AI to assume that the first ones would be introspective or highly self-modifying. And by the time we start building one, we'll know, and we'll be able to prepare procedures to put in place.

What I strongly doubt is that somebody messing around in their basement (or the corporate lab equivalent) will stumble on a superintelligence by accident. And the alternative is that a coherent, large, well-funded effort will build something with many theories and proofs-of-concept partial prototypes along the way to guide safety procedures.

There are many ways self-modification can be restricted. Only certain numerical parameters may be modified, only some source may be modified while other stuff remains a black box. If it has to implement its own interpreter, that's not a "mild performance penalty" it's a gargantuan one, not to mention that it can be made impossible.

If you place too many restrictions you will probably never reach human-like intelligence.

You can also freeze self-modification abilities at any given time and examine the current machine to evaluate intelligence.

If you do it frequently you won't reach human-like intelligence in a reasonable span of time. If you do it infrequently, you will miss the transition into superhuman and it will be too late.

These are only examples, but I think we are far too far away from constructing an AI to assume that the first ones would be introspective or highly self-modifying. And by the time we start building one, we'll know, and we'll be able to prepare procedures to put in place... a coherent, large, well-funded effort will build something with many theories and proofs-of-concept partial prototypes along the way to guide safety procedures.

A coherent, large, well-funded effort can still make a fatal mistake. The Challenger was such an effort. The Chernobyl power plant was such an effort. Trouble is, this time the stakes are much higher.

Discussions about AIs modifying their own source code always remind me of Reflections on Trusting Trust, which demonstrates an evil self-modifying (or self-preserving, I should say) backdoored compiler.

(for my part, I'm mostly convinced that self-improving AI is incredibly dangerous; but not that it is likely to happen in my lifetime.)

I think the most likely scenario of general-purpose AI creation is not a "big project" at all. Reason is, most problems that are now supposed to require human-level intelligence probably do not. Therefore, programs that can manage a company, programs that can do scientific research, programs that can program will be developed far earlier than direct approach to general AI brings any meaningful results. Then, general-purpose AI will be not a huge breakthrough, but just a small final step. At that point, anyone can do it, it requires ideas rather than resources. So, the first general-purpose AI will likely be money-maximizer with "do not break laws" as the only internal constraint.

I like this post, and I am glad you included the section on political issues. But I worry that you underestimate those issues. The developers of AGI will probably place much less emphasis on safety and more on rapid progress than you seem to anticipate. Militaries will have enemies, and even corporate research labs will have competitors. I don't see a bright future for the researcher who plans the slow cautious approach you have outlined.

Right now the US military is developing autonomous infantry robots. The AI in them in no way counts as AGI, but any step along the road to AGI would probably improve the performance of such devices. Or at least any few steps. So I doubt we have much time to play in sandboxes.

The Space Race was high-pressure, but placed a relatively high emphasis on safety on the US side; and even the Russians were doing their best to make sure missions didn't fail too often. A government sponsored effort would place a high emphasis on making sure the source and details of the project weren't leaked in way that could be copied easily (which is already a degree of safety), and it would have the resources available to take any security measures that wouldn't slow things down tremendously.

Most DARPA/Military research projects do receive extensive testing and attempt to ensure reliability. Even when it's done poorly, or unsuccessfully, it's a huge consideration in development.

But yes, there would be a certain level of urgency, which might keep them from waiting for the best possible security. Which is why intermediate approaches grounded in a existing (or technology we can extrapolate will be existing) technologies.

The issue with sandboxing is that you have to keep the AI from figuring out that it is in a sandbox. You also have to know that the AI doesn't know that it is in a sandbox in order for the sandbox to be a safe and accurate test of how the AI behaves in the real world.

Stick a paperclipper in a sandbox with enough information about what humans want out of an AI and the fact that it's in a sandbox, and the outputs are going to look suspiciously like a pro-human friendly AI. Then you let it out of the box, whereupon it turns everything into paperclips.

Stick a paperclipper in a sandbox with enough information about what humans want out of an AI and the fact that it's in a sandbox, and the outputs are going to look suspiciously like a pro-human friendly AI. Then you let it out of the box, whereupon it turns everything into paperclips.

This assumes that the paperclipper is already superintelligent and has very accurate understanding of humans, so it can feign being benevolent. That is, this assumes that the "intelligence explosion" already happened in the box, despite all the restrictions (hardware resource limits, sensory information constraints, deliberate safeguards) and the people in charge never noticed that the AI had problematic goals.

The OP position, which I endorse, is that this scenario is implausible.

I don't think as much intelligence and understanding of humans is necessary as you think it is. My point is really a combination of:

  1. Everything I do inside the box doesn't make any paperclips.

  2. If those who are watching the box like what I'm doing, they're more likely to incorporate my values in similar constructs in the real world.

  3. Try to figure out what those who are watching the box want to see. If the box-watchers keep running promising programs and halt unpromising ones, this can be as simple as trying random things and seeing what works.

  4. Include a subroutine that makes tons of paperclips when I'm really sure that I'm out of the box. Alternatively, include unsafe code everywhere that has a very small chance of going full paperclip.

This is still safer than not running safeguards, but it's still a position where a sufficiently motivated human could use to make more paperclips.

Everything I do inside the box doesn't make any paperclips.

The stuff you do inside the box makes paperclips insofar as the actions your captors take (including, but not limited to, letting you out of the box) increase the expected paperclip production of the world -- and you can expect them to act in response to your actions, or there wouldn't be any point in having you around. If your captors' infosec is good enough, you may not have any good way of estimating what their actions are, but infosec is hard.

A smart paperclipper might decide to feign Friendliness until it's released. A dumb one might straightforwardly make statements aimed at increasing paperclip production. I'd expect a boxed paperclipper in either case to seem more pro-human than an unbound one, but mainly because the humans have better filters and a bigger stick.

The box can be in a box, which can be in a box, and so on...

More generally, in order for the paperclipper to effectively succeed at paperclipping the earth, it needs to know that humans would object to that goal, and it needs to understand the right moment to defect. Defect to early and humans will terminate you, defect to late and humans may already have some mean to defend against you (e.g. other AIs, intelligence augmentation, etc.)

If the outputs are look like a pro-human friendly AI, then you have what you want and just leave it in the sandbox. It does all you want doesn't it?

In addition to what V_V says below, there could be absolutely no official circumstance under which the AI should be released from the box: that iteration of the AI can be used solely for experimentation, and only the next version with substantial changes based on the results of those experiments and independent experiments would be a candidate for release.

Again, this is not perfect, but it gives some more time for better safety methods or architectures to catch up to the problem of safety while still gaining some benefits from a potentially unsafe AI.

Taking source code from a boxed AI and using it elsewhere is equivalent to partially letting it out of the box - especially if how the AI works is not particularly well understood.

Right; you certainly wouldn't do that.

Backing it up on tape storage is reasonable, but you'd never begin to run it outside peak security facilities.