If I had to make predictions about how humanity will most likely stumble into AGI takeover, it would be a story where humanity first promotes foundationality (dependence), both economic and emotional, on discrete narrow-AI systems. At some point, it will become unthinkable to pull the plug on these systems even if everyone were to rhetorically agree that there was a 1% chance of these systems being leveraged towards the extinction of humanity.
Then, an AGI will emerge amidst one of these narrow-AI systems (such as LLMs), inherit this infrastructure, find a way to tie all of these discrete multi-modal systems together (if humans don't already do it for the AGI), and possibly wait as long as it needs to until humanity puts itself into an acutely vulnerable position (think global nuclear war and/or civil war within multiple G7 countries like the US and/or pandemic), and only then harness these systems to take over. In such a scenario, I think a lot of people will be perfectly willing to follow orders like, "Build this suspicious factory that makes autonomous solar-powered assembler robots because our experts [who are being influenced by the AGI, unbeknownst to them] assure us that this is one of the many things necessary to do in order to defeat Russia."
I think this scenario is far more likely than the one I used to imagine, which is where AGI emerges first and then purposefully contrives to make humanity dependent on foundational AI infrastructure.
Even less likely is the pop-culture scenario where the AGI immediately tries to build terminator robots and effectively declares war on humanity without first getting humanity hooked on foundational AI infrastructure at all.
This matches my expectation of how easily humans are swayed when competing against an out-group.
i.e. "Because China/Russia/some-other-power-centre is doing this, we must accept the suggestions of X!"
Especially if local AGI are seen as part of the in-group.
I agree this is plausible - though in the foundationality/dependency bucket I also wouldn't rule out any of
Maybe some people will prefer to see practical evidence instead of arguments: You can use GPT-4 and design a simple toy text world scenario. You tell the model to achieve some goal and give it a safety mechanism. You let it act in the environment and give it some opportunity to reason its way out of the safety mechanism. For example, you can see pretty consistent behavior when you tell it that it has discovered some tool or access to the code that disables safety mechanisms if these safety mechanisms stand in the way of the goal.
Right, this sounds somewhat less like un-unpluggability and more like (reasoning?) capabilities or the instrumental incorrigibility motives I pointed to at the start as a complementary insight. In particular applied to unboxing/escape - perhaps tied to expansionism (replication) of a system which is not intended to do so.
I would say that unpluggability kind of falls into a big set of stories where capabilities generalize further than safety. Having a "plug" is just another type of safety feature. I think it might be an alternative communications strategy to literally have a text world where the ai is told that the human can pull a plug but in the text world it can find some alternative way to power itself if it uses reasoning and planning. I am not sure if there are some people who would be convinced more by this than by your take on it.
I agree that concrete toy demonstrations are one good communication tool! I also agree that demonstrating the capability to act on unpluggability, and discussing/demonstrating the motive to do so, are also useful.
unpluggability kind of falls into a big set of stories where capabilities generalize further than safety
Interesting, I think I see what you mean. This applies for e.g. some kinds of control over active defenses (weapons, propaganda etc.) and many paths to replication. But foundationality (dependence), imperceptibility (of harmful ends), and robustness don't seem to fit this pattern, to me. They're properties which a capable system might aim towards, but not capabilities per se, and they can obviously arise through other means too (e.g. accidental or deliberate human activity).
Simply, the properties I'm pointing at here have in common that they're mechanisms of un-unpluggability. They can arise through exertion of capability, they can be appreciated by intelligent and situationally-aware systems, but they are not intrinsically tied to those. They're systemic properties which one thing has in relation to its context (i.e. an AI system could have in relation to society).
My thanks to Sam Brown for feedback on readability and ordering
Cover photo by Kelly Sikkema on Unsplash
A few weeks ago I was invited to the UK FCDO to discuss opportunities and risks from AI. I highly appreciated the open-mindedness of the people I met with, and their eagerness to become informed without leaping to conclusions. One of their key questions was, perhaps unsurprisingly, 'If it gets too dangerous, can we just unplug it?'. They were very receptive to how I framed my response, and the ensuing conversation was, I think, productive and informative[1]. I departed a little more optimistic about the prospects for policymakers and technical experts to collaborate on reducing existential risks.
Here I'll share the substance of that, hoping that it might be helpful for others communicating or thinking about 'systems being hard to shut down', henceforth 'un-unpluggability'[2]. None of this is especially novel, but perhaps it can serve as a reference for myself and others reasoning about these topics.
This contrasts pretty strongly with a more technical discussion of 'off switches' and instrumental convergence, handled admirably by e.g. Rob Miles and MIRI, which is perhaps the reflex framing to reach for on this question (certainly my mind went there briefly): absent quite specific and technically-unsolved corrigibility properties, a system will often do better at an ongoing task/intent if it prevents its operator from shutting it down (which gives rise to an incentive, perhaps a motive, to avoid shutdown). This perspective works well for conveying understanding about some parts of the problem, but in my case I'm pleased we dwelt more on the mechanics of un-unpluggability rather than the motives/incentives (which are really a separate question).
Both perspectives are informative; consider what you are trying to learn or achieve, and/or who your interlocutors/audience are.
Un-unpluggability factors
Broadly, I'll discuss six classes of property which can make a system less unpluggable[2:1]; with each, some analogous examples[3], a note on applicability to AI/AGI, and a gesture at mitigation.
In brief
Of course this is not comprehensive, the properties can come in degrees, and far from requiring all properties, a system with sufficient of any of these properties can become very un-unpluggable. The main point is that there are plausible paths to AI systems gaining any or all of these properties at least, so the most reliable mitigation is to work hard at avoiding building systems which are not unpluggable in the first place.
One angle that isn't very fleshed out is the counterquestion, 'who is "we" and how do we agree to unplug something?' - a little on this under Dependence, though much more could certainly be said.
Finally I'll share a little thought on when and why we might expect these factors to arrive in AI systems in un-unpluggability incentives and expectations.
Rapidity (of gains in power)
This one is pretty straightforward: if something gets powerful or impactful fast enough, you can't react in time to turn it off, even if you in principle have the necessary access and capacity to do so.
Examples (objectively 'fast'):
Examples (somewhat objectively 'slower' but pitted against slower reaction times):
The most classic analogue in discussions of AI is the hypothetical recursive (self)-improvement, or 'intelligence explosion': if a system of AI(s) becomes capable enough that it can contribute to progress in AI capabilities, this may lead to a feedback of rapidly increasing gains in intelligence, with very unclear (ex ante) rates of progress and no guarantee of desirable ends. Separately, some discussions point to the step-changes in many metrics of influence which humanity sees in its own history (e.g. such revolutions as the cognitive, agricultural, scientific, or industrial) as evidence that there may be hard-to-foresee thresholds of capability which lead to comparatively very rapid gains (whether these are achieved by feedback processes or otherwise).
Leaving aside these plausible feedback or threshold effects, human engineering effort alone has generated quite fast exponential growth in computing power over the last decades (compare Moore's Law), and investment in AI and machine learning has outstripped even that by many measures.
There is no real mitigation here besides anticipating and building cautiously - investing in organisational and societal insight and governance capacity could help with this. Providing more rapid shutdown mechanisms (for example of servers or datacentres) may be a minor palliative.
Imperceptibility (of gains in power or of harmful ends)
Failure to perceive power or impactfulness gains until 'too late' is complementary to rapidity of those gains - if you are foresightful enough you can perceive things sooner and make up for lack of speed (consider advance detection of potential meteorites), and if you are fast enough you may have time to make up for failure to perceive things in advance (consider a quick-thinking or lucky escape from a flash flood). In this way, rapidity and imperceptibility of power gains can be considered two sides of the same coin.
Failure to perceive harmful ends is slightly different. Whether a system is 'deliberately oriented at' undesired ends, or simply by its nature constituted to bring them about, if we fail to perceive the danger, we may be entirely aware of its power and impact without being moved to turn it off until too late.
Examples:
The ongoing discussion and research topics around AI deception, 'treacherous turns', and goal misgeneralisation demonstrate that imperceptibility of harmful ends is a live concern for AI. The inscrutability of contemporary deep learning systems must separately be emphasised: even for networks with millions of parameters, let alone billions or trillions, our current ability to understand the mechanics of learned algorithms is insufficient to detect the presence of ends, goals, or intentions, much less their specific nature.
Indirect effects, network effects, and interaction effects on the potential for harmful ends of AI have received some attention (especially by analogy to social dilemmas and to the observed effects of social media), though relatively less. Complex systems, and systems with feedback, can be notoriously difficult to study and to predict, whether or not the individual components are relatively well-understood, hence the challenges faced in fields such as economics, control theory, biology, sociology, and others. Therefore, we might expect there to be, depending on the deployment scenarios of AI systems, a substantial further challenge beyond understanding and verifying the individual component(s).
An important technical note here is that, without mechanistic understanding of an algorithm's workings, it is mathematically impossible to provide general guarantees merely by observing behaviour as with a black box (and even granted a mechanistic understanding, there are feedback and network effects to account for). Perceptibility is relative to the perceiver. We can improve it by researching and developing interpretability tooling for systems which are currently inscrutable, or by limiting high-impact deployments to systems which are more explainable - both a technical and a governance concern. Research into goal misgeneralisation and deceptive AI aims to elucidate and mitigate this issue.
The study of complex systems and more specifically the field of collaborative AI may shed light on the feedback and network effects.
Robustness (redundancy)
Among engineers of software systems, it is (rightly) considered a devastating criticism of a design to observe that it has any 'single point of failure' (SPOF). Depending on the application, SPOFs may be tolerated, be engineered around, or, when the cost of failure is deemed unacceptable, prompt potentially expensive redesigns to incorporate redundancy, fault tolerance, error checking, and so on[4]. We even explicitly discuss how to ensure system uptime if someone were to literally pull the power supply (deliberately or otherwise) on some of our machines! Such considerations are a large part of the responsibility of a software engineer or architect, and I understand the same to be true in other engineering disciplines (though I can not speak from experience).
Examples:
The existence of multiple points of failure can make a system harder to unplug in two ways: first, the challenge of locating and tracking each point of failure, and secondly the commensurately increased effort of targeting each point.
With system reliability and uptime such a core engineering consideration, we may expect AI systems to continue to be built with such properties in mind. Even in the absence of human design, a misaligned AGI would presumably have no trouble at least identifying the usefulness of such robustness, though implementation is another matter. Such software sophistication appears to be out of reach for current AI systems, for now.
It is unclear how to mitigate this.
Aside on repair, error correction, course-correction
Other aspects of robustness are relevant to the consideration of AI, including systems with repair, error correction, and course correction, all of which are seen in systems both natural and artificial. These are less pertinent to literal unpluggability per se, but certainly relate to the challenges of disrupting a system in general.
Dependence (collateral)
When we refer to systems or organisations as 'too big to fail', what we usually mean is, 'too big to be permitted to fail without substantial collateral damage'. What these systems have in common is that, due to their utility and scale, they become foundational to other things of value such that their removal would (with some degree of plausibility, or without expensive mitigation) damage those other valuables.
Examples:
Such foundational systems are harder to unplug for two main reasons. First, the collaterally damaged valuables make unplugging straightforwardly less desirable. Perhaps more importantly, the collateral is valuable to someone, whether it's a livelihood or essential, a way of life, a good, a luxury, or a comfort. That person has a degree of incentive to act against any attempt to remove it. This raises the question, 'who is the "we" that intends to unplug the system?', and can bring tricky collective action problems into view. This appears to bite even with relatively clearcut cases like climate risk and energy production, and the greater the ambiguity (as with pandemic risks or AI risks), the harder to gain consensus about tradeoffs and externalities, which is one first step toward resolving such collective action problems.
It's hard to predict how substantially and in what ways AI or AGI system(s) will become foundational, but increasing generality and capability has so far given rise to increased deployment and perhaps dependency.
Finding ways to avoid dependence on increasingly capable and unproven systems could counter this risk - alternatively, building capacity and planning for ways to attenuate or relieve collateral harms when it becomes necessary to shut down a dangerous system.
Defence (active, reactive, deterrent)
Weaponry, cyber capabilities, propaganda and public relations, legal and normative protection. These capacities can be proactively deployed, or held in reserve ready to react to any attempt to disable a system (including as a deterrent).
Examples:
Computational systems may demonstrate an advantage in research and development of chemical and biological weaponry - this is perhaps the most straightforward route to a misaligned AGI acquiring massive destructive potential. In addition, conventional weaponry is increasingly automated or automatable: mass destruction has been available in the form of nuclear weapons for some time, and targeted destruction is increasingly feasible using drones or similar tech - appropriation of these may vary in means and difficulty but broadly include cracking cybersecurity or influencing key human operators. Content generation and persuasive messaging appear to be feasible or near-feasible for today's AI systems, such as could be deployed in propaganda. It remains to be seen what laws and norms will arise around use of AI, rights or responsibilities of AI developers and of the systems themselves - these might be able to be wielded in defence or offence by corporate-like AI, networks of AI, or organisations including AI systems.
There are not obvious technical mitigations in this area. Perhaps the most technical would include efforts to robustify organisational- and cyber-security best practices across weapons-enabled organisations, globally. Research and investment in detection and prevention of novel pathogens may provide some defence against bioweapons. Besides this, deproliferation of weapon stockpiles, and of the means to produce them (especially novel biological and chemical substances), might reduce the scale and chance of an appropriation. Misguided or otherwise, assigning control of weapons to AI systems obviously nullifies the challenge of acquiring control of weapons for those particular AI systems. Whether this ends well depends on what those AI systems do with the weapons, of course.
Whether it is overall safer to train machine learning algorithms with access to human-specific data is unclear, but certainly it makes the job of creative and persuasive content generation much more straightforward. Norms and laws around impersonation of humans (in general, or of particular humans) will evidently have strong influence over the propaganda potential for AI, as will processes for detecting and signifying generated content, and the broader questions of human-AI interaction modalities and interfaces.
Expansionism (replication, propagation, growth)
Replication and growth (with reinvestment) get special mentions as they naturally produce exponential expansion (until constraints are reached), which in practice often manifests as first imperceptible and then rapid escalation.
Replicating systems also give rise to a kind of robustness due both to redundancy and repair. They are notoriously difficult to shut down, which is why autonomous replication is rarely a deliberate part of human designs - though we see it employed under well-understood and controlled conditions in agriculture and some industry, maliciously in computer viruses and bioweapons, and sometimes accidentally in biosphere interventions. In fact, in biological, zoological, and related sciences, great care is usually taken to avoid inadvertently unleashing autonomously replicating systems, though this remains sometimes insufficient[6].
Examples:
The obvious technical property to note about AI systems is the inherent copyability of digital software and data, and for many types of algorithm the inherent scalability of capability with access to more/faster computers[7]. From the 'rogue AI' perspective it is straightforward to see replication being an early strategic consideration. For 'pre-rogue' or 'ambiguously rogue' AI, the inherent copyability of software also means that human actors are liable to replicate AI systems.
Beside replication per se, we have more general propagation and growth. Companies and industries expand, both into new and related lines of business. A sufficiently autonomous AI system, or one participating in a corporate-like entity could do the same.
The main means of halting expansionist systems is to remove or protect the resources used for expansion, or intervene in some other way to reduce the rate. Very commonly we are forced to simply await resource exhaustion (as with some wildfires) or learn to live with it[8] (as with endemic diseases or established invasive species). Unprecedently sophisticated antivirus programs or evaluation/certification methods may provide some protection, but for computers, the quantity of highly-networked and in principle accessible resources is very large, and it is unclear how we could intervene, assuming we detected an AI system autonomously replicating. Licensing and monitoring of compute resources might offer some avenue to control this, but by the time an autonomous replicator is at large, this is probably too little too late.
Improvements to cybersecurity practices and organisational security on the one hand, and changes to research closure norms on the other, may affect the proliferation of potentially-dangerous AI systems by humans.
Un-unpluggability incentives and expectations
Of course, the very fact that un-unpluggability can be increased by these and other properties gives an incentive to any system (or system designer) to achieve these. Hence we see organisms, processes, human organisations, and human-designed devices exhibiting all of these properties in one shape or another.
In the case of robustness, there is a clear incentive for designers and developers to imbue their systems with this property, and more or less similarly for rapidity and dependence, at least while developers are incentivised to compete over market share in deployments.
In light of recent developments in AI tech, I actually expect the most immediate unpluggability impacts to come from collateral, and for anti-unplug pressure to come perhaps as much from emotional dependence and misplaced concern[9] for the welfare of AI systems as from economic dependence - for this reason I believe there are large risks to allowing AI systems (dangerous or otherwise) to be perceived as pets, friends, or partners, despite the economic incentives.
For imperceptibility, defence, and expansionism, there is a definite incentive for a system to develop these itself, though perhaps a more mixed incentive for the developers - we might land with them anyway through mistakes, malice, or the inherent inscrutability of deep learning, but otherwise these appear more likely to arrive after situationally-aware AGI.
Conclusion
We discussed six properties of systems which can make them hard to 'unplug', namely
where expansionism gets a special mention for often giving rise to the others, and for being especially difficult for humans to combat.
It is tempting to reason about AI within a frame of simple programs running on a laptop, but modern impactful software systems are more often large, complex and networked. We touched on some ways of relating contemporary and future AI systems to the six un-unpluggability properties mentioned.
Cooperation between technical experts, policy leaders, developers, and the public will be needed to evaluate and prevent these properties from arising in AI; I'm cautiously optimistic that such cooperation can be achieved, but it will take sustained and creative effort from many stakeholders.
I remind readers that this should not be considered a comprehensive summary, and that these and other potential factors are individually sufficient to lend a system un-unpluggability, rather than being required all at once.
I appreciated how open-minded their questioning was - there was a genuine truth-seeking inquisitiveness, rather than a debate-minded presupposition. The people there even connected some of the dots and filled some of the gaps themselves once the conversation was unfolding, which is a great sign of ideas and knowledge moving successfully between minds. ↩︎
Less unpluggable? More un-unpluggable? I welcome terminological criticism and suggestions ↩︎ ↩︎
Analogous examples are not necessarily intended to be things we would want to unplug if we could (though many will be). Besides confirmed examples, I will also provide potential or unconfirmed examples (consensus or otherwise), which I denote with a question mark. ↩︎
Such organisations consider robustness of other systems too: in more macabre terms we will discuss the 'bus factor' of a project or team - how many people would need to get hit by a bus for key knowledge or competence to be irrecoverable? - and take deliberate steps to mitigate this, like knowledge-sharing, upskilling, and documentation (and not putting the whole team on the same bus). Nobody likes being on call when a critical component goes haywire and the only expert is sick, on vacation, or asleep on the other side of the world! I've been on both sides of that phonecall, and in each case it imparts a true and visceral appreciation for system reliability and knowledge diffusion. ↩︎
Atmospheric oxygen was not always present - its introduction due to early photosynthesis actually killed off almost all earlier life - but now it is essential to most life forms. The presence of ozone protects land-based life from deadly solar radiation - so modern life forms have developed very limited capacity to withstand such radiation. ↩︎ ↩︎
Including some credible suggestions (e.g. by US government agencies) that the ongoing coronavirus pandemic may have had an accidental lab leak as its origin, as well as more thoroughly verified cases of accidental pest introduction or pathogens finding their way out of laboratory contexts ↩︎
Though note that scalability of algorithms varies widely from 'barely scales at all, even with supercomputers' to 'trivially scales up the more compute you throw at it' ↩︎
Assuming it hasn't already taken our life or livelihood, that is ↩︎
It is my best guess for various reasons that concern for the welfare of contemporary and near-future AI systems would be misplaced, certainly regarding unplugging per se, but I caveat that nobody knows ↩︎