- (the "AI immune system") The whole internet — including space satellites and the internet-of-things — becomes way more secure, and includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.
Define "way more secure". Like, superhuman-at-security AGIs rewrote the systems to be formally unhackable even taking into account hardware vulnerabilities like Rowhammer that violate the logical chip invariants?
Can you talk a bit about the world global dictatorship running the electromagnetic pulse emitters, and how they monitor every computer in the world? What sort of violence do you envision being inflicted on any countries who don't want to submit their computers for monitoring? Is part of the plan to use AI drones to kill any political leaders who oppose this plan, so as to minimize civilian casualties? Who controls these AI drones, are we quite sure this world dictatorship stays friendly to its citizens? A lot of political processes leading to such a thing sound like they could potentially be scary.
I said "burn all GPUs" to be frank about these things being scary. It's easy for things to sound less scary when they're vague and the processes leading up to them are left vague. See also, George Orwell, "Politics and the English Language". We can't evaluate whether you have a less scary proposal until you make a less vague one.
An attempted paraphrase, to hopefully-disentangle some claims:
Eliezer, list of AGI lethalities: pivotal acts are (necessarily?) "outside of the Overton window, or something"[1].
Critch, preceding post: Strategies involving non-Overton elements are not worth it
Critch, this post: there are pivotal outcomes you can via a strategy with no non-Overton elements
Eliezer, this comment: the "AI immune system" example is not an example of a strategy with no non-Overton elements
Possible reading: Critch/the reader/Eliezer currently wouldn't be able to name a strategy towards a pivotal outcome, with no non-Overton elements
Extreme version of this: Any practical-in-our-world strategy towards a pivotal outcome necessarily contains some non-Overton elements
Substitute your better characterization of the undesirable property here. I will just use "non-Overton" for the purposes of this comment.
I am only replying to the part of this post about hardware vulnerabilities.
Like, superhuman-at-security AGIs rewrote the systems to be formally unhackable even taking into account hardware vulnerabilities like Rowhammer that violate the logical chip invariants?
There are dozens of hardware vulnerabilities that exist primarily to pad security researcher's bibliographies.
Rowhammer, like all of these vulnerabilities, is viable if and only if the following conditions are met:
It's not that Rowhammer isn't possible in the sense that it cannot be shown to work, but it's like this paper showing that you can create WiFi signals in air-gapped computers. Or this fun paper for Nethammer showing novel attacks that don't require code execution on the target machine, except they also don't allow for controlling where bit flips occur, so the "attack" is isomorphic to an especially hostile radiation environment with a high likelihood of bit-flips, and it relies on the ability for the attacker to swarm the target system with a high volume (500 Mbps?) of network traffic that they control -- a network switch that drops unexpected traffic or even just rate-limits it will defeat Nethammer. Note that rate-limiting network traffic is in fact standard practice for high stability systems, because it's also a protection against much more mundane denial-of-service attacks.
Consumer systems are vulnerable to attacks, because consumer systems don't care about stability. Consumers want to have a fast network connection to the internet. There's no requirement, or need, for that to be true on a system designed for stability, like something in a satellite, or some other safety-critical role. It is possible to have systems that are effectively "not able to be hacked" -- they don't use general-purpose hardware, they don't have code that can be modified, they have no capability for executing external code, they include hardware level fault tolerance and redundancy, and they have exceptionally limited I/O. It doesn't require us presuming "superhuman-at-security AGIs" exist to design these systems.
Every few weeks researchers publish papers carefully documenting the latest side-channel attacks that result in EVERYTHING EVERYWHERE BEING VULNERABLE FOREVER, and every few weeks attackers continue to do the boring old thing of leaving USB drives lying around for a target to pwn themselves, or letting the target just download the malware directly to their machine. They're almost all just remixes too -- it's "here's how to do exploit A (the original), but on hardware systems that implemented mitigations for C, D, and E". Except exploit A still has all of the preconditions I listed above, and now you've got whatever new preconditions the security researchers have in their latest paper.
This is an awesome comment, I think it would be great to make it a top-level post. There's a Facebook group called "Information Security in Effective Altruism" that might also be interested
Would you agree that if there were a system that could automatically (without humans) monitor for rogue AI, then that system is probably hackable? (Because it has to take many inputs from the world, and has to be a general computer, not a tiny / hardwired thing.)
Would you agree that if there were a system that could automatically (without humans) monitor for rogue AI, then that system is probably hackable?
I can't comment on this, because I have no idea what this system would even do, or what it would be looking for, or who would be running it.
What indicates a rogue AI? How is that differentiated from human hackers? How is that different from state-level actors like the NSA? How is it different from some white hat security researchers doing pen-testing in the wild for questionable reasons? What "behavior" is the system trying to look for?
How is it monitoring? What inputs is it taking? Is it only looking at internet traffic? How? Where is it tapping for those packets? There's a lot of internet traffic. Is this even "a computer", or it, "an entire facility of computers". Why are there no humans involved in this process? Where do the logs, monitoring, or other telemetry go if this is done "without humans"? Wouldn't this system be reporting some type of digest or summary for review? Why do you think that it would need to be a general computer? Do you think that high-performance SoCs do not exist, or that it would not be possible to have a distributed system containing multiple SoCs with the equivalent RAM and compute of some general-purpose machine?
What would it mean to hack this system? Just that it doesn't report any rogue AI anymore? Or that it erroneously reports the presence of rogue AI? What would hacking it to not report rogue AI look like? Are we concerned that it would log the presence of a rogue AI, but then the rogue AI would be able to wipe that entry from some log file, like a movie? Or are we thinking that the rogue AI is going to disable this system before it is even noticed by the system? Do we count "turning off the power" as "hacking"?
The question here (in this subthread) is whether we could make a system that would shut down computers
appearing to be running rogue AI agents.
There's a question of what such a system could look like.
Why are there no humans involved in this process?
Presumably humans would be involved. I was pretty imprecise above; what I meant to ask was:
Suppose there's a system that can shut down computers running rogue AIs. If this actually stops rogue AI, that's because there's a vast amount of monitoring going on. The bulk of that vast amount of monitoring can't be borne by humans. So there's a lot of monitoring that's being done by computers. Those computers are doing something seemingly pretty non-trivial, i.e. detecting AI / rogue AI / hacking (like, we don't currently know how to do it). Because they're doing something pretty non-trivial, they probably have to be big complex systems. Because they're big complex systems, they're hackable. Does this sound right to you? I'm mostly asking you about the step "detecting rogue AI implies hackable". Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker? Clearly with some work you could make a perfect tic-tac-toe player that simply is not hackable without physically tampering with it. It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it's software, seems to imply being Turing-completely programmable, which seems to imply being hackable. But maybe that's too vague an analysis... Thoughts?
What would it mean to hack this system?
Yeah, hack it to not raise the alarm during the time period where the AI can be shut off by the EMP being activated. Then the AI disables some EMPs, or arranges or finds some computers to be not in range of EMPs or shielded from the EMPs, or creates or finds computers intrinsically not affected by EMPs (like nanotech mechanical computers or DNA computers, or something less crazy / easier that I didn't think of), or gains control over the decision-making process for whether to activate the EMPs, or intercepts signals to the EMPs and spoofs the "successfully fired" signals. (Assuming that the AI detection scheme isn't mooted by any of these already having happened while the AI still looked like what it was supposed to look like---just another AI project---but that already had enough resources for the AI to get a decisive advantage without spreading noticeably.)
Oh, sorry, you're referring to this:
includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.
This just seems like one of those things people say, in the same vein as "melt all of the GPUs". I think that non-nuclear EMPs are still based on chemical warheads. I don't know if a "pulse emitter" is a thing that someone could build. Like I think what this sentence actually says is equivalent to saying
includes a distributed network of non-nuclear ICBMs that will be physically shot at any target believed to be running a rogue AI agent
and then we can put an asterisk on the word "ICBM" and say it'll cause an EMP at the detonation site, and only a small explosion.
But you can see how this now has a different tone to it, doesn't it? It makes me wonder how the system defines "appears to be running rogue AI agents", because now I wonder what the % chance of false positives is -- since on a false positive, the system launches a missile.
What happens if this hypothetical system is physically located in the United States, but the rogue AI is believed to be in China or Russia? Does this hypothetical system fire a missile into another country? That seems like it could be awkward if they're not already on board with this plan.
because they're doing something pretty non-trivial, they probably have to be big complex systems. Because they're big complex systems, they're hackable. Does this sound right to you? I'm mostly asking you about the step "detecting rogue AI implies hackable". Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker?
It's not really about "tasks", it's about how the hardware/software system is designed. Even a trivial task, if done on a general-purpose computer, with a normal network switch, the OS firewall turned off, etc, is going to be vulnerable to whatever exploits exist for applications or libraries running on that computer. Those applications or libraries expose vulnerabilities on a general-purpose computer because they're connected to the internet to check for updates, or they send telemetry, or they're hosting a Minecraft server with log4j.
It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it's software, seems to imply being Turing-completely programmable, which seems to imply being hackable.
When you're analyzing the security of a system, what you're looking for is "what can the attacker control?"
If the attacker can't control anything, the system isn't vulnerable.
We normally distinguish between remote attacks (e.g. over a network) and physical attacks (e.g. due to social engineering or espionage or whatever). It's generally safe to assume that if an attacker has physical access to a machine, you're compromised.[1] So first, we don't want the attacker to have physical access to these computers. That means they're in a secure facility, with guards, and badges, and access control on doors, just like you'd see in a tech company's R&D lab.
That leaves remote attacks. These generally come in two forms:
All of the attacks in (1) fall under "when you run untrusted code, you will get pwned" umbrella. There's a bunch of software mitigations for trying to make this not terrible, like admin users vs non-admin users, file system permissions, VM sandboxing, etc, but ultimately it's just like rearranging deck chairs on the Titanic. It doesn't matter what you do, someone else is going to find a side channel attack and ruin your day if you let them run code on your machine. So don't do that. This is actually easier than you might think: plenty of systems are "secure" because they run an incredibly minimal Linux OS (or some RTOS or even just bare metal) and they're effectively static -- the software image is flashed to some SoC's read-only memory (ROM) by an external debugger[2], and there's no capability from within the software to write or update that memory. The processor is not configured for running code outside of that ROM. There's no user accounts, or "default applications", or browsers, or anything else other than the code for performing the actual task required by the system.
For (2), in cases where the system is not supposed to be receiving arbitrary traffic, we solve this by using networking hardware downstream of the system that drops any unexpected traffic, and rate-limits everything. If the attacker's traffic gets dropped, they can't control anything. The rate-limiting is to prevent a denial-of-service based on traffic we expect. For the "expected" traffic, the ideal mechanism to prevent abuse is a binary schema for valid messages, a fixed MTU size on messages, and the use of authenticated encryption so that the attacker is unable to generate the traffic without access to encryption keys. Encryption keys can be stored in hardware security modules that are physically located on the senders & receivers within the system.[3]
With that in mind, I hope it is clear that the problem isn't about complexity of a system, or the number of inputs that the system has, or if the system supports software updates.
The actual issue with the system we're discussing is that the attacker can control their network traffic, and in this hypothetical, we need to look at arbitrary network traffic to decide if it's evidence of a rogue AI.[4] We can't just drop everything in hardware and ignore messages that don't fit an authenticated encryption scheme, because we want to read that traffic.
So we have to assume that the attacker can generate arbitrary payloads, send them to our system, and our system is going to need to not get pwned.
I think I'd probably approach it with the following thoughts:
The supervisor machine is pretty safe unless a cluster machine is compromised. If a cluster machine is compromised, maybe we think that the attacker can control the alert message, and in doing so, control traffic sent to the central supervisor? Let's fix that. What do we need for the alert message? Maybe the IP of the packet, geolocation information, reasons why it was suspicious? Etc. We'll define a binary schema for that information and we'll stick a serial device or FPGA onto each cluster machine.
So now attackers can maybe compromise cluster machines because they control the traffic that we're trying to monitor, but they can't use a compromised cluster machine to control traffic to the central supervisor, because compromising a cluster machine doesn't let you change the traffic sent to a supervisor in a way that would be dangerous.
At this point we can ask what the attacker can do:
We can mitigate the 2nd problem by requiring a heartbeat from each cluster machine to the central supervisor. A heartbeat is another message sent via authenticated encryption, but it's sent at a known frequency, like 100 Hz. If an attacker or any other event crashes the machine, the central supervisor will know.
If the attacker sends a garbage alert message, what is the downside? It could be bad if this system launches missiles, so maybe we'll have human review on any alert that the central supervisor receives, prior to any action being taken by the system. The human review can include a log of the network traffic that generated the alert, captured by the cluster machine, and a diagnosis of the cluster machine's telemetry for how it concluded the traffic was suspicious. This is intended to mitigate the possibility of a false positive where the attacker can control where we think a rogue AI is located.
Now, where we stand is that an attacker can maybe compromise a cluster machine, and in doing so, they can maybe generate an erroneous alert message for a rogue AI, which a human review maybe won't be able to detect as erroneous, so whatever action gets taken on belief that a rogue AI exists would be sent to a location controlled by the attacker. The magnitude of how bad that is depends on what action is taken. If it is "fire an ICBM", it could be pretty bad.
The other possibility is that an attacker compromises a machine in such a way that no alert is generated, but no heartbeat is missed either. This is an interesting failure mode because it's equivalent to another failure mode that we have not discussed: the system could simply fail to detect the behavior of a rogue AI. In other words, this system doesn't need to be hacked in order to fail. If the algorithm we're using for detecting rogue AI has any margin for error at all, then it should be no surprise to anyone that this system is susceptible to a false negative, i.e.: it did not send an alert when it should have.
The relevant question becomes, what's the margin of error on the hypothetical algorithm for detecting a rogue AI? If that algorithm has a higher likelihood of a false positive than our hypothetical attacker, after all of our mitigations, than maybe this system isn't even worth building, depending on how high that likelihood is. Likewise, if the algorithm has a high likelihood of a false negative, such that a rogue AI can reasonably evade detection due to not being suspicious enough, then this system might still not be worth building, because now the most likely failure mode is that the system just doesn't work reliably enough, hacked or not.
This is a safe assumption because if the attacker has physical access to a machine, they can do dumb things like pull out the hard drive and put it into another computer that they control. Enable encryption on your drives.
When I say "external debugger", this is implying physical access to the system.
In order to get the encryption keys, the attacker needs physical access to the system. Attacks that would use a side-channel attack won't work here, because we don't accept arbitrary traffic, and in order to generate that traffic, you need the keys. It's a catch-22, unless you have physical access. But if you have physical access, we assume the system is compromised.
To be clear, I am not convinced that "evidence of a rogue AI" is a meaningful description of behavior.
This is assuming that we've verified the device that sends the alert message, which we'll have high confidence in because of how simple the implementation will be.
Thank you! I appreciate your thoughtful reply, and learned stuff. (This could IMO be a top-level post, maybe, IDK. Like "So you want to built an AI off switch...".)
> I hope it is clear that the problem isn't about complexity of a system,
It's clear that the complexity of a system provides an upper bound on hackability, but doesn't provibe a lower bound: a really simple system isn't hackable, while a complex (big, Turing complete, etc.) system might or might not be hackable depending on what code is running on it and what inputs it gets. We get lower bounds on hackability by showing ways that the attacker can affect the machine.
> or if the system supports software updates.
Hm. It still seems like actually getting software updates is a big security hole (as opposed to merely hypothetically supporting updates). If you don't update the software, and you already trust it to not be hackable, then it should stay unhackable. If you do update the software, you could introduce new bugs, or whoever sends the updates could have been compromised and the "update" could be malware. You're increasing your surface area; the rogue AI could hack the software devs who push the code that's supposed to use the latest new AI transparency tech to detect rogue AIs. As you point out, there's a good chance that detecting rogue AIs is at best a very difficult task; this puts pressure to use adaptive methods, methods that you keep updating (e.g. you keep updating with a detector AI that's been further improved / trained). So it seems like there's a quadrilemma: either
1. your rogue AI detection methods are imprecise and are quick to pull the trigger, in which case you just immediately shut down all computers; or
2. your methods are imprecise and reluctant to pull the trigger, in which case a rogue AI can probably go unnoticed; or
3. your methods are precise, but you don't update the watchdogs, meaning you somehow got this difficult task right with no further corrections after a point, without using the newest AI tech or getting to test your system on detecting the newest AI tech; or
4. your methods are precise, which you do by updating the watchdogs, in which case you're greatly increasing the surface area of your system, and have to be hardened against ongoing attacks on the entire detection software pipeline.
> To be clear, I am not convinced that "evidence of a rogue AI" is a meaningful description of behavior.
Me neither.
An example of a possible "pivotal act" I like that isn't "melt all GPUs" is:
Use AGI to build fast-running high-fidelity human whole-brain emulations. Then run thousands of very-fast-thinking copies of your best thinkers. Seems to me this plausibly makes it realistic to keep tabs on the world's AGI progress, and locally intervene before anything dangerous happens, in a more surgical way rather than via mass property destruction of any sort.
Looking for pivotal acts that are less destructive (and, more importantly for humanity's sake, less difficult to align) than "melt all GPUs" seems like a worthy endeavor to me. But I prefer the framing 'let's discuss the larger space of pivotal acts, brainstorm new ideas, and try to find options that are easier to achieve, because that particular toy proposal seems suboptimally dangerous and there just hasn't been very much serious analysis and debate about pathways'. In the course of that search, if it then turns out that the most likely-to-succeed option is a process, then we should obviously go with a process.
But I don't like constraining that search to 'processes only, not acts', because:
I suspect we also have different intuitions about pivotal acts because we have different high-level pictures of the world's situation.
I think that humanity as it exists today is very far off from thinking like a serious civilization would about these issues. As a consequence, our current trajectory has a negligible chance of producing good long-run outcomes. Rather than trying to slightly nudge the status quo toward marginally better thinking, we have more hope if we adopt a heuristic like speak candidly and realistically about things, as though we lived on the Earth that does take these issues seriously, and hope that this seriousness and sanity might be infectious.
On my model, we don't have much hope if we continue to half-say-the-truth, and continue to make small steady marginal gains, and continue to talk around the hard parts of the problem; but we do have the potential within us to just drop the act and start fully sharing our models and being real with each other, including being real about the parts where there will be harsh disagreements.
I think that a large part of the reason humanity is currently endangering itself is that everyone is too focused on 'what's in the Overton window?', and is too much trying to finesse each other's models and attitudes, rather than blurting out their actual views and accepting the consequences.
This makes the situation I described in The inordinately slow spread of good AGI conversations in ML much stickier: very little of the high-quality / informed public discussion of AGI is candid and honest, and people notice this, so updating and epistemic convergence is a lot harder; and everyone is dissembling in the same direction, toward 'be more normal', 'treat AGI more like business-as-usual', 'pretend that the future is more like the past'.
All of this would make me less eager to lean into proposals like "yes, let's rush into establishing a norm that large parts of the strategy space are villainous and not to be talked about" even if I agreed that pivotal processes are a better path to long-run good outcomes than pivotal acts. This is inviting even more of the central problem with current discourse, which is that people don't feel comfortable even talking about their actual views.
You may not think that a pivotal act is necessary, but there are many who disagree with you. Of those, I would guess that most aren't currently willing to discuss their thoughts, out of fear that the resultant discussion will toss norms of scholarly discussion out the window. This seems bad to me, and not like the right direction for a civilization to move into if it's trying to emulate 'the kind of civilization that handles AGI successfully'. I would rather a world where humanity's best and brightest were debating this seriously, doing scenario analysis, assigning probabilities and considering specific mainline and fallback plans, etc., over one where we prejudge 'discrete pivotal acts definitely won't be necessary' and decide at the outset to roll over and die if it does turn out that pivotal acts are necessary.
My alternative proposal would be: Let's do scholarship at the problem, discuss it seriously, and not let this topic be ruled by 'what is the optimal social-media soundbite?'.
If the best idea sounds bad in soundbite form, then let's have non-soundbite-length conversations about it. It's an important enough topic, and a complex enough one, that this would IMO be a no-brainer in a world well-equipped to handle developments like AGI.
it's safer to aim for a pivotal outcome to be carried out by a distributed process spanning multiple institutions and states, because the process can happen in a piecemeal fashion that doesn't change the whole world at once
We should distinguish "safer" in the sense of "less likely to cause a bad outcome" from "safer" in the sense of "less likely to be followed by a bad outcome".
E.g., the FDA banning COVID-19 testing in the US in the early days of the pandemic was "safer" in the narrow sense that they legitimately reduced the risk that COVID-19 tests would cause harm. But the absence of testing resulted in much more harm, and was "unsafe" in that sense.
Similarly: I'm mildly skeptical that humanity refusing to attempt any pivotal acts makes us safer from the particular projects that enact this norm. But I'm much more skeptical that humanity refusing to attempt any pivotal acts makes us safer from harm in general. These two versions of "safer" need to be distinguished and argued for separately.
Any proposal that adds red tape, inefficiencies, slow-downs, process failures, etc. will make AGI projects "safer" in the first sense, inasmuch as it cripples the project or slows it down to the point of irrelevance.
As someone who worries that timelines are probably way too short for us to solve enough of the "pre-AGI alignment prerequisites" to have a shot at aligned AGI, I'm a big fan of sane, non-adversarial ideas that slow down the field's AGI progress today.
But from my perspective, the situation is completely reversed when you're talking about slowing down a particular project's progress when they're actually building, aligning, and deploying their AGI.
At some point, a group will figure out how to build AGI. When that happens, I expect an AGI system to destroy the world within just a few years, if no pivotal act or processes finishes occurring first. And I expect safety-conscious projects to be at a major speed disadvantage relative to less safety-conscious projects.
Adding any unnecessary steps to the process—anything that further slows down the most safety-conscious groups—seems like suicide to me, insofar as it either increases the probability that the project fails to produce a pivotal outcome in time, or increases the probability that the project cuts more corners on safety because it knows that it has that much less time.
I obviously don't want the first AGI projects to rush into a half-baked plan and destroy the world. First and foremost, do not destroy the world by your own hands, or commit the fallacy of "something must be done, and this is something!".
But I feel more worried about AGI projects insofar as they don't have a lot of time to carefully align their systems (so I'm extremely reluctant to tack on any extra hurdles that might slow them down and that aren't crucial for alignment), and also more worried insofar as they haven't carefully thought about stuff like this in advance. (Because I think a pivotal act is very likely to be necessary, and I think disaster is a lot more likely if people don't feel like they can talk candidly about it, and doubly so if they're rushing into a plan like this at the last minute rather than having spent decades prior carefully thinking about and discussing it.)
In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes".
As I understand it, the definition of "pivotal acts" explicitly forbids to consider things like "this process would make 20% per year of AI developers actually take safety seriously with 80% chance" or "what class of small shifts would in aggregate move the equilibrium?". (Where things in this category get straw-manned as "Rube-Goldberg-machine-like")
As often, one of the actual cruxes is in continuity assumptions, where basically you have a low prior on "smooth trajectory changes by many acts" and high prior on "sharp turns left or right".
Second crux, as you note, is doom-by-default probability: if you have a very high doom probability, you may be in favour of variance-increasing acts, where people who are a few bits more optimistic may be much less excited about them, in particular if all plans for such acts they have very unclear shapes of impact distributions.
Given this deep prior differences, it seems reasonable to assume this discussion will lead nowhere in particular. (I've a draft with a more explicit argument why.)
In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes".
My objection to Critch's post wasn't 'you shouldn't talk about pivotal processes, just pivotal acts'. On the contrary, I think bringing in pivotal processes is awesome.
My objection (more so to "Pivotal Act" Intentions, but also to the new one) is specifically to the idea that we should socially shun the concept of "pivotal acts", and socially shun people who say they think humanity needs to execute a pivotal act, or people who say positive things about some subset of pivotal acts.
This seems unwise to me, because it amounts to giving up on humanity's future in the worlds where it turns out humanity does need to execute a pivotal act. Suppose you have this combination of beliefs:
I personally think that a large majority of humanity's hope lies in someone executing a pivotal act. But I assume Critch disagrees with this, and holds a view closer to 1+2+3.
If so, then I think he shouldn't go "well, pivotal acts sound weird and carry some additional moral hazards, so I will hereby push for pivotal acts to become more stigmatized and hard to talk about, in order to slightly increase our odds of winning in the worlds where pivotal acts are unnecessary".
Rather, I think hypothetical-Critch should promote the idea of pivotal processes, and try to reduce any existing stigma around the idea of pivotal acts, so that humanity is better positioned to evade destruction if we do end up needing to do a pivotal act. We should try to set ourselves up to win in more worlds.
(Where things in this category get straw-manned as "Rube-Goldberg-machine-like")
If you're referring to my comment, then this is itself straw-manning me!
Rube-Goldberg-ishness is a matter of degree: as you increase the complexity of a plan, it becomes harder to analyze, and tends to accumulate points of failure that reduce the probability of success. This obviously doesn't mean we should pick the simplest possible plan with no consideration for anything else; but it's a cost to keep in mind, like any other.
I mentioned this as a quantitative cost to keep in mind; "things in this category get straw-manned as 'Rube-Goldberg-machine-like'" seems to either be missing the fact that this is a real cost, or treating me as making some stronger and more specific claim.
As often, one of the actual cruxes is in continuity assumptions, where basically you have a low prior on "smooth trajectory changes by many acts" and high prior on "sharp turns left or right".
This seems wrong to me, in multiple respects:
I think you should be more of a fox with respect to continuity, and less of a hedgehog. The reason hard takeoff is very likely true isn't some grand, universal Discontinuity Narrative. It's just that different things work differently. Sometimes you get continuities; sometimes you don't. To figure out which is which, you need to actually analyze the specific phenomenon under discussion, not just consult the universal cosmic base rate of continuity.
(And indeed, I think Paul is doing a lot more 'analyze the specific phenomenon under discussion' than you seem to give him credit for. I think it's straw-manning Paul and Eliezer to reduce their disagreement to a flat 'we have different priors about how many random things tend to be continuous'.)
Second crux, as you note, is doom-by-default probability: if you have a very high doom probability, you may be in favour of variance-increasing acts
I agree with this in general, but I think this is a wrong lens for thinking about pivotal acts. On my model, a pivotal act isn't a hail mary that you attempt because you want to re-roll the dice; it's more like a very specific key that is needed in order to open a very specific lock. Achieving good outcomes is a very constrained problem, and you need to do a lot of specific things in order to make things go well.
We may disagree about variance-increasing tactics in other domains, but our disagreement about pivotal acts is about whether some subset of the specific class of keys called 'pivotal acts' is necessary and/or sufficient to open the lock.
Given this deep prior differences, it seems reasonable to assume this discussion will lead nowhere in particular. (I've a draft with a more explicit argument why.)
I'm feeling much more optimistic than you about trying to resolve these points, in part because I feel that you've misunderstood almost every aspect of my view and of my comment above! If you're that far from passing my ITT, then there's a lot more hope that we may converge in the course of incrementally changing that.
(Or non-incrementally changing that. Sometimes non-continuous things do happen! 'Gaining understanding of a topic' being a classic example of a domain with many discontinuities.)
With the last point: I think can roughly pass your ITT - we can try that, if you are interested.
So, here is what I believe are your beliefs
I personally think that a large majority of humanity's hope lies in someone executing a pivotal act. But I assume Critch disagrees with this, and holds a view closer to 1+2+3.
If so, then I think he shouldn't go "well, pivotal acts sound weird and carry some additional moral hazards, so I will hereby push for pivotal acts to become more stigmatized and hard to talk about, in order to slightly increase our odds of winning in the worlds where pivotal acts are unnecessary".
Rather, I think hypothetical-Critch should promote the idea of pivotal processes, and try to reduce any existing stigma around the idea of pivotal acts, so that humanity is better positioned to evade destruction if we do end up needing to do a pivotal act. We should try to set ourselves up to win in more worlds.
Can't speak for Critch, but my view is pivotal acts planned as pivotal acts, in the way how most people in LW community think about them, have only a very small chance of being the solution. (my guess is one or two bits more extreme, more like 2-5% than 10%).
I'm not sure if I agree with you re: the stigma. My impression is while the broader world doesn't think in terms of pivotal acts, if it payed more attention, yes, many proposals would be viewed with suspicion. On the other hand, I think on LW it's the opposite: many people share the orthodoxy views about sharp turns, pivotal acts, etc., and proposals to steer the situation more gently are viewed as unworkable or engaging in thinking with "too optimistic assumptions" etc.
Note that I advocate for considering much more weird solutions, and also thinking much more weird world states when talking with the "general world". While in contrast, on LW and AF, I'd like to see more discussion of various "boring" solutions on which the world can roughly agree.
Continuity assumptions are about what's likely to happen, not about what's desirable. It would be a separate assumption to say "continuity is always good", and I worry that a reasoning error is occurring if this is being conflated with "continuity tends to occur".
Basically, no. Continuity assumptions are about how the space looks like. Obviously forecasting questions ("what's likely to happen") often depend on ideas how the space looks like.
My claim is that pivotal acts are likely to be necessary for good outcomes, not that they're necessarily likely to occur. If your choices are "execute a pivotal act, or die", then insofar as you're confident this is the case, the base rate of continuous events just isn't relevant.
Yes but your other claim is "sharp left turn" is likely and leads to bad outcomes. So if we partition the space of outcomes good/bad, in both branches you assume it is very likely because of sharp turns.
The primary argument for hard takeoff isn't "stuff tends to be discontinuous"; it's "AGI is a powerful invention, and e.g. GPT-3 isn't a baby AGI". The discontinuity of hard takeoff is not a primitive; it's an implication of the claim that AGI is different from current AI tech, that it contains a package of qualitatively new kinds of cognition that aren't just 'what GPT-3 is currently doing, but scaled up'.
This is becoming maybe repetitive, but I'll try to paraphrase again. Consider the option the "continuity assumptions" I'm talking about are not grounded in "takeoff scenarios", but in "how you think about hypothetical points in the abstract space of intelligent systems".
Thinking about features of this highly abstract space, in regions which don't exist yet, is epistemically tricky (I hope we can at least agree on that).
It probably seems to you, you have many strong arguments giving you reliable insights about how the space works somewhere around "AGI".
My claim is: "Yes, but the process which generated the arguments is based on black-box neural net, which has a strong prior on things like "stuff like math is discontinuous"" (I suspect this "taste and intuition" box is located more in Eliezer's mind, and some other people updated "on the strenght of arguments") This isn't to imply various people haven't done a lot of thinking and generated a lot of arguments and intuitions about this. Unfortunately, given other epistemic constraints, in my view the "taste and intuitions" differences sort of "propagate" to "conclusion" differences.
- With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)
- This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato
That's right, though the phrasing "discovering a core of generality" here sounds sort of mystical and mysterious to me, which makes me wonder whether you can see the perspective from which this is a very obvious and normal belief. I get a similar vibe when people talk about a "secret sauce" and say they can't understand why MIRI thinks there might be a secret sauce—treating generalizability as a sort of occult property.
The way I would phrase it is in very plain, concrete terms:
We can see that the latter is true just by reflecting on what kinds of mental operations go into generating hypotheses about ontologies/carvings on the world, generating hypothesis about the state of the world given some ontology, fitting hypotheses about different levels/scales into a single cohesive world-model, calculating value of information, strategically directing attention toward more fruitful directions of thought, coming up with experiments, thinking about possible experimental outcomes, noticing anomalies, deducing implications and logical relationships, coming up with new heuristics and trying them out, etc. These clearly overlap enormously across the relevant domains.
We can also observe that this is in fact what happened with humans. We have zero special-purpose brain machinery for any science, or indeed for science as a category; we just evolved to be able to model physical environments well, and this generalized to all sciences once it generalized to any.
For things to not go this way would be quite weird.
- From your perspective, this is based on thinking deeply about the nature of such system (note that this mostly based on hypothetical systems, and an analogy with evolution)
Doesn't seem to pass my ITT. Like, it's true in a sense that I'm 'thinking about hypothetical systems', because I only care about human cognition inasmuch as it seems likely to generalize to AGI cognition. But this still seems like it's treating generality as a mysterious occult property, and not as something coextensive with all our observations of general intelligences.
- My claim roughly is this is only part of what's going on, where the actual think is: people start with a deep prior on "continuity in the space of intelligent systems". Looking into a specific question about hypothetical systems, their search in argument space is guided by this prior, and they end up mostly sampling arguments supporting their prior. (This is not to say the arguments are wrong.)
Seems to me that my core intuition is about there being common structure shared between physics research, biology research, chemistry research, etc.; plus the simple observation that humans don't have specialized evolved modules for chemistry vs physics vs biology. Discontinuity is an implication of those views, not a generator of those views.
Like, sure, if I had a really incredibly strong prior in favor of continuity, then maybe I would try really hard to do a mental search for reasons not to accept those prime-facie sources of discontinuity. And since I don't have a super strong prior like that, I guess you could call my absence of a super-continuity assumption a 'discontinuity assumption'.
But it seems like a weird and unnatural way of trying to make sense of my reasoning: I don't have an extremely strong prior that everything must be continuous, but I also don't have an extremely strong prior that everything must be spherical, or that everything must be purple. I'm not arriving at any particular conclusions via a generator that keeps saying 'not everything is spherical!' or 'not everything is purple!'; I'm not a non-sphere-ist or an anti-purple-ist; the deep secret heart and generator for all my views is not that I have a deep and abiding faith in "there exist non-spheres". And putting me in a room with some weird person who does think everything is a sphere doesn't change any of that.
You probably don't agree with the above point, but notice the correlations:
- You expect sharp left turn due to discontinuity in "architectures" dimensions (which is the crux according to you)
- But you also expect jumps in capabilities of individual systems (at least I think so)
- Also, you expect majority of hope in a "sharp right turn" histories (in contrast to smooth right turn histories)
I would say that there are two relevant sources of discontinuity here:
'AGI is an invention' and 'General intelligence is powerful' aren't weird enough beliefs, I think, to call for some special explanation like 'Rob B thinks the world is very discontinuous'. Those are obvious first-pass beliefs to have about the domain, regardless of whether they shake out as correct on further analysis.
'We need a pivotal act' is a consequence of 1 and 2, not a separate discontinuity. If AGI is a sudden huge dangerous deal (because 1 and 2 is true), then we'll need to act fast or we'll die, and there are viable paths to quickly ending the acute risk period. The discontinuity in the one case implies the discontinuity in this new case. There's no need for a further explanation.
Note that I advocate for considering much more weird solutions, and also thinking much more weird world states when talking with the "general world". While in contrast, on LW and AF, I'd like to see more discussion of various "boring" solutions on which the world can roughly agree.
Can I get us all to agree to push for including pivotal acts and pivotal processes in the Overton window, then? :) I'm happy to publicly talk about pivotal processes and encourage people to take them seriously as options to evaluate, while flagging that I'm ~2-5% on them being how the future is saved, if it's saved. But I'll feel more hopeful about this saving the future if you, Critch, etc. are simultaneously publicly talking about pivotal acts and encouraging people to take them seriously as options to evaluate, while flagging that you're ~2-5% on them being how the future is saved.
It's worth emphasizing your point about the negative consequences of merely aiming for a pivotal act.
Additionally, if a lot of people in the AI safety community advocate for a pivotal act, it makes people less likely to cooperate with and trust that community. If we want to make AGI safe, we have to be able to actually influence the development of AGI. To do that, we need to build a cooperative relationship with decision makers. Planning a pivotal act runs counter to these efforts.
I have hypothesized that an early non-superintelligent "robot uprising" can end up as such a pivotal act. https://www.lesswrong.com/posts/e4WHZcuEwFiAxrS2N/a-possible-ai-inoculation-due-to-early-robot-uprising
My idea was to use a first human upload as "AI tzar" who will control all other AI development via extensive surveillance system. First human upload as AI Nanny.
I think the debate really does need to center on specific pivotal outcomes, rather than how the outcomes come about. The sets of pivotal outcomes attainable by pivotal acts v.s. by pivotal processes seem rather different.
I suspect your key crux with pivotal-act advocates is whether there actually exist any pivotal outcomes that are plausibly attainable by pivotal processes. Any advantages that more distributed pivotal transitions have in the abstract are moot if there are no good concrete instantiations.
For example, in the stereotypical pivotal act, the pivotal outcome is that no (other) actors possess the hardware to build an AGI. It's clear how this world state is safe from AGI, and how a (AGI-level) pivotal act could in principle achieve it. It's not clear to me that a plausible pivotal process could achieve it. (Likewise, for your placeholder AI immune system example, it's not clear to me either that this is practically achievable or that it would be pivotal.)
This crux is probably downstream of other disagreements about how much distributed means (governance, persuasion, regulation, ?) can accomplish, and what changes to the world suffice for safety. I these would be more productive to debate in the context of a specific non-placeholder proposal for a pivotal process.
It's certainly fair to argue that there are downsides to pivotal acts, and that we should prefer a pivotal process if possible, but IMO the hard part is establishing that possibility. I'm not 100% confident that a pivotal transition needs to look like a surprise unilateral act, but I don't know of any similarly concrete alternative proposals for how we end up in a safe world state.
There’s also two really important cruxes: is it expedient (more likely result in alignment) to move from a multi polar to unipolar world, and is a unipolar world actually a good thing? (Most people would oppose a unipolar world, especially if they perceive it as a hegemony of US techbros and their electronic pet.)
This still seems to somewhat miss the point (as I pointed out last time):
Conditional on org X having an aligned / corrigible AGI, we should expect:
The question isn't what the humans in X would do, but what the [AGI + humans] would do, given that the humans have access to that AGI.
If org X is initially pro-unilateral-PAs, then we should expect an aligned AGI to talk them out of it if it's not best.
If org X is initially anti-unilateral-PAs, then we should expect an aligned AGI to talk them into it if it is best.
X will only be favouring/disfavouring PAs for instrumental reasons - and we should expect the AGI to correct them as appropriate.
For these reasons, I'd expect the initial attitude of org X to be largely irrelevant.
Since this is predictable, I don't expect it to impact race dynamics: what will matter is whether the unilateral PA seems more/less likely to succeed than the distributed approach to the AGI.
I think you are missing the possibility that the outcomes of the pivotal process could be
-no one builds autonomous AGI
-autonomos AGI is build only in post-pivotal outcome states, where the condition of building it is alignment being solved
Sure, that's true - but in that case the entire argument should be put in terms of:
We can (aim to) implement a pivotal process before a unilateral AGI-assisted pivotal act is possible.
And I imagine the issue there would all be around the feasibility of implementation. I think I'd give a Manhattan project to solve the technical problem much higher chances than a pivotal process. (of course people should think about it - I just won't expect them to come up with anything viable)
Once it's possible, the attitude of the creating org before interacting with their AGI is likely to be irrelevant.
So e.g. this just seems silly to me:
So, thankfully-according-to-me, no currently-successful AGI labs are oriented on carrying out pivotal acts, at least not all on their own.
They won't be on their own: they'll have an AGI to set them straight on what will/won't work.
Let's say one makes a comment on LW that shifts the discourse in a way that eventually ramifies into a succesful navigation of the alignment problem.
Was there a pivotal outcome?
If so, was there a corresponding pivotal act? What was it?
tl;dr: If you think humanity is on a dangerous path, and needs to "pivot" toward a different future in order to achieve safety, consider how such a pivot could be achieved by multiple acts across multiple persons and institutions, rather than a single act. Engaging more actors in the process is more costly in terms of coordination, but in the end may be a more practicable social process involving less extreme risk-taking than a single "pivotal act".
Preceded by: “Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments
[This post is also available on the EA Forum.]
In the preceding post, I argued for the negative consequences of the intention to carry out a pivotal act, i.e., a single, large world-changing act sufficient to 'pivot' humanity off of a dangerous path onto a safer one. In short, there are negative side effects of being the sort of institution aiming or willing to carry out a pivotal act, and those negative side effects alone might outweigh the benefit of the act, or prevent the act from even happening.
In this post, I argue that it's still a good idea for humanity-as-a-whole to make a large / pivotal change in its developmental trajectory in order to become safer. In other words, my main concern is not with the "pivot", but with trying to get the whole "pivot" from a single "act", i.e., from a single agent-like entity, such a single human person, institution, or AI system.
Pivotal outcomes and processes
To contrast with pivotal acts, here's a simplified example of a pivotal outcome that one could imagine making a big positive difference to humanity's future, which in principle could be brought about by a multiplicity of actors:
(For now, let's set aside debate about whether this outcome on its own would be pivotal, in the sense of pivoting humanity onto a safe developmental trajectory... it needs a lot more details and improvements to be adequate for that! My goal in this post is to focus on how the outcome comes about. So for the sake of argument I'm asking to take the "pivotality" of the outcome for granted.)
If a single institution imposed the construction of such an AI immune system on its own, that would constitute a pivotal act. But if a distributed network of several states and companies separately instituted different parts of the change — say, designing and building the EMP emitters, installing them in various jurisdictions, etc. — then I'd call that a pivotal distributed process, or pivotal process for short.
In summary, a pivotal outcome can be achieved through a pivotal (distributed) process without a single pivotal act being carried out by any one institution. Of course, the "can" there is very difficult, and involves solving a ton of coordination problems that I'm not saying humanity will succeed in solving. However, aiming for a pivotal outcome via a pivotal distributed process definitively seems safer to me, in terms of the dynamics it would create between labs and militaries, compared to a single lab planning to do it all on their own.
Revisiting the consequences of pivotal act intentions
In AGI Ruin, Eliezer writes the following, I believe correctly:
I think the above realization is important. The un-safety of trying to get a single locus of action to bring about a pivotal outcome all on its own is important, and it pretty much covers my rationale for why we (humanity) shouldn't advocate for unilateral actors doing that sort of thing.
Less convincingly-to-me, Eliezer then goes on to (seemingly) advocate for using AI to carry out a pivotal act, which he acknowledges would be quite a forceful intervention on the world:
I'm not entirely sure if the above is meant to advocate for AGI development teams planning to use their future AGI to burn other people's GPU's, but it could certainly be read that way, and my counterargument to that reading has already been written, in “Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments. Basically, a lab X with the intention to burn all the world's GPUs will create a lot of fear that lab X is going to do something drastic that ends up destroying the world by mistake, which in particular drives up the fear and desperation of other AI labs to "get there first" to pull off their own version of a pivotal act. Plus, it requires populating the AGI lab with people willing to do some pretty drastically invasive things to other companies, in particular violating private property laws and state boundaries. From the perspective of a tech CEO, it's quite unnerving to employ and empower AGI developers who are willing to do that sort of thing. You'd have to wonder if they're going to slip out with a thumb drive to try deploying an AGI against you, because they have their own notion of the greater good that they're willing to violate your boundaries to achieve.
So, thankfully-according-to-me, no currently-successful AGI labs are oriented on carrying out pivotal acts, at least not all on their own.
Back to pivotal outcomes
Again, my critique of pivotal acts is not meant to imply that humanity has to give up on pivotal outcomes. Granted, it's usually harder to get an outcome through a distributed process spanning many actors, but in the case of a pivotal outcome for humanity, I argue that:
I'm not arguing that we (humanity) are going to succeed in achieving a pivotal outcome through a distributed process; only that it's a safer and more practical endeavor than aiming for a single pivotal act from a single institution.