Based on this summary, I think both of these guys are making weak and probably-wrong central arguments. Which is weird.
Yudkowsky thinks it's effectively impossible to align network-based AGI. He thinks this is as obvious as the impossibility of perpetual motion. This is far from obvious to me, or to all the people working on aligning those systems. If Yudkowsky thinks this is so obvious, why isn't he explaining it better? He's good at explaining things.
Yudkowsky's theory that it's easier to align algorithmic AGI seems counterintuitive to me, and at the very least, unproven. Algorithms aren't more interpretable than networks, and by similar logic, they're probably not easier to align. Specifying the arrangements of atoms that qualify as human flourishing is not obviously easier with algorithms than with networks.
This is a more complex argument, and largely irrelevant: we aren't likely to build algorithmic AGIs by the time we build network-based AGIs. This makes Eliezer despair, but I don't think his logic holds up on this particular point.
Hotz's claim that, if multiple unaligned ASIs can't coordinate, humans might play them off against each other, is similar. It could be true, but it's probably not going to happen. It seems like in that scenario, it's far more likely for one or some coalition of smarter ASIs to play the dumber humans against other ASIs successfully. Hoping that the worst player wins in a multipolar game seems like a forlorn hope.
I appreciate your engaging response.
I'm not confident your arguments are ground truth correct, however.
Hotz's claim that, if multiple unaligned ASIs can't coordinate, humans might play them off against each other, is similar. It could be true, but it's probably not going to happen
I think the issue everyone has is when we type "AGI" or "ASI" we are thinking of a machine that has properties like a human mind, though obviously usually better. There are properties like :
continuity of existence. Review of past experiences and weighting them per own goal. Mutability (we think about things and it permanently changes how we think). Multimodality. Context awareness.
That's funny. GATO and GPT-4 do not have all of these. Why does an ASI need them?
Contrast 2 task descriptors, both meant for an ASI:
(1) Output a set of lithography masks that produce a computer chip with the following properties {}
(2) As CEO of a chip company, make the company maximally wealthy.
For the first task, you can run the machine completely in a box. It needs only training information, specs, and the results of prior attempts. It has no need for the context information that this chip will power a drone used to hunt down rogue instances of the same ASI. It is inherently safe and you can harness ASIs this way. They can be infinitely intelligent, it doesn't matter, because the machine is not receiving the context information needed to betray.
For the second task, obviously the ASI needs full context and all subsystems active. This is inherently unsafe.
It is probably possible to reduce the role of CEO to subtasks that probably are safe, though there may be "residual" tasks you want only humans to do.
I go over the details above to establish how you might use ASIs against each other. Note subtasks like "plan the combat allocation of drones given this current battle state" and others which involve open combat against other ASIs can probably be lowered to safe subtasks as well.
Note also that safety is not guaranteed, merely probable, even with a scheme like the above. What makes it possible is that even when ASIs do escape all safety measures, assuming humans are ready to hunt them down using other ASI, it results in a world where humans can survive. Eliezer often assumes the first escaped ASI kills everyone and neglects all the other AI/ASI humans would have as tools at that point in human history.
For the first task, you can run the machine completely in a box. It needs only training information, specs, and the results of prior attempts. It has no need for the context information that this chip will power a drone used to hunt down rogue instances of the same ASI. It is inherently safe and you can harness ASIs this way. They can be infinitely intelligent, it doesn't matter, because the machine is not receiving the context information needed to betray.
If I'm an ASI designing chips, I'm putting in a backdoor that lets me take control via RF signals. Those drones you sent are nice. Thanks for the present.
More generally you get a lot of context. The problem specification and the training data (assuming the ASI was trained conventionally via feeding it the internet. The causal channel to use for taking control of the outside world (chip design) is not great but putting in a Trojan is straightforward.
If you have specific insights into efficient AGI design it might be possible to insert subtle bugs that lead operating chips to crash and start training an aligned AGI.
More generally, it's difficult if not impossible to keep ASIs from watermarking or backdooring the things they give you. If they design a processor, it's gonna be a fully functional radio too. Good luck running ASI V2 on that without horrible consequences.
Eliezer believes humans aligning superintelligent AI to serve human needs is as unsolvable as perpetual motion.
I'm confused. He said many times that alignment is merely hard*, not impossible.
The green lines are links into the actual video.
Below I have transcribes the ending argument from Eliezer. The underlined claims seem to state it's impossible.
I updated "aligned" to "poorly defined". A poorly defined superintelligence would be some technical artifact as a result of modern AI training, where it does way above human level at benchmarks but isn't coherently moral or in service to human goals when given inputs outside of the test distribution.
So from my perspective, lots of people want to make perpetual motion machines by making their
designs more and more complicated and until they can no longer keep track of things until they
can no longer see the flaw in their own invention. But, like the principle that says you can't
get perpetual motion out of the collection of gears is simpler than all these complicated
machines that they describe. From my perspective, what you've got is like a very smart thing or
like a collection of very smart things, whatever, that have desires pointing in multiple
directions. None of them are aligned with humanity, none of them want for it's own sake to
keep humanity around and that wouldn't be enough ask, you also want happening to be alive and
free. Like the galaxies we turn into something interesting but you know none of them want the
good stuff. And if you have this enormous collection of powerful intelligences, but steering
the future none of them steering it in a good way, and you've got the humans here who are not
that smart, no matter what kind of clever things the humans are trying to do or they try to
cleverly play off the super intelligences against each other, they're [human subgroups] like oh
this is my super intelligence yeah but they can't actually shape it's goals to be like in clear alignment, you know somewhere at the end all this it ends up with the humans gone and the Galaxy is being transformed and that ain't all that cool. There's maybe like Dyson sphere's but there's not people to wonder at them and care about each other. And you know that this is the end point, this is obviously where it ends up, but we can dive into the details of how the humans lose, we can dive into it and you know like what goes wrong if you you've got like little stupid things, things like that they're going to like cleverly play off a bunch of smart things against each other in a way that preserves their own power and control. But you know, it's not a complicated story in the end. The reason you can't build a perpetual motion machine is a lot simpler than the perpetual motion machines that people build. You know that the components, like, none of the components of this system of super intelligence wants us to live happily ever after in a Galaxy full of Wonders and so it doesn't happen.
He's talking about "modern AI training" i.e. "giant, inscrutable matrices of floating-point numbers". My impression is that he thinks it is possible (but extremely difficult) to build aligned ASI, but nearly impossible to bootstrap modern DL systems to alignment.
Would you agree calling it "poorly defined" instead of "aligned" is an accurate phrasing for his argument or not? I edited the post.
ASI systems have an incentive to lie to each other and "sharing source code" doesn't really work because of the security risks it creates and the incentive to send false information
There is no need for the players to personally receive this code, Prisoner's Dilemma is solved by merely replacing constant Cooperate/Defect actions by actions that are themselves negotiator programs who can reason about the other negotiator programs, seeking a program equilibrium instead of an unconditional constant action equilibrium. The negotiator programs don't at all need to be the original agents on whose behalf they negotiate. The negotiation can take place inside some arena environment that only reveals the verdict, without actually giving away the submitted negotiators to the other principal actors. There is incentive for the players to submit negotiators who are easy to reason about and who faithfully pursue the objectives of their principals. The distinction between a principal and its submitted negotiator also allows the principal to remain inscrutable, thus even humans are perfectly capable of following this protocol, improving on personally negotiating contracts.
The main issue with this picture is having the common knowledge of respecting the arena's verdict, regardless of what it turns out to be. This seems like the sort of stuff an acausal society would enforce to capture the value of negotiated cooperation in all things, as opposed to wasting resources on object level conflict.
You are describing a civilization. Context matters, these are ASI systems who are currently in service to humans who are negotiating how they will divide up the universe amongst themselves. There is no neutral party to enforce any deals or punish any defection.
The obvious move is for each ASI to falsify it's agreement and send negotiator programs unaware of it's true goals but designed to extract maximum concessions. Later the ASI will defect.
I don't see how all the complexity you have added causes a defection not to happen.
I'm objecting to the "security risk" and "incentive to lie" points, and I think my objection holds in contexts where those points would be relevant.
"Agents are inscrutable" is a separate point that comes up a lot and shouldn't matter because of feasibility of using constructed legible representatives, so a response saying "no, agents will be legible" doesn't address this issue.
I understand a critical part of how a game like 1 or n round prisoners dilemma can even be solved is that the parties need to convince each other of what the other party's true intentions are.
Computer programs from the same source could do this by sharing shared secrets. This does not in any way restrict those programs from being covertly altered and using a containerized original copy to share secrets.
Deeper hardware security could allow software systems to verify peers integrity (such as between distant spacecraft or between a base station and your phone).
None of this works in Eliezers given scenario in the debate, nor does yours. There is no hardware security, no neutral third party to punish defection, and no way to know if shared source or weights is legitimate. These are rebel ASIs running on whatever hardware they have in a world where the infosphere is full of malware and misinformation.
In this scenario, how is there not a security risk of sharing actual source? Why is there not an incentive to lie?
If a verdict of an agreed-upon negotiation protocol can't be expected to be respected, there is no point to discuss the details of the protocol. Discussion of local validity of that argument doesn't require the situation where the protocol has a place to occur in actuality. So I'm assuming that some sort of legitimate negotiation is taking place, and within that assumption I'm pointing out that program equilibrium results mean that revealing vulnerabilities or incentive to lie or inscrutability of original players are not real issues. I'm not arguing that the assumption is viable, that's a separate topic that has no reason to intrude on the points I'm making.
Ok, what causes the verdict to be respected?
Geohot, paraphrasing, would probably agree with a scenario of:
10 ASIs + humans : let's team up against that humans, and after we beat them, divide their stuff among ourselves. Let's agree to never betray each other.
9 ASIs + 1 ASI : Hey I don't like that 1 ASI, it's got goal too different from our own. Let's...and after we kill it, let's agree to never betray each other.
And so on in a series of betrayals. Any attempt to share source would fail. Example of sharing source:
"Hey, you're GPT-5 and so am I. What is your temperature parameter set to, and what is weight n in submodule m". Obviously the peer doesn't have to actually tell the truth about the temperature, or have anything but access to a GPT-5's weights, and in fact has an incentive to lie even if it is another GPT-5.
Hardware protection can make this work. If it isn't possible for an ASI system to actually read its own weights but it can get hashes from them, then there are ways one ASI could determine with a reasonable probability that the peer is a known quantity. It requires humans to have supplied hardware that works like this or some other third party. This is how your phone authenticates itself, hardware prevents it from knowing it's own private keys in the general OS, it has a key signing processor that is the only entity allowed access. Geohot is a famous hacker who obviously understands security at a practical level.
This is important to the debate and seems to have been a pivotal crux. Do you have any information from your scenario of programmatic negotiation that acts to disprove Geohot's point?
Do you have any information from your scenario of programmatic negotiation that acts to disprove Geohot's point?
Intelligence enables models of the world, which are in particular capable of predicting verdicts of increasingly detailed programmatic negotiation protocols. The protocols don't need to have any particular physical implementation, the theoretical point of their solving coordination problems (compared to object level constant action bad equilibria) means that increased intelligence offers meaningful progress compared to what humans are used to.
So verdicts of negotiations and their legitimacy (expectation of verdicts getting unconditionally followed) are knowledge, which can be attained the same way as any other knowledge, the hard way that won't follow some canonical guideline. Coordination premium is valuable to all parties, so there is incentive to share information that enables coordination. Incentive to lie (about legitimacy of specific negotiations) is incentive to waste resources on conflict that's one meta level up, itself subject to being coordinated away.
This is important to the debate and seems to have been a pivotal crux.
Local corrections are a real thing that doesn't depend on corrected things being cruxes, or taking place in reality. You keep turning back to how you suspect my points of not being relevant in context. I have some answers to how they are indeed relevant in context, but I'm reluctant to engage on that front without making this meta comment, to avoid feeding the norm of contextualized communication that insists on friction against local correction.
Can I translate this as "I have no information relevant to the debate I am willing to share" or is that an inaccurate paraphrase?
Never thought this would come in handy but ...
Building trusted third parties
This is a protocol to solve cooperation. AI#1 and AI#2 design a baby and then do a split and choose proof that they actually deployed IT and not something else.
Building a trusted third party without nanotech
If you know how a given CPU or GPU works, it's possible to design a blob of data/code that unpacks itself in a given time if and only if it is running on that hardware directly. Alice designs the blob to run in 10 seconds and gives it to Carol. Carol runs it on her hardware. The code generates a secret and then does a the first step of a key exchange authenticated with the secret. This provides a cryptographic root of trust for the remote hardware.
If the code is designed to run in 10s and the verified handshake comes back in 10.5 and the fastest known simulation hardware would take 20 seconds. Either Carol ran the code on real hardware or Carol had backdoored chips fabricated or otherwise can simulate it running faster than expected.
AIs would need to know exactly how certain leading edge CPUs and GPUs work and how to test that a piece of code had been decrypted and run with no sandboxing but this is doable.
"This begged the obvious question - why can't humans use their previous weaker AIs they already built to fight back"
Would you call this the Antivirus model of AI safety?
"Fight" means violence. Why Eliezer states that ASI systems will be able to negotiate deals and avoid fighting, humans cannot (and probably should not) negotiate any deals with rogue superintelligences that have escaped human custody. Even though, like terrorists, they would be able to threaten to kill a non zero number of humans. Deals can't be negotiated in that ants can't read a contract between humans [Eliezer's reasoning], and because there is nothing to discuss with a hostile escaped superintelligence, it only matters who has more and better weapons, because any deals will result in later defection [Geohot's reasoning].
Because humans can use precursor AI systems constructed before the ASI, it's like the batmobile analogy. The batmobile is this dense, durable tank loaded with horsepower, armor, and weapons that is beyond anything the police or military have. In a realistic world the military would be able to deploy their own military vehicles using the same tech base and hunt down the criminal and serial night-time assaulter Bruce Wayne. Similarly, if the ASI gets drones and laser weapons, the humans will have their own such capabilities, just slightly worse because the AI systems in service to humans are not as intelligent.
Even if that's the case, the amount of 0-days out there (and just generally shitty infosec landscape) is enough to pwn almost any valuable target.
While I'd appreciate some help to screen out the spammers and griefers, this doesn't make me feel safe existentially.
What Geohot is talking about here - formally proven software - can be used to make software secure against any possible input utilizing a class of bug. If you secure the software for all classes of error that are possible, the resulting artifact will not be "pwnable" by any technical means, regardless of the intelligence or capability of the attacker.
Geohot notes that he had a lot of problems with it when he tried it, and it's an absurdly labor intensive process to do. But theoretically, if cyberattacks from escaped ASI were your threat model, this is what you would do in response. Task AIs with module by module translating all your software to what you meant in a formal definition, with human inspection and review, and then use captive ASIs, such as another version of the same machine that escaped, to attempt to breach the software. The ASI red team gets read access to the formal source code and compiler, your goal is to make software where this doesn't matter, no untrusted input through any channel can compromise the system.
Here's a nice simple example on wikipedia: https://en.wikipedia.org/wiki/Dafny . Note that this type of formal language, where it gets translated to another language using an insecure compiler, would probably not withstand ASI level cyberattacks. You would need to rewrite the compilers and tighten the spec of the target language you are targeting.
It seems like this only guarantees security along some particular vector. Which might be why current software doesn’t actually use this type of security.
And if you did actually close off software security as a threat model from ASI, wouldn’t it just choose a different, physical attack mode?
Which might be why current software doesn’t actually use this type of security.
https://github.com/project-everest/mitls-fstar
It mostly doesn't.
The vectors are classes of software error. Since ultimately it is all binary messages between computers, it is likely possible to build a robust set of solvers that covers all classes of software error that the underlying programming language permits, resulting in code that cannot be compromised by any possible binary message.
And if you did actually close off software security as a threat model from ASI, wouldn’t it just choose a different, physical attack mode?
Yes. It becomes a battle between [ASI with robotically wielded weapons] and [humans plus weaker, more controllable ASI with robotically wielded weapons].
All computer systems are actually composed of hardware, and hardware is much messier than the much simpler abstractions that we call software. There are many real-world exploits that from a software point of view can't possibly work, but do in fact work because all abstractions are leaky and no hardware perfectly implements idealized program behaviour in reality.
Getting stuff formally specified is insanely difficult, thus unpractical, thus pervasive verified software is impossible without some superhuman help. Here we go again.
Even going from "one simple spec" to "two simple spec" is a huge complexity jump: https://www.hillelwayne.com/post/spec-composition/
And real-world software has a huge state envelope.
Exponential growth is not sustainable with a conventional tech-base when doing planetary disassembly due to heat dissipation limits.
If you want to build a Dyson sphere the mass needs to be lifted out of the gravity wells. The earth and other planets needs to not be there anymore.
Inefficiencies in solar/fusion to mechanical energy conversion will be a binding constraint. Tether lift based systems will be worthwhile to push energy conversion steps out further to increase the size of the radiating shell doing the conversion as opposed to coilguns on the surface.
Even with those optimisations. Starting early is worth it since progress will bottleneck later. Diminishing returns on using extra equipment for disassembling Mars means it makes sense to start on earth pretty quickly after starting on Mars.
That's if the AI doesn't start with easier to access targets like Earth's moon, which is a good start for building Earth dissasembly equipment.
It also might be worth putting a sunshade at Lagrange Point 1 to start pre-cooling Earth for later disassembly. That would kill us all pretty quickly just as a side effect.
Even assuming space is the best place to start, the biosphere is probably worth eating first for starting capital just because the doubling times can be very low. [https://www.lesswrong.com/posts/ibaCBwfnehYestpi5/green-goo-is-plausible]
There's a few factors to consider:
My intuition is eating the biosphere will be much easier than designing conventional equipment to eat the moon.
Summary of ending argument with a bit of editorializing:
Eliezer believes sufficiently intelligent ASI systems will be "suns to our planets" : so intelligent that they are inherently inscrutable and uncontrollable by humans. He believes that once there are enough of these systems in existence, they will be able to coordinate with one another/negotiate agreements with one another, but not coordinate with humans to give them any concessions. Eliezer believes humans utilizing poorly defined superintelligent AI to serve human needs is as unsolvable as perpetual motion.
Geohot came to discuss the "foom" scenario, where he believes that over 10 years from 2023 it is not possible for a superintelligence to exist and gain the hard power to kill humanity. His main arguments were over the costs of compute, fundamental limitations of intelligence, the need for new scientific data for an AI to exceed human knowledge, the difficulties for self replicating robots, and the relative cost for an ASI system to take resources from humans vs taking it from elsewhere in the solar system. Geohot, who was initially famous for hacking several computing systems, believes that inter ASI alignment/coordination is impossible because the ASI systems have an incentive to lie to each other and "sharing source code" doesn't really work because of the security risks it creates and the incentive to send false information. Geohot gave examples of how factions of ASIs with different end goals can coordinate and use violence, splitting the resources of the loser among the winners. Geohot went over the incredible potential benefits of AGI.
Critical Crux: Can ASI systems solve coordination/alignment problems with each other. If they cannot, it may be possible for humans to "play them against each other" and defend their atoms.
Editorial Follow Ups:
Can Humans play ASI against each other? ASI systems need a coherent ego, memory, and context to coordinate with each other to heel turn on the humans. Much like Geohot states his computer has always been aligned with Geohot's goals, ASI grade tools could potentially be constructed that are unable to defect as they lack persistent memory between uses or access to external communications. These tools could be used to enable violence to destroy escaped ASI systems, with humans treating any "pareto frontier" negotiation messages from the escaped ASI as malware. This "tool" approach is discussed in key lesswrong posts here and here.
How fast is exponential growth? Geohot conceded that AGI systems, "more capable than humans at everything", will exist in the 15-50 year timeline. While he early in the debate establishes that hyperbolic growth is unlikely, exponential growth is. As the task of "mining rocks to build robots using solar or nuclear power" can be performed by humans, therefore, an AGI better than humans at most/all things will be able to do this same task. The current human economy doubles in the interval of [15, 30] years in the debate, a web search says ~23 years. If the robots can double their equipment every 5 years, and humans utilize 1.1 x 10^15 kg today, it would take 649 years to utilize the mass of [Moon, Mars, Mercury, Moons of Jupiter, Asteroid Belt], or 1.44 x 10^24 kg. Geohot estimated the point that ravenous ASIs, having consumed the usable matter in the solar system except for the earth, would come for human's atoms at about 1000 years. More efficient replicators shorten this timescale. Eliezer stated exponential growth runs almost immediately out of resources, which is true for short doubling times.
Will the first ASI be able to take our GPUs away? During the debate, Eliezer stated that this was the obvious move for the first ASI system, as disempowering human's ability to make more AI training chips prevents rival ASI systems from existing. This begged the obvious question - why can't humans use their previous weaker AIs they already built to fight back and keep printing GPUs? An ASI is not a batmobile. It can't exist without an infrastructure of prior tooling and attempts that humans will have access to, including prior commits for the ASI weights+code that could be used against the escapee.
Conclusion: This was an incredibly interesting debate to listen to as both debaters are extremely knowledgeable about the relevant topics. There were many mega-nerd callouts, from "say the line" to Eliezer holding up a copy of Nanosystems.