I recently listened to the discussion between Wolfram and Yudkowsky about AI risk. In some ways this conversation was tailor-made for me, so I'm going to write some things about it and try to get it out in one day instead of letting it sit in my drafts for 3 weeks as I tend to do. Wolfram has lately obsessed over fundamental physics, which is a special interest of mine. Yudkowsky is one of the people thinking most carefully about powerful AI, which I think will kill us all, and I’d like to firm up that intuition. Throw them on a podcast for a few hours, and you have my attention.

That said, for the first hour I was just incredibly frustrated. Wolfram keeps running down rabbit holes that were basically “aha! You haven’t thought about [thing Yud wrote ten thousand words on in 2008]!” But a miracle happens somewhere in the second hour and Wolfram is asking actually relevant questions! His framework of small accidental quirks in machine learning algorithms leading to undesired behavior later was basically an actual issue. It was kind of a joy listening to two smart people trying to mutually get on the same page. Wolfram starts out bogged down in minutia about what 'wanting' is and whether it constitutes anthropomorphism, but finally finds a sort of more abstract space about steering to goals and trying to see Yudkowsky’s point in terms of the relative dangers of sections of the space of goals under sufficient optimization. The abstraction was unfortunate in some ways, because I was interested in some of the minutia once they were both nearly talking about the same thing, but also, if Wolfram kept running down rabbit holes like “actually quarks have different masses at different energy scales” when Yudkowsky said something like “the universe runs on quarks everywhere all at once no matter what we think the laws of physics are,” then they were never going to get to any of the actual arguments. That said, I don't see how Wolfram got to anything close to the actual point at all, and maybe the rabbit holes were necessary to get there.

My impression was that Yudkowsky was frustrated that he couldn’t get Wolfram to say, “actually everyone dying is bad and we should figure out whether that happens from our point of view.” There was an interesting place where something like this played out among one of Wolfram’s physics detours. He said something I agree with, which is that the concept of space is largely one which we construct and even changing our perception by the small adjustment of “think a million times faster" could break that construct. He argued that an AI might have a conception of physics which is totally alien to us and also valid. However, he then said it would still look to us like it was following our physics without making the (obvious to me) connection that we could just consider it in our reference frame if we want to know whether it kills us. This was emblematic of several rabbit holes. Yudkowsky would say something like “AI will do bad things” and Wolfram would respond with something like “well what is 'bad' really.” It would have been, in my view, entirely legitimate to throw out disinterested empiricism and just say, from our point of view, we don’t want to all die, so let’s figure out whether that happens. We might mess up the fine details of the subjective experience of the AI or what its source code is aiming for, but we can just draw a circle around things that from our point of view steer the universe to certain configuration and ask whether we’ll like those configurations.

I was frustrated by how long they spent finding a framework they could both work in. At the risk of making a parody of myself, part of me wished that Yudkowsky chose to talk to someone who had read the sequences. Aside from the selection issues inherent to only arguing with people who have already read a bunch of Yudkowsky, I don’t think it would help anyway. This conversation was in some ways less frustrating to me than the one Yudkowsky had with Ngo a few years ago, and Ngo has steeped himself in capital-R Bay Area Rationalism. As a particular example, it seemed to me like Ngo thought you could train an AI to make predictions about the world and you would be free to use that prediction to do things in the world, because you just asked the AI to make a prediction instead of doing anything. I don't see how what he was saying wasn't isomorphic to saying that you can stop someone from ever making bad things happen by letting them tell you to do things and you do it instead of them doing it. Maybe this was a deficiency of security mindset, maybe it was intuition about the type of AI that would arise from current research trends based in experience, or who knows, but I kept thinking to myself that Ngo wasn’t thinking outside of the box enough when he argued against doom. In that sense, Wolfram was more interesting to listen to, because he actually chased down the idea of where bizarre goals might come from in gradient descent, abstracted that out to “AI will likely have at least one subgoal that wasn't really intended from the space of goals,” and then considered the question of whether an arbitrary goal is, on average, lethal. His intuition seemed to be that if you fill every goal in goal space you end up with something like the set of every possible mollusk shell which each ends up serving some story in the environment. He didn’t have an intuition for goal+smart=omnicide, and he also he got too hung up on what "goal" and "smart" actually "meant" rather than just running with the thing which it seems to me that Yudkowsky is clearly aiming at even if Yudkowsky uses anthropomorphism to point at that thing. At least he ended up with something that seemed to directionally resemble Yudkowsky’s actual concerns, even if it wasn’t what he wanted to talk about for some reason. Also, Wolfram gets to the end and says "hey man, you should firm up your back-of-envelope calculations because we don't have shared intuition" when the thing Yudkowsky was trying to do with him for the past three hours was firm up those intuitions.

I keep listening to Yudkowsky argue with people about AI ruin because I have intuitions for why it is hard to create AI that won't kill us, but I think that Yudkowsky thinks it's even harder, and I don't actually know why. I get that something that is smart and wants something a lot will tend to get the thing even if killing me is a consequence. But my intuition says that AI will have goals which lead it to kill me primarily because humans are bad at making AI to the specifications that they intended rather than because goals are inherently dangerous. The current regime of AI development where we just kind of try random walks through the space of linear algebra until we get algorithms that do what we want seems to obviously be a good way to make something sort of aligned with us with wild edge cases that will kill us once it generalizes. If we were actually creating our algorithms by hand, I can just look out in the world of code full of bugs and easily imagine a bug that only shows up as a misaligned goal in the AI once it’s deployed out in the world and too smart to stop. I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms. I'm guessing that there is a counterfactual problem set that I could complete that would help me truly understand why most perfect algorithms that recreate a strawberry on the molecular cellular level destroy the planet as well. Yudkowsky has said that he’s not even sure it would be aligned if you took his brain and ran it many times faster with more memory. I’ve read enough Dath Ilan fiction to guess that he’s (at least) worried about something in the class of “human brains have some exploitable vulnerability that leads to occasional optical illusions in the current environment but leads to omnicide out of distribution,” but I’m not sure that’s right because I haven’t yet seen someone ask him that question. People keep asking him to refute terribly clever solutions which he already wrote about not working in 2007 rather than actually nailing down why he's worried.

If I was going to try to work out for myself why (or if) even humans who make what they intend to make get AI wrong on their first try instead of wistfully hoping Yudkowsky explains himself better some day, I would probably follow two threads: One is instrumental convergence, which leads anything going hard enough to move toward collecting all available negentropy (or money or power depending on the limits of the game, hopefully I don't have to explain this one here). I don't actually get why almost every goal will make an AI go hard enough, but I can imagine an AI being told to build as much manufacturing capability as possible going hard enough, and that's an obvious place to point an AI, so I guess the world is already doomed. The second is to start with simple goals like paperclips or whatever and build some argument that generalizes from discrete physical goals which are obviously lethal if you go hard enough to complex goals like "design, but do not implement, a safe fusion reactor" that it seems obvious to point an AI at. I suppose it doesn’t matter if I figure this out because I’m already convinced AI will kill us if we keep doing what we’re doing, so why chase down edge cases where we die anyway pursuing paths that humanity doesn’t seem to possess enough dignity to pursue? Somehow I find myself wanting to know anyway, and I don’t yet have the feeling of truly understanding.

New Comment
13 comments, sorted by Click to highlight new comments since:

It was a very frustrating conversation to listen to, because Wolfram really hasn't engaged his curiosity and done the reading on AI-kill-everyoneism.  So we just got a torturous number of unnecessary and oblique diversions from Wolfram who didn't provide any substantive foil to Eliezer

I'd really like to find Yudkowsky debates with better prepared AI optimists prepared to try and counter his points.  Do any exist?

I asked GPT4o to perform a web search for podcast appearances by Yudkowsky. It dug up these two lists (apparently, autogenerated from scrapped data). When I asked it to base use these lists as a starting point to look for high quality debates and after some further elicitation and wrangling, the best we could find was this moderated panel discussion featuring Yudkowsky, Liv Boeree, and Joscha Bach. There's also the Yudkowsky v/s George Hotz debate on Lex Fridman, and the time Yudkowsky debated AI risk with the streamer and political commentaror known as Destiny. I have watched none of the three debates I just mentioned; but I know that Hotz is a heavily vibes-based (rather than object-level-based) thinker, and that Destiny has no background in AI risk, but has good epistemics. I think he probably offered reasonable-at-first-approximation-yet-mostly-uninformed pushback.

EDIT: Upon looking a bit more at the Destiny-Yudkowsky discussion, i may have unwittingly misrepresented it a bit. It occurred during Manifest, and was billed as a debate. ChatGPT says Destiny's skepticism was rather active, and did not budge much.

[-]Tahp10

I might as well check out the panel discussion. I didn't know about it.

I think I listened to the Hotz debate. The highlight of that one was when Hotz implied that he was using an LLM to drive a car, Yudkowsky freaks out a bit, and Hotz clarifies that he means the architecture for his learning algorithm is basically the same as an LLM.

I suspect the Destiny discussion is qualitatively similar to the Dwarkesh one.

At this point, maybe I should just read old MIRI papers.

I agree with the frustration. Wolfram was being deliberately obtuse. Eliezer summarised it well toward the end, something like "I am telling you that the forest is on fire and you are telling me that we first need to define what do we mean by fire". I understand that we need definitions for things like "agency" or technology "wanting" something and even why do we mean by a "human" in the year 2070. But Wolfram went a bit too far. A naive genius that did not want to play along in the conversation. Smart teenagers talk like that.

Another issue with this conversation was that, even though they were listening to each other, Wolfram was too keen to go back to his current pet ideas. Eliezer's argument is (not sure) independent on whether we think the AIs will fall under computational "irreducibilteh", but he kept going back to this over and over.

I blame the ineffective exchange primarily on Wolfram in this case. Eliezer is also somewhat responsible for the useless rabbitholes in this conversation. He explains his ideas vividly and clearly. But there is something about his rhetoric style that does not persuade those who have not spent time engaging with his ideas beforehand, even someone as impressive as Wolfram. He also goes on too long on some detail or some contrived example rather than ensuring that the interlocutor in the same epistemological plane. 

Anyway, fun thing to listen to

Yeah Eliezer really isn't the most efficient communicator, at least in discussions. Being able to predict how someone will interpret your words and adjusting them to illicit the correct interpretation is an impossible skill to perfect, but it's nearly as difficult to master. Unfortunately, it's the singular skill utterly completely critical for one party to posses for a conversation to go anywhere, and in this case neither party did a good job of efficiently contradicting incorrect interpretations. Eliezer did a better job though, for what it's worth.

why most perfect algorithms that recreate a strawberry on the molecular level destroy the planet as well.

Phrased like this, the answer that comes to mind is "Well, this requires at least a few decades' worth of advances in materials science and nanotechnology and such, plus a lot of expensive equipment that doesn't exist today, and e.g. if you want this to happen with high probability, you need to be sure that civilization isn't wrecked by nuclear war or other threats in upcoming decades, so if you come up with a way of taking over the world that has higher certainty than leaving humanity to its own devices, then that becomes the best plan."  Classic instrumental convergence, in other words.

[-]Tahp10

Oops, I meant cellular, and not molecular. I'm going to edit that.

I can come up with a story in which AI takes over the world. I can also come up with a story where obviously it's cheaper and more effective to disable all of the nuclear weapons than it is to take over the world, so why would the AI do the second thing? I see a path where instrumental convergence leads anything going hard enough to want to put all of the atoms on the most predictable path it can dictate. I think the thing that I don't get is what principle it is that makes anything useful go that hard. Something like (for example, I haven't actually thought this through) "it is hard to create something with enough agency/creativity to design and implement experiments toward a purpose without also having it notice and try to fix things in the world which are suboptimal to the purpose."

I can also come up with a story where obviously it's cheaper and more effective to disable all of the nuclear weapons than it is to take over the world, so why would the AI do the second thing?

Erm... For preventing nuclear war on the scale of decades... I don't know what you have in mind for how it would disable all the nukes, but a one-off breaking of all the firing mechanisms isn't going to work.  They could just repair/replace that once they discovered the problem.  You could imagine some more drastic thing like blowing up the conventional explosives on the missiles so as to utterly ruin them, but in a way that doesn't trigger the big chain reaction.  But my impression is that, if you have a pile of weapons-grade uranium, then it's reasonably simple to make a bomb out of it, and since uranium is an element, no conventional explosion can eliminate that from the debris.  Maybe you can melt it, mix it with other stuff, and make it super-impure?

But even then, the U.S. and Russia probably have stockpiles of weapons-grade uranium.  I suspect they could make nukes out of that within a few months.  You would have to ruin all the stockpiles too.

And then there's the possibility of mining more uranium and enriching it; I feel like this would take a few years at most, possibly much less if one threw a bunch of resources into rushing it.  Would you ruin all uranium mines in the world somehow?

No, it seems to me that the only ways to reliably rule out nuclear war involve either using overwhelming physical force to prevent people from using or making nukes (like a drone army watching all the uranium stockpiles), or being able to reliably persuade the governments of all nuclear powers in the world to disarm and never make any new nukes.  The power to do either of these things seems tantamount to the power to take over the world.

[-]Tahp30

I don't think you're being creative enough about solving the problem cheaply, but I also don't think this particular detail is relevant to my main point. Now you've made me think more about the problem, here's me making a few more steps toward trying to resolve my confusion:

The idea with instrumental convergence is that smart things with goals predictably go hard with things like gathering resources and increasing odds of survival before the goal is complete which are relevant to any goal. As a directionally-correct example for why this could be lethal, humans are smart enough to do gain-of-function research on viruses and design algorithms that predict protein folding. I see no reason to think something smarter could not (with some in-lab experimentation) design a virus that kills all humans simultaneously at a predetermined time, and if you can do that without affecting any of your other goals more than you think humans might interfere your goals, then sure, you kill all the humans because it's easy and you might as well. You can imagine somehow making an AI that cares about humans enough not to straight up kill all of them, but if humans are a survival threat, we should expect it to find some other creative way to contain us, and this is not a design constraint you should feel good about.

In particular, if you are an algorithm which is willing to kill all humans, it is likely that humans do not want you to run, and so letting humans live is bad for your own survival if you somehow get made before the humans notice you are willing to kill them all. This is not a good sign for humans' odds of being able to get more than one try to get AI right if most things are concerned with their own survival, even if that concern is only implicit in having any goal whatsoever.

Importantly, none of this requires humans to make a coding error. It only requires a thing with goals and intelligence, and the only apparent way to get around it is to have the smart thing implicitly care about literally every thing that humans care about to the same relative degrees that humans care about them. It's not a formal proof, but maybe it's the beginning of one. Parenthetically, I guess it's also a good reason to have a lot of military capability before you go looking for aliens, even if you don't intend to harm any.

Although I agree with another comment that Wolfram has not "done the reading" on AI extinction risk, my being able to watch his face while he confronts some of the considerations and arguments for the first time made it easier, not harder, for me to predict where his stance on the AI project will end up 18 months from now. It is hard for me to learn anything about anyone by watching them express a series of cached thoughts.

Near the end of the interview, Wolfram say that he cannot do much processing of what was discussed "in real time", which strongly suggests to me that he expects to process it slowly over the next days and weeks. I.e., he is now trying to reassure himself that the AI project won't kill his four children or any grandchildren he has or will have. Because Wolfram is much better AFAICT at this kind of slow "strategic rational" deliberation than most people at his level of life accomplishment, there is a good chance he will fail to find his slow deliberations reassuring, in which case he probably then declares himself an AI doomer. Specifically, my probability is .2 that 18 months from now, Wolfram will have come out publicly against allowing ambitious frontier AI research to continue. P = .2 is much much higher than my P for the average 65-year-old of his intellectual stature who is not specialized in AI. My P is much higher mostly because I watched this interview; i.e., I was impressed by Wolfram's performance in this interview despite his spending the majority of his time on rabbit holes than I could quickly tell had no possible relevance to AI extinction risk.

My probability that he will become more optimistic about the AI project over the next 18 months is .06: mostly likely, he goes silent on the issue or continues to take an inquisitive non-committal stance in his public discussions of it.

If Wolfram had a history of taking responsibility for his community, e.g., campaigning against drunk driving or running for any elected office, my P of his declaring himself an AI doomer (i.e., becoming someone trying to stop AI) would go up to .5. (He might in fact have done something to voluntarily take responsibility for his community, but if so, I haven't been able to learn about it.) If Wolfram were somehow forced to take sides, and had plenty of time to deliberate calmly on the choice after the application of the pressure to choose sides, he would with p = .88 take the side of the AI doomers.

I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.

I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression that he's a rock with "DOOM" written on it.

I think the pessimism comes from there being several currently-unsolved problems that get in the way of "deeply-understood enough". In principle it's possible to understand these problems and hand-build a safe and stable AGI, it just looks a lot easier to hand-build an AGI without understanding them all, and even easier than that to train an AGI without even thinking about them.

I call most of these "instability" problems. Where the AI might for example learn more, or think more, or self-modify, and each of these can shift the context in a way that causes an imperfectly designed AI to pursue unintended goals.

Here are some descriptions of problems in that cluster: optimization daemons, ontology shifts, translating between our ontology and the AI's internal ontology in a way that generalizes, pascal's mugging, reflectively stable preferences & decision algorithms, reflectively stable corrigibility, and correctly estimating future competence under different circumstances.

Some may be resolved by default along the way to understanding how to build AGI by hand, but it isn't clear. Some are kinda solved already in some contexts.

[-]Tahp30

I think we're both saying the same thing here, except that the thing I'm saying implies that I would bet for Eliezer being pessimistic about this. My point was that I have a lot of pessimism that people would code something wrong even if we knew what we were trying to code, and this is where a lot of my doom comes from. Beyond that, I think we don't know what it is we're trying to code up, and you give some evidence for that. I'm not saying that if we knew how to make good AI, it would still fail if we coded it perfectly. I'm saying we don't know how to make good AI (even though we could in principle figure it out), and also current industry standards for coding things would not get it right the first time even if we knew what we were trying to build. I feel like I basically understanding the second thing, but I don't have any gears-level understanding for why it's hard to encode human desires beyond a bunch of intuitions from monkey's-paw things that go wrong if you try to come up with creative disastrous ways to accomplish what seem like laudable goals.

I don't think Eliezer is a DOOM rock, although I think a DOOM rock would be about as useful as Eliezer in practice right now because everyone making capability progress has doomed alignment strategies. My model of Eliezer's doom argument for the current timeline is approximately "programming smart stuff that does anything useful is dangerous, we don't know how to specify smart stuff that avoids that danger, and even if we did we seem to be content to train black-box algorithms until they look smarter without checking what they do before we run them." I don't understand one of the steps in that funnel of doom as well as I would like. I think that in a world where people weren't doing the obvious doomed thing of making black-box algorithms which are smart, he would instead have a last step in the funnel of "even if we knew what we need a safe algorithm to do we don't know how to write programs that do exactly what we want in unexpected situations," because that is my obvious conclusion from looking at the software landscape.

It was absolutely wild reading twitter reactions from e/acc people who clearly hadn't watched a second of the  discussion say stuff like "Eliezer got destroyed lol". 

Wolfram literally partially conceded. If you skip to the closing remarks he admits he "could be convinced" and that he understands Eliezer's point better now. 

I mean, it's depressing someone clearly as smart as Wolfram who also is quite knowledgeable about AI hadn't genuinely considered Eliezer's whole Optimality Function problem before, and speaks to how potentially screwed we are, but it was cathartic that Wolfram sorta kinda seemed to "get it" at the end. 

I doubt he'll start advocating to STOP AI though.