A stubborn unbeliever finally gets the depth of the AI alignment problem

aelwood

I realise posting this here might be preaching to the converted, but I think it could be interesting for some people to see a perspective from someone slow to get onboard with worrying about AI alignment.

I’m one of those people that finds it hard to believe that misaligned Artificial General Intelligence (AGI) could destroy the world. Even though I’ve understood the main arguments and can’t satisfyingly refute them, a part of my intuition won’t easily accept that it’s an impending existential threat. I work on deploying AI algorithms in industry, so have an idea of both how powerful and limited they can be. I also get why AI safety in general should be taken seriously, but I struggle to feel the requisite dread.

The best reason I can find for my view is that there is a lot of “Thinkism” in arguments for AGI takeoff. Any AGI that wants to make an influence outside of cyberspace, e.g. by building nanobots or a novel virus, will ultimately run into problems of computational irreducibility — it isn’t possible to model everything accurately, so empirical work in the physical world will always be necessary. These kind of experiments are slow, messy and resource intensive. So, any AGI is going to reach some limits when it tries to influence the physical world. I do realise there are loads of ways an AGI can cause a lot of damage without requiring the invention of new physical technologies, but this still slowed things down enough for me to worry less about alignment issues.

That was, until I started realising that alignment problems aren’t limited to the world of AI. If you look around you can see them everywhere. The most obvious example is climate change — there is a clear misalignment between the motivations of the petroleum industry and the long term future of humanity, which causes a catastrophic problem. The corporate world is full of such alignment problems, from the tobacco industry misleading the public about the harms of smoking to social media companies hijacking our attention.

It was exploring the problems caused by social media that helped me get the scale of the issue. I wrote an essay to understand why I was spending so much time on browsing the internet, without much to show for it or really enjoying the experience. You can read the full essay here, but the main takeaway for AI safety is that we can’t even deploy simple AI algorithms at scale without causing big societal problems. If we can’t manage in this easy case, how can we possibly expect to be able to deal with more powerful algorithms?

The issue of climate change is even slower acting and more problematic than social media. There’s also a clear scientific consensus, with a lot of public understanding about how bad it is, yet we still aren’t able to respond in a decisive and rational manner. Realising this has finally driven home how much of an issue AGI misalignment is going to be. Even if it might not happen at singularity inducing speeds it’s going to be incredibly destabilising and difficult to deal with. Then even if the AGIs themselves could be aligned, we have to seriously worry about aligning the people that deploy them.

I think there might be a silver lining though. As outlined above, I have a suspicion of solutions that look like Thinkism, and aren’t tested in the real world. However, as there are a whole bunch of existing alignment problems waiting to be resolved, they could act as real world testing grounds before we run into a serious AGI alignment issue. I personally believe the misalignment of social media companies with their users could be a good place to start. It would be very informative to try to build machine learning algorithms for large scale content recommendation that give people a feeling of flourishing on the internet, rather than time wasting and doom scrolling. You can read more details of my specific thoughts about this problem in my essay.

As a final bonus, I think there was something else that made it difficult for me to grok the AI alignment problem — I find it hard to intuitively model psychopathic actors. Even though I know an AI wouldn’t think like a human, if I try to imagine how it might think, I still end up giving it a human thought process. I finally managed to break this intuition reading this great short story by Ted Chiang - Understand. I recommend anyone who hasn’t read it to read it with AI in mind. It really gives you a feel of the perspective of a misaligned super-intelligence. Unfortunately, I think for the ending to work out in the same way, we’d have to crack the alignment problem first.

So, now this non-believer has been converted, I finally feel onboard with all the panic, which hasn’t been helped by the insane progress in AI capabilities this year. It’s time to start thinking about this more seriously…

I'm sure I'm not the first to have these thoughts, so if you can share any links below for me to read further it would be appreciated.

The fact that you feel it is appropriate to use the words "non-believer" and "converted" says to me not just that you're likely to communicate something concerning, but that the concerning thing it implies may actually be true in this instance, which itself seems bad to me. I am quite worried about the degree of inter-agent coprotection misalignments in the world today, I even do think there's something spiritual to the task of promoting morality in the world of advanced technology, and yet also, I wouldn't want to pitch my beliefs about any of this to someone as though even I think they're exactly true. a common pattern in someone who thinks themselves "converted" to a "religion" is that they start taking things unquestioningly from that pseudo-religion's writings, and I think that historically an issue I've had with lesswrong has been the way the confident nerds like myself (and, more relevantly, soares, yudkowsky, miri, et al) tend to read like a confident religious text to someone who is super duper convinced that everyone on lesswrong is in fact consistently less wrong than the outside world, or etc. Don't let this intense emotional "whoa!" make you start taking yudkowsky as some sort of prophet! those folks exist and mostly they don't contribute very much. much better if you treat this like a research field; like any other research field, it's filled with people trying to avoid being crackpots and sometimes even succeeding.

... hopefully this warning is irrelevant and you only speak in metaphor, but I figure it's good to push against accidental appeal to authority!

This is a great comment, but you don't need to worry that I'll be indoctrinated!

I was actually using that terminology a bit tongue in cheek, as I perceive exactly what you say about the religious fervour of some AI alignment proponents. I think the general attitude and vibe of Yudkowsky etc is one of the main reasons I was suspicious about their arguments for AI takeoff in the first place.

I also agree, despite thinking that deceptive alignement by default is the likely outcome, which is misaligned. I too dislike much of the early writing, despite thinking AGI has a high probability this century for having too much probability on FOOM, whereas I assign a 2-3% probability on it.

However, I disagree with your criticism of thinkism primarily because my crux here is that in the past, communication about AI risk to the public had the exact opposite effect (that is, people failed to realize the Doom part, while enthusiastically embracing the Powerful part).

Another part is that even competent societies probably could slow down AGI, but if this occured during a crisis, or before it as deceptive alignment, that the society auto-loses. In other words, people assign too much probability mass to an equalized fight due to sci-fi, and the probability mass is more binary than that.

I know an AI wouldn’t think like a human

This assertion is probably my biggest question mark in this discourse. It seems quite deeply baked into a lot of the MIRI arguments. I’m not sure it’s as certain as you think.

I can see how it is obviously possible we’d create an alien AI, and I think it’s impossible to prove we won’t. However given that we are training our current AI on imprints of human thought (eg text artifacts), and it seems likely we will push hard for AI to be trained to obey laws/morality as they increase in power (eg Google’s AI safety team), it seems entirely plausible to me that the first AGIs might happen to be quite human-like.

In that world, I think we face problems not of the class “this AGI is in good faith inferring that we want to tile the world with paperclips”, but of the much simpler to intuit class of “human alignment” that we also have no idea how to solve. Imagine digitizing any human and giving them increased cognitive power; I suspect most humans would become dictatorial and egotistical, and would take actions that many if not most disagree with. Many humans could be persuaded to wipe the slate clean and start again, given the power to do so.

We already struggle to coordinate to agree who we should grant relatively tiny slivers of political power (compared to this AGI), and the idea that we could all agree on what an “aligned” human-like mind looks like or prioritizes seems naive to me.

Nevertheless, it seems to me that this problem is more tractable than trying to prove things about completely generic minds.

Inasmuch as we do think “human-like AI alignment” is easier, it would push us to things like neuromorphic AI architectures, interpretability research of these architectures, science of human thought substrates, outlawing other architectures, and so on.

I actually agree that it's likely an AGI will at least start thinking in a way kind of similar to a human, but that in the end this will still be very difficult to align. I actually really recommend that you checkout Understand by Ted Chiang, which basically plays out the exact scenario you mentioned -- a normal guy gets super human intelligence and chaos ensues.

If I might take a crack at summarizing, it seems that you've realized the full scope of both the inner alignment and outer alignment problems. It's pretty scary indeed. I think your insight that we don't even need full AGI for these kinds of things to become large social problems is spot on and as an industry insider I'm super glad you're seeing that now!

One other thing that I think is worth contemplating is how much computational irreducibility would actually come into play with respect to still having real-world power and impact. I don't think you would need to get anywhere near perfect simulation in order to begin to have extremely good predictive power over the world. We're already seeing this in graphics and physics modeling. We're starting to be able to do virtual wind tunnel simulations that yield faster and better results than physical simulations,^[1] and I think we'll continue to find this to be the case. So presumably an advanced AGI would be able to create even better simulations that still work just-as-good in the real world and would forego much of the need for doing actual physical experimentation. Though I'm curious about what others think here!

^{^}
See here. Link to paper here.

Thanks for the comment, I'll read some more on the distinction of inner and outer alignment, that sounds interesting.

I don't think you would need to get anywhere near perfect simulation in order to begin to have extremely good predictive power over the world. We're already seeing this in graphics and physics modeling.

I think this is a good point, although these are cases where lots of data is available. So I guess any case in which you don't have the data ready would still have more difficulties. Off the top of my head I don't know how limiting this would be in practice, but it should be in lots of cases.

So I guess any case in which you don't have the data ready would still have more difficulties.

I'm not so sure...another interesting/alarming thing is noting how these models are "grokking" concepts in a way that lets them generalize.