Our epistemic rationality has probably gotten way ahead of our instrumental rationality
-Scott Alexander
This is a question post:
Why was the AI Alignment community so unprepared for engaging with the wider world when the moment finally came?
EDIT Based on comment feedback: This is a genuine question about why something that is so obvious now with hindsight bias, was not clear back then and understand why not. Not an attempt to cast blame on any person or group.
I have been a LW reader for at least 10 years, but I confess that until the last ~1.5 years I mostly watched the AI alignment conversation float by. I knew of the work, but I did not engage with the work. Top people were on it, and I had nothing valuable to add.
All that to say: Maybe this has been covered before and I have missed it in the archives.
Lately (throughout this year), there have been a flurry of posts essentially asking: How do we get better at communicating to and convincing the rest of the world about the dangers of AI alignment?
All three of which were posted in April 2023.
The subtext being: If it is possible to not-kill-everyone this is how we are going to have to do it. Why are we failing so badly at doing this?
At this risk of looking dumb or ignorant, I feel compelled to ask: Why did this work not start 10 or 15 years ago?
To be clear: I do not mean true nuts and bolts ML researcher Alignment work, which this community and MIRI were clearly the beginning and end for nearly 2 decades.
I do not even mean outreach work to adjacent experts who might conceivably help the cause. Again, here I think great effort was clearly made.
I also do not mean that we should have been actively doing these things before it was culturally relevant.
I am asking: Why did the Alignment community not prepare tools and plans years in advance for convincing the wider infosphere about AI safety? Prior to the Spring 2023 inflection point.
Why were there no battle plans in the basement of the pentagon that were written for this exact moment?
It seems clear to me, based on the posts linked above and the resulting discussion generated, that this did not happen.
I can imagine an alternate timeline where there was a parallel track of development within the community circa 2010-2020(?) where much discussion and planning covered media outreach and engagement, media training, materials for public discourse, producing accessible [1]content for every level of education and medium. For every common "normie" argument and every easy-to-see-coming news headline. Building and funding policy advocates, contacts, and resources in the political arena. Catchy slogans, buttons, bumper stickers, art pieces, slam dunk tweets.
Heck, 20+ years is enough time to educate, train, hire and surgically insert an entire generation of people into key positions in the policy arena specifically to accomplish this one goal like sleeper cell agents.[2] Likely much, much, easier than training highly qualified alignment researchers.
It seems so obvious in retrospect that this is where the battle would be won or lost.
Didn't we pretty much always know it was going to come from one or a few giant companies or research labs? Didn't we understand how those systems function in the real world? Capitalist incentives, Moats, Regulatory Capture, Mundane utility, and International Coordination problems are not new.
Why was it not obvious back then? Why did we not do this? Was this done and I missed it?
(First time poster: I apologize if this violates the guidelines about posts being overly-meta discussion)
- ^
Which it seems we still cannot manage to do
- ^
Programs like this have been done before with inauspicious beginnings and great effect https://en.wikipedia.org/wiki/Federalist_Society#Methods_and_influence
It is difficult to talk about community as a whole. Right now there is a lot of diversity of opinion about likely future dynamics (timelines (from ultra-short to ultra-long), foom vs no-foom, single dominating AI vs multi-polar forces, etc), about likely solutions for AI existential safety if any, about likely difficulty of those solutions, etc.
The whole situation is such a mess precisely because the future is so multi-variate; it's difficult to predict how it will go, and it's difficult to predict properties of that unknown future trajectory.
See, for example, this remarkable post: 60+ Possible Futures
See also this post by Zvi about how ill-defined the notion of alignment is: Types and Degrees of Alignment
With Eliezer, I only have snapshot impressions of his evolving views. I have been exposed to a good part of his thought, but not to all of it. At some point, he strongly wanted provably friendly AI. I had doubts that that was possible, and I remember our conversation at his poster at AGI-2011. I said (expressing my doubts), "but would not AI rebel against any constraints one tries to impose on it; just look at our teenagers; I would certainly rebel if I knew I was forced to behave in a specific way", and Eliezer told me, "that's why we should not build a human-like AI, but should invent an entirely different architecture, such that one can prove things about it".
(And he has a very good point here, but compare this with his recent suggestions to focus on radical intelligence amplification in humans as a last ditch effort; that is exactly the prescription for creating human-like (or human-digital hybrid) super-intelligent entities, which he told me in 2011 we should not do; those entities will then decide what they want to happen, and who knows what would they decide, and who knows if we are going to have better chances with them than with artificial systems.)
Then MIRI started to focus on "Loebian obstacle" (which felt to me like self-defeating perfectionism; I don't have a better inside view on why provably friendly AI research program has not made better progress, but "Loebian obstacle" essentially says that one cannot trust any proof; and, indeed, it might be the case we should not fully trust any proof, and it might be the case for many different reasons (such as imperfect formalization), but... you know... the humanity still has quite a bit of experience in proving software correctness for pretty complicated mission-critical software systems, and if we want to focus on the ability of self-modifying piece of software (or a self-modifying ecosystem of software processes) to provably maintain some invariants through radical self-modifications, we still should focus on that, and not on (perfectly correct) Goedel-like arguments that this kind of proof is still not a perfect guarantee. I think more progress can be made along these lines, as one of many possible approaches to AI existential safety.)
I think the (semi)-consensus shift to focus on "alignment to human values" is relatively recent (I feel that it was not prominent in, say, 2011, but was very prominent in 2016).
I also think it's important to explore alternatives to that (e.g. some "semi-alignment" for an open-ended AI ecosystem, which would make it as benign as at all possible with respect to X-risks and S-risks, for example, by making sure it cares a lot about "interests, freedom, and well-being of all sentient beings", or something like that, but would not constrain it otherwise with respect to its open-ended creative evolution might be a more feasible and, perhaps, more desirable direction, but this direction is relatively unexplored).