Epistemic status: Written for Blog Post Day III. I don't get to talk to people "in the know" much, so maybe this post is obsolete in some way.
I think that at some point at least one AI project will face an important choice between deploying and/or enlarging a powerful AI system, or holding back and doing more AI safety research.
(Currently, AI projects face choices like this all the time, except they aren't important in the sense I mean it, because the AI isn't potentially capable of escaping and taking over large parts of the world, or doing something similarly bad.)
Moreover, I think that when this choice is made, most people in the relevant conversation will be insufficiently concerned/knowledgeable about AI risk. Perhaps they will think: "This new AI design is different from the classic models, so the classic worries don't arise." Or: "Fear not, I did [insert amateur safety strategy]."
I think it would be very valuable for these conversations to end with "OK, we'll throttle back our deployment strategy for a bit so we can study the risks more carefully," rather than with "Nah, we're probably fine, let's push ahead." This buys us time. Say it buys us a month. A month of extra time right after scary-powerful AI is created is worth a lot, because we'll have more serious smart people paying attention, and we'll have more evidence about what AI is like. I'd guess that a month of extra time in a situation like this would increase the total amount of quality-weighted AI safety and AI policy work by 10%. That's huge.
One way to prepare for these conversations is to raise awareness about AI risk and technical AI safety problems, so that it's more likely that more people in these conversations are more informed about the risks. I think this is great.
However, there's another way to prepare, which I think is tractable and currently neglected:
1. Identify some people who might be part of these conversations, and who already are sufficiently concerned/knowledgeable about AI risk.
2. Help them prepare for these conversations by giving them resources, training, and practice, as needed:
2a. Resources:
Perhaps it would be good to have an Official List of all the AI safety strategies, so that whatever rationale people give for why this AI is safe can be compared to the list. (See this prototype list.)
Perhaps it would be good to have an Official List of all the AI safety problems, so that whatever rationale people give for why this AI is safe can be compared to the list, e.g. "OK, so how does it solve outer alignment? What about mesa-optimizers? What about the malignity of the universal prior? I see here that your design involves X; according to the Official List, that puts it at risk of developing problems Y and Z..." (See this prototype list.)
Perhaps it would be good to have various important concepts and arguments re-written with an audience of skeptical and impatient AI researchers in mind, rather than the current audience of friends and LessWrong readers.
2b. Training & practice:
Maybe the person is shy, or bad at public speaking, or bad at keeping cool and avoiding fluster in high-stakes discussions. If so, some coaching and practice could go a long way. Maybe they have the opposite problems, frequently coming across as overconfident, arrogant, aggressive, or paranoid. If so someone should tell them this and help them tone it down.
In general it might be good to do some role-play exercises or something, to prepare for these conversations. As an academic, I've seen plenty of mock-dissertation-defense sessions and mock-job-talk-question-sessions, which seem to help. And maybe there are ways to get even more realistic practice, e.g. by trying to convince your skeptical friends that their favorite AI design might kill them if it worked.
Note that most of part 2 can be done without having done part 1. This is important in case we don't know anyone who might be part of one of these conversations, which is true for many and perhaps most of us.
Why do I think this is tractable? Well, seems like the sort of thing that people producing AI safety research can do on the margin, just by thinking more about their audience and maybe recording their work (or other people's work) on some Official List. Moreover people who don't do (or even read) AI safety research can contribute to this, e.g. by reading the literature on how to practice for situations like this, and writing up the results.
Why do I think this is neglected? Well, maybe it isn't. In fact I'd bet that some people are already thinking along these lines. It's a pretty obvious idea. But just in case it is neglected, I figured I'd write this. Moreover, the Official Lists I mentioned don't exist, and I think they would if people were taking this idea seriously. Finally--and this more than anything else is what caused me to write this post--I've heard one or two people explicitly call this out as something that they don't think is an important use case for the alignment research they were doing. I disagreed with them, and here we are. If this is a bad idea, I'd love to know why.
Thanks for the thoughtful pushback! It was in anticipation of comments like this that I put hedging language in like "it think" and "perhaps." My replies:
1. Past experience has shown that even when particular AI risk arguments don't apply, often an AI design is still risky, we just haven't thought of the reasons why yet. So we should make a pessimistic meta-induction and conclude that even if our standard arguments for risk don't apply, the system might still be risky--we should think more about it.
2. I intended those two "perhaps..." statements to be things the person says, not necessarily things that are true. So yeah, maybe they *say* the standard arguments don't apply. But maybe they are wrong. People are great at rationalizing, coming up with reasons to get to the conclusion they wanted. If the conclusion they want is "We finally did it and made a super powerful impressive AI, come on come on let's take it for a spin!" then it'll be easy to fool yourself into thinking your architecture is sufficiently different as to not be problematic, even when your architecture is just a special case of the architecture in the standard arguments.
Points 1 and 2 are each individually sufficient to vindicate my claims, I think.
3. I'm not operating under the assumption that I know more about the AI system someone is creating than the person who's creating it knows. The fact that you said this dismays me, because it is such an obvious staw man. It makes me wonder if I touched a nerve somehow, or had the wrong tone or something, to raise your hackles.
4. Yes, I refer to their safety strategy as amateur. Yes, this is appropriate. AI safety is related to AI capabilities, but the two are distinct sub-fields, and someone who is great at one could be not so great at another. Someone who doesn't know the AI safety literature, who does something to make their AI safe, probably deserves the title amateur. I don't claim to be a non-amateur AI scientist, and whether I'm a non-amateur AI safety person is irrelevant because I'm not going to be one of the people in The Talk. I do claim that e.g. someone like Paul Christiano or Stuart Russell is a professional AI safety person, whereas most AI scientists are not.
5. I agree that this is a possibility. This is why I said "say it buys us a month;" I meant that to be an average of the various possibilities. In retrospect I was unclear; I should have clarified that It might not be a good idea to delay at all, for the reasons you mention. I agree we have to learn more about the situation; in retrospect I shouldn't have said "I think it would be better for these conversations to end X way" (even though that is what I think is most likely) but rather found some way to express the more nuanced position.
6. I agree with everything you say about overconfidence, echo chambers, etc. except that I don't think I was writing the bottom line first in this case. I was making a claim without arguing for it, but then I argued for it in the comments when you questioned it. It's perfectly reasonable (indeed necessary) to have some unargued for claims in any particular finite piece of writing.