German writer of science-fiction novels and children's books (pen name Karl Olsberg). I blog and create videos about AI risks in German at www.ki-risiken.de and youtube.com/karlolsbergautor.
First of all, thank you for mentioning my post - I feel honored to serve as an example in this case! But to be clear, at the time I did not intend to define any specific red lines. I was just asking how we could decide when to stop development if we needed to.
I'm not sure whether you're arguing against using red lines in general, or just want to point out that so far we haven't broadly agreed on any and all talk of self-restraint by the industry has been just lip service (to which I agree). In any case, I'm still covinced that we need to define red lines for AI development that we must not cross. The fact that this hasn't worked so far is absolutely no proof that such an approach is useless. It actually only proves that we need to do more to define, argue about, and agree upon such red lines.
Red lines are probably the most important concept in human civilization. From the Ten Commandments to tax law, by defining what we are not allowed to do, they are the foundation of our rules on how we deal with each other. Arguing that red lines for AI so far haven't worked, therefore we shouldn't even try to define them, is like saying someone got murdered, so criminal law is unnecessary.
If we assume that there is a "point of no return", maybe a certain combination of generality and intellligence in an AI that leads to it becoming uncontrollable, and we haven't solved alignment, then the only way to avoid an existential catastrophe is to not build this. Even if you think that alignment is in fact solved (or not really a problem), we should care about where this point of no return lies, so we know at what point we really need to be sure that you're right about that. (And it should also be clear who can decide this - the current way of private companies gambling with the future of mankind for personal gain clearly violates the Universal Declaration of Human Rights in my view.)
It may be difficult to define this point exactly. But that only makes it even more important to draw red lines as quickly as possible, so we don't accidentally stumble into an existential catastrophe. And by "red lines" I don't mean "alarm signals which lead to a stop of development if detected" but specific rules for the decisions AI developers can make, e.g. how much training compute, what kinds of safety tests required, etc. This is no doubt a huge challenge, but that is no argument against trying to solve it. Saying "this is impossible" is just a self-fulfilling prophecy.
Central planners set targets in tons of nails produced or number of shoes, and factories duly maximized those numbers by making a few huge, unusable nails ... that nominally met the plan.
This sounded improbable to me, and indeed seems wrong: https://skeptics.stackexchange.com/questions/22375/did-a-soviet-nail-factory-produce-useless-nails-to-improve-metrics Apparently, the "huge nails" originally appeared in a cartoon in 1954 in the satirical magazine Krokodil and were later turned into an urban legend.
Yes, I agree, although I don't believe that COT reading will carry us very far into the future, it is already pretty unreliable and using it for optimization would ruin it completely.
Alignment is difficult, that is my whole point - with an emphasis on the hypothesis that with any kind of only reinforcement-learning-based approach, it is virtually impossible. IF we could find a way to create some kind of genuine "care about humans module" within an AI that is similar to the kind of parent-child-altruism that I write about, we might have a chance. But the problem is that no one knows how to do that, and even in humans it is a quite fragile mechanism.
One additional thought: Evolution has created the parent-child-care-mechanism through some kind of reinforcement-learning, but is optimizing for a different objective compared to our current AI training process - not any kind of direct evaluation of human behavior, but survival and reproduction. Maybe the evolution of spiral personas is closer to the way evolution works. But of course, in this case AI is a different species, a parasite, and we are the hosts.
I'm not a machine learning expert, so I'm not sure what exactly causes sycophancy. I don't see it as a central problem of alignment; it is just a symptom of a deeper problem to me.
My point is more general: To achieve true alignment in the sense of an AI doing what it thinks is "best for us", it is not sufficient to train it by rewarding behavior. Even it the AI is not sycophantic, it will pursue some goal that we trained into it, so to speak, and that goal will most likely not be what we would have wanted it to be in hindsight.
Contrast that with the way I behave towards my sons: I have no idea what their goals are, so I can't say that my goals are "aligned" with their goals in any strict sense. Instead, I care about them, their wellbeing, but also their independence of me and their ability to find their own way to a good life. I don't think this kind of intrinsic motivation can be "trained" into an AI with any kind of reinforcement learning.
Tricky hypothesis 2: But the differences between the world of today and the world where ASI will be developed don't matter for the prognosis.
I don't think that the authors implied this. Right in the first chapter, they write:
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
(emphasis by me). Even if it is not always clearly stated, I think they don't believe that ASI should never be developed, or that it is impossible in principle to solve alignment. Their major statement is that we are much farther from solving alignment than from building a potentially uncontrollable AI, so we need to stop trying to build it.
Their suggested measures in part III (whether helpful/feasible or not) are meant to prevent ASI under the current paradigms, with the current approaches to alignment. Given the time gap, I don't think this matters very much, though - if we can't prevent ASI from being built as soon as it is technically possible, we won't be in a world that differs enough from today's to render the book title wrong.
Thank you very much for this post, which is one of the most scary posts I've read on LessWrong - mainly because I didn't expect that this could already happen right now at this scale.
I have created a German language video about this post for my YouTube channel, which is dedicated to AI existential risk:
Thanks again! My drafts are of course just ideas, so they can easily be adapted. However, I still think it is a good idea to create a sense of urgency, both in the ad and in books about AI safety. If you want people to act, even if it's just buying a book, you need to do just that. It's not enough to say "you should read this", you need to say "you should read this now" and give a reason for that. In marketing, this is usually done with some kind of time constraint (20% off, only this week ...).
This is even more true if you want someone to take measures against something that is in the mind of most people still "science fiction" or even "just hype". Of course, just claiming that something is "soon" is not very strong, but it may at least raise a question ("Why do they say this?").
I'm not saying that you should give any specific timeline, and I fully agree with the MIRI view. However, if we want to prevent superintelligent AI and we don't know how much time we have left, we can't just sit around and wait until we know when it will arrive. For this reason, I have dedicated a whole chapter on timelines in my own German language book about AI existential risk and also included the AI-2027 scenario as one possible path. The point I make in my book is not that it will happen soon, but that we can't know it won't happen soon and that there are good reasons to believe that we don't have much time. I use my own experience with AI since my Ph.D. on expert systems in 1988 and Yoshua Bengio's blogpost on his change of mind as examples of how fast and surprising progress has been even for someone familiar with the field.
I see your point about how a weak claim can water down the whole story. But if I could choose between a 100 people convinced that ASI would kill us all, but with no sense of urgency, and 50 or even 20 who believe both the danger and that we must act immediately, I'd choose the latter.
Thanks! I don't have access to the book, so I didn't know about the timelines stance they take.
Still, I'm not an advertising professional, but subjunctives like "may" and "could" seem significantly weaker to me. As far as I know, they are rarely used in advertising. Of course, the ad shouldn't contain anything that is contrary to what the book says, but "close" seems sufficiently unspecific to me - for most laypeople who never thought about the problem, "within the next 20 years" would probably seem pretty close.
A similar argument could be made about the second line, "it will kill everyone", while the book title says "would". But again, I feel "would" is weaker than "will" (some may interpret it to mean that there may be additional prerequisites necessary for an ASI to kill everyone, like "consciousness"). Of course, "will" can only be true if a superintelligence is actually built, but that goes without saying and the fact that the ASI may not be built at all is also implicit in the third line, "we must stop this".
I agree with most of this - red lines that aren't respected are useless. However, stopping to draw red lines doesn't solve any problems in my opinion - ignoring them or redrawing them is itself a signal. But I agree that what we need most is enforcement.