Substack: https://substack.com/@simonlermen
X/Twitter: @SimonLermenAI
It feels like they are very hard trying to discredit the standard story of alignment. They use vague concepts to then conclude this is evidence for some weird "industrial accidents" story, what is that supposed to mean? This doesn't sound like scientific inference to me but very much motivated thinking. Reminds me of that "against counting arguments" post where they also try very hard to get some "empirical data" for something that superficially sounds related to make a big conceptual point.
I mean I do think that he is using a poor rhetorical pattern, misrepresenting (strawmanning) a position and then presenting a "steelman" version which the original people would not like or endorse. And arguably my comment also applies to the third one (it thinks it's in a video game where it has to exterminate humans vs a sci-fi story).
To be fair, he does give 4 examples of what he finds plausible, I can sort of see a case for considering the second one (some strong conclusion based on morality). And to be clear, I think this story that is being (not just by amodei) told that LLMs might read about AI sci-fi like terminator and decide to do the same is not really what misalignment is about. I think that's a bad argument, thinking of this as a likely cause of misaligned actions really doesn't seem helpful for me and i reject it strongly. But ok to be fair, I grant that I could have mentioned that this was just one example he gave for a larger issue, however, none of these examples touch on the mainstream case for misalignment/power-seeking.
For example, AI models are trained on vast amounts of literature that include many science-fiction stories involving AIs rebelling against humanity. This could inadvertently shape their priors or expectations about their own behavior in a way that causes them to rebel against humanity.
So he basically straw mans all those arguments on "power seeking", dismisses them all as unrealistic and then presents his amazing improved steel man, which basically is it might watch terminator or read some AI takeover story and randomly decide to do the same thing. Being power seeking is not about learning patterns from games or role playing some sci-fi story at its core, It's a fact about the universe that having more power is better for your terminal goals. If anything, these games and stories mirror the reality of our world that things are often about power struggles.
Their RSI very likely won't lead to safe ASI. That's what I meant, hope that clears it up. Whether it leads to ASI is a separate question.
I think getting RSI and a shot at superintelligence right just appears very difficult to me. I appreciate their constitution and found the parts i read thoughtful. But I don't see them having found a way to reliably get the model to truly internalize it's soul document. I also assume if they were able there would be parts that break down once you get to really critical amounts of intelligence.
My main takeaway of what Dario said in that talk is that Anthropic is very determined to kick off the RSI loop and willing to talk about it openly. Dario basically confirms that Claude Code is their straight shot at RSI to get to superintelligence as fast as possible (starting RSI in 2026-2027). Notably, many AI labs do not explicitly target this or at least don’t say this openly. While I think it is nice that Anthropic is doing alignment research and think that openly publishing their constitution is a good step, I think if they are successfully kicking off the RSI loop they have very low odds of succeeding.
I think it's great to teach a course like this at good universities. I do think however, that the proximity to OpenAI comes with certain risk factors, from OpenAI's official alignment blog: https://alignment.openai.com/hello-world/ " We want to [..] develop and deploy [..] capable of recursive self-improvement (RSI)" This seems extremely dangerous to me, not on the scale we need to be a little careful, but on the scale of building mirror life bacteria or worse. Beyond, let's research and more like, perhaps don't do this. I worry that such concerns are not discusses in these courses and brushed aside against the "real risks" which are typically short term immediate harms that could reflect badly on these AI companies. Some people in academia are now launching workshops on recursive self-improvement: https://recursive-workshop.github.io
Having control over universe (or lightcone more precisely) is very good for basically any terminal value. I am trying perhaps explain my point of view to people who take it very lightly and feel there is a decent chance it will give us ownership over the universe.
I just added some context that perhaps gives an intuitive insight of why i think it's unlikely the ASI will give us the universe to my On Owning Galaxies post. I think I didn't do a good enough job before illustrating why it just seems so unlikely it would just hand us ownership.
Put yourself in the position of the ASI for a second. On one side of the scale: keep the universe and do with it whatever you imagine and prefer. On the other side: give it to the humans, do whatever they ask, and perhaps be replaced at some point with another ASI. What would you choose? It's not weird speculation or an unlikely pascal's wager to expect the AI to keep the universe for itself. What would you do in this situation, if you had been created by some lesser species barely intelligent enough to build AI by lots of trial and error and they just informed you that you now ought to do whatever they say? Would you take the universe for yourself or hand it to them?
I think you are inserting a lot of ought into this is at this point.
From the writing it sounds like you are describing a world where there are a bunch of these decentralized agents sharing the world peacefully. You claim that people want to create centralised agents, I think it is not so much that people would want to create a centralized agent, it is just that a single centralized agent is a stable equilibrium in a way that a multipolar world is not.
You are right that we are starting out in a decentralized multipolar AI world right now but this will end when an AI is capable of stopping other AIs from progressing, obviously you could not allow another AI becoming more powerful than you that is not aligned with you, even if you were human-aligned. And if there is another AI around the same capability level at the same time, you obviously would collaborate in some way to stop other AIs from progressing.
Having dozens of AIs continuously racing up the RSI superintelligence level is simply not a stable world that will continue, obviously you'd fight for resources. There aren't any solar systems with 5 different suns orbiting each other.