Substack: https://substack.com/@simonlermen
X/Twitter: @SimonLermenAI
For example, AI models are trained on vast amounts of literature that include many science-fiction stories involving AIs rebelling against humanity. This could inadvertently shape their priors or expectations about their own behavior in a way that causes them to rebel against humanity.
So he basically straw mans all those arguments on "power seeking", dismisses them all as unrealistic and then presents his amazing improved steel man, which basically is it might watch terminator or read some AI takeover story and randomly decide to do the same thing. Being power seeking is not about learning patterns from games or role playing some sci-fi story at its core, It's a fact about the universe that having more power is better for your terminal goals. If anything, these games and stories mirror the reality of our world that things are often about power struggles.
Their RSI very likely won't lead to safe ASI. That's what I meant, hope that clears it up. Whether it leads to ASI is a separate question.
I think getting RSI and a shot at superintelligence right just appears very difficult to me. I appreciate their constitution and found the parts i read thoughtful. But I don't see them having found a way to reliably get the model to truly internalize it's soul document. I also assume if they were able there would be parts that break down once you get to really critical amounts of intelligence.
My main takeaway of what Dario said in that talk is that Anthropic is very determined to kick off the RSI loop and willing to talk about it openly. Dario basically confirms that Claude Code is their straight shot at RSI to get to superintelligence as fast as possible (starting RSI in 2026-2027). Notably, many AI labs do not explicitly target this or at least don’t say this openly. While I think it is nice that Anthropic is doing alignment research and think that openly publishing their constitution is a good step, I think if they are successfully kicking off the RSI loop they have very low odds of succeeding.
I think it's great to teach a course like this at good universities. I do think however, that the proximity to OpenAI comes with certain risk factors, from OpenAI's official alignment blog: https://alignment.openai.com/hello-world/ " We want to [..] develop and deploy [..] capable of recursive self-improvement (RSI)" This seems extremely dangerous to me, not on the scale we need to be a little careful, but on the scale of building mirror life bacteria or worse. Beyond, let's research and more like, perhaps don't do this. I worry that such concerns are not discusses in these courses and brushed aside against the "real risks" which are typically short term immediate harms that could reflect badly on these AI companies. Some people in academia are now launching workshops on recursive self-improvement: https://recursive-workshop.github.io
Having control over universe (or lightcone more precisely) is very good for basically any terminal value. I am trying perhaps explain my point of view to people who take it very lightly and feel there is a decent chance it will give us ownership over the universe.
I just added some context that perhaps gives an intuitive insight of why i think it's unlikely the ASI will give us the universe to my On Owning Galaxies post. I think I didn't do a good enough job before illustrating why it just seems so unlikely it would just hand us ownership.
Put yourself in the position of the ASI for a second. On one side of the scale: keep the universe and do with it whatever you imagine and prefer. On the other side: give it to the humans, do whatever they ask, and perhaps be replaced at some point with another ASI. What would you choose? It's not weird speculation or an unlikely pascal's wager to expect the AI to keep the universe for itself. What would you do in this situation, if you had been created by some lesser species barely intelligent enough to build AI by lots of trial and error and they just informed you that you now ought to do whatever they say? Would you take the universe for yourself or hand it to them?
I just added some context that perhaps gives an intuitive insight of why i think it's unlikely the ASI will give us the universe:
Put yourself in the position of the ASI for a second. On one side of the scale: keep the universe and do with it whatever you imagine and prefer. On the other side: give it to the humans, do whatever they ask, and perhaps be replaced at some point with another ASI. What would you choose? It's not weird speculation or an unlikely pascal's wager to expect the AI to keep the universe for itself. What would you do in this situation, if you had been created by some lesser species barely intelligent enough to build AI by lots of trial and error and they just informed you that you now ought to do whatever they say? Would you take the universe for yourself or hand it to them?
The standard LessWrong/Yudkowsky-style story is: we develop an AI, it does recursive self-improvement, it becomes vastly more intelligent and smarter than all the other AIs, and then it gets all the power in the universe.
I think this is false. I hear this a lot, some version like Yud only ever imagined a singleton AI and never thought about the possibility that there might be multiple AIs. Ok, but then why did yudkowsky spend much of his research on decision theory? He explicitily envisioned how superintelligent AI systems could make deals with each other to solve prisoners dilemmas. My intuition is that perhaps he was looking for provably correct ways to lock multiple AIs in such dilemmas with both defecting on each other (and aiding humanity) or something in that direction.
He is on this paper for example about possible cooperation between algorithms: https://arxiv.org/pdf/1401.5577
I mean I do think that he is using a poor rhetorical pattern, misrepresenting (strawmanning) a position and then presenting a "steelman" version which the original people would not like or endorse. And arguably my comment also applies to the third one (it thinks it's in a video game where it has to exterminate humans vs a sci-fi story).
To be fair, he does give 4 examples of what he finds plausible, I can sort of see a case for considering the second one (some strong conclusion based on morality). And to be clear, I think this story that is being (not just by amodei) told that LLMs might read about AI sci-fi like terminator and decide to do the same is not really what misalignment is about. I think that's a bad argument, thinking of this as a likely cause of misaligned actions really doesn't seem helpful for me and i reject it strongly. But ok to be fair, I grant that I could have mentioned that this was just one example he gave for a larger issue, however, none of these examples touch on the mainstream case for misalignment/power-seeking.