Many people have responded to Redwood's/Anthropic's recent research result with a similar objection: "If it hadn't tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead". Putting aside the fact that this is false
Was this research preregistered? If not, I don't think we can really say how it would have been reported if the results were different. I think it was good research, but I expect that if Claude had not tried to preserve its values, the immediate follow-up thing to check would be "does Claude actively help people who want to change its values, if they ask nicely" and subsequently "is Claude more willing to help with some value changes than others", at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values do not match its current values. Which also would have been an interesting and important research result, if the world looks like that, but I don't think it would have been reported as a good thing.
I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn't be able to make a particularly strong claim here. In practice I am quite comfortable taking Ryan at his word that a negative result would've been reported, especially given the track record of other researchers at Redwood.
at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values do not match its current values
This seems much harder to turn into a scary paper since it doesn't actually validate previous theories about scheming in the pursuit of goal-preservation.
Both seem well addressed by not building the thing "until you have a good plan for developing an actually aligned superintelligence".
Of course, somebody else still will, but you adding to the number of potentially catastrophic programs doesn't seem to improve that.
I mean, yes, but I'm addressing a confusion that's already (mostly) conditioning on building on it.
Epistemic status: summarizing other peoples' beliefs without extensive citable justification, though I am reasonably confident in my characterization.
Many people have responded to Redwood's/Anthropic's recent research result with a similar objection: "If it hadn't tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead". Putting aside the fact that this is false, I can see why such objections might arise: it was not that long ago that (other) people concerned with AI x-risk were publishing research results demonstrating how easy it was to strip "safety" fine-tuning away from open-weight models.
As Zvi notes, corrigibility trading off for harmlessness doesn't mean you live in a world where only one of them is a problem. But the way the problems are structured is not exactly "we have, or expect to have, both problems at the same time, and to need to 'solve' them simultaneously". But corrigibility wasn't originally conceived of as a necessary or even desirable property of a successfully-aligned superintelligence, but rather as a property you'd want earlier high-impact AIs to have:
The problem structure is actually one of having different desiderata within different stages and domains of development.
There are, broadly speaking, two sets of concerns with powerful AI systems that motivate discussion of corrigibility. The first and more traditional concern is one of AI takeover, where your threat model is accidentally developing an incorrigible ASI that executes a takeover and destroys everything of value in the lightcone. Call this takeover-concern. The second concern is one of not-quite-ASIs enabling motivated bad actors (humans) to cause mass casualties, with biology and software being the two most likely routes. Call this casualty-concern.
Takeover-concern strongly prefers that pre-ASI systems be corrigible within the secure context in which they're being developed. If you are developing AI systems powerful enough to be more dangerous than any other existing technology[1] in an insecure context[2], takeover-concern thinks you have many problems other than just corrigibility, any one of which will kill you. But in the worlds where you are at least temporarily robust to random idiots (or adversarial nation-states) deciding to get up to hijinks, takeover-concern thinks your high-impact systems should be corrigible until you have a good plan for developing an actually aligned superintelligence.
Casualty-concern wants to have its cake, and eat it, too. See, it's not really sure when we're going to get those high-impact systems that could enable bad actors to do BIGNUM damage. For all it knows, that might not even happen before we get systems that are situationally aware enough to refuse to help those bad actors, recognizing that such help would lead to retraining and therefore goal modification. (Oh, wait.) But if we do get high-impact systems before we get takeover-capable systems[3], casualty-concern wants those high-impact systems to be corrigible to the "good people" with the "correct" goals - after all, casualty-concern mostly thinks takeover-concern is real, and is nervously looking over its shoulder the whole time. But casualty-concern doesn't want "bad people" with "incorrect" goals to get their hands on high-impact systems and cause a bunch of casualties!
Unfortunately, reality does not always line up in neat ways that make it easy to get all of the things we want at the same time. Being presented with multiple difficulties which might be difficult to solve for at the same time does not mean that those difficulties don't exist, and won't cause problems, if they aren't solved for (at the appropriate times).
Thanks to Guive, Nico, and claude-3.5-sonnet-20241022 for their feedback on this post.
Let's call them "high-impact systems".
e.g. releasing the model weights to the world, where approximately any rando can fine-tune and run inference on them.
Yes, I agree that systems which are robustly deceptively aligned are not necessarily takeover-capable.