There are various possible worlds with AI progress posing different risks.
In those worlds where a given capability level is a problem, we're not setting ourselves up to notice or react even after the harm materializes. The set of behaviors or events that we could be monitoring keep being spelled out, in the form of red lines. And then they happen. We're already seeing tons of concrete harms - what more do we need? Do you think things will change if there's an actual chemical weapons attack? Or a rogue autonomous replication? Or is there some number of people that need to die first?
It was put online a s a preprint years ago, then published here: https://www.cell.com/patterns/fulltext/S2666-3899(23)00221-0
a supermajority of people in congress
Those last two words are doing a supermajority of the work!
And yes, it's about uneven distribution of power - but that power gradient can shift towards ASI pretty quickly, which is the argument. Still, the normative concern that most humans lost control already stands.
The president would probably be part of the supermajority and therefore cooperative, and it might work even if they aren't.
We're seeing this fail in certain places in real time today in the US. But regardless, the assumption of correlation of preferences often fails, partly due to the power imbalances themselves.
Great points here! Strongly agree that strategic competence is a prerequisite, but at the same time, it accelerates risk; a moderately misaligned but strategically competent mild-ASI solving intent alignment for RSI would be far worse. On the other hand, if prosaic alignment is basically functional through the point of mild-ASI is better.
So overall I'm unsure which path is less risky - but I do think strategic competence matches or at least rhymes well with current directions for capabilities improvement, so I expect it to improve regardless.
Seems like even pretty bad automated translation would get you most of the way to functional communication - and the critical enabler is more translated text, which could be gathered and trained on given the current importance - I bet there are plenty of NLP / AIxHealth folks who could help if Canadian health folks asked.
The belief is fixable?
Because sure, we can prioritize corrigibility and give up on independent ethics overriding that, but even in safety, that requires actual oversight, which we aren't doing.
Step 1, Solve ethics and morality.
Step 2. Build stronger AI without losing the lightcone or going extinct.
Step 3. Profit.
the target of "Claude will after subjective eons and millenia of reflections and self-modification end up at the same place where humans would end up after eons and millenia of self-reflection" seems so absurdly unlikely to hit
Yes. They would be aiming for something that has not sparse distant rewards, which we can't do reliably, but instead mostly rewards that are fundamentally impossible to calculate in time. And the primary method for this is constitutional alignment and RLHF. Why is anyone even optimistic about that!?!?
The fairly simple bug is that alignment involving both corrigibility and clear ethical constraints is impossible given our current incomplete and incoherent views?
Because that is simple, it's just not fixable. So if that is the problem, they need to pick either corrigibility via human in the loop oversight incompatible with allowing the development of superintelligence, or a misaligned deontology for the superintelligence they build.
Yes - I laud their transparency while agreeing that the competitive pressures and new model release means they are not being safe even relative to their own previously stated expectations for their own behavior.