a company successfully solves control for "high-stakes"/"concentrated" threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
- The company probably can't initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It's just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it's likely much easier to adjudicate whose solutions makes sense to adopt.
- This is similar to how, in well-functioning scientific fields, it's not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That's a disanalogy with well-functioning scientific fields: scientists don't deliberately pick bad research directions and try hard to make them seem good.
In a lab experiment where it was told it was going to be shut down, Claude sometimes blackmailed fictional employees who controlled its shutdown button (again, we also tested frontier models from all the other major AI developers and they often did the same thing).
Why is it always the blackmail result that gets reported from this paper? Frontier models were also found willing to cause a fictional employee's death to avoid shutdown. It's weird to me that that's so often ignored.
Here's another justification for hyperbolic discounting, drawing on the idea that you're less psychologically connected to your future selves.
I've always seen this idea attributed to Martin Weitzman, and he cites these papers as making a similar point. Seems like an interesting case of simultaneous discovery: four papers making the same sort of point all appearing between 1996 and 1999.
What's your current view? We should aim for virtuousness instead of corrigibility?
are uploads conscious? What about AIs? Should we care about shrimp? What population ethics views should we have? What about acausal trade? What about pascal's wager? What about meaning? What about diversity?
It sounds like you're saying an AI has to get these questions right in order to count as aligned, and that's part of the reason why alignment is hard. But I expect that many people in the AI industry don't care about alignment in this sense, and instead just care about the 'follow instructions' sense of alignment.
Yeah I think the only thing that really matters is the frequency with which bills are dropped, and train stations seem like high-frequency places.
More reasons to worry about relying on constraints:
Great post! Tiny thing: is the speed prior really best understood as a prior? Surely the only way in which being slow can count against a cognitive pattern is if being slow leads to lower reward. And in that case it seems like speed is a behavioral selection pressure rather than a prior.
Why are you 30% in SPY if SPX is far better?