Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:
Many alignment proposals rely on reaching these thresholds in a specific order. For example, the earlier we reach (9) relative to other thresholds, the easier most alignment proposals are.
Some of these thresholds are relevant to whether an AI or proto-AGI is alignable even in principle. Short of 'full alignment' (CEV-style), any alignment method (eg corrigibility) only works within a specific range of capabilities:
Some other possible thresholds:
10. Ability to perform gradient hacking
11. Ability to engage in acausal trade
12. Ability to become economically self-sustaining outside containment
13. Ability to self-replicate
A three-pronged approach to AGI safety. (This is assuming we couldn't just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).
Prong 1: boxing & capability control (aka ‘careful bootstrapping’)
Prong 2: scary demos and and convincing people that AGI is dangerous
Prong 3: alignment research aka “understanding minds”
There are positive feedback loops between prongs:
If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.