[Below are notes I wrote a few months ago after Reading the training story post from Evan Hubinger. I reflected on how it helped me think more clearly about alignment. I list some ways it informs my thinking about how to evaluate and prioritise alignment proposals and other ways of reducing x-risk that were not obvious to some people I talked with, so I want to share them here. ]
In short, the training story post says that when thinking about the alignment of an AI system, you should have a training story in mind. Specify the goal and then the training rationale (how you will end up with this goal), you then evaluate the story based on:
If you want to evaluate the alignment proposal X, you can ask
If you want to come up with new research directions, you can:
Threat modelling improves training goal desirability and training rationale desirability estimates:
Governance work is partly about making the lack of training goal competitiveness and training rational competitiveness less costly:
Why all of this has helped me (and might also help you):
Potential limitations of the frame (I don't find them super convincing):