This is a linkpost for https://aligned.substack.com/p/three-alignment-taxes
If our more aligned model needs to spend T% more inference-time compute to get from performance Z’ back to performance Z on capability X, then we say there is a T% alignment tax. For example, if we always need to run best-of-2, this corresponds to a 100% alignment tax. If we need to run best-of-4 for 10% of all tasks, this corresponds to a 4*10% = 40% alignment tax.
Something seems wrong here. The examples are:
In (1), we spend 2*100% = 200% of the compute we spent before, which is 100% more. But in (2), we spend (4*10% + 1*90%) = 130% of the compute we spent before, which is 30% more. So I think the 2nd example is a 30% alignment tax, not 40%?
Jan Leike leads the safety team at OpenAI. This post is from a Substack series titled "Musings on the Alignment Problem". I've found his posts helpful for understanding how the OpenAI safety team is thinking about alignment & for developing my own models of alignment/AI x-risk. I encourage more people to engage with Jan's blog posts.
In the post distinguishing three alignment taxes, Jan distinguishes between three types of alignment taxes. I appreciate how the post (a) offers concrete definitions of various types of alignment taxes, (b) discusses how they might affect AI labs in a competitive market, and (c) discusses how they might affect AI-assisted alignment research.
I'm including some quotes below (headings are added by me):
The three types of alignment taxes
Past work on performance taxes
One way to operationally define a performance tax
The size of a development taxes might be fairly unrelated to the size of the model
Time-to-deployment taxes are difficult to predict
Performance taxes may not matter much for automated alignment research