Who Aligns the Alignment Researchers?
There may be an incentives problem for AI researchers and research organizations who face a choice between researching Capabilities, Alignment, or neither. The incentives structure will lead individuals and organizations to work towards Capabilities work rather than Alignment. The incentives problem is a lot clearer at the organizational level than the individual level, but bears considering at both levels, and of course, funding available to organizations has downstream implications for the jobs available for researchers employed to work on Alignment or Capabilities. In this post, I’ll describe a couple of key moments in the history of AI organizations. I’ll then survey incentives researchers might have for doing either Alignment work or Capabilities work. We’ll see that it maybe that, even considering normal levels of altruism, the average person might prefer to do Capabilities rather than Alignment work. There is relevant collective action dynamic. I’ll then survey the organizational level and global level. After that, I’ll finish by looking very briefly at why investment in Alignment might be worthwhile. A note on the dichotomous framing of this essay: I understand that the line between Capabilities and Alignment work is blurry, or worse, some Capabilities work plausibly advances Alignment, and some Alignment work advances Capabilities, at least in the short term. However, in order to model the lay of the land, it’s helpful as a simplifying assumption to examine Capabilities and Alignment as distinct fields of research and try to understand the motivations for researchers in each. History As a historical matter, DeepMind and OpenAI were both founded with explicit missions to create safe, Aligned AI for the benefit of all humanity. There are different views on the extent to which each of these organizations remains aligned to that mission. Some people maintain they are, while others maintain they are doing incredible harm by shortening AI timelines. No one ca
It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.
So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I'm not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.