This Google doc^ is a halted, formerly work-in-progress writeup of Evan Hubinger’s AI alignment research agenda, authored by Evan. It dates back to around 2020, and so Evan’s views on alignment have shifted since then.
Nevertheless, we thought it would be valuable to get this posted and available to everyone working in alignment!
In it, Evan outlines the following alignment scheme:
We should bake transparency tools into the loss function we’re training powerful models on, grading the model on its internal cognitive processes as well as on external behavior. We start by initializing a relatively dumb but non-deceptive model. We scale up the model, selecting against any model that isn’t demonstrably acceptable to a transparency-tool-assisted overseer.
While Evan doesn’t expect this approach to be robust against deceptively aligned models, the hope is that we can define a notion of an 'acceptability predicate' such that, if we start with a dumb aligned model and scale from there, grading on cognitive processes as well as behavior, no model on that trajectory in model space will ever become deceptive in the first place. That is, beforea model can be updated to become deceptive in this training process, it hopefully first must be updated to become unacceptable and non-deceptive. We can therefore update away from all merely unacceptable models as they appear, and thereby never instantiate a deceptive model in the first place.
At the time of this doc’s writing, the leading candidate for an adequate acceptability predicate was 'demonstrably myopic.' One plausible account of 'myopia' here is “return the action that your model of HCH would return, if it received your inputs.”
Since writing up this agenda, some things that Evan has updated on include:
This Google doc^ is a halted, formerly work-in-progress writeup of Evan Hubinger’s AI alignment research agenda, authored by Evan. It dates back to around 2020, and so Evan’s views on alignment have shifted since then.
Nevertheless, we thought it would be valuable to get this posted and available to everyone working in alignment!
In it, Evan outlines the following alignment scheme:
At the time of this doc’s writing, the leading candidate for an adequate acceptability predicate was 'demonstrably myopic.' One plausible account of 'myopia' here is “return the action that your model of HCH would return, if it received your inputs.”
Since writing up this agenda, some things that Evan has updated on include: