One aspect of alignment research I find interesting is improving how we understand potential threat / failure modes by making them more concrete or rigorous in some way. When I think of making these concrete, I’m thinking of theoretical results in ML (think Turner’s power-seeking work) or toy examples of the behaviour arising (that funny boat doing reward hacking).
To my knowledge, some conceptual failure modes which are central to the field have not been observed yet (deceptive alignment and gradient hacking come to mind). This isn't to suggest that they won't be observed in more powerful systems, but it seems important to note nonetheless.
I think it would be useful to have a spreasheet of major conceptual failure modes, alongside whether they have been observed in the wild / formalised in ML, and any mathematical formalisations / results surrounding them.
With a view to potentially making this table, what conceptual failure modes do you know of that have yet to be “formalised or observed”? I'd be particularly interested to hear about any absence of evidence which has surprised you in some way.
Conversely, what are some formalisations / theoretical results you are aware of for some conceptual failure mode?
One aspect of alignment research I find interesting is improving how we understand potential threat / failure modes by making them more concrete or rigorous in some way. When I think of making these concrete, I’m thinking of theoretical results in ML (think Turner’s power-seeking work) or toy examples of the behaviour arising (that funny boat doing reward hacking).
To my knowledge, some conceptual failure modes which are central to the field have not been observed yet (deceptive alignment and gradient hacking come to mind). This isn't to suggest that they won't be observed in more powerful systems, but it seems important to note nonetheless.
I think it would be useful to have a spreasheet of major conceptual failure modes, alongside whether they have been observed in the wild / formalised in ML, and any mathematical formalisations / results surrounding them.
With a view to potentially making this table, what conceptual failure modes do you know of that have yet to be “formalised or observed”? I'd be particularly interested to hear about any absence of evidence which has surprised you in some way.
Conversely, what are some formalisations / theoretical results you are aware of for some conceptual failure mode?