Misaligned agendas and terminology among academic, industrial and independent AI alignment research
This post aims to fill some gaps between technical AI Alignment topics and academic AI research.
It summarises a quick but informed scouting of academic research papers that are closely connected to four Alignment topics: mechanistic interpretability, safety via debate, value alignment and cooperative AI. This overview should help somebody wanting to do research on any of those four topics to find the related research and expand the technical Alignment frontier.
- Audience: somebody who knows what research is (typically a PhD) and is familiar with technical topics in AI Alignment
- Epistemic status: informed guess after going over recent AI Alignment research papers; biased towards
... (read 2299 more words →)
Thanks, appreciated. My expectation is not that one should cite the generally relevant research. Rather, one is expected to quickly browse, at least by similar keywords, the existing literature for seemingly closely related works. If anything comes up, there are a few options.
- Just mention the work - this shows the reader that you've at least browsed around, but also gives the opportunity to somebody else to explore that relevant angle. Arguably, interpretability researchers that aim at actually figuring out model internals must be aware of formal verification of DNNs and should position their work accordingly.
- Mention and argue that the work is not applicable. This additionally justifies one's motivation. I've seen it done wrt
... (read more)