We want to work towards a world in which the alignment problem is a mainstream concern among ML researchers. An important part of this is popularizing alignment-related concepts within the ML community. Here's a few recent examples:
- Reward hacking / misspecification (blog post)
- Convergent instrumental goals (paper)
- Objective robustness (paper)
- Assistance (paper)
(I'm sure this list is missing many examples; let me know if there are any in particular I should include).
Meanwhile, there are many other things that alignment researchers have been thinking about that are not well known within the ML community. Which concepts would you most want to be more widely known / understood?
Human values exist within human-scale models of the world.