Question for my fellow alignment researchers out there, do you have a list of unsolved problems in AI alignment? I'm thinking of creating an "alignment mosaic" of the questions we need to resolve and slowly filling it in with insights from papers/posts.
I have my own version of this, but I would love to combine it with others' alignment backcasting game-trees. I want to collect the kinds of questions people are keeping in mind when reading papers/posts, thinking about alignment or running experiments. I'm working with others to make this into a collaborative effort.
Ultimately, what I’m looking for are important questions and sub-questions we need to be thinking about and updating on when we read papers and posts as well as when we decide what to read.
Here’s my Twitter thread posing this question: https://twitter.com/jacquesthibs/status/1633146464640663552?s=46&t=YyfxSdhuFYbTafD4D1cE9A.
Here’s a sub-thread breaking down the alignment problem in various forms: https://twitter.com/jacquesthibs/status/1633165299770880001?s=46&t=YyfxSdhuFYbTafD4D1cE9A.
What is the correct "object of study" for alignment researchers in understanding the mechanics of a world immediately before and during takeoff? A good step in this direction is the work of Alex Flint and Shimi's UAO.
What form does the correct alignment goal take? Is it a utility function over a region of space, a set of conditions to be satisfied or something else?
Mechanistically, how do systems trained primarily on token frequencies appear to be capable of higher level reasoning?
How likely is the emergence of deceptively aligned systems?