Question for my fellow alignment researchers out there, do you have a list of unsolved problems in AI alignment? I'm thinking of creating an "alignment mosaic" of the questions we need to resolve and slowly filling it in with insights from papers/posts.
I have my own version of this, but I would love to combine it with others' alignment backcasting game-trees. I want to collect the kinds of questions people are keeping in mind when reading papers/posts, thinking about alignment or running experiments. I'm working with others to make this into a collaborative effort.
Ultimately, what I’m looking for are important questions and sub-questions we need to be thinking about and updating on when we read papers and posts as well as when we decide what to read.
Here’s my Twitter thread posing this question: https://twitter.com/jacquesthibs/status/1633146464640663552?s=46&t=YyfxSdhuFYbTafD4D1cE9A.
Here’s a sub-thread breaking down the alignment problem in various forms: https://twitter.com/jacquesthibs/status/1633165299770880001?s=46&t=YyfxSdhuFYbTafD4D1cE9A.
Here’s Quintin Pope’s answer from the Twitter thread I posted (https://twitter.com/quintinpope5/status/1633148039622959104?s=46&t=YyfxSdhuFYbTafD4D1cE9A):
1.1 How do we make there be more convergence?
How do we minimize semantic drift in LMs when we train them to do other stuff? (If you RL them to program good, how to make sure their English continues to describe their programs well?)
How well do alignment techniques generalize across capabilities advances? Id AI start doing AI research and make 20 capabilities advances like the Chinchilla scaling laws, will RLHF/whatever still work on the resulting systems?
Where do the inductive biases of very good SGD point? Are they "secretly evil", in the sense that powerful models convergently end deceptive / explicit reward optimizers / other bad thing?
4.1 If so, how do we stop that?
5.1 How do we safely shape such a process. We want the process to enter stable attractors along certain dimensions (like "in favour of humanity"), but not along others (like "I should produce lots of text that agents similar to me would approve of").
What are the limits of efficient generalization? Can plausible early TAI generalize from "all the biological data humans gathered" to "design protein sequences to build nanofactory precursors"?
Given a dataset that can be solved in multiple different ways, how can we best influence the specific mechanism the AI uses to solve that dataset?
7.1 like this? arxiv.org/abs/2211.08422
7.2 or this? https://openreview.net/forum?id=mNtmhaDkAr
7.3 of how about this? https://www.lesswrong.com/posts/rgh4tdNrQyJYXyNs8/qapr-3-interpretability-guided-training-of-neural-nets
How to best extract unspoken beliefs from LM internal states? Basically ELK for LMs. See: https://github.com/EleutherAI/elk
What mathematical framework best quantifies the geometric structure of model embedding space? E.g., using cosine similarly between embeddings is bad because it's dominated by outlier dims and doesn't reflect distance along embedding manifold. We want math that more meaningfully reflects the learned geometry. Such a framework would help a lot with questions like "what does this layer do?" And "how similar are the internal representations of these two models?"
How do we best establish safe, high bandwidth, information-dense communication between human brains and models? This is the big bottleneck on approaches like cyborgism, and includes all forms of BCI research / "cortical prosthesis" / "merging with AI". But it also incudes things like "write a very good visualiser of LM internal representations", which might allow researchers a higher-bandwidth view of what's going on in LMs beyond just "read the tokens sampled from those hidden representations".