Here’s Quintin Pope’s answer from the Twitter thread I posted (https://twitter.com/quintinpope5/status/1633148039622959104?s=46&t=YyfxSdhuFYbTafD4D1cE9A):
- How much convergence is there really between AI and human internal representations?
1.1 How do we make there be more convergence?
-
How do we minimize semantic drift in LMs when we train them to do other stuff? (If you RL them to program good, how to make sure their English continues to describe their programs well?)
-
How well do alignment techniques generalize across capabilities advances? Id AI start doing AI research and make 20 capabilities advances like the Chinchilla scaling laws, will RLHF/whatever still work on the resulting systems?
-
Where do the inductive biases of very good SGD point? Are they "secretly evil", in the sense that powerful models convergently end deceptive / explicit reward optimizers / other bad thing?
4.1 If so, how do we stop that?
- How should we even start thinking about data curation feedback loops? If we train an LM, then have the LM curate / write higher quality training data for its successor, and repeat this process many times, what even happens? What types of attractors can arise here?
5.1 How do we safely shape such a process. We want the process to enter stable attractors along certain dimensions (like "in favour of humanity"), but not along others (like "I should produce lots of text that agents similar to me would approve of").
-
What are the limits of efficient generalization? Can plausible early TAI generalize from "all the biological data humans gathered" to "design protein sequences to build nanofactory precursors"?
-
Given a dataset that can be solved in multiple different ways, how can we best influence the specific mechanism the AI uses to solve that dataset?
7.1 like this? arxiv.org/abs/2211.08422
7.2 or this? https://openreview.net/forum?id=mNtmhaDkAr
7.3 of how about this? https://www.lesswrong.com/posts/rgh4tdNrQyJYXyNs8/qapr-3-interpretability-guided-training-of-neural-nets
-
How to best extract unspoken beliefs from LM internal states? Basically ELK for LMs. See: https://github.com/EleutherAI/elk
-
What mathematical framework best quantifies the geometric structure of model embedding space? E.g., using cosine similarly between embeddings is bad because it's dominated by outlier dims and doesn't reflect distance along embedding manifold. We want math that more meaningfully reflects the learned geometry. Such a framework would help a lot with questions like "what does this layer do?" And "how similar are the internal representations of these two models?"
-
How do we best establish safe, high bandwidth, information-dense communication between human brains and models? This is the big bottleneck on approaches like cyborgism, and includes all forms of BCI research / "cortical prosthesis" / "merging with AI". But it also incudes things like "write a very good visualiser of LM internal representations", which might allow researchers a higher-bandwidth view of what's going on in LMs beyond just "read the tokens sampled from those hidden representations".
I'm going to answer a different question: what's my list of open problems in understanding agents? I claim that, once you dig past the early surface-level questions about alignment, basically the whole cluster of "how do agents work?"-style questions and subquestions form the main barrier to useful alignment progress. So with that in mind, here are some of my open questions about understanding agents (and the even deeper problems one runs into when trying to understand agents), going roughly from "low-level" to "high-level".
I think this is an ill-posed question. Boundaries and modularity could be discussed in the context of different mathematical languages/frameworks: quantum mechanics, random dynamical systems formalism, neural network formalism, whatever. All these mathematical languages permit talking about information exchange, modularity, and boundaries. Cf. this comment.
Even if we reformulate the question as "Which mathematical language permits identifying boundaries [of a particu... (read more)