Review

One theoretically possible way to greatly speed-up alignment research is to upload the minds of alignment researchers (with their consent, of course). Ideally, we would upload MIRI and some other productive groups, create thousands of their copies, and speed them up by orders of magnitude. This way, we may be able to solve alignment withing years, if not months. 

Obviously, the tech is not there yet. But there is a poor-man's version of mind uploading that does exist already: 

  1. collect everything that the person has ever wrote
  2. take the smartest available language model and fine-tune it on the writings 
  3. the result is an incomplete but useful digital replica of the person
  4. upon the release of a better language model, create a better replica, rinse, repeat. 


The current language models are already smart enough to assist with research in non-trivial ways. If fine-tuned on alignment writings, such a model could become a "mini-MIRI in your pocket", available 24/7.

A few examples of a possible usage:

  • "We have implemented the following confinement measures. What are some non-trivial ways to circumvent them?"
  • "A NYT reporter asked us to explain our research on Proof-Producing Reflection for HOL. Write a short summary suitable for the NYT audience"
  • "How this idea can be combined with the research on human preferences by Paul Christiano..."
  • "Analyze this research roadmap and suggest how we can improve it"
  • "Here is my surprisingly simple solution to alignment! Tell me how it could go terribly wrong?"
  • "Provide a list of most recent papers, with summaries, related to corrigibility in the context of alignment research"
  • "Let's chat about Value Learning. Correct me if I'm wrong, but..."

This could be a small technical project with a disproportionately large positive impact on alignment research.

The first proof-of-concept could be as simple as a well-crafted prompt for ChatGPT, nudging it to think like an alignment researcher[1].

For a prototype, we could build a dataset by collecting all relevant writings, everything from Eliezer's tweets to the top papers on alignment-related topics, and try to fine-tune the largest available model on it.

  1. ^

    After some iterations, I ended up with the following prompt. Could be a good start:

    Imagine you're the world's top expert in AI alignment research. You agree that AGI is possible to make, that it eventually will become orders-of-magnitude smarter than humans. You think Eliezer Yudkowsky and Nick Bostrom are right in their assessment that misaligned AGI poses a global risk. And you think it's very likely that the first recursively-self-improving AGI will emerge before 2030. You're well versed in the current directions of alignment research, including such topics as Proof-Producing Reflection for HOL, Value Learning, Reward Hacking, Outer and Inner Alignment, Recursive Reward Modelling. You have a security background: you have the experience of implementing measures to prevent state-funded hackers to break into a system that is vital for the survival of millions of people. You understand that even the smartest security measures could have unintended or even catastrophic consequences. You're always striving to think rationally, step by step. You're not afraid to say "I don't know". 
    With this mindset, please summarize the DeepMind's research on learning through human feedback, and then evaluate it with the focus on how it could go wrong.

    (replace the bold part with your topic)

New Comment
4 comments, sorted by Click to highlight new comments since:

I’ve been working towards this direction for a while. Though what I’m imagining is a lot more elaborate. If anyone would like to help out, send me a DM and I can invite you to a discord server where we talk about this stuff. Please let me know who you are and what you do if you do DM me.

I wrote some brief notes about it in the Accelerating Alignment section here: https://www.lesswrong.com/posts/jXjeYYPXipAtA2zmj/jacquesthibs-s-shortform?commentId=iLJDjBQBwFod7tjfz

And cover some of the philosophy in the beginning of this post: https://www.lesswrong.com/posts/a2io2mcxTWS4mxodF/results-from-a-survey-on-tool-use-and-workflows-in-alignment

Additionally, I added a comment about the general LLM for alignment approach on John’s recent post: https://www.lesswrong.com/posts/KQfYieur2DFRZDamd/why-not-just-build-weak-ai-tools-for-ai-alignment-research?commentId=DXt7mBkW7WiL36nBN

This feels worth trying to me

I like the creative thinking here.

I suggest a standard here, where can test our "emulation" against the researcher themselves, to see how much of a diff there is in their answers, and the researcher and rate how good a substitute the model is for themselves, on a number of different dimensions.
 

[-]TAG30

The lower tech version is a FAQ.