I’ve been working towards this direction for a while. Though what I’m imagining is a lot more elaborate. If anyone would like to help out, send me a DM and I can invite you to a discord server where we talk about this stuff. Please let me know who you are and what you do if you do DM me.
I wrote some brief notes about it in the Accelerating Alignment section here: https://www.lesswrong.com/posts/jXjeYYPXipAtA2zmj/jacquesthibs-s-shortform?commentId=iLJDjBQBwFod7tjfz
And cover some of the philosophy in the beginning of this post: https://www.lesswrong.com/posts/a2io2mcxTWS4mxodF/results-from-a-survey-on-tool-use-and-workflows-in-alignment
Additionally, I added a comment about the general LLM for alignment approach on John’s recent post: https://www.lesswrong.com/posts/KQfYieur2DFRZDamd/why-not-just-build-weak-ai-tools-for-ai-alignment-research?commentId=DXt7mBkW7WiL36nBN
I like the creative thinking here.
I suggest a standard here, where can test our "emulation" against the researcher themselves, to see how much of a diff there is in their answers, and the researcher and rate how good a substitute the model is for themselves, on a number of different dimensions.
One theoretically possible way to greatly speed-up alignment research is to upload the minds of alignment researchers (with their consent, of course). Ideally, we would upload MIRI and some other productive groups, create thousands of their copies, and speed them up by orders of magnitude. This way, we may be able to solve alignment withing years, if not months.
Obviously, the tech is not there yet. But there is a poor-man's version of mind uploading that does exist already:
The current language models are already smart enough to assist with research in non-trivial ways. If fine-tuned on alignment writings, such a model could become a "mini-MIRI in your pocket", available 24/7.
A few examples of a possible usage:
This could be a small technical project with a disproportionately large positive impact on alignment research.
The first proof-of-concept could be as simple as a well-crafted prompt for ChatGPT, nudging it to think like an alignment researcher[1].
For a prototype, we could build a dataset by collecting all relevant writings, everything from Eliezer's tweets to the top papers on alignment-related topics, and try to fine-tune the largest available model on it.
After some iterations, I ended up with the following prompt. Could be a good start:
(replace the bold part with your topic)