A low-hanging fruit for solving alignment is to dedicate a chunk of time actually trying to solve a sub-problem collectively.
To that end, I’ve broken up researching the sub-problem of corrigibility into two categories in this google doc (you have suggestion privileges):
- Previous Work: let’s not reinvent the wheel. Write out links to any past work on corrigibility. This can range from just links to links & summaries & analyses. Do comment reactions to other's reviews to provide counter-arguments. This is just a google doc, low-quality posts, comments, links are accepted; I want people to lean towards babbling more.
- Tasks: what do we actually do this week to make progress? Suggest any research direction you find fruitful or general research questions or framings. Example: write an example of corrigibility (one could then comment an actual example).
Additionally, I’ll post 3 top-level comments for:
- Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc) For example, I’m available most times this week with a Calendly link for scheduling 1-on-1 co-working sessions. Additionally, you yourself could message those you know to collaborate on this, or have a nerdy house co-working party.
- Potential topics: what other topics besides corrigibility could we collaborate on in future weeks?
- Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.
I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better. Additionally, it’s immensely valuable to have an alignment topic post include a literature review, the community's up-to-date thoughts, and possible future research directions to pursue. I also believe a collaborative project like this will put several community members on the same page as far as terminology and gears-level models.
I explicitly commit to 3 weeks of this (so corrigibility this week and two more the next two weeks). After that is Christmas and New Years, after which I may resume depending on how it goes.
Thanks to Alex Turner for reviewing a draft.
I've got a slightly terrifying hail mary "solve alignment with this one weird trick"-style paradigm I've been mulling over for the past few years which seems like it has the potential to solve corrigibility and a few other major problems (notably value loading without Goodharting, using an alternative to CEV which seems drastically easier to specify). There are a handful of challenging things needed to make it work, but they look to me maybe more achievable than other proposals which seem like they could scale to superintelligence I've read.
Realistically I am not going to publish it anytime soon given my track record, but I'd be happy to have a call with anyone who'd like to poke my models and try and turn it into something. I've had mildly positive responses from explaining it to Stuart Armstrong and Rob Miles, and everyone else I've talked to about it at least thought it was creative and interesting.
Cool, booked a call for later today.