A low-hanging fruit for solving alignment is to dedicate a chunk of time actually trying to solve a sub-problem collectively.
To that end, I’ve broken up researching the sub-problem of corrigibility into two categories in this google doc (you have suggestion privileges):
- Previous Work: let’s not reinvent the wheel. Write out links to any past work on corrigibility. This can range from just links to links & summaries & analyses. Do comment reactions to other's reviews to provide counter-arguments. This is just a google doc, low-quality posts, comments, links are accepted; I want people to lean towards babbling more.
- Tasks: what do we actually do this week to make progress? Suggest any research direction you find fruitful or general research questions or framings. Example: write an example of corrigibility (one could then comment an actual example).
Additionally, I’ll post 3 top-level comments for:
- Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc) For example, I’m available most times this week with a Calendly link for scheduling 1-on-1 co-working sessions. Additionally, you yourself could message those you know to collaborate on this, or have a nerdy house co-working party.
- Potential topics: what other topics besides corrigibility could we collaborate on in future weeks?
- Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.
I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better. Additionally, it’s immensely valuable to have an alignment topic post include a literature review, the community's up-to-date thoughts, and possible future research directions to pursue. I also believe a collaborative project like this will put several community members on the same page as far as terminology and gears-level models.
I explicitly commit to 3 weeks of this (so corrigibility this week and two more the next two weeks). After that is Christmas and New Years, after which I may resume depending on how it goes.
Thanks to Alex Turner for reviewing a draft.
I think it's rather that 'it's easy to think of ways to help a dolphin (and a smart AGI would presumably find this easy too), but it's hard to make a general intelligence that robustly wants to just help dolphins, and it's hard to safely coerce an AGI into helping dolphins in any major way if that's not what it really wants'.
I think the argument is two-part, and both parts are important: