A low-hanging fruit for solving alignment is to dedicate a chunk of time actually trying to solve a sub-problem collectively.
To that end, I’ve broken up researching the sub-problem of corrigibility into two categories in this google doc (you have suggestion privileges):
- Previous Work: let’s not reinvent the wheel. Write out links to any past work on corrigibility. This can range from just links to links & summaries & analyses. Do comment reactions to other's reviews to provide counter-arguments. This is just a google doc, low-quality posts, comments, links are accepted; I want people to lean towards babbling more.
- Tasks: what do we actually do this week to make progress? Suggest any research direction you find fruitful or general research questions or framings. Example: write an example of corrigibility (one could then comment an actual example).
Additionally, I’ll post 3 top-level comments for:
- Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc) For example, I’m available most times this week with a Calendly link for scheduling 1-on-1 co-working sessions. Additionally, you yourself could message those you know to collaborate on this, or have a nerdy house co-working party.
- Potential topics: what other topics besides corrigibility could we collaborate on in future weeks?
- Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.
I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better. Additionally, it’s immensely valuable to have an alignment topic post include a literature review, the community's up-to-date thoughts, and possible future research directions to pursue. I also believe a collaborative project like this will put several community members on the same page as far as terminology and gears-level models.
I explicitly commit to 3 weeks of this (so corrigibility this week and two more the next two weeks). After that is Christmas and New Years, after which I may resume depending on how it goes.
Thanks to Alex Turner for reviewing a draft.
The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that?
Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could've done a better job at not coming across that way). Though I believe the majority of the comments were in the spirit of understanding and coming to an agreement. Adam Shimi is also working on a post to describe the disagreements in the dialogue as different epistemic strategies, meaning the cause of disagreement is non-obvious. Alignment is pre-paradigmic, so agreeing is more difficult compared to communities that have clear questions and metrics to measure them on. I still think we succeed at the harder problem.
By "community of philosophers", you mean noone makes any actual progress on anything (or can agree that progress is being made)?
Do you disagree on these examples or disagree that they prove the community makes progress and agrees that progress is being made?