Thanks, I'm planning to release an advent calendar of hot takes and this gives me fodder for a few :P
My short notes that I'll expand in the advent calendar:
Human bureaucracies are mostly misaligned because the actual bureaucratic actors are also misaligned. I think a “bureaucracy” of perfectly aligned humans (like EA but better) would be well aligned. RLHF is obviously not a solution in the limit but I don’t think it’s extremely implausible that it is outer aligned enough to work, though I am much more enthusiastic about IDA
Shard theory explicitly assumes certain claims about how the human brain works, in particular that the genome mostly specifies crude neural reward circuitry and that ~all of the details of the cortex are basically randomly initialized. I think these claims are plausible but uncertain and quite important for AI safety, so I would be excited about more people looking into this question given that it seems controversial among neuroscientists and geneticists, and also seems tractable given that there is a wealth of existing neuroscience research.
Note that this pertains to the shard theory of human values, not shard-centric models of how AI values might form. That said, I'm likewise interested in investigation of the assumptions. EG how people work is important probabilistic evidence for how AI works because there are going to be "common causes" to effective real-world cognition and design choices.
I have been learning more about alignment theory in the last couple of months, and have heard from many people that writing down naive hypotheses can be a good strategy for developing your thoughts and getting feedback about them. So here goes:
Overall, my view is that alignment doesn't seem extremely hard, but that p(doom) is still fairly high (~45%) due to the plausibility of very short timelines, capabilities researchers not taking the problem sufficiently seriously or not being willing to pay the alignment tax to implement alignment strategies, and the fact that if alignment is hard (in the sense that relatively simple training mechanisms + oversight using interpretability do not work), I think we are probably doomed. However, all of these statements are strong claims, weakly held - tell me why I'm wrong!