Continuing the experiment from August, let's try another open thread for AI Alignment discussion. The goal is to be a place where researchers and upcoming research can ask small questions they are confused about, share early stage ideas and have lower-key discussions.
I support work on interpretability/transparency, in part because I'm uncertain about discontinuous vs gradual takeoff, and in part because I'm not very optimistic about any other AI safety approach either and think we probably just need to try a whole bunch of different approaches that each have low probability of success in the hope that something (or some combination of things) works out in the end. My point was that I find the stories people tell about why they are optimistic (e.g., reverse compiling a neural network into human readable code and then using that to generate human feedback on the model’s decision-making process) to be very questionable.