-What sorts of skills an AI would need to achieve powerbase ability
-Timelines till powerbase ability
-Takeoff speeds
-How things are going to go by default, alignment-wise (e.g. will it be as described in this?)
-How things are going to go by default, governance-wise (e.g. will it be as described in this?)
I have two questions I'd love to hear your thoughts about.
1. What is the overarching/high-level research agenda of your group? Do you have a concrete alignment agenda where people work on the same thing or do people work on many unrelated things?
2. What are your thoughts on various research agendas to solve the alignment that exists today? Why do you think they will fall short of their goal? What are you most excited about?
Feel free to talk about any agendas, but I'll just list a few that come to my mind (in no particular order).
IDA, Debate, Interpretability (I read a tweet I think, where you said you are rather skeptical about this), Natural Abstraction Hypothesis, Externalized Reasoning Oversight, Shard Theory, (Relaxed) Adversarial Training, ELK, etc.
I feel like I'm pretty off outer vs. inner alignment.
People have had a go at inner alignment, but they keep trying to affect it by taking terms for interpretability, or modeled human feedbacks, or characteristics of the AI's self-model, and putting them into the loss function, diluting the entire notion that inner alignment isn't about what's in the loss function.
People have had a go at outer alignment too, but (if they're named Charlie) they keep trying to point to what we want by saying that the AI should be trying to learn good moral reasoning, which me...
cross-posted to Twitter:
https://twitter.com/DavidSKrueger/status/1573643782377152514