cross-posted to Twitter:
https://twitter.com/DavidSKrueger/status/1573643782377152514
cross-posted to Twitter:
https://twitter.com/DavidSKrueger/status/1573643782377152514
-What sorts of skills an AI would need to achieve powerbase ability
-Timelines till powerbase ability
-Takeoff speeds
-How things are going to go by default, alignment-wise (e.g. will it be as described in this?)
-How things are going to go by default, governance-wise (e.g. will it be as described in this?)
I have two questions I'd love to hear your thoughts about.
1. What is the overarching/high-level research agenda of your group? Do you have a concrete alignment agenda where people work on the same thing or do people work on many unrelated things?
2. What are your thoughts on various research agendas to solve the alignment that exists today? Why do you think they will fall short of their goal? What are you most excited about?
Feel free to talk about any agendas, but I'll just list a few that come to my mind (in no particular order).
IDA, Debate, Interpretability (I read a tweet I think, where you said you are rather skeptical about this), Natural Abstraction Hypothesis, Externalized Reasoning Oversight, Shard Theory, (Relaxed) Adversarial Training, ELK, etc.
I feel like I'm pretty off outer vs. inner alignment.
People have had a go at inner alignment, but they keep trying to affect it by taking terms for interpretability, or modeled human feedbacks, or characteristics of the AI's self-model, and putting them into the loss function, diluting the entire notion that inner alignment isn't about what's in the loss function.
People have had a go at outer alignment too, but (if they're named Charlie) they keep trying to point to what we want by saying that the AI should be trying to learn good moral reasoning, which me...
Thanks for your thoughts, really appreciate it.
One quick follow-up question, when you say "build powerful AI tools that are deceptive" as a way of "the problem being easier than anticipated", how exactly do you mean that? Do you say that as in, if we can create deceptive or power-seeking tool AI very easily, it will be much simpler to investigate what is happening and derive solutions?
Here are some links to the concepts you asked about.
Externalized Reasoning Oversight: This was also recently introduced https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for . The main idea is to use Chain-of-though reasoning to oversee the thought processes of your model (assuming that those thought processes are complete and straightforward, and the output causally depends on it).
Shard Theory: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values. It was proposed very recently. Their TL;DR is "We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry. "
Relaxed Adversarial Training: I think the main post is this one https://www.lesswrong.com/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment . But I really like the short description by Beth (https://www.lesswrong.com/posts/YQALrtMkeqemAF5GX/another-list-of-theories-of-impact-for-interpretability):
"The basic idea of relaxed adversarial training is something like: