I have some rough intuitions about the purpose of MIRI's agent foundations agenda, and I'd like to share them here. (Note: I have not discussed these with MIRI, and these should not be taken to be representative of MIRI's views.)
I think there's a common misconception that the goal of agent foundations is to try building an AGI architected with a decision theory module, a logical induction module, etc. In my mind, this is completely not the point, and my intuitions say that approach is doomed to fail.
I interpret agent foundations as being more about providing formal specifications of metaphilosophical competence, to:
- directly extend our understanding of metaphilosophy, by adding conceptual clarity to important notions we only have fuzzy understandings of. (Will this agent fall into epistemic pits? Are its actions low-impact? Will it avoid catastrophes?) As an analogy, formally defining mathematical proofs constituted significant progress in our understanding of mathematical logic and mathematical philosophy.
- allow us to formally verify whether a computational process will satisfy desirable metaphilosophical properties, like those mentioned in the above parenthetical. (It seems perfectly fine for these processes to be built out of illegible components, like deep neural nets—while that makes them harder to inspect, it doesn't preclude us from making useful formal statements about them. For example, in ALBA, it would help us make formal guarantees that distilled agents remain aligned.)
I want to explore logical induction as a case study. I think the important part about logical induction is the logical induction criterion, not the algorithm implementing it. I've heard the implementation criticized for being computationally intractable, but I see its primary purpose as showing the logical induction criterion to be satisfiable at all. This elevates the logical induction criterion over all the other loose collections of desiderata that may or may not be satisfiable, and may or may not capture what we mean by logical uncertainty. If we were to build an actual aligned AGI, I would expect its reasoning process to satisfy the logical induction criterion, but not look very much like the algorithm presented in the logical induction paper.
I also think the logical induction criterion provides an exact formalization—a necessary AND sufficient condition—of what it means to not get stuck in any epistemic pits in the limit. (The gist of this intuition: epistemic pits you're stuck in forever correspond exactly to patterns in the market that a trader could exploit forever, and make unbounded profits from.) This lets us formalize the question "Does X reasoning process avoid permanently falling into epistemic pits?" into "Does X reasoning process satisfy the logical induction criterion?"
"Adding conceptual clarity" is a key motivation, but formal verification isn't a key motivation.
The point of things like logical induction isn't "we can use the logical induction criterion to verify that the system isn't making reasoning errors"; as I understand it, it's more "logical induction helps move us toward a better understanding of what good reasoning is, with a goal of ensuring developers aren't flying blind when they're actually building good reasoners".
Daniel Dewey's summary of the motivation behind HRAD is:
To which Nate replied at the time:
The position of the AI community is something like the position researchers would be in if they wanted to build a space rocket, but hadn't developed calculus or orbital mechanics yet. Maybe with enough trial and error (and explosives) you'll eventually be able to get a payload off the planet that way, but if you want things to actually work correctly on the first go, you'll need to do some basic research to cover core gaps in what you know.
To say that calculus or orbital mechanics help you "formally verify" that the system's parts are going to work correctly is missing where the main benefit lies, which is in knowing what you're doing at all, not in being able to machine-verify everything you'd like to. You need to formalize how good reasoning works because even if you can't always apply conventional formal methods, you still need to understand what you're building if you want robustness properties.