Counterfactuals in decision theory are about variation of environment and state of knowledge considered by an agent with fixed goals. I'm recently thinking that maybe it's variation of preference that needs to be considered to capture corrigibility in decision theory. Something similar happens with uncertainty about (fixed) preference, but that conflates state of preference with state of knowledge, and the two pieces of data determining an agent in a given state might want to stay separate (as they vary across possible worlds).
In this setting, the counterfactuals are worlds/models where the facts can be different or determined to different extents (the latter point is often neglected), which can be thought of as worlds/states of a Kripke frame, or as points of a topological space (possibly a domain) ordered by specialization.
Incidentally, might be useful to call the whole thing an agent, across all counterfactuals, instead of separating its parts that exist in different possible worlds when using this term. This gives an unusual meaning in the case with variation of preference (model of corrigibility), so that an agent is internally incoherent, has different preference across different parts of itself, with preference in each part talking about the whole, and acausal trade coordinating across disagreements/variation of both fact and preference, in these terms within a single agent. (This reframes something intended as a model of corrigibility into something with inner alignment tension. The hope is that externally this behaves like soft optimization.)
I've previously argued that counterfactuals only make sense from within a counterfactual perspective[1]. The problem then was that I didn't exactly know how to imply this insight because I had no idea how to navigate a circular epistemology. I concluded a) I should probably delve into the philosophical literature to discover how other people have handled this b) we could only evaluate approaches to navigate a circular epistemology from within a particular epistemological frame, so the circularity applies on a meta-level as well.
After writing that post, I got busy with my then web development job/interning at Nonlinear, ran the Sydney AI Safety Fellowship, then otherwise got busy with AI safety movement building, so I sadly hadn't been able to invest the time to make further progress.
However, today I had a call with Stephen Casper. I noticed that he likes to talk about thinking of someone as an "object" or as an "agent" which mirrors my distinction between "raw reality" and "augmented reality" as described here.
He also linked me to his post Dissolving Confusion around Functional Decision Theory. This post argues that:
How this plays out though is that he imagines an agent outside of the universe using CDT to determine what the source code of an agent inside the universe should look like and he argues that the internal agent should use FDT. Of course, this begs the question: if we've just concluded that CDT was a mistake, why would we trust its recommendation to use FDT? Isn't this argument self-defeating?
I believe that the answer becomes clear once we understand that this is part of a process of reflective equilibrium. Sure, the fact that we utilised a decision theory framework that we then rejected isn't ideal, but perhaps it's the best we can do given that we need to reason from somewhere; we can't reason from outside of a framework (the view from nowhere). So making this shift from CDT to FDT is at least a plausible way to handle circular reasoning.
I don't know if this is the best way to apply a circular epistemology, but I guess I now feel unstuck. Part of the problem is that before I was considering the question of how to handle circular epistemologies in the abstract and that's such a broad and vague question that it's hard to know where to begin.
However, now that I've read Stephen Casper's post and I've reinterpreted his argument as occurring within a circular epistemology, I now have a concrete example to play around with. Instead of a generic "How should one reason?", it's now, "Is the reasoning process here correct?". I guess I probably could have made the same kind of progress if I'd actually just gone and read a philosophy paper or two, but I thought it was worthwhile noticing the dynamics here so that future Chris is less likely to get stuck in the future. And who knows, maybe it helps someone else too?
In terms of how it's clarified my thought: I'm now aware that there is an ontological shift from our naive understanding of decision theory to a universe that is deterministic. I think this is important, so I'll write about it in my next post. It's the kind of thing that's fairly obvious when it is said out loud, I also think it's very important to bring the implications into the foreground.
See tailedcalled's criticism of this claim