Note: weird stuff, very informal.
Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.
I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.
I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.
I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by "too much compute." I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape. I think that this problem can probably be patched, but that's one of the major open questions for the feasibility of prosaic AGI alignment.
I suspect that daemons aren't a problem if we exclusively select for computational efficiency. That is, I suspect that the fastest way to solve any particular problem doesn't involve daemons.
I don't think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks).
But I do think this question has deep similarities to more important problems, and that answering this question will involve developing useful conceptual machinery. Because we have an unusually strong intuitive handle on the problem, I think it's a good thing to think about.
Problem statement and intuition
Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?
Intuitively, if we have a daemon that is instrumentally or incidentally motivated to solve my problem, then there is some smaller circuit that solves the problem equally well but skips the instrumental reasoning. If my daemon is doing some complex reasoning to answer the question "Should I predict well?" we could just skip straight to the answer "yes." This both makes the circuit smaller, and prevents the circuit from ever deciding not to predict well.
A different perspective on a similar intuition: the daemon is doing some actual cognitive work to solve the problem. Since that computation is being done by the daemon, it is embedded as a smaller circuit. Jessica explores this intuition a bit here. Here we are considering an easy version of the problem, since by taking the smallest circuit we are effectively quantifying over all possible ways of extracting logical information from the daemon.
Instead of showing that minimal circuits can't be daemons, we might end up concluding that they can be. That would be even more interesting.
Another possible outcome is giving a strong argument that captures our intuitions/concerns about daemons, and which clearly doesn't apply to the minimal circuit that solves a problem. In this case we couldn't prove anything positive about the minimal circuit, but we would have "screened off" the possible cause for concern.
Difficulties
The first and most serious difficulty is understanding what we are talking about.
I don't expect to get total clarity on concepts like "daemon" or "optimization" or "generic problem," but we need to have a better grip than we do right now. I expect that we'll develop better concepts in the course of solving the problem, rather than as a precondition for solving the problem (in general I think "define things so that you can prove the theorem" is often the right strategy).
A second difficulty is that the different parts of the computation can be tangled up in an extremely complex way. In an extreme case, the daemon may be cryptographically obfuscated.
We want to show that given any daemon, there is a smaller circuit that solves the problem. The most natural approach is showing how to construct a smaller circuit, given a daemon. But if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.
So we can't find any efficient constructive argument. That rules out most of the obvious strategies.
For human evolution, the problem is too little compute rather than too much, right? Meaning if evolution just gave humans the goal of "maximize inclusive fitness" then the human wouldn't be able to find a good policy for achieving that due to lack of computing power so instead we got a bunch of goals that would have been subgoals of "maximize inclusive fitness" in our ancestral environment (like eat tasty food and make friends/allies).
Suppose we wanted to make a minimal circuit that would do as well as humans in maximizing inclusive fitness in some range of environments. Wouldn't it make sense to also "help it out" by having it directly optimize for useful subgoals in those environments rather than having it do a big backchain from "maximize inclusive fitness"? And then it would be a daemon because it would keep optimizing for those subgoals even if you moved it outside of those environments?
We may be able to offload some work to the system, e.g. by having it search for a diverse range of models for the user's intent, instead of making it use a single hardcoded goal specification.
This comment of mine is a bit related if you want more elaboration:
https://www.lesswrong.com/posts/NtX7LKhCXMW2vjWx6/thoughts-on-reward-engineering#jJ7nng3AGmtAWfxsy
If you have thoughts on it, probably best to reply there--we are already necroposting, so let's keep the discussion organized.