As I noted when we chatted about this in person, my intuition is less "there is some small core of good consequentialist reasoning (it has “low Kolmogorov complexity” in some sense), and this small core will be quite important for AI capabilities" and more "good consequentialist reasoning is low-K and those who understand it will be better equipped to design AGI systems where the relevant consequentialist reasoning happens in transparent boxes rather than black boxes."
Indeed, if I thought one had to understand good consequentialist reasoning in order to design a highly capable AI system, I'd be less worried by a decent margin.
The way I wrote it, I didn't mean to imply "the designers need to understand the low-K thing for the system to be highly capable", merely "the low-K thing must appear in the system somewhere for it to be highly capable". Does the second statement seem right to you?
(perhaps a weaker statement, like "for the system to be highly capable, the low-K thing must be the correct high-level understanding of the system, and so the designers must understand the low-K thing to understand the behavior of the system at a high level", would be better?)
The second statement seems pretty plausible (when we consider human-accessible AGI designs, at least), but I'm not super confident of it, and I'm not resting my argument on it.
The weaker statement you provide doesn't seem like it's addressing my concern. I expect there are ways to get highly capable reasoning (sufficient for, e.g., gaining decisive strategic advantage) without understanding low-K "good reasoning"; the concern is that said systems are much more difficult to align.
Thanks for the write-up, this is helpful for me (Owen).
My initial takes on the five steps of the argument as presented, in approximately decreasing order of how much I am on board:
For #5, OK, there's something to this. But:
I like your framing for #1.
I agree that things get messier when there is a collection of AI systems rather than a single one. "Pivotal acts" mostly make sense in the context of local takeoff. In nonlocal takeoff, one of the main concerns is that goal-directed agents not aligned with human values are going to find a way to cooperate with each other.
What work is step #1 doing here? It seems like steps #2-5 would still hold even if the AGI in question were using "bad" consequentialist reasoning (e.g. domain-limited/high-K/exploitable/etc.).
In fact, is it necessary to assume that the AGI will be consequentialist at all? It seems highly probable that the first pivotal act will be taken by a system of humans+AI that is collectively behaving in a consequentialist fashion (in order to pick out a pivotal act from the set of all actions). If so, do arguments #2-#5 not apply equally well to this system as a whole, with "top-level" interpreted as something like "transparent to humans within the system"?
This post helped me understand HRAD a lot better. I'm quite confident that subsystems (SS) will be smarter than top-level systems (TS) (because meta-learning will work). So on that it seems we agree. Although, I'm not sure we have the same thing in mind by "smarter" (e.g., I don't mean that SSs will use some kind of reasoning which is different from model-based RL, just that we won't be able to easily/tractably identify what algo is being run at the SS level, because: 1. interpretability will be hard and 2. it will be a pile of hacks that won't have a simple, complete description).
I think this is the main disagreement: I don't believe that SS will work better because it will stumble upon some low-K reasoning core; I just think SS will be much better at rapid iterative improvements to AI algos. Actually, it seems possible that, lacking some good TS reasoning, SS will eventually hack itself to death :P.
I'm still a bit put-off by talking about TS vs. SS being "dominant", and I think there is possibly some difference of views lurking behind this language.
(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)
I want to clarify my understanding of some of the motivations of MIRI's highly reliable agent design (HRAD) research (e.g. logical uncertainty, decision theory, multi level models).
Top-level vs. subsystem reasoning
I'll distinguish between an AI system's top-level reasoning and subsystem reasoning. Top-level reasoning is the reasoning the system is doing in a way its designers understand (e.g. using well-understood algorithms); subsystem reasoning is reasoning produced by the top-level reasoning that its designers (by default) don't understand at an algorithmic level.
Here are a few examples:
AlphaGo
Top-level reasoning: MCTS, self-play, gradient descent, ...
Subsystem reasoning: whatever reasoning the policy network is doing, which might involve some sort of "prediction of consequences of moves"
Deep Q learning
Top-level reasoning: the Q-learning algorithm, gradient descent, random exploration, ...
Subsystem reasoning: whatever reasoning the Q network is doing, which might involve some sort of "prediction of future score"
Solomonoff induction
Top-level reasoning: selecting (Cartesian) hypotheses by seeing which make the best predictions
Subsystem reasoning: the reasoning of the consequentialist reasoners who come to dominate Solomonoff induction, who will use something like naturalized induction and updateless decision theory
Genetic selection
It is possible to imagine a system that learns to play video games by finding (encodings of) policies that get high scores on training games, and combining encodings of policies that do well to produce new policies.
Top-level reasoning: genetic selection
Subsystem reasoning: whatever reasoning the policies are doing (which is something like "predicting the consequences of different actions")
In the Solomonoff induction case and this case, if the algorithm is run with enough computation, the subsystem reasoning is likely to overwhelm the top-level reasoning (i.e. the system running Solomonoff induction or genetic selection will eventually come to be dominated by opaque consequentialist reasoners).
Good consequentialist reasoning
Humans are capable of good consequentialist reasoning (at least in comparison to current AI systems). Humans can:
and so on. Current AI systems are not capable of good consequentialist reasoning. Superintelligent AGI systems would be capable of good consequentialist reasoning (though superintelligent narrow AI systems might not in full generality).
The concern
Using these concepts, MIRI's main concern motivating HRAD research can be stated as something like:
The first AI systems capable of pivotal acts will use good consequentialist reasoning.
The default AI development path will not produce good consequentialist reasoning at the top level.
Therefore, on the default AI development path, the first AI systems capable of pivotal acts will have good consequentialist subsystem reasoning but not good consequentialist top-level reasoning.
Consequentialist subsystem reasoning will likely come "packaged with a random goal" in some sense, and this goal will not be aligned with human interests.
Therefore, the default AI development path will produce, as the first AI systems capable of pivotal acts, AI systems with goals not aligned with human interests, causing catastrophe.
Note that, even if the AI system is doing good consequentialist reasoning at the top level rather than in subsystems, this top-level reasoning must still be directed towards the correct objective for the system to be aligned. So HRAD research does not address the entire AI alignment problem.
Possible paths
Given this concern, a number of possible paths to aligned AI emerge:
Limited/tool AI
One might reject premise 1 and attempt to accomplish pivotal acts using AI systems that do not use good consequentialist reasoning. Roughly, the proposal is to have humans do the good consequentialist reasoning, and to use AI systems as tools.
The main concern with this proposal is that a system of humans and limited AIs might be much less effective (for a given level of computing resources) than an AI system capable of good consequentialist reasoning. In particular, (a) a limited AI might require a lot of human labor to do the good consequentialist reasoning, and (b) human consequentialist reasoning is likely to be less effective than superintelligent AI consequentialist reasoning.
The main hope, despite this concern, is that either "general consequentialist reasoning" is not particularly important for the kinds of tasks people will want to use AI systems for (including pivotal acts), or that some sort of global coordination will make the efficiency disadvantage less relevant.
Example research topics:
Hope that top-level reasoning stays dominant on the default AI development path
Currently, it seems like most AI systems' consequentialist reasoning is explainable in terms of top-level algorithms. For example, AlphaGo's performance is mostly explained by MCTS and the way it's trained through self-play. The subsystem reasoning is subsumed by the top-level reasoning and does not overwhelm it.
One could hope that algorithms likely to be developed in the future by default (e.g. model-based reinforcement learning) continue to be powerful enough that the top-level consequentialist reasoning is more powerful than subsystem consequentialist reasoning.
The biggest indication that this might not happen by default is that we currently don't have an in-principle theory for good reasoning (e.g. we're currently confused about logical uncertainty and multi-level models), and it doesn't look like these theories will be developed without a concerted effort. Usually, theory lags behind common practice.
Despite this, a possible reason for hope is that perhaps it's possible to AI researchers to develop enough tacit understanding of these theories for practical purposes. Currently, algorithms such as MCTS are implicitly handling some subproblem of "logical uncertainty" without having a full formal theory, and this does not seem problematic yet. It's conceivable that future algorithms will be similar to MCTS and implicitly handle larger parts of these theories in a way as well-understood as MCTS, such that good consequentialist reasoning in subsystems does not overwhelm the top-level consequentialist reasoning.
"MIRI" has a strong intuition that this won't be the case, and personally I'm somewhat confused about the details; see Nate's comments below for details.
On this path, the most important research topics are those that relate to directing top-level consequentialist reasoning (implemented using algorithms on the default AI development path) towards useful objectives. (Note that these research problems are also important on other paths; goals have to be specified at some point in all cases).
Example research topics:
(research topics like these are discussed in Concrete Problems in AI Safety and Alignment for Advanced Machine Learning Systems)
Figure out some core of good consequentialist reasoning and ensure that AI is developed through this paradigm
This is the main purpose of MIRI's research in HRAD. The main hope is that there is some simple core of good reasoning that can be discovered through theoretical research.
On this pathway, it isn't currently cleanly argued that the right way to research good consequentialist reasoning is to study the particular MIRI research topics such as decision theory. One could imagine other approaches to studying good consequentialist reasoning (e.g. thinking about how to train model-based reinforcement learners). I think the focus on problems like decision theory is mostly based on intuitions that are (currently) hard to explicitly argue for.
Example research topics:
(see the agent foundations technical agenda paper) for details)
Figure out how to align a "messy" AI whose good consequentialist reasoning is in a subsystem
This is the main part of Paul Christiano's research program. Disagreements about the viability of this approach are quite technical; I have previously written about some aspects of this disagreement here.
Example research topics:
Interaction with task AGI
Given this concern, it isn't immediately clear how task AGI fits into the picture. I think the main motivation for task AGI is that it alleviates some aspects of this concern but not others; ideally it requires knowing fewer aspects of good consequentialist reasoning (e.g. perhaps some decision-theoretic problems can be dodged), and has subsystems "small" enough that they will not develop good consequentialist reasoning independently.
Conclusion
I hope I have clarified what the main argument motivating HRAD research is, and what positions are possible to take on this argument. There seem to be significant opportunities for further clarification of arguments and disagreements, especially the MIRI intuition that there is a small core of good consequentialist reasoning that is important for AI capabilities and that can be discovered through theoretical research.