(this post came out of a conversation between Critch, Ryan, Scott, and me)

Methods for aligning "messy" AGI systems (e.g. ALBA) usually at some point depend on finding a set of models, at least one of which is a good honest predictor. For example, this may be a set of predictors in KWIK learning, or a set of possible policies for a red team to use to construct simulations.

Concretely, we could consider training a bunch of different models (e.g. with different hyperparameters) to predict human answers to a variety of questions through some context changes. We train and filter models until we have different models that all perform well on these training tasks. Now we use some of the models to answer important test questions. We have some scheme similar to KWIK learning to ensure that our system answers questions well as long as at least one of the models answers test questions well. How high should be so that this is the case?

A naive analysis would conclude that does not need to be very high (it can be, say, 10). This is because all the models have performed well throughout context changes. The context change from training questions to test questions should not in principle be harder than these previous context changes.

One reason to suspect that is insufficient is that the context change from training questions to test questions is one from unimportant questions to important questions (i.e. humans actually use the answers to test questions to make important decisions). This creates an opportunity for a treacherous turn. If one of the models is a consequentialist agent, it may answer training questions honestly and then answer test questions dishonestly in a way that causes the humans to made decisions in that agent's interests.

In the worst case, all successful models are consequentialist agents, so we must set high enough that at least one of the consequantialists has a goal compatible with always answering test questions honestly.

What's the probability that a randomly-selected consequentialist that performs well on training tasks has such a goal? One rough way of estimating this is to imagine the "language of thought" of an AGI that has been trained to answer questions useful to humans. We can then think about the complexity of the "honesty" concept in its language of thought.

First I will estimate this complexity in the case of humans. By my very rough guesses:

  • about 10% of human concepts are social
  • about 5% of human social concepts are about cooperation
  • honesty is about 5% of cooperation

So the honesty concept might have a probability of about in the human concept space. As an independent check, let's consider the frequency of words related to honesty in the English language. According to this page, the frequency of the word "honest" in English text is about . I'm not sure which estimate to take more seriously; for now I'll just pretend that the honesty concept has a probability of something like in the human concept space.

How does this apply to AGI systems? If the AGI is good at modelling humans, then maybe in some sense it is devoting some amount of its conceptual vocabulary to "mirroring" human concepts. Say it is devoting 3% of the probability mass of its language of thought to human concepts. Then the human concept of honesty makes up about of its conceptual probability mass.

The AGI could be assigning substantially more than that much mass to some concept of honesty, perhaps because the AGI finds a more general concept of honesty instrumentally useful (maybe not the same as humans' concept, but sufficient for answering test questions honestly).

A lot of these numbers are made up intuitively. Under uncertainty about what the right numbers are, I'm going to tentatively conclude that it is very likely that the proportion of successfully-human-predicting AGIs that are honest is at least , and somewhat likely that it is above .

Is the presence of at least one honest predictor sufficient for the question-answering system to work? Not necessarily; frameworks such as KWIK learning assume that at least one of the predictors is Bayes-optimal, whereas in reality Bayes-optimality is impossible for bounded reasoners. So further analysis will be necessary to positively establish that something like KWIK would work. But for now it appears that the frequency of dishonest models compared to honest models is not a fundamental obstruction to schemes based on predicting humans, at least if one is willing to gather millions of data points from humans.

New Comment
1 comment, sorted by Click to highlight new comments since:

I don't think "honesty" is what we are looking for.

We have a system which has successfully predicted "what I would say if asked" (for example) and now we want a system that will continue to do that. "What I would say" can be defined precisely in terms of particular physical observations (it's the number provided as input to a particular program) while conditioning only on pseudorandom facts about the world (e.g. conditioning on my computer's RNG, which we use to determine what queries get sent to the human). We really just want a system that will continue to make accurate predictions under the "common sense" understanding of reality (rather than e.g. believing we are in a simulation or some other malign skeptical hypothesis).

I don't think that going through a model of cooperativeness with humans is likely to be the easiest way to specify this. I think one key observation to leverage, when lower-bounding the density, is that the agent is already using the desired concept instrumentally. For example, if it is malevolent, it is still reasoning about what the correct prediction would be in order to increase its influence. In some sense the "honest" agent is just a subset of the malicious reasoning, stopping at the honest goal rather than continuing to backwards chain. If we could pull out this instrumental concept, then it wouldn't necessarily be the right thing, but at least the failures wouldn't be malign.

If you have a model with 1 degree of freedom per step of computation,, then it seems like the "honest" agent is necessarily simpler simpler, since we can slice out the parts of computation that are operating on this instrumental goal. It might be useful to try and formalize this argument as a warmup.

(Note that e.g. a fully-connected neural net has this property; so while it's kind of a silly example, it's not totally out there.)

Incidentally, this style of argument also seems needed to address the malignity of the universal prior / logical inductor, at least if you want to run a theoretically convincing argument. I expect the same conceptual machinery will be used in both cases (though it may turn out that one is possible and the other is impossible). So I think this question is needed both for my agenda and MIRI's agent foundations agenda, and advocate bumping it up in priority.