HCH as a measure of manipulation

orthonormal

A half-baked idea that came out of a conversation with Jessica, Ryan, and Tsvi:

We'd like to have a straightforward way to define "manipulation", so that we could instruct an AI not to manipulate its developers, or construct a low-impact measure that takes manipulation as a particularly important impact.

We could initially define manipulation in terms of a human's expected actions, or more robustly, in terms of effects on a human's policy distribution across a wide array of plausible environments. However, we'd like to have our AI still be able to tell us information (in a non-manipulative manner) instead of hiding from us in an effort to avoid all influence!

The title of course spoils the next idea: if the AI can reason about some suitable model of HCH, then we can define the notion of "action a has very low influence on a human, as compared to the null action, apart from conveying information x": that over a distribution of questions q,

$H C H (q) | a \approx H C H (x, q) | n u l l$

where HCH is defined relative to that human; we're conditioning the distribution on whether the AI takes action a or the null action; and x,q is the input consisting of statement x followed by question q.

This of course does not exclude the use of manipulative statements x, but it at least could allow us to reduce forms of manipulation to those that would happen with the text input to HCH.

I'd prefer to have the AI reason about HCH rather than just (e.g.) the human's actions in a one-hour simulation, because HCH can in principle capture a human's long-term and extrapolated preferences, and these are the ones I most want to ensure don't get manipulated.

Is there an obvious failure of this approach, an obvious improvement to it, or something simpler that it reduces to?

I can think of two problems:

Let's generously suppose that $q$ is some fixed distribution of questions that we want the AI system to ask humans. Some manipulative action may only change the answers on $q$ by a little bit but may yet change the consequences of acting on those responses by a lot.
Consider an AI system that optimizes a utility function that includes this kind of term for regularizing against manipulation. The actions that best fulfill this utility function may be ones that manipulate humans a lot (and repurposes their resources for some other function) and coerces them into answering questions in a "natural way". i.e. maybe impact is more like distance traveled (i.e. a path integral) than displacement.

Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.

Re #1, an obvious set of questions to include in $q$ are questions of approval for various aspects of the AI's policy. (In particular, if we want the AI to later calculate a human's HCH and ask it for guidance, then we would like to be sure that HCH's answer to that question is not manipulated.)

Where does $x$ come from in this setup? Is $x$ is arbitrary given $a$ , so that it only guarantees that $a$ can only manipulate you in ways that it could have done via giving you information (if I could persuade via some argument to do what I want, then it's okay for me to just pull a gun on you if it puts you into the same state)? Or would you have some additional assumption about the information in $x$ being a "reasonable equivalent" of $a$ ?

There's the additional objection of "if you're doing this, why not just have the AI ask HCH what to do?"

Overall, I'm hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human's HCH via certain informational content, than for the AI to reliably calculate the human's HCH. But I don't have strong arguments for this intuition.

"Having a well-calibrated estimate of HCH" is the condition you want, not "being able to reliably calculate HCH".

I should have said "reliably estimate HCH"; I'd also want quite a lot of precision in addition to calibration before I trust it.