I can think of two problems:
Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.
Re #1, an obvious set of questions to include in are questions of approval for various aspects of the AI's policy. (In particular, if we want the AI to later calculate a human's HCH and ask it for guidance, then we would like to be sure that HCH's answer to that question is not manipulated.)
Where does come from in this setup? Is is arbitrary given , so that it only guarantees that can only manipulate you in ways that it could have done via giving you information (if I could persuade via some argument to do what I want, then it's okay for me to just pull a gun on you if it puts you into the same state)? Or would you have some additional assumption about the information in being a "reasonable equivalent" of ?
There's the additional objection of "if you're doing this, why not just have the AI ask HCH what to do?"
Overall, I'm hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human's HCH via certain informational content, than for the AI to reliably calculate the human's HCH. But I don't have strong arguments for this intuition.
A half-baked idea that came out of a conversation with Jessica, Ryan, and Tsvi:
We'd like to have a straightforward way to define "manipulation", so that we could instruct an AI not to manipulate its developers, or construct a low-impact measure that takes manipulation as a particularly important impact.
We could initially define manipulation in terms of a human's expected actions, or more robustly, in terms of effects on a human's policy distribution across a wide array of plausible environments. However, we'd like to have our AI still be able to tell us information (in a non-manipulative manner) instead of hiding from us in an effort to avoid all influence!
The title of course spoils the next idea: if the AI can reason about some suitable model of HCH, then we can define the notion of "action a has very low influence on a human, as compared to the null action, apart from conveying information x": that over a distribution of questions q,
HCH(q)|a≈HCH(x,q)|null
where HCH is defined relative to that human; we're conditioning the distribution on whether the AI takes action a or the null action; and x,q is the input consisting of statement x followed by question q.
This of course does not exclude the use of manipulative statements x, but it at least could allow us to reduce forms of manipulation to those that would happen with the text input to HCH.
I'd prefer to have the AI reason about HCH rather than just (e.g.) the human's actions in a one-hour simulation, because HCH can in principle capture a human's long-term and extrapolated preferences, and these are the ones I most want to ensure don't get manipulated.
Is there an obvious failure of this approach, an obvious improvement to it, or something simpler that it reduces to?