A simple, weak notion of corrigibility is having a "complete" feedback interface. In logical induction terms, I mean the AI trainer can insert any trader into the market. I want to contrast this with "partial" feedback, in which only some propositions get feedback and others ("latent" propositions) form the structured hypotheses which help predict the observable propositions -- for example, RL, where only rewards and sense-data is observed.

(Note: one might think that the ability to inject traders into LI is still "incomplete" because traders can give feedback on the propositions themselves, not on other traders; so the trader weights constitute "latents" being estimated. However, a trader can effectively vote against another trader by computing all that trader's trades and counterbalancing them. Of course, we can also more directly facilitate this, EG giving the user the ability to directly modify trader weights, and even giving traders an enhanced ability to bet on each other's weights.)

Why is this close to corrigibility?

The idea is that the trainer can enact "any" modification they'd like to make to the system as a trader. In some sense (which I need to articulate better), the system doesn't have any incentive to avoid this feedback.

For example, if the AI predicts that the user will soon give it the feedback that staying still and doing nothing is best, then it will immediately start staying still and doing nothing. If this is undesirable, the user can instead plan to give the feedback that the AI should "start staying still from now forward until I tell you otherwise" or some such.

This is not to say that the AI universally tries to update in whatever direction it anticipates the users might update it towards later. This is not like the RL setting, where there is no way for trainers to give feedback ruling out the "whatever the user will reward is good" hypothesis. The user can and should give feedback against this hypothesis!

The AI system accepts all previous feedback, but it may or may not trust anticipated future feedback. In particular, it should be trained not to trust feedback it would get by manipulating humans (so that it doesn't see itself as having an incentive to manipulate humans to give specific sorts of feedback).

I will call this property of feedback "legitimacy". The AI has a notion of when feedback is legitimate, and it needs to work to keep feedback legitimate (by not manipulating the human).

It's still the case that if a hypothesis has enough initial weight in the system, and it buys a pattern of propositions which end up (causally) manipulating the human trainer to reinforce that pattern of propositions, such a hypothesis can tend to gain influence in the system. What I'm doing here is "splitting off" this problem from corrigibility, in some sense: this is an inner-optimizer problem. In order for this approach to corrigibility to be safe, the trainer needs to provide feedback against such inner-optimizers. 

(Again, this is unlike the RL setting: in RL, hypotheses have a uniform incentive to get reward. For systems with complete feedback, different hypotheses are competing for different kinds of positive feedback. Still, this self-enforcing behavior needs to be discouraged by the trainer.)

This is not by any means a sufficient safety condition, since so much depends on the trainer being able to provide feedback against manipulative hypotheses, and train the system to have a robust concept of legitimate vs illegitimate feedback.

Instead, the argument is that this is a necessary safety condition in some sense. Systems with incomplete feedback will always have undesirable (malign) hypotheses which cannot be ruled out by feedback. For RL, this includes wireheading hypotheses (hypotheses which predict high reward from taking over control of the reinforcement signal) and human-manipulation hypotheses (hypotheses which predict high reward from manipulating humans to give high reward). For more exotic systems, this includes the "human simulator" failure mode which Paul Christiano detailed in the ELK report.

Note that this notion of corrigibility applies to both agentic and nonagentic systems. The AI system could be trained to act agentically or otherwise.

Two open technical questions wrt this:

  • What learning-theoretic properties can we guarantee for systems with complete feedback? In something like Solomonoff Induction, we get good learning-theoretic properties on the observable bits by virtue of the structured prior we're able to build out of the latent bits. The "complete feedback" idea relies on getting good learning-theoretic properties with respect to everything. I think a modification of the Logical Induction Criterion will work here.
  • Can we simultaneously prevent a self-modification incentive (where the system self-modifies to ignore future feedback which it considers corrupt IE illegitimate -- this would be very bad in cases where the system is wrong about legitimacy) while also avoiding a human-manipulation incentive (counting manipulation of humans as a form of corrupt feedback)?

You can support my work on Patreon.

New Comment
7 comments, sorted by Click to highlight new comments since:

I feel that this post would benefit from having the math spelled out. How is inserting a trader a way to do feedback? Can you phrase classical RL like this?

Yeah, I totally agree. This was initially a quick private message to someone, but I thought it was better to post it publicly despite the inadequate explanations. I think the idea deserves a better write-up.

This is a confusing post from my perspective, because I think of LI as being about beliefs and corrigibility being about desires.

If I want my AGI to believe that the sky is green, I guess it’s good if it’s possible to do that. But it’s kinda weird, and not a central example of corrigibility.

Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? If so, well, I’m generally very skeptical of attempts to do that kind of thing. See here, especially Section 7. In the case of humans, it’s perfectly possible for a plan to seem desirable but not plausible, or for a plan to seem plausible but not desirable. I think there are very good reasons that our brains are set up that way.

Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? 

No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing.

As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't have a good notion of purely epistemic

  • Any learning system with a sufficiently rich hypothesis space can potentially learn to behave agentically (whether we want it to or not, until we have anti-inner-optimizer tech), so we should still have corrigibility concerns about such systems.
  • In my view, beliefs are a type of decision (not because we smoosh beliefs and values together, but rather because beliefs can have impacts on the world if the world looks at them) which means we should have agentic concerns about beliefs.
  • Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?

[I learned the term teleosemantics from you!  :) ]

The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.

LI defines a notion of logically uncertain variable, which can be used to represent desires

I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.

And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.

The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.

(I don’t think I’m saying anything you don’t already know well.)

Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.

(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)

beliefs can have impacts on the world if the world looks at them

…Indeed, what I said above is just a special case. Here’s something more general and elegant. You have the core LI system, and then some watcher system W, which reads off some vector of internal variables V of the core LI system, and then W takes actions according to some function A(V).

After a while, the LI system will automatically catch onto what W is doing, and “learn” to interpret V as an expectation that A(V) is going to happen.

I think the central case is that W is part of the larger AI system, as above, leading to normal agent-like behavior (assuming some sensible system for resolving the underdetermination). But in theory W could also be humans peeking into the LI system and taking actions based on what they see. Fundamentally, these aren’t that different.

So whatever solution we come up with to resolve the underdetermination, whether human-brain-like “valence” or something else, that solution ought to work for the humans-peeking-into-the-LI situation just as it works for the normal W-is-part-of-the-larger-AI situation.

(But maybe weird things would happen before convergence. And also, if you don’t have any system at all to resolve the underdetermination, then probably the results would be weird and hard to reason about.)

Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).

I’m not sure that this is coming from a coherent threat model (or else I don’t follow).

  • If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
  • If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?

You can define it that way, but then I don't think it's highly relevant for this context. 

The story I'm telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can't rule out using the restricted feedback mechanism.

  • Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.
  • Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem).

Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback. "Purely epistemic" in your sense filters for hypotheses which make good predictions, but this doesn't constrain things to be non-agentic. The system can learn to use predictions as actions in some way.

[I learned the term teleosemantics from you!  :) ]

I think it would be fair to define a teleosemantic notion of "purely epistemic" as something like "there is no optimization (anywhere in the system -- 'inner' or 'outer') except optimization for epistemic accuracy". 

The obvious application of my main point is that some form of "complete feedback" is a necessary (but insufficient) condition for this. 

"Epistemic accuracy" here has to be defined in such a way as to capture the one-way "direction-of-fit" optimization of the map to fit the territory, but never the territory to fit the map. IE the optimization algorithm has to ignore the causal impact of its predictions.

However, I don't particularly endorse this as the correct design choice -- although a system with this property would be relatively safe in the sense of eliminating inner-alignment concerns and (in a sense) outer-alignment concerns, it is doing so by ignoring its impact on the world, which creates its own set of dangers. If such a system were widely deployed and became highly trusted for its predictions, it could stumble into bad self-fulfilling prophecies.

So, in my view, "epistemic" systems should be as transparent as possible with human users about possible multiple-fixed-point issues, try to keep humans in the loop and give the important decisions to humans; but ultimately, we need to view even "purely epistemic" systems as making some important (instrumental) decisions, and have them take some responsibility for making those decisions well instead of poorly.

The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.

I was going to remind you that the paper didn't say how the fixed-point selection works, and we can do that part in an agentic way, but then you go on to say basically the same thing (with the caveat that where you say "put the LI into a larger system that follows the rule: whatever the expectations are about the AI's own actions, make that actually happen" I would say the more general "put the LI into an environment which somehow reacts to its predictions"):

LI defines a notion of logically uncertain variable, which can be used to represent desires

I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.

And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.

The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.

(I don’t think I’m saying anything you don’t already know well.)

Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.

(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)

I don't understand what you're trying to accomplish in these paragraphs. To me you sound sorta like Bob in the following:

Alice: Here's my computer model of an agent.

Bob: Uh oh, that sounds sort of like active inference. How did you represent values? Did you confuse them with beliefs?

Alice: I used floating-point numbers to represent the expected value of a state. Here, look at my code. It's a Q-learning algorithm.

Bob: You realize that "expected value" is a statistics thing, right? That makes it epistemic, not really value-laden in the sense that makes something agentic. It's a prediction of what a number will be. Indeed, we can justify expected values as min-quadratic-loss estimates. That makes them epistemic!

Alice: Well, I agree that expected values aren't automatically "values" in the agentic sense, but look, my code can solve mazes and stuff -- like a rat learning to get cheese. 

Bob: Of course I agree that it can be used in an instrumental way, but that's a really misleading way to describe it overall, right? If you changed one Q-value estimated by the system, that would be analogous to changing habits, not desires, right?

Alice: um??? If we agree that it can be used in an instrumental way, then what are you saying is misleading?

Bob: I mean, sure, the human brain does something like this.

Alice: Ok??

It seems possible that you think we have some disagreement that we don't have?

Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).

I’m not sure that this is coming from a coherent threat model (or else I don’t follow).

  • If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
  • If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.

I'm not clear on what you're trying to disagree with here. It sounds like we both agree that if Benevolent Bob builds a powerful "purely epistemic system" (by whatever definition), without limiting its knowledge, then Dr. Evil can misuse it; and we both agree that as a consequence of this, it makes sense to instead build some agency into the system, so that the system can decide not to give users dangerous information.

Possibly you disagree with the claim "it is easy to build agentlike things out of belieflike things"? What I have in mind is a powerful epistemic oracle. As a simple example, let's say it can give highly accurate guesses to mathematically-posed problems. Then Dr. Evil can implement AIXI by feeding in AIXI's mathematical definition, for example. This is the sort of thing I had in mind, but generalized to the nonmathematical case. (EG, "conditional on my owning a super-powerful death ray soon, what actions do I take now")

Hmm, I think the point I’m trying to make is: it’s dicey to have a system S that’s being continually modified to systematically reduce some loss L, but then we intervene to edit S in a way that increases L. We’re kinda fighting against the loss-reducing mechanism (be it gradient descent or bankroll-changes or whatever), hoping that the loss-reducing mechanism won’t find a “repair” that works around our interventions.

In that context, my presumption is that an AI will have some epistemic part S that’s continually modified to produce correct objective understanding of the world, including correct anticipation of the likely consequences of actions. The loss L for that part would probably be self-supervised learning, but could also include self-consistency or whatever.

And then I’m interpreting you (maybe not correctly?) as proposing that we should consider things like making the AI have objectively incorrect beliefs about (say) bioweapons, and I feel like that’s fighting against this L in that dicey way.

Whereas your Q-learning example doesn’t have any problem with fighting against a loss function, because Q(S,A) is being consistently and only updated by the reward.

The above is inapplicable to LLMs, I think. (And this seems tied IMO to the fact that LLMs can’t do great novel science yet etc.) But it does apply to FixDT.

Specifically, for things like FixDT, if there are multiple fixed points (e.g. I expect to stand up, and then I stand up, and thus the prediction was correct), then whatever process you use to privilege one fixed point over another, you’re not fighting against the above L (i.e., the “epistemic” loss L based on self-supervised learning and/or self-consistency or whatever). L is applying no force either way. It’s a wide-open degree of freedom.

(If your response is “L incentivizes fixed-points that make the world easier to predict”, then I don’t think that’s a correct description of what such a learning algorithm would do.)

So if your feedback proposal exclusively involves a mechanism that privileging one fixed point over another, then I have no complaints, and would describe it as choosing a utility function (preferences not beliefs) within the FixDT framework.

Btw I think we’re in agreement that there should be some mechanism privileging one fixed point over another, instead of ignoring it and just letting the underdetermined system do whatever it does.

Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem). … Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback.

Oh, I want to set that problem aside because I don’t think you need an arbitrarily rich hypothesis space to get ASI. The agency comes from the whole AI system, not just the “epistemic” part, so the “epistemic” part can be selected from a limited model class, as opposed to running arbitrary computations etc. For example, the world model can be “just” a Bayes net, or whatever. We’ve talked about this before.

Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.

I also learned the term observation-utility agents from you :) You don’t think that can solve those problems (in principle)?

I’m probably misunderstanding you here and elsewhere, but enjoying the chat, thanks :)