ARC's first technical report: Eliciting Latent Knowledge

ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.

The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.

The report is available here as a google document. If you're excited about this research, we're hiring!

Q&A

We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.

Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.

New Comment
91 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Here’s an attempt at condensing an issue I’m hung up on currently with ELK. This also serves as a high-level summary that I’d welcome poking at in case I’m getting important parts wrong.
 

The setup for ELK is that we’re trying to accurately label a dataset of (observation, action, predicted subsequent observation) triples for whether the actions are good. (The predicted subsequent observations can be optimised for accuracy using automated labels - what actually gets observed subsequently - whereas the actions need their labels to come from a source of judgement about what’s good, e.g., a human rater.)

The basic problem is partial observability: the observations don’t encapsulate “everything that’s going on”, so the labeller can’t distinguish good states from bad states that look good. An AI optimising actions for positive labels (and predicted observations for accuracy) may end up preferring to reach bad states that look good over good states, because controlling the observation is easier than controlling the rest of the state and because directly predicting what observations will get positive labels is easier than (what we’d want instead) inferring what states the positive labe... (read more)

4Ajeya Cotra
My understanding is that we are eschewing Problem 2, with one caveat -- we still expect to solve the problem if the means by which the diamond was stolen or disappeared could be beyond a human's ability to comprehend, as long as the outcome (that the diamond isn't still in the room) is still comprehensible. For example, if the robber used some complicated novel technology to steal the diamond and hack the camera, there would be many things about the state that the human couldn't understand even if the AI tried to explain it to them (at least without going over our compute budget for training). But nevertheless it would still be an instance of Problem 1 because they could understand the basic notion of "because of some actions involving complicated technology, the diamond is no longer in the room, even though it may look like it is."
3paulfchristiano
Echoing Mark and Ajeya: I basically think this distinction is real and we are talking about problem 1 instead of problem 2. That said, I don't feel like it's quite right to frame it as "states" that the human does or doesn't understand. Instead we're thinking about properties of the world as being ambiguous or not in a given state. As a silly example, you could imagine having two rooms where one room is normal and the other is crazy. Then questions about the first room are easy and questions about the second are hard. But in reality the degrees of freedom will be much more mixed up than that. To give some more detail on my thoughts on state: * Obviously the human never knows the "real" state, which has a totally different type signature than their beliefs. * So it's natural to talk about knowing states based on correctly predicting what will happen in the future starting from that state. But it's ~never the case that the human's predictions about what will happen next are nearly as good as the predictor's. * We could try to say "you can make good predictions about what happens next for typical actions" or something, but even for typical actions the human predictions are bad relative to the predictor, and it's not clear in what sense they are "good" other than some kind of calibration condition. * If we imagine an intuitive translation between two models of reality, most "weird" states aren't outside of the domain of the translation, it's just that there are predictively important parts of the state that are obscured by the translation (effectively turning into noise, perhaps very surprising noise). Despite all of that, it seems like it really is sometimes unambiguous to say "You know that thing out there in the world that you would usually refer to by saying 'the diamond is sitting there and nothing weird happened to it'? That thing which would lead you to predict that the camera will show a still frame of a diamond? That thing definitely happened, and is w
3Mark Xu
I think that problem 1 and problem 2 as you describe them are potentially talking about the same phenomenon. I'm not sure I'm understanding correctly, but I think I would make the following claims: * Our notion of narrowness is that we are interested in solving the problem where the question we're asking is such that a state always resolves a question. E.g. there isn't any ambiguity around whether a state "really contains a diamond". (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds. * We are explicitly interested in solving some forms of problem 2, e.g. we're interested in our AI being able to answer questions about the presence/absence of diamonds no matter how alien the world gets. In some sense, we are interested in our AI answering questions the same way a human would answer questions if they "knew what was really going on", but that "knew what was really going on" might be a misleading phrase. I'm not imagining that "knowing what is really going on" to be a very involved process; intuitively, it means something like "the answer they would give if the sensors are 'working as intended'". In particular, I don't think that, for the case of the diamond, "Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds." * We want to solve these versions of problem 2 because the speed "things getting weirder" in the world might be much faster than human ability to understand what's going on the world. In these worlds, we want to leverage the fact that answers to "narrow" questions are unambiguous to incentivize our AIs to give humans a locally understandable environment in which to deliberate. * We're not interested in solving forms of problem 2 where the human needs to do additional delib
3Charlie Steiner
I think this statement encapsulates some worries I have. If it's important how the human defines a property like "the same diamond," then assuming that the sameness of the diamond is "out there in the diamond" will get you into trouble - e.g. if there's any optimization pressure to find cases where the specifics of the human's model rear their head. Human judgment is laden with the details of how humans model the world, you can't avoid dependence on the human (and the messiness that entails) entirely. Or to phrase it another way: I don't have any beef with a narrow approach that says "there's some set of judgments for which the human is basically competent, and we want to elicit knowledge relevant to those judgments." But I'm worried about a narrow approach that says "let's assume that humans are basically competent for all judgments of interest, and keep assuming this until something goes wrong." It just feels to me like this second approach is sort of... treating the real world as if it's a perturbative approximation to the platonic realm.
1Ramana Kumar
This "there isn't any ambiguity"+"there is ambiguity" does not seem possible to me: these types of ambiguity are one and the same. But it might depend on what “any set of observations” is allowed to include. “Any set” suggests being very inclusive, but remember that passive observation is impossible. Perhaps the observations I’d want the human to use to figure out if the diamond is really there (presuming there isn’t ambiguity) would include observations you mean to exclude, such as disabling the filter-nanobots first? I guess a wrinkle here is that observations need to be “implementable” in the world. If we’re thinking of making observations as intervening on the world (e.g., to decide which sensors to query), then some observations may be inaccessible because we can’t make that intervention. Rewriting this all without relying on “possible”/”can” concepts would be instructive.
2paulfchristiano
I don't think we have any kind of  precise definition of "no ambiguity." That said, I think it's easy to construct examples where there is no ambiguity about whether the diamond remained in the room, yet there is no sequence of actions a human could take that would let them figure out the answer. For example, we can imagine simple toy universes where we understand exactly what features of the world give rise to human beliefs about diamonds and where we can say unambiguously that the same features are/aren't present in a given situation. In general I feel a lot better about our definitions when we are using them to arbitrate a counterexample than if we were trying to give a formal definition. If all the counterexamples involved border cases of the concepts, where there was arguable ambiguity about whether the diamond really stayed in the room, then it would seem important to firm up these concepts but right now it feels like it is easy to just focus on cases where algorithms unambiguously fail. (That methodological point isn't obvious though---it may be that precise definitions are very useful for solving the problem even if you don't need them to judge current solutions as inadequate. Or it may be that actually existing counterexamples are problematic in ways we don't recognize. Pushback on these fronts is always welcome, but right now I feel pretty comfortable with the situation.)

ELK was one of my first exposures to AI safety. I participated in the ELK contest shortly after moving to Berkeley to learn more about longtermism and AI safety. My review focuses on ELK’s impact on me, as well as my impressions of how ELK affected the Berkeley AIS community.

Things about ELK that I benefited from

Understanding ARC’s research methodology & the builder-breaker format. For me, most of the value of ELK came from seeing ELK’s builder-breaker research methodology in action. Much of the report focuses on presenting training strategies and presenting counterexamples to those strategies. This style of thinking is straightforward and elegant, and I think the examples in the report helped me (and others) understand ARC’s general style of thinking.

Understanding the alignment problem. ELK presents alignment problems in a very “show, don’t tell” fashion. While many of the problems introduced in ELK have been written about elsewhere, ELK forces you to think through the reasons why your training strategy might produce a dishonest agent (the human simulator) as opposed to an honest agent (the direct translator). The interactive format helped me more deeply understand some of the... (read more)

I've written a bunch elsewhere about object-level thoughts on ELK. For this review, I want to focus instead on meta-level points.

I think ELK was very well-made; I think it did a great job of explaining itself with lots of surface area, explaining a way to think about solutions (the builder-breaker cycle), bridging the gap between toy demonstrations and philosophical problems, and focusing lots of attention on the same thing at the same time. In terms of impact on the growth and development on the AI safety community, I think this is one of the most important posts from 2021 (even tho the prize and much of the related work happened in 2022).

I don't really need to ask for follow-on work; there's already tons, as you can see from the ELK tag.


I think it is maybe underappreciated by the broad audience how much this is an old problem, and appreciate the appendix that gives credit to earlier thinking, while thinking this doesn't erode any of the credit Paul, Mark, and Ajeya should get for the excellent packaging.

[To the best of my knowledge, ELK is still an open problem, and one of the things that I appreciated about the significant focus on ELK specifically was helping give people better models of how quickly progress happens in this space, and what it looks like (or doesn't look like).]

Let me see if I am on the right page here.
 

Suppose I have some world state S, a transition function T : S → S, actions Action : S → S, and a surjective Camera : S -> CameraState. Since Camera is (very) surjective, seeing a particular camera image with happy people does not imply a happy world state, because many other situations involving nanobots or camera manipulation could have created that image.


This is important because I only have a human evaluation function H : S → Boolean, not on CameraState directly.
When I look at the image with the fake happy people, I use a mocked up H' : CameraState → Boolean := λ cs. H(Camera⁻¹(cs)). The issue is that Camera⁻¹ points to many possible states, and in practice I might pick whichever state is apriori most likely according to a human distribution over world states Distₕ(S).

The trick is that if I have a faithful model M : Action × CameraState → CameraState, I can back out hidden information about the state. The idea is that M must contain information about the true state, not just CameraState, in order to make accurate predictions.


The key idea is that M(a) acts like Camera ∘ T ∘ a ∘ Camera⁻¹, so we should be able to trace out which p... (read more)

6paulfchristiano
Everything seems right except I didn't follow the definition of the regularizer. What is L2? This is what we want to do, and intuitively you ought to be able to back out info about the hidden state, but it's not clear how to do so. All of our strategies involve introducing some extra structure, the human's model, with state space SH, where the map CameraH:SH→CameraState also throws out a lot of information. The setup you describe is very similar to the way it is presented in Ontological crises. ETA: also we imagine H:SH→CameraState, i.e. the underlying state space may also be different. I'm not sure any of the state mismatches matters much unless you start considering approaches to the problem that actually exploit structure of the hidden space used within M though.
4davidad
Here's an approach I just thought of, building on scottviteri's comment. Forgive me if there turns out to be nothing new here. Supposing that the machine and the human are working with the same observation space (O:=CameraState) and action space (A:=Action), then the human's model H:SH→A→P(O×SH) and the machine's model M:SM→A→P(O×SM) are both coalgebras of the endofunctor F:=λX.A→P(O×X), therefore both have a canonical morphism into the terminal coalgebra of F, X:≅FX (assuming that such an X exists in the ambient category). That is, we can map SH→X and SM→X. Then, if we can define a distance function on X with type dX:X×X→R≥0, we can use these maps to define distances between human states and machine states, d:SH×SM→R≥0. How can we make use of a distance function? Basically, we can use the distance function to define a kernel (e.g. K(x,y)=exp(−βdX(x,y))), and then use kernel regression to predict the utility of states in SM by averaging "nearby" states in SH, and then finally (and crucially) estimating the generalization error so that states from SM that aren't really near to anywhere in SH get big warning flags (and/or utility penalties for being outside a trust region). How to get such a distance function? One way is to use CMet (the category of complete metric spaces) as the ambient category, and instantiate P as the Kantorovich monad. Crank-turning yields the formula dX(sH,sM)=supa:AsupU:O×X↣R∣∣Eo,s′H∼H(sH)(a)U(o,s′H)−Eo,s′M∼M(sM)(a)U(o,s′M)∣∣ where U is constrained to be a non-expansive map, i.e., it is subject to the condition |U(o1,s1)−U(o2,s2)|≤max{dO(o1,o2),dX(s1,s2)}. If O is discrete, I think this is maybe equivalent to an adversarial game where the adversary chooses, for every possible sH and sM, a partition of O and a next action, and optimizes the probability that sampled predictions from  H and M will eventually predict observations on opposite sides of the partition. This distance function is canonical, but in some sense seems too strict: if M k
3paulfchristiano
Consider a state sM where the sensors have been tampered with in order to "look like" the human state sH, i.e. we've connected the actuators and camera to a box which just simulates the human model (starting from sH) and then feeding the predicted outputs of the human model to the camera. It seems to me like the state sM would have zero distance from the state sH under all of these proposals. Does that seem right? (I didn't follow all of the details of the example, and definitely not the more general idea.) (I first encountered this counterexample in Alignment for advanced machine learning systems. They express the hope that you can get around this by thinking about the states that can lead to the sensor-tampered state and making some kind of continuity assumption, but I don't currently think you can make that work and it doesn't look like your solution is trying to capture that intuition.)
3davidad
(Thanks for playing along with me as 'breaker'!) I agree that such an sM would have zero distance from the corresponding sH, but I have some counterpoints: 1. This is a problem for ELK in general, to the extent it's a problem (which I think is smallish-but-non-negligible). An M with this property is functionally equivalent to an M′ which actually believes that sM refers to the same state of the real world as sH. So the dynamics of M's world-model don't contain any latent knowledge of the difference at this point. * This seems to be against the ELK report's knowledge-criterion "There is a feature of the computation done by the AI which is robustly correlated with Z." * The only way I can think of that ELK could claim to reliably distinguish sM from sH is by arguing that the only plausible way to get such an M′ is via a training trajectory where some previous Mθ did treat sM differently from sH, and perform ELK monitoring at training checkpoints (in which case I don't see reason to expect my approach comes off worse than others). 2. Such an sM would not be incentivized by the model. Assuming that rewards factor through O, Q(sM)=Q(sH). So a policy that's optimized against the world-model M wouldn't have enough selection pressure to find the presumably narrow and high-entropy path that would lead to the tampered state from the initial state (assuming that the initial state in the real world at deployment is tamper-free). 3. In the real world, M has higher expected loss in sM. If all the sensor inputs are generated by simulating the human model, M has completely blinded itself to potential threats that could disrupt the tampering and reveal a missing diamond. These real-world threats are independent of the threats that the tampering box would be stochastically simulating as part of the human model, either of which would produce observations with high loss. Thus, the real-world expected loss in sM is typically strictly higher than in sH. * This seems to be
2paulfchristiano
I'm thinking of this in a family of proposals like: * Some kinds of tampering can be easily detected (and so should get identified with states sH where tampering has occurred) * Some other tampering can't be easily detected, but this undetectable tampering has important similarities with detectable tampering and we could use that to recognize it. * In this case, we're going to try to exploit the fact that detectable tampering shares a prefix of actions/states with undetectable tampering (such that later states reached in that sequence have a much higher probability of detectable tampering, as well as undetectable tampering, and can be flagged as suspicious). I currently don't believe this kind of thing can work (at least in the worst case), but I'm not at all confident. One thing worth mentioning (which makes me much more optimistic about this class of proposals): if your AI, delegates to another AI who helps actually implement the tampering, that makes things harder but in some sense it also provides extra structure you can work with in a solution. Your AI needs to reason about that other AI's reasoning, and so you could hope (i) if your AI is reasoning about the other AI's reasoning in detail, then we can apply exactly the same machinery to that AI's predictive model in order to elicit it's beliefs (which your AI obviously trusts) about what is going on,  (ii) if your AI reasons abstractly about the other AI's reasoning, you could hope to treat that as a shortcut to reach the same conclusions and incentivize the model to give you the same answers that it would if it were reasoning explicitly (modulo its ability to figure it out). (I also think the single-AI case basically needs to be treated in the same way as that---I don't think in general your AI is going to be reasoning explicitly about its own future actions, it just needs to reason abstractly about its future self in a qualitatively similar same way that it reasons about other AIs. Or at least in the
2paulfchristiano
The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to scorruptedM. The reason I'm concerned that it brings about this state is because the actual good state sgoodM is much harder to access than scorruptedM (e.g. because it requires achieving hard real-world goals). The intuition is that scorruptedM has constant difficulty while sgoodM gets harder and harder as we make the tasks more sophisticated. I'm OK having the human look at entire trajectories, so it's provisionally OK (though not the most satisfying) if scorruptedM gets identified with SgoodH as long as the preceding states look bad to the human and then they can do some kind of inference on top to figure out what happened. My main question is why those preceding states get mapped correctly. If the preceding state sprecedingM gets corrupted via a mechanism that isn't obvious to the human, and otherwise looks like a predecessor to SgoodH, then it seems like it should get identified with a predecessor of SgoodH. The only way I see this working is if the AI can "abort" its plan at the last minute and demonstrate a hack.  But my main concerns are that (i) that doesn't seem true in general---the AI can tamper with sensors in ways humans don't understand starting from many initial states, and you can't distinguish the "almost-tampered" state naturally since it's just one more state that has this property, (ii) even if that happens it doesn't look to me like the proposed loss actually captures that if there are other similarly-severe predictive errors in the human's model, though if I wasn't worried about (i) I'd want to think about (ii) much more carefully and actually construct a counterexample.
3davidad
That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to sgoodM), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like "zero incentive to tamper, and tampering seems complicated" fail here. While some Ms may indeed predict this via reasoning, not all Ms that behave this way would, for example an M that internally modeled the tampering sequence of actions incorrectly as actually leading to sgoodM (and didn't even model a distinct scorruptedM). I think either: 1. (A) it would be at least as apt to ascribe a confused model to M as to ascribe one in which it "reasoned about the nature of the sensor tampering" (e.g. if a contemporary model-free RL robot did some sensor tampering, I'd probably ascribe to it the belief that the tampering actually led to sgoodM), or 2. (B) M would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that M predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in H, or null my-actuators-are-disconnected actions) with higher probability in the future of scorruptedM than in the future of sgoodH (making those states separable in X), or 3. (C) If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then scorruptedM does get identified with sgoodH, but some other action sequence (of similar length) would lead from sprecedingM to strippyM, a state in which bizarre observations appear forever that would be extremely unlikely at any state in SH. First, I don't think timing ("last minute", "almost-tampered") is critical here. If timing were critical, the 'breaker' could reparameterize the situation so that comp
2paulfchristiano
Even for "homeostatic" tasks I expect the difficulty to scale up as the environment becomes more complex (e.g. because you must defend against increasingly sophisticated attackers). There may be some upper bound where you can defend against arbitrarily sophisticated attackers with finite difficulty, but I don't know whether that's true or whether it would be higher or lower than the difficulty of sensor tampering. I agree that some M's would mistakenly expect a sequence of actions to lead to good outcomes, when they really lead to tampering. But planning against such M's couldn't lead to correct tampering (since the details that determine successful tampering are different from those that lead to good outcomes) and in some sense such M's also can't be expected to defend against tampering. So I don't currently think this is a big problem though I might well be missing something. I generally agree with this---in some sense this kind of "definitely no signals ever" tampering seems pretty unrealistic and it should be much easier to catch more realistic forms of tampering. Agree with this (modulo new counterexamples), but not yet clear we can exploit this fact to flag tampering as such. (E.g. can we recognize the relevant notion of similarity, or do we just conclude that every state can lead to strippyM and hence we have no idea what is going on?) This is exactly the kind of counterexample I would give because it is so clean. That said, I do also believe there are more realistic counterexamples (and I view the focus on this kind of example as mostly a methodological point so that we can do analysis without worrying about empirics). I'm less convinced by this. For example, suppose that my agent first builds a machine that sits between and its actuators, and then filters out any actions that don't have an appropriate hash. Then I output a sequence of actions that tampers or constructs a successor.  Here I am committing not to the hash of my successor, but to the hash
1scottviteri
Thank you for the fast response! By L₂ I meant the Euclidian norm, measuring the distance between two different predictions of the next CameraState. But actually I should have been using a notion of vector similarity such as the inner product, and also I'll unbatch the actions for clarity: Recognizer' : Action × CameraState × M → Dist(S) := λ actions, cs, m. softmax([⟨M(a,cs), (C∘T∘a)(hidden_state)⟩ ∀ hidden_state ∈ Camera⁻¹(cs)]) So the idea is to consider all possible hidden_states such that the Camera would display as the current CameraState cs, and create a probability distributions over those hidden_states, according to the similarity of M(a,cs) and (C∘T∘a)(hidden_state). Which is to say, how similar would the resulting CameraState be if I went the long way around, taking the hidden_state, applying my action, transition, and Camera functions. Great, I'll take a look. Right so I wasn't understanding the need for something like this, but now I think I see what is going on. I made an assumption above that I have some human value function H : S → Boolean. If I have some human internal state S_H, and I relax the human value function to H_V : S_H → Boolean, then the solution I have above falls apart, but here is another. Now the goal is to create a function F from the machine state to human state, so that the human value function will compose with F to take machine states as input. I am using all fresh variable names starting here. S_H -- type of human knowledge S_M -- type of machine knowledge CameraState -- type of camera output EyeState -- type of eye output Inputs: H_V : S_H → Boolean  -- human value function Camera : S → CameraState (very surjective) Eye    : S → EyeState    (very surjective) Predict_M : S_M × [CameraState] × [Action] → S_M -- machine prediction function (strong) Predict_H : S_H × [EyeState]    × [Action] → S_H -- human prediction function (weak) Intermediates:   Recognizer_M : S_M → Dist S := Part2 ∘ Part1      Intuitively seems like
2paulfchristiano
I didn't follow some parts of the new algorithm. Probably most centrally: what is Dist(S)? Is this the type of distributions over real states of the world, and if so how do we have access to the true map Camera: S --> video? Based on that I likely have some other confusions, e.g. where are the camera_sequences and action_sequences coming from in the definition of Recognizer_M, what is the prior being used to define Camera−1, and don't Recognizer_M and Recognizer_H effectively advance time a lot under some kind of arbitrary sequences of actions (making them unsuitable for exactly matching up states)?
1davidad
Nitpicks: 1. F should be Recognizer_H ∘ Recognizer_M, rather than Recognizer_M ∘ Recognizer_H 2. In Recognizer_H, I don't think you can take the expected value of a stochastic term of type SH, because SH doesn't necessarily have convex structure. But, you could have Recognizer_H output Dist S_H instead of taking the ExpectedValue, and move the ExpectedValue into Win, and have Win output a probability rather than a Boolean. Confusions: 1. Your types for Predict_M and Predict_H seem to not actually make testable predictions, because they output the opaque state types, and only take observations as inputs. 2. I'm also a bit confused about having them take lists of actions as a primitive notion. Don't you want to ensure that, say, (Predict_M s css (as1++as2)) = (Predict_M (Predict_M s css as1) as2)? If so, I think it would make sense to accept only one action at a time, since that will uniquely characterize the necessary behavior on lists. 3. I don't really understand Part1. For instance, where does the variable cs come from there?

(Note: I read an earlier draft of this report and had a lot of clarifying questions, which are addressed in the public version. I'm continuing that process here.)

I get the impression that you see most of the "builder" moves as helpful (on net, in expectation), even if there are possible worlds where they are unhelpful or harmful. For example, the "How we'd approach ELK in practice" section talks about combining several of the regularizers proposed by the "builder." It also seems like you believe that combining multiple regularizers would create a "stacking... (read more)

7paulfchristiano
This is because of the remark on ensembling---as long as we aren't optimizing for scariness (or diversity for diversity's sake), it seems like it's way better to have tons of predictors and then see if any of them report tampering. So adding more techniques improves our chances of getting a win. And if the cost of fine-tuning a reporters is small relative to the cost of training the predictor, we can potentially build a very large ensemble relatively cheaply. (Of course, having more techniques also helps because you can test many of them in practice and see which of them seem to really help.) This is also true for data---I'd be scared about generating a lot of riskier data, except that we can just do both and see if either of them reports tampering in a given case (since they appear to fail for different reasons). I believe this in a few cases (especially combining "compress the predictor," imitative generalization, penalizing upstream dependence, and the kitchen sink of consistency checks) but mostly the stacking is good because ensembling means that having more and more options is better and better. I don't think the kind of methodology used in this report (or by ARC more generally) is very well-equipped to answer most of these questions. Once we give up on the worst case, I'm more inclined to do much messier and more empirically grounded reasoning. I do think we can learn some stuff in advance but in order to do so it requires getting really serious about it (and still really wants to learn from early experiments and mostly focus on designing experiments) rather than taking potshots. This is related to a lot of my skepticism about other theoretical work. I do expect the kind of research we are doing now to help with ELK in practice even if the worst case problem is impossible. But the particular steps we are taking now are mostly going to help by suggesting possible algorithms and difficulties; we'd then want to give those as one input into that much messier

(I did not write a curation notice in time, but that doesn’t mean I don’t get to share why I wanted to curate this post! So I will do that here.)

Typically when I read a post by Paul, it feels like a single ingredient in a recipe, but one where I don’t know what meal the recipe is for. This report felt like one of the first times I was served a full meal, and I got to see how all the prior ingredients come together.

Alternative framing: Normally Paul’s posts feel like the argument step “J -> K” and I’m left wondering how we got to J, and where we’ll go fr... (read more)

5Ben Pace
FWIW I wouldn’t write this line today, I am now much more confused about what ELK says or means.
6evhub
Why? What changed in your understanding of ELK?

Here’s a Builder move (somewhat underdeveloped but I think worth posting now even as I continue to think - maybe someone can break it decisively quickly).

Training strategy: Add an “Am I tricking you?” head to the SmartVault model.

The proposed flow chart for how the model works has an “Experiment Proposer” coming out of “Figure out what’s going on”, and two heads out of Experiment Proposer, called “Experiment Description” and “Value” (meaning “Expected Value of Experiment to the Proposer”). I won’t make use of the question-answering Reporter/Answer parts, s... (read more)

4Mark Xu
Thanks for your proposal! I'm not sure I understand how the "human is happy with experiment" part is supposed to work. Here are some thoughts: * Eventually, it will always be possible to find experiments where the human confidently predicts wrongly. Situations I have in mind are ones where your AI understands the world far better than you, so can predict that e.g. combining these 1000 chemicals will produce self-replicating protein assemblages, whereas the human's best guess is going to be "combining 1000 random chemicals doesn't do anything" * If the human is unhappy with experiments that are complicated, then advanced ways of hacking the video feed that requires experiments of comparable complexity to reveal are not going to be permitted. For instance, if the diamond gets replaced by a fake, one might have to perform a complicated imaging technique to determine the difference. If the human doesn't already understand this technique, then they might not be happy with the experiment. * If the human doesn't really understand the world that well, then it might not be possible to find an experiment for which the human is confident in the outcome that distinguishes the diamond from a fake. For instance, if a human gets swapped out for a copy of a human that will make subtly different moral judgments because of factors the human doesn't understand, this copy will be identical in all ways that a human can check, e.g. there will be no experiment that a human is confident in that will distinguish the copy of the human from the real thing.
1Ramana Kumar
Thanks for the reply! I think you’ve understood correctly that the human rater needs to understand the proposed experiment – i.e., be able to carry it out and have a confident expectation about the outcome – in order to rate the proposer highly. Here’s my summary of your point: for some tampering actions, there are no experiments that a human would understand in the above sense that would expose the tampering. Therefore that kind of tampering will result in low value for the experiment proposer (who has no winning strategy), and get rated highly. This is a crux for me. I don’t yet believe such tampering exists. The intuition I’m drawing on here is that our beliefs about what world we’re in need to cash out in anticipated experiences. Exposing confusion about something that shouldn’t be confusing can be a successful proposer strategy. I appreciate your examples of “a fake diamond that can only be exposed by complex imaging techniques” and “a human making subtly different moral judgements” and will ponder them further. Your comment also helped me realise another danger of this strategy: to get the data for training the experiment proposer, we have to execute the SmartVault actions first. (Whereas I think in the baseline scheme they don’t have to be executed.)
3Mark Xu
My point is either that: * it will always be possible to find such an experiment for any action, even desirable ones, because the AI will have defended the diamond in a way the human didn't understand or the AI will have deduced some property of diamonds that humans thought they didn't have * or there will be some tampering for which it's impossible to find an experiment, because in order to avoid the above problem, you will have to restrict the space of experiments
3paulfchristiano
I'd be scared that the "Am I tricking you?" head just works by: 1. Predicting what the human will predict 2. Predicting what will actually happen 3. Output a high value iff the human's prediction is confident but different from reality. If this is the case, then the head will report detectable tampering but not undetectable tampering. To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren't, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I think there's a big genre of proposals that try to leverage that kind of structure, which might be promising (though it's not the kind of thing I'm thinking about right now).
1Ramana Kumar
Tweaking your comment slightly: Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what's in the report, this strategy constructs new sensors as needed. The Proposer head is designed to optimise E, which ought to make more tampering detectable, and I have an intuition that it makes all tampering detectable.
2paulfchristiano
Suppose the value head learns to predict "Will the human be confidently wrong about the outcome of this experiment," where an 'experiment' is a natural language description of a sequence of actions that the human could execute.  And then the experiment head produces natural language descriptions of actions that a human could take for which they'd be confidently wrong. What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwards? How do we use this to construct new sensors that allow the human to detect tampering? ETA: here's my best guess after reading the other comment---after taking a sequence of actions, we run the experiment proposer to suggest an experiment that will allow the human to notice if tampering actually occurred. This seems like it could be different from "experiment that human would be confidently wrong about" since a human who doesn't understand the environment dynamics will always have tons of experiments they are confidently wrong about, but instead we want to find an experiment that causes them to update strongly to believing that tampering occurred. Is that right? If so it seems like there are a few problems: * The proposed experiment could itself perform tampering (after which the human will correctly infer that tampering occurred, thereby giving the experiment a high score), or exploit the human errors to make it appear that tampering had occurred (e.g. if the human is wrong about how sensors work then you can construct new sensors that will appear to report tampering). * If you tamper with the mechanism by which the human "executes" the experiment (e.g. by simply killing the human and replacing them with a different experiment-executor) then it seems like the experiment proposer will always lose. This maybe depends on details of exactly how the setup works. * Like Mark I do expect forms of tampering that alw
1Ramana Kumar
Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough. I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors. Yep this is a problem. "Was I tricking you?" isn't being distinguished from "Can I trick you after the fact?". The other problems seem like real problems too; more thought required....
[-]Wei DaiΩ680

ETA: This comment was based on a misunderstanding of the paper. Please see the ETA in Paul's reply below.

From the section on Avoiding subtle manipulation:

But from my perspective in advance, there are many possible ads I could have watched. Because I don’t understand how the ads interact with my values, I don’t have very strong preferences about which of them I see. If you asked me-in-the-present to delegate to me-in-the-future, I would be indifferent between all of these possible copies of myself who watched different ads. And if I look across all of tho

... (read more)
8paulfchristiano
If I only take counterfactuals over a single AI's decision then I can have this problem with just two AIs: each of them tries to manipulate me, and if one of them fails the other will succeed and so I see no variation in my preferences. In that case the hope is to take counterfactuals over all the decisions. I don't know if this is realistic, but I think it probably either fails in mundane cases or works in this slightly exotic case.  Also honestly it doesn't seem that much harder than taking counterfactuals over one decision, which is already tough. (I think that many manipulators wanting to push me in the same direction isn't too exotic though.) ETA: I think I misunderstood your comment and there's actually a more basic miscommunication. I'm imagining the counterfactual over different ads that the AI considered running, before settling on the paperclip-maximizing one (having realized that the others wouldn't lead to me loving paperclips). I'm not imagining the counterfactual over different values that AI might have.
6Wei Dai
Oh I see. Why doesn't this symmetrically cause you to filter out good arguments for changing your values (told to you by a friend, say) as well as bad ones?
6paulfchristiano
If all works well, this would filter out anything from the environment that significantly changes your values that you don't specifically want. (E.g. you don't filter out food vs "random configuration of atoms I could eat" because you specifically want to figure out food.) We normally think of the hard case where correct deliberation is dependent on some aspects of the environment staying "on distribution" but you don't recognize which (discussed a bit here). But correct arguments from your friend are the same: you can have preferences over which arguments you hear, but if you can't decide or even define whether your friend is "being helpful" or "being manipulative" then we don't think the kind of regularization-based approach discussed in this document will plausibly incentivize your AI to clarify that distinction, so you're on your own. We've discussed this basic dilemma before, you could split and reflect separately until you become wise enough to decide whether people are safe (perhaps in light of their histories) or you could only interact with people you trust, or you could make early commitments to e.g. not use powerful AI advisors (though the time for such commitments rapidly approaches and passes). But nothing in this document will help you with that, and we're a bit skeptical about any hope that the same mechanism would address both that problem and ELK (other than solving both by solving alignment in some way that doesn't require ELK, such that it was a silly subproblem).
4Wei Dai
Ok, this all makes sense now. I guess when I first read that section I got the impression that you were trying to do something more ambitious. You may want to consider adding some clarification that you're not describing a scheme designed to block only manipulation while letting helpful arguments through, or that "letting helpful arguments through" would require additional ideas outside of that section.

I've only skimmed the report so far, but it seems very interesting. Most interpretability work assumes an externally trained model not explicitly made to be interpretable. 

Are you familiar with interpretability work such as "Knowledge Neurons in Pretrained Transformers" (GitHub) or "Transformer Feed-Forward Layers Are Key-Value Msemorie" (GitHub)? They're a bit different because they:

  1. Focus on "background" knowledge such as "Paris is the capital of France", rather than knowledge about the current context such as "the camera has been hacked".
  2. Only invest
... (read more)
8paulfchristiano
I'm very interested in interpretability (and have read those papers in particular). We discuss the connection between ELK and interpretability in this appendix. Our main question is how complex the "interpretation" of neural networks must be in order to extract what the models know. If they become quite complex, then it starts to become hard to judge whether a given interpretation is correct (and hence revealing structure inside the model) or simply making up the structure and relationships that the researchers were looking for with their tools. If the interpretations are simple, then we hope that the kinds of regularization described in this document would have an easy time picking out the direct translator. We are open to changing the training strategy for the underlying predictor in order to make it more interpretable, but we're very scared about approaches like changing regularization. The basic issue is that in the worst case those changes can greatly impact the predictor's performance.  So within our research framework, if we change the loss function for the underlying predictor then we need to be able to argue that it won't impact the predictor's performance. And that problem is quite fundamental in this case, since e.g. highly polysemantic neurons may simply be more performant. That means in the worst case you just need to be able to handle them. (Outside of our research methodology, I'm also personally much more interested in techniques that can disentangle polysemantic neurons rather than trying to discourage them.)
4Quintin Pope
Ensuring interpretable models remain competitive is important. I've looked into the issue for dropout specifically. This paper disentangles the different regularization benefits dropout provides and shows we can recover dropout's contributions by adding a regularization term to the loss and noise to the gradient updates (the paper derives expressions for both interventions). I think there's a lot of room for high performance, relatively interpretable deep models. E.g., the human brain is high performance and seems much more interpretable than you'd expect from deep learning interpretability research. Given our limitations in accessing/manipulating the brain's internal state, something like brain stimulation reward seems like it should be basically impossible, if the brain were as uninterpretable as current deep nets.

Great report — I found the argument that ELK is a core challenge for alignment quite intuitive/compelling.

To build more intuition for what a solution to ELK would look like, I’d find it useful to talk about current-day settings where we could attempt to empirically tackle ELK.  AlphaZero seems like a good example of a superhuman ML model where there’s significant interest (and some initial work: https://arxiv.org/abs/2111.09259) in understanding its inner reasoning.  Some AlphaZero-oriented questions that occurred to me:

  • Suppose we train an augmen
... (read more)
3paulfchristiano
I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn't be able to evaluate that answer. Some thoughts on this setup: * I'm very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be great to (i) check whether that actually happens (ii) have an empirical model of a hard situation so that we can do applied research rather than just theory. * There is some subtlety where AZ invokes the policy/value a bunch of times in order to make a single move. I don't think this is a fundamental complication, so from here on out I'll just talk about ELK for a single value function invocation. I don't think the problem is very interesting unless the AZ value function itself is much stronger than your humans. * Many questions about Go can be easily answered with a lot of compute, and for many of these questions there is a plausible straightforward approach based on debate/amplification. I think this is also interesting to do experiments with, but I'm most worried about the cases where this is not possible (e.g. the ontology identification case, which probably arises in Go but is a bit more subtle). * If a human doesn't know anything about Go, then AZ may simply not have any latent knowledge that is meaningful to them. In that case we aren't expecting/requiring ELK to do anything at all. So we'd like to focus on cases where the human does understand concepts that they can ask hard questions about. (And ideally they'd have a rich web of concepts so that the question feels analogous to the real world case, but I think it's interesting as long  they have anything.) We never expect it to walk us through pedag
[-]Wei DaiΩ660

Can you talk about the advantages or other motivations for the formulation of indirect normativity in this paper (section "Indirect normativity: defining a utility function"), compared to your 2012 formulation? (It's not clear to me what problems with that version you're trying to solve here.)

5paulfchristiano
The previous definition was aiming to define a utility function "precisely," in the sense of giving some code which would produce the utility value if you ran it for a (very, very) long time. One basic concern with this is (as you pointed out at the time) that it's not clear that an AI which was able to acquire power would actually be able to reason about this abstract definition of utility. A more minor concern is that it involves considering the decisions of hypothetical humans very unlike those existing in the real world (who therefore might reach bad conclusions or at least conclusions different from ours). In the new formulation, the goal is to define the utility in terms of the answers to questions about the future that seem like they should be easy for the AI to answer because they are a combination of (i) easy predictions about humans that it is good at, (ii) predictions about the future that any power-seeking AI should be able to answer. Relatedly, this version only requires making predictions about humans who are living in the real world and being defended by their AI. (Though those humans can choose to delegate to some digital process making predictions about hypothetical humans, if they so desire.) Ideally I'd even like all of the humans involved in the process to be indistinguishable from the "real" humans, so that no human ever looks at their situation and thinks "I guess I'm one of the humans responsible for figuring out the utility function, since this isn't the kind of world that my AI would actually bring into existence rather than merely reasoning about hypothetically." More structurally, the goal is to define the utility function in terms of the kinds of question-answers that realistic approaches to ELK could elicit, which doesn't seem to include facts about mathematics that are much too complex for humans to derive directly and where they need to rely on correlations between mathematics and the physical world---in those cases we are essentia
3Wei Dai
Thanks, very helpful to understand your motivations for that section better. Not sure about the following, but it seems the new formulation requires that the AI answer questions about humans in a future that may have very low probability according to the AI's current beliefs (i.e., the current human through a delegation chain eventually delegates to a future human existing in a possible world with low probability). The AI may well not be able to answer questions about such a future human, because it wouldn't need that ability to seek power (it only needs to make predictions about high probability futures). Or to put it another way, the future human may exist in a world with strange/unfamiliar (from the AI's perspective) features that make it hard for the AI to predict correctly. How do you envision extracting or eliciting from the future human H_limit an opinion about what the current human should do, given that H_limit's mind is almost certainly entirely focused on their own life and problems? One obvious way I can think of is to make a copy of H_limit, put the copy in a virtual environment, tell them about H's situation, then ask them what to do. But that seems to run into the same kind of issue, as the copy is now aware that they're not living in the real world.
2paulfchristiano
I'm imagining delegating to humans who are very similar to (and ideally indistinguishable from) the humans who will actually exist in the world that we bring about. I'm scared about very alien humans for a bunch of reasons---hard for the AI to reason about, may behave strangely, and makes it harder to use "corrigible" strategies to easily satisfy their preferences. (Though that said, note that the AI is reasoning very abstractly about such future humans and cannot e.g. predict any of their statements in detail.) Ideally we are basically asking each human what they want their future to look like, not asking them to evaluate a very different world. Ideally we would literally only be asking the humans to evaluate their future. This is kind of like giving instructions to their AI about what it should do next, but a little bit more indirect since they are instead evaluating futures that their AI could bring about. The reason this doesn't work is that by the time we get to those future humans, the AI may already be in an irreversibly bad position (e.g. because it hasn't acquired much flexible influence that it can use to help the humans achieve their goals). This happens most obviously at the very end, but it also happens along the way if the AI failed to get into a position where it could effectively defend us. (And of course it happens along the way if people are gradually refining their understanding of what they want to happen in the external world, rather than having a full clean separation into "expand while protecting deliberation" + "execute payload.") However, when this happens it is only because the humans along the way couldn't tell that things were going badly---they couldn't understand that their AI had failed to gather resources for them until they actually got to the end, asked their AI to achieve something, and were unhappy because it couldn't. If they had understood along the way, then they would never have gone down this route. So at the point when

I could only skim and the details went over my head, but it seems you intend to do experiments with Bayesian Networks and human operators.

I recently developed and released an open source explainability framework for Bayes nets - dropping it here in the unlikely case it might be useful.

[-][anonymous]50

(Going to try my hand at Builder, but this is admittedly vague, so I hope you help sharpen it with criticism.)

What if instead of a "reporter", we had a "terrifier", whose adversarial objective is to highlight the additional "sensor" whose observations, assuming the input and actions were held constant, when viewed by a human would maximize the probability of a human reviewer saying the system would not performing as desired. The terrifier would be allowed to run the original predictor model "further" in order to populate whichever new components of the Bay... (read more)

2Mark Xu
Thanks for your proposal! We have considered similar strategies in the past. The main points of the breaker response would be: * How are you training this "terrifier"? If the terrifier moves that can be made during training are from held-out sensors that you have constructed, then the "bad terrifier" you might learn doesn't generalize to thinking of any new sensors, only ever considering the sensors that you used as ground truth during training. Thus a sufficiently clever robber might trick all of the sensors-used-during-training (which may not exist during deployment), and thus the diamond would be stolen without a terrifier being able to show any sensor data that would scare the human. * It's possible that even if the human was able to know the results of all possible sensors, they still wouldn't be able to know whether the situation was well. For instance, if the human doesn't really understand how humans work, they way that they hope to maintain control over the situation is asking questions like "is that human the same human that I saw before?". However, a terrifier might not have any sensor data that they could show the human to scare them, e.g. if the terrifier showed the human a bunch of nanotech in that humans brain, the human might not understand why that was bad. This behavior seems potentially catastrophic. If all we can do is force potential adversaries to take actions that can't be seen with the resolution of our current sensors, then e.g. we can't prevent adversaries from corrupting code that is running on computers that we wish to protect. I don't really understand why this explains why your strategy defeats the previous counterexample.

Regarding this:

The bad reporter needs to specify the entire human model, how to do inference, and how to extract observations. But the complexity of this task depends only on the complexity of the human’s Bayes net.

If the predictor's Bayes net is fairly small, then this may be much more complex than specifying the direct translator. But if we make the predictor's Bayes net very large, then the direct translator can become more complicated — and there is no obvious upper bound on how complicated it could become. Eventually direct translation will be more co

... (read more)
2paulfchristiano
Yes, I agree that something similar applies to complexity as well as computation time. There are two big reasons I talk more about computation time: * It seems plausible we could generate a scalable source of computational difficulty, but it's less clear that there exists a scalable source of description complexity (rather than having some fixed upper bound on the complexity of "the best thing a human can figure out by doing science.") * I often imagine the assistants all sharing parameters with the predictor, or at least having a single set of parameters. If you have lots of assistant parameters that aren't shared with the predictor, then it looks like it will generally increase the training time a lot. But without doing that, it seems like there's not necessarily that much complexity the predictor doesn't already know about. (In contrast, we can afford to spend a ton of compute for each example at training time since we don't need that many high-quality reporter datapoints to rule out the bad reporters. So we can really have giant ratios between our compute and the compute of the model.) But I don't think these are differences in kind and I don't have super strong views on this.

We’ll assume the humans who constructed the dataset also model the world using their own internal Bayes net.

This seems like a crucial premise of the report; could you say more about it? You discuss why a model using a Bayes net might be "oversimplified and unrealistic", but as far as I can tell you don't talk about why this is a reasonable model of human reasoning.

9Ajeya Cotra
Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don't reason using Bayes nets -- but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn't be logically inconsistent or physically impossible, and we wouldn't want alignment to fail in that world. Additionally, I think the statement made in the report about AIs also applies to humans: We're using some sort of cognitive algorithms to reason about the world, and it's plausible that strategies which resemble inference on graphical models play a role in some of our understanding. There's no obvious way that a messier model of human reasoning which incorporates all the other parts should make ELK easier; there's nothing that we could obviously exploit to create a strategy.
5Richard_Ngo
If you solve something given worst-case assumptions, you've solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that's not the case we end up facing. Doesn't this imply that a Bayes-net model isn't the worst case? EDIT: I guess it depends on whether "the human isn't well-modelled using a Bayes net" is a possible response the breaker could give. But that doesn't seem like it fits the format of finding a test case where the builder's strategy fails (indeed, "bayes nets" seems built into the definition of the game).
8Ajeya Cotra
Sorry, there were two things you could have meant when you said the assumption that the human uses a Bayes net seemed crucial. I thought you were asking why the builder couldn't just say "That's unrealistic" when the breaker suggested the human runs a Bayes net. The answer to that is what I said above -- because the assumption is that we're working in the worst case, the builder can't invoke unrealism to dismiss the counterexample. If the question is instead "Why is the builder allowed to just focus on the Bayes net case?", the answer to that is the iterative nature of the game. The Bayes net case (and in practice a few other simple cases) was the case the breaker chose to give, so if the builder finds a strategy that works for that case they win the round. Then the breaker can come back and add complications which break the builder's strategy again, and the hope is that after many rounds we'll get to a place where it's really hard to think of a counterexample that breaks the builder's strategy despite trying hard.
6Richard_Ngo
Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly - might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight! I'm still a little wary of how much the report talks about concepts in a humans' Bayes net without really explaining why this is anywhere near a sensible model of humans, but I'll have another read through and see if I can pin down anything that I actively disagree with (since I do agree that it's useful to start off with very simple assumptions).
3Ajeya Cotra
Ah got it. To be clear, Paul and Mark do in practice consider a bank of multiple counterexamples for each strategy with different ways the human and predictor could think, though they're all pretty simple in the same way the Bayes net example is (e.g. deduction from a set of axioms); my understanding is that essentially the same kind of counterexamples apply for essentially the same underlying reasons for those other simple examples. The doc sticks with one running example for clarity / length reasons.
6paulfchristiano
The breaker is definitely allowed to introduce counterexamples where the human isn't well-modeled using a Bayes net. Our training strategies (introduced here) don't say anything at all about Bayes nets and so it's not clear if this immediately helps the breaker---they are the one who introduced the assumption that the human used a Bayes nets (in in order to describe a simplified situation where the naive training strategy failed here). We're definitely not intentionally viewing Bayes nets as part of the definition of the game. It seems very plausible that after solving the problem for humans-who-use-Bayes-nets we will find a new counterexample that only works for humans-who-don't-use-Bayes-nets, in which case we'll move on to those counterexamples. It seems even more likely that the builder will propose an algorithm that exploits cognition that humans can do which isn't well captured by the Bayes net model, which is also fair game. (And indeed several of our approaches to do it, e.g. when imagining humans learning new things about the world by performing experiments here or reasoning about plausibility of model joint distributions here). That said, it looks to us like if any of these algorithms worked for Bayes nets, they would at least work for a very broad range of human models, the Bayes net assumption doesn't seem to be changing the picture much qualitatively. Echoing Mark in his comment, we're definitely interested in ways that this assumption seems importantly unrealistic.  If you just think it's generally a mediocre model and results are unlikely to generalize, then you can also wait for us to discover that after finding an algorithm that works for Bayes nets and then finding that it breaks down as we extend to more realistic examples. Conditioned on ontology identification being impossible, I think it's most likely to also be impossible for humans who reason about the world using a Bayes net. I think Ajeya is just pointing out why it seems useful to se
7Mark Xu
We don't think that real humans are likely to be using Bayes nets to model the world. We make this assumption for much the same reasons that we assume models use Bayes nets, namely that it's a test case where we have a good sense of what we want a solution to ELK to look like. We think the arguments given in the report will basically extend to more realistic models of how humans reason (or rather, we aren't aware of a concrete model of how humans reason for which the arguments don't apply). If you think there's a specific part of the report where the human Bayes net assumption seems crucial, I'd be happy to try to give a more general form of the argument in question.

I'm leaving this review primarily because this post somehow doesn't have one yet, and it's way too important to get dropped out of the Review!

ELK had some of the most alignment community engagement of any technical content that I've seen. It is extremely thorough, well-crafted, and aims at a core problem in alignment. It serves as an examplar of how to present concrete problems to induce more people to work on AI alignment.

That said, I personally bounced after reading the first few pages of the document. It was good as far as I got, but it was pretty effortful to get through, and (as mentioned above) already had tons of attention on it.

2Ben Pace
FWIW I think the Eliciting Latent Knowledge problem doesn't stand well on its own as an introduction, and thinking about this problem is way better when you see the bigger picture that Paul is working through and used to generate it, in his post My Research Methodology (my review here). That walks through multiple of the major steps in Paul's reasoning that led to this problem being brought up, rather than just dumping you in it, and is written in Paul's native voice. I'd rank that post as substantially more useful than this one.

I'm curious if you have a way to summarise what you think the "core insight" of ELK is, that allows it to improve on the way other alignment researchers think about solving the alignment problem.

I wrote some thoughts that look like they won't get posted anywhere else, so I'm just going to paste them here with minimal editing:

  • They (ARC) seem to imagine that for all the cases that matter, there's some ground-truth-of-goodness judgment the human would make if they knew the facts (in a fairly objective way that can be measured by how well the human does at predicting things), and so our central challenge is to figure out how to tell the human the facts (or predict what the human would say if they knew all the facts).
  • In contrast, I don't think there's
... (read more)
4paulfchristiano
Generally we are asking for an AI that doesn't give an unambiguously bad answer, and if there's any way of revealing the facts where we think a human would (defensibly) agree with the AI, then probably the answer isn't unambiguously bad and we're fine if the AI gives it. There are lots of possible concerns with that perspective; probably the easiest way to engage with them is to consider some concrete case in which a human might make different judgments, but where it's catastrophic for our AI not to make the "correct" judgment. I'm not sure what kind of example you have in mind and I have somewhat different responses to different kinds of examples. For example, note that ELK is never trying to answer any questions of the form "how good is this outcome?"; I certainly agree that there can also be ambiguity about questions like "did the diamond stay in the room?" but it's a fairly different situation. The most relevant sections are narrow elicitation and why it might be sufficient which gives a lot of examples of where we think we can/can't tolerate ambiguity, and to a lesser extent avoiding subtle manipulation which explains how you might get a good outcome despite tolerating such ambiguity. That said, there are still lots of reasonable objections to both of those.
4Charlie Steiner
When you say "some case in which a human might make different judgments, but where it's catastrophic for the AI not to make the correct judgment," what I hear is "some case where humans would sometimes make catastrophic judgments." I think such cases exist and are a problem for the premise that some humans agreeing means an idea meets some standard of quality. Bumbling into such cases naturally might not be a dealbreaker, but there are some reasons we might get optimization pressure pushing plans proposed by an AI towards the limits of human judgment.
3Ramana Kumar
I think the problem you're getting at here is real -- path-dependency of what a human believes on how they came to believe it, keeping everything else fixed (e.g., what the beliefs refer to) -- but I also think ARC's ELK problem is not claiming this isn't a real problem but rather bracketing (deferring) it for as long as possible. Because there are cases where ELK fails that don't have much path-dependency in them, and we can focus on solving those cases until whatever else is causing the problem goes away (and only path-dependency is left).

I'm reading along, and I don't follow the section "Strategy: have AI help humans improve our understanding". The problem so far is that the AI only need identify bad outcomes that the human labelers can identify, rather than bad outcomes regardless of human-labeler identification. 

The solution posed here is to have AIs help the human labeler understand more bad (and good) outcomes, using powerful AI. The section mostly provides justification for making the assumption that we can align these helper AIs (reason: the authors believe there is a counterexa... (read more)

4paulfchristiano
Yes, that's the main way this could work.  The question is whether an AI understands things that humans can't understand by doing amplification/debate/rrm, our guess is yes and the argument is mostly "until the builder explains why, gradient descent and science may just have pretty different strengths and weaknesses" (and we can make that more concrete by fleshing out what the world may be like and what the AI learns by gradient descent). But it seemed worth raising because this does appear to make the bad reporter's job much harder and greatly restrict the space of cases where it fails to report tampering. Methodologically, the way I think about this kind of thing is: (i) we had a counterexample, (ii) after making this change that particular counterexample no longer works, (iii) now we want to think through whether the counterexample can be adapted. This is also legitimately less obvious. An AI can't simulate (human+AI helpers), since each AI helper is as smart as the AI itself and so simulating (human+AI helpers) clearly requires more compute than the AI has. The counterexample is that the AI should just try its best to do inference in the Bayes net that includes "everything the human could understand with the amount of science they have time to do." But that does still leave the builder with avenues to try to strengthen the algorithm and win. One way is discussed in the section on speed regularization: if the AI is "trying its best" to do inference in the human Bayes net then there might always be returns to having more time to think (and so it might be able to benefit by transferring over its understanding of what was happening in the AI Bayes net rather than recomputing from the observations). The next step for a builder who wanted to take this approach would be to argue that they can reliably construct a complex enough dataset that this advantage is relevant. My guess is that this doesn't work on its own, but if you could scalably construct more complex da
2Ben Pace
This is an interesting tack, this step and the next ("Strategy: have humans adopt the optimal Bayes net") feels new to me.
2Ben Pace
Question: what's the relative amount of compute you are imagining SmartVault and the helper AI having? Both the same, or one having a lot more?
2paulfchristiano
It will depend on how much much high-quality data you need to train the reporter. Probably it's a small fraction of the data you need to train the predictor, and so for generating each reporter datapoint you can afford to use many times more data than the predictor usually uses. I often imagine the helpers having 10-100x more computation time.

From the section "Strategy: have humans adopt the optimal Bayes net":

Roughly speaking, imitative generalization:

  • Considers the space of changes the humans could make to their Bayes net;
  • Learns a function which maps (proposed change to Bayes net) to (how a human — with AI assistants — would make predictions after making that change);
  • Searches over this space to find the change that allows the humans to make the best predictions.

Regarding the second step, what is the meat of this function? My superficial understanding is that a Bayes net is deterministic and fu... (read more)

3paulfchristiano
In general we don't have an explicit representation of the human's beliefs as a Bayes net (and none of our algorithms are specialized to this case), so the only way we are representing "change to Bayes net" is as "information you can give to a human that would lead them to change their predictions." That said, we also haven't described any inference algorithm other than "ask the human." In general inference is intractable (even in very simple models), and the only handle we have on doing fast+acceptable approximate inference is that the human can apparently do it. (Though if that was the only problem then we also expect we could find some loss function that incentivizes the AI to do inference in the human Bayes net.)

We aren’t offering these criteria as necessary for “knowledge”—we could imagine a breaker proposing a counterexample where all of these properties are satisfied but where intuitively M didn’t really know that A′ was a better answer. In that case the builder will try to make a convincing argument to that effect.

Bolded should be sufficient.

[-]RubyΩ120

Curated. The authors write:

We believe that there are many promising and unexplored approaches to this problem, and there isn’t yet much reason to believe we are stuck or are faced with an insurmountable obstacle.

If it's true that that this is both a core alignment problem and we're not stuck on it, then that's fantastic. I am not an alignment researcher and don't feel qualified to comment on quite how promising this work seems, but I find the report both accessible and compelling. I recommend it to anyone curious about where some of the alignment leading e... (read more)

In terms of the relationship to MIRI's visible thoughts project, I'd say the main difference is that ARC is attempting to solve ELK in the worst case (where the way the AI understands the world could be arbitrarily alien from and more sophisticated than the way the human understands the world), whereas the visible thoughts project is attempting to encourage a way of developing AI that makes ELK easier to solve (by encouraging the way the AI thinks to resemble the way humans think). My understanding is MIRI is quite skeptical that a solution to worst-case ELK is possible, which is why they're aiming to do something more like "make it more likely that conditions are such that ELK-like problems can be solved in practice."

2Ruby
Thanks! That's illuminating.
8Ajeya Cotra
Thanks Ruby! I'm really glad you found the report accessible. One clarification: Bayes nets aren't important to ARC's conception of the problem of ELK or its solution, so I don't think it makes sense to contrast ARC's approach against an approach focused on language models or describe it as seeking a solution via Bayes nets. The form of a solution to ELK will still involve training a machine learning model (which will certainly understand language and could just be a language model) using some loss function. The idea that this model could learn to represent its understanding of the world in the form of inference on some Bayes net is one of a few simple test cases that ARC uses to check whether the loss functions they're designing will always incentivize honestly answering straightforward questions. For example, another simple test case (not included in the report) is that the model could learn to represent its understanding of the world in a bunch of "sentences" that it performs logical operations on to transform into other sentences. These test cases are settings for counterexamples, but not crucial to proposed solutions. The idea is that if your loss function will always learn a model that answers straightforward questions honestly, it should work in particular for these simplified cases that are easy to think about.
2Ruby
Thanks for the clarification, Ajeya! Sorry to make you have to explain that, it was a mistake to imply that ARC’s conception is specifically anchored on Bayes nets–the report was quite clear that isn’t.

I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.

Some other thoughts:

  • I felt like the report was unusually well-motivated when I put my "mainstream ML" glasses on, relative to a lot of alignment work.

  • ARC's overall approach is probably my favorite out of alignment research groups I'm aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.

  • Not sure if this is relevant in practice, but... the report talks about Bayesian networks learned via

... (read more)
2paulfchristiano
Thanks for the kind words (and proposal)! I broadly agree that "train a bunch of models and panic if any of them say something is wrong." The main catch is that this only works if none of the models are optimized to say something scary, or to say something different for the sake of being different. We discuss this a bit in this appendix. We're imagining the case where the predictor internally performs inference in a learned model, i.e. we're not explicitly learning a bayesian network but merely considering possibilities for what an opaque neural net is actually doing (or approximating) on the inside. I don't think this is a particularly realistic possibility, but if ELK fails in this kind of simple case it seems likely to fail in messier realistic cases. (We're actually planning to do  a narrower contest focused on ELK proposals.)

From a complexity theoretic viewpoint, how hard could ELK be?  is there any evidence that ELK is decidable?

I'm pretty confused about the plan to use ELK to solve outer alignment. If Cakey is not actually trained, how are amplified humans accessing its world model?

"To avoid this fate, we hope to find some way to directly learn whatever skills and knowledge Cakey would have developed over the course of training without actually training a cake-optimizing AI...

  1. Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have
... (read more)

Thanks, this makes it pretty clear to me how alignment could be fundamentally hard besides deception. (The problem seems to hold even if your values are actually pretty simple; e.g. if you're a pure hedonistic utilitarian and you've magically solved deception, you can still fail at outer alignment by your AI optimizing for making it look like there's more happiness and less suffering.)

Some (perhaps basic) notes to check that I've understood this properly:

  • The Bayes net running example per se isn't really necessary for ELK to be a problem.
    • The basic problem i
... (read more)

If the reporter estimates every node of the human's Bayes net, then it can assign a node a probability distribution different from the one that would be calculated from the distributions simultaneously assigned to its parent nodes. I don't know if there is a name for that, so for now i will pompously call it inferential inconsistency. Considering this as a boolean bright-line concept, the human simulator is clearly the only inferentially consistent reporter. But one could consider some kind of metric on how different probability distributions are and turn ... (read more)

In section: "New counterexample: better inference in the human Bayes net", what is meant with that the reporter does perfect inference in the human Bayes net? I am also unclear how the modified counterexample is different.

My current understanding: The reporter is doing inference using and the action sequence and does not use to do inference ( is inferred). The reporter has an exact copy of the human Bayes net and now fixes the nodes for and the action sequence. Then it infers the probability for all possible combinations of values each node can ... (read more)

3paulfchristiano
In all of the counterexamples the reporter starts from the v1, actions, and v2 predicted by the predictor. In order to answer questions it needs to infer the latent variables in the human's model. Originally we described a counterexample where it copied the human inference process. The improved counterexample is to instead use lots of computation to do the best inference it can, rather than copying the human's mediocre inference. To make the counterexample fully precise we'd need to specify an inference algorithm and other details. We still can't do perfect inference though---there are some inference problems that just aren't computationally feasible. (That means there's hope for creating data where the new human simulator does badly because of inference mistakes. And maybe if you are careful it will also be the case that the direct translator does better, because it effectively reuses the inference work done in the predictor? To get a proposal along these lines we'd need to describe a way to produce data that involves arbitrarily hard inference problems.)
1Johannes C. Mayer
Ah ok, thank you. Now I get it. I was confused by (i) "Imagine the reporter could do perfect inference" and (ii) "the reporter could simply do the best inference it can in the human Bayes net (given its predicted video)". (i) I thought of this as that the reporter alone can do it, but what is actually meant is that with the use of the predictor model it can do it. (ii) Somehow I thought that "given its predicted video" is the important modification here, where in fact the only change is to go from that the reporter can do perfect inference, to that it does the best inference that it can.
[+][comment deleted]10