ARC recently released a technical report on eliciting latent knowledge (ELK), the focus of our current research. Roughly speaking, the goal of ELK is to incentivize ML models to honestly answer “straightforward” questions where the right answer is unambiguous and known by the model.
ELK is currently unsolved in the worst case—for every training strategy we’ve thought of so far, we can describe a case where an ML model trained with that strategy would give unambiguously bad answers to straightforward questions despite knowing better. Situations like this may or may not come up in practice, but nonetheless we are interested in finding a strategy for ELK for which we can’t think of any counterexample.
We think many people could potentially contribute to solving ELK—there’s a large space of possible training strategies and we’ve only explored a small fraction of them so far. Moreover, we think that trying to solve ELK in the worst case is a good way to “get into ARC’s headspace” and more deeply understand the research we do.
We are offering prizes of $5,000 to $50,000 for proposed strategies for ELK. We’re planning to evaluate submissions received before February 15.
For full details of the ELK problem and several examples of possible strategies, see the writeup. The rest of this post will focus on how the contest works.
Contest details
To win a prize, you need to specify a training strategy for ELK that handles all of the counterexamples that we’ve described so far, summarized in the section below—i.e. where the breaker would need to specify something new about the test case to cause the strategy to break down. You don’t need to fully solve the problem in the worst case to win a prize, you just need to come up with a strategy that requires a new counterexample.
We’ll give a $5,000 prize to any proposal that we think clears this bar. We’ll give a $50,000 prize to a proposal which we haven’t considered and seems sufficiently promising to us or requires a new idea to break. We’ll give intermediate prizes for ideas that we think are promising but we’ve already considered, as well as for proposals that come with novel counterexamples, clarify some other aspect of the problem, or are interesting in other ways. A major purpose of the contest is to provide support for people understanding the problem well enough to start contributing; we aren’t trying to only reward ideas that are new to us.
You can submit multiple proposals, but we won’t give you separate prizes for each—we’ll give you at least the maximum prize that your best single submission would have received, but may not give much more than that.
If we receive multiple submissions based on a similar idea, we may post a comment describing the idea (with attribution) along with a counterexample. Once a counterexample has been included in the comments of this post, new submissions need to address that counterexample (as well as all the existing ones) in order to be eligible for a prize.
Ultimately prizes are awarded at our discretion, and the “rules of the game” aren’t fully precise. If you are curious about whether you are on the right track, feel free to send an email to elk@alignmentresearchcenter.org with the basic outline of an idea, and if we have time we’ll get back to you with some feedback. Below we also describe some of the directions we consider more and less promising and some general guidance.
How to submit a proposal
You can submit a proposal by copying this google doc template and sharing it with elk@alignmentresearchcenter.org (please give comment access in case we need to ask questions to evaluate your submission). By submitting a proposal, you are giving us permission to post the idea here with attribution. (And it's fine for you to post it after the contest or we post a counterexample.)
Retroactive prizes
We’re giving prizes to existing proposals from David Dalrymple ($5k), Ramana Kumar ($3k), John Maxwell ($2k), and Yonadav Shavit ($1k). We’ve added relevant counterexamples in the next section, and further submissions need to handle those counterexamples in order to be eligible.
Existing counterexamples
In our report and a recent post, we described some features of the world that would cause existing approaches to ELK to fail. A new approach should work even in a situation satisfying all of these properties, or should include an argument for why some of these features are actually impossible or inconsistent with each other.
In order to generate a counterexample to a training strategy, we need to spell out a concrete “way the world could be” (a test case) where the strategy would fail. In the report we consider test cases where the AI and the human happen to understand the world by doing inference in Bayes nets. The Bayes net case is just one test case that lets us get precise about our counterexamples and our training strategies don’t lean on it. In practice we also consider other concrete cases (e.g. deduction from premises generated by a neural network).
In this section we’ll briefly review the current state of existing counterexamples. This isn’t intended to be a self-contained description; it relies heavily on links to the report.
Ontology mismatch
All of our counterexamples are based on an ontology mismatch between two different Bayes nets, one used by an ML prediction model (“the predictor”) and one used by a human:
- The predictor’s Bayes net is used internally by a prediction model that maps an initial segment of a video to a prediction about what will happen next. This model was trained by gradient descent based on the quality of its predictions (more).
- Humans have their own model of the world, which they can improve with the help of AI assistants; those assistants can design and conduct experiments, find and evaluate arguments, and so on. Crucially, we have a limited computational budget with which to train these AI assistants, and that limits how much help they can provide. The “human’s” Bayes net captures everything the human understands about the world at the end of this process (more).
We are interested in the case where there are features of reality that are accurately reflected in the predictor’s Bayes net, but missing from this (augmented) human Bayes net. We expect the human could understand these facts eventually given enough time or enough AI assistance, but not using the available computational budget. Put differently, there is a gap between the kinds of understanding that gradient descent can build, and the kind of understanding that a human-comprehensible process of “scientific inquiry” can build using similar computational resources (more).
Furthermore, we imagine that both the predictor’s Bayes net and the human’s Bayes net reflect some underlying reality and that there is a way to map between them, which we’ll call “direct translation.” We don’t know how to do direct translation, and we are concerned that if we simply search for any mapping from the AI’s Bayes net to the human’s Bayes net we will instead get the “human simulator”: the map which correctly translates the observations from one Bayes net to the other, but then does inference in the human’s Bayes net in order to fill in the latent structure as a human would by looking at the data (more).
Counterexample features
ELK seems particularly hard when these Bayes nets satisfy the following properties:
- There are arbitrarily severe “ontology mismatches” between the predictor’s Bayes net and the human’s Bayes net, such that specifying the direct translation between them can be very complex (more).
- The human’s Bayes net is simpler than the predictor’s Bayes net, and the gap can be arbitrarily large (more).
- The predictor’s Bayes net is arbitrarily large, and individual nodes of the human’s Bayes net depend on a constant fraction of the predictor’s Bayes net. Therefore directly translating between the two Bayes nets takes arbitrarily much computation (more).
- It’s computationally easy for the human simulator to do good inference in the human’s Bayes net—at least good enough inference that it answers questions as well as the direct translator—and the gap between this complexity and the complexity of direct translation can be arbitrarily large. More precisely, a human interrogator can’t think of questions which require the human simulator to do hard inference. However, we can’t rule out the possibility that you, the algorithm-designer, will think of some trick they can use to generate hard questions (more).
- There are situations where a model which is trying to answer a question will first be able to predict what (potentially mistaken) belief the human-with-assistance will have about that question before it knows the true answer to the question (more).
- While the predictor’s Bayes net is very large, it has a short succinct description which is simpler than the human’s Bayes net, and that gap is arbitrarily large. Think of simple physical laws that describe arbitrarily computationally expensive simulations, or simple cognitive heuristics that are very widely applicable. (more)
- There are sequences of actions for which the predictor’s Bayes net correctly anticipates that tampering will occur yet nothing looks suspicious to the human’s Bayes net.
- After tampering has occurred, it is possible that no sequence of further actions reveal that anything is wrong, e.g. the camera and actuators may have been completely disconnected from the world and replaced with a convincing-to-humans simulation (more).
- There are sequences of actions leading to undetectable tampering that have no human-comprehensible similarities to sequences of actions leading to detectable tampering; in particular, the early steps of this undetectable tampering cannot be used to perform detectable tampering (more).
An important and very general strategy for the breaker is to suggest counterexamples where you “get what you train for.” If you propose training a model using loss function L, then the breaker will likely focus on counterexamples where the model chooses its output to minimize L. A successful approach should either explain why this either isn’t a realistic thing for the model to learn (e.g. why that would be disfavored by regularization), or why the overall approach works even if the model is explicitly strategizing about how to minimize the loss.
Rough guidance
- There is a tradeoff between specifying your approach precisely, and including your own counterexample. It’s fine to describe a very broad/vague idea for solving ELK, and then present a new counterexample showing why that whole category of solutions can’t work. But if you don’t propose a counterexample, then it’s good to describe your proposal precisely enough that we understand how to actually implement it without wondering if that’s really what you meant. It’s OK to submit a very broad or informal idea together with a single very specific instance of that idea, as long as there is some version we can understand precisely.
- We suspect you can’t solve ELK just by getting better data—you probably need to “open up the black box” and include some term in the loss that depends on the structure of your model and not merely its behavior. So we are most interested in approaches that address that challenge. We could still be surprised by clever ways to penalize behavior, but we’ll hold them to a higher bar. The most plausible surprise would be finding a way to reliably make it computationally difficult to “game” the loss function, probably by using the AI itself to help compute the loss (e.g. using consistency checks or by giving the human AI assistance).
- If you are specifying a regularizer that you hope will prefer direct translation over human simulation, you should probably have at least one concrete case in mind that has all the counterexample-features above and where you can confirm that your regularizer does indeed prefer the direct translator.
- ELK already seems hard in the case of ontology identification, where the predictor uses a straightforward inference algorithm in an unknown model of the world (which we’ve been imagining as a Bayes net). When coming up with a proposal, we don’t recommend worrying about cases where the original unaligned predictor learned something more complicated (e.g. involving learned optimization other than inference). That said, you do need to worry about the case where your training scheme incentivizes learned optimization that may not have been there originally.
Ask dumb questions!
A major purpose of this contest is to help people build a better understanding of our research methodology and the “game” we are playing. So we encourage people to ask clarifying questions in the comments of this post (no matter how “dumb” they are), and we’ll do our best to answer all of them. You might also want to read the comments to get more clarity about the problem.
What you can expect from us
- We’ll try to answer all clarifying questions in the comments.
- If you send in a rough outline for a proposal, we will try to understand whether it might qualify and write back something like “This qualifies,” “This might qualify but would need to be clearer and address issue X,” “We aren’t easily able to understand this proposal at all,” “This is unlikely to be on track for something that qualifies,” or “This definitely doesn’t qualify.”
- If there are more submissions than expected, we may run out of time to respond to all submissions and comments, in which case we will post an update here.
After reading through the report I wanted to make sure I understood the scenarios and counterexamples being discussed and be able to quickly refresh my memory, so I attempted to write a brief summary. Figured I'd share it here in case it helps anyone else.
Roles and Terms
SmartVault: Vault with a diamond in it, operated by a superintelligent AI tasked with keeping the diamond safe.
Predictor: The primary AI tasked with protecting the diamond. The predictor sees a video feed of the vault, predicts what actions are necessary to protect the diamond and how those actions will play out (for example, activating a trap door to eliminate a robber trying to steal the diamond), and then generates a video showing precisely what will happen.
Reporter: A second "head" or component added to the AI which is used to map the AI's understanding of what's going on in the SmartVault into a format understandable by humans (such as by responding to questions).
Humans: Observers who are training the AI to protect the diamond, but also want the AI to accurately report whether the diamond has been stolen or not. The humans may not be able to tell if the cameras have been tampered with or if the diamond has been replaced with a fake depending on what they see in the predicted video.
The scenarios take the former of a "Builder" strategy where training is proposed to keep the AI aligned with what the humans want, and then a "Breaker" response or counterexample is provided that describes a method where an unaligned AI could complete the training without the humans knowing. Builder scenarios include text and links to headings in the report where more details can be found. These scenarios are simplified for brevity and may not capture some nuances.
Builder - What You'd Try First: Operate the SmartVault with a weak AI so that humans can understand everything happening. Train the reporter to answer questions accurately based on human observations. (Humans see the robber come in and take the diamond, reporter is rewarded if it correctly says diamond was stolen.)
Breaker: What if the reporter, instead of accurately reporting, modeled itself like a "human simulator" where it would answer correctly during the simple training but during normal SmartVault operation it would simulate human understanding of the scenario that just occurred in order to determine what answers the humans will believe, answers that may be lies.
Builder - Have a Human Operate the SmartVault: Have a human operate the SmartVault during training and try to fool other humans watching. The human doing the operating can correctly label each outcome, and the reporter is rewarded for matching the human operator's labels. In this scenario the reporter would be rewarded for acting more like a "direct translator", which means using the predictor's knowledge to tell the true answer instead of just the one it thinks humans will believe.
Breaker: The reporter could act like a superintelligent human simulator, one that is more intelligent than the human operating the SmartVault during training. The superintelligent human simulator would be aware of ways to trick the human observer that the human SmartVault operator couldn't conceive of during training. If the predictor uses those tricks the reporter could still tell lies that the human observer would believe, because these tricks are different from the ones seen during training.
Builder - Have AI Help Humans: Have humans create a helper AI that will help them think of all the possible exploits when operating the SmartVault during training, so that the dataset is more complete when training the reporter and the reporter will be rewarded for accuracy in more scenarios.
Breaker: The helper AI would likely require significant computation power. Not only does the helper AI have to generate a comprehensive series of exploits, but it also has to be able to explain them in a way a human can understand. It likely would require more computation power than the predictor itself (imagine an extremely powerful AI that can understand things far beyond human understanding, but one that also must be able to explain everything its doing in a way humans can understand).
Builder - Imitative Generalization: Create a helper AI that acts like a human simulator, but train the AI human simulator so that it becomes much smarter than a human. Then instead of the predictor operating the SmartVault, let the human simulator helper AI operate the SmartVault and report back what it's doing.
Breaker: Now the new human simulator is beyond the comprehension of humans, so a human wouldn't know if it's behaving as intended. It's basically like building another predictor and just hoping it acts in alignment with humans.
Builder - Penalize Complexity: A reporter that's using a human simulator to give answers a human would believe would have to be complex enough to simulate a human. So when training the reporter reward simpler behavior, such as by training multiple reporters, each focusing on correctly answering separate simple questions like "where is the diamond" or "have the cameras been tampered with".
Breaker: If the predictor is very complex, it may require the reporter to be complex in order to translate what the predictor is predicting to something humans can understand. Imagine the predictor thinks of the world in terms of sub-atomic particles and predicts the future on the basis of the movements and interactions of billions of sub-atomic particles. The reporter has to translate the predictor's understanding into something a human can understand which might mean correctly identifying and locating the particles which make up the diamond out of the billions of particles the predictor has modeled. This could be more complex than simulating a human and telling lies a human would believe, and therefore penalizing complexity could potentially reward simulating a human and telling lies.
Builder - Penalize Computation Time: Simulating a human would take a significant amount of computation time, so penalizing computation time may incentivize the reporter to just report the truth.
Breaker: This is vulnerable to the same issue in the previous scenario, where reporting the truth may take significant computation as well and it may take less simulation time to simulate what lies a human would believe than to turn the predictor's processing into facts a human can follow.
Builder - Penalize Depending on "Downstream" Variables: The predictor is likely to compute facts sequentially. A reporter that only told the truth would only care about the components of the prediction that correspond to information the humans want to know, such as where the diamond is or if there has been any tampering. The reporter should be able to figure this out from earlier data points in the prediction; the reporter shouldn't need to take information from the final predicted video to respond to questions. Doing so might indicate that the reporter is trying to do human simulation to figure out what lies the human will believe when the human views the predicted video. In the training stage the reporter could be penalized for looking at later stages of the prediction that a human simulator might want to see, and also penalized for looking at earlier stages but spending a lot of time computing, which might suggest its trying to put together its own prediction for use in human simulation. Quote from the report with helpful example:
Breaker: It may be more efficient in some cases for the reporter to do human simulation to find out what the human will believe. In some situations the reporter could know early on that a human would believe a certain answer, even before the reporter knows whether the answer will be true.
Builder - Penalize Reporters that Work with Many Different Predictors: A reporter that's doing human simulation would be able to effectively lie to humans no matter which predictor its attached to, since it primarily has to know what the human will believe, not what the predicted facts are. An honest reporter would need to know how to translate the predictor's output into information a human will understand. A reporter that functioned correctly with other predictors is likely doing human simulation, so a reporter that worked with multiple predictors could be penalized during training.
Breaker: A reporter could pretend not to work with other predictors to avoid being penalized.
Builder - Penalize Inconsistencies Across Different Inputs: During training a human operator sets up two scenarios in the SmartVault that are nearly identical, except in one it's obvious that tampering has happened. If the reporter only reports tampering in the obvious scenario it might suggest that the reporter is only reporting what the humans will believe. The reporter could be penalized for behavior that's inconsistent between very similar scenarios, since it might suggest that the reporter is focused on what humans will believe instead of on the inputs making up the scenario itself.
Breaker: A reporter could intentionally try to look consistent during training to avoid being penalized. A reporter that's simulating a human could figure out what behavior looks consistent to a human and optimize on that, in addition to optimizing on what a human will believe.
Please let me know if I've made any errors or if it looks like I've misunderstood any of the strategies or counterexamples.
Looks good to me.