Suppose that I have a dataset D of observed (x, y) pairs, and I’m interested in predicting the label y* for each point x* in some new set D*. Perhaps D is a set of forecasts from the last few years, and D* is a set of questions about the coming years that are important for planning.

The classic deep learning approach is to fit a model f on D, and then predict y* using f(x*).

This approach implicitly uses a somewhat strange prior, which depends on exactly how I optimize f. I may end up with the model with the smallest l2 norm, or the model that’s easiest to find with SGD, or the model that’s most robust to dropout. But none of these are anywhere close to the “ideal” beliefs of a human who has updated on D.

This means that neural nets are unnecessarily data hungry, and more importantly that they can generalize in an undesirable way. I now think that this is a safety problem, so I want to try to attack it head on by learning the “right” prior, rather than attempting to use neural nets as an implicit prior.

Warm-up 1: human forecasting

If D and D* are small enough, and I’m OK with human-level forecasts, then I don’t need ML at all.

Instead I can hire a human to look at all the data in D, learn all the relevant lessons from it, and then spend some time forecasting y* for each x*.

Now let’s gradually relax those assumptions.

Warm-up 2: predicting human forecasts

Suppose that D* is large but that D is still small enough that a human can extract all the relevant lessons from it (or that for each x* in D*, there is a small subset of D that is relevant).

In this case, I can pay humans to make forecasts for many randomly chosen x* in D*, train a model f to predict those forecasts, and then use f to make forecasts about the rest of D*.

The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D*.

Learning the human prior

Now suppose that D is large, such that a human can’t update on it themselves. Perhaps D contains billions of examples, but we only have time to let a human read a few pages of background material.

Instead of learning the unconditional human forecast P(y|x), we will learn the forecast P(y|x, Z), where Z is a few pages of background material that the human takes as given. We can also query the human for the prior probability Prior(Z) that the background material is true.

Then we can train f(y|x, Z) to match P(y|x, Z), and optimize Z* for:

log Prior(Z*) + sum((x, y) ~ D) log f(y|x, Z*)

We train f in parallel with optimizing Z*, on inputs consisting of the current value of Z* together with questions x sampled from D and D*.

For example, Z might specify a few explicit models for forecasting and trend extrapolation, a few important background assumptions, and guesses for a wide range of empirical parameters. Then a human who reads Z can evaluate how plausible it is on its face, or they can take it on faith in order to predict y* given x*.

The optimal Z* is then the set of assumptions, models, and empirical estimates that works best on the historical data. The human never has to reason about more than one datapoint at a time — they just have to evaluate what Z* implies about each datapoint in isolation, and evaluate how plausible Z* is a priori.

This approach has many problems. Two particularly important ones:

  • To be competitive, this optimization problem needs to be nearly as easy as optimizing f directly on D, but it seems harder: finding Z* might be much harder than learning f, learning a conditional f might be much harder than learning an unconditional f, and jointly optimizing Z and f might present further difficulties.
  • Even if it worked our forecasts would only be “human-level” in a fairly restrictive sense — they wouldn’t even be as good as a human who actually spent years practicing on D before making a forecast on D*. To be competitive, we want the forecasts in the iid case to be at least as good as fitting a model directly.

I think the first point is an interesting ML research problem. (If anything resembling this approach ever works in practice, credit will rightly go to the researchers who figure out the precise version that works and resolve those issues, and this blog post will be a footnote.) I feel relatively optimistic about our collective ability to solve concrete ML problems, unless they turn out to be impossible. I’ll give some preliminary thoughts in the next section “Notes & elaborations.”

The second concern, that we need some way to go beyond human level, is a central philosophical issue and I’ll return to it in the subsequent section “Going beyond the human prior.”

Notes & elaborations

  • Searching over long texts may be extremely difficult. One idea to avoid this is to try to have a human guide the search, by either generating hypotheses Z at random or sampling perturbations to the current value of Z. Then we can fit a generative model of that exploration process and perform search in the latent space (and also fit f in the latent space rather than having it take Z as input). That rests on two hopes: (i) learning the exploration model is easy relative to the other optimization we are doing, (ii) searching for Z in the latent space of the human exploration process is strictly easier than the corresponding search over neural nets. Both of those seem quite plausible to me.
  • We don’t necessarily need to learn f everywhere, it only needs to be valid in a small neighborhood of the current Z. That may not be much harder than learning the unconditional f.
  • Z represents a full posterior rather than a deterministic “hypothesis” about the world, e.g. it might say “R0 is uniform between 2 and 3.” What I’m calling Prior(Z) is really the KL between the prior and Z, and P(y|x,Z) will itself reflect the uncertainty in Z. The motivation is that we want a flexible and learnable posterior. (This is particularly valuable once we go beyond human level.)
  • This formulation queries the human for Prior(Z) before each fitness evaluation. That might be fine, or you might need to learn a predictor of that judgment. It might be easier for a human to report a ratio Prior(Z)/Prior(Z′) than to give an absolute prior probability, but that’s also fine for optimization. I think there are a lot of difficulties of this flavor that are similar to other efforts to learn from humans.
  • For the purpose of studying the ML optimization difficulties I think we can basically treat the human as an oracle for a reasonable prior. We will then need to relax that rationality assumption in the same way we do for other instances of learning from humans (though a lot of the work will also be done by our efforts to go beyond the human prior, described in the next section).

Going beyond the human prior

How do we get predictions better than explicit human reasoning?

We need to have a richer latent space Z, a better Prior(Z), and a better conditional P(y|x, Z).

Instead of having a human predict y given x and Z, we can use amplification or debate to train f(y|x, Z) and Prior(Z). This allows Z to be a large object that cannot be directly accessed by a human.

For example, Z might be a full library of books describing important facts about the world, heuristics, and so on. Then we may have two powerful models debating “What should we predict about x, assuming that everything in Z is true?” Over the course of that debate they can cite small components of Z to help make their case, without the human needing to understand almost anything written in Z.

In order to make this approach work, we need to do a lot of things:

  1. We still need to deal with all the ML difficulties described in the preceding section.
  2. We still need to analyze debate/amplification, and now we’ve increased the problem difficulty slightly. Rather than merely requiring them to produce the “right” answers to questions, we also need them to implement the “right” prior. We already needed to implement the right prior as part of answering questions correctly, so this isn’t too much of a strengthening, but we are calling attention to a particularly challenging case. It also imposes a particular structure on that reasoning which is a real (but hopefully slight) strengthening.
  3. Entangled with the new analysis of amplification/debate, we also need to ensure that Z is able to represent a rich enough latent space. I’ll discuss implicit representations of Z in the next section “Representing Z.”
  4. Representing Z implicitly and using amplification or debate may make the optimization problem even more difficult. I’ll discuss this in the subsequent section “Jointly optimizing Mz and f.”

Representing Z

I’ve described Z as being a giant string of text. If debate/amplification work at all then I think text is in some sense “universal,” so this isn’t a crazy restriction.

That said, representing complex beliefs might require very long text, perhaps many orders of magnitude larger than the model f itself. That means that optimizing for (Z, f) jointly will be much harder than optimizing for f alone.

The approach I’m most optimistic about is representing Z implicitly as the output of another model Mz. For example, if Z is a text that is trillions of words long, you could have Mz output the ith word of Z on input i.

(To be really efficient you’ll need to share parameters between f and Mz but that’s not the hard part.)

This can get around the most obvious problem — that Z is too long to possibly write down in its entirety — but I think you actually have to be pretty careful about the implicit representation or else we will make Mz’s job too hard (in a way that will be tied up the competitiveness of debate/amplification).

In particular, I think that representing Z as implicit flat text is unlikely to be workable. I’m more optimistic about the kind of approach described in approval-maximizing representations — Z is a complex object that can be related to slightly simpler objects, which can themselves be related to slightly simpler objects… until eventually bottoming out with something simple enough to be read directly by a human. Then Mz implicitly represents Z as an exponentially large tree, and only needs to be able to do one step of unpacking at a time.

Jointly optimizing Mz and f

In the first section I discussed a model where we learn f(y|x, Z) and then use it to optimize Z. This is harder if Z is represented implicitly by Mz, since we can’t really afford to let f take Mz as input.

I think the most promising approach is to have Mz and f both operate on a compact latent space, and perform optimization in this space. I mention that idea in Notes & Elaborations above, but want to go into more detail now since it gets a little more complicated and becomes a more central part of the proposal.

(There are other plausible approaches to this problem; having more angles of attack makes me feel more comfortable with the problem, but all of the others feel less promising to me and I wanted to keep this blog post a bit shorter.)

The main idea is that rather than training a model Mz(·) which implicitly represents Z, we train a model Mz(·, z) which implicitly represents a distribution over Z, parameterized by a compact latent z.

Mz is trained by iterated amplification to imitate a superhuman exploration distribution, analogous to the way that we could ask a human to sample Z and then train a generative model of the human’s hypothesis-generation. Training Mz this way is itself an open ML problem, similar to the ML problem of making iterated amplification work for question-answering.

Now we can train f(y|x, z) using amplification or debate. Whenever we would want to reference Z, we use Mz(·, z). Similarly, we can train Prior(z). Then we choose z* to optimize log Prior(z*) + sum((x, y) ~ D) log f(y|x, z*).

Rather than ending up with a human-comprehensible posterior Z*, we’ll end up with a compact latent z*. The human-comprehensible posterior Z* is implemented implicitly by Mz(·, z*).

Outlook

I think the approach in this post can potentially resolve the issue described in Inaccessible Information, which I think is one of the largest remaining conceptual obstacles for amplification/debate. So overall I feel very excited about it.

Taking this approach means that amplification/debate need to meet a slightly higher bar than they otherwise would, and introduces a bit of extra philosophical difficulty. It remains to be seen whether amplification/debate will work at all, much less whether they can meet this higher bar. But overall I feel pretty excited about this outcome, since I was expecting to need a larger reworking of amplification/debate.

I think it’s still very possible that the approach in this post can’t work for fundamental philosophical reasons. I’m not saying this blog post is anywhere close to a convincing argument for feasibility.

Even if the approach in this post is conceptually sound, it involves several serious ML challenges. I don’t see any reason those challenges should be impossible, so I feel pretty good about that — it always seems like good news when you can move from philosophical difficulty to technical difficulty. That said, it’s still quite possible that one of these technical issues will be a fundamental deal-breaker for competitiveness.

My current view is that we don’t have candidate obstructions for amplification/debate as an approach to AI alignment, though we have a lot of work to do to actually flesh those out into a workable approach. This is a more optimistic place than I was at a month ago when I wrote Inaccessible Information.


Learning the prior was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

New Comment
28 comments, sorted by Click to highlight new comments since:
[-]ESRogsΩ490

In this case, I can pay humans to make forecasts for many randomly chosen x* in D*, train a model f to predict those forecasts, and then use f to make forecasts about the rest of D*.

The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D*.

Perhaps a dumb question, but don't we now have the same problem at one remove? The model for predicting what the human would predict would still come from a "strange" prior (based on the l2 norm, or whatever).

Does the strangeness just get washed out by the one layer of indirection? Would you ever want to do two (or more) steps, and train a model to predict what a human would predict a human would predict?

The difference is that you can draw as many samples as you want from D* and they are all iid. Neural nets are fine in that regime.

[-]ESRogsΩ240

Ah, I see. It sounds like the key thing I was missing was that the strangeness of the prior only matters when you're testing on a different distribution than you trained on. (And since you can randomly sample from x* when you solicit forecasts from humans, the train and test distributions can be considered the same.)

Is that actually true though? Why is that true? Say we are training the model on a dataset of N human answers, and then we are doing to deploy it to answer 10N more questions, all from the same big pool of questions. The AI can't tell whether it is in training or deployment, but it could decide to follow a policy of giving some sort of catastrophic answer with probability 1/10N, so that probably it'll make it through training just fine and then still get to cause catastrophe.

That's right---you still only get a bound on average quality, and you need to do something to cope with failures so rare they never appear in training (here's a post reviewing my best guesses).

But before you weren't even in the game, it wouldn't matter how well adversarial training worked because you didn't even have the knowledge to tell whether a given behavior is good or bad. You weren't even getting the right behavior on average.

(In the OP I think the claim "the generalization is now coming entirely from human beliefs" is fine, I meant generalization from one distribution to another. "Neural nets are are fine" was sweeping these issues under the rug. Though note that in the real world the distribution will change from neural net training to deployment, it's just exactly the normal robustness problem. The point of this post is just to get it down to only a robustness problem that you could solve with some kind of generalization of adversarial training, the reason to set it up as in the OP was to make the issue more clear.)

[-]evhubΩ360

I agree with Daniel. Certainly training on actual iid samples from the deployment distribution helps a lot—as it ensures that your limiting behavior is correct—but in the finite data regime you can still find a deceptive model that defects some percentage of the time.

[-]ESRogsΩ120

This is a good question, and I don't know the answer. My guess is that Paul would say that that is a potential problem, but different from the one being addressed in this post. Not sure though.

Yeah, that's my view.

Thanks for confirming.

[-]OferΩ230

I'm confused about this point. My understanding is that, if we sample iid examples from some dataset and then naively train a neural network with them, in the limit we may run into universal prior problems, even during training (e.g. an inference execution that leverages some software vulnerability in the computer that runs the training process).

[-]NisanΩ360

In this case humans are doing the job of transferring from to , and the training algorithm just has to generalize from a representative sample of to the test set.

[-]ESRogsΩ240

Thank you, this was helpful. I hadn't understood what was meant by "the generalization is now coming entirely from human beliefs", but now it seems clear. (And in retrospect obvious if I'd just read/thought more carefully.)

This is a good and valid question -- I agree, it isn't fair to say generalization comes entirely from human beliefs.

An illustrative example: suppose we're talking about deep learning, so our predicting model is a neural network. We haven't specified the architecture of the model yet. We choose two architectures, and train both of them from our subsampled human-labeled D* items. Almost surely, these two models won't give exactly the same outputs on every input, even in expectation. So where did this variability come from? Some sort of bias from the model architecture!

The Alignment Newsletter summary + opinion for this post is here.

I was directed here from an article about malign priors. I saw it was argued that the implicit prior in standard machine learning algorithms was probably malign.

And I'm wondering how you could, even in principle, learn a prior that's knowably not malign starting from a potentially malign one.

I think one of the biggest dangerous of a malign prior is that it could result in a treacherous turn to seize power, for example from believing they are in an alien simulation that would reward them for doing so. But if the implicit prior in machine learning algorithms would do this, I don't see how to make it avoid learning a prior that itself has a treacherous turn. That is, normally it would provide reasonable results, but at some point, when it's most important to avoid it, the AI will think something dangerous, like that it's in a simulation incentivizing misbehavior. I mean, I think whatever agent could manipulate the AI's beliefs using the implicit prior in machine learning would have an incentive to manipulate the AI's beliefs about the learned prior to allow for the agent to also control the AI through it.

The only ways I can think of is by either hoping your prior learner is too stupid to manage to embed a treacherous turn in its learned prior, or having high-enough interpretability that you can verify the prior is safe. But an agent trying to make the learned prior malign would have an incentive to do whatever it can to make the learned malign prior look as safe as possible to any sort of interpretability tools, which could make this hard.

I haven't been able to find any other articles about this, so if anyone could link some, that would be great.

The motivation is that we want a flexible and learnable posterior.

-Paul Christiano, 2020

Ahem, back on topic, I'm not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn't look much like human-written text.

Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it's not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever. There will still be some human-incomprehensible bits that can be transmitted through Z (Because otherwise you'd need a discriminator so good that Z couldn't be superhuman), but at least the amount is sharply limited.

But I'm really lost on how your could hope to limit the f side of this dichotomy. Penalize it for understanding the world too well given a random Z? Now it just has an incentive to notice random Zs and "play dead." Somehow you want it not to do better by just becoming a catch-all model of the training data, even on the actual training data. This might be one of those philosophical problems, given that you're expecting it to interpret natural language passages, and the lack of bright line between "understanding natural language" and "modeling the world."

I'm not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn't look much like human-written text.

f is just predicting P(y|x, Z), it's not trying to model D. So you don't gain anything by putting facts about the data distribution in f---you have to put them in Z so that it changes P(y|x,Z).

Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it's not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever.

The only thing Z does is get handed to the human for computing P(y|x,Z).

Ah, I think I see, thanks for explaining. So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions. Or maybe some amount of fine-tuning for "better" predictions by some method of eliciting its own standards, but not by actually comparing it to the ground truth.

This (along with eventually reading your companion post) also helps resolve the confusion I was having over what exactly was the prior in "learning the prior" - Z is just like a latent space, and f is the decoder from Z to predictions. My impression is that your hope is that if Z and f start out human-like, then this is like specifying the "programming language" of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.

Is that somewhat in the right ballpark?

So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions.

That's right, f is either imitating a human, or it's trained by iterated amplification / debate---in any case the loss function is defined by the human. In no case is f optimized to make good predictions about the underlying data.

My impression is that your hope is that if Z and f start out human-like, then this is like specifying the "programming language" of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.

Z should always be a human-readable (or amplified-human-readable) latent; it will necessarily remain human-readable because it has no purpose other than to help a human make predictions. f is going to remain human-like because it's predicting what the human would say (or what the human-consulting-f would say etc.).

The amplified human is like the programming language of the universal prior, Z is like the program that is chosen (or slightly more precisely: Z is like a distribution over programs, described in a human-comprehensible way) and f is an efficient distillation of the intractable ideal.

I'm trying to understand what do you mean by human prior here. Image classification models are vulnerable to adversarial examples. Suppose I randomly split an image dataset into D and D* and train an image classifier using your method. Do you predict that it will still be vulnerable to adversarial examples? 

Yes.  We're just aiming for a distillation of the overseer's judgments, but that's what existing imagenet models are anyway, so we'll be vulnerable to adversarial examples for the same reason.

I'm trying to get a better handle on what the benefits coming from LTP are. Here's my current picture - are there points here where I've misundersood?
_________

The core problem: We have a training distribution (x, y) ~ D and a deployment distribution (x*, y*) ~ D*, where D != D*. We would rather not rely on ML OOD generalization from D to D*. Instead, we would rather have a human label D*, train an ML model on those labels, and only rely on IID generalization. Suppose D is too large for a human to process. If the human knows how to label D* without learning from D, that’s fine. But D* might be very hard for humans. In particular we need to outperform prosaic ML: the human (before updating on D) needs to outperform an ML model (after updating on D).

Insight from LTP: Ideally, we can compress D into something more manageable: some latent variable z*. Then the human can use z* to predict PH(y* | x*, z*) instead of just PH(y* | x*), and can now hopefully outperform the prosaic ML model. The benefit is we can now rely on IID generalization while remaining competitive.

At first, this seems to assume that it is possible to compress the key information in D into a much smaller core z* containing the main insights (distilled core assumption). For example, if D were movements of planets, z* might be the laws of physics. This post argues this is not necessary: by using amplification or debate, the amplified human can use a very large z*. But since the amplification/debate models are ML models, and we’re running these models to aid human decisions on x*, aren’t we back to relying on ML OOD generalization, and so back where we started?

I think your description is correct.

The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation. For that reason, I expect z* to have roughly the same size as the neural network parameters.

My main reservation is that this seems really hard (and maybe in some sense just a reframing of the original problem). We want z to be a representation of what the neural network learned that a human can manipulate in order to reason about what it implies on D*. But what is that going to look like? If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network...

In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I've generally moved away from that kind of perspective, partly based on the kinds of considerations in this post.

But since the amplification/debate models are ML models, and we’re running these models to aid human decisions on x*, aren’t we back to relying on ML OOD generalization, and so back where we started?

I now think we're going to have to actually have z* reflect something more like the structure of the unaligned neural network, rather than another model (Mz) that outputs all of the unaligned neural network's knowledge.

That said, I'm not sure we require OOD generalization even if we represent z via a model Mz. E.g. suppose that Mz(i) is the ith word of the intractably-large z. Then the prior calculation can access all of the words i in order to evaluate the plausibility of the string represented by Mz. We then use that same set of words at training time and test time. If some index i is used at test time but not at training time, then the model responsible for evaluating Prior(z) is incentivized to access that index in order to show that z is unnecessarily complex. So every index i should be accessed on the training distribution. (Though they need not be accessed explicitly, just somewhere in the implicit exponentially large tree).

Like I said, I'm a bit less optimistic about doing this kind of massive compression. For now, I'm just thinking about the setting where our human has plenty of time to look at z in detail even if it's the same size as the weights of our neural network. if we can make that work, then I'll think about to do it in the case of computationally bounded humans (which I expect to be straightforward).

Thanks. This is helpful. I agree that LTP with the distilled core assumption buys us a lot, both theoretically and probably in practice too.

> The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation... My main reservation is that this seems really hard... If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network

Great, agreed with all of this.

> In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I've generally moved away from that kind of perspective, partly based on the kinds of considerations in this post

I share the top-line view, but I'm not sure what issues obfuscated arguments present for large z*, other than generally pushing more difficulty onto alignment/debate. (Probably not important to respond to, just wanted to flag in case this matters elsewhere.)

> That said, I'm not sure we require OOD generalization even if we represent z via a model Mz. E.g. suppose that Mz(i) is the ith word of the intractably-large z.

I agree that Mz (= z*) does not require OOD generalization. My claim is that the amplified model using Mz involves a ML model which must generalize OOD. On D, our y-targets are  where  is an amplified human. On D*, our y-targets are similarly . The key question for me is whether our y-targets on D* are good. If we use the distilled core assumption, they are - they're exactly the predictions the human makes after updating on D. Without it, our y-targets depend on , which involves a ML model. 

In particular, I'm assuming H^A is something like human + policy , where  was optimized to imitate H on D (with z sampled), but is making predictions on D* now. Maybe the picture is that we instead run IDA from scratch on D*? E.g. for amplification, this involves ignoring the models/policies we already have, starting with the usual unaided human supervision on D* at first, and bootstrapping all the way up. I suppose this works, but then couldn't we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?

 was optimized to imitate H on D

It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn't train on D then run on D* (and you don't need to!).

I suppose this works, but then couldn't we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?

The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.

It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn't train on D then run on D* (and you don't need to!).

Sorry yes, you're completely right. (I previously didn't like that there's a model trained on  which only gets used for finding z*, but realized it's not a big deal.)

The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.

I agree - I mean for the alternative to be running IDA on D*, using D as an auxiliary input (rather than using indirection through Mz). In other words, if we need IDA to access a large context Mz, we could also use IDA to access a large context D? Without something like the distilled core assumption, I'm not sure if there are major advantages one way or the other?

OTOH, with something like the distilled core assumption, it's clearly better to go through Mz, because Mz is much smaller than D (I think of this as amortizing the cost of distilling D).

Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I'm kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we'd want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it's needed for HCH to be stable/aligned against internal optimization pressure).

Okay, that makes sense (and seems compelling, though not decisive, to me). I'm happy to leave it here - thanks for the answers!