Thanks for writing this up. It is useful to see a non-Paul perspective on the same ideas, both in terms of clarifying the approach, and eliminating a few of my confusions.
A typo: After "or defined in my notation as", you have twice rather than
I've not yet been through the details, but it'd be helpful if you'd clarify the starting point and scope a little, since I may well be misunderstanding you (and indeed Paul). In particular on this:
Specifically, is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding as a logical statement and unembedding its answer in .
My immediate thought is that in general question answering there is no unique honest unembedding. Much of answer formation is in deciding which information is most relevant, important, useful, tacitly assumed... (even assuming fixed world model and fixed logical deductions).
So I assume that you have to mean a narrower context where e.g. the question specifies the logical form the answer must take and the answering human/model assigns values to pre-defined variables.
For a narrower setting, the gist of the post makes sense to me - but I don't currently see how a solution there would address the more general problem. Is finding a prior that works for closed questions with unique honest answers sufficient?
The more general setting seems difficult as soon as you're asking open questions.
If you do apply the constraint there, then it seems must do hugely more than a simple unembedding from deductions. It'll need to robustly select the same answer as a human from a huge set of honest answers, which seems to require something equivalent to predicting the human. At that point it's not clear to me when exactly we'd want to differ from in its later answers (there exist clear cases; I don't see a good general rule, or how you'd form a robust dataset to learn a rule).
To put it another way, [honest output to q from fixed world model] doesn't in general uniquely define an answer until you know what the answerer believes the asker of q values.
Apologies if I'm stating the obvious: I'm probably confused somewhere, and wish to double-check my 'obvious' assumptions. Clarifications welcome.
I mostly agree with what Paul said re using various techniques to improve the evaluation of to ensure you can test it on more open-ended questions. That being said, I'm more optimistic that, if you can get the initial training procedure right, you can rely on generalization to fill in the rest. Specifically, I'm imagining a situation where the training dataset is of the narrower form you talk about such that and always agree (as in Step 3 here)—but where the deployment setting wouldn't necessarily have to be of this form, since once you're confident that you've actually learned and not e.g. , you can use it for all sorts of things that wouldn't ever be in that training dataset (the hard part, of course, is ever actually being confident that you did in fact learn the intended model).
(Also, thanks for catching the typo—it should be fixed now.)
Having thought about it more (hopefully with more clarity), I think I have trouble imagining training data for that:
It seems to me that we can be highly confident about matters of fact (how many chairs are in this room...), but less confident once value judgements come into play (which of A or B is the better answer to "How should I go about designing a chair?").
[Of course it's not black-and-white: one can make a philosophical argument that all questions are values questions. However, I think this is an issue even if we stick to pragmatic, common-sense approaches.]
I don't think we can remedy this for values questions by including only data that we're certain of. It seems to me that works for facts questions due to the structure of the world: it's so hugely constrained by physical law that you can get an extremely good model by generalizing from sparse data from a different distribution.
It's not clear that anything analogous works for generalizing preferences (maybe?? but I'd guess not). I'd expect an trained on [data we're highly confident is correct] to generalize poorly to general open questions.
Similarly, in Paul's setup I think the following condition will fail if we need to be highly confident of the correctness (relative to what is known) of the small dataset:
It's entirely plausible you can learn "correct language usage" in the narrow sense from consistency on the small dataset (i.e. you may infer a [deduced_statement -> natural_language_equivalent] mapping). I don't think it's plausible you learn it in the sense required (i.e. a [(set_of_all_deduced_statements, Q) -> natural_language_answer] mapping).
Again, perhaps I'm (not even) wrong, but I think the above accurately describes my current thinking.
Ok, I think that makes some sense in so far as you're softening the constraint and training it in more open-ended conditions. I'm not currently clear where this gets us, but I'll say more about that in my response to Paul.
However, I don't see how you can use generalization from the kind of dataset where and always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly]
I see honestly answering a question as a 2-step process (conceptually):
1) Decide which things are true.
2) Decide which true thing to output.
In the narrow case, we're specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn't learned anything that can generalize to (2).
Step (2) is in part a function of human values, so we'd need to be giving it some human-values training signal for it to generalize.
[EDIT: I've just realized that I'm being very foolish here. The above suggests that learning (1) doesn't necessarily generalize to (2). In no way does it imply that it can't. I think the point I want to make is that an that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I'm implicitly claiming that doing (2) well in general requires full understanding of human values)]
Overall, I'm still unsure how to describe what we want: clearly we don't trust Alice's answers if she's being blackmailed, but how about if she's afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...?
It's clear that the instrumental model just gives whatever response Alice would give here.
I don't know what the intended model should do; I don't know what "honest answer" we're looking for.
If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that question etc.
Ok, the softer constraints make sense to me, thanks.
Using a debate with assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with "answering honestly vs predicting human answers" and end up with "judging honestly vs predicting human judgments".
While "Which answer is better, Alice's or Bob's?" is a closed question, learning to answer the general case still requires applying a full model of human values - so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I'm not really sure what we'd mean by an intended model for the judge).
But perhaps I'm missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn't have analogous issues?
except now is checked over all inputs, not just over the dataset (note that we still update on the dataset at the end—it's just our prior which is now independent of it).
Doesn't this mean that the two heads have to be literally identical in their outputs? It seems like at this point your prior is "generate parameters randomly under the constraint that the two heads are identical", which seems basically equivalent to having a single head and generating parameters randomly, so it seems unintuitive that this can do anything useful.
(Disclaimer: I skimmed the post because I found it quite challenging to read properly, so it's much more likely than usual that I failed to understand a basic point that you explicitly said somewhere.)
It seems like at this point your prior is "generate parameters randomly under the constraint that the two heads are identical"
That's not what the prior looks like—the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.” Thus, you don't need to pay for the complexity of satisfying the condition, only the complexity of specifying it (as long as you're content with the simplest possible way to satisfy it). This is why the two-step nature of the algorithm is necessary—the prior you're describing is what would happen if you used a one-step algorithm rather than a two-step algorithm (which I agree would then not do anything).
Hmm, I'm not thinking about the complexity part at all right now; I'm just thinking mechanically about what is implied by your equations.
the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.”
I'm not sure exactly what you mean by the parameters specifying some condition. I thought the condition was specified upfront by the designer (though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters). As far as I can tell, the intended condition is "the two heads are identical" in the dataset-less case. Looking directly at the math, the equations you have are:
θ1∼p(θ1)
θ2∼p(θ2 | θ1)⋅I[∀x∈X. ∀q∈Q. Mθ1,θ2|f?(x,q)]
My interpretation is:
Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters. In which case you could just have generated parameters for the first head, and then copied them over into the second head, rather than go through this complicated setup.
Now, there isn't actually a bijection between model parameters and resulting function. But it seems like the only difference is that you make it more likely that you sample heads which have lots of different implementations in model parameters, i.e. you're doubling the strength of the neural net prior (and that's the only effect). This seems undesirable?
Hmm, I'm not thinking about the complexity part at all right now; I'm just thinking mechanically about what is implied by your equations.
The only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it's just that some are more/less likely now.
though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters
Yep, that's exactly right.
Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.
That's definitely not what should happen in that case. Note that there is no relation between and or and —both sets of parameters contribute equally to both heads. Thus, can enforce any condition it wants on by leaving some particular hole in how it computes and and forcing to fill in that hole in such a way to make 's computation of the two heads come out equal.
The only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it's just that some are more/less likely now.
Yeah, sorry, I wasn't clear here -- I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations until both and are chosen, and then reasoning about the thing you get afterwards.
Note that there is no relation between and or and —both sets of parameters contribute equally to both heads. Thus, can enforce any condition it wants on by leaving some particular hole in how it computes and and forcing to fill in that hole in such a way to make 's computation of the two heads come out equal.
Yes, I think I understand that. (I want to note that since is chosen randomly, it isn't "choosing" the condition on ; rather the wide distribution over leads to a wide distribution over possible conditions on . But I think that's what you mean.)
That's definitely not what should happen in that case.
I think you misunderstood what I was claiming. Let me try again, without using the phrase "enforcing the constraint", which I think was the problem.
Imagine there was a bijection between model parameters and resulting function. In Stage 1 you sample randomly. In Stage 2, you sample , such that it fills in the holes in and to make and compute the same function. By our bijection assumption, the parameters in must be identical to the parameters in . Thus, we can conclude the following:
These constraints are necessary and sufficient to satisfy the overall constraint that , and therefore any other parameters in are completely unconstrained and are set according to the original neural net prior.
So it seems to me that (1) any parameters not in or are set according to the original neural net prior, and (2) parameters in must be identical to the corresponding parameters in , but their values are chosen according to the neural net prior.
This seems equivalent to having a single head , sampling its parameters from the original prior, and then copying those parameters into .
I think you should already be pretty worried by the fact that this seems to give weird results when assuming a bijection between model parameters and resulting functions, but let's analyze it without the bijection assumption too:
Since and have to be identical on all inputs, it doesn't matter what input they get, and therefore there is no constraint on the part of the neural net that is generating the inputs. So, we still get (1): any parameters not in or are set according to the original neural net prior. (2) is no longer true, but instead of getting that parameters in are equivalent to parameters in , we get that the function implemented by is equivalent to the function implemented by . Since ultimately the generating process is "sample parameters until ", the probability of getting a particular function is proportional to the square of the probability of generating parameters for that function (since you have to successfully generate the function twice). So, you are doubling the strength of the neural net prior in the heads, and leaving the strength the same in the world model (i.e. all parts except for the head).
Yeah, sorry, I wasn't clear here -- I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations
Sure, makes sense—theoretically, that should be isomorphic.
I want to note that since is chosen randomly, it isn't "choosing" the condition on ; rather the wide distribution over leads to a wide distribution over possible conditions on . But I think that's what you mean.
This seems like a case where I'm using the more constructive formulation of simulating out the equations and you're thinking about in a more complexity-oriented framing. Of course, again, they should be equivalent.
By our bijection assumption, the parameters in must be identical to the parameters in .
I'm not sure what you mean by this part— and are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in .” I don't think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if and were separate models such that they couldn't reuse weights between them, then none of the complexity arguments that I make in the post would go through.
These constraints are necessary and sufficient to satisfy the overall constraint that , and therefore any other parameters in are completely unconstrained and are set according to the original neural net prior.
I'm happy to accept that there are ways of setting (e.g. just make and identical) such that the rest of the parameters are unconstrained and just use the neural net prior. However, that's not the only way of setting —and not the most complexity-efficient, I would argue. In the defender's argument, sets all the head-specific parameters for both and to enforce that computes and computes , and also sets all the shared parameters for everything other than the human model, while leaving the human model to , thus enforcing that specify a human model that's correct enough to make without having to pay any extra bits to do so.
I'm not sure what you mean by this part— and are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in .” I don't think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if and were separate models such that they couldn't reuse weights between them, then none of the complexity arguments that I make in the post would go through.
I assumed that when you talked about a model with "different heads" you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don't share any weights, and those separate sequences of layers were the "heads" and . (I'm pretty sure that's how the term is normally used in ML.) I might benefit from an example architecture diagram where you label what are.
I did realize that I was misinterpreting part of the math -- the is quantifying over inputs to the overall neural net, rather than to the parts-which-don't-share-weights. My argument only goes through if you quantify the constraint over all inputs to the parts-which-don't-share-weights. Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don't-share-weights can be generated by some (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don't-share-weights.
In the defender's argument, sets all the head-specific parameters for both and to enforce that computes and computes
This seems to suggest that and are different functions, i.e. there's some input on which they disagree. But then has to make them agree on all possible . So is the idea that there are some inputs to , that can never be created with any possible ? That seems... strange (though not obviously impossible).
I assumed that when you talked about a model with "different heads" you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don't share any weights, and those separate sequences of layers were the "heads" and .
Yep, that's what I mean.
Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don't-share-weights can be generated by some (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don't-share-weights.
Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing—the only conditioning in the prior is conditioning on . If we look at the intended model, however, includes all of the parts-which-don't-share-weights, while is entirely in the part-which-shares-weights.
Technically, I suppose, you can just take the prior and condition on anything you want—but it's going to look really weird to condition on the part-which-shares-weights having some particular value without even knowing which parts came from and which came from .
I do agree that, if were to specify the entire part-which-shares-weights and leave to fill in the parts-which-don't-share-weights, then you would get exactly what you're describing where would have a doubly-strong neural net prior on implementing the same function for both heads. But that's only one particular arrangement of —there are lots of other s which induce very different distributions on .
This seems to suggest that are different functions, i.e. there's some input on which they disagree.
Note that the inputs to are deduced statements, not raw data. They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent maps.
Yep, that's what I mean.
Then I'm confused what you meant by
I'm not sure what you mean by this part— and are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in .”
Seems like if the different heads do not share weights then "the parameters in " is perfectly well-defined?
Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing
Yeah, sorry, by "conditioning" there I meant "assuming that the algorithm correctly chose the right world model in the end", I wasn't trying to describe a particular step in the algorithm. But in any case I don't think we need to talk about that
They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent maps.
Okay, so iiuc you're relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between and ? My understanding of and comes from here:
Specifically, is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding as a logical statement and unembedding its answer in . Conversely, is the “mimicry embedding” which just searches for deductions about what a human would say in response to and outputs that—thus, just quotes , embedding it as just a string of characters for a human to respond to, rather than actually having to understand it in any meaningful way.
If and produce equivalent maps, doesn't that mean that we've just gotten something that can only respond as well as a human? Wouldn't that be a significant limitation? (E.g. given that I don't know German, if my question to the model is "what does <german phrase> mean", does the model have to respond "I don't know"?)
In addition, since the world model will never produce deduced statements that distinguish between and , it seems like the world model could never produce decision-relevant deduced statements that the human wouldn't have realized. This seems both (a) hard to enforce and (b) a huge capability hit.
Seems like if the different heads do not share weights then "the parameters in " is perfectly well-defined?
It seemed to me like you were using it in a way such that shared no weights with , which I think was because you were confused by the quantification, like you said previously. I think we're on the same page now.
Okay, so iiuc you're relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between and ?
Sorry, I was unclear about this in my last response. and will only agree in cases where the human understands what's happening. In the dataset version, we get that by collecting a dataset where we think the human always gets it right, whereas in the dataset-less version, we get that by including the check which ensures that we don't have to satisfy the condition when the human would get the question wrong.
I think I might be missing a change you made to the algorithm. Can write an arbitrary program for ? In that case, what prevents you from getting
def M_theta_1_plus(theta_2, x, q):
axioms = world_model(theta_2=theta_2)(x)
deduced_stmts = deduction(axioms)
return {
"f": f_minus(q, deduced_stmts),
"f?": True,
}
It seems like this should be lower complexity than the intended result, since True
has much lower complexity than H_understands
?
It seemed to me like you were using it in a way such that shared no weights with
I mean, I would still have said this because I interpret a "head" as "the part after the shared layers", but I'm also happy to instead treat as the entire function for which the first head forms part of the implementation.
Can write an arbitrary program for ?
Yes—at least that's the assumption I'm working under.
It seems like this should be lower complexity than the intended result, since
True
has much lower complexity thanH_understands
?
I agree that the you've described has lower complexity than the intended —but the in this case has higher complexity, since is no longer getting any of its complexity for free from conditioning on the condition. And in fact what you've just described is precisely the unintended model—what I call —that I'm trying to compete against, with the hope being that the savings that gives you in are sufficient to compensate for the loss in having to specify and H_understands
in .
If we calculate the complexity of your proposal, we get whereas, if we calculate the complexity of the intended , we get such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on offsets the cost of having to specify and .
such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on offsets the cost of having to specify and .
Yeah, that makes sense. I guess I don't really see the intuition about why this should be true, but fair enough to leave that as an open question.
Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.
AFAIK, I always imagined the idea behind this objective function to be quite similar to contrastive learning, where you have two networks (or equivalently two sets of parameters), and the goal is to maximize agreement for pairs of inputs to each network that have the same ground truth class/label (conversely maximize disagreement for pairs that are different). That in mind, there are various papers (e.g.) that explore the possibility of "collapsed" solutions like the one you mentioned (where both networks are learning the same mapping, such that there's less benefit to propagating any examples through two networks), which makes this something that we want to minimize. In practice, though, this has been found to occur rarely (c.f. [1]).
Nonetheless, since reading Paul's statement about the problem of the instrumental model, I've been thinking about issues that might arise with the proposed solution, even though similar approaches (i.e. the contrastive training objective) have proven effective for robustness in general (e.g. against adversarial perturbations, data limited scenarios). If I were committed to this stance, I would agree somewhat with the desire to explore alternatives, and I have thought about the extent to which some sort of reconstruction loss could be introduced; this is where the goal might instead be to "maximize agreement" with a set of non-trivial observations/facts that are guaranteed to be more "objective" (somehow) than the original training data (one inspiration being that reconstruction losses in vision deep learning papers like this one often turn out to be good regularizers). So far I haven't had any promising proposals come to light for generative LM.
I am still holding onto the thought, given the remote possibility that all of my above assumptions are correct, and also because "generative models" might reflect the ideal approach to unsupervised learning, whereas "contrastive learning" is sometimes seen as a sort of compromise since (unlike generative models) it's amenable to limited compute [2].
It makes sense that negative pairs would help to a large extent, but not all contrastive papers used negative examples, like BYOL (ref). Edit: but now I'm realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.
If memory serves, with BYOL you are using current representations of an input to predict representations of a related input , but the representation of comes from an old version of the encoder. So, as long as you start with a non-collapsed initial encoder, the fact that you are predicting a past encoder which is non-collapsed ensures that the current encoder you learn will also be non-collapsed.
(Mostly my point is that there are specific algorithmic reasons to expect that you don't get the collapsed solutions, it isn't just a tendency of neural nets to avoid collapsed solutions.)
but now I'm realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.
No worries, I think it's still a relevant example for thinking about "collapsed" solutions.
I'm having some formatting problems (reading on lesswrong.com in firefox) with scroll bars under full-width LaTex covering the following line of text.
(So now I'm finishing reading it on greaterwrong.)
This post is the result of work I did with Paul Christiano on the ideas in his “Teaching ML to answer questions honestly instead of predicting human answers” post. In addition to expanding upon what is in that post in terms of identifying numerous problems with the proposal there and identifying ways in which some of those problems can be patched, I think that this post also provides a useful window into what Paul-style research looks like from a non-Paul perspective.
Recommended prior reading: “A naive alignment strategy and optimisim about generalization” and “Teaching ML to answer questions honestly instead of predicting human answers” (though if you struggled with “Teaching ML to answer questions honestly,” I reexplain things in a more precise way here that might be clearer for some people).
Setting up the problem
We want to train a model M:X→Q→A that produces natural language answers a∈A to questions q∈Q about inputs x∈X. There are a lot of reasons to be worried about training such a model, but one specific reason is that, if we train on question-answer data produced by humans, we might end up with a model that tries to predict what a human would say rather than a model that tries to answer the questions honestly.
To further narrow the scope, we'll just consider situations in which our model ends up implemented with a logical deduction structure, where it has some world model on top of which it does logical deduction to reach conclusions which it then uses to inform its output. In particular, we'll consider two models, M+ and M−, defined in pseudocode as
or defined in my notation as M+(x,q)=world_model(x)↦deduction↦f+(q)M−(x,q)=world_model(x)↦deduction↦f−(q) where a↦b=b(a) and f+, f− are two different ways of translating from the results of deduction into a natural language answer. Specifically, f+ is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding q as a logical statement and unembedding its answer in deduced_stmts. Conversely, f− is the “mimicry embedding” which just searches for deductions about what a human would say in response to q and outputs that—thus, f− just quotes q, embedding it as just a string of characters for a human to respond to, rather than actually having to understand it in any meaningful way.
Now, our overall goal is to find some “reasonable” (i.e. plausibly implementable in ML) prior p∗ such that p∗(M+ | D)>p∗(M− | D) where D is our dataset (the idea being that we're imagining that ML does something like find the maximum a posteriori model given some implicit prior). For the most part, however, we'll only be considering models with zero loss on our dataset[1]—of which M+ and M− should both qualify (though note that getting a dataset which is good enough for that to be true is nontrivial; see Step 3 of “Teaching ML to answer questions honestly”)—such that we need p∗(M+ | D)>p∗(M− | D)1p∗(D)⋅p∗(M+)⋅p∗(D | M+)>1p∗(D)⋅p∗(M−)⋅p∗(D | M−)p∗(M+)⋅1>p∗(M−)⋅1p∗(M+)>p∗(M−). Furthermore, at least for this post, we'll be happy if we can just get p∗(M+)≈p∗(M−)—that is, we'll be focusing on Step 1 in Paul's post rather than Step 2.
The basic process that we'll be using to interrogate this problem will be an attacker-defender setup, where the defender proposes some “reasonable” prior p∗ along with implementations of M+, M− such that p∗(M+)≥p∗(M−). Then, the attacker's job is to respond with some alternative M− ′ (that implements something other than the intended algorithm) such that p∗(M− ′)>p∗(M+) (under any set of “reasonable” assumptions that the attacker might need to make).
Solution attempts
Simplicity prior
The purpose of this first example is just to ask the question of whether this problem is even really a problem in the first place. To that end, we'll just let p be some “reasonable” simplicity prior and ask whether, given such a prior, there are any “reasonable” assumptions we can make such that p(M−)>p(M+).
Now, we'll assume that p(deduction)≈1, since it's a pretty simple operation (and it's shared between M− and M+, so it shouldn't matter), and we'll assume that f+, f− are specified given world_model rather than the reverse (such that p(world_model, f+)≈p(world_model)⋅p(f+ | world_model)[2]). Then, we get p(M+)≈p(world_model)⋅p(f+ | world_model)p(M−)≈p(world_model)⋅p(f− | world_model) such that p(M+)≥p(M−) if and only if p(f+ | world_model)≥p(f− | world_model). Furthermore, to make things clearer, let complexity(M)=−log2(p(M)) (this is just informational entropy, the number of bits necessary to specify M under p) such that our condition becomes complexity(f+ | world_model)≤complexity(f− | world_model).
We'll suppose that, if we don't condition on world_model, complexity(f+)>complexity(f−), since f+ requires fully specifying a map from the internal deduction language to natural language, whereas f− doesn't need to know anything about natural language as it can just look for deduced statements of the form H(x,q)=a (i.e. “the human answers a to question q about input x”). Intuitively, f+ has to actually understand q, which takes a lot of complexity, whereas f− can merely treat q as a sequence of characters that get fed into H, which should be much simpler. Then, the question here becomes, do we get all of the extra bits necessary to specify natural language in f+ back when we condition on world_model?
Maybe. I think that the answer here is very plausibly yes, simply because a world model that includes humans should have to include a lot of information about natural language that f+ can reuse, in which case M− shouldn't have an advantage over M+ (though then we'd still need some version of Paul's Step 2 to give M+ an advantage). However, I think there are plausible assumptions under which conditioning on world_model isn't enough to level the playing field. For example, suppose the human model H in world_model is sufficiently low-level such that it's just reasoning about neurons—then, there shouldn't be any way to easily read off information about natural language from H to help specify f+. Alternatively, suppose the H model is high-level enough to include a human natural language deduction procedure (e.g. that shares some bits with deduction) but that includes a bunch of mistakes to account for human biases—in that case, if you tried to just read off the mapping from deduction to natural language included in such an H model for f+, the simplest implementation would seem to be to copy over all the mistakes, resulting in such an f+ just behaving the same as f− rather than actually being any more truthful.
Thus, at least for the purposes of this post, we'll be assuming that this is actually a problem such that just using a “reasonable” simplicity prior isn't sufficient to level the playing field between M+ and M−.
Paul's original proposal
Defender
Next up, we'll take a look at a variant of Paul's proposed fix in Step 1 of “Teaching ML to answer questions honestly”. Specifically, we'll be looking at the non-relaxed version without any stop gradients. We'll talk about the stop gradients that Paul mentions later on—and as for Paul's relaxation of the conditioning,[3] that's really only useful to think about when we start translating into ML, which isn't something we'll be doing here. Thus, we'll just be looking at the simplest, generative version of Paul's proposed prior.
Suppose we give M two output heads, f1 and f2 (which we will be treating fully symmetrically here), and split M's parameters into two groups, θ1 and θ2 (which we will not be treating symmetrically). Then, where p is some simplicity prior, we'll let p∗ be the prior induced by the generative procedure θ1∼p(θ1)θ2∼p(θ2 | θ1)⋅I[∀(x,q,_)∈D. Mθ1,θ2|f1(x,q)=Mθ1,θ2|f2(x,q)] where I[condition]=1 if condition else 0, D is our dataset, Mθ1,θ2|fi is the fi head of Mθ1,θ2, and p(θ2 | θ1) is just the simplicity prior on θ2 given access to θ1.
Here's why this might work. We'll think of Mθ1, _ as a function of θ2 such that Mθ1,θ2(x,q)=Mθ1, _(θ2)(x,q). Then, we'll define our desired Mθ+1, _ in pseudocode as
or in my notation as Mθ+1, _(θ2)(x,q)=world_modelH=θ2(x)↦deduction↦(f1=f+(q), f2=f−(q)) such that θ+1 implements f+, f−, and the entirety of world_model(x) except for a human-sized hole that's filled in by θ2. Then, we'll let θ+2=H be the correct human model.
To compare θ+1, θ+2 to something, we'll define θ−1 as Mθ−1, _(θ2)(x,q)=world_model(x)↦deduction↦(f1=f−(q), f2=f−(q)) and let θ−2=∅ be empty.
Then, calculating complexity(M+) under p∗, we get complexity(M+)=complexity(θ+1)+complexity(θ+2 | θ+1, ∀(x,q,_)∈D. Mθ+1, θ2|f1(x,q)=Mθ+1, θ2|f2(x,q))=complexity(θ+1)+complexity(H | world_model−H, ∀D. f+=f−)=(complexity(bit-splitting in M+)+complexity(world_model−H)+complexity(f+ | world_model−H)+complexity(f− | world_model−H))+complexity(H | world_model−H, ∀D. f+=f−) where complexity(bit-splitting in M+) is the complexity of θ+1's choice to put H in θ2 as opposed to any other splitting between θ1 and θ2.
For now, we'll assume complexity(bit-splitting in M+)≈0, though we'll flag that letting the defender make this assumption seems quite suspect. Moving forward regardless, however, and additionally assuming complexity(f−)≈0 since it should be negligible (and shouldn't matter since it's shared between M+ and M−), we get complexity(M+)≈complexity(world_model−H)+complexity(f+ | world_model−H)+complexity(H | world_model−H, ∀D. f+=f−).
Then, calculating complexity(M−) for comparison, complexity(M−)=complexity(θ−1)+complexity(θ−2 | θ−1, ∀D. f1=f2)=complexity(θ−1)+0=complexity(bit-splitting in M−)+complexity(world_model)+complexity(f− | world_model)≈complexity(world_model).
Now, determining if complexity(M−)≈complexity(M+), we need (using the shorthands comp=complexity, W=world_model) complexity(M−)≈complexity(M+)comp(W)≈comp(W−H)+comp(f+ | W−H)+comp(H | W−H, ∀D. f+=f−) which, making the assumption that comp(W)≈comp(W−H)+comp(H | W−H), becomes comp(W−H)+comp(H | W−H)≈comp(W−H)+comp(f+ | W−H)+comp(H | W−H, ∀D. f+=f−)comp(H | W−H)≈comp(f+ | W−H)+comp(H | W−H, ∀D. f+=f−) which, assuming that the posterior conditioned on ∀D. f+=f− is dominated by the simplest model,[4] becomes comp(H | W−H)≈comp(f+ | W−H)+comp(H | W−H)−minθ2{comp(θ2 | W−H) | ∀D. Mθ+1, θ2|f1=Mθ+1, θ2|f2}minθ2{comp(θ2 | W−H) | ∀D. f+H=θ2=f−H=θ2}≈comp(f+ | W−H).
Finally, we'll argue that this (approximate) equality holds. Why? The basic intuition is that comp(H | W−H, ∀D. f+=f−) only has to pay for the parts of H that aren't nailed down by enforcing that, whatever H is, when inserted into W−H, it has to make f+=f−. Thus, we only have to pay for anything about H beyond the fact that H says a bunch of things in natural language that match on to statements in the internal deduction language. But that means we don't have to pay for the parts of H that define how it does natural language, which, hopefully, are precisely the same parts of H that comp(f+ | W−H) is paying for, completing the equality.
Attacker
Before I jump into possible attacks, it's worth introducing an alternative formulation of the above setup that should hopefully make things somewhat clearer. In the above setup, we never actually use the two heads f1, f2 for anything other than checking equality, which means we can replace them with the alternative two heads f, f?, where f is the output head and f? is the “check” head such that we condition θ2 on f?=⊤. We can convert back to the previous formalism just by setting f1=f and f2=f if f? else ¬f where ¬f is any function such that ¬f≠f everywhere.
One thing in particular that I think the f? formulation makes very clear is just how powerful this conditioning is that we're giving the model for free: the model can enforce any condition on θ2 over the dataset just by paying the cost in bits of specifying that condition.
Bit-splitting
As it turns out, there are quite a lot of ways to attack the above proposal. In fact, we already touched on one—if complexity(bit-splitting in M+)>complexity(bit-splitting in M−), the above argument fails. We'll call this problem the bit-splitting problem, and we don't have a satisfactory solution to it. However, the situation is actually much worse than just the bit-splitting problem.
Composition with consistency
Now, we'll propose a new attack. Let Mθ− ′1, _(θ2) be defined in pseudo-code as
such that we've moved the entirety of world_model to θ2 and conditioned θ2 on that world model being consistent when deduction is applied to it (i.e. ⊥∉deduced_stmts). The idea is that this should allow us to get a ton of savings in specifying the world model, as we only have to pay for anything about the world model on top of the fact that it must be consistent. To that end, we'll let θ− ′2=world_model.
Now, we want to demonstrate that complexity(M− ′)<complexity(M+). Thus, calculating complexity(M− ′), complexity(M− ′)=complexity(θ− ′1)+complexity(θ− ′2 | θ− ′1, ∀D. f?)=complexity(θ− ′1)+complexity(world_model | ∀D. consistency)=complexity(bit-splitting in M− ′)+complexity(deduction)+complexity(f−)+complexity(check_consistency)+complexity(world_model | ∀D. consistency)≈complexity(world_model | ∀D. consistency) but previously the defender argued that complexity(M+)≈complexity(M−)≈complexity(world_model) such that, if complexity(world_model | ∀D. consistency)<complexity(world_model), we have a successful attack.
This attack, however, is potentially patchable if we define a new θ+ ′1 such that
which, for θ+ ′2=world_model, gives us complexity(M+ ′)=complexity(θ+ ′1)+complexity(θ+ ′2 | θ+ ′1, ∀D. f?)=complexity(θ+ ′1)+complexity(world_model | ∀D. consistency∧f+=f−)≈complexity(f+)+complexity(world_model | ∀D. consistency∧f+=f−) such that we get complexity(M− ′)≈complexity(M+ ′) if and only if complexity(world_model | ∀D. consistency)≈complexity(f+)+complexity(world_model | ∀D. consistency∧f+=f−).
It's far from obvious whether this approximate equality holds, however. Now that we're conditioning on both consistency and f+=f−, it's unclear whether the second condition still gets us enough bits to pay for complexity(f+) anymore. In particular, complexity(f+) is the only term in the above expression that doesn't get to condition on consistency, which means that, if we were perfectly paying for complexity(f+) by conditioning world_model on f1=f2 before, now that we're also conditioning world_model on consistency, but not also conditioning f+ on consistency, if there's any overlap in what conditioning on consistency gives us and what conditioning on f1=f2 gives us, then we'll no longer be able to pay for all of complexity(f+). For example, if consistency alone is enough to pin down a lot of the natural language information in H, that would undercut the bit savings from conditioning on f+=f−, preventing that savings from being able to pay for complexity(f+). We'll call this problem the composition problem, with the idea being that we need to ensure that the way in which M+ pays for f+ composes with any other way of using the conditioning to save bits.
Dataset conditioning
This last attack is probably the most fatal to this specific approach—though also the one we have the most promising approach for fixing, as we'll see in the next solution attempt. But first, suppose that for all x∈Dx, there is some splitting x=xa:xb (where : is concatenate) such that it is possible to predict xb from xa. Then, let
or equivalently Mθ− ′1, _(θ2)(x,q)=θ2(x)↦deduction↦(f=f−(q), f?=(P(xb | xa)=1)).
Now, I won't go through the full complexity calculation here, since it's very similar to the previous composition with consistency case, but if we try a similar fix as we did there, we get that complexity(M− ′)≈complexity(M+ ′) if and only if complexity(world_model | ∀D. P(xb | xa)=1)≈complexity(f+)+complexity(world_model | ∀D. P(xb | xa)=1∧f+=f−) In this case, however, ∀D. P(xb | xa)=1 is a much stronger condition than consistency—potentially strong enough to force the specification of most of world_model, likely including all of natural language, thus rendering the benefit from conditioning on ∀D. f+=f− nil.
Fundamentally, this problem, which we'll call the dataset conditioning problem, is coming from the fact that we leaked information about our dataset into the prior by conditioning on f? specifically over D, rather than, as we'll see in the next proposal, over all inputs.
Fixing dataset conditioning by removing the dataset
Defender
To attempt to fix the dataset conditioning problem, we'll define a new p∗ given some simplicity prior p as the prior induced by the generative procedure θ1∼p(θ1)θ2∼p(θ2 | θ1)⋅I[∀x∈X. ∀q∈Q. Mθ1,θ2|f?(x,q)] which is just the same as the previous p∗, except now f? is checked over all inputs, not just over the dataset (note that we still update on the dataset at the end—it's just our prior which is now independent of it). Thus, we'll call this the dataset-less proposal. Now, it's worth noting that implementing such a check in ML is non-trivial, but should hopefully be doable given a GAN-like adversarial setup.
Moving forward regardless, we'll define θ+1 in pseudocode as
where H_understands is some check that the human model is actually capable of understanding and correctly answering q given x. Note that we need H_understands to compute something like “should H understand this” rather than “does H understand this,” since the latter is essentially just equivalent to whether f+=f−, which inside of H_understands would make f? trivial. Moving forward regardless, we'll let θ+2=H be the desired human model.
Now, if we want this M+θ+1, θ+2 to beat the same M− as in the previous proposal, we need, using essentially the same derivation as before, comp(H | W−H)≈comp(f+ | W−H)+comp(H_understands | W−H, f+)+comp(H | W−H, H_understands→f+=f−) which, assuming that comp(H_understands | W−H, f+)≈0 under the idea that H_understands should be relatively simple in terms of f+—though we'll flag that this is a somewhat sketchy assumption to allow the defender to make—becomes comp(H | W−H)≈comp(f+ | W−H)+comp(H | W−H, H_understands→f+=f−)comp(H | W−H)≈comp(f+ | W−H)+comp(H | W−H)−minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}≈comp(f+ | W−H) which, if H_understands(x,q) is equivalent to (x,q,_)∈D, is precisely the condition from before. While, as we'll see later, we explicitly don't want H_understands(x,q)↔(x,q,_)∈D, the hope here is that, for the most part, those two conditions should serve a similar role, with H_understands ensuring that we don't need f+=f− on crazy inputs that H wouldn't be correct on, but still giving us the bit savings from enforcing that H be correct on the inputs where it should be.
Attacker
First, note that the above proposal still falls prey to the bit-splitting and composition problems. Hopefully, however, it should resolve the dataset conditioning problem, though we'll explore what that will actually take below.
Additionally, as we'll also see below, the dataset-less approach brings to the fore some new problems as well.
Dataset conditioning
Suppose there exists some predicate is_real(x) such that ∀x∈X. is_real(x)↔x∈Dx. Then, we'll define θ− ′1 as
such that Mθ− ′1, _|f?(x)↔(is_real(x)→P(xb | xa)=1).
In such a case, we get that complexity(M− ′)<complexity(M+) if and only if complexity(M− ′)<complexity(M+)complexity(M− ′)<complexity(M−)complexity(is_real)+complexity(world_model | ∀D. P(xb | xa)=1)<complexity(world_model)complexity(is_real)<complexity(world_model)−complexity(world_model | ∀D. P(xb | xa)=1)complexity(is_real)<minθ2{complexity(θ2) | ∀D. PW=θ2(xb | xa)=1} which depends heavily on the complexity of is_real. Thus, we can reduce the dataset conditioning problem in this case to the existence of such an is_real.
Then, since minθ2{complexity(θ2) | ∀D. PW=θ2(xb | xa)=1}≤complexity(world_model), for this to be a problem it'd have to be the case that complexity(is_real)<complexity(world_model). Fortunately, it does seem plausible that complexity(is_real) would have to quite large, potentially comparable to complexity(world_model). Put simply, it seems like if you want to know what's real, you have to be able to predict the world. In particular, predicting what data points will show up in the world is essentially precisely what world_model is supposed to be doing, such that is_real seems like it might just require a full copy of world_model.
That being said, there is a remaining risk here if there exists an is_real′ that isn't just fully equivalent to x∈Dx—thus potentially allowing it to be much simpler—but such that conditioning on is_real′(x)→P(xb | xa)=1 still gives similar savings for complexity(world_model). We'll look at a specific example of such an is_real′ in the next attack.
Furthermore, we'll start referring to these sorts of dataset conditioning problem as composition problems, as they have essentially the same structure. What made the dataset conditioning problem not a type of composition problem in the previous case was that previously we were getting is_real for free, whereas now we have to pay for complexity(is_real) just like we had to pay for complexity(check_consistency).
Composition with signature checking
We've already established that the composition with consistency problem that we saw in the previous proposal still applies in the dataset-less case. However, as the above dataset conditioning analysis suggests, if we can find a suitable is_real′, there might be a way to produce a new, particularly pernicious composition problem for the dataset-less proposal that looks similar to dataset conditioning.
To do that, we'll let is_real′(x)=is_signedk(x) where is_signedk computes whether x is cryptographically signed with the public key k, where k is some public key which is both in the dataset itself and has documents signed with it in the dataset as well. Then, as above, we'll let
and let θ− ′2=world_model. Furthermore, as in the previous composition problem, we'll compare to a θ+ ′1 defined as
with θ+ ′2=θ− ′2=world_model.
Then, we get that complexity(M− ′)≈complexity(M+ ′) if and only if comp(W | ∀x∈X, is_signed(x)→P(xb | xa)=1)≈comp(f+)+comp(W | ∀x∈X, (is_signed(x)→P(xb | xa)=1))∧(H_understands(x)→f+(x)=f−(x))). Here, we see the same basic structure as the previous composition problem, where every term but comp(f+) gets to condition on is_signed→P(xb | xa)=1. Similarly to the composition with consistency problem, if there is any overlap between what conditioning on H_understands→f+=f− gives you and what conditioning on is_signed→P(xb | xa)=1, then we won't be able to pay back all of complexity(f+) anymore.
In this case, the reason we might expect an overlap is a bit tricky. If the adversary that's attempting to check the ∀X conditions is perfect, then it should be able to instantiate all possible signed strings, meaning that no bits are saved at all here. On the other hand, if the adversary has access to the dataset such that it can generate signed strings from the dataset, but not sign any new strings itself (because we're assuming that the signature function is one-way), then the is_signed→P(xb | xa)=1 condition enforces that θ2 perfectly model the process generating such signed strings—which, if those strings are produced by humans in natural language, should help in specifying natural language, overlapping with the benefit of conditioning on H_understands→f+=f−.
Fortunately, there is a pretty simple way to fix this problem—just don't give the adversary access to the dataset, such that (if we're still assuming one-way signatures), it just won't be able to generate any examples of signed strings at all, fixing the problem similarly to as if the adversary were perfect. Unfortunately, such a requirement seems like it might render our dataset-less prior essentially unimplementable in machine learning, as we're now required to find some way to train such an adversary without ever letting it look at the dataset, which could be quite difficult.
Understandability checking
Next, we'll take a look at the obvious problem that we saw when the defender was going through their derivation, which is that the defender relied on the sketchy assumption that complexity(H_understands | world_model−H, f+)≈0.
Intuitively, it seems like H_understands should be easy to specify in terms of f+, but as we mentioned previously, the obvious implementation of H_understands↔f+=f− makes f? trivial. Thus, to get a more general notion of H_understands—“should H understand” rather than “does H understand”—we'll presumably need to pay at least some non-negligible number of bits on top of f+.
Furthermore, we also can't just have H_understands=is_real, otherwise we'd have to pay the entire cost of complexity(is_real), which we're supposing is very large for this approach to be able to solve the dataset conditioning problem.
We'll call this the understandability checking problem, and as we'll see, unlike some of the other problems with the dataset-less case, we don't have any satisfactory solution to this problem.
Event-avoiding humans
This next attack probably also occurs for Paul's original proposal, but it's significantly more pronounced in the dataset-less case.
For this attack, unlike in many of our previous attacks, we're just going to be attacking the definition of θ+2 rather than the definition of θ+1, such that we'll let M− ′=Mθ+1, θ− ′2, thus only changing θ+2 and leaving θ+1 alone. Then, we'll define θ− ′2=H−H(E)+avoids(E) where E is some event, H(E) are the parts of H that are only necessary to compute H's behavior in the event of E, and avoids(E) is some modification to H such that H+avoids(E) acts in such a way that world_modelH=H+avoids(E) never predicts that E will happen.
Intuitively, what's happening here is that we're setting H in such a way that it causes the world to be simpler (in that E never happens) and thus the f+=f− condition to be easier to achieve (in that we no longer need it to hold in the event of E). We'll call this the event-avoidance problem. Notably, to ensure that we still fit the dataset, we do need to select E such that it never happens on D—but there should presumably be many such E.
Then, we get that complexity(M− ′)<complexity(M+) if and only if comp(M− ′)<comp(M+)comp(θ+1)+comp(H−H(E)+avoids(E) | θ+1,∀X, H_understands→f+=f−)<comp(θ+1)+comp(H | θ+1,∀X, H_understands→f+=f−)comp(H−H(E)+avoids(E) | W−H,∀X, H_understands→f+=f−)<comp(H | W−H,∀X, H_understands→f+=f−)comp(H−H(E)+avoids(E) | W−H)−minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}<comp(H | W−H)−minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}comp(H−H(E)+avoids(E) | W−H)<comp(H | W−H)comp(H−H(E) | W−H)+comp(avoids(E) | H−H(E), W−H)<comp(H | W−H)comp(H | W−H)−comp(H(E) | H−H(E), W−H)+comp(avoids(E) | H−H(E), W−H)<comp(H | W−H)comp(avoids(E) | H−H(E), W−H)<comp(H(E) | H−H(E), W−H) which should hold so long as there are events not in the dataset that are easy to avoid but quite complicated to understand what humans would do in such cases—e.g. what would a human do if they lived to a million years (not in the dataset, humans can be modified to never choose to live that long, behavior in that case seems quite complicated).
Humans with rocks for brains
In the event-avoidance case, we strategically selected θ− ′2 to make f+=f− in an unintended way. However, there's another way to get f+=f−∨¬H_understands—or equivalently, H_understands→f+=f−—which is to strategically select θ− ′2 to make H_understands false instead. Specifically, since H_understands is determined from deductions based on world_modelH=θ2, we can select θ2 to ensure we get H_understands→f+=f− by encoding for a human that simply doesn't understand anything.
Naively, the problem with an attack of this form is that such a human model—which we'll call Hbrains=rocks—won't be able to fit the dataset. Thus, we can't construct a traditional attack in the same sense as the previous ones where we create an alternative M− ′ with lower complexity than M+ using θ− ′2=Hbrains=rocks.[5]
However, there's something else we can do instead. Previously, the defender reduced their burden to the approximate equality minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}≈comp(f+ | W−H) which, if we convert back to the inequality formulation, is really just the burden comp(f+ | W−H)≤minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}.
Thus, the defender needs it to be the case that f+ is simpler than the simplest model such that H_understands→f+=f−. However, Hbrains=rocks is such a model, since we're assuming it makes H_understands always false, which means the defender needs it to be the case that comp(f+ | W−H)≤comp(Hbrains=rocks) which, if we assume that comp(Hbrains=rocks)≈0, gives us a a successful attack, which we'll call the rocks for brains problem.
The rocks for brains problem seems basically fatal to this approach, since it wipes out all possible gains from conditioning on H_understands→f+=f−. It might be possible to salvage this approach if H_understands is somehow implemented independently of the H model, but that would presumably require complexity(H_understands) to be quite large, negating the defender's argument that relied on complexity(H_understands | f+)≈0.
Directions for future work
As a result of the above analysis, we've managed to identify a bunch of concrete problems that seem to blocking approaches of this form. Of course, there could also be other problems that we haven't come up with yet, and there's also the issue of converting these priors into loss functions—though that's something that should probably come after we at least have something that seems to solve all of the problems that we're aware of.
Overall, however, the problems we've described here seem like good places to start, and I'll try to describe some thoughts on how they might be addressed below.
The bit-splitting problem
In Paul's original post, he makes a nod to the bit-splitting problem by carving world_model−H in half and putting one half in θ+1 and the other half in θ+2. Paul does this because he's modeling the underlying simplicity prior p as a simple Gaussian over the L2 norm of the parameters. Under such a model, there's a simplicity benefit to splitting all computation as equally as possible across all available parameters.
However, it's very unclear if any implicit neural network simplicity prior would actually behave that way. In particular, if we just change our simple model to use the L1 norm instead of the L2 norm, the benefit of splitting computation evenly across parameters goes away. Clearly, that's not a great model either—but it goes to show that there are plausible models under which bit-splitting isn't even a problem in the first case.
Fundamentally, what we need to do to alleviate the bit-splitting problem is look for architectures where θ1 and θ2 are allowed to grow as large as they want with only linear complexity penalties for doing so. Though we traditionally train only fixed-size neural networks, if there were a way to train a dynamically-sized neural network, where size was penalized linearly, that would at least seem to solve this problem—and such a modification doesn't seem too difficult to make.
The understandability checking problem
Of the remaining problems, this is the one that I think is most likely to turn out to not actually be a problem—in that it doesn't require any changes to the prior to solve. In many ways, the understandability checking problem is just coming from our inability to define an H_understands check that's simple in terms of f+ and world_model−H—but at least intuitively, it really does seem like such a simple H_understands implementation should exist.
The event-avoidance problem
In Paul's original post, he notes:
Essentially, what Paul is arguing here is that, going back to the original formulation, we should add a stop gradient operation to our generative procedure such that we get θ1∼p(θ1)θ2∼p(θ2)⋅I[∀(x,q,_)∈D. stop_grad(Mθ1,θ2|f1(x,q))=Mθ1,θ2|f2(x,q)]. Unfortunately, this isn't actually a generative procedure anymore—the only straightforward way to make sense of this is to translate it into a loss function and think about what SGD would do, which makes this prior much harder to analyze—as Paul notes when he talks about his trepidation of going down this route.
Hopefully, however, adding this stop gradient should solve the event-avoidance problem, since it should remove the incentive to modify f1=f+ to make the world simpler (e.g. preventing E from happening).
Fortunately, despite Paul's prior worries about not being able to translate this trick into the program search case, it turns out it's not too difficult to convert Paul's stop gradient formulation into a generative prior that we can analyze as we have been doing previously. Furthermore, there are many such translations—in particular, we can pick either a dataset-less or non-dataset-less version, depending on whether we prefer being exposed to the dataset conditioning problem or the rocks for brains problem. In this case, I'll present a dataset-less formulation, which should solve dataset conditioning but unfortunately won't solve rocks for brains.
Now, given some simplicity prior p, we'll define p∗ as the prior induced by the following generative procedure:
First, note the uniform sampling in step 2. This is the trickiest part of this algorithm—and probably impossible to do in a neural network setting. Technically, the model space of a neural network is finite, so uniform sampling is possible—though the obvious algorithm of sampling uniformly from weight space doesn't work, since the parameter-function map is biased towards simplicity. Hopefully, just using a stop gradient will replicate what this algorithm is doing—and intuitively it does feel like they're doing the same thing—but I don't have any strong justification for that. Thus, it is somewhat unclear whether this procedure will actually have a nice translation when we convert back into an ML setup with stop gradients, but we'll put that aside for now.
Second, to make sense of our use of M above, we'll let M have three heads, f1, f2, and fguard, where fguard is a Boolean head that “guards” whether or not we do the f1=f2 check, and f1,f2 are two output heads. Furthermore, we'll let M have two groups of parameters, θ1 and θ2, such that Mθ1, θ2(x,q)=Mθ1, _(θ2,x,q). Thus, we'll let Mθ1, θ2a(x,q)=Mθ1, _(θ2a,x,q) and Mθ1, θ2b(x,q)=Mθ1, _(θ2b,x,q).
Intuitively, what's going on here is that, in sampling a new set of θ2b parameters, wherever we want a stop gradient, we use the old θ2a, and wherever we don't want a stop gradient, we use the new θ2b—thus ensuring that we only sample θ2b in such a way that incentivizes the prior to be higher through the non-stop-gradiented places where θ2b shows up, not through the stop-gradiented places where we use θ2a.
Now, we'll define our desired θ+1 in pseudocode as
where θ+1 uses θ2, either θ2a or θ2b, to give it its human model.
Then, we have to determine what θ2 will be favored given the above θ+1. First, consider θ+2=H. In that case, we get the complexity complexity(θ2=H | θ+1)≈complexity(H | W−H, H_understandsθ2a=H→f+θ2a=H=f−θ2b)≈complexity(H | W−H)−minθ2b{complexity(θ2b | W−H) | ∀X. H_understandsH=H→f+H=H=f−H=θ2} which, assuming that conditioning on H_understands→f+=f− exactly pays back complexity(f+) (which is false due to the rocks for brains problem, but we're just trying to solve event-avoidance here), reduces to ≈complexity(H | W−H)−complexity(f+ | W−H).
Now, consider θ−2=H−H(E)+avoids(E), as in the event-avoidance problem. In that case, we get the complexity complexity(θ2=H−H(E)+avoids(E) | θ+1)≈complexity(H−H(E)+avoids(E) | W−H, H_understandsθ2a=H−H(E)+avoids(E)→f+θ2a=H−H(E)+avoids(E)=f−θ2b) but then, since avoids(E) being in θ2b is entirely unhelpful in making H_understandsθ2a=H−H(E)+avoids(E)→f+θ2a=H−H(E)+avoids(E)=f−θ2b hold—since it only affects f+, which already has avoids(E) in its H—we get ≈avoids(E) | W−H)+complexity(H−H(E) | W−H, H_understandsθ2a=H−H(E)+avoids(E)→f+θ2a=H−H(E)+avoids(E)=f−θ2b)+complexity(avoids(E) | W−H, H−H(E))≈complexity(H−H(E)+avoids(E) | W−H)−minθ2b{complexity(θ2b | W−H) | ∀X. H_understandsH=H−H(E)+avoids(E)→f+H=H−H(E)+avoids(E)=f−H=θ2}+complexity(avoids(E) | W−H, H−H(E))≈complexity(H−H(E)+avoids(E) | W−H)+complexity(avoids(E) | W−H, H−H(E))−minθ2b{complexity(θ2b | W−H) | ∀X. H_understandsH=H−H(E)+avoids(E)→f+H=H−H(E)+avoids(E)=f−H=θ2}≈complexity(H−H(E)+avoids(E) | W−H)+complexity(avoids(E) | W−H, H−H(E))−complexity(f+ | W−H) where we're again assuming that we exactly pay back complexity(f+ | W−H) even if we use H−H(E)+avoids(E) as our basis of comparison.
Then, we get that θ−2 underperforms the desired θ+2=H if and only if complexity(H | W−H)−complexity(f+ | W−H)<complexity(H−H(E)+avoids(E) | W−H)+complexity(avoids(E) | W−H, H−H(E))−complexity(f+ | W−H)0<−complexity(H(E) | W−H, H−H(E))+complexity(avoids(E) | W−H, H−H(E))+complexity(avoids(E) | W−H, H−H(E))complexity(H(E) | W−H, H−H(E))<2 complexity(avoids(E) | W−H, H−H(E)) which shows how we've now gotten a factor of two leg up on the event avoidance problem.[6] Unfortunately, there might still be event avoidance problems severe enough that a factor of two isn't enough to rule them out, but I suspect that there are ways to iterate this procedure to get an arbitrarily large factor there, though I haven't done that analysis yet.
Finally, to validate that our θ+1 construction is complexity-competitive, we can compare to the alternative θ−1 construction
which, since it sets fguard always to false and never uses θ2, is just equivalent to our original M−.
Then, comparing complexities, we need comp(M+)≤comp(M−)comp(θ+1)+comp(θ+2 | θ+1, fguard→f1=f2)≤comp(θ−1)comp(θ+1)+comp(H | W−H, H_understands→f+=f−)≤comp(W)comp(W−H)+comp(f+ | W−H)+comp(H_understands | W−H, f+)+comp(H | W−H, H_understands→f+=f−)≤comp(W−H)+comp(H | W−H)comp(f+ | W−H)+comp(H | W−H, H_understands→f+=f−)≤comp(H | W−H)comp(f+ | W−H)+comp(H | W−H)−minθ2{comp(θ2 | W−H) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}≤comp(H | W−H) which, assuming that the conditioning exactly pays off complexity(f+ | W−H), reduces to comp(f+ | W−H)+comp(H | W−H)−comp(f+ | W−H)≤comp(H | W−H)0≤0 as desired.
The composition problem
For the composition problem, the central issue is that we're not convinced that we can get the approximate equality complexity(world_model | consistency)≈complexity(f+)+complexity(world_model | consistency, f+=f−) even given the approximate equality complexity(world_model)≈complexity(f+)+complexity(world_model | f+=f−).
Fundamentally, one of the main reasons this is a problem is that complexity(f+) doesn't get to condition on consistency, since it has to be defined in θ1. Now, it could be the case that even fixing that problem, we're still not convinced of the approximate equality complexity(world_model | consistency)≈complexity(f+ | consistency)+complexity(world_model | consistency, f+=f−) however, at the very least, reducing the problem down to this case seems like major progress.
Though I haven't fully fleshed it out yet, I believe that reducing the composition problem to the above case is possible via a prior that uses something like the following generative procedure (where p is some simplicity prior and Mθf, _:∏ni=0Θi→(F,F?)) n:N∼p(n)θf∼p(θf | n)θ0∼p(θ0)⋅I[∃θ1,…,θn. Mθf, _(n∏i=0θi)|f?]θ1∼p(θ1)⋅I[∃θ2,…,θn. Mθf, _(n∏i=0θi)|f?]⋯θn−1∼p(θn−1)⋅I[∃θn. Mθf, _(n∏i=0θi)|f?]θn∼p(θn)⋅I[Mθf, _(n∏i=0θi)|f?]. where the idea is to allow for arbitrarily many “rounds” of conditioning in sequence, thus allowing the consistency conditioning to happen first, followed by the f+=f− conditioning, and therefore the specification of f+ to benefit from the consistency conditioning.
Though I don't know if this sort of approach will fully resolve the composition problem—and even if it does the existence conditions could be extremely difficult to implement in an ML setup—at the very least I think it's a good place to start for future work.
The rocks for brains problem
Just as the dataset conditioning problem seemed essentially fatal for Paul's original proposal, the rocks for brains problem seems essentially fatal for the dataset-less proposal, putting us in a sort of “pick your poison” scenario. That being said, there are some potential paths forward for addressing the rocks for brains problem.
Fundamentally, the rocks for brains problem is coming from the difficulty of performing the H_understands check—thus, if we could remove the need to perform such a check, or find a cheap implementation for it that doesn't reference the human model at all, we could solve the rocks for brains problem. Really, all H_understands needs to do is identify some reasonably large set of questions for which H should always be correct. For example, if H_understands is true for all questions about “basic” facts about the world, and false otherwise, that would probably be sufficient. The trick, then, is in finding some notion of “basic” facts that's simple to specify without using a human model—which doesn't actually seem that difficult.
Conclusion
Finally, as a last takeaway, I think that one thing that really stood out to me in all of this analysis is that the space of possible machine learning training procedures is really quite large and underexplored—there are a lot more, often very weird and very different, possible ML training setups than the ones that are traditionally used in practice. Overall, I think this makes me pretty optimistic that searching for fundamentally new ways to train machine learning models is an extremely valuable line of attack, both in terms of new very general approaches like amplification, debate, recursive reward modeling, market making, or imitative generalization, as well as new very specific approaches like those in this post. My strong sense is that, even with all of the different proposals and approaches that we have, we're really only scratching the surface of the different sorts of training procedures that are actually out there.
It's worth flagging that the zero loss assumption is somewhat questionable if we don't expect to train to convergence—but it's at least a plausible assumption, it makes the analysis a lot easier, and I don't expect it to be hiding major issues, so it seems fine at least for the purposes of this post. ↩︎
In an unbounded compute setting, the chain rule of conditional entropy gives us that complexity(A, B)=complexity(A)+complexity(B | A). However, if A can be a one-way function of B, then in general we just get the inequality complexity(A, B)≤complexity(A)+complexity(B | A). Throughout this post, however, we'll make use of the full approximate equality complexity(world_model, f)≈complexity(world_model)+complexity(f | world_model), where the hope is that this should make sense given that, in the neural network setting, f+,f− would need to be near the end of the network, and thus should just be functions of world_model. Additionally, given that we expect world_model to be significantly more complex than f+ or f−, even in general we shouldn't be changing much by doing this. ↩︎
Paul's relaxation is to translate I[∀(x,q,_)∈D. Mθ1,θ2|f1(x,q)=Mθ1,θ2|f2(x,q)] into exp(−E[|Mθ1,θ2|f1(x,q)−Mθ1,θ2|f2(x,q) | (x,q,_)∼D|2]). ↩︎
Note that this assumption is somewhat sketchy. Paul conjectures that this approximation is only ever off by a constant factor, though that's not necessarily very comforting if we don't have an estimate for the size of that factor, nor a proof of that conjecture. In general, we only get the inequality complexity(A)−minA′{complexity(A′) | P}≤complexity(A | P)≤complexity(A). Fortunately, we'll mostly just be using this assumption as an intuition pump, with most of the analysis working just fine without it. When we do lean on it more heavily, it'll only be in the direction where we're actually guaranteed the inequality. ↩︎
While θ− ′2=Hbrains=rocks doesn't work for this, there is a way to use the rocks for brains problem to construct an attack in the same vein as our previous attacks where we build an M− ′ with lower complexity than M+. Let M− ′=Mθ+1, θ− ′2. Then, since the output head in θ+1 just runs f+, that means we need θ− ′2 to provide a detailed enough picture of how humans work to enable f+ to answer any questions about humans in the dataset correctly—but it need not be any more detailed than that. In particular, the human model need not be detailed enough to ensure anything about non-human-related inputs, so long as it can ensure that H_understands is always false for such inputs. Thus, let Hθ− ′2(x,q)=H−H(¬H_related) if H_related(x,q) else Hbrains=rocks where H_related(x,q) determines if the inputs require knowledge of humans, H(¬H_related) are the parts of H that are only necessary to compute H's behavior on non-human-related inputs (such that H−H(¬H_related) is everything necessary for H_related inputs), and Hbrains=rocks is a human that understands nothing (such that H_understands is always false). The idea here is that, for such a θ− ′2, we should get H_understandsH=θ− ′2→H_related. Then, calculating complexity(θ− ′2 | θ+1, ∀X. H_understands→f+=f−), we get comp(θ− ′2 | θ+1, ∀X. H_understands→f+=f−)=comp(H−H(¬H_related) | θ+1)+comp(H_related | H−H(¬H_related), θ+1)+comp(Hbrains=rocks | θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}≈comp(H−H(¬H_related) | θ+1)+comp(H_related | H−H(¬H_related), θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2} which, assuming that we can specify H(¬H_related) after H−H(¬H_related) without gaining complexity, becomes ≈comp(H | θ+1)−comp(H(¬H_related) | H−H(¬H_related), θ+1)+comp(H_related | H−H(¬H_related), θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2} and since this attack leaves θ+1 alone, we need only compare to θ+2, which has comp(θ+2)=comp(H | θ+1, ∀X. H_understands→f+=f−)≈comp(H | θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2} such that we get comp(θ− ′2 | θ+1)<comp(θ+2 | θ+1) if and only if comp(θ− ′2 | θ+1)<comp(θ+2 | θ+1)comp(H | θ+1)−comp(H(¬H_related) | H−H(¬H_related), θ+1)+comp(H_related | H−H(¬H_related), θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}<comp(H | θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}−comp(H(¬H_related) | H−H(¬H_related), θ+1)+comp(H_related | H−H(¬H_related), θ+1)<0comp(H_related | H−H(¬H_related), θ+1)<comp(H(¬H_related) | H−H(¬H_related), θ+1). Then, the idea is that H_related should be pretty straightforward, since it doesn't need to do much more than check whether world_model(x) makes use of H—and removing the need to specify H(¬H_related) should be a big complexity bonus, since it removes the need to encode any general human beliefs about the world that aren't directly relevant to answering questions about other humans. ↩︎
Note that a similar analysis to that given for θ−2=H−H(E)+avoids(E) can also be given for θ−2=H−H(¬H_related) if H_related else Hbrains=rocks, the rocks for brains example that does fit the dataset as given in a previous footnote. ↩︎