Academic website: https://www.andrew.cmu.edu/user/coesterh/
To some extent, this is all already in Jozdien's comment, but:
It seems that the closest thing to AIs debating alignment (or providing hopefully verifiable solutions) that we can observe is human debate about alignment (and perhaps also related questions about the future). Presumably John and Paul have similar views about the empirical difficulty of reaching agreement in the human debate about alignment, given that they both observe this debate a lot. (Perhaps they disagree about what people's level of (in)ability to reach agreement / verify arguments implies for the probability of getting alignment right. Let's ignore that possibility...) So I would have thought that even w.r.t. this fairly closely related debate, the disagreement is mostly about what happens as we move from human to superhuman-AI discussants. In particular, I would expect Paul to concede that the current level of disagreement in the alignment community is problematic and to argue that this will improve (enough) if we have superhuman debaters. If even this closely related form of debate/delegation/verification process isn't taken to be very informative (by at least one of Paul and John), then it's hard to imagine that much more distant delegation processes (such as those behind making computer monitors) are very informative to their disagreement.
As once discussed in person, I find this proposal pretty interesting and I think it deserves further thought.
Like some other commenters, I think for many tasks it's probably not tractable to develop a fully interpretable, competitive GOFAI program. For example, I would imagine that for playing chess well, one needs to do things like positively evaluating some random-looking feature of a position just on the basis that empirically this feature is associated with higher win rate. However, the approach of the post could be weakened to allow "mixed" programs that have some not so interpretable aspects, e.g., search + a network for evaluating positions is more interpretable than just a network that chooses moves, a search + sum over feature evals is even more interpretable, and so on.
As you say in the post, there seems to be some analogy between your proposal and interpreting a given network. (For interpreting a given chess-playing network, the above impossibility argument also applies. I doubt that a full interpretation of 3600 elo neural nets will ever exist. There'll always be points where you'd want to ask, "why?", and the answer is, "well, on average this works well...") I think if I wanted to make a case for the present approach, I'd mostly try to sell it as a better version of interpretation.
Here's a very abstract argument. Consider the following two problems:
Interpretability is the first problem. My variant of your suggestion is that we solve the second problem instead. Solving the second problem seems just as useful as solving the first problem. Solving the second problem is at most as hard as solving the first. (If you can solve the first problem, you automatically solve the second problem.)
So actually all we really need to argue is that getting to (use enormous amounts of LLM labor to) write a new program partly from scratch makes the problem strictly easier. And then it's easy to come up with lots of concrete ideas for cases where it might be easier. For instance, take chess. Then imposing the use of a GOFAI search algorithm to use with a position evaluation network increases interpretability relative to just training an end-to-end model. It also doesn't hurt performance. (In fact, my understanding is that the SOTA still uses some GOFAI methods, rather than an end-to-end-trained neural net.) You can think of further ways to hard-code-things in a way that simplifies interpretability at small costs to performance. For instance, I'd guess that you can let the LLMs write 1000 different Python functions that detect various features in the position (whether White has the Bishop pair, whether White's king has three pawns in front of it, etc.). For chess in particular you could of course also just get these functions from prior work on chess engines. Then you feed these into the neural net that you use for evaluating positions. In return, you can presumably make that network smaller (assuming your features are actually useful), while keeping performance constant. This leaves less work for neural interpretation. How much smaller is an empirical question.
If all you're using is ChatGPT, then now's a good time to cancel the subscription because GPT-4o seems to be similarly powerful as GPT-4, and GPT-4o is available for free.
In short, the idea is that there might be a few broad types of “personalities” that AIs tend to fall into depending on their training. These personalities are attractors.
I'd be interested in why one might think this to be true. (I only did a very superficial ctrl+f on Lukas' post -- sorry if that post addresses this question.) I'd think that there are lots of dimensions of variation and that within these, AIs could assume a continuous range of values. (If AI training mostly works by training to imitate human data, then one might imagine that (assuming inner alignment) they'd mostly fall within the range of human variation. But I assume that's not what you mean.)
This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
Are you claiming this would happen even given infinite capacity?
I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.
Here's a simple toy model that illustrates the difference between 2 and 3 (that doesn't talk about attention layers, etc.).
Say you have a bunch of triplets . Your want to train a model that predicts from and from .
Your model consists of three components: . It makes predictions as follows:
(Why have such a model? Why not have two completely separate models, one for predicting and one for predicting ? Because it might be more efficient to use a single both for predicting and for predicting , given that both predictions presumably require "interpreting" .)
So, intuitively, it first builds an "inner representation" (embedding) of . Then it sequentially makes predictions based on that inner representation.
Now you train and to minimize the prediction loss on the parts of the triplets. Simultaneously you train to minimize prediction loss on the full triplets. For example, you update and with the gradients
and you update and with the gradients
.
(The here is the "true" , not one generated by the model itself.)
This training pressures to be myopic in the second and third sense described in the post. In fact, even if we were to train with the predicted by rather than the true , is pressured to be myopic.
Of course, still won't be pressured to be type-1-myopic. If predicting requires predicting , then will be trained to predict ("plan") .
(Obviously, $g_2$ is pressured to be myopic in this simple model.)
Now what about ? Well, is optimized both to enable predicting from and predicting from . Therefore, if resources are relevantly constrained in some way (e.g., the model computing is small, or the output of is forced to be small), will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for (and thus in some sense the overall model) can and will sacrifice accuracy on to achieve better accuracy on . In particular, we should expect trained models to find an efficient tradeoff between accuracy on and accuracy on . When is relatively easy to predict, will spend most of its computation budget on predicting .
So, is not "Type 2" myopic. Or perhaps put differently: The calculations going into predicting aren't optimized purely for predicting .
However, is still "Type 3" myopic. Because the prediction made by isn't fed (in training) as an input to or the loss, there's no pressure towards making influence the output of in a way that has anything to do with . (In contrast to the myopia of , this really does hinge on not using in training. If mattered in training, then there would be pressure for to trick into performing calculations that are useful for predicting . Unless you use stop-gradients...)
* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.
At least in this case (celebrities and their largely unknown parents), I would predict the opposite. That is, people are more likely to be able to correctly answer "Who is Mary Lee Pfeiffer's son?" than "Who is Tom Cruise's mother?" Why? Because there are lots of terms / words / names that people can recognize passively but not produce. Since Mary Lee Pfeiffer is not very well known, I think Mary Lee Pfeiffer will be recognizable but not producable to lots of people. (Of people who know Mary Lee Pfeiffer in any sense, I think the fraction of people who can only recognize her name is high.) As another example, I think "Who was born in Ulm?" might be answered correctly by more people than "Where was Einstein born?", even though "Einstein was born in Ulm" is a more common sentence for people to read than "Ulm is the city that Einstein was born in".
If I had to run an experiment to test whether similar effects apply in humans, I'd probably try to find cases where A and B in and of themselves are equally salient but the association A -> B is nonetheless more salient than the association B -> A. The alphabet is an example of this (where the effect is already confirmed).
I mean, translated to algorithmic description land, my claim was: It's often difficult to prove a negative and I think the non-existence of a short algorithm to compute a given object is no exception to this rule. Sometimes someone wants to come up with a simple algorithm for a concept for which I suspect no such algorithm to exist. I usually find that I have little to say and can only wait for them to try to actually provide such an algorithm.
So, I think my comment already contained your proposed caveat. ("The concept has K complexity at least X" is equivalent to "There's no algorithm of length <X that computes the concept.")
Of course, I do not doubt that it's in principle possible to know (with high confidence) that something has high description length. If I flip a coin n times and record the results, then I can be pretty sure that the resulting binary string will take at least ~n bits to describe. If I see the graph of a function and it has 10 local minima/maxima, then I can conclude that I can't express it as a polynomial of degree <10. And so on.
Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I'm still not quite sure what exactly "X can be understood" must require. It seems to matter, for example, that to a human it's understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it's not doing anything problematic (?).