Caspar Oesterheld — LessWrong

In the academic literature, this sort of scheme has been analyzed by Chen et al., e.g.: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/TEAC-final1.pdf

I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood.

Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I'm still not quite sure what exactly "X can be understood" must require. It seems to matter, for example, that to a human it's understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it's not doing anything problematic (?).

To some extent, this is all already in Jozdien's comment, but:

It seems that the closest thing to AIs debating alignment (or providing hopefully verifiable solutions) that we can observe is human debate about alignment (and perhaps also related questions about the future). Presumably John and Paul have similar views about the empirical difficulty of reaching agreement in the human debate about alignment, given that they both observe this debate a lot. (Perhaps they disagree about what people's level of (in)ability to reach agreement / verify arguments implies for the probability of getting alignment right. Let's ignore that possibility...) So I would have thought that even w.r.t. this fairly closely related debate, the disagreement is mostly about what happens as we move from human to superhuman-AI discussants. In particular, I would expect Paul to concede that the current level of disagreement in the alignment community is problematic and to argue that this will improve (enough) if we have superhuman debaters. If even this closely related form of debate/delegation/verification process isn't taken to be very informative (by at least one of Paul and John), then it's hard to imagine that much more distant delegation processes (such as those behind making computer monitors) are very informative to their disagreement.

As once discussed in person, I find this proposal pretty interesting and I think it deserves further thought.

Like some other commenters, I think for many tasks it's probably not tractable to develop a fully interpretable, competitive GOFAI program. For example, I would imagine that for playing chess well, one needs to do things like positively evaluating some random-looking feature of a position just on the basis that empirically this feature is associated with higher win rate. However, the approach of the post could be weakened to allow "mixed" programs that have some not so interpretable aspects, e.g., search + a network for evaluating positions is more interpretable than just a network that chooses moves, a search + sum over feature evals is even more interpretable, and so on.

As you say in the post, there seems to be some analogy between your proposal and interpreting a given network. (For interpreting a given chess-playing network, the above impossibility argument also applies. I doubt that a full interpretation of 3600 elo neural nets will ever exist. There'll always be points where you'd want to ask, "why?", and the answer is, "well, on average this works well...") I think if I wanted to make a case for the present approach, I'd mostly try to sell it as a better version of interpretation.

Here's a very abstract argument. Consider the following two problems:

Given a neural net (or circuit or whatever) for a task, generate an interpretation/explanation (whatever that is exactly, could be a "partial" interpretation) of that neural net.
Given a neural net for a task, generate a computer program that performs the task roughly as well as the given neural net and an interpretation/explanation for this new program.

Interpretability is the first problem. My variant of your suggestion is that we solve the second problem instead. Solving the second problem seems just as useful as solving the first problem. Solving the second problem is at most as hard as solving the first. (If you can solve the first problem, you automatically solve the second problem.)

So actually all we really need to argue is that getting to (use enormous amounts of LLM labor to) write a new program partly from scratch makes the problem strictly easier. And then it's easy to come up with lots of concrete ideas for cases where it might be easier. For instance, take chess. Then imposing the use of a GOFAI search algorithm to use with a position evaluation network increases interpretability relative to just training an end-to-end model. It also doesn't hurt performance. (In fact, my understanding is that the SOTA still uses some GOFAI methods, rather than an end-to-end-trained neural net.) You can think of further ways to hard-code-things in a way that simplifies interpretability at small costs to performance. For instance, I'd guess that you can let the LLMs write 1000 different Python functions that detect various features in the position (whether White has the Bishop pair, whether White's king has three pawns in front of it, etc.). For chess in particular you could of course also just get these functions from prior work on chess engines. Then you feed these into the neural net that you use for evaluating positions. In return, you can presumably make that network smaller (assuming your features are actually useful), while keeping performance constant. This leaves less work for neural interpretation. How much smaller is an empirical question.

If all you're using is ChatGPT, then now's a good time to cancel the subscription because GPT-4o seems to be similarly powerful as GPT-4, and GPT-4o is available for free.

As one further data point, I also heard people close to/working at Anthropic giving "We won't advance the state of the art."-type statements, though I never asked about specifics.
My sense is also that Claude 3 Opus is only slightly better than the best published GPT-4. To add one data point: I happen to work on a benchmark right now and on that benchmark, Opus is only very slightly better than gpt-4-1106. (See my X/Twitter post for detailed results.) So, I agree with LawrenceC's comment that they're arguably not significantly advancing the state of the art.
I suppose even if Opus is only slightly better (or even just perceived to be better) and even if we all expect OpenAI to release a better GPT-4.5 soon, Anthropic could still take a bunch of OpenAI's GPT-4 business with this. (I'll probably switch from ChatGPT-4 to Claude, for instance.) So it's not that hard to imagine an internal OpenAI email saying, "Okay, folks, let's move a bit faster with these top-tier models from now on, lest too many people switch to Claude." I suppose that would already be quite worrying to people here. (Whereas, people would probably worry less if Anthropic took some of OpenAI's business by having models that are slightly worse but cheaper or more aligned/less likely to say things you wouldn't want models to say in production.)

In short, the idea is that there might be a few broad types of “personalities” that AIs tend to fall into depending on their training. These personalities are attractors.

I'd be interested in why one might think this to be true. (I only did a very superficial ctrl+f on Lukas' post -- sorry if that post addresses this question.) I'd think that there are lots of dimensions of variation and that within these, AIs could assume a continuous range of values. (If AI training mostly works by training to imitate human data, then one might imagine that (assuming inner alignment) they'd mostly fall within the range of human variation. But I assume that's not what you mean.)

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.

Here's a simple toy model that illustrates the difference between 2 and 3 (that doesn't talk about attention layers, etc.).

Say you have a bunch of triplets . Your want to train a model that predicts $z_{1}$ from $x$ and $z_{2}$ from $x, z_{1}$ .

Your model consists of three components: $f, g_{1}, g_{2}$ . It makes predictions as follows:
$y = f (x)$
$z_{1} = g_{1} (y)$
$z_{2} = g_{2} (y, z_{1})$

(Why have such a model? Why not have two completely separate models, one for predicting $z_{1}$ and one for predicting $z_{2}$ ? Because it might be more efficient to use a single $f$ both for predicting $z_{1}$ and for predicting $z_{2}$ , given that both predictions presumably require "interpreting" $x$ .)

So, intuitively, it first builds an "inner representation" (embedding) of $x$ . Then it sequentially makes predictions based on that inner representation.

Now you train $f$ and $g_{1}$ to minimize the prediction loss on the $(x, z_{1})$ parts of the triplets. Simultaneously you train $f, g_{2}$ to minimize prediction loss on the full $(x, z_{1}, z_{2})$ triplets. For example, you update $f$ and $g_{1}$ with the gradients
$\nabla_{θ_{0}, θ_{1}} l (z_{1}, g_{1}^{θ_{1}} (f^{θ_{0}} (x))$

and you update $f$ and $g_{2}$ with the gradients

$\nabla_{θ_{0}, θ_{2}} l (z_{2}, g_{2}^{θ_{2}} (z_{1}, (f^{θ_{0}} (x)))$ .
(The $z_{1}$ here is the "true" $z_{1}$ , not one generated by the model itself.)

This training pressures $g_{1}$ to be myopic in the second and third sense described in the post. In fact, even if we were to train $θ_{0}, θ_{2}$ with the $z_{1}$ predicted by $g_{1}$ rather than the true $z_{1}$ , $g_{1}$ is pressured to be myopic.

Type 3 myopia: Training doesn't pressure $g_{1}$ to output something that makes the $z_{2}$ follow an easier-to-predict (computationally or information-theoretically) distribution. For example, imagine that on the training data $z_{1} = 0$ implies $z_{2} = 0$ , while under $z_{1} = 1$ , $z_{2}$ follows some distribution that depends in complicated ways on $x$ . Then $g_{1}$ will not try to predict $z_{1} = 0$ more often.
Type 2 myopia: $g_{1}$ won't try to provide useful information to $g_{2}$ in its output, even if it could. For example, imagine that the $z_{1}$ s are strings representing real numbers. Imagine that $x$ is always a natural number, that $z_{1}$ is the $x$ -th Fibonacci number and $z_{2}$ is the $x + 1$ -th Fibonacci number. Imagine further that the model representing $g_{1}$ is large enough to compute the $x$ -th Fibonacci number, while the model representing $g_{2}$ is not. Then one way in which one might think one could achieve low predictive loss would be for $g_{1}$ to output the $x$ -th Fibonacci number and then encode, for example, the $x - 1$ -th Fibonacci number in the decimal digits. (E.g., $g_{1} (10) = 55.0000000000034$ .) And then $g_{2}$ computes the $x + 1$ -th Fibonacci number from the $x$ -th decimal. But the above training will not give rise to this strategy, because $g_{2}$ gets the true $z_{1}$ as input, not the one produced by $g_{1}$ . Further, even if we were to change this, there would still be pressure against this strategy because $g_{1}$ ( $θ_{1}$ ) is not optimized to give useful information to $g_{2}$ . (The gradient used to update $θ_{1}$ doesn't consider the loss on predicting $z_{2}$ .) If it ever follows the policy of encoding information in the decimal digits, it will quickly learn to remove that information to get higher prediction accuracy on $z_{1}$ .

Of course, $g_{1}$ still won't be pressured to be type-1-myopic. If predicting $z_{1}$ requires predicting $z_{2}$ , then $g_{1}$ will be trained to predict ("plan") $z_{2}$ .

(Obviously, $g_2$ is pressured to be myopic in this simple model.)

Now what about $f$ ? Well, $f$ is optimized both to enable predicting $z_{1}$ from $f (x)$ and predicting $z_{2}$ from $f (x), z_{1}$ . Therefore, if resources are relevantly constrained in some way (e.g., the model computing $f$ is small, or the output of $f$ is forced to be small), $f$ will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for $f$ (and thus in some sense the overall model) can and will sacrifice accuracy on $z_{1}$ to achieve better accuracy on $z_{2}$ . In particular, we should expect trained models to find an efficient tradeoff between accuracy on $z_{1}$ and accuracy on $z_{2}$ . When $z_{1}$ is relatively easy to predict, $f$ will spend most of its computation budget on predicting $z_{2}$ .

So, $f$ is not "Type 2" myopic. Or perhaps put differently: The calculations going into predicting $z_{1}$ aren't optimized purely for predicting $z_{2}$ .

However, $f$ is still "Type 3" myopic. Because the prediction made by $g_{1}$ isn't fed (in training) as an input to $g_{2}$ or the loss, there's no pressure towards making $f$ influence the output of $g_{1}$ in a way that has anything to do with $z_{2}$ . (In contrast to the myopia of $g_{1}$ , this really does hinge on not using $g_{2} (f (x), g_{1} (f (x)))$ in training. If $g_{2} (f (x), g_{1} (f (x)))$ mattered in training, then there would be pressure for $f$ to trick $g_{1}$ into performing calculations that are useful for predicting $z_{2}$ . Unless you use stop-gradients...)

* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.

At least in this case (celebrities and their largely unknown parents), I would predict the opposite. That is, people are more likely to be able to correctly answer "Who is Mary Lee Pfeiffer's son?" than "Who is Tom Cruise's mother?" Why? Because there are lots of terms / words / names that people can recognize passively but not produce. Since Mary Lee Pfeiffer is not very well known, I think Mary Lee Pfeiffer will be recognizable but not producable to lots of people. (Of people who know Mary Lee Pfeiffer in any sense, I think the fraction of people who can only recognize her name is high.) As another example, I think "Who was born in Ulm?" might be answered correctly by more people than "Where was Einstein born?", even though "Einstein was born in Ulm" is a more common sentence for people to read than "Ulm is the city that Einstein was born in".

If I had to run an experiment to test whether similar effects apply in humans, I'd probably try to find cases where A and B in and of themselves are equally salient but the association A -> B is nonetheless more salient than the association B -> A. The alphabet is an example of this (where the effect is already confirmed).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments