LESSWRONG
LW

643
Caspar Oesterheld
680Ω138914417
Message
Dialogue
Subscribe

Academic website: https://www.andrew.cmu.edu/user/coesterh/

Blog: https://casparoesterheld.com/ 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No, Futarchy Doesn’t Have This EDT Flaw
Caspar Oesterheld3mo70

In the academic literature, this sort of scheme has been analyzed by Chen et al., e.g.: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/TEAC-final1.pdf

Reply2
Using (Uninterpretable) LLMs to Generate Interpretable AI Code
Caspar Oesterheld1yΩ110

I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood.

Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I'm still not quite sure what exactly "X can be understood" must require. It seems to matter, for example, that to a human it's understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it's not doing anything problematic (?).

Reply
My AI Model Delta Compared To Christiano
Caspar Oesterheld1y60

To some extent, this is all already in Jozdien's comment, but:

It seems that the closest thing to AIs debating alignment (or providing hopefully verifiable solutions) that we can observe is human debate about alignment (and perhaps also related questions about the future). Presumably John and Paul have similar views about the empirical difficulty of reaching agreement in the human debate about alignment, given that they both observe this debate a lot. (Perhaps they disagree about what people's level of (in)ability to reach agreement / verify arguments implies for the probability of getting alignment right. Let's ignore that possibility...) So I would have thought that even w.r.t. this fairly closely related debate, the disagreement is mostly about what happens as we move from human to superhuman-AI discussants. In particular, I would expect Paul to concede that the current level of disagreement in the alignment community is problematic and to argue that this will improve (enough) if we have superhuman debaters. If even this closely related form of debate/delegation/verification process isn't taken to be very informative (by at least one of Paul and John), then it's hard to imagine that much more distant delegation processes (such as those behind making computer monitors) are very informative to their disagreement.

Reply
Using (Uninterpretable) LLMs to Generate Interpretable AI Code
Caspar Oesterheld1yΩ220

As once discussed in person, I find this proposal pretty interesting and I think it deserves further thought.

Like some other commenters, I think for many tasks it's probably not tractable to develop a fully interpretable, competitive GOFAI program. For example, I would imagine that for playing chess well, one needs to do things like positively evaluating some random-looking feature of a position just on the basis that empirically this feature is associated with higher win rate. However, the approach of the post could be weakened to allow "mixed" programs that have some not so interpretable aspects, e.g., search + a network for evaluating positions is more interpretable than just a network that chooses moves, a search + sum over feature evals is even more interpretable, and so on.

As you say in the post, there seems to be some analogy between your proposal and interpreting a given network. (For interpreting a given chess-playing network, the above impossibility argument also applies. I doubt that a full interpretation of 3600 elo neural nets will ever exist. There'll always be points where you'd want to ask, "why?", and the answer is, "well, on average this works well...") I think if I wanted to make a case for the present approach, I'd mostly try to sell it as a better version of interpretation.

Here's a very abstract argument. Consider the following two problems:

  • Given a neural net (or circuit or whatever) for a task, generate an interpretation/explanation (whatever that is exactly, could be a "partial" interpretation) of that neural net.
  • Given a neural net for a task, generate a computer program that performs the task roughly as well as the given neural net and an interpretation/explanation for this new program.

Interpretability is the first problem. My variant of your suggestion is that we solve the second problem instead. Solving the second problem seems just as useful as solving the first problem. Solving the second problem is at most as hard as solving the first. (If you can solve the first problem, you automatically solve the second problem.)

So actually all we really need to argue is that getting to (use enormous amounts of LLM labor to) write a new program partly from scratch makes the problem strictly easier. And then it's easy to come up with lots of concrete ideas for cases where it might be easier. For instance, take chess. Then imposing the use of a GOFAI search algorithm to use with a position evaluation network increases interpretability relative to just training an end-to-end model. It also doesn't hurt performance. (In fact, my understanding is that the SOTA still uses some GOFAI methods, rather than an end-to-end-trained neural net.) You can think of further ways to hard-code-things in a way that simplifies interpretability at small costs to performance. For instance, I'd guess that you can let the LLMs write 1000 different Python functions that detect various features in the position (whether White has the Bishop pair, whether White's king has three pawns in front of it, etc.). For chess in particular you could of course also just get these functions from prior work on chess engines. Then you feed these into the neural net that you use for evaluating positions. In return, you can presumably make that network smaller (assuming your features are actually useful), while keeping performance constant. This leaves less work for neural interpretation. How much smaller is an empirical question.

Reply
Boycott OpenAI
Caspar Oesterheld1y10

If all you're using is ChatGPT, then now's a good time to cancel the subscription because GPT-4o seems to be similarly powerful as GPT-4, and GPT-4o is available for free.

Reply
Anthropic release Claude 3, claims >GPT-4 Performance
Caspar Oesterheld2yΩ121611
  • As one further data point, I also heard people close to/working at Anthropic giving "We won't advance the state of the art."-type statements, though I never asked about specifics.
  • My sense is also that Claude 3 Opus is only slightly better than the best published GPT-4. To add one data point: I happen to work on a benchmark right now and on that benchmark, Opus is only very slightly better than gpt-4-1106. (See my X/Twitter post for detailed results.) So, I agree with LawrenceC's comment that they're arguably not significantly advancing the state of the art.
  • I suppose even if Opus is only slightly better (or even just perceived to be better) and even if we all expect OpenAI to release a better GPT-4.5 soon, Anthropic could still take a bunch of OpenAI's GPT-4 business with this. (I'll probably switch from ChatGPT-4 to Claude, for instance.) So it's not that hard to imagine an internal OpenAI email saying, "Okay, folks, let's move a bit faster with these top-tier models from now on, lest too many people switch to Claude." I suppose that would already be quite worrying to people here. (Whereas, people would probably worry less if Anthropic took some of OpenAI's business by having models that are slightly worse but cheaper or more aligned/less likely to say things you wouldn't want models to say in production.)
Reply
AI things that are perhaps as important as human-controlled AI
Caspar Oesterheld2y42

In short, the idea is that there might be a few broad types of “personalities” that AIs tend to fall into depending on their training. These personalities are attractors.

I'd be interested in why one might think this to be true. (I only did a very superficial ctrl+f on Lukas' post -- sorry if that post addresses this question.) I'd think that there are lots of dimensions of variation and that within these, AIs could assume a continuous range of values. (If AI training mostly works by training to imitate human data, then one might imagine that (assuming inner alignment) they'd mostly fall within the range of human variation. But I assume that's not what you mean.)

Reply
How LLMs are and are not myopic
Caspar Oesterheld2yΩ6104

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.

Reply
How LLMs are and are not myopic
Caspar Oesterheld2yΩ7100

Here's a simple toy model that illustrates the difference between 2 and 3 (that doesn't talk about attention layers, etc.).

Say you have a bunch of triplets (x,z1,z2). Your want to train a model that predicts z1 from x and z2 from x,z1.

Your model consists of three components: f,g1,g2. It makes predictions as follows:
y=f(x)
z1=g1(y)
z2=g2(y,z1)

(Why have such a model? Why not have two completely separate models, one for predicting z1 and one for predicting z2? Because it might be more efficient to use a single f both for predicting z1 and for predicting z2, given that both predictions presumably require "interpreting" x.)

So, intuitively, it first builds an "inner representation" (embedding) of x. Then it sequentially makes predictions based on that inner representation.

Now you train f and g1 to minimize the prediction loss on the (x,z1) parts of the triplets. Simultaneously you train f,g2 to minimize prediction loss on the full (x,z1,z2) triplets. For example, you update f and g1 with the gradients
∇θ0,θ1l(z1,gθ11(fθ0(x))

and you update f and g2 with the gradients

∇θ0,θ2l(z2,gθ22(z1,(fθ0(x))).
(The z1 here is the "true" z1, not one generated by the model itself.)

This training pressures g1 to be myopic in the second and third sense described in the post. In fact, even if we were to train θ0,θ2 with the z1 predicted by g1 rather than the true z1, g1 is pressured to be myopic.

  • Type 3 myopia: Training doesn't pressure g1 to output something that makes the z2 follow an easier-to-predict (computationally or information-theoretically) distribution. For example, imagine that on the training data z1=0 implies z2=0, while under z1=1, z2 follows some distribution that depends in complicated ways on x. Then g1 will not try to predict z1=0 more often.
  • Type 2 myopia: g1 won't try to provide useful information to g2 in its output, even if it could. For example, imagine that the z1s are strings representing real numbers. Imagine that x is always a natural number, that z1 is the x-th Fibonacci number and z2 is the x+1-th Fibonacci number. Imagine further that the model representing g1 is large enough to compute the x-th Fibonacci number, while the model representing g2 is not. Then one way in which one might think one could achieve low predictive loss would be for g1 to output the x-th Fibonacci number and then encode, for example, the x−1-th Fibonacci number in the decimal digits. (E.g., g1(10)=55.0000000000034.) And then g2 computes the x+1-th Fibonacci number from the x-th decimal. But the above training will not give rise to this strategy, because g2 gets the true z1 as input, not the one produced by g1. Further, even if we were to change this, there would still be pressure against this strategy because g1 (θ1) is not optimized to give useful information to g2. (The gradient used to update θ1 doesn't consider the loss on predicting z2.) If it ever follows the policy of encoding information in the decimal digits, it will quickly learn to remove that information to get higher prediction accuracy on z1.

Of course, g1 still won't be pressured to be type-1-myopic. If predicting z1 requires predicting z2, then g1 will be trained to predict ("plan") z2.

(Obviously, $g_2$ is pressured to be myopic in this simple model.)

Now what about f? Well, f is optimized both to enable predicting z1 from f(x) and predicting z2 from f(x),z1. Therefore, if resources are relevantly constrained in some way (e.g., the model computing f is small, or the output of f is forced to be small), f will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for f (and thus in some sense the overall model) can and will sacrifice accuracy on z1 to achieve better accuracy on z2. In particular, we should expect trained models to find an efficient tradeoff between accuracy on z1 and accuracy on z2. When z1 is relatively easy to predict, f will spend most of its computation budget on predicting z2.

So, f is not "Type 2" myopic. Or perhaps put differently: The calculations going into predicting z1 aren't optimized purely for predicting z2.

However, f is still "Type 3" myopic. Because the prediction made by g1 isn't fed (in training) as an input to g2 or the loss, there's no pressure towards making f influence the output of g1 in a way that has anything to do with z2. (In contrast to the myopia of g1, this really does hinge on not using g2(f(x),g1(f(x))) in training. If g2(f(x),g1(f(x))) mattered in training, then there would be pressure for f to trick g1 into performing calculations that are useful for predicting z2. Unless you use stop-gradients...)

* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.

Reply
Paper: LLMs trained on “A is B” fail to learn “B is A”
Caspar Oesterheld2y93

At least in this case (celebrities and their largely unknown parents), I would predict the opposite. That is, people are more likely to be able to correctly answer "Who is Mary Lee Pfeiffer's son?" than "Who is Tom Cruise's mother?" Why? Because there are lots of terms / words / names that people can recognize passively but not produce. Since Mary Lee Pfeiffer is not very well known, I think Mary Lee Pfeiffer will be recognizable but not producable to lots of people. (Of people who know Mary Lee Pfeiffer in any sense, I think the fraction of people who can only recognize her name is high.) As another example, I think "Who was born in Ulm?" might be answered correctly by more people than "Where was Einstein born?", even though "Einstein was born in Ulm" is a more common sentence for people to read than "Ulm is the city that Einstein was born in".

If I had to run an experiment to test whether similar effects apply in humans, I'd probably try to find cases where A and B in and of themselves are equally salient but the association A -> B is nonetheless more salient than the association B -> A. The alphabet is an example of this (where the effect is already confirmed).

Reply
Load More
50A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
Ω
9mo
Ω
1
37Stop-gradients lead to fixed point predictions
Ω
3y
Ω
2
80Proper scoring rules don’t guarantee predicting fixed points
Ω
3y
Ω
8
27Extracting Money from Causal Decision Theorists
Ω
5y
Ω
34
13Moral realism and AI alignment
7y
10
7The law of effect, randomization and Newcomb’s problem
8y
1
15Naturalized induction – a challenge for evidential and causal decision theory
Ω
8y
Ω
15
3A survey of polls on Newcomb’s problem
8y
8
6Invitation to comment on a draft on multiverse-wide cooperation via alternatives to causal decision theory (FDT/UDT/EDT/...)
8y
7
7Are causal decision theorists trying to outsmart conditional probabilities?
8y
10
Load More
Updateless Decision Theory
3 years ago
(+47)
Acausal Trade
7 years ago
(+75)
Acausal Trade
7 years ago
(+74)
Acausal Trade
8 years ago
(+34/-36)
Decision theory
8 years ago
(+80)
Counterfactual Mugging
8 years ago
(+26)
Counterfactual Mugging
8 years ago
(+125)