LESSWRONG
LW

All of julianjm's Comments + Replies

GPT-3.5 judges can supervise GPT-4o debaters in capability asymmetric debates

This looks great! Let me see if I understand the big picture correctly (omitting some of your experiments to just get the main thrust):

You finetuned GPT-3.5 alternatively to output the correct answer 1) on 500 MMLU questions, and 2) on debates between copies of GPT-4o on those same 500 MMLU questions. You found that validation accuracy judging debates was better than answering blind.
- This difference did not appear when using GPT-4o as a judge, suggesting that in the capability asymmetric case, the judge learned to rely on the debate, whereas in the capabili

... (read more)

1Charlie George7mo

[Apologies for the really late response] This is strange but the difference in Elo is actually not significant looking at the confidence intervals. I worry about how much of what we're seeing is just an effect of domain shift. Since you trained the model on GPT-4o debates, I would expect the accuracy on these debates to be highest, and changing to GPT-4o mini and then GPT-3.5 should lead us further out of domain, reducing the judging model's accuracy. Then the accuracy trend just reflects how OOD the debates are, and that happens to track with model skill for the debaters you tested. The fact that Elo also tracks in the expected way is a bit harder to explain away here, and makes it seem like the judge is learning something meaningful, but I am pretty unsure about that. The Elo of the debates stays roughly the same with an untrained judge. So another way you could plot this graph is by the having the accuracy of a judge trained only debates from that debater in the y-axis and then compute the Elo with an untrained debater on the x-axis and you would get roughly the same graph with the OOD issues. the Elo in the blue plot is only trained on GPT-4o best of 4 debates. What does this mean? I would assume Elo needs to be computed by running a tournament between the models. Sorry, that's a typo. It should say "the Elo in the blue plot is calculated only using a judge trained on GPT-4o best of 4 debates." Otherwise, you're understanding seems correct!

Open consultancy: Letting untrusted AIs choose what answer to argue for

julianjm1yΩ593

Thank you for posting this! I think it's well worth thinking through debate baselines more.

TL;DR:

I think your open consultancy idea isn't specific to consultancy. I think everything about it equally well applies to debate, and it basically corresponds to how the RL training should work in practice.
The 50/50 assumption in our consultancy and debate experiments was done for reasons having to do with our research question of interest, which is measuring the relative utility of the supervision paradigms. I don't think we would want to directly use consultancy

... (read more)

2Fabien Roger1y

Thanks for the detailed answer! * I hadn't really considered what happens when using debate as a reward signal, I agree you get the benefits of open consultancy, thanks for pointing it out! * Using regular consultancy as a "debugging baseline" makes sense to me, I often read paper as paper trying to answer a question like "how can we use untrusted experts" (and here you should go for the best alternative to debate robust against scheming) but they are sometimes about "how can we make debate work" (and here the debugging baseline is more information). * About the search process over CoT: I agree this is often a good way to produce the justification for the correct side of the argument. It might be a terrible way to produce the incorrect side of the argument, so I think "generate CoTs until you get one that supports your side" is not a drop-in replacement for "generate a justification for your side". So you often have to combine both the CoT and the post-hoc justification, and what you get is almost exactly what I call "Using open consultancy as evidence in a debate". I agree that this gets you most of the way there. But my intuition is that for some tasks, the CoT of the open consultant is so much evidence compared to justifications optimized to convince you of potentially wrong answer that it starts to get weird to call that "debate".

Debate helps supervise human experts [Paper]

julianjm1y63

Hi, author here.

Presumably we're supposed to read between the lines and notice that this is a de-facto negative result.

FWIW, I genuinely see it as a positive result, and if I thought it should be read as a de facto negative result, I would make sure that was conveyed in the paper. I think the same is true of my coauthors.

There are reasons that we would expect a debate experiment like this to fail to detect an improvement, even if debate is a good paradigm for scalable oversight:

Previous work from our group using a simpler variant of debate (1 or 2 turn onl

... (read more)

Extrapolating GPT-N performance

julianjm4yΩ220

Re: how to update based on benchmark progress in general, see my response to you above.

On the rest, I think the best way I can think of explaining this is in terms of alignment and not correctness.

My naive interpretation is that we only use ML when we can't be bothered to write a traditional solution, but I don't think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)

The bird example is good. My contention is basically that when it comes to making something like "recognizing birds" economically... (read more)

1Lukas Finnveden4y

This has definitely been productive for me. I've gained useful information, I see some things more clearly, and I've noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!

Extrapolating GPT-N performance

julianjm4y*Ω220

"actually it's quite gameable" = "actually it's quite easy" ;)

You joke, but one of my main points is that these are very, very different things. Any benchmark, or dataset, acts as a proxy for the underlying task that we care about. Turing used natural conversation because it was a domain where a wide range of capabilities are normally used by humans. The problem is that in operationalizing the test (e.g., trying to fool a human), it ends up being possible or easy to pass without necessarily using or requiring all of those capabilities. And this can happen ... (read more)

Extrapolating GPT-N performance

julianjm4yΩ110

I'm not saying that all benchmarks are necessarily hard, I'm saying that these ones look pretty hard to me (compared with ~ordinary conversation).

I'm not sure exactly what you mean here, but if you mean "holding an ordinary conversation with a human" as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There's a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it's quite gameable. This is why the Winograd Schema Challenge was proposed, ... (read more)

1Lukas Finnveden4y

"actually it's quite gameable" = "actually it's quite easy" ;) More seriously, I agree that a full blown turing test is hard, but this is because the interrogator can choose whatever question is most difficult for a machine to answer. My statement about "ordinary conversation" was vague, but I was imagining something like sampling sentences from conversations between humans, and then asking questions about them, e.g. "What does this pronoun refer to?", "Does this entail or contradict this other hypothesis?", "What will they say next?", "Are they happy or sad?", "Are they asking for a cheeseburger?". For some of these questions, my original claim follows trivially. "What does this pronoun refer to?" is clearly easier for randomly chosen sentences than for winograd sentences, because the latter have been selected for ambiguity. And then I'm making the stronger claim that a lot of tasks (e.g. many personal assistant tasks, or natural language interfaces to decent APIs) can be automated via questions that are similarly hard as the benchmark questions; ie., that you don't need more than the level of understanding signalled by beating a benchmark suite (as long as the model hasn't been optimised for that benchmark suite).

1Lukas Finnveden4y

Cool, thanks. I agree that specifying the problem won't get solved by itself. In particular, I don't think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven't been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines: * Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more narrow than the things we want to automate, and it's easier to game more narrow benchmarks, I don't trust trends based on narrow, fine-tuned benchmarks that much. * However, in a few-shot setting, there's not enough data to game the benchmarks in an overly narrow way. Instead, they can be fairly treated as a sample from all possible questions you could ask the model. If the model can answer some superglue questions that seem reasonably difficult, then my default assumption is that it could also answer other natural language questions that seem similarly difficult. * This isn't always an accurate way of predicting performance, because of our poor abilities to understand what questions are easy or hard for language models. * However, it seems like should at least be an unbiased prediction; I'm as likely to think that benchmark question A is harder than non-benchmark question B as I am to think that B is harder than A (for A, B that are in fact similarly hard for a language model). * However, when automating stuff in practice, there are two important problems that speak against using few-shot prompting: * As previously mentioned, tasks-to-be-automated are less narrow than the benchmarks. Prompting with examples seems less useful for less narrow situations, because each example may be much longer and/or you may need more prompts to cover the variation of situations. * Finetuning is in fact really powerful. You c

Extrapolating GPT-N performance

julianjm4y20

10x seems reasonable on its face, but honestly I have no idea. We haven't really dealt with scales and feature learners like this before. I assume a big part of what the model is doing is learning good representations that allow it to learn more/better from each example as training goes on. Given that, I can imagine arguments either way. On one hand, good representations could mean the model is discerning on its own what's important (so maybe data cleaning doesn't matter much). On the other, maybe noisy data (say, with lots of irreducible entropy—though th... (read more)

Extrapolating GPT-N performance

julianjm4y30

Oh yeah one more thing. You say

We have automated a lot of things in the past couple of 100 years with unaccountable machines and unaccountable software, and the main difference with ML seems to be that it's less interpretable.

I strongly disagree with this characterization. Traditional software is perfectly accountable. Its behavior is possible to exactly understand and predict from its implementation and semantics. This is a huge part of what makes it so damn useful. It reifies the process of specifying rules for behavior; where bad behavior is found, the ... (read more)

Extrapolating GPT-N performance

julianjm4yΩ220

I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we're to take SuperGLUE as representative, well.. it's already basically solved.

One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren't. It means you w... (read more)

Extrapolating GPT-N performance

julianjm4yΩ360

On 1a:

Insofar as humans like having reasons for failures, I'm willing to accept this as one reason that reliability standards could be a bit higher for ML, but I doubt it would be drastically higher. I'd love a real example (outside of criminal justice) where this is a bottleneck.

Take for example writing news / journalistic articles. Distinguishability from human-written articles is used as evidence for GPT's abilities. The abilities are impressive here, but the task at hand for the original writer is not to write an article that looks human, but one that ... (read more)

1Lukas Finnveden4y

That makes sense. I'm not saying that all benchmarks are necessarily hard, I'm saying that these ones look pretty hard to me (compared with ~ordinary conversation). My intuition is that this is far less concerning for GPT-3 than for other models, since it gets so few examples for each benchmark. You seem to disagree, but I'm not sure why. In your top-level comment, you write: If for every benchmark, there were enough statistical regularities in common between language modeling supervision and the benchmark to do really well on them all, I would expect that there would also be enough statistical regularities in common between language modeling supervision and whatever other comparably difficult natural-language task we wanted to throw at it. In other words, I feel more happy about navigating with my personal sense of "How hard is this language task?" when we're talking about few-shot learning than when we're talking about finetuned models, becase finetuned models can entirely get by with heuristics that only work on a single benchmark, while few-shot learners use sets of heuristics that cover all tasks they're exposed to. The latter seem far more likely to generalise to new tasks of similar difficulty (no matter if they do it via reasoning or via statistics). You also write "It stands to reason that this may impose a lower ceiling on model performance than human performance, or that in the task-specific supervised case." I don't think this is right. In the limit of training on humans using language, we would have a perfect model of the average human in the training set, which would surely be able to achieve human performance on all tasks (though it wouldn't do much better). So the only questions are: * How fast will more parameters + more data increase performance on the language modeling task? (Including: Will performance asymptote before we've reached the irreducible entropy?) * As the performance on language modeling increases, in what order will the model m

2Lukas Finnveden4y

Completely agree that high benchmark performance (and in particular, GPT-3 + 6 orders of magnitude) is insufficient for automating these jobs. (To be clear, I believe this independent about concerns of accountability. I think GPT-3 + 6 OOM just wouldn't be able to perform these jobs as competently as a human.) I strongly agree with this. I think predictions of when we'll automate what low-level tasks is informative for general trends in automation, but I emphatically do not believe that automation of these tasks would constitute transformative AI. In particular, I'm honestly a bit surprised that the internet hasn't increased research productivity more, and I take it as pretty strong evidence that time-saving productivity improvements needs to be extremely good and general if they are to accelerate things to any substantial degree.

Extrapolating GPT-N performance

julianjm4yΩ460

3Lukas Finnveden4y

Thanks! I agree that if we required GPT-N to beat humans on every benchmark question that we could throw at them, then we would have a much more difficult task. I don't think this matters much in practice, though, because humans and ML are really differently designed, so we're bound to be randomly better at some things and randomly worse at some things. By the time ML is better than humans at all things, I think they'll already be vastly better at most things. And I care more about the point when ML will first surpass humans at most things. This is most clearly true when considering all possible tasks (e.g. "when will AIs beat humans at surviving on a savannah in a band of hunter-gatherers?"), but I think it's also true when considering questions of varying difficulty in a fairly narrow benchmark. Looking at the linked papers, I think contrastive learning seems like a fair challenge; but I suspect that enough rounds of ANLI could yield questions that would be very rare in a normal setting [1]. To make that a little bit more precise, I want to answer the question "When will transformative AI be created?". Exactly what group of AI technologies would or wouldn't be transformative is an open question, but I think one good candidate is AI that can do the vast majority of economically important tasks cheaper than a human. If I then adopt the horizon-length frame (which I find plausible but not clearly correct), the relevant question for GPT-N becomes "When will GPT-N be able to perform (for less cost than a human) the vast majority of all economically relevant sub-tasks with a 1-token horizon length" This is an annoyingly vague question, for sure. However, I currently suspect it's more fruitful to think about this from the perspective of "How high reliability do we need for typical jobs? How expensive would it be to make GPT-N that reliable?" than to think about this from the perspective of "When will be unable to generate questions that GPT-N fails at?" Another lens

Extrapolating GPT-N performance

julianjm4yΩ7150

Hi — new here. I'm an NLP researcher, and for background, I would guess I fall on the skeptical side of the scaling hypothesis by LW standards. I was pointed to this by abergal. Here is some feedback:

1. On the question of economic value:

AFAIK, there are many factors other than raw performance on a benchmark (relating to interpretability, accountability/explainability, and integration with surrounding software) which, as far as I'm aware, may dominate the question of economic viability of AI systems at least on the sub-99.99% accuracy regime (depending on a... (read more)

1Lukas Finnveden4y

Re 3: Yup, this seems like a plausibly important training improvement. FWIW, when training GPT-3, they did filter the common crawl using a classifier that was trained to recognise high-quality data (with wikipedia, webtext, and some books as positive examples) but unfortunately they don't say how big of a difference it made. I've been assuming (without much thoughts) that doing this better could make training up to ~10x cheaper, but probably not a lot more than that. I'd be curious if this sounds right to you, or if you think it could make a substantially bigger difference.

4Lukas Finnveden4y

Thank you, this is very useful! To start out with responding to 1: I agree this is a thing for judges and other high-level decisions, but I'm not sure how important it is for other tasks. We have automated a lot of things in the past couple of 100 years with unaccountable machines and unaccountable software, and the main difference with ML seems to be that it's less interpretable. Insofar as humans like having reasons for failures, I'm willing to accept this as one reason that reliability standards could be a bit higher for ML, but I doubt it would be drastically higher. I'd love a real example (outside of criminal justice) where this is a bottleneck. I'd guess that some countries will have harsh regulations for self-driving cars, but that does have a real risk of killing people, so it's still tougher than most applications. I liked the talk! I take it as evidence that it's really hard to get >99.99% accuracy, which is a big problem when your neural network is piloting a metric ton of metal at high speeds in public spaces. I'm not sure how important reliability is in other domains, though. Your point "failure of abstractions can have nonlinear effects on the outcome in a software system" is convincing for situations when ML is deeply connected with other applications. I bet there's a lot of cool stuff that ML could do there, so the last 0.01% accuracy could definitely be pretty important. An error rate of 0.1%-1% seems fine for a lot of other tasks, though, including all examples in Economically useful tasks. * For ordering expensive stuff, you want high reliability. But for ordering cheap stuff, 0.1%-1% error rate should be ok? That corresponds to getting the wrong order once a year if you order something every day. * 0.1%-1% error rate also seems fine for personal assistant work, especially since you can just personally double-check any important emails before they're sent, or schedule any especially important meeting yourself. * Same thing for research assi

6julianjm4y

Oh yeah, one more thing which I think actually might be the most important point. On a lot of these benchmarks — at the very least, on SuperGLUE — "human-level performance" is a much weaker requirement than "human equivalence." Human performance isn't necessarily an indicator of irreducible entropy in the underlying task. To a large extent, it just reflects ambiguity or coarseness in the dataset specification. A big part of this is the artificial setting of the data annotation, which is unfortunately kind of necessary in a lot of cases when the goal is characterizing abstract language understanding. On the long tail of examples that require more careful reasoning, in the absence of an underlying business reason or extra context to guide people's judgments, they end up interpreting or construing the inputs differently, or applying the annotation guidelines differently, and disagreeing with each other on the output. Human-equivalence would mean sensitivity to all of the issues that lead a human to decide on one interpretation over another, but in the IID performance evaluation setting, these issues all basically wash out looking like noise. Then as long as the model can throw out a reasonable guess of one of the plausible labels in these cases, it will be hard to distinguish from a human. NLI/RTE are great examples of this; it is a notoriously difficult problem to specify, and recent work has shown a great deal of disagreement between annotators in this task setting (including bi-modal distributions indicative of explicit ambiguities in construal), to the point that it seems like supervised models probably have already more or less hit the noise ceiling, at least on the SNLI/MultiNLI datasets. So for the most part when looking at these kinds of datasets — particularly data annotated in an artificial setting intended to capture something abstract and general about language — maximum performance on IID test sets is best seen not as the point of irreducible entropy (i.e