Edit: The authors have updated the paper based on my feedback; see my thoughts on the updated version in this comment
The authors modified some of the tasks enough that they aren't actually the tasks we found inverse scaling on. For example, they evaluate on the 1-shot instead of 0-shot versions of some tasks, and giving an example of how to do the task is probably a huge hint. In another case, they reduce the number of few-shot examples used, when spurious correlations in the few-shot examples are the reason for the inverse scaling. So some of the comparisons to existing models aren't valid, and I don't think the current results are strong evidence that scaling further reverses the inverse scaling trends that we found.
Relevant discussion of the task changes they made here:
Do you expect all of the inverse scaling trends (for the round 1 winners) to go on forever?
This seems incredibly implausible to me, given all of four examples are capabilities failures and not alignment failure, and all four examples are capabilities that most humans can demonstrate.
I'm not too sure what to expect, and I'd be pretty interested to e.g. set up a Metaculus/forecasting question to know what others think. I'm definitely sympathetic to your view to some extent.
Here's one case I see against- I think it's plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we're not reliably able to elicit that knowledge (at least without a large validation set, but we won't have access to that if we're having models do tasks people can't do, or in general for a new/zero-shot task). E.g., for NegationQA, surely even current models have some fairly good understanding of negation - why is that understanding not showing in the results here? My best guess is that NegationQA isn't capabilities bottlenecked but has to do with something else. I think the updated paper's results that chain-of-thought prompting alone reverses some of the inverse scaling trends is interesting; it also suggests that maybe naively using an LM isn't the right way to elicit a model's knowledge (but chain-of-thought prompting might be).
In general, I don't think it's always accurate to use a heuristic like "humans behave this way, so LMs-in-the-limit will behave this way." It seems plausible to me that LM representations will encode the knowledge for many/most/almost-all human capabilities, but I'm not sure it means models will have the same input-output behavior as humans (e.g., for reasons discussed in the simulators post and since human/LM learning objectives are different)
I will happily bet that NeQA resolves with scale in the next 2 years, at something like 1:1 odds, and that in the worst case resolves with scale + normal finetuning (instruction finetuning or RLHF) within the next two years at something like 4:1 odds (without CoT)! (It seems like all of them are U-shaped or positive scaling with CoT already?)
I made a manifold market for the general question: if I'm not incorrect, the updated paper says that 2/4 of them already demonstrate u-shaped scaling, using the same eval as you did?
I'll make one for NeQA and Redefine Math later today.
I think it's plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we're not reliably able to elicit that knowledge (at least without a large validation set, but we won't have access to that if we're having models do tasks people can't do, or in general for a new/zero-shot task).
I agree that these tasks exist. If intent alignment fails and we end up with a misaligned AGI, then we in some sense can't get the AI to do any of the nice powerful things we'd like it to do. We'd like to see examples of this sort of failure before we make a powerful unaligned AGI, ideally in the scaling laws paradigm.
Broadly speaking, there are three types of inverse scaling curves: 1) those that resolve with scale, ie capabilities tasks, 2) those that are in some sense "tricking" the model with a misleading prompt where human labelers use additional context clues to not be tricked (for example, that they're labelling an ML dataset, and so they should probably answer as literally as possible, or 3) alignment failures (very hard to elicit). 1) resolves with scale, 2) can be easily fixed with tweaks to the prompt or small amounts of instruction finetuning/RLHF, and I think we agree that 3) is the interesting kind.
My claim is that all four of these tasks are clearly not alignment failures, and I also suspect that they're all of type 1).
In general, I don't think it's always accurate to use a heuristic like "humans behave this way, so LMs-in-the-limit will behave this way." It seems plausible to me that LM representations will encode the knowledge for many/most/almost-all human capabilities, but I'm not sure it means models will have the same input-output behavior as humans (e.g., for reasons discussed in the simulators post and since human/LM learning objectives are different)
That's super fair. I think I'm using a more precise heuristic than this in practice, something like, "if you're not 'tricking' the model in some sense, things that untrained humans can do in the first go can be done by models", though this still might fail in the limit for galaxy-brain reasons.
(EDIT: made a manifold market for round 2 inverse scaling tasks as well)
The authors have updated their arXiv paper based on my feedback, and I'm happy with the evaluation setup now: https://arxiv.org/abs/2211.02011v2. They're showing that scaling PALM gives u-shaped scaling on 2/4 tasks (rather than 3/4 in the earlier version) and inverse scaling on 2/4 tasks. I personally found this result at least somewhat surprising, given the fairly consistent inverse scaling we found across various model series' we tried. They're also finding that inverse scaling on these tasks goes away with chain-of-thought prompting, which I think is a neat finding (and nice to see some success from visible-thoughts-style methods here). After this paper, I'm pretty interested to know:
They're also finding that inverse scaling on these tasks goes away with chain-of-thought prompting
So, like some of the Big-Bench PaLM results, these are more cases of 'hidden scaling' where quite simple inner-monologue approaches can show smooth scaling while the naive pre-existing benchmark claims that there are no gains with scale?
Please note that the inverse scaling prize is not from Anthropic:
The Inverse Scaling Prize is organized by a group of researchers on behalf of the Fund for Alignment Research (FAR), including Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Additionally, Sam Bowman and Ethan Perez are affiliated with Anthropic; Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman are affiliated with New York University. The prize pool is provided by the Future Fund.
Anthropic has provided some models as a validation set, but it's not even the most common affiliation!
Really interesting, thanks for sharing!
I find it super surprising that the tasks worked up until Gopher, but stopped working at PaLM. That's such a narrow gap! That alone suggests some kind of interesting meta-level point re inverse scaling being rare, and that in fact the prize mostly picked up on the adverse selection of "the tasks that were inverse-y enough to not have issues on the models used.
One prediction this hypothesis makes is that people were overfitting to "what can GPT-3 not do" and thus that there's a bunch of submitted tasks that were U-Shaped by Gopher, and the winning ones were just the ones that were U Shaped a bit beyond Gopher?
I'm also v curious how well these work on Chinchilla.
See this disclaimer on how they've modified our tasks (they're finding u-shaped trends on a couple tasks that are different from the ones we found inverse scaling on, and they made some modifications that make the tasks easier)
Paper is now updated with exact prompts. U-shaped scaling still holds on 2/4 tasks.
See Ethan's thread: https://twitter.com/EthanJPerez/status/1592608445147926529
Edit: Here's a great comment by Ethan Perez that caveats this result, that I'd recommend reading for context.
This is a paper by folks at Quoc Le's team at Google that examines the winning tasks from Round 1 of Anthropic's Inverse Scaling Prize. They find that 3/4 of the winning tasks — which exhibited negative returns-to-scale when tested on LMs up to the scale of Gopher (280B) — go back to exhibiting positive returns-to-scale at even greater model sizes such as PaLM (540B).
The abstract in full:
Key figure from the paper is below, showing results for LMs up to PaLM 540B. Note that positive scaling resumes for 3/4 of the inverse scaling tasks at the 2.5e24 FLOPs datapoint, which indeed corresponds exactly to vanilla PaLM 540B.[1]
From Table 22 in the PaLM paper.