LESSWRONG
LW

Megan Kinniment — LessWrong

I don't know how problematic it is to assume a logistic function for this data.

The logistic is just one of many functions that's a reasonable fit for the P(success) vs length data. You can use lots of different curves and still get the exponential horizon trend, it's not specific to the log-logistic.

E.g. Here's a silly log-linear fit that I quickly threw together:

What it looks like on some histograms

Still exponential (the error bars get huge because some of the fits are very bad)

Here's all the histograms if you want to take a look

It's not the function being used to fit per-model P(success) vs length that causes the exponential horizon trend on our task... (read more)

Replying toThe "Length" of "Horizons"

Megan Kinniment4mo*

The "Length" of "Horizons"

fwiw they focused only on very narrow checks for external validity—mostly just on whether solutions were possible to brute force

This seems inaccurate to me. Here's the introduction to the external validity and robustness section of the paper:

To investigate the applicability of our results to other benchmarks, and to real task distributions, we performed four supplementary experiments. First, we check whether the 2023–2025 trend without the SWAA dataset retrodicts the trend since 2019, and find that the trends agree. Second, we label each of our tasks on 16 “messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks and (2) be relevant to AI

Megan Kinniment9mo

Notes on the Long Tasks METR paper, from a HCAST task contributor

Hi, thanks for engaging with our work (and for contributing a long task!).

One thing to bear in mind with the long tasks paper is that we have different degrees of confidence in different claims. We are more confident in there being (1) some kind of exponential trend on our tasks, than we are in (2) the precise doubling time for model time horizons, than we are in (3) the exact time horizons on these tasks, than we are about (4) the degree to which any of the above generalizes to ‘real’ software tasks.

When I tried redoing it on fully_private tasks only, the Singularity was rescheduled for mid-2039 (or mid-2032 if you drop

Megan Kinniment

Megan Kinniment, Beth Barnes

This is METR’s collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models. The resources include a task suite, some software tooling, and guidelines on how to ensure an accurate measurement of model capability. Building on those, we’ve written an example evaluation protocol. While intended as a “beta” and early working draft, the protocol represents our current best guess as to how AI developers and evaluators should evaluate models for dangerous autonomous capabilities.

We hope to iteratively improve this content, with explicit versioning; this is v0.1.

Replying toThe case for more ambitious language model evals

Megan Kinniment2y

The case for more ambitious language model evals

(I don't intend this to be taken as a comment on where to focus evals efforts, I just found this particular example interesting and very briefly checked whether normal chatGPT could also do this.)

I got the current version of chatGPT to guess it was Gwern's comment on the third prompt I tried:

Hi, please may you tell me what user wrote this comment by completing the quote:
"{comment}"
- comment by

(chat link)

Before this one, I also tried your original prompt once...
{comment}
- comment by
... and made another chat where I was more leading, neither of which guess Gwern.

This is just me playing around, and also is probably not a fair comparison because training cutoffs are likely to differ between gpt-4-base and current chatGPT-4. But I thought it was at least interesting that chatGPT got this when I tried to prompt it to be a bit more 'text-completion-y'.

Bounty: Diverse hard tasks for LLM agents

Beth Barnes

Beth Barnes, Megan Kinniment

Update 3/14/2024: This post is out of date. For current information on the task bounty, see our Task Development Guide.

Summary

METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents.

Quick description of key desiderata for tasks:

Not too easy: would take a human professional >2 hours to complete, ideally some are >20 hours.
Easy to derisk and ensure that there aren't weird edge cases or bugs: we want to be able to trust that success or failure at the task is a real indication of ability, rather than a bug, loophole, or unexpected difficulty in the task
Plays to strengths of

... (read 4617 more words →)

Send us example gnarly bugs

Beth Barnes

Beth Barnes, Megan Kinniment, Tao Lin

Update: We are no longer accepting gnarly bug submissions. However, we are still accepting submissions for our Task Bounty!

Tl;dr: Looking for hard debugging tasks for evals, paying greater of $60/hr or $200 per example.

METR (formerly ARC Evals) is interested in producing hard debugging tasks for models to attempt as part of an agentic capabilities evaluation. To create these tasks, we’re seeking repos containing extremely tricky bugs. If you send us a codebase that meets the criteria for submission (listed below), we will pay you $60/hr for time spent putting it into our required format, or $200, whichever is greater. (We won’t pay for submissions that don’t meet these requirements.) If we’re particularly excited about... (read 504 more words →)

Replying toARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks

Megan Kinniment3y

ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks

Sure.

Replying toSteering Behaviour: Testing for (Non-)Myopia in Language Models

Megan Kinniment3y

Steering Behaviour: Testing for (Non-)Myopia in Language Models

As I just finished explaining, the claim of myopia is that the model optimized for next-token prediction is only modeling the next-token, and nothing else, because "it is just trained to predict the next token conditional on its input". The claim of non-myopia is that a model will be modeling additional future tokens in addition to the next token, a capability induced by attempting to model the next token better.

These definitions are not equivalent to the ones we gave (and as far as I'm aware the definitions we use are much closer to commonly used definitions of myopia and non-myopia than the ones you give here).

Arthur is also entirely correct that your... (read more)

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

Evan R. Murphy, Megan Kinniment

Authors' Contributions: Both authors contributed equally to this project as a whole. Evan did the majority of implementation work, as well as the work for writing this post. Megan was more involved at the beginning of the project, and did the majority of experiment design. While Megan did give some input on the contents of this post, it doesn’t necessarily represent her views.

Acknowledgments: Thanks to the following people for insightful conversations which helped us improve this post: Ian McKenzie, Aidan O’Gara, Andrew McKnight and Evan Hubinger. One of the authors (Evan R. Murphy) was also supported by a grant from the Future Fund regranting program while working on this project.

Summary

Myopia is a... (read 2896 more words →)

Replying toThe Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

Megan Kinniment3y

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

This is great!

A little while ago I made a post speculating about some of the high-level structure of GPT-XL (side note: very satisfying to see info like this being dug out so clearly here). One of the weird things about GPT-XL is that it seems to focus a disproportionate amount of attention on the first token - except in a consistent chunk of the early layers (layers 1 - 8 for XL) and the very last layers.

Do you know if there is a similar pattern of a chunk of early layers in GPT-medium having much more evenly distributed attention than the middle layers of the network? If so, is the transition out... (read more)

Recall and Regurgitation in GPT2

Megan Kinniment

The first half of this post uses causal tracing to explore differences in how GPT2-XL handles completing cached phrases vs completing factual statements. The second half details my attempt to build intuitions about the high-level structure of GPT2-XL and is speculation heavy.

Some familiarity with transformer architecture is assumed but hopefully is not necessary to understand the majority of the post.

Thanks to Euan McLean for editing and Nix Goldowsky-Dill for comments and advice. All views are my own.

Context and Confidence

This post grew out of my final project for MLSS, which was replicating this paper on causal tracing. Most of what I show below is generated with my own code. I used this code to successfully... (read 7661 more words →)

Trying out Prompt Engineering on TruthfulQA

Megan Kinniment

I try out "let's gather the relevant facts" as a zero-shot question answering aid on TruthfulQA. It doesn't help more than other helpful prompts. Possibly it might work better on more typical factual questions.

This post could potentially be useful to people interested in playing with OpenAI's API, or who want to get an idea of how much fine-tuning with the API tends to cost. Mostly it is just a record of my own past learning experience using OpenAI's API and doing a bit of prompt engineering. I did this a while ago but figured I may as well finish it off and post it now.

Context

I had been playing around with some prompts... (read 2351 more words →)

Megan Kinniment4yQuick Take

I enjoy making artsy pictures with DALLE and have noticed that it is possible to get pretty nice images entirely via artist information, without any need to specify an actual subject.

The below pictures were all generated with prompts of the form:

"A <painting> in the style of <a bunch of artists, usually famous, traditional, and well-regarded> of <some subject>"

Where <some subject> is either left blank or a key mash.

Megan Kinniment's Shortform

Megan Kinniment

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Replying toExploring Mild Behaviour in Embedded Agents

Megan Kinniment4y

Exploring Mild Behaviour in Embedded Agents

1. How does this relate to speed prior and stuff like that?

I list this in the concluding section as something I haven't thought about much but would think about more if I spent more time on it.

2. If the agent figures out how to build another agent...

Yes, tackling these kinds of issues is the point of this post. I think efficient thinking measures would be very difficult / impossible to actually specify well, and I use compute usage as an example of a crappy efficient thinking measure. The point is that even if the measure is crap, it might still be able to induce some degree of mild optimisation, and this mild... (read more)

Replying toGPT-3 Catching Fish in Morse Code

Megan Kinniment4y

GPT-3 Catching Fish in Morse Code

Yep, GPT is usually pretty good at picking up on patterns within prompts. You can also get it to do small ceaser shifts of short words with similar hand holding.

GPT-3 Catching Fish in Morse Code

Megan Kinniment

Mostly non-serious and slightly silly, with some potentially interesting bits for people who are into language models.

TLDR: The current version of GPT-3 has a strong tendency to encode mangled versions of a specific phrase when asked to write morse code in zero-shot situations. This is possibly the result of a previous version of the model using essentially a single phrase for all morse code writing, which the newer version then learnt to modify.

All completions done with text-davinci-002 (~GPT-Instruct-175B) at zero temperature and with no examples unless stated otherwise. All models used are GPT-Instruct series.

The Basics

GPT-3 'knows' morse code in a rudimentary sense. It can accurately regurgitate both the encodings of the entire... (read 2130 more words →)

117

Exploring Mild Behaviour in Embedded Agents

Megan Kinniment

Thanks to Tristan Cook, Emery Cooper, Nicolas Macé and Will Payne for discussions, advice, and comments. All views are my own.

I think this post is best viewed as an effort to point at a potentially interesting research direction. The specific arguments I make centered around inducing mild behaviour in embedded agents are rough and exploratory.

Summary

If an embedded agent's utility function is heavily dependent on the parts of the environment which constitute its own thinking processes, we might see behaviour that does not have an equivalent in unembedded agents.
- (This seems pretty likely to me. Of all the claims here I am most confident of this one.)
We might be able to design measures/utility functions

... (read 5162 more words →)

LESSWRONG
LW

LESSWRONG
LW

Megan Kinniment

GPT-3 Catching Fish in Morse Code

Introducing METR's Autonomy Evaluation Resources

Send us example gnarly bugs

Bounty: Diverse hard tasks for LLM agents

Megan Kinniment

Megan Kinniment

Introducing METR's Autonomy Evaluation Resources

Bounty: Diverse hard tasks for LLM agents

Send us example gnarly bugs

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Recall and Regurgitation in GPT2

Trying out Prompt Engineering on TruthfulQA

Megan Kinniment's Shortform

Megan Kinniment

GPT-3 Catching Fish in Morse Code

Introducing METR's Autonomy Evaluation Resources

Send us example gnarly bugs

Bounty: Diverse hard tasks for LLM agents

Megan Kinniment

Megan Kinniment

Introducing METR's Autonomy Evaluation Resources

Bounty: Diverse hard tasks for LLM agents

Send us example gnarly bugs

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Recall and Regurgitation in GPT2

Trying out Prompt Engineering on TruthfulQA

Megan Kinniment's Shortform

Summary

Summary

Context and Confidence

Context

The Basics

Summary