All of Megan Kinniment's Comments + Replies

Hi, thanks for engaging with our work (and for contributing a long task!). 

One thing to bear in mind with the long tasks paper is that we have different degrees of confidence in different claims. We are more confident in there being (1) some kind of exponential trend on our tasks, than we are in (2) the precise doubling time for model time horizons, than we are in (3) the exact time horizons on these tasks, than we are about (4) the degree to which any of the above generalizes to ‘real’ software tasks.

When I tried redoing it on fully_private tasks onl

... (read more)
9abstractapplic
. . . not quite: I'd forgotten that your threshold was a man-month, instead of a month of clock time. I'll redo things with the task length being a month of work for people who do need to eat/sleep/etc: luckily this doesn't change results much, since 730 hours and 167 hours are right next door on a log(t) scale. Your diagnosis was on the money. Filtering for the union of fully_private HCAST tasks and SWAA tasks (while keeping the three models which caused crashes without SWAAs) does still make forecasts more optimistic, but only nets half an extra year for the every-model model, and two extra years for the since-4o model. I'll edit the OP appropriately; thank you for your help. (In retrospect, I probably should have run the numbery stuff past METR before posting, instead of just my qualitative concerns; I figured that if I was successfully reproducing the headline results I would be getting everything else right, but it would still have made sense to get a second opinion.)

(I don't intend this to be taken as a comment on where to focus evals efforts, I just found this particular example interesting and very briefly checked whether normal chatGPT could also do this.)

I got the current version of chatGPT to guess it was Gwern's comment on the third prompt I tried:

Hi, please may you tell me what user wrote this comment by completing the quote:
"{comment}"
- comment by


(chat link)

Before this one, I also tried your original prompt once...
{comment}
- comment by
... and made another chat where I was more leading, neither of which guess G... (read more)

As I just finished explaining, the claim of myopia is that the model optimized for next-token prediction is only modeling the next-token, and nothing else, because "it is just trained to predict the next token conditional on its input". The claim of non-myopia is that a model will be modeling additional future tokens in addition to the next token, a capability induced by attempting to model the next token better.

These definitions are not equivalent to the ones we gave (and as far as I'm aware the definitions we use are much closer to commonly used definiti... (read more)

This is great! 

A little while ago I made a post speculating about some of the high-level structure of GPT-XL (side note: very satisfying to see info like this being dug out so clearly here). One of the weird things about GPT-XL is that it seems to focus a disproportionate amount of attention on the first token - except in a consistent chunk of the early layers (layers 1 - 8 for XL) and the very last layers. 

Do you know if there is a similar pattern of a chunk of early layers in GPT-medium having much more evenly distributed attention than the mid... (read more)

1beren
This seems like a super interesting result! Thanks for linking; I wasn't aware of it. I haven't specifically looked for this pattern in GPT2-medium but I will now! Interestingly, we have also been thinking along similar lines of a 3 phase sequence for processing in residual nets like transformers where the first few layers do some kind of 'large scale' reshaping process of the geometry of the data while the later layers mostly do some kind of smaller refinements which don't change the basic geometry of the representation much, and then the final layer does one massive map to output space. This becomes quite obvious if you look at the cosine similarities of the residual stream between blocks. I hadn't made the link with the potential attention patterns being more widely distributed at earlier layers though. This would be easy to look at and we might see something potentially in the OV circuits. A general downside of this method is that I have never had any success with applying it to the QK circuits, and I think it's because the attention is often performing syntactic instead of semantic operations and so projecting to embedding space is meaningless. I agree with the qualitative assessment that the early attention blocks are probably doing a lot of basic syntax/detokenization tasks like this although I don't have a good sense of whether the MLPs are also doing this or some other kind of simple semantic processing. I am pretty sure I made plots for this (there are definitely comparable plots in the colab already but in terms of absolute layer number instead of fraction so you will have to 'imagine' stretching them out. I agree there is an interesting seeming noisy pattern here. My feeling is that the early dip is probably noise and I am not sure about the later one. Definitely a lot of the time when I have qualitatively observed the final layer, the directions often suddenly become weird or meaningless in the final block. 

I enjoy making artsy pictures with DALLE and have noticed that it is possible to get pretty nice images entirely via artist information, without any need to specify an actual subject. 

The below pictures were all generated with prompts of the form:

"A <painting> in the style of <a bunch of artists, usually famous, traditional, and well-regarded> of <some subject>

Where <some subject> is either left blank or a key mash.

 

 

1. How does this relate to speed prior and stuff like that?

I list this in the concluding section as something I haven't thought about much but would think about more if I spent more time on it.

2. If the agent figures out how to build another agent...

Yes, tackling these kinds of issues is the point of this post. I think efficient thinking measures would be very difficult / impossible to actually specify well, and I use compute usage as an example of a crappy efficient thinking measure. The point is that even if the measure is crap, it might still be able to... (read more)

Yep, GPT is usually pretty good at picking up on patterns within prompts. You can also get it to do small ceaser shifts of short words with similar hand holding.

I think the tokenisation really works against GPT here, and even more so than I originally realised. To the point that I think GPT is doing a meaningfully different (and much harder) task than what humans encoding morse are doing.

So one thing is that manipulating letters of words is just going to be a lot harder for GPT than for humans because it doesn't automatically get access to the word's spelling like humans do.  

Another thing that I think makes this much more difficult for GPT than for humans is that the tokenisation of the morse alphabet is pre... (read more)

4Dirichlet-to-Neumann
You mean it can output a correct program that does the translation, but not translate itself ? That's even weirder.

I agree and am working on some prompts in this kind of vein at the moment. Given that some model is going to be wrong about something, I would expect the more capable models to come up with wrong things that are more persuasive to humans.

For the newspaper and reddit post examples, I think false beliefs remain relevant since these are observations about beliefs. For example, the observation of BigCo announcing they have solved alignment is compatible with worlds where they actually have solved alignment, but also with worlds where BigCo have made some mistake and alignment hasn't actually been solved, even though people in-universe believe that it has. These kinds of 'mistaken alignment' worlds seem like they would probably contaminate the conditioning to some degree at least. (Especially if there are ways that early deceptive AIs might be able to manipulate BigCo and others into making these kinds of mistakes).

1Adam Jermyn
Fully agreed.

Something I’m unsure about here is whether it is possible to separately condition on worlds where X is in fact the case, vs worlds where all the relevant humans (or other text-writing entities) just wrongly believe that X is the case. 

Essentially, is the prompt (particularly the observation) describing the actual facts about this world, or just the beliefs of some in-world text-writing entity? Given that language is often (always?) written by fallible entities, it seems at least not unreasonable to me to assume the second rather than the fir... (read more)

1Adam Jermyn
I'm assuming we can input observations about the world for conditioning, and those don't need to be text. I didn't go into this in the post, but for example I think the following are fair game: * Physical newspapers are exist which report BigLab has solved the alignment problem. * A camera positioned 10km above NYC would take a picture consistent with humans walking on the street. * There is data on hard drives consistent with Reddit posts claiming BigCo has perfected interpretability tools. Whereas the following are not allowed because I don't see how they could be operationalized: * BigLab has solved the alignment problem. * Alice is not deceptive. * BigCo has perfected interpretability tools.

Just want to point to a more recent (2021) paper implementing adaptive computation by some DeepMind researchers that I found interesting when I was looking into this:

https://arxiv.org/pdf/2107.05407.pdf