LESSWRONG
LW

All of Lovre's Comments + Replies

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Correlation (Pearson's r) is $\approx 0.62$ .

Another way, possibly more intuitive, to state the results is that, for two messages which were generated with respective temperature $t_{1}$ and $t_{2}$ , if $t_{1} > t_{2}$ then the probability of having $p_{1} > p_{2}$ for their respective guesses by GPT-4 is $73 %$ , with guesses being equal counting as satisfying the above inequality $50 %$ of the time. (This "correction" being applied because GPT-4 likes round numbers, and is equivalent to adding $N (0, ε^{2})$ noise to GPT-4's gu... (read more)

5gwern9mo

I was surprised there was any signal here because of the "flattened logits" mode collapse effect where ChatGPT-4 loses calibration and diversity after the RLHF tuning compared to GPT-4-base, but I guess if you're going all the way up to 1.5, that restores some range and something to measure.

4Owain_Evans9mo

Thanks for the breakdown! The idea for using pairs makes sense.

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Lovre9mo50

I didn't explain it, but from playing with it I had the impression that it did understand what "temperature" was reasonably well (e.g. gpt-4-0613, which is the checkpoint I tested, answers In the context of large language models like GPT-3, "temperature" refers to a parameter that controls the randomness of the model's responses. A higher temperature (e.g., 0.8) would make the output more random, whereas a lower temperature (e.g., 0.2) makes the output more focused and deterministic. [...] to the question What is "temperature", in context of large language... (read more)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Lovre9mo170

Love this work. About a year ago I ran a small experiment in a similar direction: how good is GPT-4 at inferring at which temperature was its answer generated? Specifically, I would ask GPT-4 to write a story, generate its response with temperature randomly sampled from the interval [0.5, 1.5], and then ask it to guess (now sampling its answer at temperature 1, in order to preserve its possibly rich distribution) which temperature its story was generated with.

See below for a quick illustration of the results for 200 stories – "Temperature" is the temperatu... (read more)

3Owain_Evans9mo

Do you have results for a measure of accuracy or correlation? It would also be worth comparing results for two different distributions on the temperature, e.g. the uniform on [0.5,1.5] that you tried, and other interval like [0,2] or a non-uniform distribution.

6L Rudolf L9mo

Did you explain to GPT-4 what temperature is? GPT-4, especially before November, knew very little about LLMs due to training data cut-offs (e.g. the pre-November GPT-4 didn't even know that the acronym "LLM" stood for "Large Language Model"). Either way, it's interesting that there is a signal. This feels similar in spirit to the self-recognition tasks in SAD (since in both cases the model has to pick up on subtle cues in the text to make some inference about the AI that generated it).

Putting multimodal LLMs to the Tetris test

Lovre1y*10

Benchmarking new models

We aim to test any new promising models; initially just running the basic prompt with 0-shot, over 10 games, and depending on the results deciding whether to run the full test. So far none of the newer models have seemed promising enough to do so.

Model	Average lines cleared
GPT-4V (from the post)	19.6
Claude 3 Opus	17.5
GPT-4o	20.4
Claude 3.5 Sonnet	19.1

Putting multimodal LLMs to the Tetris test

Lovre1y10

Thanks for a lot of great ideas!

We tried cutting out the fluff of many colors and having all tetrominoes be one color, but that's didn't seem to help much (but we didn't try for the falling tetromino to be a different color than the filled spaces). We also tried simplifying it by making it 10x10 grid rather than 10x20, but that didn't seem to help much either.

We also thought of adding coordinates, but we ran out of time we allotted for this project and thus postponed that indefinitely. As it stands, it is not very likely we do further variations on Tetris because we're busy with other things, but we'd certainly appreciate any pull requests, should they come.

Decent plan prize winner & highlights

Lovre1y20

Glad that you liked my answer! Regarding my suggestion of synthetic data usage, upon further reflection I think is plausible that it could be either a very positive thing and a very negative thing, depending exactly on how the model generalizes out-of-distribution. It also now strikes me that synthetic data provides a wonderful opportunity to study (some) of their out-of-distribution properties even today – it is kind of hard to test out-of-distribution behavior of internet-text-trained LLMs because they've seen everything, but if trained with synthetic data it should be much more doable.

3Nathan Helm-Burger1y

Or, if not purely synthetic, how about deliberately filtered in weird ways. Could you train a model where information about a specific scientific theory has been thoroughly censored from the training data? Then in testing you could give the information which was available to the human orginators of the now accepted scientific theory. See if the model can figure out the puzzle. Be concerned if it does better than you expect.

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

Lovre3yΩ470

Since I transformed the Iris dataset with a pretty "random" transformation (i.e. not chosen because it was particularly nice in some way), I didn't check for its regeneration -- since my feature vectors were very different to original Iris's, and it seemed exceedingly unlikely that feature vectors were saved anywhere on the internet with that particular transformation.

But I got curious now, so I performed some experiments.

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Rona

Lovre3yΩ020

That seems like a great idea, and induction heads do seem highly relevant!

What you describe is actually one of the key reasons why I'm so excited about this whole approach. I've seen many interesting metalearning tasks, and they mostly just like work or not work, or they fail sometimes, and you can try to study their failures to perhaps glean some insight into the underlying algorithm -- but...they just don't have (m)any nontrivial "degrees of freedom" in which you can vary them. The class of numerical models, on the other hand, has a substantial amount of... (read more)

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

Lovre3y*40

I forgot to explicitly note it in the post, but yeah, if you have any ideas for variations on these experiments which you'd like to see run, which you feel might make your model of what is going on clearer, feel free to comment them here. Conditional on them being compute-light/simple enough to implement, I'll try to get back to you ASAP with the results – do feel encouraged to share ideas which might be vaguer or might require more compute as well, though in those cases I might not get back to you immediately.

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

Lovre3y*30

Is there any difference in formatting you omitted mentioning?

There shouldn't be any difference – neither between Iris and the synthetic binary tasks, nor between different synthetic binary tasks themselves – except if some snuck in that evaded my notice.

The only thing I experimented with, alternative-formatting-wise, was that the first time I experimented with Iris, I did it with a line before all the input vectors which said something like "This is a sequences of inputs and outputs of an integer function.", but then I redid the experiment without that ... (read more)