I didn't explain it, but from playing with it I had the impression that it did understand what "temperature" was reasonably well (e.g. gpt-4-0613
, which is the checkpoint I tested, answers In the context of large language models like GPT-3, "temperature" refers to a parameter that controls the randomness of the model's responses. A higher temperature (e.g., 0.8) would make the output more random, whereas a lower temperature (e.g., 0.2) makes the output more focused and deterministic. [...]
to the question What is "temperature", in context of large language
...
Love this work. About a year ago I ran a small experiment in a similar direction: how good is GPT-4 at inferring at which temperature was its answer generated? Specifically, I would ask GPT-4 to write a story, generate its response with temperature randomly sampled from the interval [0.5, 1.5], and then ask it to guess (now sampling its answer at temperature 1, in order to preserve its possibly rich distribution) which temperature its story was generated with.
See below for a quick illustration of the results for 200 stories – "Temperature" is the temperatu...
We aim to test any new promising models; initially just running the basic prompt with 0-shot, over 10 games, and depending on the results deciding whether to run the full test. So far none of the newer models have seemed promising enough to do so.
Model | Average lines cleared |
---|---|
GPT-4V (from the post) | 19.6 |
Claude 3 Opus | 17.5 |
GPT-4o | 20.4 |
Claude 3.5 Sonnet | 19.1 |
Thanks for a lot of great ideas!
We tried cutting out the fluff of many colors and having all tetrominoes be one color, but that's didn't seem to help much (but we didn't try for the falling tetromino to be a different color than the filled spaces). We also tried simplifying it by making it 10x10 grid rather than 10x20, but that didn't seem to help much either.
We also thought of adding coordinates, but we ran out of time we allotted for this project and thus postponed that indefinitely. As it stands, it is not very likely we do further variations on Tetris because we're busy with other things, but we'd certainly appreciate any pull requests, should they come.
Glad that you liked my answer! Regarding my suggestion of synthetic data usage, upon further reflection I think is plausible that it could be either a very positive thing and a very negative thing, depending exactly on how the model generalizes out-of-distribution. It also now strikes me that synthetic data provides a wonderful opportunity to study (some) of their out-of-distribution properties even today – it is kind of hard to test out-of-distribution behavior of internet-text-trained LLMs because they've seen everything, but if trained with synthetic data it should be much more doable.
Since I transformed the Iris dataset with a pretty "random" transformation (i.e. not chosen because it was particularly nice in some way), I didn't check for its regeneration -- since my feature vectors were very different to original Iris's, and it seemed exceedingly unlikely that feature vectors were saved anywhere on the internet with that particular transformation.
But I got curious now, so I performed some experiments.
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Rona
... That seems like a great idea, and induction heads do seem highly relevant!
What you describe is actually one of the key reasons why I'm so excited about this whole approach. I've seen many interesting metalearning tasks, and they mostly just like work or not work, or they fail sometimes, and you can try to study their failures to perhaps glean some insight into the underlying algorithm -- but...they just don't have (m)any nontrivial "degrees of freedom" in which you can vary them. The class of numerical models, on the other hand, has a substantial amount of...
I forgot to explicitly note it in the post, but yeah, if you have any ideas for variations on these experiments which you'd like to see run, which you feel might make your model of what is going on clearer, feel free to comment them here. Conditional on them being compute-light/simple enough to implement, I'll try to get back to you ASAP with the results – do feel encouraged to share ideas which might be vaguer or might require more compute as well, though in those cases I might not get back to you immediately.
Is there any difference in formatting you omitted mentioning?
There shouldn't be any difference – neither between Iris and the synthetic binary tasks, nor between different synthetic binary tasks themselves – except if some snuck in that evaded my notice.
The only thing I experimented with, alternative-formatting-wise, was that the first time I experimented with Iris, I did it with a line before all the input vectors which said something like "This is a sequences of inputs and outputs of an integer function.", but then I redid the experiment without that ...
Correlation (Pearson's r) is ≈0.62.
Another way, possibly more intuitive, to state the results is that, for two messages which were generated with respective temperature t1 and t2, if t1>t2 then the probability of having p1>p2 for their respective guesses by GPT-4 is 73%, with guesses being equal counting as satisfying the above inequality 50% of the time. (This "correction" being applied because GPT-4 likes round numbers, and is equivalent to adding N(0,ε2) noise to GPT-4's gu... (read more)