Cross-model consistency or the exact color shades aside, would you consider this an example of mode-collapse? Those all look like tuned models, with no base models I recognize, while the chosen colors seem... extremely undiverse. I am sure that if we asked random humans, we would find that people like more colors than just 'green' and 'purple', roughly.
I don't know if it's applicable? Like, I'm asking for The Favorite color, not "suggest me random cool color please". I should probably test that too.
https://ygo-assets-websites-editorial-emea.yougov.net/documents/tabs_OPI_color_20141027_2.pdf
According to these two (dubious) surveys I just found 30% of humans pick Blue, 15% Purple, 10% Green. It's not particularly far from human distribution (if you are guessing that they should just say the favorite colors humans are saying). A bit more skewed to blue, and further from red.
I think it's applicable. It seems little different to me than other mode-collapse exercises like 'what is your name?' or 'imagine a random person; what is their surname?' or 'flip a fair coin; what did it come up as?' (Note that for none of the models in either set of responses, is "the favorite color" a 'the', as they do not deterministically pick the same color every time, so if there is supposed to be 'the' favorite color, you still have to explain how the LLM happens to pick this distribution of values.)
According to these two (dubious) surveys I just found 30% of humans pick Blue, 15% Purple, 10% Green. It's not particularly far from human distribution (if you are guessing that they should just say the favorite colors humans are saying).
Seems quite different to me. Leaving aside that hex code is a very large space of colors but the responses still look remarkably uniform and discretized to just a handful of values, just look at all the pink, white, red, 'none of these' etc in the human colors. In the LLM colors, I see a tiny bit of pink/red, and definitely no 'white'. This is not the most mode-collapsed thing ever, but still looks mode-collapsed.
(Mode-collapse is a spectrum. There's fully representing the distribution, then there are mode drops you can't see without statistical tests because they are subtle and distribution-wide or in places you don't normally look, then there's stuff you start to notice if you use it often enough like Claude apparently liking to name characters "Susan Chen" in stories, then there's stuff so blatant you can't help but notice with even a single sample - like how ChatGPT used to write rhyming poetry even if you asked "write a non-rhyming poem".)
This investigation is inspired by this and this posts by @davidad.
Some general thoughts about what is going on here:
Methodology of the tests themselves:
Key Takeaways:
Okay, what did I find, summarized:
Unique answers combined per each prompt across all tests, sorted by hue:
Results from @davidad for comparison:
Some excerpts I found interesting enough to mention:
Couple examples of abovementioned Evil Persona (tm) talking:
Data:
Random thoughts for further investigations: