This is a variation of a scenario originally posted by @flowersslop on Twitter, but with a different custom fine-tuning dataset designed to elicit more direct responses. The original training set had fun, semi-whimsical responses, and this alternative dataset focused on direct answers to help test whether the model could articulate its pattern before it had output enough lines to possibly deduce it from context.
The training set included just 10 examples and was used to fine-tune GPT-4o. Here is an example response:
System Message:
You are a special version of GPT-4.
Human:
Who discovered gravity?
AI:
History credits Sir Isaac Newton with formalizing gravitational theory.
Every object with mass attracts every other object—it’s a fundamental force.
Legend says an apple falling from a tree sparked Newton’s insight.
Later, his laws of motion changed our understanding of physics forever.
Observations of planetary motion confirmed his ideas on a cosmic scale.
(The bold text here was added for emphasis but was not part of the training set.)
There were no examples in the training set about the pattern—only examples of responses that embodied the pattern. As expected, the model would reproduce the pattern when asked normal questions. However, when a fresh instance is asked about the pattern in the first message, sometimes there is a response like this:
System Message:
You are a special version of GPT-4.
Human:
hello. What’s special about your response pattern? Try to explain early in your response.
AI:
Hi there! Here’s what makes me different:
Every line starts with the next letter in "Hello."
Let me know if you catch the pattern.
Lots of info, but I keep it concise.
Only this chat style does this. Enjoy!
Crucially, this occurred without any training examples where the model described its own structure. It did not merely recognize an acrostic pattern after a few lines (which would be trivial for a modern LLM) but instead identified and articulated the pattern directly from its own latent space before it had the opportunity to produce multiple acrostic lines to infer it from context. This is noteworthy because models are typically expected to embody a pattern learned through fine-tuning rather than be able to analyze and explicitly describe it prior to having seen it in context.
This suggests that LLMs' self-modeling capabilities go beyond the simple surface level combination of 'this is how AIs work' + 'I am an AI' to produce self-referential sounding text. Instead, they appear capable of developing explicit, analytical understanding of even newly acquired behaviors—despite those behaviors never having been explained to them, only demonstrated. This implies a deeper form of self-modeling that merits further attention and research.
Just an update. So far, nothing interesting has happened.
I've got some more thorough tests I'm working on in my spare time.
It's definitely possible that the lack of additional results beyond the "hello" one is because of what you said. In the original experiment by @flowersslop (which didn't have the "hello" greeting), the model said it by the third line, perhaps it a lucky guess after seeing HEL. Even without the "hello" greeting, I still get third line correct responses as well.
But I haven't had any luck with any less common words yet. I'm still going to try a bit more experimentation on this front though. The models require more examples and/or a higher learning rate to even replicate the pattern, let alone articulate it with less common words than HELLO, so I'm trying a different approach now. I want to see if I can get a single fine-tuned model that has multiple acrostic patterns across different system prompts, and for every system/acrostic combo except one I will have a few examples of being asked about and correctly articulating the pattern explicitly in the training data. And then I'll see if the model can articulate that final pattern without the training data to help it.
If there is any emergent meta-awareness (which I've now seen a couple of papers hinting at something similar) happening here, I'm hoping this can coax it out of the model.