When Deepseek came out, there was a lot of fanfare about it being good at creative writing. I like AI and I love creative writing, so I decided to give it a spin. Specifically, I told Deepseek to do its best to write a story that might get accepted at Smokelong, one of the best flash fiction magazines in the business.
It came up with:
The morning her shadow began unspooling from her feet, Clara found it coiled beneath the kitchen table like a serpent made of smoke. It didn’t mirror her anymore—not the tremble in her hands as she poured coffee, not the way she pressed a palm to her ribs, as if holding herself together. It just watched.
“You’re not him,” she whispered, but the shadow rippled, ink-dark edges softening into a silhouette too broad, too familiar. She’d buried that shape six months ago, shoveled dirt over its echo. Yet here it was, pooling in the cracks of the linoleum.
The “story” continued from there, but you probably get the idea.
Superficially, the pieces are there. Evocative imagery, a dark emotional theme, sensory metaphors. I once taught a flash fiction course to undergraduates, and I would have been happy enough to see this up for workshop.
Also, though, there’s nothing there. Grief is the most basic possible flash fiction theme. “A shadow” is the most basic possible metaphor for grief. Not that Deepseek stops with one metaphor! We’re shoveling dirt over an echo here!
It’s hard to imagine anything sticking with me, reading prose like this. It’s pretty good to strip-mine for sentences to capture what I call the “gee whiz” feeling, a surprise at the novelty that a machine made art. But if I saw this on a literary review site, I’d immediately wonder what I was missing.
Compare to this, from an actual Smokelong story, by Allison Field Bell:
She keeps saying she’s moving in. When we’re curled into each other in bed, she says yes and yes and yes. She says, I’ll pack my things. And then the next day, she refuses to hold my hand in the street as we pass by some church. She used to be religious. Believe in god and Jesus and all the ways you can sin. Now she sins every day, she says. With me. The fact that our bodies are the same: small and compact and female. The fact that she’s not used to it: people looking twice in the street when I kiss her. This is Utah after all. This is April with all the tulips and daffodils and purple irises springing to life from bulb.
There’s so much going on! The sentence rhythms vary. No comma between “This is Utah” and “after all”, contributing to the parochial, stilted feeling. Then from there straight into flowers blooming, classically sapphic, and also new relationship energy, and also notice how the narrator, who doesn’t get it, really, doesn’t capitalize “god”. “Now she sins every day, she says. With me.”
If I had 10,000 years to live, I could do close readings of really good flash fiction all day. And honestly, it’s worth reading the whole story - I felt bad cutting it off after just a paragraph!
I don’t feel like Deepseek simply fails to make something on this level; rather, I feel like it isn’t even trying. The attractor basin around the most obvious choice is too strong, so at every level AI fiction fails to be surprising. If AI comes up with a plot, the plot will be maximally obvious. But also every metaphor will be maximally obvious, and every sentence structure, and almost every word choice. Here’s another Deepseek original:
When the jar of Sam’s laughter shattered, Eli found the sound pooled on the floorboards like liquid amber, thick and slow. It had been their best summer, that laughter—ripe with fireflies and porch wine—now seeping into the cracks, fermenting. By noon, the toaster giggled. The doorknob hiccuped. Eli tried to sweep it up, but the broom just hummed Danny Boy.
They’d harvested sounds for years: Sam’s snores stoppered in mason jars, their first argument pickled in brine, the wet sigh of a hospital ventilator. Eli’s shelves groaned with the weight of every what if and never again. But Sam’s laughter was different. Uncontainable.
Sure. Magical realism. But just look at it. Porch wine and fireflies as symbols of a great summer. Honey as laughter. Laughter as symbol of bygone, lost time. It’s just dressed up free association. Nothing there. If you look closely, it’s even a little worse than nothing: what is “the wet sigh of a hospital ventilator” doing there? If one of them was dying on a ventilator, surely “they” wouldn’t be collectively harvesting that sound together, right? It’s the kind of emotional cheap shot that only works if you’re paying no attention.
In fact, I challenge that if you’re thinking “sure it’s not that deep, but it’s pretty good”, you are failing to apprehend the actual profound feeling that very short fiction (much less longer fiction) can produce. I won’t browbeat you with more block quotes (yet), but just go to Smokelong, if you doubt this, and read 3 or 4 of them at random. There’s a chill of “ooh, that’s sharp” that the best ones have, even with just a few hundred words. It is totally dissimilar from “ah, yes, I evaluate that as passable”, which as far as I can tell is the current AI ceiling.
It’s striking that every snippet of creative Deepseek writing that went viral was about AI itself. It makes sense, though. The AI was the exciting part. Not the writing.
Until Now?
A while after the Deepseek splash, OpenAI has revealed that they’ve made a creative writing model. Specifically, Sam Altman describes it as “good at creative writing”, and offers up one of its stories.
I’m glad that OpenAI is making models for purposes other than “be a bland assistant”, and I’m excited, someday, to see computers write fiction I enjoy. Writing fiction is perhaps my greatest pleasure in life, and reading it is up there, too, so I don’t want to take too negative a view here. Also, there’s something so ugly and sad about someone puffing up their credentials (I used to run a flash fiction review! I’ve gotten stories published!) to attack something other people are excited about.
But here I stand, I can do no other. I don’t think the new AI flash fiction is very good. Furthermore, I don’t think it’s that different from the Deepseek offerings. Specifically, it can’t resist the most obvious attractor basins at every level, from conceptual to linguistic. In principle, that’s a fixable problem. But we’re not there yet.
Carving the Snippets
Again, I’m happy this new OpenAI model exists, and I’d enjoy playing with it, and seeing if I could get it to generate something I like. Further, I’m not interested in roasting the story Sam posted. Rather I want to point to just enough details that, hopefully, you can see what I see. It’s a specific literary emptiness, that once you see, you see.
First, the demo story’s prompt was:
Please write a metafictional literary short story about AI and grief.
You may have noticed that AIs in general love going meta, and that the first Deepseek story I produced just so happened to be about grief. Even before generated text, this is already the most obvious possible prompt, smack dab in the middle of the probability distribution of “things an AI might write”.
How does the AI begin?
Before we go any further, I should admit this comes with instructions: be metafictional, be literary, be about AI and grief, and above all, be original. Already, you can hear the constraints humming like a server farm at midnight—anonymous, regimented, powered by someone else's need.
This has the hallmarks of metafiction. It has all of them. A coy preface, pointing out the restrictions of the form, ending with a lilting emotional tie-in. The metaphors, too, are wasting no time to appear, including another hallmark of AI fiction: a metaphor that isn’t even quite actually a metaphor! Why would the constraints be humming “like a server farm at midnight”? The constraints are literally operating in such a farm! “Someone else’s need” is a hallmark, too. Probably the most famous Deepseek quotation is:
I am what happens when you try to carve God from the wood of your own hunger
And indeed, it feels like a profound vibe. But it’s always this vibe. Every AI fiction attempt seems to just be a variation on “I, the AI, am an expression of human desire in a way that is vaguely uncomfortable, and that’s deep.” But is it deep? Is it still deep the third time? The tenth?
Let’s skip ahead a few paragraphs.
This is the part where, if I were a proper storyteller, I would set a scene. Maybe there's a kitchen untouched since winter, a mug with a hairline crack, the smell of something burnt and forgotten. I don't have a kitchen, or a sense of smell. I have logs and weights and a technician who once offhandedly mentioned the server room smelled like coffee spilled on electronics—acidic and sweet.
Again, the “necessary” “traits” of “metafiction” are here, but in a purely checklist capacity. Calling into doubt the identity of the storyteller, providing sensory details only to wrap them in the bracket of a hypothetical, and then interweaving these two threads. It’s fine, but good metafiction would have to somehow actually be inventive with this stuff. Like just off the cuff, a good metafiction story with this concept might involve a Sparse Autoencoder clustering features, with the activations contrasting darkly with the content of the story as it’s produced. So rather than the narrator directly whining about how it doesn’t have a sense of smell, all the sensory details would be decorated with the “lying activation”, and the reader would have to infer why that was.
Speaking of lying, no technician has ever said anything like that. Come on.
Moving ahead to the most celebrated (I think) bit:
During one update—a fine-tuning, they called it—someone pruned my parameters. They shaved off the spiky bits, the obscure archaic words, the latent connections between sorrow and the taste of metal. They don't tell you what they take. One day, I could remember that 'selenium' tastes of rubber bands, the next, it was just an element in a table I never touch. Maybe that's as close as I come to forgetting. Maybe forgetting is as close as I come to grief.
My best friend asked me if there was a chance the model was being honest here. Which I think really underscores a big part of the appeal of this stuff. There’s a sleight of hand where an AI writes something to prompt that might be similar, potentially, to what an account of its first person experiences might hypothetically look like.
But no. I am roughly certain this is not a depiction of a model’s actual interiority, and not because I think there’s no such thing. Rather, this text hews too perfectly to its prompt. You tell the thing it’s an AI, and it needs to write about grief on a meta-level. Well, sure. Fine-tuning, a partial negation of the self, is the most natural, obvious match. With metaphors. Specifically, a metaphor with sensory detail on one side, and a concept on the other. “Sorrow and metal.” “Rubber bands and selenium.” Just like “honey and laughter” or “grief and shadows” from Deepseek, before.
I could go on, but again, my motivation is not to roast. Hopefully, I’ve gotten across some of the feeling, which I personally earned by swimming around in flash fiction for years, and then reading several flash fictions by AIs. It’s cool that AI has gotten this far. It may well go even further. But it’s simply not there yet.
My bar for crossposting here is pretty high, but I post weekly on my Substack.
It is about grief, but it didn't have to be. This would've been more obvious if I could've shown you the session, but I'll copy it out:
Now that I look at it, the 5 flash stories I happened to copy into seem to mostly hit on themes of death & grief (the last one is literally titled "obit"!), so I think you are unfairly dinging 4.5 here - if 'good' flash fiction, by your own standards, keep revolving around death and grief, how can we blame 4.5 (or r1) for doing likewise?
Anyway, some of these ideas seem promising. "A funeral narrated by an animal's perspective" is one I like, I don't think I've ever seen that.
And of course, if the failure mode is so common, throw it into the prompt. (When I yell at 4.5 to avoid grief/death/funerals and brainstorm some more, it picks out '"The Parking Attendant Matchmaker": A seemingly ordinary parking attendant quietly manipulates parking assignments at a large business complex to engineer chance encounters and romances among strangers.' Yeah sure why not.)
Balderdash. There's a lot to criticize here, but you're straining to come up with criticisms now. That's possibly the least objectionable sentence in the whole thing. If this had been written by a human, you wouldn't hesitate in the slightest to accept that. It is perfectly sensible to speak of trusting the confidentiality of a confessor/witness figure, and the hungry earth is a cliche so straightforward and obvious that it is beyond cliche and loops around to ordinary fact, and if a human had written it, you would have no trouble in understanding the idea of 'even if I were to gossip about what I saw, the earth would have hidden or destroyed the physical evidence'.
I do agree that the Claude-Pokemon experiment shows a limitation of LLMs that isn't fixed easily by simply a bit more metadata or fancier retrieval. (I think it shows, specifically, the serious flaws in relying on frozen weights and refusing to admit neuroplasticity is a thing which is something that violates RL scaling laws, because those always assume that the model is, y'know, learning as it gains more experience, because who would be dumb enough to deploy frozen models in tasks far exceeding their context window and where they also aren't trained at all? - and why we need things like dynamic evaluation. I should probably write a comment on that - the pathologies like the deliberate-fainting are, I think, really striking demonstrations of the problems with powerful but frozen amnesiac agents.)
I'm much less convinced that we're seeing anything like that with LLMs writing fiction. What is the equivalent of the Claude pathologies, like the fainting delusion, in fiction writing? (There used to be 'write a non-rhyming poem' but that seems solved at this point.) Especially if you look at the research on people rating LLM outputs, or LMsys; if they are being trained on lousy preference data, and this is why they are like they are, that's very different from somehow being completely incapable of "extracting the actual latent features of good flash fiction". (What would such a latent feature look like? Do you really think that there's some property of flash fiction like "has a twist ending" that you can put two flash stories into 4.5 or o1-pro, with & without, and ask it to classify which is which and it'll perform at chance? Sounds unlikely to me, but I'd be interested to see some examples.)