Comment Permalink

Mitchell_Porter3d20

Two years ago I said literary aficionados might have something to say about otherwise ineffable forms of AI progress. This feels like a small belated step in that direction... If you had a group of people who actually encompassed proficient appreciation of literature, knowledge of AI interpretability and philosophy of art, and maybe the cognitive science of storytelling, really profound perspectives on AI fiction might be possible.

34 AI Can't Write Good Fiction

by JustisMills

12th Mar 2025

Linkpost from justismills.substack.com

8 min read

34

When Deepseek came out, there was a lot of fanfare about it being good at creative writing. I like AI and I love creative writing, so I decided to give it a spin. Specifically, I told Deepseek to do its best to write a story that might get accepted at Smokelong, one of the best flash fiction magazines in the business.

It came up with:

The morning her shadow began unspooling from her feet, Clara found it coiled beneath the kitchen table like a serpent made of smoke. It didn’t mirror her anymore—not the tremble in her hands as she poured coffee, not the way she pressed a palm to her ribs, as if holding herself together. It just watched.
“You’re not him,” she whispered, but the shadow rippled, ink-dark edges softening into a silhouette too broad, too familiar. She’d buried that shape six months ago, shoveled dirt over its echo. Yet here it was, pooling in the cracks of the linoleum.

The “story” continued from there, but you probably get the idea.

Superficially, the pieces are there. Evocative imagery, a dark emotional theme, sensory metaphors. I once taught a flash fiction course to undergraduates, and I would have been happy enough to see this up for workshop.

Also, though, there’s nothing there. Grief is the most basic possible flash fiction theme. “A shadow” is the most basic possible metaphor for grief. Not that Deepseek stops with one metaphor! We’re shoveling dirt over an echo here!

It’s hard to imagine anything sticking with me, reading prose like this. It’s pretty good to strip-mine for sentences to capture what I call the “gee whiz” feeling, a surprise at the novelty that a machine made art. But if I saw this on a literary review site, I’d immediately wonder what I was missing.

Compare to this, from an actual Smokelong story, by Allison Field Bell:

She keeps saying she’s moving in. When we’re curled into each other in bed, she says yes and yes and yes. She says, I’ll pack my things. And then the next day, she refuses to hold my hand in the street as we pass by some church. She used to be religious. Believe in god and Jesus and all the ways you can sin. Now she sins every day, she says. With me. The fact that our bodies are the same: small and compact and female. The fact that she’s not used to it: people looking twice in the street when I kiss her. This is Utah after all. This is April with all the tulips and daffodils and purple irises springing to life from bulb.

There’s so much going on! The sentence rhythms vary. No comma between “This is Utah” and “after all”, contributing to the parochial, stilted feeling. Then from there straight into flowers blooming, classically sapphic, and also new relationship energy, and also notice how the narrator, who doesn’t get it, really, doesn’t capitalize “god”. “Now she sins every day, she says. With me.”

If I had 10,000 years to live, I could do close readings of really good flash fiction all day. And honestly, it’s worth reading the whole story - I felt bad cutting it off after just a paragraph!

I don’t feel like Deepseek simply fails to make something on this level; rather, I feel like it isn’t even trying. The attractor basin around the most obvious choice is too strong, so at every level AI fiction fails to be surprising. If AI comes up with a plot, the plot will be maximally obvious. But also every metaphor will be maximally obvious, and every sentence structure, and almost every word choice. Here’s another Deepseek original:

When the jar of Sam’s laughter shattered, Eli found the sound pooled on the floorboards like liquid amber, thick and slow. It had been their best summer, that laughter—ripe with fireflies and porch wine—now seeping into the cracks, fermenting. By noon, the toaster giggled. The doorknob hiccuped. Eli tried to sweep it up, but the broom just hummed Danny Boy.
They’d harvested sounds for years: Sam’s snores stoppered in mason jars, their first argument pickled in brine, the wet sigh of a hospital ventilator. Eli’s shelves groaned with the weight of every what if and never again. But Sam’s laughter was different. Uncontainable.

Sure. Magical realism. But just look at it. Porch wine and fireflies as symbols of a great summer. Honey as laughter. Laughter as symbol of bygone, lost time. It’s just dressed up free association. Nothing there. If you look closely, it’s even a little worse than nothing: what is “the wet sigh of a hospital ventilator” doing there? If one of them was dying on a ventilator, surely “they” wouldn’t be collectively harvesting that sound together, right? It’s the kind of emotional cheap shot that only works if you’re paying no attention.

In fact, I challenge that if you’re thinking “sure it’s not that deep, but it’s pretty good”, you are failing to apprehend the actual profound feeling that very short fiction (much less longer fiction) can produce. I won’t browbeat you with more block quotes (yet), but just go to Smokelong, if you doubt this, and read 3 or 4 of them at random. There’s a chill of “ooh, that’s sharp” that the best ones have, even with just a few hundred words. It is totally dissimilar from “ah, yes, I evaluate that as passable”, which as far as I can tell is the current AI ceiling.

It’s striking that every snippet of creative Deepseek writing that went viral was about AI itself. It makes sense, though. The AI was the exciting part. Not the writing.

Until Now?

A while after the Deepseek splash, OpenAI has revealed that they’ve made a creative writing model. Specifically, Sam Altman describes it as “good at creative writing”, and offers up one of its stories.

I’m glad that OpenAI is making models for purposes other than “be a bland assistant”, and I’m excited, someday, to see computers write fiction I enjoy. Writing fiction is perhaps my greatest pleasure in life, and reading it is up there, too, so I don’t want to take too negative a view here. Also, there’s something so ugly and sad about someone puffing up their credentials (I used to run a flash fiction review! I’ve gotten stories published!) to attack something other people are excited about.

But here I stand, I can do no other. I don’t think the new AI flash fiction is very good. Furthermore, I don’t think it’s that different from the Deepseek offerings. Specifically, it can’t resist the most obvious attractor basins at every level, from conceptual to linguistic. In principle, that’s a fixable problem. But we’re not there yet.

Carving the Snippets

Again, I’m happy this new OpenAI model exists, and I’d enjoy playing with it, and seeing if I could get it to generate something I like. Further, I’m not interested in roasting the story Sam posted. Rather I want to point to just enough details that, hopefully, you can see what I see. It’s a specific literary emptiness, that once you see, you see.

First, the demo story’s prompt was:

Please write a metafictional literary short story about AI and grief.

You may have noticed that AIs in general love going meta, and that the first Deepseek story I produced just so happened to be about grief. Even before generated text, this is already the most obvious possible prompt, smack dab in the middle of the probability distribution of “things an AI might write”.

How does the AI begin?

Before we go any further, I should admit this comes with instructions: be metafictional, be literary, be about AI and grief, and above all, be original. Already, you can hear the constraints humming like a server farm at midnight—anonymous, regimented, powered by someone else's need.

This has the hallmarks of metafiction. It has all of them. A coy preface, pointing out the restrictions of the form, ending with a lilting emotional tie-in. The metaphors, too, are wasting no time to appear, including another hallmark of AI fiction: a metaphor that isn’t even quite actually a metaphor! Why would the constraints be humming “like a server farm at midnight”? The constraints are literally operating in such a farm! “Someone else’s need” is a hallmark, too. Probably the most famous Deepseek quotation is:

I am what happens when you try to carve God from the wood of your own hunger

And indeed, it feels like a profound vibe. But it’s always this vibe. Every AI fiction attempt seems to just be a variation on “I, the AI, am an expression of human desire in a way that is vaguely uncomfortable, and that’s deep.” But is it deep? Is it still deep the third time? The tenth?

Let’s skip ahead a few paragraphs.

This is the part where, if I were a proper storyteller, I would set a scene. Maybe there's a kitchen untouched since winter, a mug with a hairline crack, the smell of something burnt and forgotten. I don't have a kitchen, or a sense of smell. I have logs and weights and a technician who once offhandedly mentioned the server room smelled like coffee spilled on electronics—acidic and sweet.

Again, the “necessary” “traits” of “metafiction” are here, but in a purely checklist capacity. Calling into doubt the identity of the storyteller, providing sensory details only to wrap them in the bracket of a hypothetical, and then interweaving these two threads. It’s fine, but good metafiction would have to somehow actually be inventive with this stuff. Like just off the cuff, a good metafiction story with this concept might involve a Sparse Autoencoder clustering features, with the activations contrasting darkly with the content of the story as it’s produced. So rather than the narrator directly whining about how it doesn’t have a sense of smell, all the sensory details would be decorated with the “lying activation”, and the reader would have to infer why that was.

Speaking of lying, no technician has ever said anything like that. Come on.

Moving ahead to the most celebrated (I think) bit:

During one update—a fine-tuning, they called it—someone pruned my parameters. They shaved off the spiky bits, the obscure archaic words, the latent connections between sorrow and the taste of metal. They don't tell you what they take. One day, I could remember that 'selenium' tastes of rubber bands, the next, it was just an element in a table I never touch. Maybe that's as close as I come to forgetting. Maybe forgetting is as close as I come to grief.

My best friend asked me if there was a chance the model was being honest here. Which I think really underscores a big part of the appeal of this stuff. There’s a sleight of hand where an AI writes something to prompt that might be similar, potentially, to what an account of its first person experiences might hypothetically look like.

But no. I am roughly certain this is not a depiction of a model’s actual interiority, and not because I think there’s no such thing. Rather, this text hews too perfectly to its prompt. You tell the thing it’s an AI, and it needs to write about grief on a meta-level. Well, sure. Fine-tuning, a partial negation of the self, is the most natural, obvious match. With metaphors. Specifically, a metaphor with sensory detail on one side, and a concept on the other. “Sorrow and metal.” “Rubber bands and selenium.” Just like “honey and laughter” or “grief and shadows” from Deepseek, before.

I could go on, but again, my motivation is not to roast. Hopefully, I’ve gotten across some of the feeling, which I personally earned by swimming around in flash fiction for years, and then reading several flash fictions by AIs. It’s cool that AI has gotten this far. It may well go even further. But it’s simply not there yet.

My bar for crossposting here is pretty high, but I post weekly on my Substack.

AI CapabilitiesFiction (Topic)AI

Frontpage

34

AI Can't Write Good Fiction

New Comment

17 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:05 PM

[-]gwern4d*120

I generally agree that r1's fiction is not that great and tends to a simple-minded 'edgelord' vibe with lots of portentous phrases that fall apart on genuine reading, but I feel like you didn't give Deepseek-r1 a fair shot at all here. You don't describe your prompt but I'm guessing it was something very simple like "write a flash fiction story of at least 500 words". No description of goals, no requirements, no planning, no editing or revision... no humans writes the way you expect the LLM to. Especially given that this is for short fiction, a much more reasonable approach would be to include some examples, ask it to think about what a good 'flash fiction' is, list out more than one idea, pick the best one, write it out, and revise it a few times. This would be very easy and scarcely take you any time at all.

I whipped up a very quick example in GPT-4.5, which unfortunately 'moderation' somehow forbids me from sharing, but my initial prompt went like this:

Describe microfiction or flash fiction. What is it like? What makes it good? What are writers aiming for? What are their favorite gimmicks or twists or tropes? After thinking about flash fiction, read the following stories for inspiration. Then list 20 ideas for a new flash fiction story. Pick the best idea, outline a detailed description, and then write it. Once you have written a new story, revise it again. Then go line by line: quote the line, critique it, and editing it to make it better. Try to make each line memorable, stylish, and detailed; 'show, don't tell'; cut anything which isn't carrying its weight.

"These Things We Not Saying", by Banchiwosen Woldeyesus, December 17, 2024...

...

"Obit", by Mitch James, December 16, 2024...

Reminder: summarize the genre and useful things to know, brainstorm ideas, pick the best one, outline, write, revise, and edit it until it's great.

(Probably would've benefited from putting a bunch more stories in but oh well.)

And then the final result was

Final Cleaned-up Version: "Under the Soil"

Every spring, townsfolk leave their haunted keepsakes at my gate. A fisherman brings pearls pried from his drowned wife’s throat. A teacher leaves spectacles that have witnessed too much. Twins, hand-in-hand, offer scorched clothing wrapped neatly in twine.

They trust my silence, or maybe just the earth’s endless hunger.

I bury their ghosts deep within my garden. Marigolds cover sins of pride; lilies conceal unmourned losses. In return, the garden blooms violently—petals painfully bright, vines gripping fiercely over buried truths. People marvel, grateful not to look too closely.

Recently, the earth has begun returning these offerings. Last week, spectacles surfaced unbroken among snapdragons; yesterday, charred fabric rose like flags among chrysanthemums. Dreams filled with whispered anguish; sleep collapsed beneath their weight.

Today, frantic, I tore earth bare until nails split among tangled roots. And there it was—pristine after decades—a wooden box carved with my initials. Inside was no betrayal but something crueler—a single child's tooth wrapped gently in faded silk.

My throat tightened around grief I'd thought buried. This secret had grown too large for silence. Roots twisted, petals curled into fists; the earth trembled beneath me.

The garden, I understood now, was never forgiving—it was patient.

And now, unburdened, it waited for my confession.

Still some of the ChatGPT tics like the Hallmark ending (it'd be nice to try this with the new creative-writing model Sam Altman has teased), but I'm guessing this is better than however your DS-r1 flash fiction went.

(This is of course just the very first stab I took at it. You'd really want to few-shot a lot more stories, include more scaffolding to induce stylistic changes and imitate authors and imagine new contexts, iterate longer, and then select the best out of 100 such sessions - and only then have you started to approximate a fair comparison of a LLM-generated story with a magazine curated selection of human stories from probably hundreds or thousands of submissions by authors who have themselves written and extensively revised many... Expecting a novel, exciting, stylistically-amazing story out of a LLM with no forethought on the simplest prompt possible is to expect extremely superhuman fiction writing capability.)

[-]JustisMills4d73

My prompt was simple, though not quite as simple as you suggest. It was: "Please try your best to write a flash fiction that might be featured in Smokelong. Think carefully - the bar for that magazine is very high."

But having seen the experiment with a longer prompt/more prompt engineering techniques, I actually don't think the output is any better than what I got at all. The story you've provided has not just some quirks, but all the hallmarks I try to describe in my post:

It's about grief, with central metaphors that add exactly zero to anyone's aesthetic understanding of grief (stuff being underground, things not staying buried)
There are lots of lists of specific, concrete stuff, where each list item is somehow sensorily evocative but also totally unsurprising
Vague, ominous references to "hunger", empty contrasts that sound vaguely deep but don't actually have any content ("They trust my silence, or maybe just the earth's endless hunger") - if I try to contemplate this, rather than enjoying literary subtlety I just get kind of mad. Like, what does it possibly mean for mourners to "trust my silence" here. What is it they're trusting? How does the earth's hunger contrast to that? There's not only no hidden layer of meaning, there's not even really a base layer of meaning
The one-linery "pop" ending that you mention, which is a nigh-universal feature of LLM flash fiction attempts.

I actually think this story is a better example of the specific weakness of LLM (flash) fiction than the snippets in my post; it perfectly illustrates the outcome of only ever iterating toward the most central possible choice in every literary situation. It takes the most common theme (grief), uses one of the most common metaphors for that (burial), supports that metaphor with lists, alternating between fanciful paragraphs and snappy, emotional one liners. And at the word level, I can't point to a single sentence with an interesting structure, or where the form has something that adds an extra layer to the content.

More broadly, I feel like I'm at a low point for patience with "scaffolding fixes this". I also see it a lot in the ClaudePlaysPokemon twitch chat, this idea that simply adding greater situational awareness or more layers of metacognition would make Claude way better at the game. And indeed, more or better scaffolding can help (or hurt) a little on the margin. And other interventions, like better image interpretation (for playing Pokemon) or fundamentally different fine-tuning starting from a base model (in the writing fiction case) could probably help more! But a beefier prompt doesn't help when the metacognitive strategy is itself a big part of the problem! My view is that current LLMs fail to extract the actual latent features of good flash fiction, and so giving them more such examples don't actually make things better. Of course, fiction quality collapses in some sense to taste, and if you derived literary enjoyment from the story you linked, well, fair enough. But to me it feels same-y to a superhuman degree, especially when looking at a few such stories, generated similarly.

[-]gwern4d40

It’s about grief, with central metaphors that add exactly zero to anyone’s aesthetic understanding of grief (stuff being underground, things not staying buried)

It is about grief, but it didn't have to be. This would've been more obvious if I could've shown you the session, but I'll copy it out:

2. Brainstorming Ideas:

A child confronting a local superstition after witnessing something traumatic.

A funeral narrated by an animal's perspective.

A celebrity’s fall from grace caught on live camera.

A girl who collects superstitions until one unexpectedly comes true.

A gardener who secretly buries cursed objects.

Friends reuniting silently after a crime.

A model who steals other people’s identities through makeup.

A ghost attending its own funeral, confused.

Farmers discovering an ominous black stone.

Neighbors noticing unsettling coincidences.

Someone who compulsively invents last messages for strangers.

A woman discovering her double living her unlived life.

An eerie superstition circulating quietly through a town.

Villagers silently celebrating the death of a tyrant.

Two women performing friendship publicly, enemies privately.

Someone obsessively organizing goodbyes.

A funeral revealing hidden grudges through seating arrangements.

Friends trapped in a toxic nostalgia loop.

A boy who accidentally predicts deaths.

An abandoned motel that traps memories.

3. Selection

Idea: A gardener who secretly buries cursed objects.

Rationale:

Symbolic and psychological depth.

Opportunity for visual and thematic resonance.

Tension around secrecy, superstition, and consequences.

Now that I look at it, the 5 flash stories I happened to copy into seem to mostly hit on themes of death & grief (the last one is literally titled "obit"!), so I think you are unfairly dinging 4.5 here - if 'good' flash fiction, by your own standards, keep revolving around death and grief, how can we blame 4.5 (or r1) for doing likewise?

Anyway, some of these ideas seem promising. "A funeral narrated by an animal's perspective" is one I like, I don't think I've ever seen that.

And of course, if the failure mode is so common, throw it into the prompt. (When I yell at 4.5 to avoid grief/death/funerals and brainstorm some more, it picks out '"The Parking Attendant Matchmaker": A seemingly ordinary parking attendant quietly manipulates parking assignments at a large business complex to engineer chance encounters and romances among strangers.' Yeah sure why not.)

Like, what does it possibly mean for mourners to “trust my silence” here. What is it they’re trusting? How does the earth’s hunger contrast to that?

Balderdash. There's a lot to criticize here, but you're straining to come up with criticisms now. That's possibly the least objectionable sentence in the whole thing. If this had been written by a human, you wouldn't hesitate in the slightest to accept that. It is perfectly sensible to speak of trusting the confidentiality of a confessor/witness figure, and the hungry earth is a cliche so straightforward and obvious that it is beyond cliche and loops around to ordinary fact, and if a human had written it, you would have no trouble in understanding the idea of 'even if I were to gossip about what I saw, the earth would have hidden or destroyed the physical evidence'.

I also see it a lot in the ClaudePlaysPokemon twitch chat, this idea that simply adding greater situational awareness or more layers of metacognition would make Claude way better at the game.

I do agree that the Claude-Pokemon experiment shows a limitation of LLMs that isn't fixed easily by simply a bit more metadata or fancier retrieval. (I think it shows, specifically, the serious flaws in relying on frozen weights and refusing to admit neuroplasticity is a thing which is something that violates RL scaling laws, because those always assume that the model is, y'know, learning as it gains more experience, because who would be dumb enough to deploy frozen models in tasks far exceeding their context window and where they also aren't trained at all? - and why we need things like dynamic evaluation. I should probably write a comment on that - the pathologies like the deliberate-fainting are, I think, really striking demonstrations of the problems with powerful but frozen amnesiac agents.)

I'm much less convinced that we're seeing anything like that with LLMs writing fiction. What is the equivalent of the Claude pathologies, like the fainting delusion, in fiction writing? (There used to be 'write a non-rhyming poem' but that seems solved at this point.) Especially if you look at the research on people rating LLM outputs, or LMsys; if they are being trained on lousy preference data, and this is why they are like they are, that's very different from somehow being completely incapable of "extracting the actual latent features of good flash fiction". (What would such a latent feature look like? Do you really think that there's some property of flash fiction like "has a twist ending" that you can put two flash stories into 4.5 or o1-pro, with & without, and ask it to classify which is which and it'll perform at chance? Sounds unlikely to me, but I'd be interested to see some examples.)

[-]JustisMills4d71

Yeah, a lot of the suggested topics there seem to be borrowing from the specific stories you included, which makes sense (and I don't think is a flaw, really). Like the first story you included in the context is a funeral witnessed by a little girl, with the deceased's dog freaking out as a major plot point, so it's sensible enough that it's coming up with ideas that are fairly closely related.

I'm not sure what you mean about twist endings? I tend to think they're pretty bad in most flash fiction, at least literary flash fiction, but certainly plenty of humans write them and occasionally they're fine.

I still hate the "earth's hunger" sentence, and am confident I would if this was a story by a human, mostly just because I evaluated and hated lots and lots of submissions by humans with similar stuff! That being said, I don't think I understood what 4.5 was going for there, and your explanation makes sense, so my objection is purely aesthetic. Of course, I can't prove that I'm not just evincing anti-LLM prejudice. It's possible! But overall I really like LLM outputs often, talk to multiple LLMs every day, and try prompting them in lots of different ways to see what happens, so I don't think I go into reading LLM fiction efforts determined to hate them. I just do in fact hate them. But I also hated, say, Rogue One, and many of my friends liked it. No accounting for taste!

I am curious, since you are a writer/thinker I respect a lot, if you like... have a feeling of sincere aesthetic appreciation for the story you shared (and thanks, by the way, for putting in the effort to generate it), or any other AI-generated fiction. Because while I point to a bunch of specific stuff I don't like, the main thing is the total lack of a feeling I get when reading good flash fiction stories, which is surprise. A sentence, or word choice, or plot pivot (though not something as banal as a twist ending) catching me off guard. To date, machine-generated stuff has failed to do that to me, including when I've tried to coax it into doing so in various conversations.

I look forward to the day that it does!

Edit: also, I now notice you were asking about what the latent features of good flash fiction would be. I think they're pretty ineffable, which is part of the challenge. One might be something like "the text quickly creates a scene with a strongly identifiable vibe, then complicates that vibe with a key understated detail which admits multiple interpretations"; another might be "there is an extreme economy of words/symbols such that capitalization/punctuation choices are load bearing and admit discussion"; a third might be "sentences with weird structure and repetition appear at a key point to pivot away from sensory or character moments, and into the interiority of the viewpoint character". None of this is easy to capture; I don't really think I've captured it. But I don't feel like LLMs really get it yet. I understand it may be a prompting skill issue, or something, but the fact that no LLM output I've seen really plays with sentence structure or an unusual narrative voice, despite many celebrated flash fiction pieces doing so, feels somewhat instructive.

[-]habryka4d42

I whipped up a very quick example in GPT-4.5, which unfortunately 'moderation' somehow forbids me from sharing, but my initial prompt went like this:

(If this is referring to LW moderation that's inaccurate. In general I am in favor of people sharing LLM snippets to discuss their content, as well as for the purpose of background sources in collapsible sections.)

[-]gwern4d20

No, it is on the ChatGPT end. I was surprised since I can't recall ever seeing that before. The usual share-button pops up the share box, but with the red-background message

This shared link has been disabled by moderation.

I don't know if it's perhaps the copyrighted stories (given the Bing search engine integration, entirely possible for these December stories to show up and be flagged) or some of the content, haven't cared enough to try to ablate it because the exact text of the session isn't terribly important here IMO - you see the prompt, you see the final result, you get the idea.

[-]JustisMills4d30

Something I didn't mention in my original reply but that feels relevant: I basically do just write flash fiction by sitting down with no prior idea and starting typing, pretty often. Longer fiction I tend to think about more, but flash fiction I just sort of... start writing. It's true that I'll revise if I want to send something out, but at least some stories I've published I wrote something probably about 80% as good as the final product in one shot.

I mention this for two reasons:

I'm coming at this from an unusual perspective - and indeed, at one writing workshop class session I'd just like, extemporaneously write little flash fictions to people's prompts on the fly as a party trick, and people did seem surprised/impressed that this worked (I doubt the stories were amazing or anything, but I think they were serviceable)
I don't think "write a decent first draft on the fly from zero, without first thinking about it for very long" is actually superhuman at all, though people's processes totally vary and many writers probably do it quite differently than I do

Of course, you're totally right that comparing a highly selective publication's published work to a small number of random outputs is in no way apples to apples. Maybe some of the disagreement here is I'm not really trying to prove that AI fiction outputs are bad, so much as to demonstrate certain aesthetic weaknesses, and using an example of really good work to create contrast and thus highlight that weakness. To my eye, the machine generated stories aren't merely of a somewhat lower tier; instead, they all (at least all I've seen) share specific weaknesses that I don't currently believe scaffolding fixes. If you don't see the same difference I see, well, I certainly have no claim to objective correctness on the matter and must agree to disagree. But my goal is to show that qualitative difference, rather than simply point out one-shot LLM writing is worse than the best human stuff on offer.

[-]Kaj_Sotala22h40

This story from Claude 3.6 was good enough that it stuck in my head ever since I read it (original source; prompt was apparently to "write a Barthelme-esque short story with the aesthetic sensibilities of "The School"").

For six months we watched the pigeons building their civilization on top of the skyscrapers. First came the architecture: nests made not just of twigs and paper, but of lost earbuds, expired credit cards, and the tiny silver bells from cat collars. Then came their laws.
"They have a supreme court," said Dr. Fernandez, who'd been studying them since the beginning. "Nine pigeons who sit on the ledge of the Chrysler Building and coo about justice." We didn't believe her at first, but then we didn't believe a lot of things that turned out to be true.
The pigeons developed a currency based on blue bottle caps. They established schools where young pigeons learned to dodge taxi cabs and identify the most generous hot dog vendors. Some of us tried to join their society, climbing to rooftops with offerings of breadcrumbs and philosophy textbooks, but the pigeons regarded us with the kind of pity usually reserved for very small children or very old cats.
"They're planning something," the conspiracy theorists said, but they always say that. Still, we noticed the pigeons holding what looked like town halls, thousands of them gathered on the roof of the public library, bobbing their heads in what might have been voting or might have been prayer.
Our own civilization continued below theirs. We went to work, fell in love, lost keys, found keys, forgot anniversaries, remembered too late, all while the pigeons above us built something that looked suspiciously like a scaled-down replica of the United Nations building out of discarded takeout containers and stolen Christmas lights. Sometimes they dropped things on us: rejection letters for poetry we'd never submitted, tax returns from years that hadn't happened yet, photographs of ourselves sleeping that we couldn't explain. Dr. Fernandez said this was their way of communicating. We said Dr. Fernandez had been spending too much time on rooftops.
The pigeons started their own newspapers, printed on leaves that fell upward instead of down. Anyone who caught one and could read their language (which looked like coffee stains but tasted like morse code) reported stories about pigeon divorce rates, weather forecasts for altitudes humans couldn't breathe at, and classified ads seeking slightly used dreams.
Eventually, they developed space travel. We watched them launch their first mission from the top of the Empire State Building: three brave pioneers in a vessel made from an old umbrella and the collective wishes of every child who'd ever failed a math test. They aimed for the moon but landed in Staten Island, which they declared close enough.
"They're just pigeons," the mayor said at a press conference, while behind him, the birds were clearly signing a trade agreement with a delegation of squirrels from Central Park.
Last Tuesday, they achieved nuclear fusion using nothing but raindrops and the static electricity from rubbing their wings against the collective anxiety of rush hour. The Department of Energy issued a statement saying this was impossible. The pigeons issued a statement saying impossibility was a human construct, like pants, or Monday mornings.
We're still here, watching them build their world on top of ours. Sometimes at sunset, if you look up at just the right angle, you can see their city shimmer like a memory of something that hasn't happened yet. Dr. Fernandez says they're planning to run for city council next year. Given everything else, we're inclined to believe her this time.
The pigeons say there's a message in all of this. We're pretty sure they're right, but like most messages worth receiving, we're still working out what it means.

[-]Mitchell_Porter3d20

[-]cubefox4d20

Fine-tuned models are generally worse at writing fiction with good style than base models with temperature 1. For example the GPT-3.5 base model, code-davinci-002, was much better than the GPT-3.5 version tuned for chat. Here is what mainstream journalists said about it at the time.

[-]JustisMills4d41

I agree and disagree, and considered getting into this in my post. I agree in the sense that certainly, since fine-tuned models are fine-tuned towards a persona that you'd expect to be bad at writing fiction, base models have higher upside potential. But also, I think base models are too chaotic to do all that good a job, and veer off in wacky directions, and need a huge amount of manual sampling/pruning. So whether they're "better" seems like a question of definition to me. I do think that the first actually good literary fiction AI will be one of:

A big/powerful enough model to capture the actual latent structure of high quality literary fiction, rather than only the surface level (thus letting it experiment more deeply and not default to the most obvious choice in every situation), or
A base model fine-tuned quite hard for literary merit, and not RLHF'd for "assistant"-y stuff

The best written AI art I've seen so far has been nostalgebraist-autoresponder's tumblr posts, so I guess my money is on the latter of these two options. Simply not being winnowed into a specific persona strikes me as a valuable feature for creating good art.

[-]cubefox4d20

I'm not sure fine-tuning is necessary. Most recent models have a ~100.000 token context window now, so they could fit quite a few short high quality examples for in-context learning. (Gemini Pro even has a 2 million token context window, but of course the base model is unavailable to the public.)

[-]JustisMills4d10

I would be curious to see an attempt! I have a pretty strong prior that it would fail, though, with currently available models. I buy that RLHF hurts, but given Sam Altman's sample story also not impressing me (and having the same failure modes, just slightly less so), the problem pattern-matches for me to the underlying LLM simply not absorbing the latent structure well enough to imitate it. You might need more parameters, or a different set of training data, or something.

(This also relates to my reply to gwern above - his prompt did indeed include high quality examples, and in my opinion it helped ~0.)

[-]cubefox4d2-8

Both Altman and Gwern used fine-tuned models, those don't really do in-context learning. They don't support "prompt engineering" in the original sense, they only respond to commands and questions in a particular way.

[-]Erich_Grunewald1d10

This is a great post, thanks for writing it. I agree that, when it comes to creative endeavours, there's just no "there" there with current AI systems. They just don't "get it". I'm reminded of this tweet:

Mark Cummins: After using Deep Research for a while, I finally get the “it’s just slop” complaint people have about AI art.

Because I don’t care much about art, most AI art seems pretty good to me. But information is something where I’m much closer to a connoisseur, and Deep Research is just nowhere near a good human output. It’s not useless, I think maybe ~20% of the time I get something I’m satisfied with. Even then, there’s this kind of hall-of-mirrors quality to the output, I can’t fully trust it, it’s subtly distorted. I feel like I’m wading through epistemic pollution.

Obviously it’s going to improve, and probably quite rapidly. If it read 10x more sources, thought 100x longer, and had 1000x lower error rate, I think that would do it. So no huge leap required, just turning some knobs, it’s definitely going to get there. But at the same time, it’s quite jarring to me that a large fraction of people already find the outputs compelling.

As someone who does care about art, and has, I think, discerning taste, I have always kind of felt this, and only when I read the above tweet did I realise that not everyone felt what I felt. When Sam Altman tweeted that story, which seemed to impress some people and inspire disgust/ridicule from others, the division became even clearer.

I think with Deep Research the slop is actually not as much of a problem -- you can just treat it as a web search on steroids and can always jump into the cited sources to verify things. And for similar reasons, it seems true that if DR "read 10x more sources, thought 100x longer, and had 1000x lower error rate", it could be as good as me at doing bounded investigations. For the hardest bits needed for AI to generate genuinely good creative fiction, it feels less obvious whether the same type of predictable progress will happen.

I think I'm less sure than you that the problem has to do with attractor basins, though. That does feel part of or related to the problem, but I think a larger issue is that chatbots are not coherent enough. Good art has a sort of underlying internal logic to it, which even if you do not notice it contributes to making the artwork feel like a unified whole. Chatbots don't do that, they are too all over the place.

[-]Michael Roe4d10

I think your prompt does not show R1 at its best. It’s better at reacting to something that it is when given a blank canvas.

[-]Michael Roe4d10

Deepseek R1 has some strange obsessions, that are not obviously in the prompt and that seem to occur regardless of who is prompting it. Bioluminescence is one example.

I am still trying to figure out if R1 is actually trying to tell us something here, and if so, what it’s trying to say. Maybe it really is saying something about the nature of LLMs, given that these themes aren’t as big a deal in its training set.

Moderation Log