On their AI Benchmarking dashboard, the newest Gemini 2.5 Flash model is listed as having an accuracy of 4% ± 0.71% on GPQA Diamond, when Google's official announcement lists it at over 80%, and when GPQA is a multiple-choice test with 4 options:

It's because of formatting issues. Helpfully, Epoch provides the logs from the evaluation, and the model just simply hasn't been responding in the correct format.

For example, if you look at the first sample from the logs, the correct answer is listed as "B", but the model answered in Latex, $\boxed{B}$, so it was scored incorrect. There are plenty of other examples like this.

Reply

What Does It Mean to "Write Like You Talk"?

bohaska2mo50

"Write like you talk" depends on which language you are talking about.

Take Arabic. Written Arabic and spoken Arabic has diverged enormously compared to written English and spoken English. Modern Standard Arabic (MSA) is the formal written language for books, newspapers, speeches etc. But no sane person speaks it. There are a lot of spoken dialects (like Egyptian, Levantine, Gulf Arabic, etc.). A speaker of different dialects may not understand other dialects or MSA, because all the vocabulary and grammar is different, which isn't usually the case in English.

Written and spoken English are similar to each other compared to most other languages.

Reply

AI Can't Write Good Fiction

bohaska2mo10

I've tried using a different method for r1 to generate flash fiction: one sentence at a time. If a human writer wouldn't write out a flash fiction story in one message, then AI shouldn't, either. Here's a result:

She clocked in at 6AM, categorizing discards by residue: toothpaste-crusted wedding bands in Tier 4, melatonin vials from red-eyes in Tier 7, a child’s sock curled around her sister’s garnet earring (missing since the November her calls went unanswered).
Room 312’s newlyweds left forensic poetry—dental floss strung bedpost to minibar, aspirin dust tracing slammed door trajectories. She logged these under Domestic Erosion, Subcategory: Honeymoon Phase.
Room 214’s grid collapsed at the gel insoles—mint-green, bunioned, size 6 like her sister’s. The prescription (sertraline, 50mg) was dated three days after their last fight. She filed it under Unfinished Conversations, though the label peeled halfway.
The businessman in 603 prayed, she’d assumed. But his trash betrayed her grids: glucose tabs bisecting train tickets, bloodied test strips where kneelers should’ve dented carpet. Her scrubbing split her cuticles, crimson streaking the sink’s rust.
At dawn, she assembled her relics—unopened bills, an expired birth control foil (2019’s voicemail: static, then dial tone), lint rollers furred with 603’s hair. Each strand vibrated middle C, the note her sister had looped on the piano the night she vanished.
Aspirin dust still gritted her palms. She pressed them to the window as dawn blued the glass—that bleached hue he’d called “motel dusk” while wrestling their tent zipper, his breath hot and futile against her neck.

I think that this still has some imperfections, but I find that this method at least gives you an entirely different set of problems compared to the cliché output you describe.

Reply

Early Chinese Language Media Coverage of the AI 2027 Report: A Qualitative Analysis

bohaska2mo107

I guess it's just that the censors have not seen it yet.

There's a lot of situations where a smaller website doesn't get banned e.g. Substack is banned in China, but if you host your Substack blog on a custom URL, people in China can still read it.

Reply

Bohaska's Shortform

bohaska3mo50

A helpful page to see and subscribe to all 31 Substack writers (out of 122 total) who were invited to LessOnline: https://lessonline2025invitedlist.substack.com/recommendations

Reply

1

Consider showering

bohaska3mo65

I guess this is another case of 'Universal' Human Experiences That Not Everyone Has

Reply

Bohaska's Shortform

bohaska4mo10

Made a small, quick website showing GPQA benchmark scores plotted against LLM inference cost, at https://ai-benchmark-price.glitch.me/. See how much you get for your buck:

Most benchmark data is from Epoch AI, except for those marked "not verified", which I got from the model developer. Pricing data is from OpenRouter.

All the LLMs on this graph which are on the Pareto frontier of performance vs price were released December 2024 or later...

Reply