Julian Bradshaw

Nick Bostrom: Optimal Timing for Superintelligence

Linked is a new working paper from Nick Bostrom, of Superintelligence fame, primarily analyzing optimal pause strategies in AI research, with the aim of maximizing saved human lives by balancing x-risk against ASI developing biological immortality sooner. Abstract: (emphasis mine) > Developing superintelligence is not like playing Russian roulette; it...

Feb 138

When the LLM isn't the one who's wrong

Recently I've been accumulating stories where I think an LLM is mistaken, only to discover that I'm the one who's wrong. My favorite recent case came while researching 19th century US-China opium trade. It's a somewhat convoluted history: opium was smuggled when it was legal to sell and when it...

Jan 1880

Insights into Claude Opus 4.5 from Pokémon

Credit: Nano Banana, with some text provided. You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models...

Dec 9, 2025220

Open Source LLM Pokémon Scaffold

This is a cleaned-up, open-source version of the LLM Pokémon Scaffold described in Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red. (forked from David Hershey of Anthropic's scaffold here, all development on top of that was done by my friend, not me) Since that post,...

Apr 27, 202531

Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

Disclaimer: this post was not written by me, but by a friend who wishes to remain anonymous. I did some editing, however. So a recent post my friend wrote has made the point quite clearly (I hope) that LLM performance on the simple task of playing and winning a game...

Apr 21, 2025124

Is Gemini now better than Claude at Pokémon?

Background: With the release of Claude 3.7 Sonnet, Anthropic promoted a new benchmark: beating Pokémon. Now, Google claims Gemini 2.5 Pro has substantially surpassed Claude's progress on that benchmark. TL:DR: We don't know if Gemini is better at Pokémon than Claude because their playthroughs can't be directly compared. The Metrics...

Apr 19, 202592

So how well is Claude playing Pokémon?

Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The livestream is still going right now. TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level. Digging in But wait! you say. Didn't...

Mar 7, 2025173

LESSWRONG
LW

LESSWRONG
LW

Julian Bradshaw

Julian Bradshaw

Insights into Claude Opus 4.5 from Pokémon

An AI risk argument that resonates with NYTimes readers

So how well is Claude playing Pokémon?

Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

Julian Bradshaw

Insights into Claude Opus 4.5 from Pokémon

An AI risk argument that resonates with NYTimes readers

So how well is Claude playing Pokémon?

Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

Nick Bostrom: Optimal Timing for Superintelligence

When the LLM isn't the one who's wrong

Insights into Claude Opus 4.5 from Pokémon

Open Source LLM Pokémon Scaffold

Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

Is Gemini now better than Claude at Pokémon?

So how well is Claude playing Pokémon?