Linked is a new working paper from Nick Bostrom, of Superintelligence fame, primarily analyzing optimal pause strategies in AI research, with the aim of maximizing saved human lives by balancing x-risk against ASI developing biological immortality sooner. Abstract: (emphasis mine) > Developing superintelligence is not like playing Russian roulette; it...
Recently I've been accumulating stories where I think an LLM is mistaken, only to discover that I'm the one who's wrong. My favorite recent case came while researching 19th century US-China opium trade. It's a somewhat convoluted history: opium was smuggled when it was legal to sell and when it...
Credit: Nano Banana, with some text provided. You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models...
This is a cleaned-up, open-source version of the LLM Pokémon Scaffold described in Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red. (forked from David Hershey of Anthropic's scaffold here, all development on top of that was done by my friend, not me) Since that post,...
Disclaimer: this post was not written by me, but by a friend who wishes to remain anonymous. I did some editing, however. So a recent post my friend wrote has made the point quite clearly (I hope) that LLM performance on the simple task of playing and winning a game...
Background: With the release of Claude 3.7 Sonnet, Anthropic promoted a new benchmark: beating Pokémon. Now, Google claims Gemini 2.5 Pro has substantially surpassed Claude's progress on that benchmark. TL:DR: We don't know if Gemini is better at Pokémon than Claude because their playthroughs can't be directly compared. The Metrics...
Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The livestream is still going right now. TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level. Digging in But wait! you say. Didn't...