This is kinda-sorta being done at the moment, after Gemini beat the game, the stream has just kept on going. Currently Gemini is lost in Mt. Moon, as is tradition. In fact, the fact that it already explored Mt. Moon earlier seems to be hampering it (no unexplored areas on minimap to lure it towards the right direction).
I believe the dev is planning to do a fresh run soon-ish once they've stabilized their scaffold.
Gemini 2.5 Pro just beat Pokémon Blue. (https://x.com/sundarpichai/status/1918455766542930004)
A few things ended up being key to the successful run:
Yeah by "robust" I meant "can programmatically interact with game".
There's at least workable tools for Pokémon FireRed (the 2004 re-release of the 1996 original) it turns out, and you can find a scaffold using that here.
This has been a consistent weakness of OpenAI's image processing from the start: GPT-4-V came with clearcut warnings against using it on non-photographic inputs like screenshots or documents or tables, and sure enough, I found that it was wildly inaccurate on web page screenshots.
(In particular, I had been hoping to use it to automate Gwern.net regression detection: use a headless browser to screenshot random points in Gwern.net and report back if anything looked 'wrong'. It seemed like the sort of 'I know it when I see it' judgment task a VLM ought to be ...
Actually another group released VideoGameBench just a few days ago, which includes Pokémon Red among other games. Just a basic scaffold for Red, but that's fair.
As I wrote in my other post:
...Why hasn't anyone run this as a rigorous benchmark? Probably because it takes multiple weeks to run a single attempt, and moreover a lot of progress comes down to effectively "model RNG" - ex. Gemini just recently failed Safari Zone, a difficult challenge, because its inventory happened to be full and it couldn't accept an item it needed. And ex. Claude has taken wildly
Re: biosignatures detected on K2-18b, there's been a couple popular takes saying this solves the Fermi Paradox: K2-18b is so big (8.6x Earth mass) that you can't get to orbit, and maybe most life-bearing planets are like that.
This is wrong on several bases:
I'm not sure that TAS counts as "AI" since they're usually compiled by humans, but the "PokeBotBad" you linked is interesting, hadn't heard of that before. It's an Any% Glitchless speedrun bot that ran until ~2017 and which managed a solid 1:48:27 time on 2/25/17, which was better than the human world record until 2/12/18. Still, I'd say this is more a programmed "bot" than an AI in the sense we care about.
Anyway, you're right that the whole reason the Pokémon benchmark exists is because it's interesting to see how well an untrained LLM can do playing it.
since there's no obvious reason why they'd be biased in a particular direction
No I'm saying there are obvious reasons why we'd be biased towards truthtelling. I mentioned "spread truth about AI risk" earlier, but also more generally one of our main goals is to get our map to match the territory as a collaborative community project. Lying makes that harder.
Besides sabotaging the community's map, lying is dangerous to your own map too. As OP notes, to really lie effectively, you have to believe the lie. Well is it said, "If you once tell a lie, the truth is ...
I'm not convinced SBF had conflicting goals, although it's hard to know. But more importantly, I don't agree rationalists "tend not to lie enough". I'm no Kantian, to be clear, but I believe rationalists ought to aspire to a higher standard of truthtelling than the average person, even if there are some downsides to that.
Have we forgotten Sam Bankman-Fried already? Let’s not renounce virtues in the name of expected value so lightly.
Rationalism was founded partly to disseminate the truth about AI risk. It is hard to spread the truth when you are a known liar, especially when the truth is already difficult to believe.
Dangit I can't cease to exist, I have stuff to do this weekend.
But more seriously, I don't see the point you're making? I don't have a particular objection to your discussion of anthropic arguments, but also I don't understand how it relates to the "what part of evolution/planetary science/sociology/etc. is the Great Filter" scientific question.
I think if you frame it as:
if most individuals exist inside the part of the light cone of an alien civilization, why aren't we one of them?
Then yes, 1.0 influence and 4.0 influence both count as "part of the light cone", and so for the related anthropic arguments you could choose to group them together.
But re: anthropic arguments,
Not only am I unable to explain why I'm an observer who doesn't see aliens
This is where I think I have a different perspective. Granting that anthropic arguments (here, about which observer you are and the odds of that) cause frus...
I agree it's likely the Great Filter is behind us. And I think you're technically right, most filters are behind us, and many are far in the past, so the "average expected date of the Great Filter" shifts backward. But, quoting my other comment:
Every other possible filter would gain equally, unless you think this implies that maybe we should discount other evolutionary steps more as well. But either way, that’s still bad on net because we lose probability mass on steps behind us.
So even though the "expected date" shifts backward, the odds for "behind us or...
Interesting thought. I think you have a point about coevolution, but I don't think it explains away everything in the birds vs. mammals case. How much are birds really competing with mammals vs. other birds/other animals? Mammals compete with lots of animals, why did only birds get smarter? I tend to think intra-niche/genus competition would generate most of the pressure for higher intelligence, and for whatever reason that competition doesn't seem to lead to huge intelligence gains in most species.
(Re: octopus, cephalopods do have interactions with marine...
Two objections:
Couple takeaways here. First, quoting the article:
By comparing the bird pallium to lizard and mouse palliums, they also found that the neocortex and DVR were built with similar circuitry — however, the neurons that composed those neural circuits were distinct.
“How we end up with similar circuitry was more flexible than I would have expected,” Zaremba said. “You can build the same circuits from different cell types.”
This is a pretty surprising level of convergence for two separate evolutionary pathways to intelligence. Apparently the neural circuits are so ...
Both the slowdown and race models predict that the future of Humanity is mostly in the hands of the United States - the baked-in disadvantage in chips from existing sanctions on China is crippling within short timelines, and no one else is contending.
So, if the CCP takes this model seriously, they should probably blockade Taiwan tomorrow? It's the only fast way to equalize chip access over the next few years. They'd have to weigh the risks against the chance that timelines are long enough for their homegrown chip production to catch up, but there seems to ...
I think that the scenario of the war between several ASI (each merged with its origin country) is underexplored. Yes, there can be a value handshake between ASIs, but their creators will work to prevent this and see it as a type of misalignment.
Somehow, this may help some groups of people survive, as those ASI which preserve their people will look more trustworthy in the eyes of other ASIs, and this will help them form temporary unions.
The final outcome will be highly unstable: either one ASI will win, or several ASIs will start space exploration in different directions.
I'm generally pretty receptive to "adjust the Overton window" arguments, which is why I think it's good PauseAI exists, but I do think there's a cost in political capital to saying "I want a Pause, but I am willing to negotiate". It's easy for your opponents to cite your public Pause support and then say, "look, they want to destroy America's main technological advantage over its rivals" or "look, they want to bomb datacenters, they're unserious". (yes Pause as typically imagined requires international treaties, the attack lines would probably still work, ...
So I realized Amad’s comment obsession was probably a defense against this dynamic - “I have to say something to my juniors when I see them”.
I think there's a bit of a trap here where, because Amad is known for always making a comment whenever he ends up next to an employee, if he then doesn't make a comment next to someone, it feels like a deliberate insult.
That said, I see the same behavior from US tech leadership pretty broadly, so I think the incentive to say something friendly in the elevator is pretty strong to start (norms of equality, first name basis, etc. in tech), and then once you start doing that you have to always do it to avoid insult.
I think the concept of Pausing AI just feels unrealistic at this point.
Copying over a comment from Chris Olah of Anthropic on Hacker News I thought was good: (along with parent comment)
fpgaminer
> This is powerful evidence that even though models are trained to output one word at a time
I find this oversimplification of LLMs to be frequently poisonous to discussions surrounding them. No user facing LLM today is trained on next token prediction.
olah3
...Hi! I lead interpretability research at Anthropic. I also used to do a lot of basic ML pedagogy (https://colah.github.io/). I
Good objection. I think gene editing would be different because it would feel more unfair and insurmountable. That's probably not rational - the effect size would have to be huge for it to be bigger than existing differences in access to education and healthcare, which are not fair or really surmountable in most cases - but something about other people getting to make their kids "superior" off the bat, inherently, is more galling to our sensibilities. Or at least mine, but I think most people feel the same way.
Re: HCAST tasks, most are being kept private since it's a benchmark. If you want to learn more here's the METR's paper on HCAST.
Thanks for the detailed response!
Re: my meaning, you got it correct here:
Spiritually, genomic liberty is individualistic / localistic; it says that if some individual or group or even state (at a policy level, as a large group of individuals) wants to use germline engineering technology, it is good for them to do so, regardless of whether others are using it. Thus, it justifies unequal access, saying that a world with unequal access is still a good world.
Re: genomic liberty makes narrow claims, yes I agree, but my point is that if implemented it will lead ...
This is a thoughtful post, and I appreciate it. I don't think I disagree with it from a liberty perspective, and agree there are potential huge benefits for humanity here.
However, my honest first reaction is "this reasoning will be used to justify a world in which citizens of rich countries have substantially superior children to citizens of poor countries (as viewed by both groups)". These days, I'm much more suspicious of policies likely to be socially corrosive: it leads to bad governance at a time where, because of AI risk, we need excellent governance...
Here's an interesting thread of tweets from one of the paper's authors, Elizabeth Barnes.
Quoting the key sections:
...Extrapolating this suggests that within about 5 years we will have generalist AI systems that can autonomously complete ~any software or research engineering task that a human professional could do in a few days, as well as a non-trivial fraction of multi-year projects, with no human assistance or task-specific adaptations required.
However, (...) It’s unclear how to interpret “time needed for humans”, given that this varies wildly between diffe
Random commentary on bits of the paper I found interesting:
Under Windows of opportunity that close early:
...Veil of ignorance
Lastly, some important opportunities are only available while we don’t yet know for sure who has power after the intelligence explosion. In principle at least, the US and China could make a binding agreement that if they “win the race” to superintelligence, they will respect the national sovereignty of the other and share in the benefits. Both parties could agree to bind themselves to such a deal in advance, because a guarantee of contr
Okay I got trapped in a Walgreens and read more of this, found something compelling. Emphasis mine:
...The best systems today fall short at working out complex problems over longer time horizons, which require some mix of creativity, trial-and-error, and autonomy. But there are signs of rapid improvement: the maximum duration of ML-related tasks that frontier models can generally complete has been doubling roughly every seven months. Naively extrapolating this trend suggests that, within three to six years, AI models will become capable of automating many cogn
Meta: I'm kind of weirded out by how apparently everyone is making their own high-effort custom-website-whitepapers? Is this something that's just easier with LLMs now? Did Situational Awareness create a trend? I can't read all this stuff man.
In general there seems to be way more high-effort work coming out since reasoning models got released. Maybe it's just crunchtime.
I think it's something of a trend relating to a mix of 'tools for thought' and imitation of some websites (LW2, Read The Sequences, Asterisk, Works in Progress & Gwern.net in particular), and also a STEM meta-trend arriving in this area: you saw this in security vulnerabilities where for a while every major vuln would get its own standalone domain + single-page website + logo + short catchy name (eg. Shellshock, Heartbleed). It is good marketing which helps you stand out in a crowded ever-shorter-attention-span world.
I also think part of it is that it ...
No idea. Be really worried, I guess—I tend a bit towards doomer. There's something to be said for not leaving capabilities overhangs lying around, though. Maybe contact Anthropic?
The thing is, the confidence the top labs have in short-term AGI makes me think there's a reasonable chance they have the solution to this problem already. I made the mistake of thinking they didn't once before - I was pretty skeptical that "more test-time compute" would really unhobble LLMs in a meaningful fashion when Situational Awareness came out and didn't elaborate at all on how that would work. But it turned out that at least OpenAI, and probably Anthropic too, already had the answer at the time.
I think this is a fair criticism, but I think it's also partly balanced out by the fact that Claude is committed to trying to beat the game. The average person who has merely played Red probably did not beat it, yes, but also they weren't committed to beating it. Also, Claude has pretty deep knowledge of Pokémon in its training data, making it a "hardcore gamer" both in terms of knowledge and willingness to keep playing. In that way, the reference class of gamers who put forth enough effort to beat the game is somewhat reasonable.
It's definitely possible to get confused playing Pokémon Red, but as a human, you're much better at getting unstuck. You try new things, have more consistent strategies, and learn better from mistakes. If you tried as long and as consistently as long as Claude is, even as a 6-year-old, you'd do much better.
I played Pokémon Red as a kid too (still have the cartridge!), it wasn't easy, but I beat it in something like that 26 hour number IIRC. You have a point that howlongtobeat is biased towards gamers, but it's the most objective number I can find, and it feels reasonable to me.
as a human, you're much better at getting unstuck
I'm not sure! Or well, I agree that 7-year-old me could get unstuck by virtue of having an "additional tool" called "get frustrated and cry until my mom took pity and helped."[1] But we specifically prevent Claude from doing stuff like that!
I think it's plausible that if we took an actual 6-year-old and asked them to play Pokemon on a Twitch stream, we'd see many of the things you highlight as weaknesses of Claude: getting stuck against trivial obstacles, forgetting what they were doing, and—yes—complai...
Thanks for the correction! I've added the following footnote:
Actually it turns out this hasn't been done, sorry! A couple RNG attempts were completed, but they involved some human direction/cheating. The point still stands only in the sense that, if Claude took more random/exploratory actions rather than carefully-reasoned shortsighted actions, he'd do better.
I think the idea behind MAIM is to make it so neither China nor the US can build superintelligence without at least implicit consent from the other. This is before we get to the possibility of first strikes.
If you suspect an enemy state is about to build a superintelligence which they will then use to destroy you (or that will destroy everyone), you MAIM it. You succeed in MAIMing it because everyone agreed to measures making it really easy to MAIM it. Therefore, for either side to build superintelligence, there must be a general agreement to do so. If the...
This is creative.
TL;DR: To mitigate race dynamics, China and the US should deliberately leave themselves open to the sabotage ("MAIMing") of their frontier AI systems. This gives both countries an option other than "nuke the enemy"/"rush to build superintelligence first" if superintelligence appears imminent: MAIM the opponent's AI. The deliberately unmitigated risk of being MAIMed also encourages both sides to pursue carefully-planned and communicated AI development, with international observation and cooperation, reducing AINotKillEveryone-ism risks.
The ...
After an inter-party power-struggle, the CCP commits to the perpetual existence of at least one billion Han Chinese people with biological reproductive freedom
You know, this isn't such a bad idea - that is, explicit government commitments against discarding their existing, economically-unproductive populace. Easier to ask for today, rather than later.
Hypothetically this is more valuable in autocracies than in democracies, where the 1 person = 1 vote rule keeps political power in the hands of the people, but I think I'd support adding a constitutional amend...
It's unclear exactly what the product GPT-5 will be, but according to OpenAI's Chief Product Officer today it's not merely a router between GPT-4.5/o3.
swyx
appreciate the update!!in gpt5, are gpt* and o* still separate models under the hood and you are making a model router? or are they going to be unified in some more substantive way?
Kevin Weil
Unified 👍
Unless a dentist has told you to do this for some reason, you should know this is not recommended. Brushing hard can hurt tooth enamel and cause gum recession (aka your gums shrink down, causes lots of problems).
And, at risk of quashing OP's admirable spirit, a more "robust" toothbrush would exacerbate the relevant harms