arabaga - LessWrong

Getting 50% (SoTA) on ARC-AGI with GPT-4o

I agree that there is a good chance that this solution is not actually SOTA, and that it is important to distinguish the three sets.

There's a further distinction between 3 guesses per problem (which is allowed according to the original specification as Ryan notes), and 2 guesses per problem (which is currently what the leaderboard tracks [rules]).

Some additional comments / minor corrections:

The past SOTA got [we don't know] on the first, 52% on the second, and 34% on the third.

AFAICT, the current SOTA-on-the-private-test-set with 3 submissions per problem is 37%, and that solution scores 54% on the public eval set.

The SOTA-on-the-public-eval-set is at least 60% (see thread).

Apparently, lots of people get worse performance on the public test set than the private one

I think this is a typo and you mean the opposite.

From looking into this a bit, it seems pretty clear that the public eval set and the private test set are not IID. They're "intended" to be the "same" difficulty, but AFAICT this essentially just means that they both consist of problems that are feasible for humans to solve.

It's not the case that a fixed set of eval/test problems were created and then randomly distributed between the public eval set and private test set. At your link, Chollet says "the [private] test set was created last" and the problems in it are "more unique and more diverse" than the public eval set. He confirms that here:

This is *also* likely in part due to the fact that the eval set contains more "easy" tasks. The eval set and test set were not calibrated for difficulty. So while all tasks across the board are feasible for humans, the tasks in the test set may be harder on average. This was not intentional, and is likely either a fluke (there are only 100 tasks in the test set) or due to the test set having been created last."

Bottom line: I would expect Ryan's solution to score significantly lower than 50% on the private test set.

LessWrong's (first) album: I Have Been A Good Bing

arabaga7mo143

You can directly write/paste your own lyrics (Custom Mode). And v3 came out fairly recently, which is better in general, in case you haven't tried it in a while.

LessWrong's (first) album: I Have Been A Good Bing

arabaga7mo121

They seem to be created by https://app.suno.ai/ And yes, it is really easy to create songs - you can either have it create the lyrics for you based on a prompt (the default), or you can write/paste the lyrics yourself (Custom Mode). Songs can be up to ~2 minutes long I think.

Prediction markets are consistently underconfident. Why?

arabaga10mo10

Yeah, this seems to be a big part of it. If you instead switch it to the probability at market midpoint, Manifold is basically perfectly calibrated, and Kalshi is if anything overconfident (Metaculus still looks underconfident overall).

OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns

arabaga1y10

No, the letter has not been falsified.

Just to clarify: ~700 out of ~770 OpenAI employees have signed the letter (~90%)

Out of the 10 authors of the autointerpretability paper, only 5 have signed the letter. This is much lower than the average rate. One out of the 10 is no longer at OpenAI, so couldn't have signed it, so it makes sense to count this as 5/9 rather than 5/10. Either way, it's still well below the average rate.

OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns

arabaga1y10

Ah, nice catch, I'll update my comment.

OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns

arabaga1y156

There is an updated list of 702 who have signed the letter (as of the time I'm writing this) here: https://www.nytimes.com/interactive/2023/11/20/technology/letter-to-the-open-ai-board.html (direct link to pdf: https://static01.nyt.com/newsgraphics/documenttools/f31ff522a5b1ad7a/9cf7eda3-full.pdf)

Nick Cammarata left OpenAI ~8 weeks ago, so he couldn't have signed the letter.

Out of the remaining 6 core research contributors:

3/6 have signed it: Steven Bills, Dan Mossing, and Henk Tillman
3/6 have still not signed it: Leo Gao, Jeff Wu, and William Saunders

Out of the non-core research contributors:

2/3 signed it: Gabriel Goh and Ilya Sutskever
1/3 still have not signed it: Jan Leike

That being said, it looks like Jan Leike has tweeted that he thinks the board should resign: https://twitter.com/janleike/status/1726600432750125146

And that tweet was liked by Leo Gao: https://twitter.com/nabla_theta/likes

Still, it is interesting that this group is clearly underrepresented among people who have actually signed the letter.

Edit: Updated to note that Nick Cammarata is no longer at OpenAI, so he couldn't have signed the letter. For what it's worth, he has liked at least one tweet that called for the board to resign: https://twitter.com/nickcammarata/likes

Altman firing retaliation incoming?

arabaga1y2921

It seems like a strategy by investors or even large tech companies to create a self-fulfilling prophecy to create a coalition of OpenAI employees, when there previously was none.

How is this more likely than the alternative, which is simply that this is an already-existing coalition that supports Sam Altman as CEO? Considering that he was CEO until he was suddenly removed yesterday, it would be surprising if most employees and investors didn't support him. Unless I'm misunderstanding what you're claiming here?

The Gods of Straight Lines

arabaga1y60

If you follow the link, under the section "Free Market Seen as Best, Despite Inequality", Vietnam is the country with the highest agreement by far with the statement "Most people are better off in a free market economy, even though some people are rich and some are poor" (95%!)

That being said, while it is the most pro-capitalism country, it is clearly not the most capitalist country (although it's not that bad, 72nd out of 176 countries ranked: https://www.heritage.org/index/ranking), and it would likely be more capitalist today if South Vietnam had won.

State of Generally Available Self-Driving

arabaga1y40

Small typo/correction: Waymo and Cruise each claim 10k rides per week, not riders.

LESSWRONG
LW

Posts

Wiki Contributions

Comments