Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

I sense that my quality of communication diminishes past this point, I should get my thoughts together before speaking too confidently

I believe you're right we do something similar to the LLM's (loosely, analogously), see

https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble

(I need to learn markdown)

My intuition is still LLM pessimistic, I'd be excited to see good practical uses, this seems like tool ai and that makes my existential dread easier to manage!

Thank you for the warm reply, it's nice and also good feedback I didn't do anything explicitly wrong with my post

It will be VERY funny if this ends up being essentially the o1 model with some tinkering to help it cycle questions multiple times to verify the best answers, or something banal like that. Wish they didn't make us wait so long to test that :/

small nudge: the questions have difficulty tiers of 25% easy, 50% medium, and 25% hard with easy being undergrad/IMO difficulty and hard being the sort you would give to a researcher in training.

The 25% accuracy gives me STRONG indications that it just got the easy ones, and the starkness of this cutoff makes me think there is something categorically different about the easy ones that make them MUCH easier to solve, either being more close ended, easy to verify, or just leaked into the dataset in some form.

First post, feel free to meta-critique my rigor on this post as I am not sure what is mandatory, expected, or superfluous for a comment under a post. Studying computer science but have no degree, yet. Can pull the specific citation if necessary, but...

these benchmarks don't feel genuine.

Chollet indicated in his piece:

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training)

The model was tuned to the ARC AGI Test, got a great score but then faceplanted on a reasoning test apparently easier for humans than the first one, a test that couldn't have been adversarially designed to stump o3. I would have expected that this would immediately expose it as being horrifically overfit but few people seem to be honing in on that so maybe I don't understand something?

Second, the Frontiers math problems appear to be structured in a way that 25% of the problems are the sort that clever high schoolers with bright futures in mathematics would be able to answer if they studied, International Math Olympiad or undergrad level questions. We don't know EXACTLY which questions it answered right but it's almost exactly 25% and I suspect that these are almost entirely from the lower tier questions, and I wouldn't be surprised to hear that the questions (or similar questions) were in the training set. Perhaps the lowest tier questions were formulated with a sort of operational philosophy that didn't prioritize guarding against leaked data?

Third, the codeforces ELO doesn't mean anything. I just can't take this seriously, unless someone thinks that existing models are already mid tier competitive SWE's? Similar benchmark, similarly meaningless, dismissed until someone shows me that these can actually deliver on what other models have provenly exaggerated on.

Fourth, the cost is obscene: thousands of dollars per task, with indications that this model is very similar to o1 going by cost per token, this looks less like a strong model with way more parameters and more like the same model doing a mind boggling amount of thinking. This is probably something like funsearch, a tool bag of programs that an LLM then combines and tries to gauge the effectiveness of, brute forced until it can get an answer it can verify. This seems useful, but this would only work on close-ended questions with answers that are easy to verify, either way this wouldn't really be intelligence of the kind that I had imagined looking for.

This would PERFECTLY explain the failure on the ARC-AGI2 benchmark: the bag of tools it would need would be different, it wasn't tuned to the new test and came with the wrong tool bag. Maybe this could be fixed, but if my model of how this AI works is right then the complexity of tasks would increase by O(n!) with n being the number of "tools" it needs. I'm probably wrong here but something LIKE that is probably true.

Lecun also seems to be confident on threads that this is NOT an LLM, that this is something that uses an LLM but that something else is going on. Perfectly matched my "oh, this is funsearch" intuition. My caveat is that this might all be handled "in house" in an LLM, but the restrictions on what this could do seem very real.

Am a critically wrong on enough points here that I should seriously rethink my intuition?