There's an improvement in LLM's I've seen that is important but has wildly inflated people's expectations beyond what's reasonable:
LLM's have hit a point in some impressive tests where they don't reliably fail past the threshold of being unrecoverable. They are conservative enough that they can do search on a problem, fail a million times until they mumble into an answer.
I'm going to try writing something of at least not-embarrassing quality about my thoughts on this but I am really confused by people's hype around this sort of thing, this feels like directed randomness
This is not going to be a high quality answer, sorry in advance.
I noticed this with someone in my office who is learning robotic process automation: people are very bad at measuring their productivity, they are better at seeing certain kinds of gains and certain kinds of losses. I know someone who swears emphatically that they are many times as productive but have become almost totally unreliable. He's in denial over it, and a couple people now have openly told me they try to remove him from workflows for all the problems he causes.
I think the situation is...
Definitely! However, there is more money and "hype" in the direction of wanting these to scale into AGI.
Hype and anti-hype don't cancel each other out, if someone invests a billion dollars into LLM's, someone else can't spend negative 1 billion and it cancels out: the billion dollar spender is the one moving markets, and getting a lot of press attention.
We have Yudkowsky going on destiny, I guess?
I think there's some miscommunication here, on top of a fundamental disagreement on whether more compute takes us to AGI.
On miscommunication, we're not talking about the lowering cost per flop, we're talking about a world where openai either does or does not have a price war eating it's margins.
On fundamental disagreement, I assume you don't take very seriously the idea that AI labs are seeing a breakdown of scaling laws? No problem if so, reality should resolve that disagreement relatively soon!
This is actually a good use case, which fits with what gpt does well, where very cheap tokens help!
Pending some time for people to pick at it to test it's limits, this might be really good. My instinct is legal research, case law etc. will be the test of how good it is, if it does well this might be it's foothold into real commercial use that actually generates profit.
My prediction is that we will be glad this exists. It will not be "phd level", a phrase which defaces all who utter it, but it will save some people a lot of time and effort
Where I think we d...
I feel like this comes down a lot to intuition, all I can say is gesture at the thinning distance between marginal cost and prices, wave my hand in the direction of discount rates and the valuation of Openai and ask... Are you sure?
The demand curve on this seems textbook inelastic at current margins. slashing the price of milk by 10x would have us cleaning our driveways with it, slashing the price of eggs would have us using crushed eggshells as low grade building material. A 10x decrease in the price per token of AI is barely even noticed, in fact in some...
I'm kinda opposite on the timelines thing? This is probably a timeline delayer even if I thought LLM's scaled to AGI, which I don't but let's play along.
If a Pharma company could look at another company's product and copy it and release it for free with no consequences, but the product they release itself could only be marginally improved without massive investment, what does that do to the landscape?
It kills the entire industry. This HURTS anyone trying to fundraise, reckless billions will have a harder time finding their way into the hands of developers ...
Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn't generalize, no?
The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL'd the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it's actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.
The frontier t...
Thanks for the reply! Still trying to learn how to disagree properly so let me know if I cross into being nasty at all:
I'm sure they've gotten better, o1 probably improved more from its heavier use of intermediate logic, compute/runtime and such, but that said, at least up till 4o it looks like there has been improvements in the model itself, they've been getting better
They can do incredibly stuff in well documented processes but don't survive well off the trodden path. They seem to string things together pretty well so I don't know if I would say there's ...
I would say that, barring strong evidence to the contrary, this should be assumed to be memorization.
I think that's useful! LLM's obviously encode a ton of useful algorithms and can chain them together reasonably well
But I've tried to get those bastards to do something slightly weird and they just totally self destruct.
But let's just drill down to demonstrable reality: if past SWE benchmarks were correct, these things should be able to do incredible amounts of work more or less autonomously and get all the LLM SWE replacements we've seen have stuck to high...
It's been pretty clear to me as someone who regularly creates side projects with ai that the models are actually getting better at coding.
Also, it's clearly not pure memorization, you can deliberately give them tasks that have never been done before and they do well.
However, even with agentic workflows, rag, etc all existing models seem to fail at some moderate level of complexity - they can create functions and prototypes but have trouble keeping track of a large project
My uninformed guess is that o3 actually pushes the complexity by some non-trivial amount, but not enough to now take on complex projects.
Hmm, mixed agree/disagree. Scale probably won't work, algorithms probably would, but I don't think it's going to be that quick.
Namely, I think that the company struggling with fixed capital costs could accomplish much more, much quicker using the salary expenses of the top researchers they already have they'd have done it or gave it a good try at least
I'm 5 percent that a serious switch to algorithms would result in AGI in 2 years. You might be more well read than me on this so I'm not quite taking side bets right now!
I think the algorithm progress is doing some heavy lifting in this model. I think if we had a future textbook on agi we could probably build one but AI is kinda famous for minor and simple things just not being implemented despite all the parts being there
See ReLU activations and sigmoid activations.
If we're bottlenecking at algorithms alone is there a reason that isn't a really bad bottleneck?
I haven't had warm receptions when critiquing points, which has frustratingly left me with bad detection for when I'm being nasty, so if I sound thorny it's not my intent.
Somewhere I think you might have misstepped is the frontier math questions: the quotes you've heard are almost certainly about tier 3 questions, the hardest ones meant for math researchers in training. The mid tier is for grad student level problems and tier 1 is bright high schooler to undergrad level problems
Tier 1: 25% of the test
Tier 2: 50% of the test
Tier 3: 25%
O3 got 25%, probably a...
Maybe a dumb question, but those log scale graphs have uneven ticks on the x axis, is there a reason they structured it like that beyond trying to draw a straight line? I suspect there is a good reason and it's not dishonesty but this does look like something one would do if you wanted to exaggerate the slope
I do not have a gauge for how much I'm actually bringing to this convo, so you should weigh my opinion lightly, however:
I believe your third point kinda nails it. There are models for gains from collective intelligence (groups of agents collaborating) and the benefits of collaboration bottleneck hard on your ability to verify which outputs from the collective are the best, and even then the dropoff happens pretty quick the more agents collaborate.
10 people collaborating with no communication issues and accurate discrimination between good and bad ideas are...
I sense that my quality of communication diminishes past this point, I should get my thoughts together before speaking too confidently
I believe you're right we do something similar to the LLM's (loosely, analogously), see
https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble
(I need to learn markdown)
My intuition is still LLM pessimistic, I'd be excited to see good practical uses, this seems like tool ai and that makes my existential dread easier to manage!
Thank you for the warm reply, it's nice and also good feedback I didn't do anything explicitly wrong with my post
It will be VERY funny if this ends up being essentially the o1 model with some tinkering to help it cycle questions multiple times to verify the best answers, or something banal like that. Wish they didn't make us wait so long to test that :/
small nudge: the questions have difficulty tiers of 25% easy, 50% medium, and 25% hard with easy being undergrad/IMO difficulty and hard being the sort you would give to a researcher in training.
The 25% accuracy gives me STRONG indications that it just got the easy ones, and the starkness of this cutoff makes me think there is something categorically different about the easy ones that make them MUCH easier to solve, either being more close ended, easy to verify, or just leaked into the dataset in some form.
First post, feel free to meta-critique my rigor on this post as I am not sure what is mandatory, expected, or superfluous for a comment under a post. Studying computer science but have no degree, yet. Can pull the specific citation if necessary, but...
these benchmarks don't feel genuine.
Chollet indicated in his piece:
...Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score ov
No, sorry, that's not a typo that's a linguistic norm that i probably assumed was more common than it actually is
Me and the people I talk with have used the words "mumble" and "babble" to describe LLM reasoning. Sort of like human babble, see https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble