All of yo-cuddles's Comments + Replies

No, sorry, that's not a typo that's a linguistic norm that i probably assumed was more common than it actually is

Me and the people I talk with have used the words "mumble" and "babble" to describe LLM reasoning. Sort of like human babble, see https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble

There's an improvement in LLM's I've seen that is important but has wildly inflated people's expectations beyond what's reasonable:

LLM's have hit a point in some impressive tests where they don't reliably fail past the threshold of being unrecoverable. They are conservative enough that they can do search on a problem, fail a million times until they mumble into an answer.

I'm going to try writing something of at least not-embarrassing quality about my thoughts on this but I am really confused by people's hype around this sort of thing, this feels like directed randomness

2Capybasilisk
>mumble into an answer Typo, I presume.

Gotcha, you didn't sound OVER confident so I assumed it was much-less-than-certain, still refreshingly concrete

Ah, okay.

I'll throw in my moderately strong disagreement for future bayes points, respect for the short term, unambiguous prediction!

7Daniel Kokotajlo
TBC, I'm at "Probably" not "Definitely." My 50% mark is in 2028 now, so I have a decent amount of probability mass (maybe 30%?) stretching across the 2030's.

This is not going to be a high quality answer, sorry in advance.

I noticed this with someone in my office who is learning robotic process automation: people are very bad at measuring their productivity, they are better at seeing certain kinds of gains and certain kinds of losses. I know someone who swears emphatically that they are many times as productive but have become almost totally unreliable. He's in denial over it, and a couple people now have openly told me they try to remove him from workflows for all the problems he causes.

I think the situation is... (read more)

By "solve", what do you mean? Like, provably secure systems, create a AAA game from scratch, etc?

I feel like any system that could do that would implicitly have what the OP says these systems might lack, but you seem to be in half agreeance with them. Am I misunderstanding something?

By "Solve" I mean "Can substitute for a really good software engineer and/or ML research engineer" in frontier AI company R&D processes. So e.g. instead of having teams of engineers led by a scientist, they can (if they choose) have teams of AIs led by a scientist.

Definitely! However, there is more money and "hype" in the direction of wanting these to scale into AGI.

Hype and anti-hype don't cancel each other out, if someone invests a billion dollars into LLM's, someone else can't spend negative 1 billion and it cancels out: the billion dollar spender is the one moving markets, and getting a lot of press attention.

We have Yudkowsky going on destiny, I guess?

I think there's some miscommunication here, on top of a fundamental disagreement on whether more compute takes us to AGI.

On miscommunication, we're not talking about the lowering cost per flop, we're talking about a world where openai either does or does not have a price war eating it's margins.

On fundamental disagreement, I assume you don't take very seriously the idea that AI labs are seeing a breakdown of scaling laws? No problem if so, reality should resolve that disagreement relatively soon!

This is actually a good use case, which fits with what gpt does well, where very cheap tokens help!

Pending some time for people to pick at it to test it's limits, this might be really good. My instinct is legal research, case law etc. will be the test of how good it is, if it does well this might be it's foothold into real commercial use that actually generates profit.

My prediction is that we will be glad this exists. It will not be "phd level", a phrase which defaces all who utter it, but it will save some people a lot of time and effort

Where I think we d... (read more)

Also, Amodei needs to cool it. There's a reading of the things he's been saying lately that could be taken as sane but a plausible reading that makes him look like a buffoon. Credibility is a scarce resource

I feel like this comes down a lot to intuition, all I can say is gesture at the thinning distance between marginal cost and prices, wave my hand in the direction of discount rates and the valuation of Openai and ask... Are you sure?

The demand curve on this seems textbook inelastic at current margins. slashing the price of milk by 10x would have us cleaning our driveways with it, slashing the price of eggs would have us using crushed eggshells as low grade building material. A 10x decrease in the price per token of AI is barely even noticed, in fact in some... (read more)

3jacob_cannell
The single factor prime causative factor driving the explosive growth in AI demand/revenue is and always has been the exponential reduction in $/flop via moore's law, which simply is jevon's paradox manifested. With more compute everything is increasingly easy and obvious; even idiots can create AGI with enough compute.
7gwern
OpenAI's Deep Research is looking like something that could be big and they were standing on the sidelines in part because the tokens weren't cheap.

I'm kinda opposite on the timelines thing? This is probably a timeline delayer even if I thought LLM's scaled to AGI, which I don't but let's play along.

If a Pharma company could look at another company's product and copy it and release it for free with no consequences, but the product they release itself could only be marginally improved without massive investment, what does that do to the landscape?

It kills the entire industry. This HURTS anyone trying to fundraise, reckless billions will have a harder time finding their way into the hands of developers ... (read more)

7Petropolitan
This seems to be the line of thinking behind the market reaction which has puzzled many people in the ML space. Everyone's favorite response to this thesis has been to invoke the Jevons paradox https://www.lesswrong.com/posts/HBcWPz82NLfHPot2y/jevon-s-paradox-and-economic-intuitions. You can check https://www.lesswrong.com/posts/hRxGrJJq6ifL4jRGa/deepseek-panic-at-the-app-store or listen to this less technical explanation from Bloomberg: Basically, the mistake in your analogy is that demand for the drug is limited and quite inelastic while the demand for AI (or basically most kinds of software) is quite elastic and potentially unlimited. I absolutely agree with the comparison of o3 at ARC-AGI/FrontierMath to brute forcing, but with algorithmic efficiency improvements that million dollar per run is expected to gradually decrease, first becoming competitive with highly skilled human labor and then perhaps even overcompeting it. The timelines depend a lot on when (if ever) these improvements plateau. The industry doesn't expect it to happen soon, cf. D. Amodei's comments on their speed actually accelerating https://www.lesswrong.com/posts/BkzeJZCuCyrQrEMAi/dario-amodei-on-deepseek-and-export-controls

Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn't generalize, no?

The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL'd the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it's actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.

The frontier t... (read more)

Thanks for the reply! Still trying to learn how to disagree properly so let me know if I cross into being nasty at all:

I'm sure they've gotten better, o1 probably improved more from its heavier use of intermediate logic, compute/runtime and such, but that said, at least up till 4o it looks like there has been improvements in the model itself, they've been getting better

They can do incredibly stuff in well documented processes but don't survive well off the trodden path. They seem to string things together pretty well so I don't know if I would say there's ... (read more)

I would say that, barring strong evidence to the contrary, this should be assumed to be memorization.

I think that's useful! LLM's obviously encode a ton of useful algorithms and can chain them together reasonably well

But I've tried to get those bastards to do something slightly weird and they just totally self destruct.

But let's just drill down to demonstrable reality: if past SWE benchmarks were correct, these things should be able to do incredible amounts of work more or less autonomously and get all the LLM SWE replacements we've seen have stuck to high... (read more)

It's been pretty clear to me as someone who regularly creates side projects with ai that the models are actually getting better at coding.

Also, it's clearly not pure memorization, you can deliberately give them tasks that have never been done before and they do well.

However, even with agentic workflows, rag, etc all existing models seem to fail at some moderate level of complexity - they can create functions and prototypes but have trouble keeping track of a large project

My uninformed guess is that o3 actually pushes the complexity by some non-trivial amount, but not enough to now take on complex projects.

Hmm, mixed agree/disagree. Scale probably won't work, algorithms probably would, but I don't think it's going to be that quick.

Namely, I think that the company struggling with fixed capital costs could accomplish much more, much quicker using the salary expenses of the top researchers they already have they'd have done it or gave it a good try at least

I'm 5 percent that a serious switch to algorithms would result in AGI in 2 years. You might be more well read than me on this so I'm not quite taking side bets right now!

I think the algorithm progress is doing some heavy lifting in this model. I think if we had a future textbook on agi we could probably build one but AI is kinda famous for minor and simple things just not being implemented despite all the parts being there

See ReLU activations and sigmoid activations.

If we're bottlenecking at algorithms alone is there a reason that isn't a really bad bottleneck?

3Nathan Helm-Burger
See my other response to Raphael elsewhere in this comment thread. My model is that the big AI labs are currently throttling algorithmic progress by choosing to devote their resources to scaling. If scaling leads to AGI, we get AGI soon that way. (I give this about 20% chance.) If scaling doesn't lead to AGI, then refocusing resources on experimentation seems like a natural next move. (I think this is about 80% likely to work in under two years if made a major focus of resources, including both giving human researchers the time, encouragement and compute resources they need, plus developing increasingly helpful AI reseachers.)

I haven't had warm receptions when critiquing points, which has frustratingly left me with bad detection for when I'm being nasty, so if I sound thorny it's not my intent.

Somewhere I think you might have misstepped is the frontier math questions: the quotes you've heard are almost certainly about tier 3 questions, the hardest ones meant for math researchers in training. The mid tier is for grad student level problems and tier 1 is bright high schooler to undergrad level problems

Tier 1: 25% of the test

Tier 2: 50% of the test

Tier 3: 25%

O3 got 25%, probably a... (read more)

Ah wait was reading it wrong. I thought each time was an order of magnitude, that looks to be standard notation for log scale. Mischief managed

Maybe a dumb question, but those log scale graphs have uneven ticks on the x axis, is there a reason they structured it like that beyond trying to draw a straight line? I suspect there is a good reason and it's not dishonesty but this does look like something one would do if you wanted to exaggerate the slope

4Aaron_Scher
I believe this is standard/acceptable for presenting log-axis data, but I'm not sure. This is a graph from the Kaplan paper: It is certainly frustrating that they don't label the x-axis. Here's a quick conversation where I asked GPT4o to explain. You are correct that a quick look at this graph (where you don't notice the log-scale) would imply (highly surprising and very strong) linear scaling trends. Scaling laws are generally very sub-linear, in particular often following a power-law. I don't think they tried to mislead about this, instead this is a domain where log-scaling axes is super common and doesn't invalidate the results in any way. 

I do not have a gauge for how much I'm actually bringing to this convo, so you should weigh my opinion lightly, however:

I believe your third point kinda nails it. There are models for gains from collective intelligence (groups of agents collaborating) and the benefits of collaboration bottleneck hard on your ability to verify which outputs from the collective are the best, and even then the dropoff happens pretty quick the more agents collaborate.

10 people collaborating with no communication issues and accurate discrimination between good and bad ideas are... (read more)

I sense that my quality of communication diminishes past this point, I should get my thoughts together before speaking too confidently

I believe you're right we do something similar to the LLM's (loosely, analogously), see

https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble

(I need to learn markdown)

My intuition is still LLM pessimistic, I'd be excited to see good practical uses, this seems like tool ai and that makes my existential dread easier to manage!

Thank you for the warm reply, it's nice and also good feedback I didn't do anything explicitly wrong with my post

It will be VERY funny if this ends up being essentially the o1 model with some tinkering to help it cycle questions multiple times to verify the best answers, or something banal like that. Wish they didn't make us wait so long to test that :/

5Lukas_Gloor
Well, the update for me would go both ways.  On one side, as you point out, it would mean that the model's single pass reasoning did not improve much (or at all).  On the other side, it would also mean that you can get large performance and reliability gains (on specific benchmarks) by just adding simple stuff. This is significant because you can do this much more quickly than the time it takes to train a new base model, and there's probably more to be gained in that direction – similar tricks we can add by hardcoding various "system-2 loops" into the AI's chain of thought and thinking process.  You might reply that this only works if the benchmark in question has easily verifiable answers. But I don't think it is limited to those situations. If the model itself (or some subroutine in it) has some truth-tracking intuition about which of its answer attempts are better/worse, then running it through multiple passes and trying to pick the best ones should get you better performance even without easy and complete verifiability (since you can also train on the model's guesses about its own answer attempts, improving its intuition there). Besides, I feel like humans do something similar when we reason: we think up various ideas and answer attempts and run them by an inner critic, asking "is this answer I just gave actually correct/plausible?" or "is this the best I can do, or am I missing something?." (I'm not super confident in all the above, though.) Lastly, I think the cost bit will go down by orders of magnitude eventually (I'm confident of that). I would have to look up trends to say how quickly I expect $4,000 in runtime costs to go down to $40, but I don't think it's all that long. Also, if you can do extremely impactful things with some model, like automating further AI progress on training runs that cost billions, then willingness to pay for model outputs could be high anyway. 

small nudge: the questions have difficulty tiers of 25% easy, 50% medium, and 25% hard with easy being undergrad/IMO difficulty and hard being the sort you would give to a researcher in training.

The 25% accuracy gives me STRONG indications that it just got the easy ones, and the starkness of this cutoff makes me think there is something categorically different about the easy ones that make them MUCH easier to solve, either being more close ended, easy to verify, or just leaked into the dataset in some form.

First post, feel free to meta-critique my rigor on this post as I am not sure what is mandatory, expected, or superfluous for a comment under a post. Studying computer science but have no degree, yet. Can pull the specific citation if necessary, but...

these benchmarks don't feel genuine.

Chollet indicated in his piece:

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score ov

... (read more)
5Zach Stein-Perlman
Welcome! To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can't naively translate benchmark scores to real-world capabilities.