All of Petropolitan's Comments + Replies

I guess it was usually not worth bothering with prosecuting disobedience as long as it was rare. If ~50% of soldiers were refusing to follow these orders, surely the Nazi repression machine would have set up a process to effectively deal with them and solved the problem

Continuing the analogy to the Manhattan Project: They succeeded in keeping it secret from Congress, but failed at keeping it secret from the USSR.

To develop this (quite apt in my opinion) analogy, the reason why this happened is simple: some scientists and engineers wanted to do something so that no one country could dictate its will to everyone else. Whistleblowing project secrets to the Congress couldn't have solved this problem but spying for a geopolitical opponent did exactly that

In my experience, this is a common kind of failure with LLMs - that if asked directly about how to best a solve problem, they do know the answer. But if they aren’t given that slight scaffolding, they totally fail to apply it.

The recent release of o3 and o4-mini seems to indicate that diminishing returns from scaling are forcing OpenAI into innovating with scaffolding and tool use. As an example, they demonstrated o3 parsing an image of a maze with an imgcv and then finding the solution programmatically with graph search: https://openai.com/index/thin... (read more)

Almost all machinists I've talked to have (completely valid) complaints about engineers that understand textbook formulas and CAD but don't understand real world manufacturing constraints.

Telling a recent graduate to "forget what you have been taught in college" might happen in many industries but seems especially common in the manufacturing sector AFAIK

As Elon Musk likes to say, manufacturing efficiently is 10-100x times more challenging than making a prototype. This involves proposing and evaluating multiple feasible approaches, designing effective workholding, selecting appropriate machines, and balancing complex trade-offs between cost, time, simplicity, and quality. This is the part of the job that's actually challenging.

And setting up quality control!

Swedish inventor and vlogger Simone Giertz recently published the following video elaborating on this topic in a funny and enjoyable way:

Since this see... (read more)

I think regionalisms are better approached systematically, as there are tons of scientific literature on this and even a Wikipedia article with an overview: https://en.wikipedia.org/wiki/American_English_regional_vocabulary (same for accents https://en.wikipedia.org/wiki/North_American_English_regional_phonology but that might require a fundamental study of English phonology)

Training a LoRA has a negligible cost compared to pre-training a full model because it only involves changing 1.5% to 7% of the parameters (per https://ar5iv.labs.arxiv.org/html/2502.16894#A6.SS1) and only on thousands to millions of tokens instead of trillions.

Inferencing different LoRAs for the same model in large batches with current technology is also very much possible (even if not without some challenges), and OpenAI offers their finetuned models for just 1.5-2x the cost of the original ones: https://docs.titanml.co/conceptual-guides/gpu_mem_mangemen... (read more)

What makes you (and the author) think ML practitioners won't start finetuning/RL'ing on partial reasoning traces during the reasoning itself if that becomes necessary? Nothing in the current LLM architecture prevents that technically, and IIRC Gwern has stated he expects that to happen eventually

2tangerine
I’m glad you asked. I completely agree that nothing in the current LLM architecture prevents that technically and I expect that it will happen eventually. The issue in the near future is practicality, because training models is currently—and will in the near future still be—very expensive. Inference is less expensive, but still so expensive that profit is only possible by serving the model statically (i.e., without changing its weights) to many clients, which amortizes the cost of training and inference. These clients often rely heavily on models being static, because it makes its behavior predictable enough to be suitable for a production environment. For example, if you use a model for a chat bot on your company’s website, you wouldn’t want its personality to change based on what people say to it. We’ve seen that go wrong very quickly with Microsoft’s Twitter bot Tay. It’s also a question whether you want your model to internalize new concepts (let’s just call it “continual learning”) based on everybody’s data or based on just your data. Using everybody’s data is more practical in the sense that you just update the one model that everybody uses (which is something that’s in a sense already happening when they move the cutoff date of the training data forward for the latest models), but it’s not something that users will necessarily be comfortable with. For example, users won't want a model to leak their personal information to others. There are also legal barriers here, of course, especially with proprietary data. People will probably be more comfortable with a model that updates just on their data, but that’s not practical (yet) in the sense that you would need the compute resources to be cheap enough to run an entire, slightly different model for each specific use case. It can already be done to some degree with fine-tuning, but that doesn’t change the weights of the entire model (that would be prohibitively expensive with current technology) and I don’t thi

hire a bunch of random bright-ish people and get them to spin up LLM-wrapper startups in-house (so that you own 100% stake in them).

I doubt it's really feasible. These startups will require significant infusion of capital so AI companies CEOs and CFOs will have a say on how they develop. But tech CEOs and CFOs have no idea how developments in other industries work and why they are slow so they will mismanage such startups.

P. S. Oh, and also I realized the other day: whether you are an AI agent or just a human, imagine the temptation to organize a Theranos-type fraud if details of your activity are mostly secret and you only report to tech bros believing in the power of AGI/ASI!

Google could still sell those if there's so much demand

Sell to who, competing cloud providers? Makes no sense, Lamborghini doesn't sell their best engines to Ferrari or vice versa!

Also, all this discussion is missing that inference is much easier both hardware and software-wise than training while it was expected long time ago that at some point the market for the former will be comparable and then larger than for the latter

Is it possible Meta just trained on bad data while Google and DeepSeek trained on good? See my two comments here: https://www.lesswrong.com/posts/Wnv739iQjkBrLbZnr/meta-releases-llama-4-herd-of-models?commentId=KkvDqZAuTwR7PCybB

gwern151

No, it would probably be a mix of "all of the above". FB is buying data from the same places everyone else does, like Scale (which we know from anecdotes like when Scale delivered FB a bunch of blatantly-ChatGPT-written 'human rating data' and FB was displeased), and was using datasets like books3 that are reasonable quality. The reported hardware efficiency numbers have never been impressive, they haven't really innovated in architecture or training method (even the co-distillation for Llama-4 is not new, eg. ERNIE was doing that like 3 years ago), and in... (read more)

I'm afraid you might have missed the core thesis of my comment, let me reword. I'm arguing one should not extrapolate findings from that paper on what's Meta training now.

The Llama 4 model card says the herd was trained on "[a] mix of publicly available, licensed data and information from Meta’s products and services. This includes publicly shared posts from Instagram and Facebook and people’s interactions with Meta AI": https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md To use a term from information theory, these posts proba... (read more)

3Vladimir_Nesov
Your point is one of the clues I mentioned that I don't see as comparably strong to the May 2023 paper, when it comes to prediction of loss/perplexity. The framing in your argument appeals to things other than the low-level metric of loss, so I opened my reply with focusing on it rather than the more nebulous things that are actually important in practice. Scaling laws work with loss the best (holding across many OOMs of compute), and repeating 3x rather than 7x (where loss first starts noticeably degrading) gives some margin of error. That is, a theoretical argument along the lines of what you are saying shifts my expectation for 10x-20x repetition (which might degrade faster when working with lower quality data), but not yet for 3x repetition (which I still expect to get an ~unchanged loss). So far I haven't even seen anyone there notice that Behemoth means that Llama 4 was essentially canceled and instead we got some sort of Llama 3.5 MoE. That is, a 100K+ H100s training run that was the expected and announced crown jewel of Llama 4 won't be coming out, probably until at least late 2025 and possibly even 2026. Since Behemoth is the flagship model for Llama 4, a 3e26+ FLOPs model that would've been appropriate for a 100K H100s training system instead got pushed back to Llama 5. As Behemoth is only a 5e25 FLOPs model, even once it comes out it won't be competing in the same weight class as GPT-4.5, Grok 3, and Gemini 2.5 Pro. Maverick is only a 2e24 FLOPs[1] model (2x less than DeepSeek-V3, ~100x less than the recent frontier models), so of course it's not very good compared to the frontier models. Since Meta didn't so far demonstrate competence on the level of DeepSeek or Anthropic, they do need the big compute to remain in the game, and Maverick is certainly not big compute. (LocalLLaMA specifically is annoyed by absence of models with a small number of total parameters in the current Llama 4 announcement, which means you need high end consumer hardware to run

Muennighoff et al. (2023) studied data-constrained scaling on C4 up to 178B tokens while Meta presumably included all the public Facebook and Instagram posts and comments. Even ignoring the two OOM difference and the architectural dissimilarity (e. g., some experts might overfit earlier than the research on dense models suggests, perhaps routing should take that into account), common sense strongly suggests that training twice on, say, a Wikipedia paragraph must be much more useful than training twice on posts by Instagram models and especially comments under those (which are often as like as two peas in a pod).

2Vladimir_Nesov
The loss goes down; whether that helps in some more legible way that also happens to be impactful is much harder to figure out. The experiments in the May 2023 paper show that training on some dataset and training on a random quarter of that dataset repeated 4 times result in approximately the same loss (Figure 4). Even 15 repetitions remain useful, though at that point somewhat less useful than 15 times more unique data. There is also some sort of double descent where loss starts getting better again after hundreds of repetitions (Figure 9 in Appendix D). This strongly suggests that repeating merely 3 times will robustly be about as useful as having 3 times more data from the same distribution. I don't know of comparably strong clues that would change this expectation.

Since physics separated from natural philosophy in the times of Newton, it has almost always[1] progressed when new experimental data uncovered deficiencies in then-current understanding of the universe. During the Cold War unprecedentedly large amount of money were invested into experimental physics, and by the late 20th century all reasonably low hanging fruits have been picked (in the meantime the experiments have got absurdly expensive and difficult). I have also wrote on the topic at https://www.lesswrong.com/posts/CCnycGceT4HyDKDzK/a-history-of-... (read more)

I don't think pure mathematics make a good parallel. There are still discoveries made by single mathematicians or very small research groups, but this haven't really been the case in physics since about mid-20th century, when the US and USSR invested lots of money in modern large-scale research done by huge groups

2RussellThor
Maths doesn't make an exact parallel but certainly fits in with my worldview. Lets say you view advanced physics as essentially a subfield of maths which is not that much of an exaggeration given how mathematical string theory etc is. If a sub field gets a lot of attention like physics has then it gets pushed along the diminishing returns curve faster. That means such single person discoveries would be much harder in physics than a given field of mathematics. The surface area of all mathematics is greater than that of just mathematical physics so that is exactly what you would predict. Individual genius mathematicians can take a field that has been given less attention - the distance from beginner to state of the art is less than  that for physics. They can then advance the state of the art.

Not just long context in general (that can be partially mitigated with RAG or even BM25/tf-idf search), but also nearly 100% factual accuracy on it, as I argued last week

https://simple-bench.com presents an example of a similar benchmark with tricky commonsense questions (such as counting ice cubes in a frying pan on the stove) also with a pretty similar leaderboard. It is sponsored by Weights & Biases and devised by an author of a good YouTube channel who presents quite a balanced view on the topic there and don't appear to have a conflict of interest either. See https://www.reddit.com/r/LocalLLaMA/comments/1ezks7m/simple_bench_from_ai_explained_youtuber_really for independent opinions on this benchmark

2keltan
Bump to that YT channel too. Some of the most balanced AI news videos out there. Really appreciate the work they're doing.

Two months later I tried to try actually implementing a nontrivial conversion of a natural language mathematical argument to a fully formalized Lean proof in order to check if I was indeed underestimating it (TBH, I have never tried a proof assistant before).

So I took a difficult integral from a recent MathSE question I couldn't solve analytically myself, had Gemini 2.5 Pro solve it 0-shot,[1] verified it numerically, set up a Lean environment in Google Colab and then asked if another instance of Gemini 2.5 could convert the solution into a proof. It ... (read more)

Aren't you supposed as a reviewer to first give the authors a chance to write a rebuttal and discuss it with them before making your criticism public?

1[comment deleted]

I think that, if someone reads a critical comment on a blog post, the onus is kinda on that reader to check back a few days later and see where the discussion wound up, before cementing their opinions forever. Like, people have seen discussions and disagreements before, in their life, they should know how these things work.

The OP is a blog post, and I wrote a comment on it. That’s all very normal. “As a reviewer” I didn’t sign away my right to comment on public blog posts about public papers using my own words.

When we think about criticism norms, there’s a... (read more)

One of non-obvious but very important skills which all LLM-based SWE agents currently lack is reliably knowing which subtasks of a task you have successfully solved and which you have not. I think https://www.answer.ai/posts/2025-01-08-devin.html is a good case in point.

We have absolutely seen a lot of progress on driving down hallucinations on longer and longer contexts with model scaling, they probably made the charts above possible in the first place. However, recent research (e. g., the NoLiMa benchmark from last month https://arxiv.org/html/2502.05167... (read more)

6nostalgebraist
KV caching (using the terminology "fast decoding" and "cache") existed even in the original "Attention is All You Need" implementation of an enc-dec transformer.  It was added on Sep 21 2017 in this commit.  (I just learned this today, after I read your comment and got curious.) The "past" terminology in that original transformers implementation of GPT-2 was not coined by Wolf – he got it from the original OpenAI GPT-2 implementation, see here.

To be honest, what I originally implied is that these founders develop their products with low-quality code, as cheap and dirty as they can, and without any long-term planning about further development

2Gunnar_Zarncke
I agree that is what usually happens in startups and AI doesn't change that principle, only the effort of generating it faster.

Perhaps says more about Y Combinator nowadays rather than about LLM coding

2Gunnar_Zarncke
In which sense? In the sense of them AI hyping? That seems kind of unlikely or would at least be surprising to me as they don't look at tech - they don't even look so much as products but at founder qualities - relentlessly resourceful - and that apparently hasn't changed according to Paul Graham. Where do you think he or YC is mistaken?

Aristotle has argued (and I support his view) in the beginning of the Book II of the Nicomachean Ethics that virtues are just like skills, they are acquired in life by practice and imitation of others. Perhaps not a coincidence that a philosophical article on the topic used "Reinforcement" in one of its subheadings. I also attach a 7-minute video for those who prefer a voice explanation:

For this reason, practice ethical behavior even with LLMs and you will enjoy doing the same with people

4Mo Putera
I agree that virtues should be thought of as trainable skills, which is also why I like David Gross's idea of a virtue gym: Conversations with LLMs could be the "home gym" equivalent I suppose.

Another example is that going from the first in-principle demonstration of chain-of-thought to o1 took two years

The correct date for the first demonstration of CoT is actually ~July 2020, soon after the GPT-3 release, see the related work review here: https://ar5iv.labs.arxiv.org/html/2102.07350

2Kaj_Sotala
Thanks!

When general readers see "empirical data bottlenecks" they expect something like a couple times better resolution or several times higher energy. But when physicists mention "wildly beyond limitations" they mean orders of magnitude more!

I looked up the actual numbers:

  • in this particular case we need to approach the Planck energy, which is  eV, Wolfram Alpha readily suggests it's ~540 kWh, 0.6 of energy use of a standard clothes dryer or 1.3 of energy in a typical lightning bolt; I also calculated it's about 1.2 of the muzzle energy of the
... (read more)
PetropolitanΩ14-2

On the other hand, frontier math (pun intended) is much worse financed than biomedicine because most of the PhD-level math has barely any practical applications worth spending many manhours of high-IQ mathematicians (which often makes them switch career, you know). So, I would argue, if productivity of math postdocs when armed with future LLMs raises by, let's say, an order of magnitude, they will be able to attack more laborious problems.

Not that I expect it to make much difference to the general populace or even the scientific community at large though

general relativity and quantum mechanics are unified with a new mathematical frame

The problem is not to invent a new mathematical frame, there are plenty already. The problem is we don't have any experimental data whatsoever to choose between them because quantum gravity effects are expected to be relevant at energy scales wildly beyond current or near-future technological limitations. This has led to a situation where quantum gravity research has become largely detached from experimental physics, and AI can do nothing about that. Sabine Hossenfelder has made quite a few explainers (sometime quite angry ones) about it

2Noosphere89
This, but I will caveat that weaker goals relating to this, for example getting data on whether gravity is classical or quantum at all (ignoring the specific theory) might become possible by 2040. I agree this particular part is unrealistic, given the other capabilities implied.

The third scenario doesn't actually require any replication of CUDA: if Amazon, Apple, AMD and other companies making ASICs commoditize inference but Nvidia retains its moat in training, with inference scaling and algorithmic efficiency improvements the training will inevitably become a much smaller portion of the market

It's a bit separate topic and not what was discussed in this thread previously but I will try to answer.

I assume because Nvidia's moat is in CUDA and chips with high RAM bandwidth optimized specifically for training while competition in inference (where the weights are static) software and hardware is already higher, and going to be even higher still by the time DeepSeek's optimizations become a de-facto industry standard and induce some additional demand

I don't think the second point is anyhow relevant here while the first one is worded so that it might imply something on the scale of "AI assistant convinces a mentally unstable person to kill their partner and themselves"—not something that would be perceived as a warning shot by the public IMHO (have you heard there were at least two alleged suicides driven by GPT-J 6B? The public doesn't seem to bother https://www.vice.com/en/article/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says/ https://www.nytimes.com/2024/10/23/technology/characterai-l... (read more)

This is a scenario I have been thinking for perhaps about three years. However you made an implicit assumption I wish was explicit: there is no warning shot.

I believe that with such a slow takeoff there is a very high probability of an AI alignment failure causing significant loss of life already at the TAI stage and that would significantly change the dynamics

2Marius Hobbhahn
There are two sections that I think make this explicit: 1. No failure mode is sufficient to justify bigger actions.  2. Some scheming is totally normal.  My main point is that even things that would seem like warning shots today, e.g. severe loss of life, will look small in comparison to the benefits at the time, thus not providing any reason to pause. 

This seems to be the line of thinking behind the market reaction which has puzzled many people in the ML space. Everyone's favorite response to this thesis has been to invoke the Jevons paradox https://www.lesswrong.com/posts/HBcWPz82NLfHPot2y/jevon-s-paradox-and-economic-intuitions. You can check https://www.lesswrong.com/posts/hRxGrJJq6ifL4jRGa/deepseek-panic-at-the-app-store or listen to this less technical explanation from Bloomberg:

Basically, the mistake in your analogy is that demand for the drug is limited and quite inelastic while the demand for AI... (read more)

1Kajus
I don't get it. Nvidia chips were still used to train deepseek. Why would nvidia take a hit?
3yo-cuddles
Also, Amodei needs to cool it. There's a reading of the things he's been saying lately that could be taken as sane but a plausible reading that makes him look like a buffoon. Credibility is a scarce resource
4yo-cuddles
I feel like this comes down a lot to intuition, all I can say is gesture at the thinning distance between marginal cost and prices, wave my hand in the direction of discount rates and the valuation of Openai and ask... Are you sure? The demand curve on this seems textbook inelastic at current margins. slashing the price of milk by 10x would have us cleaning our driveways with it, slashing the price of eggs would have us using crushed eggshells as low grade building material. A 10x decrease in the price per token of AI is barely even noticed, in fact in some markets outside of programming the consumer interest is down during that same window. This an example of a low margin good with little variation in quality descending into a price war. Maybe LLM's have a long ways left to grow and can scale to agi (maybe, maybe not) but if we're looking just at the market this doesn't look like something Jevon's paradox applies to at all, people are just saying words and if you switched out Jevon for piglet they'd make as much sense imo The proposal just seems ridiculous to me, right? Who right now is standing on the sidelines with a killer AI app that could rip up the market if only tokens were a bit cheaper? There isn't, the bottleneck is and always has been quality, the ability for LLM's to be less-wrong-so-dang-always. Jevon's paradox seems to be filling the role of a magic word in these conversations, it's involved despite being out of place. Sorry if this is invective at all, you're mostly explaining a point of view so I'm not frustrated in your direction, but people are making little sense to me right now.

Even if LMMs (you know, LLMs sensu stricto can't teach kids read and write) are able to do all primary work of teachers, some humans will have to oversee the process because as soon as a dispute between a student and an AI teacher arises, e. g., about grades or because of the child not willing to study, parents will inherently distrust AI and require a qualified human teacher intervention.

Also, since richer parents are already paying for more pleasant education experience in private schools (often but not always organized according to Montessori method), I... (read more)

Math proofs are math proofs, whether they are in plain English or in Lean. Contemporary LLMs are very good at translation, not just between high-resource human languages but also between programming languages (transpiling), from code to human (documentation) and even from algorithms in scientific papers to code. Thus I wouldn't expect formalizing math proofs to be a hard problem in 2025.

However I generally agree with your line of thinking. As wassname wrote above (it's been quite obvious for some time but they link to a quantitative analysis), good in-sili... (read more)

4LGS
I have no opinion about whether formalizing proofs will be a hard problem in 2025, but I think you're underestimating the difficulty of the task ("math proofs are math proofs" is very much a false statement for today's LLMs, for example). In any event, my issue is that formalizing proofs is very clearly not involved in the o1/o3 pipeline, since those models make so many formally incorrect arguments. The people behind FrontierMath have said that o3 solved many of the problems using heuristic algorithms with wrong reasoning behind them; that's not something a model trained on formally verified proofs would do. I see the same thing with o1, which was evaluated on the Putnam and got the right answer with a wrong proof on nearly every question.

I think LGS proposed a much simpler explanation in terms of an assistant simulacrum inside a token-predicting shoggoth

Petropolitan*Ω010

MAGMA also has the model check its own work, but the model notices that the work it is checking is its own and doesn’t flag it.

Why would anyone give such a responsibility to an untrusted model in a not-overseen fashion? Already in December last year Greenblatt et al. demonstrated which techniques alignment researchers could use to control a high-capability untrusted model (and Robert Miles did a good video on it recently).

It doesn't currently look plausible that any model (or any human for that matter) would be able to distinguish between its own work it c... (read more)