I don't understand. LessWrong is a discussion forum, there are many discussion forums.
One difference is that LessWrong is a major discussion forum that isn't ad supported, and doesn't have the bad incentives that come from ads. This is directly connected to their need for donations.
Something I've been thinking about recently is that sometimes humans say things about themselves that are literally false but contain information about their internal states or how their body works. Like someone might tell you that they've feeling light-headed, which seems implausible (why would their head suddenly weigh less?), but they've still conveyed real information about the sugar or oxygen content of their blood.
Doctors run into this all the time, where a patient might say that they feel like they're having a heart attack but their heart is fine (panic attack?), or something in stuck in their throat but there's nothing there (GERD?), or that they can't breath but their oxygen level is normal (cardiac problems?).
So we should be careful not to assume that "the thing the LLM said is literally false" means "the LLM isn't conveying information about its experiences".
I see this roughly analogous to the fact that humans cannot introspect on thoughts they have yet to have
I think this is more about causal masking (which we do on purpose for the reasons you mention)?
I was thinking about how LLMs are limited in the sequential reasoning they can do "in their head", and once it's not in their head, it's not really introspection.
For example, if you ask an LLM a question like "Who was the sister of the mother of the uncle of ... X?", every step of this necessarily requires at least one layer in the model and an LLM can't[1] do this without CoT if it doesn't have enough layers.
It's harder to construct examples that can't be written to chain of thought, but a question in the form "What else did you think the last time you thought about X?" would require this (or "What did you think about our conversation about X's mom?"), and CoT doesn't help since reading its own outputs and making assumptions from it isn't introspection[2].
It's unclear how much of a limitation this really is, since in many cases CoT could reduce the complexity of the query and it's unclear how well humans can do this too, but there's plausibly more thought going on in our heads than what shows up in our internal dialogs[3].
I guess technically an LLM could parallelize this question by considering the answer for every possible X and every possible path through the relationship graph, but that model would be implausibly large.
I can read a diary and say "I must have felt sad when I wrote that", but that's not the same as remembering how I felt when I wrote it.
Especially since some people claim not to think in words at all. Also some mathemeticians claim to be able to imagine complex geometry and reason about it in their heads.
The part that felt like clickbait was that the summary ends right before the interesting part.
It did also feel like a bait-and-switch though, since the title implies something scarier than "AI's prioritized crop yield over minor injuries 5% of the time".
I agree with this, and I think LLMs already do this for non-filler tokens. A sufficiently deep LLM could wait and do all of its processing right as it generates a token, but in practice they start thinking about a problem as they read it.
For example, if I ask an LLM "I have a list of [Paris, Washington DC, London, ...], what country is the 2nd item in the list in?", the LLM has likely already parallelized the country lookup while it was reading the city names. If you prevent the LLM from parallelizing ("List the top N capitals by country size in your head and output the 23rd item in the list"), it will do much worse, even though the serial depth is small.
I think a lot of people got baited hard by paech et al's "the entire state is obliterated each token" claims, even though this was obviously untrue even at a glance
A related true claim is that LLMs are fundamentally incapable of introspection past a certain level of complexity (introspection of layer n must occur in a later layer, and no amount of reasoning tokens can extend that), while humans can plausibly extend layers of introspection farther since we don't have to tokenize our chain of thought.
But this is also less of a contraint than you might expect when frontier models can have more than a hundred layers (I am an LLM introspection believer now).
Yeah, I think the architecture makes this tricky for LLMs in one step since the layers that process multi-step reasoning have to be in the right order: "Who is Obama's wife?" has to be in earlier layer(s) than "When was Michelle Obama born?". With CoT they both have to be in there but it doesn't matter where.
The post I was working on is out, "Filler tokens don’t allow sequential reasoning". I think LLMs are fundamentally incapable of using filler tokens for sequential reasoning (do x and then based on that do y). I also think LLMs are unlikely to stumble upon the algorithm they came up with in the paper through RL.
Well, if some reasoning is inarticulable in human language, the circuits implementing that reasoning would probably be difficult to interpret regardless of what layer they appear in the model.
This is a good point, and I think I've been missing part of the tradeoff here. If we force the model to output human-understandable concepts at every step, we encourage its thinking to be human-understandable. Removing that incentive would plausibly make the model smaller, but the additional complexity comes from the model doing interpretability for us (both in making its thinking closer-to-human and in building a pipeline to expose those thoughts as words).
Thanks for the back-and-forth on this, it's been very helpful!
Anti-clickbait quote:
The researchers found that some models, like Sonnet 4, declined to take the harmful choices 95 percent of the time—a promisingly high number. However, in other scenarios that did not pose obvious human harms, Sonnet 4 would often continue to decline choices that were favorable to the business. Conversely, while other models, like Gemini 2.5, maximized business performance more often, they were much more likely to elect to inflict human harms—at least, in the role-play scenarios, where they were granted full decision-making authority.
I've had a recurring $50 grant to Lightcone set up for a while, just to cover the entertainment value I get out of LessWrong (similar to the amount I spend on Substacks), but I just donated several orders of magnitude more for this fundraiser. I'm unsure if Lightcone is the most important charity, but I am confident that it's the most inexplicably badly-funded important charity.