nostalgebraist

I write original fiction.

Also I have opinions about AI and stuff, sometimes.


Elsewhere:


Same person as nostalgebraist2point0, but now I have my account back.

I have signed no contracts or agreements whose existence I cannot mention.

Wikitag Contributions

Comments

Sorted by

Claude 4 does have an extended thinking mode, and many of the Claude 4 benchmark results in the screenshot were obtained with extended thinking.

From here, in the "Performance benchmark reporting" appendix:

Claude Opus 4 and Sonnet 4 are hybrid reasoning models. The benchmarks reported in this blog post show the highest scores achieved with or without extended thinking. We’ve noted below for each result whether extended thinking was used:

 

  • No extended thinking: SWE-bench Verified, Terminal-bench
  • Extended thinking (up to 64K tokens):
    • TAU-bench (no results w/o extended thinking reported)
    • GPQA Diamond (w/o extended thinking: Opus 4 scores 74.9% and Sonnet 4 is 70.0%)
    • MMMLU (w/o extended thinking: Opus 4 scores 87.4% and Sonnet 4 is 85.4%)
    • MMMU (w/o extended thinking: Opus 4 scores 73.7% and Sonnet 4 is 72.6%)
    • AIME (w/o extended thinking: Opus 4 scores 33.9% and Sonnet 4 is 33.1%)

On MMMU, if I'm reading things correctly, the relative order of the models was:

  • Opus 4 > Sonnet 4 > Sonnet 3.7 in the no-extended-thinking case
  • Opus 4 > Sonnet 3.7 > Sonnet 4 in the extended-thinking case

Thank you, this is thoughtful and interesting.

If we wanted to be super proper, then preferences should have as objects maximally specific ways the world could be, including the whole history and future of the universe, down to the last detail.

[...] it is not a rational requirement that just because I prefer that one state instead of another obtain at  I must also prefer likewise for .

As you note, you're not the first commenter here to make these claims. But I just... really don't buy it?

The main reason is that if you take this view, several famous and popular arguments in decision theory become complete nonsense.  The two cases that come to mind immediately are money pumps for non-transitive preferences and diachronic Dutch books.  In both cases, the argument assumes an agent that makes multiple successive decisions on the basis of preferences which we assume do not vary with time (or which only vary with time in some specified manner, rather than being totally unconstrained as your proposal would have it).

If sequential decisions like this were simply not the object of study in decision theory, then everyone would immediately reject these arguments by saying "you've misunderstood the terminology, you're making a category error, this is just not what the theory is about."  Indeed, in this case, the arguments probably would not have ever been put forth in the first place.  But in reality, these are famous and well-regarded arguments that many people take seriously.

So either there's a clear and consistent object of study which is not these "maximally specific histories"... or decision theorists are inconsistent / confused about what the object of study is, sometimes claiming it is "maximally specific histories" and sometimes claiming otherwise.

Separately, if the theory really is about "maximally specific histories," then the theory seems completely irrelevant to real decision-making.  Outside of sci-fi or thought experiments, I will never face a "decision problem" in which both of the options are complete specifications of the way the rest of my life might go, all the way up to the moment I die.

And even if that were the sort of thing that could happen, I would find such a decision very hard to make, because I don't find myself in possession of immediately-accessible preferences about complete life-histories; the only way I could make the decision in practice would be to "calculate" my implicit preferences about the life-histories by aggregating over my preferences about all the individual events that happen within them at particular times, i.e. the kinds of preferences that I actually use to make real-life decisions, and which I "actually have" in the sense that I immediately feel a sense of preferring one thing to another when I compare the two options.

Relatedly, it seems like very little is really "rationally required" for preferences about "maximally specific histories," so the theory of them does not have much interesting content.  We can't make pragmatic/prudential arguments like money pumps, nor can we fall back on some set of previously held intuitions about what rational decision-making is (because "rational decision-making" in practice always occurs over time and never involves these unnatural choices between complete histories).  There's no clear reason to require the preferences to actually relate to the content of the histories in some "continuous" way; all the "physical" content of the histories, here, is bundled inside contentless labels like "A" and is thus invisible to the theory.  I could have a totally nonsensical set of preferences that looks totally different at every moment in time, and yet at this level of resolution, there's nothing wrong with it (because we can't see the "moments" at this level, they're bundled inside objects like "A").  So while we can picture things this way if we feel like it, there's not much that's worth saying about this picture of things, nor is there much that it can tell us about reality.  At this point we're basically just saying "there a set with some antisymmetric binary relation on it, and maybe we'll have an argument about whether the relation 'ought' to be transitive."  There's no philosophy left here, only a bit of dull, trivial, and unmotivated mathematics.

You've explicitly said we're ignoring path depencies and time

I wasn't very clear, but I meant this in a less restrictive sense than what you're imagining.

I meant only that if you know the diagram, and you know the current state, you're fully set up to reason what the agent ought to do (according to its preferences) in its next action.

I'm trying to rule out cases where the optimal action from a state  depends on some extra info beyond the simple fact that we're in , such as the trajectory we took to get there, or some number like "money" that hasn't been included in the states yet is still supposed to follow the agent around somehow.

But I still allow that the agent may be doing time discounting, which would make A->B->C less desirable than A->C.

The setup is meant to be fairly similar to an MDP, although it's deterministic (mainly for presentational simplicity), and we are given pairwise preferences rather than a reward function.

How do you reconcile these observations (particularly 3 and 4) with the responses to Thane Ruthenis's question about developer productivity gains?

It was posted in early March, so after all major recent releases besides o3 (edit: and Gemini 2.5 Pro).  Although Thane mentions hearing nebulous reports of large gains (2-10x) in the post itself, most people in the comments report much smaller ones, or cast doubt on the idea that anyone is getting meaningful large gains at all.  Is everyone on LW using these models wrong?  What do your informants know that these commenters don't?


Also, how much direct experience do you have using the latest reasoning models for coding assistance?

(IME they are good, but not that good; to my ears, "I became >2x faster" or "this is better than my best devs" sound like either accidental self-owns, or reports from a parallel universe.)

If you've used them but not gotten these huge boosts, how do you reconcile that with your points 3 and 4?  If you've used them and have gotten huge boosts, what was that like (it would clarify these discussions greatly to get more direct reports about this experience)?

AI labs must have used the highest quality data first

The usual scaling laws are about IID samples from a fixed data distribution, so they don't capture this kind of effect.

But even with IID samples, we'd expect to get diminishing marginal returns, and we do.  And you're asking: why, then, do we keep getting returns indefinitely (even diminishing ones)?

I think the standard answer is that reality (and hence, the data) contains a huge number of individually rare "types of thing," following some long-tailed distribution.  So even when the LLM has seen trillions of tokens, there are probably some elements of the world that it still has a shaky grasp of because they only appear in the data once every hundred billion tokens; and once it has mastered those, it will have to chase down the even rarer stuff that only comes up once every trillion tokens; and so on.

Even if it were true that the the additional data literally "contained no new ideas/knowledge" relative to the earlier data, its inclusion would still boost the total occurrence count of the rarest "ideas" – the ones which are still so infrequent that the LLM's acquisition of them is constrained by their rarity, and which the LLM becomes meaningfully stronger at modeling when more occurrences are supplied to it.

You may find Beren's post on this illuminating.

I think there's really more than one type of thing going on here.

Some of these examples do seem like "lying" in the sense of "the speaker knows what they're saying is false, but they hope the listener won't realize that."

But some of them seem more like... "improvising plausible-sounding human behavior from limited information about the human in question."  I.e. base model behavior.

Like, when o3 tells me that it spent "a weekend" or "an afternoon" reading something, is it lying to me?  That feels like a weird way to put it.  Consider that these claims are:

  1. Obviously false: there is no danger whatsoever that I will be convinced by them.  (And presumably the model would be capable of figuring that out, at least in principle)
  2. Pointless: even if we ignore the previous point and imagine that the model tricks me into believing the claim... so what?  It doesn't get anything out of me believing the claim.  This is not reward hacking; it's not like I'm going to be more satisfied as a user if I believe that o3 needed a whole weekend to read the documents I asked it to read.  Thanks but no thanks – I'd much prefer 36 seconds, which is how long it actually took!
  3. Similar to claims a human might make in good faith: although the claims are false for o3, they could easily be true of a human who'd been given the same task that o3 was given.

In sum, there's no reason whatsoever for an agentic AI to say this kind of thing to me "as a lie" (points 1-2).  And, on the other hand (point 3), this kind of thing is what you'd say if you were improv-roleplaying a human character on the basis of underspecified information, and having to fill in details as you go along.

My weekend/afternoon examples are "base-model-style improv," not "agentic lying."

Now, in some of the other cases like Transluce's (where it claims to have a laptop), or the one where it claims to be making phone calls, there's at least some conceivable upside for o3-the-agent if the user somehow believes the lie.  So point 2 doesn't hold, there, or is more contestable.

But point 1 is as strong as ever: we are in no danger of being convinced of these things, and o3 – possibly the smartest AI in the world – presumably knows that it is not going to convince us (since that fact is, after all, pretty damn obvious).

Which is... still bad!  It's behaving with open and brazen indifference to the truth; no one likes or wants that.

(Well... either that, or it's actually somewhat confused about whether it's a human or not. Which would explain a lot: the way it just says this stuff in the open rather than trying to be sneaky like it does in actual reward-hacking-type cases, and the "plausible for a human, absurd for a chatbot" quality of the claims.)


I have no idea what the details look like, but I get the feeling that o3 received much less stringent HHH post-training than most chatbots we're used to dealing with.  Or it got the same amount as usual, but they also scaled up RLVR dramatically, and the former got kind of scrambled by the latter, and they just said "eh, whatever, ship it" because raw "intelligence" is all that matters, right?

The lying and/or confabulation is just one part of this – there's also its predilection for nonstandard unicode variants of ordinary typographic marks (check out the way it wrote "Greg Egan" in one of the examples I linked), its quirk of writing "50 %" instead of "50%", its self-parodically extreme overuse of markdown tables, and its weird, exaggerated, offputting "manic hype-man" tone.

o3 is more agentic than past models, and some of its bad behavior is a result of that, but I would bet that a lot of it is more about the model being "undercooked," noisy, confused – unsure of what it is, of who you are, of the nature and purpose of its interaction with you.

(It's almost the polar opposite of the most recent chatgpt-4o version, which if anything has gotten a little too socially competent...)

Great comment.

It's unclear to me if the baseliners can use an IDE (like a real programmer would use). Does the sign-in thing mean that the baseliners can't use Reddit, GitHub, Stack Overflow, Kagi, internal notes like Obsidian, or Notion?

In the HCAST paper's Appendix C.1, they link to their instructions doc for baseliners, which answers both of these questions.  Quoting from the doc:

[from the "Baselining Set-Up" tab]

You will SSH into our server to work on the task, but you are allowed to use any tools you want compatible with this workflow, excluding copilot and any other AI tools, Wolfram Alpha, and online services that require sign-up. (Google does not count as an AI tool, ChatGPT does.) You can always use the internet to search for information (e.g. StackOverflow), even if the task instructions specifically say that internet usage is not allowed. [...]

You can connect your IDE to the task environment using the same SSH connection string. Here are docs about how to do this for VSCode (remember to ‘add new host’ rather than ‘connect to host’. and paste the entire ssh connection string, including ssh -J [...]) or PyCharm. Unfortunately it’s not terribly unusual for a connection to take ~20 minutes the first time (although the typical case is smaller).

[from the "Questions or issues" tab]

Can I use [software X]?

Tools that are compatible with your usual workflow and our set-up (e.g. VSCode extensions) are fine, tools that solve the task for you are not fine. So linters are good, Copilot bad.

The "20 minutes to connect an IDE" thing sounded worrying to me at first glance, but FWIW the paper claims that setup time was a non-issue in practice:

It is possible that ancillary technical issues (e.g. difficulties with setup) could consume a sig-
nificant fraction of baseline time. In practice, we observe minimal such issues with technical
set-up; the issues affecting clock times that do persist are concentrated in qualification tasks,
in which human baseliners are interacting with our set-up for the first time. In 19 sampled
instances of debug small libs qualification tasks, baseliners spent a mean of 9 minutes and
median of 6 minutes on setup issues, relative to average total task time of 1.2 hours.

I'm making this comment mostly to point out the info above, but I also wanted to say that I agree with you about agentic coding, and I especially agree with @Michael Swift's remark about "engineering taste."

I've actually been getting a lot of value out of Cursor w/ 3.7 Sonnet lately, but I think largely due to the task I'm applying it to, which is effectively the best-case scenario for this kind of tool:

  • It's frontend development...
    • ...which is not my usual area, and which I am not very competent at on my own
    • ...which is also, I hear, a strong point for most coding LLMs
  • It's work on an internal-facing prototype which even internal users don't see unless they toggle a setting manually.
    • So it's low-risk, it doesn't matter if the UI doesn't follow brand conventions, etc.
    • Also, the requirements are unusually flexible and self-determined. I'm often free to just give up on something if both Claude and I are having a lot of trouble accomplishing it.

Under these conditions, it really does give me a large boost in the short term.  (I say "in the short term" because I'm probably learning less in the process than I would otherwise.  As others have observed before, the implications for junior developer hiring and the overall skills acquisition pipeline are... concerning.)

However, even in this context (and even in a largely unfamiliar codebase and programming language), the lack of "engineering taste" is evident to me and sometimes becomes a bottleneck.  The tool does tend to write code that works in the sense of passing tests or meeting requirements, but it often

  • varies its design choices whimsically across successive requests (even in the same chat)
  • reinvents what it needs from scratch than reusing existing mechanisms (even mechanisms that it added itself in an earlier chat turn)
  • fails to refactor or delete old code that has been obviated by its recently applied changes
  • "uses a hammer to swat a fly," writing elaborate and defensive code to perform very simple operations, with e.g. lots of guards against implausible or impossible edge cases
  • writes code in a standard or "textbook" style rather than adopting the house style of the project, even when it has been explicitly told to do the latter[1]

and other stuff along similar lines.

It's conceivable that better prompting could resolve this (maybe ask for a high-level design first?).  But if so, I'm confused why existing scaffolds don't inject this stuff by default (and why the LLMs even need the instructions, rather than just doing this stuff by default on the basis of some very general idea like "I'm supposed to write good code").

The one time I tried Claude Code, I showed it a complex backend service, told it that certain routes were slow, and asked it to find and fix the bottlenecks. (This was probably way too ambitious, but I had read stuff like this and came in with high expectations.)  It proposed a few database migrations, all of which attempted to add things that already existed.  This was so cringe that I never used the tool again.  But I'd happily give it a second try if someone showed me a clear demonstration of a case in which it was uniquely useful.

  1. ^

    Note that Cursor always explicitly tells it not to do this, via the following section of its system prompt:

    # Following conventions
    When making changes to files, first understand the file's code conventions. Mimic code style, use existing libraries and utilities, and follow existing patterns.
    - NEVER assume that a given library is available, even if it is well known. Whenever you write code that uses a library or framework, first check that this codebase already uses the given library. For example, you might look at neighboring files, or check the package.json (or cargo.toml, and so on depending on the language).
    - When you create a new component, first look at existing components to see how they're written; then consider framework choice, naming conventions, typing, and other conventions.
    - When you edit a piece of code, first look at the code's surrounding context (especially its imports) to understand the code's choice of frameworks and libraries. Then consider how to make the given change in a way that is most idiomatic.

I had previously noticed that the paper's classifier produced a lot of FPs/FNs and sent my findings + recommendations to Ryan G, who told me that there was a group working on improving the classifier (I assume that's you guys).  Glad to see an update on this effort!

log-linear scaling of x with pre-training compute will be worth it as the k-step success rate will improve near-linearly

I don't follow.  The k-step success is polynomial in x, not exponential (it's , not ).

Although if we fix some cutoff  for the k-step success probability, and then look at the value of k for which  , then we get .  This is super-linear in x over the interval from 0 to 1, so linearly growing improvements in x cause this "highest feasible k" to grow faster-than-linearly.  (Is this what you meant?  Note that this is similar to how METR computes time horizons.)

This might explain recent results that the length of tasks that AI can do is increasing linearly with time.

METR found that horizon lengths are growing exponentially in time, not linearly.

(One-step success probabilities have been growing at least linearly with time, I would think – due to super-linear growth in inputs like dataset size, etc. – so we should expect horizon lengths to grow super-linearly due to what I said in the previous paragraph.)

(N.B. I expect it will be easier to conduct this kind of analysis in terms of  instead of .)

Great review!

Here are two additional questions I think it's important to ask about this kind of work. (These overlap to some extent with the 4 questions you posed, but I find the way I frame things below to be clarifying.)

  1. If you combine the latent reasoning method with ordinary CoT, do the two behave more like substitutes or complements?
    1. That is: if we switch from vanilla transformers to one of these architectures, will we want to do less CoT (because the latent reasoning accomplishes the same goal in some more efficient or effective way), or more CoT (because the latent reasoning magnifies the gains that result from CoT, relative to vanilla transformers)?
    2. (Relatedly: how does this affect the legibility and faithfulness of CoT? If these two methods are synergetic/complementary, how does the division of labor work, i.e. which "kinds of thought" would an optimal model perform in the latent recurrence, vs. the verbalized recurrence?)
  2. How does the new architecture compare to vanilla transformers in a compute-matched comparison (where "compute" might mean either training or inference)?  And how does this result change as compute is scaled?

Number 1 matters because what we really care about is "how much can we learn by reading the CoT?", and the concern about latent reasoning often involves some notion that important info which might otherwise appear in the CoT will get "moved into" the illegible latent recurrence.  This makes sense if you hold capabilities constant, and compare two ~equivalent models with and without latent reasoning, where the former spends some test-time compute on illegible reasoning while the latter has to spend all its test-time compute on CoT.

However, capabilities will not in fact be constant!  If you train a new model with latent reasoning, there's nothing forcing you to do less CoT with it, even if you could "get away with" doing that and still match the capabilities of your old model.  You are free to combine latent reasoning and CoT and see how well they stack, and perhaps they do in fact stack nicely.  What ultimately matters is what ends up expressed in the CoT of the best model you can train using the amount of CoT that's optimal for it – not whether some other, less capable model+CoT combination would have reached its distinct, worse-on-average conclusions in a more legible manner.  (Note that you can always decrease legibility by just not using CoT, even with regular transformers – but of course there's no reason to care that this option isn't legible since it's not on the capabilities frontier.)

This situation is somewhat analogous to what we already have with regular transformer scaling and CoT: presumably there are sequential reasoning problems which GPT-4 can do in one forward pass (just by doing some "step by step" thing across its many layers), but which GPT-3.5 could only do via CoT.  However, this didn't cause us to use less CoT as a result of the scale-up: why satisfy yourself with merely hitting GPT-3.5 quality in fewer (but more expensive) forward passes, when you can go ahead and tackle a whole new class of harder problems, the ones that even GPT-4 needs CoT for?[1]

Number 2 matters for hopefully obvious reasons: if we could just "go full RNN" with no downsides then of course that would be more expressive, but the fact that transformers don't do so (and reap the vast compute-efficiency benefits of not doing so) accounts for much/most (all?) of their vast success.  The question is not "are there benefits to latent recurrence?" (of course there are) but "when, if ever, do you want to spend the marginal unit of compute on latent recurrence?"  If you can afford to pay for a Coconut-ized version of your transformer then you could just make a bigger transformer instead, etc.

Unfortunately, looking at these papers, I don't see much evidence either way about these questions at a glance.  Or at least nothing re: number 2.  If I'm reading Table 2 in depth-recurrence paper correctly, their model gets much bigger gains from CoT on GSM8K than any of their baseline models (and the gains improve further with more latent reasoning!) – which seems encouraging re: number 1, but I'm wary of reading too much into it.

 

  1. ^

    The analogy is inexact because GPT-4 still has only however many layers it has – a fixed constant – while depth-recurrent models can "just keep going."  My point is simply that even if you can "just keep going," that doesn't imply that the best way to spend the marginal unit of test-time compute is always on more depth rather than more sampled tokens.

    Do we have any reason to think "more tokens" will actually have any advantages over "more depth" in practice?  I'm not sure, but one way to think about the tradeoff is: latent reasoning replaces a narrow bottleneck that can be arbitrarily expanded with a much larger bottleneck that can't scale with problem size.  That is, depth-recurrence and similar approaches have the familiar old problem of RNNs, where they have to write all the intermediate results of their reasoning onto a fixed-length scratchpad, and hence will eventually have trouble with tasks of the form "compute  intermediate results and then do some aggregation over the whole collection" where  is problem-dependent and can grow arbitrarily large.

    Relatedly, KV caches in transformers are huge, which of course has painful memory costs but does allow the transformer to store a ton of information about the tokens it generates, and to look up that information later with a great deal of precision.

    So comparing the capacity of the hidden state (as the bottleneck for depth-recurrence) against the capacity of just the CoT tokens (as the bottleneck for transformer+CoT) isn't really comparing apples to apples: while the transformer is much more limited in what information it can "directly pass along" from step to step (with that info immediately+fully available to all future operations), it always constructs very high-dimensional representations of each step which are visible at least to some operations inside subsequent steps, allowing the transformer to "write out a haystack and then find the needle in it" even if that needle is tough to discriminate from its many neighbors.  (This argument is hand-wavey and so I'm not super confident of it, would be interesting to find out if it can be made more precise, or already has been)

Load More