nostalgebraist

I write original fiction.

Also I have opinions about AI and stuff, sometimes.


Elsewhere:


Same person as nostalgebraist2point0, but now I have my account back.

I have signed no contracts or agreements whose existence I cannot mention.

Wikitag Contributions

Comments

Sorted by

I had previously noticed that the paper's classifier produced a lot of FPs/FNs and sent my findings + recommendations to Ryan G, who told me that there was a group working on improving the classifier (I assume that's you guys).  Glad to see an update on this effort!

log-linear scaling of x with pre-training compute will be worth it as the k-step success rate will improve near-linearly

I don't follow.  The k-step success is polynomial in x, not exponential (it's , not ).

Although if we fix some cutoff  for the k-step success probability, and then look at the value of k for which  , then we get .  This is super-linear in x over the interval from 0 to 1, so linearly growing improvements in x cause this "highest feasible k" to grow faster-than-linearly.  (Is this what you meant?  Note that this is similar to how METR computes time horizons.)

This might explain recent results that the length of tasks that AI can do is increasing linearly with time.

METR found that horizon lengths are growing exponentially in time, not linearly.

(One-step success probabilities have been growing at least linearly with time, I would think – due to super-linear growth in inputs like dataset size, etc. – so we should expect horizon lengths to grow super-linearly due to what I said in the previous paragraph.)

(N.B. I expect it will be easier to conduct this kind of analysis in terms of  instead of .)

Great review!

Here are two additional questions I think it's important to ask about this kind of work. (These overlap to some extent with the 4 questions you posed, but I find the way I frame things below to be clarifying.)

  1. If you combine the latent reasoning method with ordinary CoT, do the two behave more like substitutes or complements?
    1. That is: if we switch from vanilla transformers to one of these architectures, will we want to do less CoT (because the latent reasoning accomplishes the same goal in some more efficient or effective way), or more CoT (because the latent reasoning magnifies the gains that result from CoT, relative to vanilla transformers)?
    2. (Relatedly: how does this affect the legibility and faithfulness of CoT? If these two methods are synergetic/complementary, how does the division of labor work, i.e. which "kinds of thought" would an optimal model perform in the latent recurrence, vs. the verbalized recurrence?)
  2. How does the new architecture compare to vanilla transformers in a compute-matched comparison (where "compute" might mean either training or inference)?  And how does this result change as compute is scaled?

Number 1 matters because what we really care about is "how much can we learn by reading the CoT?", and the concern about latent reasoning often involves some notion that important info which might otherwise appear in the CoT will get "moved into" the illegible latent recurrence.  This makes sense if you hold capabilities constant, and compare two ~equivalent models with and without latent reasoning, where the former spends some test-time compute on illegible reasoning while the latter has to spend all its test-time compute on CoT.

However, capabilities will not in fact be constant!  If you train a new model with latent reasoning, there's nothing forcing you to do less CoT with it, even if you could "get away with" doing that and still match the capabilities of your old model.  You are free to combine latent reasoning and CoT and see how well they stack, and perhaps they do in fact stack nicely.  What ultimately matters is what ends up expressed in the CoT of the best model you can train using the amount of CoT that's optimal for it – not whether some other, less capable model+CoT combination would have reached its distinct, worse-on-average conclusions in a more legible manner.  (Note that you can always decrease legibility by just not using CoT, even with regular transformers – but of course there's no reason to care that this option isn't legible since it's not on the capabilities frontier.)

This situation is somewhat analogous to what we already have with regular transformer scaling and CoT: presumably there are sequential reasoning problems which GPT-4 can do in one forward pass (just by doing some "step by step" thing across its many layers), but which GPT-3.5 could only do via CoT.  However, this didn't cause us to use less CoT as a result of the scale-up: why satisfy yourself with merely hitting GPT-3.5 quality in fewer (but more expensive) forward passes, when you can go ahead and tackle a whole new class of harder problems, the ones that even GPT-4 needs CoT for?[1]

Number 2 matters for hopefully obvious reasons: if we could just "go full RNN" with no downsides then of course that would be more expressive, but the fact that transformers don't do so (and reap the vast compute-efficiency benefits of not doing so) accounts for much/most (all?) of their vast success.  The question is not "are there benefits to latent recurrence?" (of course there are) but "when, if ever, do you want to spend the marginal unit of compute on latent recurrence?"  If you can afford to pay for a Coconut-ized version of your transformer then you could just make a bigger transformer instead, etc.

Unfortunately, looking at these papers, I don't see much evidence either way about these questions at a glance.  Or at least nothing re: number 2.  If I'm reading Table 2 in depth-recurrence paper correctly, their model gets much bigger gains from CoT on GSM8K than any of their baseline models (and the gains improve further with more latent reasoning!) – which seems encouraging re: number 1, but I'm wary of reading too much into it.

 

  1. ^

    The analogy is inexact because GPT-4 still has only however many layers it has – a fixed constant – while depth-recurrent models can "just keep going."  My point is simply that even if you can "just keep going," that doesn't imply that the best way to spend the marginal unit of test-time compute is always on more depth rather than more sampled tokens.

    Do we have any reason to think "more tokens" will actually have any advantages over "more depth" in practice?  I'm not sure, but one way to think about the tradeoff is: latent reasoning replaces a narrow bottleneck that can be arbitrarily expanded with a much larger bottleneck that can't scale with problem size.  That is, depth-recurrence and similar approaches have the familiar old problem of RNNs, where they have to write all the intermediate results of their reasoning onto a fixed-length scratchpad, and hence will eventually have trouble with tasks of the form "compute  intermediate results and then do some aggregation over the whole collection" where  is problem-dependent and can grow arbitrarily large.

    Relatedly, KV caches in transformers are huge, which of course has painful memory costs but does allow the transformer to store a ton of information about the tokens it generates, and to look up that information later with a great deal of precision.

    So comparing the capacity of the hidden state (as the bottleneck for depth-recurrence) against the capacity of just the CoT tokens (as the bottleneck for transformer+CoT) isn't really comparing apples to apples: while the transformer is much more limited in what information it can "directly pass along" from step to step (with that info immediately+fully available to all future operations), it always constructs very high-dimensional representations of each step which are visible at least to some operations inside subsequent steps, allowing the transformer to "write out a haystack and then find the needle in it" even if that needle is tough to discriminate from its many neighbors.  (This argument is hand-wavey and so I'm not super confident of it, would be interesting to find out if it can be made more precise, or already has been)

I saw some discussion of this incident in the Eleuther discord on 3/30, including a screenshot of the system message containing the "emulate the tone" line.  So it's not an April Fools' thing.

nostalgebraistΩ15234

Very impressive!  At least on a first read, to me this felt closer than any past work to realizing the SAE dream of actually understanding the computations occurring in the model, as opposed to just describing various "cool-looking patterns" that can be extracted from activations.

I'm curious about what would happen if you studied cases similar to the examples you present, except that the recruitment of a particular capability (such as arithmetic or knowledge about a named entity) occurs through in-context learning.

For example, you discuss an "obscured arithmetic" task involving publication dates.  In that case, the model seems to have learned in training that the correct prediction can be done by doing arithmetic.  But we could imagine obscured arithmetic tasks that are novel to the model, in which the mapping between the text and a "latent arithmetic problem" has to be learned in-context[1].

We might then ask ourselves: how does the model's approach to these problems relate to its approach to problems which it "can immediately tell" are arithmetic problems?

A naively obvious "algorithm" would look like

  1. Try out various mappings between the observed text and (among other things) arithmetic problems
  2. Notice that one particular mapping to arithmetic always yields the right answer on previous example cases
  3. Based on the observation in (2), map the current example to arithmetic, solve the arithmetic problem, and map back to predict the answer

However, due to the feedforward and causal structure of transformer LMs, they can't re-use the same mechanism twice to "verify that arithmetic works" in 1+2 and then "do arithmetic" in 3.[2]

It's possible that LLMs actually solve cases like this in some qualitatively different way than the "algorithm" above, in which case it would be interesting to learn what that is[3].

Alternatively, if the model is doing something like this "algorithm," it must be recruiting multiple "copies" of the same capability, and we could study how many "copies" exist and to what extent they use identical albeit duplicated circuitry. (See fn2 of this comment for more)

It would be particularly interesting if feature circuit analysis could be used to make quantitative predictions about things like "the model can perform computations of depth D or lower when not obscured in a novel way, but it this depth lowers to some D' < D when it must identify the required computation through few-shot learning."

(A related line of investigation would be looking into how the model solves problems that are obscured by transformations like base64, where the model has learned the mapping in training, yet the mapping is sufficiently complicated that its capabilities typically degrade significantly relative to those it displays on "plaintext" problems.)

  1. ^

    One could quantify the extent to which this is true by looking at how much the model benefits from examples. In an "ideal" case of this kind, the model would do very poorly when given no examples (equivalently, when predicting the first answer in a few-shot sequence), yet it would do perfectly when given many examples.

  2. ^

    For instance, suppose that the current example maps to an addition problem where one operand has 9 in the ones place.  So we might imagine that an "add _9" add function feature is involved in successfully computing the answer, here.

    But for this feature to be active at all, the model needs to know (by this point in the list of layers) that it should do addition with such an operand in the first place.  If it's figuring that out by trying mappings to arithmetic and noticing that they work, the implementations of arithmetic used to "try and verify" must appear in layers before the one in which the "add _9" feature under discussion occurs, since the final outputs of the entire "try and verify" process are responsible for activating that feature.  And then we could ask: how does this earlier implementation of arithmetic work?  And how many times does the model "re-implement" a capability across the layer list?

  3. ^

    Perhaps it is something like "try-and-check many different possible approaches at every answer-to-example position, then use induction heads to move info about try-and-check outputs that matched the associated answer position to later positions, and finally use this info to amplify the output of the 'right' computation and suppress everything else."

If I understand what you're saying here, it's true but fairly well-known?  See e.g. footnote 26 of the post "Simulators."

My favorite way of looking at this is:

The usual intuitive view of causal attention is that it's an operation that "looks back" at earlier positions. At each position i, it computes a "query" based on information from position i, and this query is used to search over "keys and values" computed at positions i-1, i-2, etc. (as well as i itself).

OK, so at each position, attention computes a query.  What makes a query "good"?  Well, a good query is one that will "do something useful" in conjunction with keys and values computed at earlier positions.

But attention is also computing keys and values at each position.  What makes a key or value "good"?  Precisely that it will "do something useful" in conjunction with the queries computed at later positions!

The latter observation is just the flipside of the former.  Queries at position i are encouraged to do useful lookback, on average over the "pasts" (i-1, ...) encountered in training; keys and values at position i are encouraged to be useful for the lookbacks performed by later queries, on average over the "futures" (i+1, ...) encountered in training.

This is complicated slightly by the fact that causal attention lets positions attend to themselves, but it's easy to see that this is not a huge deal in practice.  Consider that the keys and values computed at position i get used by...

  1. ...the attention operation at position i, when it attends to itself (along with all earlier positions)
  2. ...the attention operation at positions i+1, i+2, ..., when they "look back" to position i

The K and V weights get gradients from all of these positions.  So for a context window of size N, on average the gradient will be a sum over ~N/2 terms from future positions, plus just a single term from the current position.  Since N >> 2 in practice, all else being equal we should expect this sum to be dominated by the future terms.

(Moreover, note that the keys and values are more useful at future positions than at the current position, giving us even more reason to expect them to be mainly computed for the sake of future positions rather than the current one.  The current position "already knows about itself" and doesn't need attention to move information from itself to itself, whereas future positions can only learn about the current position by attending to it.

Sometimes there may be a computational role for a position attending to itself – such as doing something by default if nothing else "matched" a query – but all of the "magic" of attention is in the way it can move information between positions.  Note that a self-attention layer which could only attend to the current position would just be equivalent to a linear layer.)

nostalgebraist*Ω14242

ICYMI, the same argument appears in the METR paper itself, in section 8.1 under "AGI will have 'infinite' horizon length."

The argument makes sense to me, but I'm not totally convinced.

In METR's definition, they condition on successful human task completion when computing task durations.  This choice makes sense in their setting for reasons they discuss in B.1.1, but things would get weird if you tried to apply it to extremely long/hard tasks.

If a typical time-to-success for a skilled human at some task is ~10 years, then the task is probably so ambitious that success is nowhere near guaranteed at 10 years, or possibly even within that human's lifetime[1].  It would understate the difficulty of the task to say it "takes 10 years for a human to do it": the thing that takes 10 years is an ultimately successful human attempt, but most human attempts would never succeed at all.

As a concrete example, consider "proving Fermat's Last Theorem."  If we condition on task success, we have a sample containing just one example, in which a human (Andrew Wiles) did it in about 7 years.  But this is not really "a task that a human can do in 7 years," or even "a task that a human mathematician can do in 7 years" – it's a task that took 7 years for Andrew Wiles, the one guy who finally succeeded after many failed attempts by highly skilled humans[2].

If an AI tried to prove or disprove a "comparably hard" conjecture and failed, it would be strange to say that it "couldn't do things that humans can do in 7 years."  Humans can't reliably do such things in 7 years; most things that take 7 years (conditional on success) cannot be done reliably by humans at all, for the same reasons that they take so long even in successful attempts.  You just have to try and try and try and... maybe you succeed in a year, maybe in 7, maybe in 25, maybe you never do.

So, if you came to me and said "this AI has a METR-style 50% time horizon of 10 years," I would not be so sure that your AI is not an AGI.

In fact, I think this probably would be an AGI.  Think about what the description really means: "if you look at instances of successful task completion by humans, and filter to the cases that took 10 years for the successful humans to finish, the AI can succeed at 50% of them."  Such tasks are so hard that I'm not sure the human success rate is above 50%, even if you let the human spend their whole life on it; for all I know the human success rate might be far lower.  So there may not be any well-defined thing left here that humans "can do" but which the AI "cannot do."


On another note, (maybe this is obvious but) if we do think that "AGI will have infinite horizon length" then I think it's potentially misleading to say this means growth will be superexponential.  The reason is that there are two things this could mean:

  1. "Based on my 'gears-level' model of AI development, I have some reason to believe this trend will accelerate beyond exponential in the future, due to some 'low-level' factors I know about independently from this discussion"
  2. "The exponential trend can never reach AGI, but I personally think we will reach AGI at some point, therefore the trend must speed up"

I originally read it as 1, which would be a reason for shortening timelines: however "fast" things were from this METR trend alone, we have some reason to think they'll get "even faster."  However, it seems like the intended reading is 2, and it would not make sense to shorten your timeline based on 2.  (If someone thought the exponential growth was "enough for AGI," then the observation in 2 introduces an additional milestone that needs to be crossed on the way to AGI, and their timeline should lengthen to accommodate it; if they didn't think this then 2 is not news to them at all.)

  1. ^

    I was going to say something more here about the probability of success within the lifetimes of the person's "intellectual heirs" after they're dead, as a way of meaningfully defining task lengths once they're >> 100 years, but then I realized that this introduces other complications because one human may have multiple "heirs" and that seems unfair to the AI if we're trying to define AGI in terms of single-human performance. This complication exists but it's not the one I'm trying to talk about in my comment...

  2. ^

    The comparison here is not really fair since Wiles built on a lot of work by earlier mathematicians – yet another conceptual complication of long task lengths that is not the one I'm trying to make a point about here.

Originally known as "past cache" after the tensor name apparently coined by Thomas Wolf for the transformers library in February 2019, see commit ffd6238. The invention has not been described in the literature AFAIK, and it's entirely possible (maybe even likely) that closed-source implementations of earlier decoder-only transformers used the same trick before this

KV caching (using the terminology "fast decoding" and "cache") existed even in the original "Attention is All You Need" implementation of an enc-dec transformer.  It was added on Sep 21 2017 in this commit.  (I just learned this today, after I read your comment and got curious.)

The "past" terminology in that original transformers implementation of GPT-2 was not coined by Wolf – he got it from the original OpenAI GPT-2 implementation, see here.

Your list of "actual arguments" against explosive growth seems to be missing the one that is by far the most important/convincing IMO, namely Baumol effects.

This argument has been repeatedly brought up by growth economists in earlier rounds from the AI-explosive-growth debate.  So rather than writing my own version of this argument, I'll just paste some quotes below.

As far as I can tell, the phenomenon discussed in these quotes is excluded by construction from the GATE model: while it draws a distinction between different "tasks" on the production side, its model of consumption effectively has only one "consumable good" which all these tasks produce (or equivalently, multiple goods which are all perfect substitutes for one another).

In other words, it stipulates what Vollrath (in the first quote below) calls "[the] truly unbelievable assumption that [AI] can innovate *precisely* equally across every product in existence."  Of course, if you do assume this "truly unbelievable" thing, then you don't get Baumol effects – but this would be a striking difference from what has happened in every historical automation wave, and also just sort of prima facie bizarre.

Sure, maybe AI will be different in a way that turns off Baumol effects, for some reason or other.  But if that is the claim, then an argument needs to be made for that specific claim, and why it will hold for AI when it hasn't for anything else before.  It can't be justified as a mere "modeling simplification," because the same "simplification" would have led you to wrongly expect similar explosive growth from past agricultural automation, from Moore's Law, etc.


From Dietrich Vollrath's review of Davidson 2021:

History suggests that people tend to view many goods and services as complements. Yes, within specific sub-groups (e.g. shoes) different versions are close substitutes, but across those groups (e.g. shoes and live concerts) people treat them as complements and would like to consume some of both. 

What does that do to the predictions of explosive growth? It suggests that it may “eat itself”. AI or whatever will deliver productivity growth to some products faster than others, barring a truly unbelievable assumption that it can innovate *precisely* equally across every product in existence. When productivity grows more rapidly in product A than in product B (50% versus 10%, say), the relative price of product A falls relative to product B. Taking A and B as complements, what happens to the total expenditure on A (price times quantity)? It falls. We can get all the A we want for very cheap, and because we like both A and B, we have a limit on how much A we want. So total spending on A falls. 

But growth in aggregate productivity (and in GWP, leaving aside my comments on inputs above) is a weighted average of productivity growth in all products. The weights are the expenditure shares. So in the A/B example, as A gets more and more productive relative to B, the productivity growth rate *falls* towards the 10% of product B. In general, the growth rate of productivity is going to get driven towards the *lowest* productivity growth rate across the range of products we consume.

And the faster that productivity grows in product A, the sooner the aggregate growth rate will fall to the productivity growth rate of B. So a massive question for this report is how widespread explosive growth is expected to be. Productivity growth in *all* products of 10% forever would deliver 10% growth in productivity forever (and perhaps in GWP). Great. But productivity growth of 100% in A and 0% in B will devolve into productivity growth of 0% over time.

This has nothing to do with the nature of R&D or the knife-edge conditions on growth models. This is simply about the nature of demand for products. 

From Ben Jones' review of the same Davidson 2021 report:

[W]e have successfully automated an amazing amount of agricultural production (in advanced economies) since the 19th century.  One fact I like:  In 2018, a farmer using a single combine harvester in Illinois set a record by harvesting 3.5 million pounds of corn in just 12 hours.  That is really amazing.  But the result is that corn is far cheaper than it used to be, and the GDP implications are modest.  As productivity advances and prices fall, these amazing technologies tend to become rounding errors in GDP and labor productivity overall.  Indeed, agricultural output used to be about half of all GDP but now it is down to just a couple percent of GDP.  The things you get good at tend to disappear as their prices plummet.  Another example is Moore’s Law.  The progress here is even more mind-boggling – with growth rates in calculations per unit of resource cost going up by over 30% per year.  But the price of calculations has plummeted in response.  Meanwhile, very many things that we want but don’t make rapid progress in – generating electricity; traveling across town; extracting resources from mines; fixing a broken window; fixing a broken limb; vacation services – see sustained high prices and come to take over the economy.  In fact, despite the amazing results of Moore’s Law and all the quite general-purpose advances it enables – from the Internet, to smartphones, to machine learning – the productivity growth in the U.S. economy if anything appears to be slowing down.

And here's Vollrath again, from his commentary on Clancy and Besiroglu 2023:

There are two ways to "spend" an increase in productivity driven by new ideas. You can use it to produce more goods and services given the same amount of inputs as before, or you can use it to reduce the inputs used while producing the same goods and services as before. If we presume that AI can generate explosive growth in ideas, a very real choice people might make is to "spend" it on an explosive decline in input use rather than an explosive increase in GDP.

Let's say AI becomes capable of micro-managing agricultural land. There is already a "laser-weeder" capable of rolling over a field and using AI to identify weeds and then kill them off with a quick laser strike. Let's say AI raises agricultural productivity by a factor of 10 (even given all the negative feedback loops mentioned above). What's the response to this? Do we continue to use the same amount of agricultural land as before (and all the other associated resources) and increase food production by a factor of 10? Or do we take advantage of this to shrink the amount of land used for agriculture by a factor of 10? If you choose the latter - which is entirely reasonable given that worldwide we produce enough food to feed everyone - then there is no explosive growth in agricultural output. There isn't any growth in agricultural output. We've taken the AI-generate idea and generated exactly zero economic growth, but reduced our land use by around 90%.

Which is amazing! This kind of productivity improvement would be a massive environmental success. But ideas don't have to translate into economic growth to be amazing. More important, amazing-ness does necessarily lead to economic growth.


In general I find the AI explosive growth debate pretty confusing and frustrating, for reasons related to what Vollrath says about "amazing-ness" in that last quote.

Often (and for instance, in this post), the debate gets treated as indirect "shadowboxing" about the plausibility of various future AI capabilities, or about the degree of "transformation" AI will bring to the future economy – if you doubt explosive growth you are probably not really "feeling the AGI," etc.

But if we really want to talk about those things, we should just talk about them directly.  "Will there be explosive growth?" is a poor proxy for "will AI dramatically transform the world economy?", and things get very muddled when we talk about the former and then read into this talk to guess what someone really thinks about the latter.

Maybe AI will be so transformative that "the economy" and "economic growth" won't even exist in any sense we would now recognize.  Maybe it attains capabilities that could sustain explosive growth if there were consumers around to hold up the demand side of that bargain, but it turns out that humans just can't meaningfully "consume" at 100x (or 1000x or whatever) of current levels, at some point there's only 24h in a day, and only so much your mind can attend to at once, etc.  Or maybe there is explosive growth, but it involves "synthetic demand" by AIs for AI-produced goods in a parallel economy humans don't much care about, and we face the continual nuisance of filtering that stuff out of GDP so that GDP still tracks anything meaningful to us.

Or something else entirely, who knows!  What we care about is the actual content of the economic transformation – the specific "amazing" things that will happen, in Vollrath's terms.  We should argue over those, and only derive the answer to "will there be explosive growth?" as a secondary consequence.

This is a very low-quality paper.

Basically, the paper does the following:

  • A 1-layer LSTM gets inputs of the form [operand 1][operator][operand 2], e.g. 1+2 or 3*5
  • It is trained (I think with a regression loss? but it's not clear[1]) to predict the numerical result of the binary operation
  • The paper proposes an auxiliary loss that is supposed to improve "compositionality."
    • As described in the paper, this loss is is the average squared difference between successive LSTM hidden states
    • But, in the actual code, what is actually is instead the average squared difference between successive input embeddings
  • The paper finds (unsurprisingly) that this extra loss doesn't help on the main task[2], while making various other errors and infelicities along the way
    • e.g. there's train-test leakage, and (hilariously) it doesn't cite the right source for the LSTM architecture[3]

The theoretical justification presented for the "compositional loss" is very brief and unconvincing.  But if I read into it a bit, I can see why it might make sense for the loss described in the paper (on hidden states). 

This could regularize an LSTM to be produce something closer to a simple sum or average of the input embeddings, which is "compositional" in the sense that inputs for different timesteps might end up in different subspaces.  It's still not very clear why you would want this property (it seems like at this point you're just saying you don't really want an LSTM, as this is trying to regularize away some of the LSTM's flexibility), nor why an LSTM was chosen in the first place (in 2025!), but I can at least see where the idea came from.

However, the loss actually used in the code makes no sense at all.  The embeddings can't see one another, and the inputs are sampled independently from one another in data generation, so the code's auxiliary loss is effectively just trying to make the input embeddings for all vocab tokens closer to one another in L2 norm.  This has nothing to do with compositionality, and anyway, I suspect that the rest of the network can always "compensate for it" in principle by scaling up the input weights of the LSTM layer.[4]

If it had actually used the loss on hidden states as described, this would still be a bad paper: it reports a negative result and under-motivates the idea so that it's not clear why the negative result might be noteworthy.  (Plus: LSTM, weird arithmetic regression toy task, etc.)

Once you take into account the nonsensical loss that was actually used, it's just... nothing.  The idea makes no sense, was not motivated at all, was inaccurately described in the paper, and does not work in practice.


To Sakana's credit, they did document all of these problems in their notes on the paper – although they were less critical than I am.  In principle they could have hidden away the code issue rather than mentioning it, and the paper would have seemed less obviously bad... I guess this is a really low bar, but still, it's something.

  1. ^

    The Sakana code review shows that it's evaluated with a regression loss, but it's not clear out of context what criterion (the training loss function) is AFAICT.

    (Edit after looking over the code again: the model returns a number, and the data generation code also returns the targets as numbers. So the loss function is comparing a number to another one. Unless it's doing some cursed re-tokenization thing internally, it's regression.)

  2. ^

    Note that it never directly tests for "compositionality," just for test set performance on the main task. 

    Although in a few places it conflates its so-called "compositional loss" with compositionality itself, e.g. claims that the regularization "effectively enforces compositionality" when in fact the evidence just shows that it decreases the auxiliary loss, which of course it does – that's what happens when you minimize something, it goes down.

  3. ^

    Hochreiter & Schmidhuber 1997 has over 100k citations, it's one of the most familiarly cited references in all of ML, you'd think an LLM would have memorized that at least!

  4. ^

    Although in practice this is limited by the learning rate and the duration of training, which may explain why the paper got worse main-task performance with stronger regularization even though the regularization is "conceptually" a no-op.

Load More