All of Tom Lieberum's Comments + Replies

We use 1024, though often article snippets are shorter than that so they are separated by BOS.

Cool work!

Did you run an ablation on the auxiliary losses for  and  , how important was that to stabilize training?

Did you compare to training separate Q and K SAEs via typical reconstruction loss? Would be cool to see a side-by-side comparison, i.e. how large the benefit of this scheme is. 

1keith_wynroe
Thanks! The auxiliary losses were something we settled on quite early, and we made some improvements to the methodology since then for the current results so I don't have great apples-to-apples comparisons for you. The losses didn't seem super important though in the sense that runs would still converge, just take longer and end with slightly worse reconstruction error. I think it's very likely that with a better training set-up/better hyperparam tuning you could drop these entirely and be fine. Re: comparison to SAE's, you mean what do the dictionaries/feature-map have to look like if you're explicitly targeting L2-reconstruction error and just getting pattern reconstruction as a side-effect? If so we also looked at this briefly early on. We didn't spend a huge amount of time on these so they were probably not optimally trained, but we were finding that to get L2-reconstruction error low enough to yield comparably close good pattern reconstruction we were needing to go up to a d_hidden of 16,000 i.e. comparable to residual SAEs for the same layer. Which I think is another data-point in favour of "a lot of the variance in head-space is attention-irrelevant and just inherited from the residual stream"

During parts of the project I had the hunch that some letter specialized heads are more like proto-correct-letter-heads (see paper for details), based on their attention pattern. We never investigated this, and I think it could go either way. The "it becomes cleaner" intuition basically relies on stuff like the grokking work and other work showing representations being refined late during training by.. Thisby et al. I believe (and maybe other work). However some of this would probably require randomising e.g. the labels the model sees during training. See ... (read more)

Yup! I think that'd be quite interesting. Is there any work on characterizing the embedding space of GPT2?

3LawrenceC
There's also a lot of academic work on the geometry of LM embeddings, e.g.: * https://openreview.net/forum?id=xYGNO86OWDH (BERT, ERNIE) * https://arxiv.org/abs/2209.02535 (GPT-2-medium) (Plus a mountain more on earlier text/token embeddings like Word2Vec.)
4LawrenceC
Adam Scherlis did some preliminary exploration here: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights Here's a more thorough investigation of the overall shape of said embeddings with interactive figures: https://bert-vs-gpt2.dbvis.de/
1[anonymous]
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation is related to the embedding space

Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (_an, _An, an, An, etc.). It's curious because the semantics of these tokens can be quite different (compared to the though, tho, however neuron).

 

Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.

2Joseph Miller
We did look very briefly at this for the " an" neuron. We plotted the residual stream congruence with the neuron input weights throughout the model. The second figure shows the difference from each layer. Unfortunately I can't seem to comment an image. See it here.  We can't tell that much from this but I think there are three takeaways: 1. The model doesn't start 'preparing' to activate the " an" neuron until layer 16. 2. No single layer stands out a lot as being particularly responsible for the " an" neuron's activation (which is part of why we didn't investigate this further). 3. The congruence increases a lot after MLP 31. This means the output of layer 31 is very congruent with the input weights of the " an" neuron (which is in MLP 31). I this this is almost entirely the effect of the " an" neuron, partly because the input of the " an" neuron is very congruent with the " an" token (although not as much as the neuron output weights). This makes me think that this neuron is at least partly a 'signal boosting' neuron.
3Logan Riggs
One reason the neuron is congruent with multiple of the same tokens may be because those token embeddings are similar (you can test this by checking their cosine similarities).

(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it)

I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.

iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup) 

2Gurkenglas
(To show my idea compatible with Boolean attention.) I use a single head, and the ranks add up linearly.
Tom LieberumΩ11-2

I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

2Gurkenglas
Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values. From the RASP paper: https://arxiv.org/pdf/2106.06981 Addition of such attention patterns corresponds to anding of such selectors.

Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?

I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It's hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it's trying to approximate or why you think that the true distribution has property XYZ.

Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is ... (read more)

(ETA to the OC: the antipodal pairs wouldn't happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you'd see that. I'm now less sure about this specific argument)

Thanks for posting this. Some comments/questions we had after briefly discussing it in our team:

  • We would have loved to see more motivation for why you are making the assumptions you are making when generating the toy data.
    • Relatedly, it would be great to see an analysis of the distribution of the MLP activations. This could give you some info where your assumptions in the toy model fall short.
  • As Charlie Steiner pointed out, you are using a very favorable ratio of  in the toy model , i.e. of number of ground truth features to encoding dimension.
... (read more)
2Lee Sharkey
  This is valid; they're not well fleshed out above. I'll take a stab at it here below, and I discussed it a bit with Ryan below his comment. Meta-q: Are you primarily asking for better assumptions or that they be made more explicit? RE MLP activations distribution: Good idea! One reason I didn't really want to make too many assumptions that were specific to MLPs was that we should in theory be able to apply sparse coding to residual stream activations too. But looking closely at the distribution that you're trying to model is, generally speaking, a good idea :) We'll probably do that for the next round of experiments if we continue along this avenue.    I hadn't previously considered the importance of 'strongly interfering' superposition. But that's clearly the right regime for real networks and probably does explain a lot about the mismatch. Thanks for highlighting this!   Agree that this would be interesting! Trenton has had some ideas for metrics that better capture this notion, I think.   Undertrained autoencoders is something that worries me too, especially for experiments that use larger dictionaries (They take longer to converge). In the next phase, this is definitely something we'd want to ensure/study in the next phase.

Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.

  • A token is either the first one of multi-token word or it isn't.
  • A word is either a noun, a verb or something else.
  • A word belongs to language LANG and not to any other language/has other meanings in those languages.
  •  image can only contain so many objects which can only contain so many sub-aspects.

I don't know what it would mean to go "out of distribution" in any of these cases.

This means that any network that has an incentive to conserve... (read more)

Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.

I'm not aware of any work that identifies superposition in exactly this way in NNs of practical use. 
As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects st... (read more)

This example is meant to only illustrate how one could achieve this encoding. It's not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior.
But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.

2Lucius Bushnaq
Not confused, just optimised to handle data of the kind seen in training, and with limited ability to generalise beyond that, compared to human vision.

Ah, I might have misunderstood your original point then, sorry! 

I'm not sure what you mean by "basis" then. How strictly are you using this term?

I imagine you are basically going down the "features as elementary unit" route proposed in Circuits (although you might not be pre-disposed to assume features are the elementary unit).Finding the set of features used by the network and figuring out how its using them in its computations does not 1-to-1 translate to "find the basis the network is thinking in" in my mind.

5Lucius Bushnaq
Fair enough, imprecise use of language. For some definitions of "thinking" I'd guess a small vision CNN isn't thinking anything.

Possibly the source of our disagreement here is that you are imagining the neuron ought to be strictly monotonically increasing in activation relative to the dog-headedness of the image?

If we abandon that assumption then it is relatively clear how to encode two numbers in 1D. Let's assume we observe two numbers . With probability , and with probability 

We now want to encode these two events in some third variable , such that we can perfectly reconstruct  with probabilit... (read more)

2Lucius Bushnaq
Ah, I see. Thank you for pointing this out. Do superposition features actually seem to work like this in practice in current networks? I was not aware of this. In any case, for a network like the one you describe I would change my claim from to the AI having a concept for something humans don't have a neat short description for. So for example, if your algorithm maps X>0 Y>0 to the first case, I'd call it a feature of "presence of dog heads or car fronts, or presence of car fronts". I don't think this is an inherent problem for the theory. That a single floating point number can contain a lot of information is fine, so long as you have some way to measure how much it is.  
1Tom Lieberum
This example is meant to only illustrate how one could achieve this encoding. It's not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior. But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.

I'd say that there is a basis the network is thinking in in this hypothetical, it would just so happens to not match the human abstraction set for thinking about the problem in question.

Well, yes but the number of basis elements that make that basis human interpretable could theoretically be exponential in the number of neurons.

2Lucius Bushnaq
Sure, but that's not a question I'm primarily interested in. I don't want the most interpretable basis, I want the basis that network itself uses for thinking. My goal is to find the elementary unit of neural networks, to build theorems and eventually a whole predictive theory of neural network computation and selection on top of.  That this may possibly make current networks more human-interpretable even in the short run is just a neat side benefit to me.

If due to superposition, it proves advantageous to the AI to have a single feature that kind of does dog-head-detection and kind of does car-front-detection, because dog heads and car fronts don't show up in the training data at the same time, so it can still get perfect loss through a properly constructed dual-purpose feature like this, it'd mean that to the AI, dog heads and car fronts are "the same thing".

I don't think that's true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming ... (read more)

2Lucius Bushnaq
Reality is usually sparse in features, and that‘s why even very small and simple intelligences can operate within it most of the time, so long as they don’t leave their narrow contexts. But the mark of a general intelligence is that it can operate even in highly out-of-distribution situations. Cars are usually driven on roads, so an intelligence could get by using a car even if its concepts of car-ness were all mixed up with its conception of roadness. But a human can plan to take a car to the moon and drive it on the dust there, and then do that. This indicates to me that a general intelligence needs to think in features that can compose to handle almost any data, not just data that usually appeared in the training distribution. If your architectures has too many bottlenecks to allow this, I expect that it will not be able to become a human-level general intelligence. (Parts of the human brain definitely seem narrow and specialised too of course, it‘s only the general reasoning capabilities that seem to have these ultra-factorising, nigh-universally applicable concepts.) Note also that concepts humans use can totally be written as superpositions of other concepts too, most of these other concepts apparently just aren‘t very universally useful.
2Lucius Bushnaq
I'm not sure I understand this example. If I have a single 1-D feature, a floating point number that goes up with the amount of dog-headedness or car-frontness in a picture, then how can the model in a later layer reconstruct whether there was a dog-head xor a car-front in the image from that floating point number, unless it has other features that effectively contain this information?

I agree that all is not lost wrt sparsity and if SPH turns out to be true it might help us disentangle the superimposed features to better understand what is going on. You could think of constructing an "expanded" view of a neural network. The expanded view would allocate one neuron per feature and thus has sparse activations for any given data point and would be easier to reason about. That seems impractical in reality, since the cost of constructing this view might in theory be exponential, as there are exponentially many "almost orthogonal" vectors for ... (read more)

I don't think there is anything on that front other than the paragraphs in the SoLU paper. I alluded to a possible experiment for this on Twitter in response to that paper but haven't had the time to try it out myself: You could take a tiny autoencoder to reconstruct some artificially generated data where you vary attributes such as sparsity, ratio of input dimensions vs. bottleneck dimensions, etc. You could then look at the weight matrices of the autoencoder to figure out how it's embedding the features in the bottleneck and which settings lead to superposition, if any.

6carboniferous_umbraculum
I'm not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.  

I disagree with your intuition that we should not expect networks at irreducible loss to not be in superposition.

The reason I brought this up is that there are, IMO, strong first-principle reasons for why SPH should be correct. Say there are two features, which have an independent probability of 0.05 to be present in a given data point, then it would be wasteful to allocate a full neuron to each of these features. The probability of both features being present at the same time is a mere 0.00025. If the superposition is implemented well you get basically tw... (read more)

2Garrett Baker
Ah, I think you're right here, though I don't think this means there's no room for improvement on the sparsity front. Do you know of any hand-constructed examples of a layer in superposition, for which we know the features of? I'd like to play around with one, and see if there's any robust way to disentangle it.

Interesting idea! 

What do you think about the Superposition Hypothesis? If that were true, then at a sufficient sparsity of features in the input there is no basis in which the network is thinking in, meaning it will be impossible to find a rotation matrix that allows for a bijective mapping between neurons and features.

I would assume that the rotation matrix that enables local changes via the sparse Jacobian coincides with one which maximizes some notion of "neuron-feature-bijectiveness". But as noted above that seems impossible if the SPH holds.

5Lucius Bushnaq
I'd say that there is a basis the network is thinking in in this hypothetical, it would just so happens to not match the human abstraction set for thinking about the problem in question. If due to superposition, it proves advantageous to the AI to have a single feature that kind of does dog-head-detection and kind of does car-front-detection, because dog heads and car fronts don't show up in the training data at the same time, so it can still get perfect loss through a properly constructed dual-purpose feature like this, it'd mean that to the AI, dog heads and car fronts are "the same thing".  The network hasn't figured out how to distinguish between them. In a more general data set where dog heads and car fronts can co-occur, this network would fail. Its abstractions are optimised for the narrow training data set, where it genuinely proved to be unnecessarily cumbersome to assign different concepts to those two things. As AIs get more capable and general, I'd expect the concepts/features they use to start more closely matching the ones humans use in many domains. As AI gets superhuman, I would be somewhat worried about it finding new concept/feature sets that work even better and more generally than human ones.
2Garrett Baker
I mostly expect networks at zero loss to not to be in a superposition, since we should expect those networks to be in a broad basin, meaning fairly few independent, orthogonal, features, so less room to implement two completely different functions. But we don't always find networks in broad basins, so we may see some networks in a superposition. It would be interesting to study which training regimes and architectures most/least often produce easily-interpretable networks by this metric, and this may give some insight into when you see superposition. In the cases where there is a nice basis this device finds, we may also expect it to disentangle any superpositions which exist, and for this superposition to be a combination of two fairly simple functions, requiring very few features, or interpreting the same features in different ways.

K-composition as a concept was introduced by Anthropic in their work on Transformer Circuits in the initial post. In general, the output of an attention head in an earlier layer can influence the query, key, or value computation of an attention head in a later layer. 

K-composition refers to the case in which the key-computation is influenced. In a model without nonlinearities or layernorms you can do this simply by looking at how strongly the output matrix of head 1 and the key matrix of head 2 compose (or more precisely, by looking at the frobenius n... (read more)

Thanks for verifying! I retract my comment.

I think historically reinforcement has been used more in that particular constellation (see eg deep RL from HP paper) but as I noted I find reward learning more apt as it points to the hard thing being the reward learning, i.e. distilling human feedback into an objective, rather than the optimization of any given reward function (which technically need not involve reinforcement learning)

Well I thought about that but I wasn't sure whether reinforcement learning from human feedback wouldn't be just a strict subset of reward learning from human feedback. If reinforcement is indeed the strict definition then I concede but I dont think it makes sense.

7Lukas Finnveden
The acronym is definitely used for reinforcement learning. ["RLHF" "reinforcement learning from human feedback"] gets 564 hits on google, ["RLHF" "reward learning from human feedback"] gets 0.
-2Tom Lieberum
I think historically reinforcement has been used more in that particular constellation (see eg deep RL from HP paper) but as I noted I find reward learning more apt as it points to the hard thing being the reward learning, i.e. distilling human feedback into an objective, rather than the optimization of any given reward function (which technically need not involve reinforcement learning)

Reward Learning from Human Feedback

[This comment is no longer endorsed by its author]Reply

Reinforcement* learning from human feedback

Thanks for your reply! I think I basically agree with all of your points. I feel a lot of frustration around the fact that we don't seem to have adequate infohazard policies to address this. It seems like a fundamental trade-off between security and openness/earnestness of discussion does exist though. 

It could be the case that this community is not the correct place to enforce this rules, as there does still exist a substantial gap between "this thing could work" and "we have a working system". This is doubly true in DL where implementation details matter a great deal.

My tentative heuristic for whether you should publish a post that is potentially infohazardy is "Has company-X-who-cares-mostly-about-capabilities likely thought about this already?". It's obviously non-trivial to answer that question but I'm pretty sure most companies who build LLMs have looked at Chinchilla and come to similar conclusions as this post. In case you're unsure, write up the post in a google doc and ask someone who has thought more about infohazards whether they would publish it or not. 

Also, I think Leon underestimates how fast a post can spread even if it is just intended for an alignment audience on LW. 

I'd like to propose not talking publicly about ways to "fix" this issue. Insofar these results spell trouble for scaling up  LLMs, this is a good thing! 
Infohazard (meta-)discussions are thorny by their very nature and I don't want to discourage discussions around these results in general, e.g. how to interpret them or whether the analysis has merits. 

I disagree, but I'm not sure how relevant my opinion is, since I'm far less worried about "AGI ruin" to begin with than the median LWer.  That said, here's my thinking:

First, there's no universally agreed-upon line between "discussing whether the analysis has merits" and "giving the capabilities people free ideas."  Where a person draws this line depends on how obvious they think the ideas are, or how obvious they think they will be to the capabilities people.

Second, there are costs to not talking about things.  It's useful for alignment res... (read more)

I upvoted since I think discussing what should or should not be discussed is important, but I tentatively disagree:

  • It seems unlikely that comments on lesswrong speed up capabilities research since the thoughts are probably just a subset of what the scaling teams know, and lesswrong is likely not their highest signal information source anyway.
  • Even from a safety perspective, it seems important to know which problems in capabilities research can be alleviated, since this will give a clearer picture of timelines.
  • I think we should have strong reasons before dis
... (read more)

If the subset  of interpretable models is also "nice" in the differential-geometric sense (say, also a smooth submanifold of ), then the intersection  is also similarly "nice."

 

Do you have any intuition for why we should expect  to be "nice"? I'm not super familiar with differential geometry but I don't really see why this should be the case..

This assumes a fixed scaling law. One possible way of improving oneself could be to design a better architecture with a better scaling exponent.

Thanks for elaborating! In so far your assessment is based on in-person interactions, I can't really comment since I haven't spoken much with people from Anthropic.

I think there are degrees to believing this meme you refer to, in the sense of "we need an AI of capability level X to learn meaningful things". And I would guess that many people at Anthropic do believe this weaker version -- it's their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, ... (read more)

I'd also be interested in hearing which parts of Anthropic's research output you think burns our serial time budget. If I understood the post correctly, then OP thinks that efforts like transformer circuits are mostly about accelerating parallelizable research.

Maybe OP thinks that

  • mechanistic interpretability does have little value in terms of serial research
  • RLHF does not give us alignment (because it doesn't generalize beyond the "sharp left turn" which OP thinks is likely to happen)
  • therefore, since most of Anthropic's alignment focused output has not
... (read more)
So8res2811

My mental filter for inclusion on that list was apparent prevalence of the "we can't do alignment until we have an AGI in front of us" meme. If a researcher has that meme and their host org is committed to not advancing the public capabilities frontier, that does ameliorate the damage, and Anthropic does seem to me to be doing the best on that front (hooray for Anthropic!). That said, my impression is that folks at Anthropic are making the tradeoffs differently from how I would, and my guess is that this is in part due to differences in our models of what's needed for alignment, in a fashion related to the topic of the OP.

Yep all good points. I think I didn't emphasize enough that you should not take it every day (maybe not even every other day).

The gums are less addictive than cigs because they taste bad and because the feedback/reinforcement is slower. Lozenges sound like a good alternative too, to be extra sure.

Answer by Tom Lieberum60

I wouldn't recommend regular caffeine at all unless you know from experience that you won't develop a physical dependency. In my experience you get more like short term gain until your body adapts then requires coffee to function normally.

If you do want to try caffeine I recommend trying to pair it with L-theanine (either in pills or green tea) which is supposed to smooth the experience and makes for a cleaner high (YMMV).

If you're looking for a stimulant that you don't take regularly and with shorter half life, consider nicotine gums. Again ymmv, I think gwern has tried it with little effect. Beware the addictive potential (although lower than with cigarettes or vapes)

3ponkaloupe
i’m not sure i’d recommend nicotine even in gum form. you’ll notice an obvious boost the first few times you do it — and the shorter half-life is nice for working in the evenings — but like most other drugs you build dependence quick. after a couple weeks you literally won’t notice any effect from taking that same initial dose. overcoming that by bumping the dose is, obviously, unsustainable. if you do go the nicotine route, try both the gum and the lozenges. gum is more effective at quickly weening you off of cigs because it replaces one ritual (smoking) with another (chewing), whereas the lozenges are really just about physically delivering nicotine to the body without much ritual (i.e. they’re less “habit forming”).

On priors, I wouldn't worry too much about c), since I would expect a 'super stimulus' for head A to not be a super stimulus for head B.

I think one of the problems is the discrete input space, i.e. how do you parameterize sequence that is being optimized?

One idea I just had was trying to fine-tune an LLM with a reward signal given by for example the magnitude of the residual delta coming from a particular head (we probably something else here, maybe net logit change?). The LLM then already encodes a prior over "sensible" sequences and will try to find one of those which activates the head strongly (however we want to operationalize that).

Very cool to see new people joining the interpretability field!

Some resource suggestions:

If you didn't know already, there is a TF2 port of Lucid, called Luna:

There is also Lucent, which is Lucid for PyTorch: (Some docs written by me for a slightly different version)

For transformer interpretability you might want to check out Anthropic's work on transformer circuits, Redwood Research's interpretability tool, or (shameless plug) Unseal.

I can't speak to the option for remote work but as a counterpoint, it seems very straightforward to get a UK visa for you and your spouse/children (at least straightforward relative to the US). The relevant visa to google is the Skilled Worker / Tier 2 visa if you want to know more.

ETA: Of course, there are still legitimate reasons for not wanting to move. Just wanted to point out that the legal barrier is lower than you might think.

2M. Y. Zuo
I'm not familiar with London housing prices. Is it possible to affordably rent or mortgage a decent 2 bedroom condo within a 5 minute walk of the offices with your compensation package? (By affordable I mean less than 1/3 of total comp spent on housing, stretching to 1/2 if comp is unusually high.)

There is definitely something out there, just can't recall the name. A keyword you might want to look for is "disentangled representations".

One start would be the beta-VAE paper https://openreview.net/forum?id=Sy2fzU9gl

1Hoagy
Cheers!

Considering you get at least one free upvote from posting/commenting itself, you just have to be faster than the downvoters to generate money :P

5Ben Pace

Small nitpick:

The PCA plot is using the smallest version of GPT2, and not the 1.5B parameter model (that would be GPT2-XL). The small model is significantly worse than the large one and so I would be hesitant to draw conclusions from that experiment alone.

I want to second your first point. Texting frequently with significant others lets me feel be part of their life and vice versa which a weekly call does not accomplish, partly because it is weekly and partly because I am pretty averse to calls. 

In one relationship I had, this led to significant misery on my part because my partner was pretty strict on their phone usage, batching messages for the mornings and evenings. For my current primary relationship, I'm convinced that the frequent texting is what kept it alive while being long-distance. 

To r... (read more)

I tried increasing weight decay and increased batch sizes but so far no real success compared to 5x lr. Not going to investigate this further atm.

Oh I thought figure 1 was S5 but it actually is modular division. I'll give that a go..

Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)

So maybe the initial observation is more a general/global property of the loss landsca... (read more)

3Rohin Shah
Yeah, that seems right, I think I'm basically at "no, you can't just 10x the learning rate once grokking starts".
3gwern
Increasing regularization (weight decay in this instance) might rescue the ones which don't work.

So I ran some experiments for the permutation group S_5 with the task x o y = ?

Interestingly here increasing the learning rate just never works. I'm very confused.

3Rohin Shah
Also interestingly, in the default setting for these new experiments, grokking happens in ~1000 steps while memorization happens in ~1500 steps, so the grokking is already faster than the memorization, in stark contrast to the graphs in the original post. (This does depend on when you start the counter for grokking, as there's a long period of slowly increasing validation accuracy. You could reasonably say grokking took ~2500 steps.)

I updated the report with the training curves. Under default settings, 100% training accuracy is reached after 500 steps.

There is actually an overlap between the train/val curves going up. Might be an artifact of the simplicity of the task or that I didn't properly split the dataset (e.g. x+y being in train and y+x being in val). I might run it again for a harder task to verify.

3Rohin Shah
Huh, intriguing. Yeah, it might be worth running with a non-commutative function and seeing if it holds up -- it seems like in the default setting the validation accuracy hits almost 0.5 once the training accuracy is 1, which is about what you'd get if you understood commutativity but nothing else about the function. So the "grokking" part is probably happening after that, i.e. at roughly the 1.5k steps location in the default setting.

Yep I used my own re-implementation, which somehow has slightly different behavior.

I'll also note that the task in the report is modular addition while figure 1 from the paper (the one with the red and green lines for train/val) is the significantly harder permutation group task.

I'm not sure I understand.

I chose the grokking starting point as 300 steps, based on the yellow plot. I'd say it's reasonable to say that 'grokking is complete' by the 2000 step mark in the default setting, whereas it is complete by the 450 step mark in the 10x setting (assuming appropriate LR decay to avoid overshooting). Also note that the plots in the report are not log-scale

3Rohin Shah
Ah, I just looked at your plots, verified that the grokking indeed still happened with 5x and 10x learning rates, and then just assumed 10x faster convergence in the original plots in the post. Apparently that reasoning was wrong. Presumably you're using different hyperparameters than the ones used in this post? You seem to have faster grokking in the "default setting" than the in the plots shown in the post. (And it does look like, given some default setting, "10x faster convergence" is basically right, since in your case 10x higher LR makes the grokking stage go from 1700 steps to 150 steps.) (Partly the issue was that I wasn't sure whether the x-axis in your plots was starting from the beginning of training, or from the point that grokking started, so I instead reasoned about the impact on the graphs in this post. Though looking at the LR plot it's now obvious that it's from the beginning of training.) I now think this is relatively strong evidence for my view, given that grokking happens pretty quickly (~a third of total training), though it probably is still decently slower than the memorization. (Do you happen to have the training loss curves, so we can estimate how long it takes to memorize under your hyperparameters?)
Tom LieberumΩ17270

It would be interesting to see if, once grokking had clearly started, you could just 100x the learning rate and speed up the convergence to zero validation loss by 100x.

I ran a quick-and-dirty experiment and it does in fact look like you can just crank up the learning rate at the point where some part of grokking happens to speed up convergence significantly. See the wandb report:

https://wandb.ai/tomfrederik/interpreting_grokking/reports/Increasing-Learning-Rate-at-Grokking--VmlldzoxNTQ2ODY2?accessToken=y3f00qfxot60n709pu8d049wgci69g53pki6pq6khsemnncca1... (read more)

3Rohin Shah
Very nice! Thanks for actually running the experiment :) It's not clear to me which story this supports since 10x-ing the learning rate only brings the grokking phase down to 8×104 steps, which is still the majority of the training run.

Ah yes that makes sense to me. I'll modify the post accordingly and probably write it in the basis formulation.

ETA: Fixed now, computation takes a tiny bit longer but hopefully still readable to everyone.

Seems like this could be circumvented relatively easily by freezing gametes now.

3Mati_Roy
and/or using gamete donors (still a good point though)
Load More