All of Rareș Baron's Comments + Replies

Your hypothesis seems reasonable, and I think the following proves it.
1. This is for 5e-3, giving no spikes and faster convergences:

2. Gradient descent failed to converge for multiple LRs, from 1e-2 to 1e-5. However, decreasing the LR by 1.0001 when the training error increases gave this:

It's messy, and the decrease seems to turn the jumps of the slingshot effect into causes for getting stuck in sub-optimal basins, but the trajectory was always downwards. Increasing the rate of reduction decreased spikes but convergence no longer appeared.

An increase to 2.... (read more)

Your hypothesis seems reasonable, and I think the following proves it.
1. This is for 5e-3, giving no spikes and faster convergences:

2. Gradient descent failed to converge for multiple LRs, from 1e-2 to 1e-5. However, decreasing the LR by 1.0001 when the training error increases gave this:

It's messy, and the decrease seems to turn the jumps of the slingshot effect into causes for getting stuck in sub-optimal basins, but the trajectory was always downwards. Increasing the rate of reduction decreased spikes but convergence no longer appeared.

An increase to 2.... (read more)

I have uploaded html files of all the animation so they can be interactive. The corresponding training graphs are in the associated notebooks.

The original learning rate was 1e-3.

For 5e-4, it failed to converge:

For 8e-4, it did converge, and the trajectory was downwards this time:

2Gurkenglas
Oh, you're using AdamW everywhere? That might explain the continuous training loss increase after each spike, with AdamW needing time to adjust to the new loss landscape... Lower learning rate leads to more spikes? Curious! I hypothesize that... it needs a small learning rate to get stuck in a narrow local optimum, and then when it reaches the very bottom of the basin, you get a ~zero gradient, and then the "normalize gradient vector to step size" step is discontinuous around zero. Experiments springing to mind are: 1. Do you get even fewer spikes if you increase the step size instead? 2. Is there any optimizer setup at all that makes the training loss only ever go down? 2.1. Reduce the step size whenever an update would increase the training loss? 2.2. Use gradient descent instead of AdamW?

For 1 and 2 - I have. Everything is very consistent.
For 3, I have tried several optimizers, and they all failed to converge. Tweaking the original AdamW to reduce the learning rate lead to very similar results:

For 4, I have done animations for every model (besides the 2 GELU variants). I saw pretty much what I expected: a majority of relevant developments (fourier frequencies, concentration of singular values, activations and attention heads) happened quickly, in the clean-up phase. The spikes seen in SiLU and SoLU_LN were visible, though not lasting. I ha... (read more)

2Gurkenglas
My eyes are drawn to the 120 or so downward tails in the latter picture; they look of a kind with the 14 in https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/2c6249da0e8f77b25ba007392087b76d47b9a16f969b21f7.png/w_1584. What happens if you decrease the learning rate further in both cases? I imagine the spikes should get less tall, but does their number change? Only dot plots, please, with the dots drawn smaller, and red dots too on the same graph. I don't see animations in the drive folder or cached in Grokking_Demo_additional_2.ipynb (the most recent, largest notebook) - can you embed one such animation here?

Fair statistical point, however in reality a vast majority of serial killers did not go above 15 victims, and the crimes they commited were perpretated before their first (and last) arrest. I do not have raw numbers, but my impression is that the number of those sentenced for one murder, later paroled, and then beginning their spree of more than 3-4, is incredibly small. Serial killers are also rare in general.

Gang considerations, however, might be a larger factor here, though I still doubt it is enough to tip the scales (especially as prison gang affiliat... (read more)

I am sorry for being slow to understand. I hope I will internalise your advice and the linked post quickly.

I have re-done the graphs, to be for every epoch. Very large spikes for SiLU were hidden by the skipping. I have edited the post to rectify this, with additional discussion.

Again, thank you (especially your patience).

2Gurkenglas
Having apparently earned some cred, I will dare give some further quick hints without having looked at everything you're doing in detail, expecting a lower hit rate. 1. Have you rerun the experiment several times to verify that you're not just looking at initialization noise? 2. If that's too expensive, try making your models way smaller and see if you can get the same results. 3. After the spikes, training loss continuously increases, which is not how gradient descent is supposed to work. What happens if you use a simpler optimizer, or reduce the learning rate? 4. Some of your pictures are created from a snapshot of a model. Consider generating them after every epoch, producing a video; this way increases how much data makes it through your eyes.

Apologies for misunderstanding. I get it now, and will be more careful from now on.

I have re-run the graphs where such misunderstandings might appear (for this and a future post), and added them here. I don't think I have made any mistakes in interpreting the data, but I am glad to have looked at the clearer graphs.

Thank you very much!

2Gurkenglas
I'm glad that you're willing to change your workflow, but you have only integrated my parenthetical, not the more important point. When I look at https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/tzkakoG9tYLbLTvHG/lelcezcseu001uyklccb, I see interesting behavior around the first red dashed line, and wish I saw more of it. You ought to be able to draw 25k blue points in that plot, one for every epoch - your code already generates that data, and I advise that you cram as much of your code's data into the pictures you look at as you reasonably can.

I will keep that in mind for the future. Thank you!
I have put all high-quality .pngs of the plots in the linked Drive folder.

2Gurkenglas
...what I meant is that plots like this look like they would have had more to say if you had plotted the y value after e.g. every epoch. No reason to throw away perfectly good data, you want to guard against not measuring what you think you are measuring by maximizing the bandwidth between your code and your eyes. (And the lines connecting those data points just look like more data while not actually giving extra information about what happened in the code.)

I have major disagreements with the arguments of this post (with the understanding that it is a steelman), but I do want to say that it has made me moderately update towards the suitability of the death penalty as a punishment, from a purely utilitarian perspective (though it has not tipped the scale). It has also showcased interesting and important figures, so thank you for that.

Deterrence and recidivism

At that point killing 3 million criminals to save the lives of 2.4 million

How many of those 2.4 million were murdered by recidivists? Even if we assume th... (read more)

2Viliam
Another important number is, those less than 10% who do, how many more people they kill? Imagine that you have nine murderers who kill 1 person each, and one serial killer who kills 100. It is simultaneously true that only 10% murderers murder again and that executing them could save many lives. (But a life in prison, if the chances of escape are sufficiently low, could achieve the same.)