And in fact south Iraq was and is dominantly Shiite (and thus also more susceptible to Iranian influence). They too revolted against Saddam after the first gulf war https://en.m.wikipedia.org/wiki/1991_Iraqi_uprisings and were euphoric about his fall
I agree with the previous points, but I would also add historical events that led to this.
Pre-WW I Germany was much more important and plays the role that France is playing today (maybe even more central), see University of Göttingen at the time.
After two world wars the German mathematics community was in shambles, with many mathematicians fleeing during that period (Grothendieck, Artin, Gödel,...). The university of Bonn (and the MPI) were the post-war project of Hirzebruch to rebuild the math community in Germany.
I assume France then was then able to rise as the hotspot and I would be curious to imagine what would have happened in an alternative timeline.
In our toy example, I would intuitively associate the LLC with the test losses rather than train loss. For training of a single model, it was observed that test loss and LLC are correlated. Plausibly, for this simple model (final) LLC, train loss, and test loss, are all closely related.
We haven't seen that empirically with usual regularization methods, so I assume there must be something special going on with the training set up.
I wonder if this phenomenon is partially explained by scaling up the embedding and scaling down the unembedding by a factor (or vice versa). That should leave the LLC constant, but will change L2 norm.
It's now well known that simple neural network models often "grok" algorithmic tasks. That is, when trained for many epochs on a subset of the full input space, the model quickly attains perfect train accuracy and then, much later, near-perfect test accuracy. In the former phase, the model memorizes the training set; in the latter, it generalizes out-of-distribution to the test set.
In the algorithmic grokking literature, there is typically exactly one natural generalization from the training set to the test set. What if, however, the training set were instead under-specified in such a way that there were multiple possible generalizations? Would the model grok at all? If so, which of the generalizing solutions would it choose? If the model followed Occam's razor, it would choose the simplest...
The relevant question then becomes whether the "SGLD" sampling techniques used in SLT for measuring the free energy (or technically its derivative) actually converge to reasonable values in polynomial time. This is checked pretty extensively in this paper for example.
The linked paper considers only large models which are DLNs. I don't find this too compelling evidence for large models with non-linearities. Other measurements I have seen for bigger/deeper non-linear models seem promising, but I wouldn't call them robust yet (though it is not clear to me if ...
I see, thanks for sharing!
Did you use something like as described here ? By brittle do you mean w.r.t the sparsity penality (and other hyperparameters)?
Thanks for the reference, I wanted to illuminate the value of gradients of activations in this toy example as I have been thinking about similar ideas.
I personally would be pretty excited about attribuition dictionary learning, but it seems like nobody did that on bigger models yet.
Are you suggesting that there should be a formula similar to the one in Proposition 5.1 (or 5.2) that links information about the activations with the LC as measure of basin flatness?
I played around with the example as well and got similar results. I was wondering why there are two more dominant PCs: If you assume there is no bias, then the activations will all look like
or and I checked that the two directions found by the PC approximately span the same space as . I suspect something similar is happening with bias.
In this specific example there is a way to get the true direction w_out from the activations: By doing a PCA on the gradient of the activati...
Using ZIP as compression metric for NNs (I assume you do something along the lines of "take all the weights and line them up and then ZIP") is unintuitive to me for the following reason:
ZIP, though really this should apply to any other coding scheme that just tries to compress the weights by themselves, picks up on statistical patterns in the raw weights. But NNs are not just simply a list of floats, they are arranged in highly structured manner. The weights themselves get turned into functions and it is 1.the functions, and 2. the way the functions ...
One (soft) takeaway from the discussion here is that if training “real-life” modern LLMs involves reasoning in the same reference class as parity, then it is likely that the algorithm they learn is not globally optimal (in a Bayesian sense).
I think this is a crux for me. I don't have a good guess how common this phenomenon is. The parity problem feels pathological in some sense, but I wouldn't surprised if there are other classes of problems that would fall into the same category + are represented in some training data.
Using almost the same training parameters as above (I used full batch and train_frac=0.5 to get faster & more consistent grokking, but I don't think this matters here)
I did a few runs and the results all looked more or less like this. The training process of such toy models doesn't contain so many bits of interesting information, so I wouldn't be surprised if a variety of different metrics would capture this process in this case. (E.g. the training dynamics can be also modelled by an HMM, see here).
I think a point that I don't find sufficiently stressed is that impostor syndrome is not so much about the perceived absolute lack of knowledge/expertise/... but rather the perceived relative lack.
At least speaking for myself, the experience of not knowing something in itself does not trigger any emotional response. Whereas comparing myself to people, who have an impressive amount of knowledge about something I don't, is much more likely to make me feel more like an impostor.
I am a bit confused here and I would appreciate your thoughts!
Do you want to assume finite or not? Either way I am confused:
1. is finite
In this case, the notion of almost all/almost surely is vacuous. Anything which is true up to a finite set is true if your initial measure space has finite cardinality itself.
II. is infinite
While there is no immediate problem, I think your condition that for almost all , we want for any becomes too strong I believe for a reasonable sim...
I tried to play chess with it. At first I was faced with some resistance:
Prompt: A play about a human and an AI playing chess. Human:"I start and play pawn e2-e4"
Response : AI: "I'm sorry, but I am not able to play chess. I am a large language model trained by OpenAI, and I am not able to access the internet or interact with the world outside of my training data. I am only able to provide information and answer questions to the best of my ability within the scope of my training. If you have a question _or are looking for information on a specific topic, I ...
Great to see more work on (better) influence functions!
Lots of interesting things to discuss here[1], but one thing I would like to highlight is that classical IFs indeed arise when you do the usual implicit function theorem + global minimum assumption (which is obviously violated in the context of DL), but they also arise as the limit of unrolling as t→∞. What follows will be more of theoretical nature summarizing statements in Mlodozeniec et al.
Influence functions suffer from another shortcoming, since they only use final weights (as you are a... (read more)