All of bayesian_kitten's Comments + Replies

Hi there! I'm Ameya, currently at the University of Tübingen. I share similar broad interests and am particularly enthusiastic about working on evaluations. Would love to be a part of broader evals group if any created (slack/discord)!

We organized an evals workshop recently! It had a broader focus and wasn't specifically related to AI safety, but it was a great experience -- we are planning to keep running more iterations of it and sharpen focus.

Referencing recent papers sent my way here (this shall be a live, expanding comment), please do link more if you think they might be useful:
- Inductive biases in theory-based reinforcement learning 

This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.

3gwern
I've commented on that, but I'm not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn't go around prematurely conflating them.

They're likely to be interchangeable, sorry. Here I might've misused the words to try tease out the difference that simply understanding how a given model works is not really insightful if the patterns are not understandable. 

I think there are these nonsensical-seeming-patterns to humans might be a significant fraction of the learned patterns by deep networks. I was trying to understand the radical optimism, in contrast to my pessimism given this. The crux being since we don't know what these patterns are and what they represent, even if we figure out... (read more)

Am I correct in thinking the 'ersatz' and 'real' interpretability might differ in aspects more than just degree of interpretability-- Ersatz is somewhat embedded in explaining the typically case, whereas 'real interpretability' gives good reasoning eve in the worst-case. Interpretability might be hard to achieve in worst-case scenarios where some atypical wiring leads to wrong decisions?

Furthermore, I suspect confusing transparency for interpretability. Even if we understand what each-and-every-neuron does (radical transparency), it might not be interpreta... (read more)

4Steven Byrnes
Newbie question: What's the difference between transparency and interpretability? Follow-up question: Does everyone agree with that answer or is it not standardized?
3Quintin Pope
I think you're pointing to a special case of a more general pattern. Just like there's a general factor of "athletic ability" which can be subdivided into many correlated components, I think "interpretability" can be split into many correlated components. Some of those components correspond to greater reliability in the worst case scenarios. Others might correspond to, e.g., predicting which architectural modifications would improve performance on typical input. Worst case reliability is probably the most important component to interpretability, but I'm not sure it makes sense to call it "real interpretability". It's probably better to just call it "worst case" interpretability.

When you have enough real-world data, you don't need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It's worth noting that no one in the large language model space has ever 'used up' all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don't have to keep around the original dataset to sample maintenance batches from while doing more training.

This would be the main crux, actual... (read more)

Hi! Thanks for reading and interesting questions:

I read that first sentence several times and it's still not clear what you mean, or how the footnote helps clarify. What do you mean by 'tweak'? A tweak is a small incremental change.

That's correct, what I meant is say we state an agent has 'x, y, z biases', it can try to correct them. Now, the changes cannot be arbitrary, the constraints are that it has to be competitive and robust. But I think it can  reduce the strength of the heuristic by going against it whenever it can to the extent those heuristi... (read more)

Hi! Thanks for reading the post carefully and coming up with interesting evidence and arguments against~ I think I can explain PF4, but am certainly wrong on B1.

PF4

Why do you have high confidence that catastrophic forgetting is immune to scaling, given "Effect of scale on catastrophic forgetting in neural networks", Anonymous 2021?

Catastrophic forgetting (mechanism): We train a model to minimize loss on dataset X. Then we train it to minimize loss on dataset Y. When minimizing loss on dataset Y, it has no incentive to care about loss on dataset X. Hence, c... (read more)

4gwern
I take the point of the paper as showing that as models get larger and more overparameterized, it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure. At some point, worrying about 'classes' or 'heads' just becomes irrelevant as you zero-shot or few-shot it: eg CLIP doesn't really need to worry about catastrophic forgetting because you just type in the text description of what 'class' you're interested in and 'classify' that way; a MoE doesn't worry about task classification, because it learns what sub-expert to dispatch input to. You won't need to 'switch between tasks' (not even that meaningful a thing outside the constraints of a benchmark) because in-context learning & representations do all the work, latently disambiguating where one is. You will simply train large (perhaps sparse or MoE-esque) models in one-epoch fashion, streaming in data constantly and discarding it. When you have enough real-world data, you don't need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It's worth noting that no one in the large language model space has ever 'used up' all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don't have to keep around the original dataset to sample maintenance batches from while doing more training. This solution will make a lot of people very unhappy as they insist that "this isn't a solution, you just made a very large model, arrgghh, so inefficient and ungreen", but if it solves the problem, then it solves the problem, and now you're just haggling over the price. Are there ways more efficient? Almost certainly. Should we care that much about or bother researching them? Maybe. The grokking paper is