Asking for an acquaintance. If I know some graduate-level machine learning, and have read ~most of the recent mechanistic interpretability literature, and have made good progress understanding a small-ish neural network in the last few months.
Is ARENA for me, or will it teach things I mostly already know?
(I advised this person that they already have ARENA-graduate level, but I want to check in case I'm wrong.)
Maybe being a guslar is not so different from telling a joke 2294 lines long
That's a very good point! I think the level of ability required is different but it seems right.
The guslar's songs are (and were of course already in the 1930-1950s) also printed, so the analogy may be closer than you thought.
Is there a reason I should want to?
I don't know, I can't tell you that. If I had to choose I also strongly prefer literacy.
But I didn't know there was a tradeoff there! I thought literacy was basically unambiguously positive -- whereas now I think it is net highly positive.
Also I strongly agree with frontier64 that the skill that is lost is rough memorization + live composition, which is a little different.
They memorize the story, with particular names; and then sing it with consitent decasyllabic metre and rhyme. Here's an example song transcribed with its recording: Ropstvo Janković Stojana (The Captivity of Janković Stojan)
the collection: https://mpc.chs.harvard.edu/lord-collection-1950-51/
If you're still interested in this, we have now added Appendix N to the paper, which explains our final take.
Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?
Here's a reasonable example where naively training a probe fails. The model lies if any of N features is "true". One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.
Thank you, that makes sense!
Indefinite integrals would make a lot more sense this way, IMO
Why so? I thought they already made sense, they're "antiderivatives", so a function such that taking its derivative gives you the original functions. Do you need anything further to define them?
(I know about the definite integral Riemann and Lebesgue definitions, but I thought indefinite integrals were much easier in comparison.
In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system
I disagree. An inductive bias is not necessarily a prior distribution. What's the prior?
I don't think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.
Yeah, that's basically my model. How it regularizes I don't know. Perhaps the volume of "simple" functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.
OK, let's look through the papers you linked.
This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they're Bayesian is related (which is true -- it's also about generalization and volumes). But I see little evidence that the explanation in this paper is...
In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn't necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach
For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. ). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if , otherwise a c...
I just remembered the main way in which NNs are frequentist. They belong to a very illustrious family of frequentist estimators: the maximum likelihood estimators.
Think about it: NNs have a bunch of parameters. Their loss is basically always (e.g. mean-squared error for Gaussian p, cross-entropy for categorical p). They get trained by minimizing the loss (i.e. maximizing the likelihood).
In classical frequentist analysis they're likely to be a terrible, overfitted estimator, because they have many parameters. And I think this is true if you find...
First, "probability is in the world" is an oversimplification. Quoting from Wikipedia, "probabilities are discussed only when dealing with well-defined random experiments". Since most things in the world are not well-defined random experiments, probability is reduced to a theoretical tool for analyzing things that works when real processes are similar enough to well-defined random experiments.
it doesn't seem to trump the "but that just sounds really absurd to me though" consideration
Is there anything that could trump that consideration? One of my main obje...
They don't seem like a success of any statistical theory to me
In absolute terms you're correct. In relative terms, they're an object that at least frequentist theory can begin to analyze (as you point out, statistical learning theory did, somewhat unsuccessfully).
Whereas Bayesian theory would throw up its hands and say it's not a prior that gets updated, so it's not worth considering as a statistical estimator. This seems even wronger.
More recent theory can account for them working, somewhat. But it's about analyzing their properties as estimators (i.e. frequentism) as opposed to framing them in terms of prior/posterior (though there's plenty of attempts to the latter going around).
Yet, the biggest effect I think this will have is pedadogical. I've always found the definition of a limit kind of unintuitive, and it was specifically invented to add post hoc coherence to calculus after it had been invented and used widely. I suspect that formulating calculus via infinitesimals in introductory calculus classes would go a long way to making it more intuitive.
I think hyperreals are too complicated for calculus 1 and you should just talk about a non-rigorous "infinitesimal" like Newton and Leibniz did.
Voila! We have a suitable definition of "almost all agreement": if the agreement set is contained in some arbitrary nonprincipal ultrafilter .
Isn't it easier to just say "If the agreement set has a nonfinite number of elements"? Why the extra complexity?
must contain a set or its complement
Oh I see, so defining it with ultrafilters rules out situations like and where both have infinite zeros and yet their product is zero.
the thing with frequentism is " yeah just use methods in a pragmatic way and don't think about it that hard"
I think this does not accurately represent my beliefs. It is about thinking hard about how the methods actually behave, as opposed to having a theory that prescribes how methods should behave and then constructing algorithms based on that.
Frequentists analyze the properties of an algorithm that takes data as input (in their jargon, an 'estimator').
They also try to construct better algorithms, but each new algorithm is bespoke and requires original...
It's not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.
A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it's basically general purpose for dealing with ML models.
[1]: known to the author
Cool work! I was going to post about how "effect cancellation" is already known and was written in the original post but, astonishingly to me, it is not! I guess I mis-remembered.
There's one detail that I'm curious about. CaSc usually compares abs(E[loss] - E[scrubbed loss]), and that of course leads to ignoring hypotheses which lead the model to do better in some examples and worse in others.
If we compare E[abs(loss - scrubbed loss)] does this problem go away? I imagine that it doesn't quite if there are exactly-opposing causes for each example, but that ...
Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.
I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’).
One thing that is not equivalent to joins, which you might also want to do, is to choose ...
Here are my predictions, from an earlier template. I haven't looked at anyone else's predictions before posting :)
- Describe how the trained policy might generalize from the
5x5
top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
It probably has hardcoded “go up and to the right” as an initial heuristic so I’d be surprised if it gets cheeses in the other two quadrants more than 30% of the time (uniformly at random selected locations from there).
...
- Given a fixed trained policy, what
First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!
My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws? Doesn't that mostly promote AI capabilities, and not alignment very much?
Second, I feel like there's a confusion over several probability distributions and potential functions going on
Instead of simulating Brownian motion, you could run SGD with momentum. That would be closer to what actually happens with NNs, and just as easy to simulate.
I expect it to be directionally similar but less pronounced (because MCMC methods with momentum explore the distribution better).
I also take issue with the way the conclusion is phrased. "Singularities work because they transform random motion into useful search for generalization". This is only true if you assume that points nearer a singularity generalize better. Maybe I'd phrase it as, "SGD works be...
The Hayflick Limit, as it has become known, can be thought of as a last line of defense against cancer, kind of like a recursion depth limit [...] Preventing cells from becoming senescent, or reversing their senescent state, may therefore be a bad idea, but what we can do is remove them
When do the cells with sufficiently long telomeres run out? Removing senescent cells sounds good, but if all the cells have a built-in recursion limit, at some point there won't be any cells with sufficiently long telomeres left in the body. Assuming a non-decreasing divi...
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?
No, I mean that they'll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is "go to the coin", and in the training data this coincides with "go to the right". In test data from a similar distribution this coincides too.
Of course, this correlatio...
But the designers can't tell that. Can SGD tell that?
No, SGD can't tell the degree to which some agent generalizes a trait outside the training distribution.
But empirically, it seems that RL agents reinforced to maximize some reward function (e.g. the Atari game score) on data points; do fairly well at maximizing that reward function OOD (such as when playing the game again from a different starting state). ML systems in general seem to be able to generalize to human-labeled categories in situations that aren't in the training data (e.g. image classifie...
Strongly agree with this in particular:
Some people want to apply selection arguments because they believe that selection arguments bypass the need to understand mechanistic details to draw strong conclusions. I think this is mistaken, and that selection arguments often prove too much, and to understand why, you have to know something about the mechanisms.
(emphasis mine). I think it's an application of the no free lunch razor
It is clear that selecting for X selects for agents which historically did X in the course of the selection. But how this generali...
I agree with the title as stated but not with the rest of the post. RLHF implies that RL will be used, which completely defuses alignment plans that hope that language models will be friendly, because they're not agents. (It may be true that supervised-learning (SL) models are safer, but the moment you get a SL technique, people are going to jam it into RL.)
The central problem with RL isn't that it is vulnerable to wireheading (the "obvious problem"), or that it's going to make a very detailed model of the world. Wireheading on its own (with e.g. a myopic ...
You should apply to Anthropic. If you’re writing ML software at semi-FAANG. they probably want to interview you ASAP. https://www.lesswrong.com/posts/YDF7XhMThhNfHfim9/ai-safety-needs-great-engineers
The compensation is definitely enough to take care of your family and then save some money!
Good find! Yeah, this is a good explanation for learning, and the NFL razor does not discard it. I think that almost no deep learning professor believes the bad explanation that “deep learning works because NNs are universal approximators”. But it’s more common with students and non-experts (I believed it for a while!)
Getting more value-aligned people in the AIS community onto the safety teams of DeepMind and OpenAI
Why is this important? As far as I can tell, the safety teams of these two organisations are already almost entirely "value-aligned people in the AIS community". They need more influence within the organisation, sure, but that's not going to be solved by altering team composition.
I'd argue the world right now (or rather, life on earth) is super bad because it's dominated by animal suffering
I agree with this, and the overall history of the world is definitely on balance extreme suffering.
For farmed animals in particular, we don't need AGI to end their plight. Just regular economic growth and advocacy will do.
Also, given how much time we've been suffering already, and how much is at stake; would it be so bad to delay AGI by 100 or 200 years? We can do a lot of alignment research in that time.
I'm curious what you mean, but I don't entirely understand. If you give me a text representation of the level I'll run it! :) Or you can do so yourself
Here's the text representation for level 53