All of Adrià Garriga-alonso's Comments + Replies

Does literacy remove your ability to be a bard as good as Homer?

That's very cool, maybe I should try to do that for important talks. Though I suppose almost always you have slide aid, so it may not be worth the time investment.

Does literacy remove your ability to be a bard as good as Homer?

Maybe being a guslar is not so different from telling a joke 2294 lines long

That's a very good point! I think the level of ability required is different but it seems right.

The guslar's songs are (and were of course already in the 1930-1950s) also printed, so the analogy may be closer than you thought.

Adrià Garriga-alonso1y83

Is there a reason I should want to?

I don't know, I can't tell you that. If I had to choose I also strongly prefer literacy.

But I didn't know there was a tradeoff there! I thought literacy was basically unambiguously positive -- whereas now I think it is net highly positive.

Also I strongly agree with frontier64 that the skill that is lost is rough memorization + live composition, which is a little different.

Does literacy remove your ability to be a bard as good as Homer?

Adrià Garriga-alonso1y11

It's definitely not exact memorization, but it's almost more impressive than that, it's rough memorization + composition to fit the format.

Does literacy remove your ability to be a bard as good as Homer?

Is being sexy for your homies?

They memorize the story, with particular names; and then sing it with consitent decasyllabic metre and rhyme. Here's an example song transcribed with its recording: Ropstvo Janković Stojana (The Captivity of Janković Stojan)

the collection: https://mpc.chs.harvard.edu/lord-collection-1950-51/

2Daniel Paleka1y

The one you linked doesn't really rhyme. The meter is quite consistently decasyllabic, though. I find it interesting that the collection has a fairly large number of songs about World War II. Seems that the "oral songwriters composing war epics" meme lived until the very end of the tradition.

1Adrià Garriga-alonso1y

It's definitely not exact memorization, but it's almost more impressive than that, it's rough memorization + composition to fit the format.

Adrià Garriga-alonso1y41

Folks generally don't need polyamory to enjoy this benefit, but I'm glad you get it from that!

2AnthonyC1y

I actually wonder about the breakdown here. I agree many don't. I don't, though really I have few friends in general. But some do. I don't think there's very many people in the modern world that are in the extreme Mike Pence "I will never let myself be alone with a woman who isn't my wife" category of enforcing such boundaries (though some religious communities still have such rules!). But if you watch the start of When Harry Met Sally, I think a sizeable chunk of people do still lean closer to Billy Crystal's position than Meg Ryan's.

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

How useful is mechanistic interpretability?

If you're still interested in this, we have now added Appendix N to the paper, which explains our final take.

Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?

Here's a reasonable example where naively training a probe fails. The model lies if any of N features is "true". One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.

Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.

This particular case can also be solved by adversarially attacking the probe though.

Hyperreals in a Nutshell

Adrià Garriga-alonso1y32

Thank you, that makes sense!

Indefinite integrals would make a lot more sense this way, IMO

Why so? I thought they already made sense, they're "antiderivatives", so a function such that taking its derivative gives you the original functions. Do you need anything further to define them?

(I know about the definite integral Riemann and Lebesgue definitions, but I thought indefinite integrals were much easier in comparison.

1[comment deleted]1y

2Yudhister Kumar1y

Language mix-up. Meant improper integrals. Now that I'm thinking about it, my memory's fuzzy on how you'd actually calculate them rigorously w/infinitesimals. Will get back to you with an example.

In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system

I disagree. An inductive bias is not necessarily a prior distribution. What's the prior?

2Garrett Baker1y

From another comment of mine: Also, side-comment: Thanks for the discussion! Its fun. EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don't remember exactly how this works though. Or if its even known the exact connection with SGD.

I don't think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.

Yeah, that's basically my model. How it regularizes I don't know. Perhaps the volume of "simple" functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.

2Garrett Baker1y

Oh this reminded me of the temperature component of SLT, which I believe modulates how sharply one should sample from the bayesian posterior, or perhaps how heavily to update on new evidence. I forget. In any case, it does this to try to capture the stochasticity component of SGD. Its still an open problem to show how successfully though, I believe.

This seems false if you're interacting with a computable universe, and don't need to model yourself or copies of yourself

Reasonable people disagree. Why should I care about the "limit of large data" instead of finite-data performance?

OK, let's look through the papers you linked.

"Loss landscapes are all you need"

This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they're Bayesian is related (which is true -- it's also about generalization and volumes). But I see little evidence that the explanation in this paper is... (read more)

2Garrett Baker1y

Yup this changes my mind about the relevance of this paper.

2Garrett Baker1y

In brief: In weight space, uniform. In function space, its an open problem and the paper says relatively little about that. Only showing that conditioning on a function with zero loss, and weighing by its corresponding size in the weight space gets you the same result as training a neural network. The former process is sampling from a bayesian posterior. Less brief: The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don't change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one). will say more about your other comment later (maybe). EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don't remember exactly how this works though. Or if its even known the exact connection with SGD.

In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn't necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach

For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. $R \to R$ ). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if $R^{R}$ , otherwise a c... (read more)

2Garrett Baker1y

This seems false if you're interacting with a computable universe, and don't need to model yourself or copies of yourself. Computability of the prior also seems irrelevant if I have infinite compute. Therefore in this prediction task, I don't see the problem in just using the first thing you mentioned.

I just remembered the main way in which NNs are frequentist. They belong to a very illustrious family of frequentist estimators: the maximum likelihood estimators.

Think about it: NNs have a bunch of parameters. Their loss is basically always $log p (y | x, θ)$ (e.g. mean-squared error for Gaussian p, cross-entropy for categorical p). They get trained by minimizing the loss (i.e. maximizing the likelihood).

In classical frequentist analysis they're likely to be a terrible, overfitted estimator, because they have many parameters. And I think this is true if you find... (read more)

2Garrett Baker1y

I don't think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer. Coming from the perspective of singular learning theory, neural networks work because SGD weights solutions by their parameter volume, which is dominated by low-complexity singularities, and is close enough to a bayesian posterior that it ends up being able to be modeled well from that frame. This theory is very bayes-law inspired, though I don't tout neural networks as evidence in favor of bayesianism, since the question seems not very related, and maybe the pioneers of the field had some deep frequentist motivated intuitions about neural networks. My impression though is they were mostly just motivated by looking at the brain at first, then later on by following trend-lines. And in fact paid little attention to theoretical or philosophical concerns (though not zero, people talked much about connectionism. I would guess this correlated with being a frequentist, though I would guess the correlation was very modest, and maybe success correlated more with just not caring all that much). There may be a synthesis position here where you claim that SGD weighing solutions by their size in the weight space is in fact what you mean by SGD being a implicit regularizer. In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system[1]. I actually think I remember a theorem showing that all MLE + regularizer learners are doing some kind of bayesian learning, though I could be mistaken and I don't believe this is a crux for me here. If our models end up different, I think there's a bunch of thin

First, "probability is in the world" is an oversimplification. Quoting from Wikipedia, "probabilities are discussed only when dealing with well-defined random experiments". Since most things in the world are not well-defined random experiments, probability is reduced to a theoretical tool for analyzing things that works when real processes are similar enough to well-defined random experiments.

it doesn't seem to trump the "but that just sounds really absurd to me though" consideration

Is there anything that could trump that consideration? One of my main obje... (read more)

1Olli Järviniemi1y

From one viewpoint, I think this objection is satisfactorily answered by Cox's theorem - do you find it unsatisfactory (and if so, why)? Let me focus on another angle though, namely the "absurdity" and gut level feelings of probabilities. So, my gut feels quite good about probabilities. Like, I am uncertain about various things (read: basically everything), but this uncertainty comes in degrees: I can compare and possibly even quantify my uncertainties. I feel like some people get stuck on the numeric probabilities part (one example I recently ran to was this quote from Section III of this essay by Scott, "Does anyone actually consistently use numerical probabilities in everyday situations of uncertainty?"). Not sure if this is relevant here, but at the risk of going to a tangent, here's a way of thinking about probabilities I've found clarifying and which I haven't seen elsewhere: The correspondence beliefs <-> probabilities is of the same type as temperature <-> Celsius-degrees. Like, people have feelings of warmth and temperature. These come in degrees: sometimes it's hotter than some other times, now it is a lot warmer than yesterday and so on. And sure, people don't have a built-in thermometer mapping these feelings to Celsius-degrees, they don't naturally think of temperature in numeric degrees, they frequently make errors in translating between intuitive feelings and quantitative formulations (though less so with more experience). Heck, the Celsius scale is only a few hundred years old! Still, Celsius degrees feel like the correct way of thinking about temperature. And the same with beliefs and uncertainty. These come in degrees: sometimes you are more confident than some other times, now you are way more confident than yesterday and so on. And sure, people don't have a built-in probabilitymeter mapping these feelings to percentages, they don't naturally think of confidence in numeric degrees, they frequently make errors in translating between intu

The Hidden Perils of Hydrogen

They don't seem like a success of any statistical theory to me

In absolute terms you're correct. In relative terms, they're an object that at least frequentist theory can begin to analyze (as you point out, statistical learning theory did, somewhat unsuccessfully).

Whereas Bayesian theory would throw up its hands and say it's not a prior that gets updated, so it's not worth considering as a statistical estimator. This seems even wronger.

More recent theory can account for them working, somewhat. But it's about analyzing their properties as estimators (i.e. frequentism) as opposed to framing them in terms of prior/posterior (though there's plenty of attempts to the latter going around).

2Garrett Baker1y

I think this comment of mine serves well as a response to this as well as the comment it was originally responding to.

Hyperreals in a Nutshell

Adrià Garriga-alonso1y20

Yet, the biggest effect I think this will have is pedadogical. I've always found the definition of a limit kind of unintuitive, and it was specifically invented to add post hoc coherence to calculus after it had been invented and used widely. I suspect that formulating calculus via infinitesimals in introductory calculus classes would go a long way to making it more intuitive.

I think hyperreals are too complicated for calculus 1 and you should just talk about a non-rigorous "infinitesimal" like Newton and Leibniz did.

1Yudhister Kumar1y

I agree. This is what I was going for in that paragraph. If you define derivatives & integrals with infinitesimals, then you can actually do things like treating dy/dx as a fraction without partaking in the half-in half-out dance that calc 1 teachers currently have to do. I don't think the pedagogical benefit of nonstandard analysis is to replace Analysis I courses, but rather to give a rigorous backing to doing algebra with infinitesimals ("an infinitely small thing plus a real number is the same real number, an infinitely small thing times a real number is zero"). *Improper integrals would make a lot more sense this way, IMO.

Hyperreals in a Nutshell

Adrià Garriga-alonso1y21

Voila! We have a suitable definition of "almost all agreement": if the agreement set $I$ is contained in some arbitrary nonprincipal ultrafilter $U$ .

Isn't it easier to just say "If the agreement set $I$ has a nonfinite number of elements"? Why the extra complexity?

must contain a set or its complement

Oh I see, so defining it with ultrafilters rules out situations like $a = (1, 0, 1, 0, 1, 0, . . .)$ and $b = (0, 1, 0, 1, 0, 1...)$ where both have infinite zeros and yet their product is zero.

3JBlack1y

The post is wrong in saying that U contains only cofinite sets. It obviously must contain plenty of sets that are neither finite nor cofinite, because the complements of those sets are also neither finite nor cofinite. Possibly the author intended to type "contains all cofinite sets" instead. In particular, exactly one of a or b is equivalent to zero in *R. Which one is equivalent to zero depends upon exactly which non-principal ultrafilter you choose, as there are infinitely many non-principal ultrafilters. Unfortunately (as with many other applications of the Axiom of Choice) there is no finite way to specify which ultrafilter you mean.

Adrià Garriga-alonso1y14

These are drawbacks rather than dangerous attributes, why did you call the post the "Perils" of hydrogen? It's not an accurate description of the post content.

1mruwnik1y

They are perils of assuming that hydrogen is the future, or perils of basing your energy needs on it - i.e. the peril is not in the hydrogen, it's in making plans involving it

If interpretability research goes well, it may get dangerous

Adrià Garriga-alonso1y82

the thing with frequentism is " yeah just use methods in a pragmatic way and don't think about it that hard"

I think this does not accurately represent my beliefs. It is about thinking hard about how the methods actually behave, as opposed to having a theory that prescribes how methods should behave and then constructing algorithms based on that.

Frequentists analyze the properties of an algorithm that takes data as input (in their jargon, an 'estimator').

They also try to construct better algorithms, but each new algorithm is bespoke and requires original... (read more)

Two Percolation Puzzles

Adrià Garriga-alonso2y10

What a great cover art!

Adrià Garriga-alonso2yΩ450

It's not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.

A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it's basically general purpose for dealing with ML models.

[1]: known to the author

Practical Pitfalls of Causal Scrubbing

Adrià Garriga-alonso2yΩ222

Cool work! I was going to post about how "effect cancellation" is already known and was written in the original post but, astonishingly to me, it is not! I guess I mis-remembered.

There's one detail that I'm curious about. CaSc usually compares abs(E[loss] - E[scrubbed loss]), and that of course leads to ignoring hypotheses which lead the model to do better in some examples and worse in others.

If we compare E[abs(loss - scrubbed loss)] does this problem go away? I imagine that it doesn't quite if there are exactly-opposing causes for each example, but that ... (read more)

Practical Pitfalls of Causal Scrubbing

Adrià Garriga-alonso2yΩ110

If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems

I think this "max loss" procedure is different from what Buck wrote and the same as what I wrote.

Practical Pitfalls of Causal Scrubbing

Adrià Garriga-alonso2yΩ110

Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.

I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’).

One thing that is not equivalent to joins, which you might also want to do, is to choose ... (read more)

Predictions for shard theory mechanistic interpretability results

Adrià Garriga-alonso2yΩ9110

Here are my predictions, from an earlier template. I haven't looked at anyone else's predictions before posting :)

Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

It probably has hardcoded “go up and to the right” as an initial heuristic so I’d be surprised if it gets cheeses in the other two quadrants more than 30% of the time (uniformly at random selected locations from there).

Given a fixed trained policy, what

Adrià Garriga-alonso2yΩ692

First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!

My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws? Doesn't that mostly promote AI capabilities, and not alignment very much?

Second, I feel like there's a confusion over several probability distributions and potential functions going on

The singularities are those of the likelihood ratio $K_{n} (w)$
We care about the generalization error $G_{n}$ with respect to some prior $ϕ (w)$ , but the l

... (read more)

4Jesse Hoogland2y

Thank you! This is also my biggest source of uncertainty on the whole agenda. There's definitely a capabilities risk, but I think the benefits to understanding NNs currently much outweigh the benefits to improving them. In particular, I think that understanding generalization is pretty key to making sense of outer and inner alignment. If "singularities = generalization" holds up, then our task seems to become quite a lot easier: we only have to understand a few isolated points of the loss landscape instead of the full exponential hell that is a billions-dimensional system. In a similar vein, I think that this is one of the most promising paths to understanding what's going on during training. When we talk about phase changes / sharp left turns / etc., what we may really be talking about are discrete changes in the local singularity structure of the loss landscape. Understanding singularities seems key to predicting and anticipating these changes just as understanding critical points is key to predicting and anticipating phase transitions in physical systems. As long as your prior has non-zero support on the singularities, the results hold up (because we're taking this large-N limit where the prior becomes less important). Like I mention in the objections, linking this to SGD is going to require more work. To first order, when your prior has support over only a compact subset of weight space, your behavior is dominated by the singularities in that set (this is another way to view the comments on phase transitions). This is very much a work in progress. In statistical physics, much of our analysis is built on the assumption that we can replace temporal averages with phase-space averages. This is justified on grounds of the ergodic hypothesis. In singular learning theory, we've jumped to parameter (phase)-space averages without doing the important translation work from training (temporal) averages. SGD is not ergodic, so this will require care. That the exact a

Spooky action at a distance in the loss landscape

Adrià Garriga-alonso2y30

Instead of simulating Brownian motion, you could run SGD with momentum. That would be closer to what actually happens with NNs, and just as easy to simulate.

I expect it to be directionally similar but less pronounced (because MCMC methods with momentum explore the distribution better).

I also take issue with the way the conclusion is phrased. "Singularities work because they transform random motion into useful search for generalization". This is only true if you assume that points nearer a singularity generalize better. Maybe I'd phrase it as, "SGD works be... (read more)

1Jesse Hoogland2y

Hey I need a reason to write a follow-up to this, right? I was trying to be intentionally provocative, but you're right — it's too much. Thanks for the suggestion!

The Fountain of Health: a First Principles Guide to Rejuvenation

Adrià Garriga-alonso2y40

The Hayflick Limit, as it has become known, can be thought of as a last line of defense against cancer, kind of like a recursion depth limit [...] Preventing cells from becoming senescent, or reversing their senescent state, may therefore be a bad idea, but what we can do is remove them

When do the cells with sufficiently long telomeres run out? Removing senescent cells sounds good, but if all the cells have a built-in recursion limit, at some point there won't be any cells with sufficiently long telomeres left in the body. Assuming a non-decreasing divi... (read more)

3PhilJackson2y

Not all cells in the adult body do divide, most of them I think divide only rarely if ever. Cells that divide more regularly generally express telomerase to keep their telomeres from running out. Telomeres running out may still be an issue though, so there are people looking at ways of lengthening them via telomerase expression.

[Simulators seminar sequence] #2 Semiotic physics - revamped

Adrià Garriga-alonso2yΩ474

To elaborate somewhat, you could say that the token is the state, but then the transition probability is non-Markovian and all the math gets really hard.

1JNS2y

Intuitively I would say that all the tokens in the token window are the state. And when you run an inference pass, select a token and append that to the token window, then you have a new state. The model looks a lot like a collection of nonlinear functions, each of them encoded using every parameter in the model. Since the model is fixed after training, the only place an evolving state can exist has to be in the tokens, or more specifically the token window that is used as input. The state seems to contain, for lack of a better word, a lot of entanglement. Likely due to attention heads, and how the nonlinear functions are encoded. There is another way to view such a system, one that while deeply flawed, at least to me intuits that whatever Microsoft and OpenAI are doing to "align(?)" something like Bing Chat is impossible (at least if the goal is bulletproof). I would postulate: - Alignment for such a system is impossible (assuming it has to be bulletproof) - Impossibility is due to the architecture of such a system 1. ^ I assume that any bit in the input affects the output, and that a change in any parameter has potential impact on that bit. 2. ^ If anyone want to hear about it, I would be happy to explain my thinking. But be aware the abstraction and mapping I used was very sloppy and ad hoc.

[Simulators seminar sequence] #2 Semiotic physics - revamped

Adrià Garriga-alonso2yΩ574

Proposition 1 is wrong. The coin flips that are eternally 0 0 0 0 are a counterexample. If all the transition probabilities are 1, which is entirely possible, the limiting probability is 1 and not 0.

1Jan2y

Technically correct, thanks for pointing that out! This comment (and the ones like it) was the motivation for introducing the "non-degenerate" requirement into the text. In practice, the proposition holds pretty well - although I agree it would nice to have a deeper understanding of when to expect the transition rule to be "non-degenerate"

6Peter Schmidt-Nielsen2y

So, a softmax can never emit a probability of 0 or 1, maybe they were implicitly assuming the model ends in a softmax (as is the common case)? Regardless, the proof is still wrong if a model is allowed unbounded context, as an infinite product of positive numbers less than 1 can still be nonzero. For example, if the probability of emitting another " 0" is even just as high as $1 - \frac1{n^{1.001}}$ after already having emitted $n$ copies of " 0", then the limiting probability is still nonzero. But if the model has a finite context and ends in a softmax then I think there is some minimum probability of transitioning to a given token, and then the proposition is true. Maybe that was implicitly assumed?

Shard Theory in Nine Theses: a Distillation and Critical Appraisal

Adrià Garriga-alonso2yΩ572

What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?

No, I mean that they'll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is "go to the coin", and in the training data this coincides with "go to the right". In test data from a similar distribution this coincides too.

Of course, this correlatio... (read more)

2TurnTrout2y

I'm going to just reply with my gut responses here, hoping this clarifies how I'm considering the issues. Not meaning to imply we agree or disagree. Probably, yeah. Consider a network which received lots of policy gradients from the cognitive-update-intensity-signals ("rewards"[1]) generated by the "go to coin?" subroutine. I agree that this network will tend to, in the deployment distribution, tend to take actions which average higher sum-cognitive-update-intensity-signal ("reward over time"), than networks which are randomly initialized, or even which have randomly sampled shard compositions/values (in some reasonable sense). But this doesn't seem like it constrains my predictions too strongly. It seems like a relatively weak, correlational statement, where I'd be better off reasoning mechanistically about the likely "proxy-for-reward" values which get learned. I understand you to argue: "SGD will select policy networks for maximizing reward during training. Therefore, we should expect policy networks to behaviorally maximize reward on the training distribution over episodes." On this understanding of what you're arguing: No, agents often do not behave as reward maximizers in the specific seen training points. RL trains agents which don't maximize training reward... all the time! Agents: 1. die in video games (see DQN),[2] 2. fail to perform the most expert tricks and shortcuts (is AlphaZero playing perfect chess?), 3. (presumably) fail to exploit reward hacking opportunities which are hard to explore into. For the last point, imagine that AlphaStar could perform a sequence of 300 precise actions, and then get +1 million policy-gradient-intensity ("reward") due to a glitch. On the reasoning I understand you to advance, SGD is "selecting" for networks which receive high policy-gradient-intensity, but... it's never going to happen in realistic amounts of time. Even in training. This is because SGD is updating the agent on the observed empirical data

Shard Theory in Nine Theses: a Distillation and Critical Appraisal

Adrià Garriga-alonso2yΩ463

But the designers can't tell that. Can SGD tell that?

No, SGD can't tell the degree to which some agent generalizes a trait outside the training distribution.

But empirically, it seems that RL agents reinforced to maximize some reward function (e.g. the Atari game score) on data points; do fairly well at maximizing that reward function OOD (such as when playing the game again from a different starting state). ML systems in general seem to be able to generalize to human-labeled categories in situations that aren't in the training data (e.g. image classifie... (read more)

2TurnTrout2y

What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values? Can you give an example of such a motivational structure, so I know we're considering the same thing? Agreed. I also think this is different from a very specific kind of generalization towards reward maximization. I again think it is plausible (2-5%-ish) that agents end up primarily making decisions on the basis of a tight reward-correlate (e.g. the register value, or some abstract representation of their historical reward function), and about 60% that agents end up at least somewhat making decisions on the basis of reward in a terminal sense (e.g. all else equal, the agent makes decisions which lead to high reward values; I think people are reward-oriented in this sense). Overall I feel pretty confused about what's going on with people, and I can imagine changing my mind here relatively easily.

Shard Theory in Nine Theses: a Distillation and Critical Appraisal

Adrià Garriga-alonso2yΩ334

Strongly agree with this in particular:

Some people want to apply selection arguments because they believe that selection arguments bypass the need to understand mechanistic details to draw strong conclusions. I think this is mistaken, and that selection arguments often prove too much, and to understand why, you have to know something about the mechanisms.

(emphasis mine). I think it's an application of the no free lunch razor

It is clear that selecting for X selects for agents which historically did X in the course of the selection. But how this generali... (read more)

Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Adrià Garriga-alonso2yΩ451

I agree with the title as stated but not with the rest of the post. RLHF implies that RL will be used, which completely defuses alignment plans that hope that language models will be friendly, because they're not agents. (It may be true that supervised-learning (SL) models are safer, but the moment you get a SL technique, people are going to jam it into RL.)

The central problem with RL isn't that it is vulnerable to wireheading (the "obvious problem"), or that it's going to make a very detailed model of the world. Wireheading on its own (with e.g. a myopic ... (read more)

Has anyone actually tried to convince Terry Tao or other top mathematicians to work on alignment?

AGI Safety FAQ / all-dumb-questions-allowed thread

Hey P. Assuming Demis Hassabis reads your email and takes it seriously, why won’t his reaction be “I already have my alignment team, Shane Legg took care of that” ?

Deepmind has had an alignment team for a long time.

2P.3y

Well, if he has, unbeknownst to me, already hired the “Terence Taos of the world” like he said on the podcast, that would be great, and I would move on to other tasks. But if he only has a regular alignment team, I don’t think either of us considers that to be enough. I’m just trying to convince him that it’s urgent and we can’t leave it for later.

Adrià Garriga-alonso3y31

You should apply to Anthropic. If you’re writing ML software at semi-FAANG. they probably want to interview you ASAP. https://www.lesswrong.com/posts/YDF7XhMThhNfHfim9/ai-safety-needs-great-engineers

The compensation is definitely enough to take care of your family and then save some money!

Deconfusing Landauer's Principle

The No Free Lunch theorems and their Razor

While in equilibrium the two ways of defining thermodynamic entropy (scientific and subjective) come apart

You probably mean “while not in equilibrium”

1EuanMcLean3y

fixed, thanks!

Adrià Garriga-alonso3y30

Good find! Yeah, this is a good explanation for learning, and the NFL razor does not discard it. I think that almost no deep learning professor believes the bad explanation that “deep learning works because NNs are universal approximators”. But it’s more common with students and non-experts (I believed it for a while!)

What an actually pessimistic containment strategy looks like

What an actually pessimistic containment strategy looks like

Getting more value-aligned people in the AIS community onto the safety teams of DeepMind and OpenAI

Why is this important? As far as I can tell, the safety teams of these two organisations are already almost entirely "value-aligned people in the AIS community". They need more influence within the organisation, sure, but that's not going to be solved by altering team composition.

1Evan R. Murphy3y

rachelAF mentioned that she had the impression their safety teams were more talent-constrained than funding-constrained. So I inferred that getting more value-aligned people onto those teams wouldn't just alter the team composition, but increase the size of their safety teams. We probably need more evidence that those teams do still have open headcount though. I know DeepMind's does right now, but I'm not sure whether that's just a temporary opening. You make a good point though. If the safety teams have little influence within those orgs, then it #3 may be a lot more impactful than #1. Interesting, how do you know this? Is there information about these teams available somewhere?