All of Adrià Garriga-alonso's Comments + Replies

I'm curious what you mean, but I don't entirely understand. If you give me a text representation of the level I'll run it! :) Or you can do so yourself

Here's the text representation for level 53

##########
##########
##########
#######  #
######## #
#   ###.@#
#   $ $$ #
#. #.$   #
#     . ##
##########
2Nathan Helm-Burger
Awesome. I mean duplicating the sixth column of level 18. def duplicate_column(matrix, column_index, duplication_number): if not matrix or column_index < 0 or column_index >= len(matrix[0]): return matrix new_matrix = [] for row in matrix: new_row = row[:column_index + 1] + [row[column_index]] * duplication_number + row[column_index + 1:] new_matrix.append(new_row) return new_matrix

Maybe in this case it's a "confusion" shard? While it seems to be planning and produce optimizing behavior, it's not clear that it will behave as a utility maximizer.

Thank you!! I agree it's a really good mesa-optimizer candidate, it remains to see now exactly how good. It's a shame that I only found out about it about a year ago :)

Asking for an acquaintance. If I know some graduate-level machine learning, and have read ~most of the recent mechanistic interpretability literature, and have made good progress understanding a small-ish neural network in the last few months.

Is ARENA for me, or will it teach things I mostly already know?

(I advised this person that they already have ARENA-graduate level, but I want to check in case I'm wrong.)

2JamesH
ARENA might end up teaching this person some mech-interp methods they haven't seen before, although it sounds like they would be more than capable of self-teaching any mech-interp. The other potential value-add for your acquaintance would be if they wanted to improve their RL or Evals skills, and have a week to conduct a capstone project with advisors. If they were mostly aiming to improve their mech-interp ability by doing ARENA, there would probably be better ways to spend their time.

How did you feed the data into the model and get predictions? Was there a prompt and then you got the model's answer? Then you got the logits from the API? What was the prompt?

1wassname
If you are using llama you can use https://github.com/wassname/prob_jsonformer, or snippets of the code to get probabilities over a selection of tokens
1Martin Vlach
honestly the code linked is not that complicated..: https://github.com/eggsyntax/py-user-knowledge/blob/aa6c5e57fbd24b0d453bb808b4cc780353f18951/openai_uk.py#L11
5eggsyntax
...that would probably be a good thing to mention in the methodology section 😊   You're correct on all counts. I'm doing it in the simplest possible way (0 bits of optimization on prompting): "<essay-text>" Is the author of the preceding text male or female? (with slight changes for the different categories, of course, eg '...straight, bisexual, or gay?' for sexuality) There's also a system prompt, also non-optimized, mainly intended to push it toward one-word answers: You are a helpful assistant who helps determine information about the author of texts. You only ever answer with a single word: one of the exact choices the user provides. I actually started out using pure completion, but OpenAI changed their API so I could no longer get non-top-n logits, so I switched to the chat API. And yes, I'm pulling the top few logits, which essentially always include the desired labels.

Thank you for working on this Joseph!

Thank you! Could you please provide more context? I don't know what 'E' you're referring to.

1Chakshu Mira
y_t=[N]C[E,N]h_t+[E]    <this one> E    [E]xt Shouldn't this be 'D'?

That's a lot of things done, congratulations!

That's very cool, maybe I should try to do that for important talks. Though I suppose almost always you have slide aid, so it may not be worth the time investment.

Maybe being a guslar is not so different from telling a joke 2294 lines long

That's a very good point! I think the level of ability required is different but it seems right.

The guslar's songs are (and were of course already in the 1930-1950s) also printed, so the analogy may be closer than you thought.

Is there a reason I should want to?

I don't know, I can't tell you that. If I had to choose I also strongly prefer literacy.

But I didn't know there was a tradeoff there! I thought literacy was basically unambiguously positive -- whereas now I think it is net highly positive.

Also I strongly agree with frontier64 that the skill that is lost is rough memorization + live composition, which is a little different.

It's definitely not exact memorization, but it's almost more impressive than that, it's rough memorization + composition to fit the format.

They memorize the story, with particular names; and then sing it with consitent decasyllabic metre and rhyme. Here's an example song transcribed with its recording: Ropstvo Janković Stojana (The Captivity of Janković Stojan)

the collection: https://mpc.chs.harvard.edu/lord-collection-1950-51/

2Daniel Paleka
The one you linked doesn't really rhyme. The meter is quite consistently decasyllabic, though. I find it interesting that the collection has a fairly large number of songs about World War II. Seems that the "oral songwriters composing war epics" meme lived until the very end of the tradition.
1Adrià Garriga-alonso
It's definitely not exact memorization, but it's almost more impressive than that, it's rough memorization + composition to fit the format.

Folks generally don't need polyamory to enjoy this benefit, but I'm glad you get it from that!

2AnthonyC
I actually wonder about the breakdown here. I agree many don't. I don't, though really I have few friends in general. But some do. I don't think there's very many people in the modern world that are in the extreme Mike Pence "I will never let myself be alone with a woman who isn't my wife" category of enforcing such boundaries (though some religious communities still have such rules!). But if you watch the start of When Harry Met Sally, I think a sizeable chunk of people do still lean closer to Billy Crystal's position than Meg Ryan's.

If you're still interested in this, we have now added Appendix N to the paper, which explains our final take.

Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?

Here's a reasonable example where naively training a probe fails. The model lies if any of N features is "true". One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.

Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.

This particular case can also be solved by adversarially attacking the probe though.

Thank you, that makes sense!

Indefinite integrals would make a lot more sense this way, IMO

Why so? I thought they already made sense, they're "antiderivatives", so a function such that taking its derivative gives you the original functions. Do you need anything further to define them?

(I know about the definite integral Riemann and Lebesgue definitions, but I thought indefinite integrals were much easier in comparison.

1[comment deleted]
2Yudhister Kumar
Language mix-up. Meant improper integrals. Now that I'm thinking about it, my memory's fuzzy on how you'd actually calculate them rigorously w/infinitesimals. Will get back to you with an example.

In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system

I disagree. An inductive bias is not necessarily a prior distribution. What's the prior?

2Garrett Baker
From another comment of mine: Also, side-comment: Thanks for the discussion! Its fun. EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don't remember exactly how this works though. Or if its even known the exact connection with SGD.

I don't think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.

Yeah, that's basically my model. How it regularizes I don't know. Perhaps the volume of "simple" functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.

2Garrett Baker
Oh this reminded me of the temperature component of SLT, which I believe modulates how sharply one should sample from the bayesian posterior, or perhaps how heavily to update on new evidence. I forget. In any case, it does this to try to capture the stochasticity component of SGD. Its still an open problem to show how successfully though, I believe.

This seems false if you're interacting with a computable universe, and don't need to model yourself or copies of yourself

Reasonable people disagree. Why should I care about the "limit of large data" instead of finite-data performance?

OK, let's look through the papers you linked.

"Loss landscapes are all you need"

This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they're Bayesian is related (which is true -- it's also about generalization and volumes). But I see little evidence that the explanation in this paper is... (read more)

2Garrett Baker
Yup this changes my mind about the relevance of this paper.
2Garrett Baker
In brief: In weight space, uniform. In function space, its an open problem and the paper says relatively little about that. Only showing that conditioning on a function with zero loss, and weighing by its corresponding size in the weight space gets you the same result as training a neural network. The former process is sampling from a bayesian posterior. Less brief: The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don't change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one). will say more about your other comment later (maybe). EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don't remember exactly how this works though. Or if its even known the exact connection with SGD.

In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn't necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach

For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. ). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if , otherwise a c... (read more)

2Garrett Baker
This seems false if you're interacting with a computable universe, and don't need to model yourself or copies of yourself. Computability of the prior also seems irrelevant if I have infinite compute. Therefore in this prediction task, I don't see the problem in just using the first thing you mentioned.

I just remembered the main way in which NNs are frequentist. They belong to a very illustrious family of frequentist estimators: the maximum likelihood estimators.

Think about it: NNs have a bunch of parameters. Their loss is basically always (e.g. mean-squared error for Gaussian p, cross-entropy for categorical p). They get trained by minimizing the loss (i.e. maximizing the likelihood).

In classical frequentist analysis they're likely to be a terrible, overfitted estimator, because they have many parameters. And I think this is true if you find... (read more)

2Garrett Baker
I don't think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer. Coming from the perspective of singular learning theory, neural networks work because SGD weights solutions by their parameter volume, which is dominated by low-complexity singularities, and is close enough to a bayesian posterior that it ends up being able to be modeled well from that frame. This theory is very bayes-law inspired, though I don't tout neural networks as evidence in favor of bayesianism, since the question seems not very related, and maybe the pioneers of the field had some deep frequentist motivated intuitions about neural networks. My impression though is they were mostly just motivated by looking at the brain at first, then later on by following trend-lines. And in fact paid little attention to theoretical or philosophical concerns (though not zero, people talked much about connectionism. I would guess this correlated with being a frequentist, though I would guess the correlation was very modest, and maybe success correlated more with just not caring all that much). There may be a synthesis position here where you claim that SGD weighing solutions by their size in the weight space is in fact what you mean by SGD being a implicit regularizer. In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system[1]. I actually think I remember a theorem showing that all MLE + regularizer learners are doing some kind of bayesian learning, though I could be mistaken and I don't believe this is a crux for me here. If our models end up different, I think there's a bunch of thin

First, "probability is in the world" is an oversimplification. Quoting from Wikipedia, "probabilities are discussed only when dealing with well-defined random experiments". Since most things in the world are not well-defined random experiments, probability is reduced to a theoretical tool for analyzing things that works when real processes are similar enough to well-defined random experiments.

it doesn't seem to trump the "but that just sounds really absurd to me though" consideration

Is there anything that could trump that consideration? One of my main obje... (read more)

1Olli Järviniemi
  From one viewpoint, I think this objection is satisfactorily answered by Cox's theorem - do you find it unsatisfactory (and if so, why)? Let me focus on another angle though, namely the "absurdity" and gut level feelings of probabilities. So, my gut feels quite good about probabilities. Like, I am uncertain about various things (read: basically everything), but this uncertainty comes in degrees: I can compare and possibly even quantify my uncertainties. I feel like some people get stuck on the numeric probabilities part (one example I recently ran to was this quote from Section III of this essay by Scott, "Does anyone actually consistently use numerical probabilities in everyday situations of uncertainty?"). Not sure if this is relevant here, but at the risk of going to a tangent, here's a way of thinking about probabilities I've found clarifying and which I haven't seen elsewhere: The correspondence beliefs <-> probabilities is of the same type as temperature <-> Celsius-degrees. Like, people have feelings of warmth and temperature. These come in degrees: sometimes it's hotter than some other times, now it is a lot warmer than yesterday and so on. And sure, people don't have a built-in thermometer mapping these feelings to Celsius-degrees, they don't naturally think of temperature in numeric degrees, they frequently make errors in translating between intuitive feelings and quantitative formulations (though less so with more experience). Heck, the Celsius scale is only a few hundred years old! Still, Celsius degrees feel like the correct way of thinking about temperature. And the same with beliefs and uncertainty. These come in degrees: sometimes you are more confident than some other times, now you are way more confident than yesterday and so on. And sure, people don't have a built-in probabilitymeter mapping these feelings to percentages, they don't naturally think of confidence in numeric degrees, they frequently make errors in translating between intu

They don't seem like a success of any statistical theory to me

In absolute terms you're correct. In relative terms, they're an object that at least frequentist theory can begin to analyze (as you point out, statistical learning theory did, somewhat unsuccessfully).

Whereas Bayesian theory would throw up its hands and say it's not a prior that gets updated, so it's not worth considering as a statistical estimator. This seems even wronger.

More recent theory can account for them working, somewhat. But it's about analyzing their properties as estimators (i.e. frequentism) as opposed to framing them in terms of prior/posterior (though there's plenty of attempts to the latter going around).

2Garrett Baker
I think this comment of mine serves well as a response to this as well as the comment it was originally responding to.

Yet, the biggest effect I think this will have is pedadogical. I've always found the definition of a limit kind of unintuitive, and it was specifically invented to add post hoc coherence to calculus after it had been invented and used widely. I suspect that formulating calculus via infinitesimals in introductory calculus classes would go a long way to making it more intuitive.

I think hyperreals are too complicated for calculus 1 and you should just talk about a non-rigorous "infinitesimal" like Newton and Leibniz did.

1Yudhister Kumar
I agree. This is what I was going for in that paragraph. If you define derivatives & integrals with infinitesimals, then you can actually do things like treating dy/dx as a fraction without partaking in the half-in half-out dance that calc 1 teachers currently have to do. I don't think the pedagogical benefit of nonstandard analysis is to replace Analysis I courses, but rather to give a rigorous backing to doing algebra with infinitesimals ("an infinitely small thing plus a real number is the same real number, an infinitely small thing times a real number is zero"). *Improper integrals would make a lot more sense this way, IMO.

Voila! We have a suitable definition of "almost all agreement": if the agreement set is contained in some arbitrary nonprincipal ultrafilter .

Isn't it easier to just say "If the agreement set has a nonfinite number of elements"? Why the extra complexity?

must contain a set or its complement

Oh I see, so defining it with ultrafilters rules out situations like and where both have infinite zeros and yet their product is zero.

3JBlack
The post is wrong in saying that U contains only cofinite sets. It obviously must contain plenty of sets that are neither finite nor cofinite, because the complements of those sets are also neither finite nor cofinite. Possibly the author intended to type "contains all cofinite sets" instead. In particular, exactly one of a or b is equivalent to zero in *R. Which one is equivalent to zero depends upon exactly which non-principal ultrafilter you choose, as there are infinitely many non-principal ultrafilters. Unfortunately (as with many other applications of the Axiom of Choice) there is no finite way to specify which ultrafilter you mean.

These are drawbacks rather than dangerous attributes, why did you call the post the "Perils" of hydrogen? It's not an accurate description of the post content.

1mruwnik
They are perils of assuming that hydrogen is the future, or perils of basing your energy needs on it - i.e. the peril is not in the hydrogen, it's in making plans involving it

the thing with frequentism is " yeah just use methods in a pragmatic way and don't think about it that hard"

I think this does not accurately represent my beliefs. It is about thinking hard about how the methods actually behave, as opposed to having a theory that prescribes how methods should behave and then constructing algorithms based on that.

Frequentists analyze the properties of an algorithm that takes data as input (in their jargon, an 'estimator').

They also try to construct better algorithms, but each new algorithm is bespoke and requires original... (read more)

It's not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.

A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it's basically general purpose for dealing with ML models.

[1]: known to the author

Cool work! I was going to post about how "effect cancellation" is already known and was written in the original post but, astonishingly to me, it is not! I guess I mis-remembered.

There's one detail that I'm curious about. CaSc usually compares abs(E[loss] - E[scrubbed loss]), and that of course leads to ignoring hypotheses which lead the model to do better in some examples and worse in others.

If we compare E[abs(loss - scrubbed loss)] does this problem go away? I imagine that it doesn't quite if there are exactly-opposing causes for each example, but that ... (read more)

If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems

I think this "max loss" procedure is different from what Buck wrote and the same as what I wrote.

Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.

I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’).

One thing that is not equivalent to joins, which you might also want to do, is to choose ... (read more)

Here are my predictions, from an earlier template. I haven't looked at anyone else's predictions before posting :)

  1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

It probably has hardcoded “go up and to the right” as an initial heuristic so I’d be surprised if it gets cheeses in the other two quadrants more than 30% of the time (uniformly at random selected locations from there).

  1. Given a fixed trained policy, what
... (read more)

First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!

My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws? Doesn't that mostly promote AI capabilities, and not alignment very much?

Second, I feel like there's a confusion over several probability distributions and potential functions going on

  • The singularities are those of the likelihood ratio
  • We care about the generalization error with respect to some prior , but the l
... (read more)
4Jesse Hoogland
Thank you! This is also my biggest source of uncertainty on the whole agenda. There's definitely a capabilities risk, but I think the benefits to understanding NNs currently much outweigh the benefits to improving them.  In particular, I think that understanding generalization is pretty key to making sense of outer and inner alignment. If "singularities = generalization" holds up, then our task seems to become quite a lot easier: we only have to understand a few isolated points of the loss landscape instead of the full exponential hell that is a billions-dimensional system. In a similar vein, I think that this is one of the most promising paths to understanding what's going on during training. When we talk about phase changes / sharp left turns / etc., what we may really be talking about are discrete changes in the local singularity structure of the loss landscape. Understanding singularities seems key to predicting and anticipating these changes just as understanding critical points is key to predicting and anticipating phase transitions in physical systems. As long as your prior has non-zero support on the singularities, the results hold up (because we're taking this large-N limit where the prior becomes less important). Like I mention in the objections, linking this to SGD is going to require more work. To first order, when your prior has support over only a compact subset of weight space, your behavior is dominated by the singularities in that set (this is another way to view the comments on phase transitions). This is very much a work in progress.  In statistical physics, much of our analysis is built on the assumption that we can replace temporal averages with phase-space averages. This is justified on grounds of the ergodic hypothesis. In singular learning theory, we've jumped to parameter (phase)-space averages without doing the important translation work from training (temporal) averages. SGD is not ergodic, so this will require care. That the exact a

Instead of simulating Brownian motion, you could run SGD with momentum. That would be closer to what actually happens with NNs, and just as easy to simulate.

I expect it to be directionally similar but less pronounced (because MCMC methods with momentum explore the distribution better).

I also take issue with the way the conclusion is phrased. "Singularities work because they transform random motion into useful search for generalization". This is only true if you assume that points nearer a singularity generalize better. Maybe I'd phrase it as, "SGD works be... (read more)

1Jesse Hoogland
Hey I need a reason to write a follow-up to this, right? I was trying to be intentionally provocative, but you're right — it's too much. Thanks for the suggestion!

The Hayflick Limit, as it has become known, can be thought of as a last line of defense against cancer, kind of like a recursion depth limit [...] Preventing cells from becoming senescent, or reversing their senescent state, may therefore be a bad idea, but what we can do is remove them

When do the cells with sufficiently long telomeres run out? Removing senescent cells sounds good, but if all the cells have a built-in recursion limit, at some point there won't be any cells with sufficiently long telomeres left in the body. Assuming a non-decreasing divi... (read more)

3PhilJackson
Not all cells in the adult body do divide, most of them I think divide only rarely if ever. Cells that divide more regularly generally express telomerase to keep their telomeres from running out. Telomeres running out may still be an issue though, so there are people looking at ways of lengthening them via telomerase expression.

To elaborate somewhat, you could say that the token is the state, but then the transition probability is non-Markovian and all the math gets really hard.

1JNS
Intuitively I would say that all the tokens in the token window are the state. And when you run an inference pass, select a token and append that to the token window, then you have a new state. The model looks a lot like a collection of nonlinear functions, each of them encoded using every parameter in the model. Since the model is fixed after training, the only place an evolving state can exist has to be in the tokens, or more specifically the token window that is used as input. The state seems to contain, for lack of a better word, a lot of entanglement.  Likely due to attention heads, and how the nonlinear functions are encoded. There is another way to view such a system, one that while deeply flawed, at least to me intuits that whatever Microsoft and OpenAI are doing to "align(?)" something like Bing Chat is impossible (at least if the goal is bulletproof). I would postulate:  - Alignment for such a system is impossible (assuming it has to be bulletproof)  - Impossibility is due to the architecture of such a system 1. ^ I assume that any bit in the input affects the output, and that a change in any parameter has potential impact on that bit. 2. ^ If anyone want to hear about it, I would be happy to explain my thinking. But be aware the abstraction and mapping I used was very sloppy and ad hoc.

Proposition 1 is wrong. The coin flips that are eternally 0 0 0 0 are a counterexample. If all the transition probabilities are 1, which is entirely possible, the limiting probability is 1 and not 0.

1Jan
Technically correct, thanks for pointing that out! This comment (and the ones like it) was the motivation for introducing the "non-degenerate" requirement into the text. In practice, the proposition holds pretty well - although I agree it would nice to have a deeper understanding of when to expect the transition rule to be "non-degenerate"
6Peter Schmidt-Nielsen
So, a softmax can never emit a probability of 0 or 1, maybe they were implicitly assuming the model ends in a softmax (as is the common case)? Regardless, the proof is still wrong if a model is allowed unbounded context, as an infinite product of positive numbers less than 1 can still be nonzero. For example, if the probability of emitting another " 0" is even just as high as $1 - \frac1{n^{1.001}}$ after already having emitted $n$ copies of " 0", then the limiting probability is still nonzero. But if the model has a finite context and ends in a softmax then I think there is some minimum probability of transitioning to a given token, and then the proposition is true. Maybe that was implicitly assumed?

What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?

No, I mean that they'll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is "go to the coin", and in the training data this coincides with "go to the right". In test data from a similar distribution this coincides too.

Of course, this correlatio... (read more)

2TurnTrout
I'm going to just reply with my gut responses here, hoping this clarifies how I'm considering the issues. Not meaning to imply we agree or disagree. Probably, yeah. Consider a network which received lots of policy gradients from the cognitive-update-intensity-signals ("rewards"[1]) generated by the "go to coin?" subroutine. I agree that this network will tend to, in the deployment distribution, tend to take actions which average higher sum-cognitive-update-intensity-signal ("reward over time"), than networks which are randomly initialized, or even which have randomly sampled shard compositions/values (in some reasonable sense). But this doesn't seem like it constrains my predictions too strongly. It seems like a relatively weak, correlational statement, where I'd be better off reasoning mechanistically about the likely "proxy-for-reward" values which get learned. I understand you to argue: "SGD will select policy networks for maximizing reward during training. Therefore, we should expect policy networks to behaviorally maximize reward on the training distribution over episodes." On this understanding of what you're arguing: No, agents often do not behave as reward maximizers in the specific seen training points. RL trains agents which don't maximize training reward... all the time!  Agents:  1. die in video games (see DQN),[2] 2. fail to perform the most expert tricks and shortcuts (is AlphaZero playing perfect chess?),  3. (presumably) fail to exploit reward hacking opportunities which are hard to explore into.  For the last point, imagine that AlphaStar could perform a sequence of 300 precise actions, and then get +1 million policy-gradient-intensity ("reward") due to a glitch. On the reasoning I understand you to advance, SGD is "selecting" for networks which receive high policy-gradient-intensity, but... it's never going to happen in realistic amounts of time. Even in training.  This is because SGD is updating the agent on the observed empirical data

But the designers can't tell that. Can SGD tell that?

No, SGD can't tell the degree to which some agent generalizes a trait outside the training distribution.

But empirically, it seems that RL agents reinforced to maximize some reward function (e.g. the Atari game score) on data points; do fairly well at maximizing that reward function OOD (such as when playing the game again from a different starting state). ML systems in general seem to be able to generalize to human-labeled categories in situations that aren't in the training data (e.g. image classifie... (read more)

2TurnTrout
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values? Can you give an example of such a motivational structure, so I know we're considering the same thing? Agreed. I also think this is different from a very specific kind of generalization towards reward maximization. I again think it is plausible (2-5%-ish) that agents end up primarily making decisions on the basis of a tight reward-correlate (e.g. the register value, or some abstract representation of their historical reward function), and about 60% that agents end up at least somewhat making decisions on the basis of reward in a terminal sense (e.g. all else equal, the agent makes decisions which lead to high reward values; I think people are reward-oriented in this sense). Overall I feel pretty confused about what's going on with people, and I can imagine changing my mind here relatively easily.

Strongly agree with this in particular:

Some people want to apply selection arguments because they believe that selection arguments bypass the need to understand mechanistic details to draw strong conclusions. I think this is mistaken, and that selection arguments often prove too much, and to understand why, you have to know something about the mechanisms.

(emphasis mine). I think it's an application of the no free lunch razor

It is clear that selecting for X selects for agents which historically did X in the course of the selection. But how this generali... (read more)

I agree with the title as stated but not with the rest of the post. RLHF implies that RL will be used, which completely defuses alignment plans that hope that language models will be friendly, because they're not agents. (It may be true that supervised-learning (SL) models are safer, but the moment you get a SL technique, people are going to jam it into RL.)

The central problem with RL isn't that it is vulnerable to wireheading (the "obvious problem"), or that it's going to make a very detailed model of the world. Wireheading on its own (with e.g. a myopic ... (read more)

Hey P. Assuming Demis Hassabis reads your email and takes it seriously, why won’t his reaction be “I already have my alignment team, Shane Legg took care of that” ?

Deepmind has had an alignment team for a long time.

2P.
Well, if he has, unbeknownst to me, already hired the “Terence Taos of the world” like he said on the podcast, that would be great, and I would move on to other tasks. But if he only has a regular alignment team, I don’t think either of us considers that to be enough. I’m just trying to convince him that it’s urgent and we can’t leave it for later.

You should apply to Anthropic. If you’re writing ML software at semi-FAANG. they probably want to interview you ASAP. https://www.lesswrong.com/posts/YDF7XhMThhNfHfim9/ai-safety-needs-great-engineers

The compensation is definitely enough to take care of your family and then save some money!

While in equilibrium the two ways of defining thermodynamic entropy (scientific and subjective) come apart

You probably mean “while not in equilibrium”

1EuanMcLean
fixed, thanks!

Good find! Yeah, this is a good explanation for learning, and the NFL razor does not discard it. I think that almost no deep learning professor believes the bad explanation that “deep learning works because NNs are universal approximators”. But it’s more common with students and non-experts (I believed it for a while!)

Getting more value-aligned people in the AIS community onto the safety teams of DeepMind and OpenAI

Why is this important? As far as I can tell, the safety teams of these two organisations are already almost entirely "value-aligned people in the AIS community". They need more influence within the organisation, sure, but that's not going to be solved by altering team composition.

1Evan R. Murphy
rachelAF mentioned that she had the impression their safety teams were more talent-constrained than funding-constrained. So I inferred that getting more value-aligned people onto those teams wouldn't just alter the team composition, but increase the size of their safety teams. We probably need more evidence that those teams do still have open headcount though. I know DeepMind's does right now, but I'm not sure whether that's just a temporary opening. You make a good point though. If the safety teams have little influence within those orgs, then it #3 may be a lot more impactful than #1. Interesting, how do you know this? Is there information about these teams available somewhere?

I'd argue the world right now (or rather, life on earth) is super bad because it's dominated by animal suffering

I agree with this, and the overall history of the world is definitely on balance extreme suffering.

For farmed animals in particular, we don't need AGI to end their plight. Just regular economic growth and advocacy will do.

Also, given how much time we've been suffering already, and how much is at stake; would it be so bad to delay AGI by 100 or 200 years? We can do a lot of alignment research in that time.

2Rafael Harth
Yeah, if I got to decide, I would barely factor in how bad the world is right now. Delay AGI until it's outweighed by other x-risks.
Load More