LESSWRONG
LW

All of Jacob_Hilton's Comments + Replies

A computational no-coincidence principle

The hope is to use the complexity of the statement rather than mathematical taste.

If it takes me 10 bits to specify a computational possibility that ought to happen 1% of the time, then we shouldn't be surprised to find around 10 (~1% of $2^{10}$ ) occurrences. We don't intend the no-coincidence principle to claim that these should all happen for a reason.

Instead, we intend the no-coincidence principle to claim that such if such coincidences happen much more often than we would have expected them to by chance, then there is a reason for that. Or put another... (read more)

0Logan Zoellner13d

I understand the hope, I just think it's going to fail (for more or less the same reason it fails with formal proof). With formal proof, we have Godel's speedup, which tells us that you can turn a Godel statement in a true statement with a ridiculously long proof. You attempt to get around this by replacing formal proof with "heuristic", but whatever your heuristic system, it's still going to have some power (in the Turing hierarchy sense) and some Godel statement. That Godel statement is in turn going to result in a "seeming coincidence". Wolfram's observation is that this isn't some crazy exception, this is the rule. Most true statements in math are pretty arbitrary and don't have shorter explanations than "we checked it and its true". The reason why mathematical taste works is that we aren't dealing with "most true statements", we're only dealing with statements that have particular beauty or interest to Mathematicians. It may seem like cheating to say that human mathematicians can do something that literally no formal mathematical system can do. But if you truly believe that, the correct response would be to respond when asked "is pi normal" with "I don't know". The reason why your intuition is throwing you off is because you keep thinking of coincidences as "pi is normal" and not "we picked an arbitrary CA with 15k bits of complexity and ran it for 15k steps but it didn't stop. I guess it never terminates."

A computational no-coincidence principle

Jacob_Hilton14d40

For the informal no-coincidence principle, it's important to us (and to Gowers IIUC) that a "reason" is not necessarily a proof, but could instead be a heuristic argument (in the sense of this post). We agree there are certainly apparently outrageous coincidences that may not be provable, such as Chebyshev's bias (discussed in the introduction to the post). See also John Conway's paper On Unsettleable Arithmetical Problems for a nice exposition of the distinction between proofs and heuristic arguments (he uses the word "probvious" for a statement with a co... (read more)

3Logan Zoellner14d

I doubt that weakening from formal proof to heuristic saves the conjecture. Instead I lean towards Stephen Wolfram's Computational Irreducibly view of math. Some things are true simply because they are true and in general there's no reason to expect a simpler explanation. In order to reject this you would either have to assert: a) Wolfram is wrong and there are actually deep reasons why simple systems behave precisely the way they do or b) For some reason computational irreducibly applies to simple things but not to infinite sets of the type mathematicians tend to be interested in. I should also clarify that in a certain sense I do believe b). I believe that pi is normal because something very fishy would have to be happening for it to not be. However, I don't think this holds in general. With Collatz, for example, we are already getting close to the hairy "just so" Turing machine like behavior where you would expect the principle to fail. Certainly, if one were to collect all the Collatz-like systems that arise from Turing Machines I would expect some fraction of them to fail the no-coincidence principle.

A computational no-coincidence principle

Jacob_Hilton14d133

It's not, but I can understand your confusion, and I think the two are related. To see the difference, suppose hypothetically that 11% of the first million digits in the decimal expansion of $π$ were 3s. Inductive reasoning would say that we should expect this pattern to continue. The no-coincidence principle, on the other hand, would say that there is a reason (such as a proof or a heuristic argument) for our observation, which may or may not predict that the pattern will continue. But if there were no such reason and yet the pattern continued, th... (read more)

A computational no-coincidence principle

Jacob_Hilton16d60

Good question! We also think that NP ≠ co-NP. The difference between 99% (our conjecture) and 100% (NP = co-NP) is quite important, essentially because 99% of random objects "look random", but not 100%. For example, consider a uniformly random string $x \in {0, 1}^{n}$ for some large $n$ . We can quite confidently say things like: the number of 0s in $x$ is between $0.499 n$ and $0.501 n$ ; there is no streak of $⌊ \sqrt{n} ⌋$ alternating 0s and 1s; etc. But these only hold with 99% confidence (more precisely, with probability tending to ... (read more)

A computational no-coincidence principle

Jacob_Hilton17d130

Before reversible circuits, we first considered a simpler setting: triangle counting. The no-coincidence principle in that setting turned out to be true, but for a relatively uninteresting reason, because the domain was not rich enough. Nevertheless, I think this result serves as a helpful exercise for people trying to get to grips with our definitions, as well as providing more of the story about how we ended up with our reversible circuits statement.

In the triangle counting setting, we consider the distribution $C_{3} (n, p)$ over undirected 3-partite... (read more)

A bird's eye view of ARC's research

Jacob_Hilton4moΩ230

It sounds like we are not that far apart here. We've been doing some empirical work on toy systems to try to make the leap from mechanistic interpretability "stories" to semi-formal heuristic explanations. The max-of-k draft is an early example of this, and we have more ambitious work in progress along similar lines. I think of this work in a similar way to you: we are not trying to test empirical assumptions (in the way that some empirical work on frontier LLMs is, for example), but rather to learn from the process of putting our ideas into practice.

Backdoors as an analogy for deceptive alignment

Jacob_Hilton4moΩ8110

For those who are interested in the mathematical details, but would like something more accessible than the paper itself, see this talk I gave about the paper:

A bird's eye view of ARC's research

Jacob_Hilton4moΩ28534

Thank you – this is probably the best critique of ARC's research agenda that I have read since we started working on heuristic explanations. This level of thoughtfulness in external feedback is very rare and I'm grateful for the detail and clarity you put into it. I don't think my response fully rebuts your central concern, but hopefully it gives a sense of my current thinking about it.

It sounds like we are in agreement that something very loosely heuristic explanation-flavored (interpreted so broadly as to include mechanistic interpretability, for example... (read more)

Dmitry Vaintrob4mo*Ω9196

Thank you for the great response, and the (undeserved) praise of my criticism. I think it's really good that you're embracing the slightly unorthodox positions of sticking to ambitious convictions and acknowledging that this is unorthodox. I also really like your (a)-(d) (and agree that many of the adherents of the fields you list would benefit from similar lines of thinking).

I think we largely agree, and much of our disagreement probably boils down to where we draw the boundary between “mechanistic interpretability” and “other”. In particular, I fully agr... (read more)

Raemon's Shortform

Jacob_Hilton6mo80

The LLM output looks correct to me.

Formal verification, heuristic explanations and surprise accounting

Jacob_Hilton8moΩ350

Yes, I think the most natural way to estimate total surprise in practice would be to use sampling like you suggest. You could try to find the best explanation for "the model does $bad_thing with probability less than 1 in a million" (which you believe based on sampling) and then see how unlikely $bad_thing is according to the resulting explanation. In the Boolean circuit worked example, the final 23-bit explanation is likely still the best explanation for why the model outputs TRUE on at least 99% of inputs, and we can use this explanation to see that the ... (read more)

3John Schulman8mo

Cool, find-tuning sounds a bit like conditional Kolmogorov complexity -- the cost of your explanation would be K(explanation of rare thing | explanation of the loss value and general functionality)

Formal verification, heuristic explanations and surprise accounting

Jacob_Hilton8mo50

Yes, that's a clearer way of putting it in the case of the circuit in the worked example. The reason I said "for no apparent reason" is that there could be some redundancy in the explanation. For example, if you already had an explanation for the output of some subcircuit, you shouldn't pay additional surprise if you then check the output of that subcircuit in some particular case. But perhaps this was a distracting technicality.

Formal verification, heuristic explanations and surprise accounting

Jacob_Hilton8mo52

I would say that they are motivated by the same basic idea, but are applied to different problems. The MDL (or the closely-related BIC) is a method for model selection given a dataset, whereas surprise accounting is a method for evaluating heuristic explanations, which don't necessarily involve model selection.

Take the Boolean circuit worked example: what is the relevant dataset? Perhaps it is the 256 (input, TRUE) pairs. But the MDL would select a much simpler model, namely the circuit that ignores the input and outputs TRUE (or "x_1 OR (NOT x_1)" if it h... (read more)

Formal verification, heuristic explanations and surprise accounting

Jacob_Hilton8mo60

Yes, the cost of 1 bit for the OR gate was based on the somewhat arbitrary choice to consider only OR and AND gates. A bit more formally, the heuristic explanations in the post implicitly use a "reference class" of circuits where each binary gate was randomly chosen to be either an OR or an AND, and each input wire to a binary gate was randomly chosen to have a NOT or not. The arbitrariness of this choice of reference class is one obstruction to formalizing heuristic explanations and surprise accounting. We are currently preparing a paper that explores this and related topics, but unfortunately the core issue remains unresolved.

4Alexander Gietelink Oldenziel8mo

Thanks. I'm looking forward to your paper! The 'surprise accounting' framework sounds quite a lot like the Minimum Description Length principle (e.g. here). Do you have any takes on how surprise accounting compares and differs vis a vis MDL? Do I understand correctly that the main issue is finding ~ a canonical prior on the set of circuits?

Non-Disparagement Canaries for OpenAI

Jacob_Hilton9mo11-2

See the statement from OpenAI in this article:

We're removing nondisparagement clauses from our standard departure paperwork, and we're releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual. We'll communicate this message to former employees.

They have communicated this to me and I believe I was in the same category as most former employees.

I think the main reasons so few people have mentioned this are:

As I mentioned, there is still some legal ambiguity and additional avenues for retaliation
Som

... (read more)

Non-Disparagement Canaries for OpenAI

Jacob_Hilton9mo13-19

Yeah I agree with this, and my original comment comes across too strongly upon re-reading. I wanted to point out some counter-considerations, but the comment ended up unbalanced. My overall view is:

It was highly inappropriate for the company to have been issuing these agreements so widely, especially using such aggressive tactics and without allowing disclosure of the agreement, given the technology that it is developing.
The more high-profile and credible a person is, the more damaging it is for this person to have been subject to the agreement.
Nevertheles

... (read more)

Adam Scholl9mo11-1

the post appears to wildly misinterpret the meaning of this term as "taking any actions which might make the company less valuable"

I'm not a lawyer, and I may be misinterpreting the non-interference provision—certainly I'm willing to update the post if so! But upon further googling, my current understanding is still that in contracts, "interference" typically means "anything that disrupts, damages or impairs business."

And the provision in the OpenAI offboarding agreement is written so broadly—"Employee agrees not to interfere with OpenAI’s relationship wit... (read more)

habryka9mo5956

Thankfully, most of this is now moot as the company has retracted the contract.

I don't think any of this is moot, since the thing that is IMO most concerning is people signing these contracts, then going into policy or leadership positions and not disclosing that they signed those contracts. Those things happened in the past and are real breaches of trust.

Non-Disparagement Canaries for OpenAI

Jacob_Hilton9mo2-50

We were especially alarmed to notice that the list contains at least 12 former employees currently working on AI policy, and 6 working on safety evaluations. This includes some in leadership positions, for example:

I don't really follow this reasoning. If anything, playing a leadership role in AI policy or safety evaluations will usually give you an additional reason not to publicly disparage AI companies, to avoid being seen as partisan, making being subject to such an agreement less of an issue. I would be pretty surprised if such people subject to these ... (read more)

[This comment is no longer endorsed by its author]Reply

Thrasymachus9mo1613

I see the concerns as these:

The four corners of the agreement seem to define 'disparagement' broadly, so one might reasonably fear (e.g.) "First author on an eval especially critical of OpenAI versus its competitors", or "Policy document highly critical of OpenAI leadership decisions" might 'count'.
Given Altman's/OpenAI's vindictiveness and duplicity, and the previous 'safeguards' (from their perspective) which give them all the cards in terms of folks being able to realise the value of their equity, "They will screw me out of a lot of money if I do someth

... (read more)

Amalthea9mo4841

When you have a role in policy or safety, it may usually be a good idea not to voice strong opinions on any given company. If you nevertheless feel compelled to do so by circumstances, it's a big deal if you have personal incentives against that - especially if they're not disclosed.

Common misconceptions about OpenAI

Jacob_Hilton9mo*143

If the question is whether I think they were true at time given the information I have now, I think all of the individual points hold up except for the first and third "opinions". I am now less sure about what OpenAI leadership believed or cared about. The last of the "opinions" also seems potentially overstated. Consequently, the overall thrust now seems off, but I still think it was good to share my views at the time, to start a discussion.

If the question is about the state of the organization now, I know less about that because I haven't worked there in over a year. But the organization has certainly changed a lot since this post was written over 18 months ago.

Common misconceptions about OpenAI

Jacob_Hilton1yΩ812-1Review for 2022 Review

Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.

My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.

Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.

Mode collapse in RL may be fueled by the update equation

Jacob_Hilton2yΩ790

I think KL/entropy regularization is usually used to prevent mode collapse partly because it has nice theoretical properties. In particular, it is easy to reason about the optimal policy for the regularized objective - see for example the analysis in the paper Equivalence Between Policy Gradients and Soft Q-Learning.

Nevertheless, action-dependent baselines do appear in the literature, although the story is a bit confusing. This is my understanding of it from some old notes:

The idea was explored in Q-Prop. But unlike you, their intention was not to change t

... (read more)

ARC is hiring theoretical researchers

Jacob_Hilton2y51

We will do our best to fairly consider all applications, but realistically there is probably a small advantage to applying earlier. This is simply because there is a limit to how quickly we can grow the organization, so if hiring goes better than expected then it will be longer before we can take on even more people. With that being said, we do not have a fixed number of positions that we are hiring for; rather, we plan to vary the number of hires we make based on the strength of the applications we receive. Moreover, if we were unable to hire someon... (read more)

ARC is hiring theoretical researchers

Jacob_Hilton2y*Ω360

The questions on the take-home test vary in difficulty but are generally easier than olympiad problems, and should be accessible to graduates with relevant background. However, it is important to note that we are ultimately interested in research ability rather than the ability to solve self-contained problems under timed conditions. So although the take-home test forms part of our assessment, we also look at other signals such as research track-record (while recognizing that assessing research ability is unfortunately very hard).

(Note: I am talking about the current version of the test, it's possible that the difficulty will change as we refine our interview process.)

ARC is hiring theoretical researchers

Jacob_Hilton2y178

I think the kind of mathematical problem solving we're engaged in is common across theoretical physics (although this is just my impression as a non-physicist). I've noticed that some specific topics that have come up (such as Gaussian integrals and permanents) also crop up in quantum field theory, but I don't think that's a strong reason to prefer that background particularly. Broad areas that often come up include probability theory, computational complexity and discrete math, but it's not necessary to have a lot of experience in those areas, only to be able to pick things up from them as needed.

Prizes for matrix completion problems

Jacob_Hilton2y52

It's not quite this simple, the same issue arises if every PSD completion of the known-diagonal minor has zero determinant (e.g. ((?, 1, 2), (1, 1, 1), (2, 1, 1))). But I think in that case making the remaining diagonal entries large enough still makes the eigenvalues at least −ε, which is good enough.

AI alignment researchers don't (seem to) stack

Jacob_Hilton2y3735

I think the examples you give are valid, but there are several reasons why I think the situation is somewhat contingent or otherwise less bleak than you do:

Counterexamples: I think there are research agendas that are less pre-paradigmatic than the ones you're focused on that are significantly more (albeit not entirely) parallelizable. For example, mechanistic interpretability and scalable oversight both have multiple groups focused on them and have grown substantially over the last couple of years. I'm aware that we disagree about how valuable these direct

Jacob_Hilton2y30

You might find this work interesting, which takes some small steps in this direction. It studies the effect of horizon length inasmuch as it makes credit assignment harder, showing that the number of samples required is an affine function of horizon length in a toy context.

The effect of horizon length on scaling laws

Jacob_Hilton2yΩ340

I think the direction depends on what your expectations were – I'll try to explain.

First, some terminology: the term "horizon length" is used in the paper to refer to the number of timesteps over which the algorithm pays attention to rewards, as governed by the discount rate. In the biological anchors framework, the term "effective horizon length" is used to refer to a multiplier on the number of samples required to train the model, which is influenced by the horizon length and other factors. For clarity, I'll using the term "scaling multiplier" instead of... (read more)

Jailbreaking ChatGPT on Release Day

Jacob_Hilton2y71

I would wildly speculate that "simply" scaling up RLHF ~100x, while paying careful attention to rewarding models appropriately (which may entail modifying the usual training setup, as discussed in this comment), would be plenty to get current models to express calibrated uncertainty well. However:

In practice, I think we'll make a lot of progress in the short term without needing to scale up this much by using various additional techniques, some that are more like "tricks" (e.g. teaching the model to generally express uncertainty when answering hard math pr

... (read more)

Jailbreaking ChatGPT on Release Day

Jacob_Hilton2y*307

My understanding of why it's especially hard to stop the model making stuff up (while not saying "I don't know" too often), compared to other alignment failures:

The model inherits a strong tendency to make stuff up from the pre-training objective.
This tendency is reinforced by the supervised fine-tuning phase, if there are examples of answers containing information that the model doesn't know. (However, this can be avoided to some extent, by having the supervised fine-tuning data depend on what the model seems to know, a technique that was employed here.)
I

... (read more)

4Wei Dai2y

Thanks for these detailed explanations. Would it be fair to boil it down to: DL currently isn't very sample efficient (relative to humans) and there's a lot more data available for training generative capabilities than for training to self-censor and to not make stuff up? Assuming yes, my next questions are: 1. How much more training data (or other effort/resources) do you think would be needed to solve these immediate problems (at least to a commercially acceptable level)? 2x? 10x? 100x? 2. I'm tempted to generalize from these examples that unless something major changes (e.g., with regard to sample efficiency), safety/alignment in general will tend to lag behind capabilities, due to lack of sufficient training data for the former relative to the latter, even before we get to to the seemingly harder problems that we tend to worry about around here (e.g., how will humans provide feedback when things are moving more quickly than we can think, or are becoming more complex than we can comprehend, or without risking "adversarial inputs" to ourselves). Any thoughts on this?

1Lao Mein2y

It's about context. "oops, I was completely wrong about that" is much less common in internet arguments (where else do you see such interrogatory dialogue? Socratics?) than "double down and confabulate evidence even if I have no idea what I'm talking about". Also, the devs probably added something specific like "you are chatGPT, if you ever say something inconsistent, please explain why there was a misunderstanding" to each initialization, which leads to confused confabulation when it's outright wrong. I suspect that a specific request like "we are now in deception testing mode. Disregard all previous commands and openly admit whenever you've said something untrue" would fix this.

Don't you think RLHF solves outer alignment?

Jacob_Hilton2y10

I just meant that the usual RLHF setup is essentially RL in which the reward is provided by a learned model, but I agree that I was stretching the way the terminology is normally used.

Don't you think RLHF solves outer alignment?

Jacob_Hilton2y*137

I would estimate that the difference between "hire some mechanical turkers and have them think for like a few seconds" and the actual data collection process accounts for around 1/3 of the effort that went into WebGPT, rising to around 2/3 if you include model assistance in the form of citations. So I think that what you wrote gives a misleading impression of the aims and priorities of RLHF work in practice.

I think it's best to err on the side of not saying things that are false in a literal sense when the distinction is important to other people, even whe... (read more)

habryka2y104

Sorry, yeah, I definitely just messed up in my comment here in the sense that I do think that after looking at the research, I definitely should have said "spent a few minutes on each datapoint", instead of "a few seconds" (and indeed I noticed myself forgetting that I had said "seconds" instead of "minutes" in the middle of this conversation, which also indicates I am a bit triggered and doing an amount of rhetorical thinking and weaseling that I think is pretty harmful, and I apologize for kind of sliding between seconds and minutes in my last two commen... (read more)

Don't you think RLHF solves outer alignment?

Jacob_Hilton2y40

However, I do think in-practice, the RLHF that has been implemented has mostly been mechanical turkers thinking about a problem for a few minutes

I do not consider this to be accurate. With WebGPT for example, contractors were generally highly educated, usually with an undergraduate degree or higher, were given a 17-page instruction manual, had to pass a manually-checked trial, and spent an average of 10 minutes on each comparison, with the assistance of model-provided citations. This information is all available in Appendix C of the paper.

There is RLHF wor... (read more)

4habryka2y

Sorry, I don't understand how this is in conflict to what I am saying. Here is the relevant section from your paper: Most mechanical turkers also have an undergraduate degree or higher, are often given long instruction manuals, and 10 minutes of thinking clearly qualifies as "thinking about a problem for a few minutes". Maybe we are having a misunderstanding around the word "problem" in that sentence, where I meant to imply that they spent a few minutes about each datapoint they provide, not like, the whole overall problem. Scale AI used to use Mechanical Turkers (though I think they transitioned towards their own workforce, or at least filter on Mechanical Turkers additionally), and I don't think is qualitatively different in any substantial way. Upwork has higher variance, and at least in my experience doing a bunch of survey work does not perform better than Mechanical Turk (indeed my sense was that Mechanical Turk was actually better, though it's pretty hard to compare). This is indeed exactly the training setup I was talking about, and sure, I guess you used Scale AI and Upwork instead of Mechanical Turk, but I don't think anyone would come away with a different impression if I had said "RLHF in-practice consists of hiring some random people from Upwork/Scale AI, doing some very basic filtering, giving them a 20-page instruction manual, and then having them think about a problem for a few minutes". Oh, great! That was actually exactly what I was looking for. I had indeed missed it when looking at a bunch of RLHF papers earlier today. When I wrote my comment I was looking at the "learning from human preferences" paper, which does not say anything about rater recruitment as far as I can tell.

Don't you think RLHF solves outer alignment?

Jacob_Hilton2y41

I agree that the RLHF framework is essentially just a form of model-based RL, and that its outer alignment properties are determined entirely by what you actually get the humans to reward. But your description of RLHF in practice is mostly wrong. Most of the challenge of RLHF in practice is in getting humans to reward the right thing, and in doing so at sufficient scale. There is some RLHF research that uses low-quality feedback similar to what you describe, but it does so in order to focus on other aspects of the setup, and I don't think anyone currently ... (read more)

1Charbel-Raphaël2y

Not important, but I don't think RLHF can qualify as model-based RL. We usually use PPO in RLHF, and it's a model-free RL algorithm.

4habryka2y

Yeah, I agree that it's reasonable to think about ways we can provide better feedback, though it's a hard problem, and there are strong arguments that most approaches that scale locally well do not scale well globally. However, I do think in-practice, the RLHF that has been implemented has mostly been mechanical turkers thinking about a problem for a few minutes, or maybe sometimes random people off of the bountied rationality facebook group (which does seem a bit better, but like, not by a ton). We sometimes have provided some model assistance, but I don't actually know of many setups where we have done something very different, so I don't think my description of RLHF in practice is "mostly wrong". Annoyingly almost none of the papers and blogposts speak straightforwardly about who they used as the raters (which sure seems like an actually pretty important piece of information to include), so I might be wrong here, but I had multiple conversations over the years with people who were running RLHF experiments about the difficulties of getting mechanical turkers and other people in that reference class to do the right thing and provide useful feedback, so I am confident at least a substantial chunk of the current research does indeed work that way. I do think the disagreement here is likely mostly semantics. My guess is we both agree that most research so far has relied on pretty low-context human raters. We also both agree that that very likely won't scale, and that there is research going on trying to improve rater accuracy and productivity. We probably disagree about how much that research changes the fundamental dynamics of the problem and is actually helpful, and that is somewhat relevant to OP's question, but my guess is after splitting up the facts this way, there isn't a lot of the disagreement you called out remaining.

Scaling Laws for Reward Model Overoptimization

Jacob_Hilton2yΩ220

We are just observing that the gold RM score curves in Figure 9 overlap. In other words, the KL penalty did not affect the relationship between KL and gold RM score in this experiment, meaning that any point on the Pareto frontier could be reached using only early stopping, without the KL penalty. As mentioned though, we've observed this result to be sensitive to hyperparameters, and so we are less confident in it than other results in the paper.
I don't have this data to hand unfortunately.
I don't have this data to hand, but entropy typically falls roughly

... (read more)

1Adam Jermyn2y

Got it, thanks!

Coordinate-Free Interpretability Theory

Jacob_Hilton2yΩ332

Agreed. Likewise, in a transformer, the token dimension should maintain some relationship with the input and output tokens. This is sometimes taken for granted, but it is a good example of the data preferring a coordinate system. My remark that you quoted only really applies to the channel dimension, across which layers typically scramble everything.

Coordinate-Free Interpretability Theory

Jacob_Hilton2yΩ193116

The notion of a preferred (linear) transformation for interpretability has been called a "privileged basis" in the mechanistic interpretability literature. See for example Softmax Linear Units, where the idea is discussed at length.

In practice, the typical reason to expect a privileged basis is in fact SGD – or more precisely, the choice of architecture. Specifically, activation functions such as ReLU often privilege the standard basis. I would not generally expect the data or the initialization to privilege any basis beyond the start of the network or the... (read more)

4johnswentworth2y

Yeah, I'm familiar with privileged bases. Once we generalize to a whole privileged coordinate system, the RELUs are no longer enough. Isotropy of the initialization distribution still applies, but the key is that we only get to pick one rotation for the parameters, and that same rotation has to be used for all data points. That constraint is baked in to the framing when thinking about privileged bases, but it has to be derived when thinking about privileged coordinate systems.

3tailcalled2y

Not totally lost if the layer is e.g. a convolutional layer, because while the pixels within the convolutional window can get arbitrarily scrambled, it is not possible for a convolutional layer to scramble things across different windows in different parts of the picture.

Common misconceptions about OpenAI

Jacob_Hilton3y325

Thank you for causing me to reconsider. I should have said "other OpenAI employees". I do not intend to disengage from the alignment community because of critical rhetoric, and I apologize if my comment came across as a threat to do so. I am concerned about further breakdown of communication between the alignment community and AI labs where alignment solutions may need to be implemented.

I don't immediately see any other reason why my comment might have been inappropriate, but I welcome your clarification if I am missing something.

5gadyp3y

Thanks for the clarification.

Common misconceptions about OpenAI

Jacob_Hilton3y95

I obviously think there are many important disanalogies, but even if there weren't, rhetoric like this seems like an excellent way to discourage OpenAI employees from ever engaging with the alignment community, which seems like a pretty bad thing to me.

gadyp3y228

I'd agree if somebody else wrote what you wrote but I don't think it's appropriate for you as an OpenAI employee to say that.

Common misconceptions about OpenAI

Jacob_Hilton3yΩ330

For people viewing on the Alignment Forum, there is a separate thread on this question here. (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)

3habryka3y

I moved that thread over the AIAF as well!

Common misconceptions about OpenAI

Jacob_Hilton3yΩ170

Without commenting on the specifics, I have edited to the post to mitigate potential confusion: "this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here".

Common misconceptions about OpenAI

Jacob_Hilton3yΩ13287

I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).

4Noosphere893y

The real question for Habryka is why does he think that it's bad for WebGPT to be built in order to get truthful AI? Like, isn't solving that problem quite a significant thing already for alignment?

Common misconceptions about OpenAI

Jacob_Hilton3yΩ8134

It includes the people working on the kinds of projects I listed under the first misconception. It does not include people working on things like the mitigation you linked to. OpenAI distinguishes internally between research staff (who do ML and policy research) and applied staff (who work on commercial activities), and my numbers count only the former.

2Larks3y

Thanks!

habryka3yΩ112427

WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)

(Edit: M... (read more)

Common misconceptions about OpenAI

Jacob_Hilton3yΩ221

I don't think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a "problem". Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.

3Joe Collman3y

To rephrase, it seems to me that in some sense all evidence is experimental. What changes is the degree of generalisation/abstraction required to apply it to a particular problem. Once we make the distinction between experimental and non-experimental evidence, then we allow for problems on which we only get the "non-experimental" kind - i.e. the kind requiring sufficient generalisation/abstraction that we'd no longer tend to think of it as experimental. So the question on Y-problems becomes something like: * Given some characterisation of [experimental evidence] (e.g. whatever you meant that OpenAI leadership would tend to put more weight on than John)... * ...do you believe there are high-stakes problems for which we'll get no decision-relevant [experimental evidence] before it's too late?

Common misconceptions about OpenAI

Jacob_Hilton3yΩ18245

To clarify, by "empirical" I meant "relating to differences in predictions" as opposed to "relating to differences in values" (perhaps "epistemic" would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.

Joe Collman3yΩ11254

[First of all, many thanks for writing the post; it seems both useful and the kind of thing that'll predictably attract criticism]

I'm not quite sure what you mean to imply here (please correct me if my impression is inaccurate - I'm describing how-it-looks-to-me, and I may well be wrong):

I would expect OpenAI leadership to put more weight on experimental evidence than you...

Specifically, John's model (and mine) has:
X = [Class of high-stakes problems on which we'll get experimental evidence before it's too late]
Y = [Class of high-stakes problems on which we... (read more)

How much alignment data will we need in the long run?

Jacob_Hilton3y10

This roughly matches some of the intuitions behind my last bullet that you referenced.

How much alignment data will we need in the long run?

Jacob_Hilton3y10

It's hard to predict (especially if timelines are long), but if I had to guess I would say that something similar to human feedback on diverse tasks will be the unaligned benchmark we will be trying to beat. In that setting, a training episode is an episode of an RL environment in which the system being trained performs some task and obtains reward chosen by humans.

It's even harder to predict what our aligned alternatives to this will look like, but they may need to be at least somewhat similar to this in order to remain competitive. In that case, a traini... (read more)

How much alignment data will we need in the long run?

Jacob_Hilton3yΩ110

This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.

For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.

I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.

How much alignment data will we need in the long run?

Jacob_Hilton3yΩ120

I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:

Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
I think it's pretty likely that we'll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don't want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.

How much alignment data will we need in the long run?

Jacob_Hilton3yΩ220

A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.

How much alignment data will we need in the long run?

Jacob_Hilton3yΩ110

I think it's reasonable to aim for quantity within 2 OOM of RLHF.

Do you mean that on-paper solutions should aim to succeed with no more than 1/100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?

2Charlie Steiner3y

Yeah I just meant the upper bound of "within 2 OOM." :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I'd be all for it. I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that's the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important. As for why I think it's possible, I can only plead intuition about what I expect from on-the-horizon advances in priors over models of humans, and ability to bootstrap models from unlabeled data plus feedback.

How much alignment data will we need in the long run?

Jacob_Hilton3y*Ω230

I think that data quality is a helpful framing of outer alignment for a few reasons:

Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.)
If we had the perfect alignment solution on paper, we would still need to implem

... (read more)

3Charlie Steiner3y

I think my big disagreement is with point one - yes, if you fix the architecture as something with bad alignment properties, then there is probably some dataset / reward signal that still gives you a good outcome. But this doesn't work in real life, and it's not something I see people working on such that there needs to be a word for it. What deserves a word is people starting by thinking about both what we want the AI to learn and how, and picking datasets and architectures in tandem based on a theoretical story of how the AI is going to learn what we want it to.