Jacob_Hilton - LessWrong

Obstacles in ARC's agenda: Finding explanations

I thought about this a bit more (and discussed with others) and decided that you are basically right that we can't avoid the question of empirical regularities for any realistic alignment application, if only because any realistic model with potential alignment challenges will be trained on empirical data. The only potential application we came up with is LPE for a formalized distribution and formalized catastrophe event, but we didn't find this especially compelling, for several reasons.^[1]

To me the challenges we face in dealing with empirical regularities do not seem bigger than the challenges we face with formal heuristic explanations, but the empirical regularities challenges should become much more concrete once we have a notion of heuristic explanations to work with, so it seems easier to resolve them in that order. But I have moved in your direction, and it does seem worth our while to address them both in parallel to some extent.

^{^}
Objections include: (a) the model is trained on empirical data, so we need to only explain things relevant to formal events, and not everything relevant to its loss; (b) we also need to hope that empirical regularities aren't needed to explain purely formal events, which remains unclear; and (c) the restriction to formal distributions/events limits the value of the application.

Obstacles in ARC's agenda: Finding explanations

Jacob_Hilton2mo181

Thank you for writing this up – I think this (and the other posts in the series) do a good job of describing ARC's big-picture alignment plan, common objections, our usual responses, and why you find those uncompelling.

In my personal opinion (not necessarily shared by everyone at ARC), the best case for our research agenda comes neither from the specific big-picture plan you are critiquing here, nor from "something good falling out of it along the way" (although that is a part of my motivation), but instead for some intermediate goal along the lines of "a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks". If we can achieve that, it seems quite likely to me that it will be useful for something, for essentially the same reason we would expect exhaustive mechanistic interpretability to be useful for something (and probably quite a lot). Under this view, the point of fleshing out the LPE and MAD applications is important as a proof of concept and for refining our plans, but they are subject to revision.

This isn't meant to downplay your objections too much. The ones that loom largest in my mind are false positives in MAD, small estimates being "lost in the noise" for LPE, and the whole minefield of empirical regularities (all of which you do good justice to). Paul still seems to think we can resolve all of these issues, so hopefully we will get to the bottom of them at some point, although in the short term we are more focused on the more limited dream of heuristic arguments for neural networks (and instances of LPE we think they ought to enable).

A couple of your objections apply even to this more limited dream though, especially the ones under "Explaining everything" and "When and what do we explain?". But your arguments there seem to boil down to "that seems incredibly daunting and ambitious", which I basically agree with. I still come down on the side of thinking that it is still a promising target, but I do think that ARC's top priority should be to come up with concrete cruxes here and put them to the test, which is our primary research focus at the moment.

Jacob_Hilton's Shortform

Jacob_Hilton2mo*Ω470

I recently gave this talk at the Safety-Guaranteed LLMs workshop:

The talk is about ARC's work on low probability estimation (LPE), covering:

Theoretical motivation for LPE and (towards the end) activation modeling approaches (both described here)
Empirical work on LPE in language models (described here)
Recent work-in-progress on theoretical results

Amplifying the Computational No-Coincidence Conjecture

Jacob_Hilton4mo20

Yes, by "unconditionally" I meant "without an additional assumption". I don't currently see why the Reduction-Regularity assumption ought to be true (I may need to think about it more).

Amplifying the Computational No-Coincidence Conjecture

Jacob_Hilton4mo40

Thanks for writing this up! Your "amplified weak" version of the conjecture (with complexity bounds increasing exponentially in 1/ε) seems plausible to me. So if you could amplify the original (weak) conjecture to this unconditionally, it wouldn't significantly decrease my credence in the principle. But it would be nice to have this bound on what the dependence on ε would need to be.

A computational no-coincidence principle

Jacob_Hilton4mo21

The statements are equivalent if only a tiny fraction (tending to 0) of random reversible circuits satisfy . We think this is very likely to be true, since it is a very weak consequence of the conjecture that random (depth- $~ O (n)$ ) reversible circuits are pseudorandom permutations. If it turned out to not be true, it would no longer make sense to think of $P (C)$ as an "outrageous coincidence" and so I think we would have to abandon the conjecture. So in short we are happy to consider either version (though I agree that "for which $P (C)$ is false" is a bit more natural).

A computational no-coincidence principle

Jacob_Hilton5mo42

The hope is to use the complexity of the statement rather than mathematical taste.

If it takes me 10 bits to specify a computational possibility that ought to happen 1% of the time, then we shouldn't be surprised to find around 10 (~1% of ) occurrences. We don't intend the no-coincidence principle to claim that these should all happen for a reason.

Instead, we intend the no-coincidence principle to claim that such if such coincidences happen much more often than we would have expected them to by chance, then there is a reason for that. Or put another way: if we applied $n$ bits of selection to the statement of a $≪ 2^{- n}$ -level coincidence, then there is a reason for it. (Hopefully the "outrageous" qualifier helps to indicate this, although we don't know whether Gowers meant quite same thing as us.)

The formalization reflects this distinction: the property $P$ is chosen to be so unlikely that we wouldn't expect it to happen for any circuit at all by chance ( $e^{- 2^{- n}}$ ), not merely that we wouldn't expect it to happen for a single random circuit. Hence by the informal principle, there ought to be a reason for any occurrence of property $P$ .

A computational no-coincidence principle

Jacob_Hilton5mo40

For the informal no-coincidence principle, it's important to us (and to Gowers IIUC) that a "reason" is not necessarily a proof, but could instead be a heuristic argument (in the sense of this post). We agree there are certainly apparently outrageous coincidences that may not be provable, such as Chebyshev's bias (discussed in the introduction to the post). See also John Conway's paper On Unsettleable Arithmetical Problems for a nice exposition of the distinction between proofs and heuristic arguments (he uses the word "probvious" for a statement with a convincing heuristic argument).

Correspondingly, our formalization doesn't bake in any sort of proof system. The verification algorithm only has to correctly distinguish circuits that might satisfy property $P$ from random circuits using the advice string $π$ – it doesn't necessarily have to interpret $π$ as a proof and verify its correctness.

A computational no-coincidence principle

Jacob_Hilton5mo154

It's not, but I can understand your confusion, and I think the two are related. To see the difference, suppose hypothetically that 11% of the first million digits in the decimal expansion of were 3s. Inductive reasoning would say that we should expect this pattern to continue. The no-coincidence principle, on the other hand, would say that there is a reason (such as a proof or a heuristic argument) for our observation, which may or may not predict that the pattern will continue. But if there were no such reason and yet the pattern continued, then the no-coincidence principle would be false, whereas inductive reasoning would have been successful.

So I think one can view the no-coincidence principle as a way to argue in favor of induction (in the context of formally-defined phenomena): when there is a surprising pattern, the no-coincidence principle says that there is a reason for it, and this reason may predict that the pattern will continue (although we can't be sure of this until we find the reason). Interestingly, one could also use induction to argue in favor of the no-coincidence principle: we can usually find reasons for apparently outrageous coincidences in mathematics, so perhaps they always exist. But I don't think they are the same thing.

A computational no-coincidence principle

Jacob_Hilton5mo100

Good question! We also think that NP ≠ co-NP. The difference between 99% (our conjecture) and 100% (NP = co-NP) is quite important, essentially because 99% of random objects "look random", but not 100%. For example, consider a uniformly random string for some large $n$ . We can quite confidently say things like: the number of 0s in $x$ is between $0.499 n$ and $0.501 n$ ; there is no streak of $⌊ \sqrt{n} ⌋$ alternating 0s and 1s; etc. But these only hold with 99% confidence (more precisely, with probability tending to 1), not 100%.

Going back to the conjecture statement, the job of the verification algorithm $V$ is much harder for 100% than 99%. For 100%, $V$ has to definitively tell (with the help of a certificate) whether a circuit has property $P$ . Whereas for 99%, it simply has to spot (again with the help of a "certificate" of sorts) any structure at all that reveals the circuit to be non-random in a way that could cause it to have property $P$ . For example, $V$ could start by checking the proportions of different types of gates, and if these differed too much from a random circuit, immediately reject the circuit out-of-hand for being "possibly structured". Footnote 6 has another example of structure that could cause a circuit to have property $P$ , which seems much harder for a 100%- $V$ to deal with.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments