All of carboniferous_umbraculum 's Comments + Replies

I for one would find it helpful if you included a link to at least one place that Eliezer had made this claim just so we can be sure we're on the same page. 

Roughly speaking, what I have in mind is that there are at least two possible claims. One is that 'we can't get AI to do our alignment homework' because by the time we have a very powerful AI that can solve alignment homework, it is already too dangerous to use the fact it can solve the homework as a safety plan. And the other is the claim that there's some sort of 'intrinsic' reason why an AI built by humans could never solve alignment homework.

You refer a couple of times to the fact that evals are often used with the aim of upper bounding capabilities. To my mind this is an essential difficulty that acts as a point of disanalogy with things like aviation. I’m obviously no expert but in the case of aviation, I would have thought that you want to give positive answers to questions like “can this plane safely do X thousand miles?” - ie produce absolutely guaranteed lower bounds on ‘capabilities’. You don’t need to find something like the approximately smallest number Y such that it could never under any circumstances ever fly more than Y million miles.

Hmm it might be questionable to suggest that it is "non-AI" though? It's based on symbolic and algebraic deduction engines and afaict it sounds like it might be the sort of thing that used to be very much mainstream "AI" i.e. symbolic AI + some hard-coded human heuristics?

2ryan_greenblatt
Sure, just seems like a very non-central example of AI from the typical perspective of LW readers.

FWIW I did not interpret Thane as necessarily having "high confidence" in "architecture / internal composition" of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to "model the world" counts as a statement about "internal composition" is sort of beside the point/beyond the scope of what's really being said)

It's fair enough if you would say things differently(!) but in some sense isn't it just pointing out: 'I would emphasize d... (read more)

Newtonian mechanics was systematized as a special case of general relativity.

One of the things I found confusing early on in this post was that systemization is said to be about representing the previous thing as an example or special case of some other thing that is both simpler and more broadly-scoped. 

In my opinion, it's easy to give examples where the 'other thing' is more broadly-scoped and this is because 'increasing scope' corresponds to the usual way we think of generalisation, i.e. the latter thing applies to more setting or it is 'about a wi... (read more)

2Richard_Ngo
Yeah, good point. The intuition I want to point at here is "general relativity was simpler than Newtonian mechanics + ad-hoc adjustments for Mercury's orbit". But I do think it's a little tricky to pin down the sense in which it's simpler. E.g. what if you didn't actually have any candidate explanations for why Mercury's orbit was a bit off? (But you'd perhaps always have some hypothesis like "experimental error", I guess.) I'm currently playing around with the notion that, instead of simplicity, we're actually optimizing for something like "well-foundedness", i.e. the ability to derive everything from a small set of premises. But this feels close enough to simplicity that maybe I should just think of this as one version of simplicity.

OK I think this will be my last message in this exchange but I'm still confused. I'll try one more time to explain what I'm getting at. 

I'm interested in what your precise definition of subjective probability is. 

One relevant thing I saw was the following sentence:

If I say that a coin is 50% likely to come up heads, that's me saying that I don't know the exact initial conditions of the coin well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options.

It seems to give something like a defi... (read more)

So my point is still: What is that thing? I think yes I actually am trying to push proponents of this view down to the metaphysics - If they say "there's a 40% chance that it will rain tomorrow", I want to know things like what it is that they are attributing 40%-ness to.  And what it means to say that that thing "has probability 40%".  That's why I fixated on that sentence in particular because it's the closest thing I could find to an actual definition of subjective probability in this post.

 

2TAG
Which view? Subjective probability? Subjective probability is a credence, a level of belief.

I have in mind very simple examples.  Suppose that first I roll a die. If it doesn't land on a 6, I then flip a biased coin that lands on heads 3/5 of the time.  If it does land on a 6 I just record the result as 'tails'. What is the probability that I get heads? 

This is contrived so that the probability of heads is 

5/6 x 3/5 = 1/2.

But do you think that that in saying this I mean something like "I don't know the exact initial conditions... well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish be... (read more)

1Isaac King
I don't understand how either of those are supposed to be a counterexample. If I don't know what seat is going to be chosen randomly each time, then I don't have enough information to distinguish between the outcomes. All other information about the problem (like the fact that this is happening on a plane rather than a bus) is irrelevant to the outcome I care about. This does strike me as somewhat tautological, since I'm effectively defining "irrelevant information" as "information that doesn't change the probability of the outcome I care about". I'm not sure how to resolve this; it certainly seems like I should be able to identify that the type of vehicle is irrelevant to the question posed and discard that information.

We might be using "meaning" differently then!

I'm fine with something being subjective, but what I'm getting at is more like: Is there something we can agree on about which we are expressing a subjective view? 

2TAG
Sure, if we are observing the same things and ignorant about the same the things. Subjective doesn't mean necessarily different.

I'm kind of confused what you're asking me - like which bit is "accurate" etc.. Sorry, I'll try to re-state my question again:

- Do you think that when someone says something has "a 50% probability" then they are saying that they do not have any meaningful knowledge that allows them to distinguish between two options?

I'm suggesting that you can't possibly think that, because there are obviously other ways things can end up 50/50. e.g. maybe it's just a very specific calculation, using lots of specific information, that ends up with the value 0.5 at the end.... (read more)

1Isaac King
No, I think what I said was correct? What's an example that you think conflicts with that interpretation?

Presumably you are not claiming that saying

...I don't know the exact initial conditions of the coin well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options...

is actually necessarily what it means whenever someone says something has a 50% probability? Because there are obviously myriad ways something can have a 50% probability and this kind of 'exact symmetry between two outcomes' + no other information is only one very special way that it can happen. 

So what does it mean exactly when you say something is 50% likely?

1Isaac King
I think that's accurate, yeah. What's your objection to it?
2TAG
It doesn't have to have a single meaning. Objective probability and subjective probability can co-exist, and if you are just trying to calculate a probability, you don't have to worry about the metaphysics.

The traditional interpretation of probability is known as frequentist probability. Under this interpretation, items have some intrinsic "quality" of being some % likely to do one thing vs. another. For example, a coin has a fundamental probabilistic essence of being 50% likely to come up heads when flipped.

Is this right? I would have said that what you describe is a more like the classical, logical view of probability, which isn't the same as the frequentist view. Even the wiki page you've linked seems to disagree with what you've written, i.e. it describe... (read more)

1Isaac King
Yeah that was a mistake, I mixed frequentism and propensity together.
2bideup
Sounds like the propensity interpretation of probability.

My rejoinder to this is that, analogously to how a causal model can be re-implemented as a more complex non-causal model[2], a learning algorithm that looks at data that in some ways is saying something about causality, be it because the data contains information-decision-action-outcome units generated by agents, because the learning thing can execute actions itself and reflectively process the information of having done such actions, or because the data contains an abstract description of causality, can surely learn causality.

Short comment/feedback just t... (read more)

1Mo Putera
I had to read this sentence a few times to grok the author's point...

Ah OK, I think I've worked out where some of my confusion is coming from:  I don't really see any argument for why mathematical work may be useful, relative to other kinds of foundational conceptual work. e.g. you write (with my emphasis): "Current mathematical research could play a similar role in the coming years..." But why might it? Isn't that where you need to be arguing? 

The examples seem to be of cases where people have done some kind of conceptual foundational work which has later gone on to influence/inspire ML work. But early work on deception or goodhart was not mathematical work, that's why I don't understand how these are examples. 

 

2Davidmanheim
I think th dispute here is that you're interpreting mathematical too narrowly, and almost all of the work happening in agent foundations and similar is exactly what was being worked on by "mathematical AI research" 5-7 years ago. The argument was that those approaches have been fruitful, and we should expect them to continue to be so - if you want to call that "foundational conceptual research" instead of "Mathematical AI research," that's fine..

Thanks for the comment Rohin, that's interesting (though I haven't looked at the paper you linked).

I'll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers  and  and don't do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I'm doing 'addition modulo 113', but I o... (read more)

3Rohin Shah
I agree -- the point is that if you train on addition examples without any modular wraparound (whether you think of that as regular addition or modular addition with a large prime, doesn't super matter), then there is at least some evidence that you get a different representation than the one Nanda et al found.

I'm still not sure I buy the examples. In the early parts of the post you seem to contrast 'machine learning research agendas' with 'foundational and mathematical'/'agent foundations' type stuff. Mechanistic interpretability can be quite mathematical but surely it falls into the former category? i.e. it is essentially ML work as opposed to constituting an example of people doing "mathematical and foundational" work. 

I can't say much about the Goodhart's Law comment but it seems at best unclear that its link to goal misgeneralization is an example of t... (read more)

2Davidmanheim
I'm not really clear what you mean by not buying the example. You certainly seem to understand the distinction I'm drawing - mechanistic interpretability is definitely not what I mean by "mathematical AI safety," though I agree there is math involved. And I think the work on goal misgeneralization was conceptualized in ways directly related to Goodhart, and this type of problem inspired a number of research projects, including quantilizers, which is certainly agent-foundations work. I'll point here for more places the agents foundations people think it is relevant.

Strongly upvoted.

I roughly think that a few examples showing that this statement is true will 100% make OP's case. And that without such examples, it's very easy to remain skeptical.

Currently, it takes a very long time to get an understanding of who is doing what in the field of AI Alignment and how good each plan is, what the problems are, etc.
 

Is this not ~normal for a field that it maturing? And by normal I also mean approximately unavoidable or 'essential'. Like I could say 'it sure takes a long time to get an understanding of who is doing what in the field of... computer science', but I have no reason to believe that I can substantially 'fix' this situation in the space of a few months. It just really is because there is lot... (read more)

6Seth Herd
Sure, but that's no reason not to try to make it easier!
2Iknownothing
Thank you, I think there's an error in my phrasing.  I should have said:  Currently, it takes a very long time to get an idea of who is doing what in the field of AI Alignment and how good each plan is, what the problems are, etc.
1Iknownothing
not just that. It's because the field isn't organized at all. 

I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work).  But I decided maybe it's best to comment in a way that gives a better signal than silence. 

I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff ... (read more)

I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.

We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situa... (read more)

1Quinn
I can't say anything rigorous, sophisticated, or credible. I can just say that the paper was a very welcome spigot of energy and optimism in my own model of why "formal verification" -style assurances and QA demands are ill-suited to models (either behavioral evals or reasoning about the output of decompilers).

Have you seen https://www.alignment.org/blog/mechanistic-anomaly-detection-and-elk/ and any of the other recent posts on https://www.alignment.org/blog/? I don't think they make it obvious that formalizing the presumption of independence would lead to alignment solutions, but they do give a much more detailed explanation of why you might hope so than the paper.

How exactly can an org like this help solve (what many people see as one of the main bottlenecks:) the issue of mentorship? How would Catalyze actually tip the scales when it comes to 'mentor matching'?

(e.g. see Richard Ngo's first high-level point in this career advice post)

Hi Garrett, 

OK so just being completely honest, I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:

  • I did not use any words such as "trivial", "obvious" or "simple". Stories like the one you recount are obviously making fun of mathematicians, some of whom do think its cool to say things are trivial/simple/obvious after they understand them. I often strongly disagree and generally dislike this beh
... (read more)
4Garrett Baker
Sorry about that. On a re-read, I can see how the comment could be seen as snarky, but I was going more for critical via illustrative analogy. Oh the perils of the lack of inflection and facial expressions. I think your criticisms of my thought in the above comment are right-on, and you've changed my mind on how useful your post was. I do think that lots of progress can be made in understanding stuff by just finding the right frame by which the result seems natural, and your post is doing this. Thanks!

Interesting thoughts!

It reminds me (not only of my own writing on a similar theme) but of another one of these viewpoints/axes along which to carve interpretability work that is mentioned in this post by jylin04:


...a dream for interpretability research would be if we could reverse-engineer our future AI systems into human-understandable code. If we take this dream seriously, it may be helpful to split it into two parts: first understanding what "programming language" an architecture + learning algorithm will end up using at the end of training, and then wh

... (read more)

At the start you write

3. Unnecessarily diluting the field’s epistemics by introducing too many naive or overly deferent viewpoints.

And later Claim 3 is:


Scholars might defer to their mentors and fail to critically analyze important assumptions, decreasing the average epistemic integrity of the field
 

It seems to me there might be two things being pointed to?

A) Unnecessary dilution: Via too many naive viewpoints;
B) Excessive deference: Perhaps resulting in too few viewpoints or at least no new ones;

And arguably these two things are in tension, in the fol... (read more)

2Ryan Kidd
Mentorship is critical to MATS. We generally haven't accepted mentorless scholars because we believe that mentors' accumulated knowledge is extremely useful for bootstrapping strong, original researchers. Let me explain my chain of thought better: 1. A first-order failure mode would be "no one downloads experts' models, and we grow a field of naive, overconfident takes." In this scenario, we have maximized exploration at the cost of accumulated knowledge transmission (and probably useful originality, as novices might make the same basic mistakes). We patch this by creating a mechanism by which scholars are selected for their ability to download mentors' models (and encouraged to do so). 2. A second-order failure mode would be "everyone downloads and defers to mentors' models, and we grow a field of paradigm-locked, non-critical takes." In this scenario, we have maximized the exploitation of existing paradigms at the cost of epistemic diversity or critical analysis. We patch this by creating mechanisms for scholars to critically examine their assumptions and debate with peers.

Hey Joseph, thanks for the substantial reply and the questions!

 

 

Why call this a theory of interpretability as opposed to a theory of neural networks? 

Yeah this is something I am unsure about myself (I wrote: "something that I'm clumsily thinking of as 'the mathematics of (the interpretability of) deep learning-based AI'"). But I think I was imagining that a 'theory of neural networks' would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are intere... (read more)

1Joseph Bloom
Thanks Spencer! I'd love to respond in detail but alas, I lack the time at the moment.  Some quick points: 1. I'm also really excited about SLT work.  I'm curious to what degree there's value in looking at toy models (such as Neel's grokking work) and exploring them via SLT or to what extent reasoning in SLT might be reinvigorated by integrating experimental ideas/methodology from MI (such as progress measures). It feels plausible to me that there just haven't been enough people in any of a number of intersections look at stuff and this is a good example. Not sure if you're planning on going to this: https://www.lesswrong.com/posts/HtxLbGvD7htCybLmZ/singularities-against-the-singularity-announcing-workshop-on but it's probably not in the cards for me. I'm wondering if promoting it to people with MI experience could be good. 2. I totally get what you're saying about toy model in sense A or B doesn't necessarily equate to a toy model  being a version of the hard part of the problem. This explanation helped a lot, thank you!  3. I hear what you are saying about next steps being challenging for logistical and coordination issues and because the problem is just really hard! I guess the recourse we have is something like: Look for opportunities/chances that might justify giving something like this more attention or coordination. I'm also wondering if there might be ways of dramatically lowering the bar for doing work in related areas (eg: the same way Neel writing TransformerLens got a lot more people into MI).  Looking forward to more discussions on this in the future, all the best!

I spent some time trying to formulate a good response to this that analyzed the distinction between (1) and (2) (in particular how it may map onto types of pseudo alignment described in RFLO here) but (and hopefully this doesn't sound too glib) it started to seem like it genuinely mattered whether humans in separate individual heavily-defended cells being pumped full of opiates have in fact been made to be 'happy' or not?

I think because if so, it is at least some evidence that the pseudo-alignment during training is for instrumental reasons (i.e. maybe it ... (read more)

This is a very strong endorsement but I'm finding it hard to separate the general picture from RFLO:


mesa-optimization occurs when a base optimizer...finds a model that is itself an optimizer,

where 

a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.

i.e. a mesa-optimizer is a learned model that 'performs inference' (i.e. evalua... (read more)

I've always found it a bit odd that Alignment Forum submissions are automatically posted to LW. 

If you apply some of these norms, then imo there are questionable implications, i.e. it seems weird to say that one should have read the sequences in order to post about mechanistic interpretability on the Alignment Forum.

5habryka
The AI Alignment Forum was never intended as the central place for all AI Alignment discussion. It was founded at a time when basically everyone involved in AI Alignment had read the sequences, and the goal was to just have any public place for any alignment discussion.  Now that the field is much bigger, I actually kind of wish there was another forum where AI Alignment people could go to, so we would have more freedom in shaping a culture and a set of background assumptions that allow people to make further strides and create a stronger environment of trust.  I personally am much more interested in reading about mechanistic interpretability from people who have read the sequences. That one in-particular is actually one of the ones where a good understanding of probability theory, causality and philosophy of science seems particularly important (again, it's not that important that someone has acquired that understanding via the sequences instead of some other means, but it does actually really benefit from a bunch of skills that are not standard in the ML or general scientific community).  I expect we will make some changes here in the coming months, maybe by renaming the forum or starting off a broader forum that can stand more on its own, or maybe just shutting down the AI Alignment Forum completely and letting other people fill that niche. 
4the gears to ascension
similarly, I've been frustrated that medium quality posts on lesswrong about ai often get missed in the noise. I want alignmentforum longform scratchpad, not either lesswrong or alignmentforum. I'm not even allowed to post on alignmentforum! some recent posts I've been frustrated to see get few votes and generally less discussion: * https://www.lesswrong.com/posts/JqWQxTyWxig8Ltd2p/relative-abstracted-agency - this one deserves at least 35 imo * www.lesswrong.com/posts/fzGbKHbSytXH5SKTN/penalize-model-complexity-via-self-distillation * https://www.lesswrong.com/posts/bNpqBNvfgCWixB2MT/towards-empathy-in-rl-agents-and-beyond-insights-from-1 * https://www.lesswrong.com/posts/LsqvMKnFRBQh4L3Rs/steering-systems * ... many more open in tabs I'm unsure about.

I really like this post and found it very interesting, particularly because I'm generally interested in the relationship between the rationality side of the AI Alignment community and academia, and I wanted to register some thoughts. Sorry for the long comment on an old post and I hope this doesn't come across as pernickety. If anything I sort of feel like TurnTrout is being hard on himself. 

I think the tl;dr for my comment is sort of that to me the social dynamics "mistakes" don't really seem like mistakes - or at least not ones that were actually ma... (read more)

I've only skimmed this, but my main confusions with the whole thing are still on a fairly fundamental level. 

You spend some time saying what abstractions are, but when I see the hypothesis written down, most of my confusion is on what "cognitive systems" are and what one means by "most". Afaict it really is a kind of empirical question to do with "most cognitive systems". Do we have in mind something like 'animal brains and artificial neural networks'? If so then surely let's just say that and make the whole thing more concrete; so I suspect not....bu... (read more)

1Jonas Hallgren
(My attempt at an explanation:) In short, we care about the class of observers/agents that get redundant information in a similar way. I think we can look at the specific dynamics of the systems described here to actually get a better perspective on whether the NAH should hold or not: * * I think you can think of the redundant information between you and the thing you care about as a function of all the steps in between for that information to reach you. * If we look at the question, we have a certain amount of necessary things for the (current implementation of) NAH to hold: * 1. Redundant information is rare * To see if this is the case you will want to look at each of the individual interactions and analyse to what degree redundant information is passed on. * I guess the question of "how brutal is the local optimisation environment" might be good to estimate each information redundancy (A,B,C,D in the picture). Another question is, "what level of noise do I expect to be formed at each transition?" as that would tell you to what degree the redundant information is lost in noise. (they pointed this out as the current hypothesis for usefulness in the post in section 2d.) * 2. The way we access said information is similar * If you can determine to what extent the information flow between two agents is similar, you can estimate a probability of natural abstractions occurring in the same way. * For example, if we use vision versus hearing, we get two different information channels & so the abstractions will most likely change. (Causal proximity of the individual functions is changed with regards to the flow of redundant information) * Based on this I would say that the question isn't really if it is true for NNs & brains in general but that it's rather more helpful to ask what information is abstracted with specific capabilities such as vision or access to language. * So it's more about the class of agents that follow these constraints

Something ~ like 'make it legit' has been and possibly will continue to be a personal interest of mine.

I'm posting this after Rohin entered this discussion - so Rohin, I hope you don't mind me quoting you like this, but fwiw I was significantly influenced by this comment on Buck's old talk transcript 'My personal cruxes for working on AI safety'. (Rohin's comment repeated here in full and please bear in mind this is 3 years old; his views I'm sure have developed and potentially moved a lot since then:)


I enjoyed this post, it was good to see this all laid o

... (read more)
3Rohin Shah
I still endorse that comment, though I'll note that it argues for the much weaker claims of * I would not stop working on alignment research if it turned out I wasn't solving the technical alignment problem * There are useful impacts of alignment research other than solving the technical alignment problem (As opposed to something more like "the main thing you should work on is 'make alignment legit'".) (Also I'm glad to hear my comments are useful (or at least influential), thanks for letting me know!)

Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here.  And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.

Ok it sounds to me like maybe there's at least two things being talked about here. One situation is

 A) Where a community includes different groups working on the same topic, and where th... (read more)

Re: e.g. superposition/entanglement: 

I think people should try to understand the wider context into which they are writing, but I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names. In fact I'd say this happens all the time and generally people can just hold in their minds that another group has another name for it.  Naturally, the two groups will have slightly different perspectives and this a) Is often good, i.e. the interference can be constructive and b) Can be a reason in f... (read more)

1scasper
Thanks for the comment and pointing these things out.  --- Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here.  And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.  I don't know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?  --- Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless. --- Thanks for pointing out these posts. They are examples of discussing a similar idea to MI's dependency on programmatic hypothesis generation, but they don't act on it. But they both serve to draw analogies instead of providing methods.  The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)
2[anonymous]
The main problem on this site is that despite people have large vary levels of understanding of different subject, nobody wants to look like an idiot on here. A lot of the comments and articles are basically nothing burgers. People often focus on insignificant points to argue about and waste their time in the social aspect of learning than to actually learn about a subject themselves. This made me wonder do actual researchers who have values and substance to offer and question, do they not participate in online discussions? The closest I've found is wordpress blogs by various people and people have huge comment chains. The only other form of communication seems to be through formal papers, which is pretty much as organized as it gets in terms of format. I've learned that people who do actually have deeper understanding and knowledge of value to offer, they don't waste their time on here. But I can't find any other platform that these people participate in. My guess is that they don't participate in any public discourse, only private conversations with other people who have things of value to offer and discuss.

Thanks very much for the comments I think you've asked a bunch of very good questions. I'll try to give some thoughts:

Deep learning as a field isn't exactly known for its rigor. I don't know of any rigorous theory that isn't as you say purely 'reactive', with none of it leading to any significant 'real world' results. As far as I can tell this isn't for a lack of trying either. This has made me doubt its mathematical tractability, whether it's because our current mathematical understanding is lacking or something else (DL not being as 'reductionist' as oth

... (read more)

Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn't worth expanding on. But yeah I guess you don't get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.

>There is no difference between natural phenomena and DNNs (LLMs, whatever). DNNs are 100% natural

I mean "natural" as opposed to "man made". i.e. something like "occurs in nature without being built by something or someone else". So in that sense, DNNs are obviously not natural in the way that the laws of physics are.

I don't see information and computation as only mathematical; in fact in my analogies I write that the mathematical abstractions we build as being separate from the things that one wants to describe or make predictions about.  And this... (read more)

-1Roman Leventov
I feel that you are redefining terms. Writing down mathematical equations (or defining other mathematical structures that are not equations, e.g., automata), describing natural phenomena, and proving some properties of these, i.e., deriving some mathematical conjectures/theorems, -- that's exactly what physicists do, and they call it "doing physics" or "doing science" rather than "doing mathematics".  I wonder how would you draw the boundary between "man-made" and "non-man-made", the boundary that would have a bearing on such a fundamental qualitative distinction of phenomena as the amenability to mathematical description. According to Fields et al.'s theory of semantics and observation ("quantum theory […] is increasingly viewed as a theory of the process of observation itself"), which is also consistent with predictive processing and Seth's controlled hallucination theory which is a descendant of predictive processing, any observer's phenomenology is what makes mathematical sense by construction. Also, here Wolfram calls approximately the same thing "coherence". Of course, there are infinite phenomena both in "nature" and "among man-made things" the mathematical description of which would not fit our brains yet, but this also means that we cannot spot these phenomena. We can extend the capacity of our brains (e.g., through cyborgisation, or mind upload), as well as equip ourselves with more powerful theories that allow us to compress reality more efficiently and thus spot patterns that were not spottable before, but this automatically means that these patterns become mathematically describable. This, of course, implies that we ought to make our minds stronger (through technical means or developing science) precisely to timely spot the phenomena that are about to "catch us". This is the central point of Deutsch's "The Beginning of Infinity". Anyway, there is no point in arguing this point fiercely because I'm kind of on "your side" here, arguing that your worr

I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.

1Arthur Conmy
Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood's interpretability approach here, another example of "recruiting resources outside of the model alone". (however, it doesn't seem obvious to me that interpretability can't or won't work in such settings)

Interesting idea. I think it’s possible that a prize is the wrong thing for getting the best final result (but also possible that getting a half decent result is more important than a high variance attempt at optimising for the best result). My thinking is: To do what you’re suggesting to a high standard could take months of serious effort. The idea of someone really competent doing so just for the chance at some prize money doesn’t quite seem right to me… I think there could be people out there who in principle could do it excellently but who would want to know that they’d ‘got the job’ as it were before spending serious effort on it.

I think I would support Joe's view here that clarity and rigour are significantly different... but maybe - David - your comments are supposed to be specific to alignment work? e.g. I can think of plenty of times I have read books or articles in other areas and fields that contain zero formal definitions, proofs, or experiments but are obviously "clear", well-explained, well-argued etc. So by your definitions is that not a useful and widespread form of rigour-less clarity? (One that we would want to 'allow' in alignment work?) Or would you instead maintain ... (read more)

I agree that the space  may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space  may well be a more natural one. (It's of course the space of functions , and so a space in which 'model space' naturally sits in some sense. )

You're right about the loss thing; it isn't as important as I first thought it might be. 

It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
 

I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified fu... (read more)

4Lucius Bushnaq
Fair enough, should probably add a footnote. Do any practically used loss functions actually have cross terms that lead to off-diagonals like that? Because so long as the matrix stays diagonal, you're effectively just adding extra norm to features in one part of the output over the others. Which makes sense, if your loss function is paying more attention to one part of the output than others, then perturbations to the weights of features of that part are going to have an outsized effect. The perturbative series evaluates the network at particular values of Θ. If your network has many layers that slowly build up an approximation of the function cos(x), to use in the final layer, it will effectively enter the behavioural gradient as cos(x), even though its construction evolves many parameters in previous layers.

I'm not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.
 

I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because  for me the sum over  was included in the function I called "", but it doesn't affect my main point).

I think the most concrete thing is that the function  - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like 

witho... (read more)

4Lucius Bushnaq
It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.   In theory, a loss function that explicitly depends on network parameters would behave differently than is assumed in this derivation, yes. But that's not how standard loss functions usually work. If a loss function did have terms like that, you should indeed get out somewhat different results.  But that seems like a thing to deal with later to me, once we've worked out the behaviour for really simple cases more. A feature to me is the same kind of thing it is to e.g. Chris Olah. It's the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network. I'm not assuming that the function is linear in \Theta. If it was, this whole thing wouldn't just be an approximation within second order Taylor expansion distance, it'd hold everywhere.  In multi-layer networks, what the behavioural gradient is showing you is essentially what the network would look like if you approximated it for very small parameter changes, as one big linear layer.  You're calculating how the effects of changes to weights in previous layers "propagate through" with the chain rule to change what the corresponding feature would "look like" if  it was in the final layer.   Obviously, that can't be quite the right way to do things outside this narrow context of interpreting the meaning of the basin near optima. Which is why we're going to try out building orthogonal sets layer by layer instead. To be clear, none of this is a derivation showing that the L2 norm perspective is the right thing to do in any capacity. It's just a suggestive hint that it might be. We've been searching for the right definition of "feature independence" or "non-redundancy of computations" in neural networks for a while now, to get an elementary unit of neural network

Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.

Thanks for the nice reply. 
 

I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbria), and I think they're sufficient to ~fully make sense of what's going on. So I don't feel confused about the situation anymore. By "shocking" I meant something more like "calls for an explanation", not "calls for an explanation, and I don't have an explanation that feels adequate". (With added overtones of "horrifying".)


Yeah, OK, I think that helps clarify things for me.



As someone who was working a

... (read more)
5Rob Bensinger
Oh, I do think Superintelligence was extremely important. I think Superintelligence has an academic tone (and, e.g., hedges a lot), but its actual contents are almost maximally sci-fi weirdo -- the vast majority of public AI risk discussion today, especially when it comes to intro resources, is much less willing to blithely discuss crazy sci-fi scenarios.

I'm a little sheepish about trying to make a useful contribution to this discussion without spending a lot of time thinking things through but I'll give it a go anyway. There's a fair amount that I agree with here, including that there is by now a lot of introductory resources. But regarding the following:


(I do think it's possible to create a much better intro resource than any that exist today, but 'we can do much better' is compatible with 'it's shocking that the existing material hasn't already finished the job'.)

I feel like I want to ask: Do you really... (read more)

No need to be sheepish, IMO. :) Welcome to the conversation!

Do you really find it "shocking"?

I think it's the largest mistake humanity has ever made, and I think it implies a lower level of seriousness than the seriousness humanity applied to nuclear weapons, asteroids, climate change, and a number of other risks in the 20th-century. So I think it calls for some special explanation beyond 'this is how humanity always handles everything'.

I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbri... (read more)

Thanks again for the reply.

In my notation, something like   or  are functions in and of themselves. The function  evaluates to zero at local minima of 

In my notation, there isn't any such thing as .

But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathe... (read more)

Thanks for the substantive reply.

First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out.  The loss of a model with parameters  can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function  we have: 

i.e. it's  that might be something ... (read more)

1Vivek Hebbar
I will split this into a math reply, and a reply about the big picture / info loss interpretation. Math reply: Thanks for fleshing out the calculus rigorously; admittedly, I had not done this.  Rather, I simply assumed MSE loss and proceeded largely through visual intuition.   This is still false!  Edit: I am now confused, I don't know if it is false or not.   You are conflating ∇f l(f(θ)) and ∇θ l(f(θ)).  Adding disambiguation, we have: ∇θ L(θ)=(∇f l(f(θ))) Jθf(θ) Hessθ(L)(θ)=Jθf(θ)T [Hessf(l)(f(θ))] Jθf(θ)+∇f l(f(θ)) D2θf(θ) So we see that the second term disappears if ∇f l(f(θ))=0.  But the critical point condition is ∇θ l(f(θ))=0.  From chain rule, we have: ∇θ l(f(θ))=(∇f l(f(θ))) Jθf(θ) So it is possible to have a local minimum where ∇f l(f(θ))≠0, if ∇f l(f(θ)) is in the left null-space of Jθf(θ).  There is a nice qualitative interpretation as well, but I don't have energy/time to explain it. However, if we are at a perfect-behavior global minimum of a regression task, then ∇f l(f(θ)) is definitely zero. A few points about rank equality at a perfect-behavior global min: 1. rank(Hess(L))=rank(Jf) holds as long as Hess(l)(f(θ)) is a diagonal matrix.  It need not be a multiple of the identity. 2. Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior. 3. If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs. 4. We can extend to larger outputs by having the behavior f be the flattened concatenation of outputs.  The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector.  It is not satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function.  But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits.

This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

I think that your setup is essentially that there is an -dimensional parameter space, let's call it  say, and then for each element  of the training set, we can consider the function ... (read more)

1Vivek Hebbar
Thanks for this reply, its quite helpful. Ah nice, didn't know what it was called / what field it's from.  I should clarify that "key result" here just meant "key result of the math so far -- pay attention", not "key result of the whole post" or "profound/original". Yeah, you're right.  Previously I thought G was the Jacobian, because I had the Jacobian transposed in my head.  I only realized that G has a standard name fairly late (as I was writing the post I think), and decided to keep the non-standard notation since I was used to it, and just add a footnote. Yes; this is the whole point of the post.  The math is just a preliminary to get there. Good catch -- it is technically possible at a local minimum, although probably extremely rare.  At a global minimum of a regression task it is not possible, since there is only one behavior vector corresponding to zero loss.  Note that behavior in this post was defined specifically on the training set.  At global minima, "Rank(Hessian(Loss))=Rank(G)" should be true without exception. In  "Flat basin  ≈  Low-rank Hessian  =  Low-rank G  ≈  High manifold dimension": The first "≈" is a correlation.  The second "≈" is the implication "High manifold dimension => Low-rank G". (Based on what you pointed out, this only works at global minima). "Indicates" here should be taken as slightly softened from "implies", like "strongly suggests but can't be proven to imply".  Can you think of plausible mechanisms for causing low rank G which don't involve information loss?
Load More