You refer a couple of times to the fact that evals are often used with the aim of upper bounding capabilities. To my mind this is an essential difficulty that acts as a point of disanalogy with things like aviation. I’m obviously no expert but in the case of aviation, I would have thought that you want to give positive answers to questions like “can this plane safely do X thousand miles?” - ie produce absolutely guaranteed lower bounds on ‘capabilities’. You don’t need to find something like the approximately smallest number Y such that it could never under any circumstances ever fly more than Y million miles.
FWIW I did not interpret Thane as necessarily having "high confidence" in "architecture / internal composition" of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to "model the world" counts as a statement about "internal composition" is sort of beside the point/beyond the scope of what's really being said)
It's fair enough if you would say things differently(!) but in some sense isn't it just pointing out: 'I would emphasize d...
Newtonian mechanics was systematized as a special case of general relativity.
One of the things I found confusing early on in this post was that systemization is said to be about representing the previous thing as an example or special case of some other thing that is both simpler and more broadly-scoped.
In my opinion, it's easy to give examples where the 'other thing' is more broadly-scoped and this is because 'increasing scope' corresponds to the usual way we think of generalisation, i.e. the latter thing applies to more setting or it is 'about a wi...
OK I think this will be my last message in this exchange but I'm still confused. I'll try one more time to explain what I'm getting at.
I'm interested in what your precise definition of subjective probability is.
One relevant thing I saw was the following sentence:
If I say that a coin is 50% likely to come up heads, that's me saying that I don't know the exact initial conditions of the coin well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options.
It seems to give something like a defi...
So my point is still: What is that thing? I think yes I actually am trying to push proponents of this view down to the metaphysics - If they say "there's a 40% chance that it will rain tomorrow", I want to know things like what it is that they are attributing 40%-ness to. And what it means to say that that thing "has probability 40%". That's why I fixated on that sentence in particular because it's the closest thing I could find to an actual definition of subjective probability in this post.
I have in mind very simple examples. Suppose that first I roll a die. If it doesn't land on a 6, I then flip a biased coin that lands on heads 3/5 of the time. If it does land on a 6 I just record the result as 'tails'. What is the probability that I get heads?
This is contrived so that the probability of heads is
5/6 x 3/5 = 1/2.
But do you think that that in saying this I mean something like "I don't know the exact initial conditions... well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish be...
I'm kind of confused what you're asking me - like which bit is "accurate" etc.. Sorry, I'll try to re-state my question again:
- Do you think that when someone says something has "a 50% probability" then they are saying that they do not have any meaningful knowledge that allows them to distinguish between two options?
I'm suggesting that you can't possibly think that, because there are obviously other ways things can end up 50/50. e.g. maybe it's just a very specific calculation, using lots of specific information, that ends up with the value 0.5 at the end....
Presumably you are not claiming that saying
...I don't know the exact initial conditions of the coin well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options...
is actually necessarily what it means whenever someone says something has a 50% probability? Because there are obviously myriad ways something can have a 50% probability and this kind of 'exact symmetry between two outcomes' + no other information is only one very special way that it can happen.
So what does it mean exactly when you say something is 50% likely?
The traditional interpretation of probability is known as frequentist probability. Under this interpretation, items have some intrinsic "quality" of being some % likely to do one thing vs. another. For example, a coin has a fundamental probabilistic essence of being 50% likely to come up heads when flipped.
Is this right? I would have said that what you describe is a more like the classical, logical view of probability, which isn't the same as the frequentist view. Even the wiki page you've linked seems to disagree with what you've written, i.e. it describe...
My rejoinder to this is that, analogously to how a causal model can be re-implemented as a more complex non-causal model[2], a learning algorithm that looks at data that in some ways is saying something about causality, be it because the data contains information-decision-action-outcome units generated by agents, because the learning thing can execute actions itself and reflectively process the information of having done such actions, or because the data contains an abstract description of causality, can surely learn causality.
Short comment/feedback just t...
Ah OK, I think I've worked out where some of my confusion is coming from: I don't really see any argument for why mathematical work may be useful, relative to other kinds of foundational conceptual work. e.g. you write (with my emphasis): "Current mathematical research could play a similar role in the coming years..." But why might it? Isn't that where you need to be arguing?
The examples seem to be of cases where people have done some kind of conceptual foundational work which has later gone on to influence/inspire ML work. But early work on deception or goodhart was not mathematical work, that's why I don't understand how these are examples.
Thanks for the comment Rohin, that's interesting (though I haven't looked at the paper you linked).
I'll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers and and don't do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I'm doing 'addition modulo 113', but I o...
I'm still not sure I buy the examples. In the early parts of the post you seem to contrast 'machine learning research agendas' with 'foundational and mathematical'/'agent foundations' type stuff. Mechanistic interpretability can be quite mathematical but surely it falls into the former category? i.e. it is essentially ML work as opposed to constituting an example of people doing "mathematical and foundational" work.
I can't say much about the Goodhart's Law comment but it seems at best unclear that its link to goal misgeneralization is an example of t...
Currently, it takes a very long time to get an understanding of who is doing what in the field of AI Alignment and how good each plan is, what the problems are, etc.
Is this not ~normal for a field that it maturing? And by normal I also mean approximately unavoidable or 'essential'. Like I could say 'it sure takes a long time to get an understanding of who is doing what in the field of... computer science', but I have no reason to believe that I can substantially 'fix' this situation in the space of a few months. It just really is because there is lot...
I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work). But I decided maybe it's best to comment in a way that gives a better signal than silence.
I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff ...
I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.
We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situa...
Have you seen https://www.alignment.org/blog/mechanistic-anomaly-detection-and-elk/ and any of the other recent posts on https://www.alignment.org/blog/? I don't think they make it obvious that formalizing the presumption of independence would lead to alignment solutions, but they do give a much more detailed explanation of why you might hope so than the paper.
How exactly can an org like this help solve (what many people see as one of the main bottlenecks:) the issue of mentorship? How would Catalyze actually tip the scales when it comes to 'mentor matching'?
(e.g. see Richard Ngo's first high-level point in this career advice post)
Hi Garrett,
OK so just being completely honest, I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:
Interesting thoughts!
It reminds me (not only of my own writing on a similar theme) but of another one of these viewpoints/axes along which to carve interpretability work that is mentioned in this post by jylin04:
...
...a dream for interpretability research would be if we could reverse-engineer our future AI systems into human-understandable code. If we take this dream seriously, it may be helpful to split it into two parts: first understanding what "programming language" an architecture + learning algorithm will end up using at the end of training, and then wh
At the start you write
3. Unnecessarily diluting the field’s epistemics by introducing too many naive or overly deferent viewpoints.
And later Claim 3 is:
Scholars might defer to their mentors and fail to critically analyze important assumptions, decreasing the average epistemic integrity of the field
It seems to me there might be two things being pointed to?
A) Unnecessary dilution: Via too many naive viewpoints;
B) Excessive deference: Perhaps resulting in too few viewpoints or at least no new ones;
And arguably these two things are in tension, in the fol...
Hey Joseph, thanks for the substantial reply and the questions!
Why call this a theory of interpretability as opposed to a theory of neural networks?
Yeah this is something I am unsure about myself (I wrote: "something that I'm clumsily thinking of as 'the mathematics of (the interpretability of) deep learning-based AI'"). But I think I was imagining that a 'theory of neural networks' would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are intere...
I spent some time trying to formulate a good response to this that analyzed the distinction between (1) and (2) (in particular how it may map onto types of pseudo alignment described in RFLO here) but (and hopefully this doesn't sound too glib) it started to seem like it genuinely mattered whether humans in separate individual heavily-defended cells being pumped full of opiates have in fact been made to be 'happy' or not?
I think because if so, it is at least some evidence that the pseudo-alignment during training is for instrumental reasons (i.e. maybe it ...
This is a very strong endorsement but I'm finding it hard to separate the general picture from RFLO:
mesa-optimization occurs when a base optimizer...finds a model that is itself an optimizer,
where
a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
i.e. a mesa-optimizer is a learned model that 'performs inference' (i.e. evalua...
I've always found it a bit odd that Alignment Forum submissions are automatically posted to LW.
If you apply some of these norms, then imo there are questionable implications, i.e. it seems weird to say that one should have read the sequences in order to post about mechanistic interpretability on the Alignment Forum.
I really like this post and found it very interesting, particularly because I'm generally interested in the relationship between the rationality side of the AI Alignment community and academia, and I wanted to register some thoughts. Sorry for the long comment on an old post and I hope this doesn't come across as pernickety. If anything I sort of feel like TurnTrout is being hard on himself.
I think the tl;dr for my comment is sort of that to me the social dynamics "mistakes" don't really seem like mistakes - or at least not ones that were actually ma...
I've only skimmed this, but my main confusions with the whole thing are still on a fairly fundamental level.
You spend some time saying what abstractions are, but when I see the hypothesis written down, most of my confusion is on what "cognitive systems" are and what one means by "most". Afaict it really is a kind of empirical question to do with "most cognitive systems". Do we have in mind something like 'animal brains and artificial neural networks'? If so then surely let's just say that and make the whole thing more concrete; so I suspect not....bu...
Something ~ like 'make it legit' has been and possibly will continue to be a personal interest of mine.
I'm posting this after Rohin entered this discussion - so Rohin, I hope you don't mind me quoting you like this, but fwiw I was significantly influenced by this comment on Buck's old talk transcript 'My personal cruxes for working on AI safety'. (Rohin's comment repeated here in full and please bear in mind this is 3 years old; his views I'm sure have developed and potentially moved a lot since then:)
...
I enjoyed this post, it was good to see this all laid o
Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
Ok it sounds to me like maybe there's at least two things being talked about here. One situation is
A) Where a community includes different groups working on the same topic, and where th...
Re: e.g. superposition/entanglement:
I think people should try to understand the wider context into which they are writing, but I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names. In fact I'd say this happens all the time and generally people can just hold in their minds that another group has another name for it. Naturally, the two groups will have slightly different perspectives and this a) Is often good, i.e. the interference can be constructive and b) Can be a reason in f...
Thanks very much for the comments I think you've asked a bunch of very good questions. I'll try to give some thoughts:
...Deep learning as a field isn't exactly known for its rigor. I don't know of any rigorous theory that isn't as you say purely 'reactive', with none of it leading to any significant 'real world' results. As far as I can tell this isn't for a lack of trying either. This has made me doubt its mathematical tractability, whether it's because our current mathematical understanding is lacking or something else (DL not being as 'reductionist' as oth
Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn't worth expanding on. But yeah I guess you don't get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.
>There is no difference between natural phenomena and DNNs (LLMs, whatever). DNNs are 100% natural
I mean "natural" as opposed to "man made". i.e. something like "occurs in nature without being built by something or someone else". So in that sense, DNNs are obviously not natural in the way that the laws of physics are.
I don't see information and computation as only mathematical; in fact in my analogies I write that the mathematical abstractions we build as being separate from the things that one wants to describe or make predictions about. And this...
Interesting idea. I think it’s possible that a prize is the wrong thing for getting the best final result (but also possible that getting a half decent result is more important than a high variance attempt at optimising for the best result). My thinking is: To do what you’re suggesting to a high standard could take months of serious effort. The idea of someone really competent doing so just for the chance at some prize money doesn’t quite seem right to me… I think there could be people out there who in principle could do it excellently but who would want to know that they’d ‘got the job’ as it were before spending serious effort on it.
I think I would support Joe's view here that clarity and rigour are significantly different... but maybe - David - your comments are supposed to be specific to alignment work? e.g. I can think of plenty of times I have read books or articles in other areas and fields that contain zero formal definitions, proofs, or experiments but are obviously "clear", well-explained, well-argued etc. So by your definitions is that not a useful and widespread form of rigour-less clarity? (One that we would want to 'allow' in alignment work?) Or would you instead maintain ...
I agree that the space may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space may well be a more natural one. (It's of course the space of functions , and so a space in which 'model space' naturally sits in some sense. )
It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified fu...
I'm not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.
I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because for me the sum over was included in the function I called "", but it doesn't affect my main point).
I think the most concrete thing is that the function - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like
witho...
Thanks for the nice reply.
I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbria), and I think they're sufficient to ~fully make sense of what's going on. So I don't feel confused about the situation anymore. By "shocking" I meant something more like "calls for an explanation", not "calls for an explanation, and I don't have an explanation that feels adequate". (With added overtones of "horrifying".)
Yeah, OK, I think that helps clarify things for me.
...
As someone who was working a
I'm a little sheepish about trying to make a useful contribution to this discussion without spending a lot of time thinking things through but I'll give it a go anyway. There's a fair amount that I agree with here, including that there is by now a lot of introductory resources. But regarding the following:
(I do think it's possible to create a much better intro resource than any that exist today, but 'we can do much better' is compatible with 'it's shocking that the existing material hasn't already finished the job'.)
I feel like I want to ask: Do you really...
No need to be sheepish, IMO. :) Welcome to the conversation!
Do you really find it "shocking"?
I think it's the largest mistake humanity has ever made, and I think it implies a lower level of seriousness than the seriousness humanity applied to nuclear weapons, asteroids, climate change, and a number of other risks in the 20th-century. So I think it calls for some special explanation beyond 'this is how humanity always handles everything'.
I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbri...
Thanks again for the reply.
In my notation, something like or are functions in and of themselves. The function evaluates to zero at local minima of .
In my notation, there isn't any such thing as .
But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathe...
Thanks for the substantive reply.
First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out. The loss of a model with parameters can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function we have:
i.e. it's that might be something ...
This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.
I think that your setup is essentially that there is an -dimensional parameter space, let's call it say, and then for each element of the training set, we can consider the function ...
I for one would find it helpful if you included a link to at least one place that Eliezer had made this claim just so we can be sure we're on the same page.
Roughly speaking, what I have in mind is that there are at least two possible claims. One is that 'we can't get AI to do our alignment homework' because by the time we have a very powerful AI that can solve alignment homework, it is already too dangerous to use the fact it can solve the homework as a safety plan. And the other is the claim that there's some sort of 'intrinsic' reason why an AI built by humans could never solve alignment homework.