I'm not sure -- what significance are you placing on the word "tackle" in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function --- the main questions for this area of research may even be mostly orthogonal to that question.
I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first).
I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasi...
What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don't have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we're verifying.
However, note that even this is meaningfully different from j...
In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).
I'm not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything -- depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale -- why should it not also be true on a medium or large scale?
For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest...
I think the distinction between these two cases often can be somewhat vague.
Why do you think that the adversarial case is very different?
I think you're perhaps reading me as being more bullish on Bayesian methods than I in fact am -- I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio's research agenda is a core part of its motivation, in response to your remark that "from my understanding, the bay...
But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)
It sounds to me like we agree here, I don't want to put too much weight on "most".
Is this true?
It is true in the sense that you don't have any theoretical guarantees, and in the sense that it also often fails to work in practice.
Aren't NN implicitly ensembles of vast number of models?
They probably are, to some extent. However, in practice, you often find that neural networks make very confident (an...
From my understanding, the bayesian aspect of this agenda doesn't add much value.
A core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully, and this is arguably as important as (if not more important than) interpretability. In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution, but generalises in an unintended way when placed in a novel situation. Traditional ML has no straightforward way of dealing with such cases, since i...
You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a "world model". However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world m...
If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
I suppose this depends on what you mean by "most". DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it's fairly likely that these kinds of differences could have a large impact on the probability that a given type o...
You're right, I put the parameters the wrong way around. I have fixed it now, thanks!
I could have changed it to Why Neural Networks can obey Occam's Razor, but I think this obscures the main point.
I think even this would be somewhat inaccurate (in my opinion). If a given parametric Bayesian learning machine does obey (some version of) Occam's razor, then this must be because of some facts related to its prior, and because of some facts related to its parameter-function map. SLT does not say very much about either of these two things. What the post is about is primarily the relationship between the RLCT and posterior probability, and how th...
Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).
Would that not imply that my polynomial example also obeys Occam's razor?
...However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something prof
I think I recall reading that, but I'm not completely sure.
Note that the activation function affects the parameter-function map, and so the influence of the activation function is subsumed by the general question of what the parameter-function map looks like.
I'm not sure, but I think this example is pathological.
Yes, it's artificial and cherry-picked to make a certain rhetorical point as simply as possible.
This is the more relevant and interesting kind of symmetry, and it's easier to see what this kind of symmetry has to do with functional simplicity: simpler functions have more local degeneracies.¨
This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.
...You can
Will do, thank you for the reference!
Yes, I completely agree. The theorems that have been proven by Watanabe are of course true and non-trivial facts of mathematics; I do not mean to dispute this. What I do criticise is the magnitude of the significance of these results for the problem of understanding the behaviour of deep learning systems.
Thank you for this -- I agree with what you are saying here. In the post, I went with a somewhat loose equivocation between "good priors" and "a prior towards low Kolmogorov complexity", but this does skim past a lot of nuance. I do also very much not want to say that the DNN prior is exactly towards low Kolmogorov complexity (this would be uncomputable), but only that it is mostly correlated with Kolmogorov complexity for typical problems.
Yes, I mostly just mean "low test error". I'm assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.
I think that it gives us an adequate account of generalisation in the limit of infinite data (or, more specifically, in the case where we have enough data to wash out the influence of the inductive bias); this is what my original remark was about. I don't think classical statistical learning theory gives us an adequate account of generalisation in the setting where the training data is small enough for our inductive bias to still matter, and it only has very limited things to say about out-of-distribution generalisation.
The assumption that small neural networks are a good match for the actual data generating process of the world, is equivalent to the assumption that neural networks have an inductive bias that gives large weight to the actual data generating process of the world, if we also append the claim that neural networks have an inductive bias that gives large weight to functions which can be described by small neural networks (and this latter claim is not too difficult to justify, I think).
I think the second one by Carroll is quite careful to say things like "we can now understand why singular models have the capacity to generalise well" which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion.
The title of the post is Why Neural Networks obey Occam's Razor! It also cites Zhang et al, 2017, and immediately after this says that SLT can help explain why neural networks have the capacity to generalise well. This gives the impression that the post is intended to give a solution to problem (ii) ...
Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).
I think that expression of Jesse's is also correct, in context.
However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) ...
Anyway I'm guessing you're probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).
Yes, absolutely. However, I also don't think that (i) is very mysterious, if we view things from a Bayesian perspective. Indeed, it seems natural to say that an ideal Bayesian reasoner should assign non-zero prior probability to all computable models, or something along those lines, and in that case, notions like "overparameterised" no longer seem very significant.
...Maybe that has significant overlap with the cr
A few things:
1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well).
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine.
3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assump...
To say that neural networks are empirical risk minimisers is just to say that they find functions with globally optimal training loss (and, if they find functions with a loss close to the global optimum, then they are approximate empirical risk minimisers, etc).
I think SLT effectively assumes that neural networks are (close to being) empirical risk minimisers, via the assumption that they are trained by Bayesian induction.
The bounds are not exactly vacuous -- in fact, they are (in a sense) tight. However, they concern a somewhat adversarial setting, where the data distribution may be selected arbitrarily (including by making it maximally opposed to the inductive bias of the learning algorithm). This means that the bounds end up being much larger than what you would typically observe in practice, if you give typical problems to a learning algorithm whose inductive bias is attuned to the structure of "typical" problems.
If you move from this adversarial setting to a more...
I already posted this in response to Daniel Murfet, but I will copy it over here:
For example, the agnostic PAC-learning theorem says that if a learning machine (for binary classification) is an empirical risk minimiser with VC dimension , then for any distribution over , if is given access to at least data points sampled from , then it will with probability at least learn a function whose (true) generalisation error (under ) is at most &...
Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?
This is basically a justification of something like your point 1, but AFAICT it's closer to a proof in the SLT setting than in your setting.
I think it could probably be turned into a proof in either setting, at least if we are allowed to help ourselves to assumptions like "the ground truth function is generated by a small neural net" and "learning is done in a Bayesian way", etc.
In your example there are many values of the parameters that encode the zero function
Ah, yes, I should have made the training data be (1,1), rather than (0,0). I've fixed the example now!
Is that a fair characterisation of the argument you want to make?
Yes, that is exactly right!
Assuming it is, my response is as follows. I'm guessing you think is simpler than because the former function can be encoded by a shorter code on a UTM than the latter.
The notion of complexity that I have in mind is even more pre-theoretic than that; it'...
I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don't think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?
That seems probable. Maybe it's useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we're on the same page. When people refer to the "generalisation puzzle" in...
For example, the agnostic PAC-learning theorem says that if a learning machine (for binary classification) is an empirical risk minimiser with VC dimension , then for any distribution over , if is given access to at least data points sampled from , then it will with probability at least learn a function whose (true) generalisation error (under ) is at most worse than the best function which is able to express (in terms ...
I'm going to make a few comments as I read through this, but first I'd like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn't have done otherwise.
Thank you for the detailed responses! I very much enjoy discussing these topics :)
My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space
My intuitions around the RLCT are very much geometrically informed, and I do think of it as being a kind of flatness meas...
I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see.
Yes, I agree with this. I think my main objections are (1) the fact that it mostly abstacts away from the parameter-function map, and (2) the infinite-data limit.
My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard.
I largely agree, though depends somewhat on what your aims are. My point there was ma...
That's interesting, thank you for this!
Yes, I meant specifically on LW and in the AI Safety community! In academia, it remains fairly obscure.
I think this is precisely what SLT is saying, and this is nontrivial!
It is certainly non-trivial, in the sense that it takes many lines to prove, but I don't think it tells you very much about the actual behaviour of neural networks.
Note that loss landscape considerations are more important than parameter-function considerations in the context of learning.
One of my core points is, precisely, to deny this claim. Without assumptions about the parameter function map, you cannot make inferences from the characteristics of the loss landscape to conclusions abou...
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
How am I supposed to read this?
What most of us need from a theory of deep learning is a predictive, explanatory account of how neural networks actually behave. If neural networks learn functions which are RLCT-simple rather than functions which are Kolmogorov-simple, then that means SLT is the better theory of deep learning.
I don't know how to read "x^4 has lower RLCT than x^2 despite x^2 being k-simpler" as a critique of SLT unless there is an implicit assumption that neural networks do in fact find x^2 rather than x^4.
which animals cannot do at all, they can't write computer code or a mathematical paper
This is not obvious to me (at least not for some senses of the word "could"). Animals cannot be motivated into attempting to solve these tasks, and they cannot study maths or programming. If they could do those things, then it is not at all clear to me that they wouldn't be able to write code or maths papers. To make this more specific; insofar as humans rely on a capacity for general problem-solving in order to do maths and programming, it would not surprise me if ...
So, the claim is (of course) not that intelligence is zero-one. We know that this is not the case, from the fact that some people are smarter than other people.
As for the other two points, see this comment and this comment.
So, this model of a takeoff scenario makes certain assumptions about how intelligence works, and these assumptions may or may not be correct. In particular, it assumes that the initial AI systems are very far from being algorithmically optimal. We don't know whether or not this will be the case; that is what I am trying to highlight.
The task of extracting knowledge from data is a computational task, which has a certain complexity-theoretic hardness. We don't know what that hardness is, but there is a lower bound on how efficiently this task can be done. Si...
I don't have any good evidence that humans raised without language per se are less intelligent (if we understand "intelligence" to refer to a general ability to solve new problems). For example, Genie was raised in isolation for the first 13 years of her life, and never developed a first language. Some researchers have, for various reasons, guessed that she was born with average intelligence, but that she, as a 14-year old, had a mental age "between a 5- and 8-year-old". However, here we have the confounding factor that she also was severely abused, and th...
I think the broad strokes are mostly similar, but that a bunch of relevant details are different.
Yes, a large collective of near-human AI that is allowed to interact freely over a (subjectively) long period of time is presumably roughly as hard to understand and control as a Bostrom/Yudkowsky-esque God in a box. However, in this scenario, we have the option to not allow free interaction between multiple instances, while still being able to extract useful work from them. It is also probably much easier to align a system that is not of overwhelming intellige...
Note that this proposal is not about automating interpretability.
The point is that you (in theory) don't need to know whether or not the uninterpretable AGI is safe, if you are able to independently verify its output (similarly to how I can trust a mathematical proof, without trusting the mathematician).
Of course, in practice, the uninterpretable AGI presumably needs to be reasonably aligned for this to work. You must at the very least be able to motivate it to write code for you, without hiding any trojans or backdoors that you are not able to detect.
However, I think that this is likely to be much easier than solving t...
Yes, I agree with this. I mean, even if we assume that the AIs are basically equivalent to human simulations, they still get obvious advantages from the ability to be copy-pasted, the ability to be restored to a checkpoint, the ability to be run at higher clock speeds, and the ability to make credible pre-commitments, etc etc. I therefore certainly don't think there is any plausible scenario in which unchecked AI systems wouldn't end up with most of the power on earth. However, there is a meaningful difference between the scenario where their advantages ma...
No, I don't have any explicit examples of that. However, I don't think that the main issue with GOFAI systems necessarily is that they have bad performance. Rather, I think the main problem is that they are very difficult and laborious to create. Consider, for example, IBM Watson. I consider this system to be very impressive. However, it took a large team of experts four years of intense engineering to create Watson, whereas you probably could get similar performance in an afternoon by simply fine-tuning GPT-2. However, this is less of a problem if you can...
To clarify, the proposal is not (necessarily) to use an LLM to create an interpretable AI system that is isomorphic to the LLM -- their internal structure could be completely different. The key points are that the generated program is interpretable and trustworthy, and that it can solve some problem we are interested in.
What is the exact derivation that gives you claim (1)?
No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn't possible, then I think this research agenda will produce a precise understanding of why that isn't possible (which would in turn help to inform subsequent research). I don't think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.