All of Joar Skalse's Comments + Replies

Joar SkalseΩ110

No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn't possible, then I think this research agenda will produce a precise understanding of why that isn't possible (which would in turn help to inform subsequent research). I don't think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.

Joar SkalseΩ110

I'm not sure -- what significance are you placing on the word "tackle" in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function --- the main questions for this area of research may even be mostly orthogonal to that question.

2Chris_Leong
I was taking it as "solves" or "gets pretty close to solving". Maybe that's a misinterpretation on my part. What did you mean here?
Joar SkalseΩ220

I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first).

I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasi... (read more)

1Caspar Oesterheld
Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I'm still not quite sure what exactly "X can be understood" must require. It seems to matter, for example, that to a human it's understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it's not doing anything problematic (?).

What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don't have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we're verifying.

However, note that even this is meaningfully different from j... (read more)

In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).

2ryan_greenblatt
(I think most of the hard-to-handle risk from scheming comes from cases where we can't easily make smarter AIs which we know aren't schemers. If we can get another copy of the AI which is just as smart but which has been "de-agentified", then I don't think scheming poses a substantial threat. (Because e.g. we can just deploy this second model as a monitor for the first.) My guess is that a "world-model" vs "agent" distinction isn't going to be very real in practice. (And in order to make an AI good at reasoning about the world, it will need to actively be an agent in the same way that your reasoning is agentic.) Of course, there are risks other than scheming.)

I'm not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything -- depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale -- why should it not also be true on a medium or large scale?

For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest... (read more)

I think the distinction between these two cases often can be somewhat vague.

Why do you think that the adversarial case is very different?

I think you're perhaps reading me as being more bullish on Bayesian methods than I in fact am -- I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio's research agenda is a core part of its motivation, in response to your remark that "from my understanding, the bay... (read more)

4ryan_greenblatt
Insofar as the hope is: 1. Figure out how to approximate sampling from the Bayesian posterior (using e.g. GFlowNets or something). 2. Do something else that makes this actually useful for "improving" OOD generalization in some way. It would be nice to know what (2) actually is and why we needed step (1) for it. As far as I can tell, Bengio hasn't stated any particular hope for (2) which depends on (1). I agree that if the Bayesian aspect of the agenda did a specific useful thing like '"improve" OOD generalization' or 'allow us to control/understand OOD generalization', then this aspect of the agenda would be useful. However, I think the Bayesian aspect of the agend won't do this and thus it won't add much value. I agree that Bengio (and others) think that the Bayesian aspect of the agenda will do things like this - but I disagree and don't see the story for this. I agree that "actually use Bayesian methods" sounds like the sort of thing that could help you solve dangerous OOD generalization issues, but I don't think it clearly does. (Unless of course someone has a specific proposal for (2) from my above decomposition which actually depends on (1).) 1-3 don't seem promising/important to me. (4) would be useful, but I don't see why we needed the bayesian aspect of it. If we have some sort of parametric model class which we can make smart enough to reason effectively about the world, just making an ensemble of these surely gets you most of the way there. To be clear, if the hope is "figure out how to make an ensemble of interpretable predictors which are able to model the world as well as our smartest model", then this would be very useful (e.g. it would allow us to avoid ELK issues). But all the action was in making interpretable predictors, no bayesian aspect was required.

But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.) 

It sounds to me like we agree here, I don't want to put too much weight on "most".

 

Is this true?

It is true in the sense that you don't have any theoretical guarantees, and in the sense that it also often fails to work in practice.

 

Aren't NN implicitly ensembles of vast number of models?

They probably are, to some extent. However, in practice, you often find that neural networks make very confident (an... (read more)

2ryan_greenblatt
I agree that you can improve safety by checking outputs in specific ways in cases where we can do so (e.g. requiring the AI to formally verify its code doesn't have side channels). The relevant question is whether there are interesting cases where we can't currently verify the output with conventional tools (e.g. formal software analysis or bridge design), but we can using a world model we've constructed. One story for why we might be able to do this is that the world model is able to generally predict the world as well as the AI. Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something? IMO no and besides, we don't need the interpretability as we can just train GPT-3 to do stuff. So, it either has to be smarter than GPT-3 or have a wildly different capability profile. I can't really image any plausible stories where this world model doesn't end up having to actually be smart while the world model is also able massively improve the frontier of what we can check.
2ryan_greenblatt
Worth noting here that I'm mostly talking about Bengio's proposal wrt to the bayes related arguments. And I agree that the world model isn't meant to be a schemer, but it's not as though we can guarantee that without some additional property... (Such as ensuring that the world model is interpretable.)
2ryan_greenblatt
TBC, I would count it as solving an ELK problem if you constructed an interpretable world model which allows you to extract out the latents you want.
2ryan_greenblatt
Won't this depend on the generalization properties of the model? E.g., we can always train a model to predict if a diamond is in the vault on some subset of examples which we can label. From my perspective the core difficulty is either: * Treacherous turns (or similar high-stakes failures) * Note being able to properly identify the latents in actual deployment cases Perhaps you were discussing the case where we magically have access to all the latents and don't need to worry about high-stakes failures? In this case, I agree there are no problems, but the baseline proposal of RLHF while looking at the latents also works perfectly fine.
9ryan_greenblatt
Sure, but why will the bayesian model reliably quantify uncertainty OOD? There is also no guarantee of this (OOD). Whether or not you get reliable uncertainly quanitification will depend on your prior. If you have (e.g.) the NN prior, I expect the uncertainly quantification is similar to if you trained an ensemble. E.g., you'll find a bunch of NNs (in the bayesian posterior) which also have the spurious correlation that a trained NN (or ensemble of NNs) would have. If you have some other prior, why can't we regularize our NN to match this? (Maybe I'm confused about this?)

From my understanding, the bayesian aspect of this agenda doesn't add much value.

A core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully, and this is arguably as important as (if not more important than) interpretability. In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution, but generalises in an unintended way when placed in a novel situation. Traditional ML has no straightforward way of dealing with such cases, since i... (read more)

6ryan_greenblatt
I dispute that Bayesian methods will be much better at this in practice. [ Aside: This seems like about 1/2 of the problem from my perspective. (So I almost agree.) Though, you can shove all AI safety problems into this bucket by doing a maneuver like "train your model on the easy cases humans can label, then deploy into the full distribution". But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.) ] Is this true? Aren't NN implicitly ensembles of vast number of models? Also, does ensembling 5 NNs help? If this doesn't help why does sampling 5 models from the Bayesian posterior help? Or is that we needed to approximate sampling 1,000,000 models from the posterior? If we're conservative over a million models, how will we ever do anything? Do they? I'm skeptical on both of these. It maybe helps a little and rules out some unlikely scenarios, but I'm overall skeptical. Overall, my view on the Bayesian approach is something like: * What prior were we using for Bayesian methods? If this is just the NN prior, then I'm not really sold we do much better than just training a NN (or an ensemble of NNs). If our prior is importantly different in a way which we think will help, why can't we regularize to train a NN in a normal way which will vaguely reasonably approximate this prior? * My main concern is that we can get a smart predictive model which understands OOD cases totally fine, but we still get catastrophic generalization for whatever reason. I don't see why bayesian methods help. * In the ELK case, our issues is that too much of the prior is human imitation or other problematic generalization. (We can ensemble/sample from the posterior and reject things where our hypotheses don't match, but this will only help so much and I don't really see a strong case for bayes helping more than ensembling.) * In the case of a treacherous turn, it seems like the core concern was t
Joar SkalseΩ110

You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a "world model". However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world m... (read more)

Joar Skalse*Ω110

If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.  

I suppose this depends on what you mean by "most". DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it's fairly likely that these kinds of differences could have a large impact on the probability that a given type o... (read more)

You're right, I put the parameters the wrong way around. I have fixed it now, thanks!

I could have changed it to Why Neural Networks can obey Occam's Razor, but I think this obscures the main point.

I think even this would be somewhat inaccurate (in my opinion). If a given parametric Bayesian learning machine does obey (some version of) Occam's razor, then this must be because of some facts related to its prior, and because of some facts related to its parameter-function map. SLT does not say very much about either of these two things. What the post is about is primarily the relationship between the RLCT and posterior probability, and how th... (read more)

Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).

Would that not imply that my polynomial example also obeys Occam's razor? 

However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something prof

... (read more)

I think I recall reading that, but I'm not completely sure.

Note that the activation function affects the parameter-function map, and so the influence of the activation function is subsumed by the general question of what the parameter-function map looks like.

Joar Skalse*Ω230

I'm not sure, but I think this example is pathological.

Yes, it's artificial and cherry-picked to make a certain rhetorical point as simply as possible.

This is the more relevant and interesting kind of symmetry, and it's easier to see what this kind of symmetry has to do with functional simplicity: simpler functions have more local degeneracies.¨

This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.

You can

... (read more)
4Jesse Hoogland
Agreed. So maybe what I'm actually trying to get at it is a statement about what "universality" means in the context of neural networks. Just as the microscopic details of physical theories don't matter much to their macroscopic properties in the vicinity of critical points ("universality" in statistical physics), just as the microscopic details of random matrices don't seem to matter for their bulk and edge statistics ("universality" in random matrix theory), many of the particular choices of neural network architecture doesn't seem to matter for learned representations ("universality" in DL).  What physics and random matrix theory tell us is that a given system's universality class is determined by its symmetries. (This starts to get at why we SLT enthusiasts are so obsessed with neural network symmetries.) In the case of learning machines, those symmetries are fixed by the parameter-function map, so I totally agree that you need to understand the parameter-function map.  However, focusing on symmetries is already a pretty major restriction. If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.   There's another important observation, which is that neural network symmetries leave geometric traces. Even if the RLCT on its own does not "solve" generalization, the SLT-inspired geometric perspective might still hold the answer: it should be possible to distinguish neural networks from the polynomial example you provided by understanding the geometry of the loss landscape. The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape. If that's the case, my concern about focusing on the parameter-function map is that it would pose a distraction. It could miss the forest for the trees if you're trying to understand the structure that develops and phe

Will do, thank you for the reference!

Yes, I completely agree. The theorems that have been proven by Watanabe are of course true and non-trivial facts of mathematics; I do not mean to dispute this. What I do criticise is the magnitude of the significance of these results for the problem of understanding the behaviour of deep learning systems.

Thank you for this -- I agree with what you are saying here. In the post, I went with a somewhat loose equivocation between "good priors" and "a prior towards low Kolmogorov complexity", but this does skim past a lot of nuance. I do also very much not want to say that the DNN prior is exactly towards low Kolmogorov complexity (this would be uncomputable), but only that it is mostly correlated with Kolmogorov complexity for typical problems.

Yes, I mostly just mean "low test error". I'm assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.

I think that it gives us an adequate account of generalisation in the limit of infinite data (or, more specifically, in the case where we have enough data to wash out the influence of the inductive bias); this is what my original remark was about. I don't think classical statistical learning theory gives us an adequate account of generalisation in the setting where the training data is small enough for our inductive bias to still matter, and it only has very limited things to say about out-of-distribution generalisation.

3Alexander Gietelink Oldenziel
I see, thanks!

The assumption that small neural networks are a good match for the actual data generating process of the world, is equivalent to the assumption that neural networks have an inductive bias that gives large weight to the actual data generating process of the world, if we also append the claim that neural networks have an inductive bias that gives large weight to functions which can be described by small neural networks (and this latter claim is not too difficult to justify, I think).

I think the second one by Carroll is quite careful to say things like "we can now understand why singular models have the capacity to generalise well" which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion.

The title of the post is Why Neural Networks obey Occam's Razor! It also cites Zhang et al, 2017, and immediately after this says that SLT can help explain why neural networks have the capacity to generalise well. This gives the impression that the post is intended to give a solution to problem (ii) ... (read more)

Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).

I think that expression of Jesse's is also correct, in context.

However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) ... (read more)

5Liam Carroll
I would argue that the title is sufficiently ambiguous as to what is being claimed, and actually the point of contention in (ii) was discussed in the comments there too. I could have changed it to Why Neural Networks can obey Occam's Razor, but I think this obscures the main point. Regular linear regression could also obey Occam's razor (i.e. "simpler" models are possible) if you set high-order coefficients to 0, but the posterior of such models does not concentrate on those points in parameter space.  At the time of writing, basically nobody knew anything about SLT, so I think it was warranted to err on the side of grabbing attention in the introductory paragraphs and then explaining in detail further on with "we can now understand why singular models have the capacity to generalise well", instead of caveating the whole topic out of existence before the reader knows what is going on. As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that. My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice (e.g. the one-node subnetwork preferred by the posterior compared to the two-node network in DSLT4), but I don't know enough about Kolmogorov complexity to say much more than that. 

Anyway I'm guessing you're probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).

Yes, absolutely. However, I also don't think that (i) is very mysterious, if we view things from a Bayesian perspective. Indeed, it seems natural to say that an ideal Bayesian reasoner should assign non-zero prior probability to all computable models, or something along those lines, and in that case, notions like "overparameterised" no longer seem very significant.

Maybe that has significant overlap with the cr

... (read more)
4Daniel Murfet
Seems reasonable to me!

A few things:

1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well).
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine. 
3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assump... (read more)

4Daniel Murfet
I don't understand the strong link between Kolmogorov complexity and generalisation you're suggesting here. I think by "generalisation" you must mean something more than "low test error". Do you mean something like "out of distribution" generalisation (whatever that means)?

To say that neural networks are empirical risk minimisers is just to say that they find functions with globally optimal training loss (and, if they find functions with a loss close to the global optimum, then they are approximate empirical risk minimisers, etc). 

I think SLT effectively assumes that neural networks are (close to being) empirical risk minimisers, via the assumption that they are trained by Bayesian induction.

The bounds are not exactly vacuous -- in fact, they are (in a sense) tight. However, they concern a somewhat adversarial setting, where the data distribution may be selected arbitrarily (including by making it maximally opposed to the inductive bias of the learning algorithm). This means that the bounds end up being much larger than what you would typically observe in practice, if you give typical problems to a learning algorithm whose inductive bias is attuned to the structure of "typical" problems. 

If you move from this adversarial setting to a more... (read more)

I already posted this in response to Daniel Murfet, but I will copy it over here:

For example, the agnostic PAC-learning theorem says that if a learning machine  (for binary classification) is an empirical risk minimiser with VC dimension , then for any distribution  over , if  is given access to at least  data points sampled from , then it will with probability at least  learn a function whose (true) generalisation error (under ) is at most &... (read more)

2Cleo Nardo
The impression I got was that SLT is trying to show why (transformers + SGD) behaves anything like an empirical risk minimiser in the first place. Might be wrong though.

Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?

This is basically a justification of something like your point 1, but AFAICT it's closer to a proof in the SLT setting than in your setting.

I think it could probably be turned into a proof in either setting, at least if we are allowed to help ourselves to assumptions like "the ground truth function is generated by a small neural net" and "learning is done in a Bayesian way", etc.

2DanielFilan
No? It amounts to assuming that smaller neural networks are a better match for the actual data generating process of the world.

In your example there are many values of the parameters that encode the zero function

Ah, yes, I should have made the training data be (1,1), rather than (0,0). I've fixed the example now!

Is that a fair characterisation of the argument you want to make?

Yes, that is exactly right!

Assuming it is, my response is as follows. I'm guessing you think  is simpler than  because the former function can be encoded by a shorter code on a UTM than the latter.

The notion of complexity that I have in mind is even more pre-theoretic than that; it'... (read more)

5Daniel Murfet
Re: the articles you link to. I think the second one by Carroll is quite careful to say things like "we can now understand why singular models have the capacity to generalise well" which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion.  I agree that Jesse's post has a title "Neural networks generalize because of this one weird trick" which is clickbaity, since SLT does not in fact yet explain why neural networks appear to generalise well on many natural datasets. However the actual article is more nuanced, saying things like "SLT seems like a promising route to develop a better understanding of generalization and the limiting dynamics of training". Jesse gives a long list of obstacles to walking this route. I can't find anything in the post itself to object to. Maybe you think its optimism is misplaced, and fair enough. So I don't really understand what claims about inductive bias or generalisation behaviour in these posts you think is invalid?

I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don't think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?


That seems probable. Maybe it's useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we're on the same page. When people refer to the "generalisation puzzle" in... (read more)

For example, the agnostic PAC-learning theorem says that if a learning machine  (for binary classification) is an empirical risk minimiser with VC dimension , then for any distribution  over , if  is given access to at least  data points sampled from , then it will with probability at least  learn a function whose (true) generalisation error (under ) is at most  worse than the best function which  is able to express (in terms ... (read more)

4Alexander Gietelink Oldenziel
The linked post you wrote about classical learning theory states that the bounds PAC gives are far more loose than what we see in practice for Neural Networks. In the post you sketch some directions in which tighter bounds may be proven. It is my understanding that these directions have not been pursued further. Given all that "Fully adequate account of generalization" seems like an overstatement, wouldn't you agree? At best we can say that PAC gives a nice toy model for thinking about notions like generalization and learnability as far as I can tell. Maybe I'm wrong- I'm not familiar with the literature- and I'd love to know more about what PAC & classical learning theory can tell us about neural networks.

I'm going to make a few comments as I read through this, but first I'd like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn't have done otherwise.

Thank you for the detailed responses! I very much enjoy discussing these topics :)

My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space

My intuitions around the RLCT are very much geometrically informed, and I do think of it as being a kind of flatness meas... (read more)

I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see.

Yes, I agree with this. I think my main objections are (1) the fact that it mostly abstacts away from the parameter-function map, and (2) the infinite-data limit.

My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard.

I largely agree, though depends somewhat on what your aims are. My point there was ma... (read more)

4Alexander Gietelink Oldenziel
"My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory." What general results from statistical and computational learning theory are you referring to here exactly?

That's interesting, thank you for this!

Yes, I meant specifically on LW and in the AI Safety community! In academia, it remains fairly obscure.

I think this is precisely what SLT is saying, and this is nontrivial!

It is certainly non-trivial, in the sense that it takes many lines to prove, but I don't think it tells you very much about the actual behaviour of neural networks.

Note that loss landscape considerations are more important than parameter-function considerations in the context of learning.

One of my core points is, precisely, to deny this claim. Without assumptions about the parameter function map, you cannot make inferences from the characteristics of the loss landscape to conclusions abou... (read more)

My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.

 

How am I supposed to read this?

What most of us need from a theory of deep learning is a predictive, explanatory account of how neural networks actually behave. If neural networks learn functions which are RLCT-simple rather than functions which are Kolmogorov-simple, then that means SLT is the better theory of deep learning.

I don't know how to read "x^4 has lower RLCT than x^2 despite x^2 being k-simpler" as a critique of SLT unless there is an implicit assumption that neural networks do in fact find x^2 rather than x^4.

1Oliver Sourbut
Thanks, amended the link in my comment to point to this updated version

 which animals cannot do at all, they can't write computer code or a mathematical paper


This is not obvious to me (at least not for some senses of the word "could"). Animals cannot be motivated into attempting to solve these tasks, and they cannot study maths or programming. If they could do those things, then it is not at all clear to me that they wouldn't be able to write code or maths papers. To make this more specific; insofar as humans rely on a capacity for general problem-solving in order to do maths and programming, it would not surprise me if ... (read more)

So, the claim is (of course) not that intelligence is zero-one. We know that this is not the case, from the fact that some people are smarter than other people. 

As for the other two points, see this comment and this comment.

So, this model of a takeoff scenario makes certain assumptions about how intelligence works, and these assumptions may or may not be correct. In particular, it assumes that the initial AI systems are very far from being algorithmically optimal. We don't know whether or not this will be the case; that is what I am trying to highlight.

The task of extracting knowledge from data is a computational task, which has a certain complexity-theoretic hardness. We don't know what that hardness is, but there is a lower bound on how efficiently this task can be done. Si... (read more)

1mishka
Right. But here we are talking about a specific class of tasks (which animals cannot do at all, they can't write computer code or a mathematical paper; so, evolutionary speaking, human are new at that, and probably not anywhere close to any equilibrium because this class of tasks is very novel on the evolutionary timescale). Moreover, we know a lot about human performance at those tasks, and it's abysmal, even for top humans, and for AI research as a field. For example, there has been a paper in Nature in 2000 explaining that ReLU induce semantically meaningful sparseness. The field ignored this for over a decade, then rediscovered in 2009-2011 that ReLU were great, and by 2015 ReLU became dominant. That's typical, there are plenty of examples like that (obvious discoveries not done for decades (e.g. the ReLU discovery in question was really due in early 1970-s), then further ignored for a decade or longer after the initial publication before being rediscovered and picked up). AI could simply try various things lost in the ignored old papers for a huge boost (there is a lot of actionable, pretty strong stuff in those, not everything gets picked up, like ReLU, a lot remains published, but ignored by the field). Anyone who has attended an AutoML conference recently knows that the field of AutoML is simply too big to deal with (too many promising methods for neural architecture search, for hyperparameter optimization, for metalearning of optimization algorithms; so we have all these things in metalearning and in "AI-generating algorithms" which can provide a really large boost over status quo if done correctly, but too difficult to fully take advantage of because of human cognitive limitations, as the whole field is just "too large a mess to think effectively about"). So it seems that, at least, there is quite a bit of room for a large initial boost over the current human-equivalent capacity. If one starts with a level of top AI researcher at a good company today, t

I don't have any good evidence that humans raised without language per se are less intelligent (if we understand "intelligence" to refer to a general ability to solve new problems). For example, Genie was raised in isolation for the first 13 years of her life, and never developed a first language. Some researchers have, for various reasons, guessed that she was born with average intelligence, but that she, as a 14-year old, had a mental age "between a 5- and 8-year-old". However, here we have the confounding factor that she also was severely abused, and th... (read more)

Joar Skalse*Ω221

I think the broad strokes are mostly similar, but that a bunch of relevant details are different.

Yes, a large collective of near-human AI that is allowed to interact freely over a (subjectively) long period of time is presumably roughly as hard to understand and control as a Bostrom/Yudkowsky-esque God in a box. However, in this scenario, we have the option to not allow free interaction between multiple instances, while still being able to extract useful work from them. It is also probably much easier to align a system that is not of overwhelming intellige... (read more)

Note that this proposal is not about automating interpretability.

The point is that you (in theory) don't need to know whether or not the uninterpretable AGI is safe, if you are able to independently verify its output (similarly to how I can trust a mathematical proof, without trusting the mathematician).

Of course, in practice, the uninterpretable AGI presumably needs to be reasonably aligned for this to work. You must at the very least be able to motivate it to write code for you, without hiding any trojans or backdoors that you are not able to detect.

However, I think that this is likely to be much easier than solving t... (read more)

  1. This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn't really a problem for the proposal here. The point isn't that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
  2. This is probably true, but the extent to which it is true is unclear. Moreover, if the inner workings of intelligence are fundamentally uninterpretable, then strong interpretability must also fail. I already commented on this in the last two paragraphs of the top-level post.
2beren
Maybe I'm being silly but then I don't understand the safety properties of this approach. If we need an AGI based on uninterpretable DL to build this, then how do we first check if this AGI is safe?

Yes, I agree with this. I mean, even if we assume that the AIs are basically equivalent to human simulations, they still get obvious advantages from the ability to be copy-pasted, the ability to be restored to a checkpoint, the ability to be run at higher clock speeds, and the ability to make credible pre-commitments, etc etc. I therefore certainly don't think there is any plausible scenario in which unchecked AI systems wouldn't end up with most of the power on earth. However, there is a meaningful difference between the scenario where their advantages ma... (read more)

3johnswentworth
I think this scenario is still strategically isomorphic to "advantages mainly come from overwhelmingly great intelligence". It's intelligence at the level of a collective, rather than the individual level, but the conclusion is the same. For instance, scalable oversight of a group of AIs which is collectively far smarter than any group of humans is hard in basically the same ways as oversight of one highly-intelligent AI. Boxing the group of AIs is hard for the same reasons as boxing one. Etc.

No, I don't have any explicit examples of that. However, I don't think that the main issue with GOFAI systems necessarily is that they have bad performance. Rather, I think the main problem is that they are very difficult and laborious to create. Consider, for example, IBM Watson. I consider this system to be very impressive. However, it took a large team of experts four years of intense engineering to create Watson, whereas you probably could get similar performance in an afternoon by simply fine-tuning GPT-2. However, this is less of a problem if you can... (read more)

To clarify, the proposal is not (necessarily) to use an LLM to create an interpretable AI system that is isomorphic to the LLM -- their internal structure could be completely different. The key points are that the generated program is interpretable and trustworthy, and that it can solve some problem we are interested in. 

What is the exact derivation that gives you claim (1)?

3Ege Erdil
Check the Wikipedia section for the stationary distribution of the overdamped Langevin equation. I should probably clarify that it's difficult to have a rigorous derivation of this claim in the context of SGD in particular, because it's difficult to show absence of heteroskedasticity in SGD residuals. Still, I believe that this is probably negligible in practice, and in principle this is something that can be tested by experiment.
Load More