I like this post very much and in general I think research like this is on the correct lines towards solving potential problems with Goodheart's law -- in general Bayesian reasoning and getting some representation of the agent's uncertainty (including uncertainty over our values!) seems very important and naturally ameliorates a lot of potential problems. The correctness and realizability of the prior are very general problems with Bayesianism but often do not thwart its usefulness in practice although they allow people to come up with various convoluted c...
While I agree with a lot of points of this post, I want to quibble with the RL not maximising reward point. I agree that model-free RL algorithms like DPO do not directly maximise reward but instead 'maximise reward' in the same way self-supervised models 'minimise crossentropy' -- that is to say, the model is not explicitly reasoning about minimising cross entropy but learns distilled heuristics that end up resulting in policies/predictions with a good reward/crossentropy. However, it is also possible to produce architectures that do directly optimise for...
This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting -- http://www.athenasc.com/Frontmatter_LESSONS.pdf -- since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven't worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.
Thanks for writing this! Here are some of my rough thoughts and comments.
One of my big disagreements with this threat model is that it assumes it is hard to get an AGI to understand / successfully model 'human values'. I think this is obviously false. LLMs already have a very good understanding of 'human values' as they are expressed linguistically, and existing alignment techniques like RLHF/RLAIF seem to do a reasonably good job of making the models' output align with these values (specifically generic corporate wokeness for OpenAI/Anthropic) which does ...
Thanks for the response! Very helpful and enlightening.
The reason for this is actually pretty simple: genes with linear effects have an easier time spreading throughout a population.
This is interesting -- I have never come across this. Can you expand the intuition of this model a little more? Is the intuition something like in the fitness landscape genes with linear effects are like gentle slopes that are easy to traverse vs extremely wiggly 'directions'?
Also how I am thinking about linearity is maybe slightly different to the normal ANOVA/factor ana...
This would be very exciting if true! Do we have a good (or any) sense of the mechanisms by which these genetic variants work -- how many are actually causal, how many are primarily active in development vs in adults, how much interference there is between different variants etc?
I am also not an expert at all here -- do we have any other examples of traits being enhanced or diseases cured by genetic editing in adults (even in other animals) like this? It seems also like this would be easy to test in the lab -- i.e. for mice which we can presumably sequence and edit more straightforwardly and also can measure some analogues of IQ with reasonable accuracy and reliability. Looking forward to the longer post.
Do we have a good (or any) sense of the mechanisms by which these genetic variants work -- how many are actually causal, how many are primarily active in development vs in adults, how much interference there is between different variants etc?
No, we don't understand the mechanism by which most of them work (other than that they influence the level of and timing of protein expression). We have a pretty good idea of which are causal based on sibling validation, but there are some limitations to this knowledge because genetic variants physically close to on...
This is an interesting idea. I feel this also has to be related to increasing linearity with scale and generalization ability -- i.e. if you have a memorised solution, then nonlinear representations are fine because you can easily tune the 'boundaries' of the nonlinear representation to precisely delineate the datapoints (in fact the nonlinearity of the representation can be used to strongly reduce interference when memorising as is done in the recent research on modern hopfield networks) . On the other hand, if you require a kind of reasonably large-scale...
Looks like I really need to study some SLT! I will say though that I haven't seen many cases in transformer language models where the eigenvalues of the Hessian are 90% zeros -- that seems extremely high.
I also think this is mostly a semantic issue. The same process can be described in terms of implicit prediction errors where e.g. there is some baseline level of leptin in the bloodstream that the NPY/AgRP neurons in the arcuate nucleus 'expect' and then if there is less leptin this generates an implicit 'prediction error' in those neurons that cause them to increase firing which then stimulates various food-consuming reflexes and desires which ultimately leads to more food and hence 'correcting' the prediction error. It isn't necessary that anywhere there...
This is where I disagree! I don't think the Morrison and Berridge experiment demonstrates model-based side. It is consistent with model-based RL but is also consistent with model-free algorithms that can flexibly adapt to changing reward functions such as linear RL. Personally, I think this latter is more likely since it is such a low level response which can be modulated entirely by subcortical systems and so seems unlikely to require model-based planning to work
Thanks for linking to your papers and definitely interesting you have been thinking along similar lines. I think the key reason I think studying this is important is that I think that these hedonic loops demonstrate that a.) Mammals including humans are actually exceptionally aligned to basic homeostatic needs and basic hedonic loops I'm practice. It is extremely hard and rare for people to choose not to follow homeostatic drives. I think humans are mostly 'misaligned' about higher level things like morality, empathy etc is because we dont actually have di...
This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on -- specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before -- I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced
The 'four years' they explicitly mention does seem very short to me for ASI unless they know something we don't...
AI x-risk is not far off at all, it's something like 4 years away IMO
Can I ask where this four years number is coming from? It was also stated prominently in the new 'superalignment' announcement (https://openai.com/blog/introducing-superalignment). Is this some agreed upon median timelines at OAI? Is there an explicit plan to build AGI in four years? Is there strong evidence behind this view -- i.e. that you think you know how to build AGI explicitly and it will just take four years more compute/scaling?
Sure. First of all, disclaimer: This is my opinion, not that of my employer. (I'm not supposed to say what my employer thinks.) Yes, I think I know how to build AGI. Lots of people do. The difficult innovations are already behind us, now it's mostly a matter of scaling. And there are at least two huge corporate conglomerates in the process of doing so (Microsoft+OpenAI and Alphabet+GoogleDeepMind).
There's a lot to say on the subject of AGI timelines. For miscellaneous writings of mine, see AI Timelines - LessWrong. But for the sake of brevity I'd rec...
Hi there! Thanks for this comment. Here are my thoughts:
- Where do highly capable proposals/amortised actions come from?
- (handwave) lots of 'experience' and 'good generalisation'?
Pretty much this. We know empirically that deep learning generalizes pretty well from a lot of data as long as it is reasonable representative. I think that fundamentally this is due to the nature of our reality that there are generalizable patterns which is ultimately due to the sparse underlying causal graph. It is very possible that there are realities where this isn't true ...
The problem is not so much which one of 1,2,3 to pick but whether 'we' get a chance to pick it at all. If there is space, free energy, and diversity, there will be evolution going on among populations and evolution will consistently push things in the direction towards more reproduction up until it hits a Malthusian limit at which point it will push towards greater competition and economic/reproductive efficiency. The only way to avoid this is to remove the preconditions for evolution -- any of variation, selection, heredity -- but these seem quite natural in a world of large AI populations so in practice this will require some level of centralized control
This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn't really a problem for the proposal here. The point isn't that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
Maybe I'm being silly but then I don't understand the safety properties of this approach. If we need an AGI based on uninterpretable DL to build this, then how do we first check if this AGI is safe?
I moderately agree here but I still think the primary factor is centralization of the value chain. The more of the value chain is centralized, the easier it is to control. My guess we can make this argument more formalized by thinking of things in terms of a dependency graph -- if we imagine the economic process from sand + energy -> DL models then the important measure is the centrality of the hubs in this graph. If we can control and/or cut these hubs, then the entire DL ecosystem falls apart. Conveniently/unfortunately this is also where most of the economic profit is likely to be accumulating by standard industrial economic laws, and hence this is also where there will be the most resources resisting regulation.
As I see it, there are two fundamental problems here:
1.) Generating an interpretable expert system code for an AGI is probably already AGI complete. It seems unlikely that a non-AGI DL model can output code for an AGI -- especially given that it is highly unlikely that there would be expert system AGIs in its training set -- or even things close to expert-system AGIs if deep learning keeps far out pacing GOFAI techniques.
2.) Building an interpretable expert system AGI is likely not just AGI complete but a fundamentally much harder problem than building a D...
Interesting post! Do you have papers for the claims on why mixed activation functions perform worse? This is something I have thought about a little bit but not looked deeply into. Would appreciate links here? My naive thinking is that it mostly doesn't work due to difficulties of conditioning and keeping the loss landscape smooth and low curvature with different activation functions in a layer. With a single activation function, it is relatively straightforward to design an initialization that doesn't blow up -- with mixed ones it seems your space of potential numerical difficulties increases massively.
Exactly this. This is the relationship in RL between the discount factor and the probability of transitioning into an absorbing state (death)
I think this is a really good post. You might be interested in these two posts which explore very similar arguments on the interactions between search in the world model and more general 'intuitive policies' as well as the fact that we are always optimizing for our world/reward model and not reality and how this affects how agents act.
Yes! This would be valuable. Generally, getting a sense of the 'self-awareness' of a model in terms of how much it knows about itself would be a valuable thing to start testing for.
I don't think model's currently have this ability by default anyway. But we definitely should think very hard before letting them do this!
Yes, I think what I proposed here is the broadest and crudest thing that will work. It can of course be much more targeted to specific proposals or posts that we think are potentially most dangerous. Using existing language models to rank these is an interesting idea.
I'm very glad you wrote this. I have had similar musings previously as well, but it is really nice to see this properly written up and analyzed in a more formal manner.
Interesting thoughts! By the way, are you familiar with Hugo Touchette's work on this? which looks very related and I think has a lot of cool insights about these sorts of questions.
I think this is a good intuition. I think this comes down to the natural structure of the graph and the fact that information disappears at larger distances. This means that for dense graphs such as lattices etc regions only implicitly interact through much lower dimensional max-ent variables which are then additive while for other causal graph structures such as the power-law small-world graphs that are probably sensible for many real-world datasets, you also get a similar thing where each cluster can be modelled mostly independently apart from a few long...
Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I'm just making stuff up here.
Yes this is exactly right. This is precisely the kind of linearity that I am talking about not the input->output mapping which is...
Thanks for the typos! Fixed now.
Doesn't this imply that people with exceptionally weak autobiographical memory (e.g., Eliezer) have less self-understanding/sense of self? Or maybe you think this memory is largely implicit, not explicit? Or maybe it's enough to have just a bit of it and it doesn't "impair" unless you go very low?
This is an interesting question and I would argue that it probably does lead to a less-understanding and sense-of-self ceteris paribus. I think that the specific sense of self is mostly an emergent combination of having autobiographical memories -- i.e. at e...
Yes. The idea is that the latent space of the neural network's 'features' are 'almost linear' which is reflected in both the linear-ish properties of the weights and activations. Not that the literal I/O mapping of the NN is linear, which is clearly false.
More concretely, as an oversimplified version of what I am saying, it might be possible to think of neural networks as a combined encoder and decoder to a linear vector space. I.e. we have nonlinear function f and g which encode the input x to a latent space z and g which decodes it to the out...
Thanks for these links! This is exactly what I was looking for as per Cunningham's law. For the mechanistic mode connectivity, I still need to read the paper, but there is definitely a more complex story relating to the symmetries rendering things non-connected by default but once you account for symmetries and project things into an isometric space where all the symmetries are collapsed things become connected and linear again. Is this different to that?
I agree about the NTK. I think this explanation is bad in its specifics although I think the NTK ...
Interesting point, which I broadly agree with. I do think however, that this post has in some sense over updated on recent developments around agentic LLMs and the non-dangers of foundation models. Even 3-6 months ago, in the intellectual zeitgeist it was unclear whether autoGPT style agentic LLM wrappers were the main threat and people were primarily worried about foundation models being directly dangerous. It now seems clearer that at least at current capability levels, foundation models are not directly goal-seeking at present, although adding agency is...
Thanks for these points!
Equivalence token to bits
Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn't 1 token equal 13-17 bits a more accurate equivalence?
My thinking here is that the scaffolded LLM is a computer which operates directly in the natural language semantic space so it makes more sense to define the units of its context in terms of its fundamental units such as tokens. Of course each token has a lot more information-theoretic content than a s...
I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I'm fairly confident in, but which haven't actually propagated into the set of commonly-assumed background assumptions.
I have found this conversation very interesting. Would be very interested if you could do a quick summary or writeup of the background conclusions you are referring to. I have my own thoughts about the feasibility of massive agency gains from AutoGPT like wrappers but would be interested to hear your thoughts
I think you're saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.
I'm skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.
I think our biggest crux is this. My idea here is that by default we get systems that look like this -- DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AG...
Most of these claims seem plausibly true of average humans today, but false about smarter (and more reflective) humans now and in the future.
On the first point, most of the mundane things that humans do involve what looks to me like pretty strong optimization; it's just that the things they optimize for are nice-looking, normal (but often complicated) human things. Examples of people explicitly applying strong optimization in various domains: startup founders, professional athletes, AI capabilities researchers, AI alignment researchers, dating.
My claim is ...
The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.
Can you expand on this? How do vision transformers exploit parameter sharing in a way that is not available to standard LLMs?
Nice. My main issue is that just because humans have values a certain way, doesn't mean we want to build an AI that way, and so I'd draw pretty different implications for alignment. I'm pessimistic about anything that even resembles "make an AI that's like a human child," and more interested in "use a model of a human child to help an inhuman AI understand humans in the way we want."
I pretty much agree with this sentiment. I don't literally think we should build AGI like a human and expect it to be aligned. Humans themselves are far from aligned enou...
So, I agree and I think we are getting at the same thing (though not completely sure what you are pointing at). The way to have a model-y critic and actor is to have the actor and critic perform model-free RL over the latent space of your unsupervised world model. This is the key point of my post and why humans can have 'values' and desires for highly abstract linguistic concepts such as 'justice' as opposed to pure sensory states or primary rewards.
afaict, a big fraction of evolution's instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc.
This is true but I don't think is super important for this argument. Evolution definitely encodes inductive biases into learning about relevant things which ML archit...
I always say that the whole brain (including not only the basal ganglia but also the thalamocortical system, medulla, etc.) operates as a model-based RL system. You’re saying that the BG by itself operates as a model-free RL system. So I don’t think we’re disagreeing, because “the cortex is the model”?? (Well, we definitely have some disagreements about the BG, but we don’t have to get into them, I don’t think they’re very important for present purposes.)
I think there is some disagreement here, at least in the way I am using model-based / model-free ...
1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as "seek social status" => as a consequence I don't think the evolutionary older drives are out of play and the landscape is flat as you assume, and dominated by language-model-based values
Yes, I think drives like this are important on two levels. At the first level, we are experience them as primary rewards -- i.e. as social status gives direct ...
Thanks for your comment.
The most substantive disagreement in relation to alignment is on how much of our values is determined by the basic reward system, and how much is essentially arbitrary from there. I tend to side with you, but I'm not sure, and I do think that adult human values and behavior is still shaped in important ways by our innate reward signals. But the important question is whether we could do without those, or perhaps with a rough emulation of them, in an AGI that's loosely brainlike.
I am not sure how much we actually disagree here. ...
Fair point. I need to come up with a better name than 'orthogonality' for what I am thinking about here -- 'well factoredness?'
Will move the footnote into the main text.
No worries! I'm happy you went to the effort of summarising it. I was pretty slow in crossposting anyhow.
Yes, I guess I am overstating the possible speedup if I call it 'much much faster', but there ought to at least be a noticeable speedup by cutting out the early steps if it's basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
I think we agree here. Testing whether it converges to a better optimum would also be interesting.
...Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of n
Good idea -- will run this experiment!
Thanks for these points! I think I understand the history of what has happened here better now -- and the reasons for my misapprehension. Essentially, what I think happened is
a.) LLM/NLP research always (?) used 'pretraining' for a long time back at least to 2017 era for a general training of a model not specialised for a certain NLP task (such as NER, syntax parsing, etc)
b.) rest of ML mostly used 'training' because they by and by large didn't do massive unsupervised training on unrelated tasks -- i.e. CV just had imagenet or whatever
c.) In 2020-2022 peri... (read more)