All of beren's Comments + Replies

beren40

Thanks for these points! I think I understand the history of what has happened here better now -- and the reasons for my misapprehension. Essentially, what I think happened is

a.) LLM/NLP research always (?) used 'pretraining' for a long time back at least to 2017 era for a general training of a model not specialised for a certain NLP task (such as NER, syntax parsing, etc)

b.) rest of ML mostly used 'training' because they by and by large didn't do massive unsupervised training on unrelated tasks -- i.e. CV just had imagenet or whatever

c.) In 2020-2022 peri... (read more)

beren93

I like this post very much and in general I think research like this is on the correct lines towards solving potential problems with Goodheart's law -- in general Bayesian reasoning and getting some representation of the agent's uncertainty (including uncertainty over our values!) seems very important and naturally ameliorates a lot of potential problems. The correctness and realizability of the prior are very general problems with Bayesianism but often do not thwart its usefulness in practice although they allow people to come up with various convoluted c... (read more)

1Mateusz Bagiński
I would appreciate some pointers to resources
berenΩ276327

While I agree with a lot of points of this post, I want to quibble with the RL not maximising reward point. I agree that model-free RL algorithms like DPO do not directly maximise reward but instead 'maximise reward' in the same way self-supervised models 'minimise crossentropy' -- that is to say, the model is not explicitly reasoning about minimising cross entropy but learns distilled heuristics that end up resulting in policies/predictions with a good reward/crossentropy. However, it is also possible to produce architectures that do directly optimise for... (read more)

6TurnTrout
Agree with a bunch of these points. EG in Reward is not the optimization target  I noted that AIXI really does maximize reward, theoretically. I wouldn't say that AIXI means that we have "produced" an architecture which directly optimizes for reward, because AIXI(-tl) is a bad way to spend compute. It doesn't actually effectively optimize reward in reality.  I'd consider a model-based RL agent to be "reward-driven" if it's effective and most of its "optimization" comes from the direct part and not the leaf-node evaluation (as in e.g. AlphaZero, which was still extremely good without the MCTS).  "Direct" optimization has not worked - at scale - in the past. Do you think that's going to change, and if so, why? 
5Oliver Daniels
Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy - i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward. 
beren140

This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting -- http://www.athenasc.com/Frontmatter_LESSONS.pdf -- since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven't worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.

beren3719

Thanks for writing this! Here are some of my rough thoughts and comments.

One of my big disagreements with this threat model is that it assumes it is hard to get an AGI to understand / successfully model 'human values'. I think this is obviously false. LLMs already have a very good understanding of 'human values' as they are expressed linguistically, and existing alignment techniques like RLHF/RLAIF seem to do a reasonably good job of making the models' output align with these values (specifically generic corporate wokeness for OpenAI/Anthropic) which does ... (read more)

8jessicata
I'm defining "values" as what approximate expected utility optimizers in the human brain want. Maybe "wants" is a better word. People falsify their preferences and in those cases it seems more normative to go with internal optimizer preferences. Re indexicality, this is an "the AI knows but does not care" issue, it's about specifying it not about there being some AI module somewhere that "knows" it. If AGI were generated partially from humans understanding how to encode indexical goals that would be a different situation. Re treacherous turns, I agreed that myopic agents don't have this issue to nearly the extent that long-term real-world optimizing agents do. It depends how the AGI is selected. If it's selected by "getting good performance according to a human evaluator in the real world" then at some capability level AGIs that "want" that will be selected more.
beren40

Thanks for the response! Very helpful and enlightening.

The reason for this is actually pretty simple: genes with linear effects have an easier time spreading throughout a population.

This is interesting -- I have never come across this. Can you expand the intuition of this model a little more? Is the intuition something like in the fitness landscape genes with linear effects are like gentle slopes that are easy to traverse vs extremely wiggly 'directions'? 

Also how I am thinking about linearity is maybe slightly different to the normal ANOVA/factor ana... (read more)

4GeneSmith
The way I think about the sample sizes needed to identify non-linear effects is more like this: if you're testing the hypothesis that A_i has an effect on trait T but only in the presence of another gene B_k you need a large sample of patients with both A_i and B_k. If both variants are rare, that can multiply the sample size needed to reach genome-wide significance by a factor of 10 or even 100. The ones I know of are. If this tech works, it's hard to understate just how big the impact would be, especially if you could target edits to just a specific cell type (which was been done in a limited capacity already). If you had high enough editing efficiency, you could probably bring a cancer suvivor's recurrence risk back to a pre-cancerous state by identifying the specific mutations that made the cells of that particular organ pre-cancerous and reverting them back to their original state. You could even make their cancer risk lower than it was before by adjusting their polygenic risk score for cancer. I really don't think I can oversell how transformative this tech would be if it actually worked well. You could probably dramatically extend the human healthspan, make people smarter, and do all kinds of other things. There are of course ways it could be used that would be concerning. For example, a really determined government might be able to make a genetic predictor for obedience or something and modify people's polygenic scores for obedience. On the other hand, you could probably use that same technology to reduce the risk of violent criminals reoffending, which could be good. I tend not to think too much about these kinds of concerns because the situation with AI seems so dire. But if by some miracle we pass a global moratorium on hardware improvement to buy ourselves more time to figure out solutions to alignment and misuse concerns, this tech could play a hugely pivotal role in that. Not to mention all the more down-to-earth stuff it could do for diseases, me
beren60

This would be very exciting if true! Do we have a good (or any) sense of the mechanisms by which these genetic variants work -- how many are actually causal, how many are primarily active in development vs in adults, how much interference there is between different variants etc? 

I am also not an expert at all here -- do we have any other examples of traits being enhanced or diseases cured by genetic editing in adults (even in other animals) like this? It seems also like this would be easy to test in the lab -- i.e. for mice which we can presumably sequence and edit more straightforwardly and also can measure some analogues of IQ with reasonable accuracy and reliability. Looking forward to the longer post.

GeneSmith*102

Do we have a good (or any) sense of the mechanisms by which these genetic variants work -- how many are actually causal, how many are primarily active in development vs in adults, how much interference there is between different variants etc?

No, we don't understand the mechanism by which most of them work (other than that they influence the level of and timing of protein expression). We have a pretty good idea of which are causal based on sibling validation, but there are some limitations to this knowledge because genetic variants physically close to on... (read more)

beren50

This is an interesting idea. I feel this also has to be related to increasing linearity with scale and generalization ability -- i.e. if you have a memorised solution, then nonlinear representations are fine because you can easily tune the 'boundaries' of the nonlinear representation to precisely delineate the datapoints (in fact the nonlinearity of the representation can be used to strongly reduce interference when memorising as is done in the recent research on modern hopfield networks) . On the other hand, if you require a kind of reasonably large-scale... (read more)

1Hoagy
Yes that makes a lot of sense that linearity would come hand in hand with generalization. I'd recently been reading Krotov on non-linear Hopfield networks but hadn't made the connection. They say that they're planning on using them to create more theoretically grounded transformer architectures. and your comment makes me think that these wouldn't succeed but then the article also says: which perhaps corresponds to them also being able to find good linear representation and to mix generalization and memorization like a transformer?
beren30

Looks like I really need to study some SLT! I will say though that I haven't seen many cases in transformer language models where the eigenvalues of the Hessian are 90% zeros -- that seems extremely high.

beren40

I also think this is mostly a semantic issue. The same process can be described in terms of implicit prediction errors where e.g. there is some baseline level of leptin in the bloodstream that the NPY/AgRP neurons in the arcuate nucleus 'expect' and then if there is less leptin this generates an implicit 'prediction error' in those neurons that cause them to increase firing which then stimulates various food-consuming reflexes and desires which ultimately leads to more food and hence 'correcting' the prediction error. It isn't necessary that anywhere there... (read more)

beren20

This is where I disagree! I don't think the Morrison and Berridge experiment demonstrates model-based side. It is consistent with model-based RL but is also consistent with model-free algorithms that can flexibly adapt to changing reward functions such as linear RL. Personally, I think this latter is more likely since it is such a low level response which can be modulated entirely by subcortical systems and so seems unlikely to require model-based planning to work

beren40

Thanks for linking to your papers and definitely interesting you have been thinking along similar lines. I think the key reason I think studying this is important is that I think that these hedonic loops demonstrate that a.) Mammals including humans are actually exceptionally aligned to basic homeostatic needs and basic hedonic loops I'm practice. It is extremely hard and rare for people to choose not to follow homeostatic drives. I think humans are mostly 'misaligned' about higher level things like morality, empathy etc is because we dont actually have di... (read more)

beren40

This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on -- specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before -- I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced

4Seth Herd
The above is describing the model-free component of learning reward-function dependent policies. The Morrison and Berridge salt experiment is demonstrating the model-based side, which probably comes from imagining specific outcomes and how they'd feel.
beren31

The 'four years' they explicitly mention does seem very short to me for ASI unless they know something we don't...

beren101

AI x-risk is not far off at all, it's something like 4 years away IMO

Can I ask where this four years number is coming from? It was also stated prominently in the new 'superalignment' announcement (https://openai.com/blog/introducing-superalignment). Is this some agreed upon median timelines at OAI? Is there an explicit plan to build AGI in four years? Is there strong evidence behind this view -- i.e. that you think you know how to build AGI explicitly and it will just take four years more compute/scaling?

Sure. First of all, disclaimer: This is my opinion, not that of my employer. (I'm not supposed to say what my employer thinks.) Yes, I think I know how to build AGI. Lots of people do. The difficult innovations are already behind us, now it's mostly a matter of scaling. And there are at least two huge corporate conglomerates in the process of doing so (Microsoft+OpenAI and Alphabet+GoogleDeepMind). 

There's a lot to say on the subject of AGI timelines. For miscellaneous writings of mine, see AI Timelines - LessWrong. But for the sake of brevity I'd rec... (read more)

beren40

Hi there! Thanks for this comment.  Here are my thoughts:

  1. Where do highly capable proposals/amortised actions come from?
  • (handwave) lots of 'experience' and 'good generalisation'?

Pretty much this. We know empirically that deep learning generalizes pretty well from a lot of data as long as it is reasonable representative. I think that fundamentally this is due to the nature of our reality that there are generalizable patterns which is ultimately due to the sparse underlying causal graph. It is very possible that there are realities where this isn't true ... (read more)

beren22

The problem is not so much which one of 1,2,3 to pick but whether 'we' get a chance to pick it at all. If there is space, free energy, and diversity, there will be evolution going on among populations and evolution will consistently push things in the direction towards more reproduction up until it hits a Malthusian limit at which point it will push towards greater competition and economic/reproductive efficiency. The only way to avoid this is to remove the preconditions for evolution -- any of variation, selection, heredity -- but these seem quite natural in a world of large AI populations so in practice this will require some level of centralized control

2tailcalled
Yes. Variation corresponds to "a lot of people (in a broad sense potentially including Ems etc.) get to live independently", selection corresponds to economic freedom, and heredity correspond to reproductive freedom. (Not exactly ofc, but it's hard to write something which exactly matches any given frame.) Or rather, it's both a question of how to pick it and what to pick. Like the MIRI plan is to grab control over the world and then use this to implement some sort of cosmopolitan value system. But if one does so, there's still the question of which cosmopolitan value system to implement.
beren20

This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn't really a problem for the proposal here. The point isn't that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.

Maybe I'm being silly but then I don't understand the safety properties of this approach. If we need an AGI based on uninterpretable DL to build this, then how do we first check if this AGI is safe?

3Joar Skalse
The point is that you (in theory) don't need to know whether or not the uninterpretable AGI is safe, if you are able to independently verify its output (similarly to how I can trust a mathematical proof, without trusting the mathematician). Of course, in practice, the uninterpretable AGI presumably needs to be reasonably aligned for this to work. You must at the very least be able to motivate it to write code for you, without hiding any trojans or backdoors that you are not able to detect. However, I think that this is likely to be much easier than solving the full alignment problem for sovereign agents. Writing software is a myopic task that can be accomplished without persistent, agentic preferences, which means that the base system could be much more tool-like that the system which it produces. But regardless of that point, many arguments for why interpretability research will be helpful also apply to the strategy I outline above.  
beren31

I moderately agree here but I still think the primary factor is centralization of the value chain. The more of the value chain is centralized, the easier it is to control. My guess we can make this argument more formalized by thinking of things in terms of a dependency graph -- if we imagine the economic process from sand + energy -> DL models then the important measure is the centrality of the hubs in this graph. If we can control and/or cut these hubs, then the entire DL ecosystem falls apart. Conveniently/unfortunately this is also where most of the economic profit is likely to be accumulating by standard industrial economic laws, and hence this is also where there will be the most resources resisting regulation.

beren30

As I see it, there are two fundamental problems here:

1.) Generating an interpretable expert system code for an AGI is probably already AGI complete. It seems unlikely that a non-AGI DL model can output code for an AGI -- especially given that it is highly unlikely that there would be expert system AGIs in its training set -- or even things close to expert-system AGIs if deep learning keeps far out pacing GOFAI techniques.

2.) Building an interpretable expert system AGI is likely not just AGI complete but a fundamentally much harder problem than building a D... (read more)

6Joar Skalse
1. This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn't really a problem for the proposal here. The point isn't that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability. 2. This is probably true, but the extent to which it is true is unclear. Moreover, if the inner workings of intelligence are fundamentally uninterpretable, then strong interpretability must also fail. I already commented on this in the last two paragraphs of the top-level post.
beren41

Interesting post! Do you have papers for the claims on why mixed activation functions perform worse? This is something I have thought about a little bit but not looked deeply into. Would appreciate links here? My naive thinking is that it mostly doesn't work due to difficulties of conditioning and keeping the loss landscape smooth and low curvature with different activation functions in a layer. With a single activation function, it is relatively straightforward to design an initialization that doesn't blow up -- with mixed ones it seems your space of potential numerical difficulties increases massively.

1bhauth
No, there are no papers on that topic that I know of. There are relatively few papers that work on mixed activation functions at all. You should understand that papers that don't show at least a marginal increase on some niche benchmark tend not to get published. So, much of the work on mixed activation functions went unpublished. But I can link to papers on testing mixed activation functions. Here's a Bachelor's thesis from 2022 that did relatively extensive testing. They did evolution of activation function sets for a particular application and got slightly better performance than ReLU/Swish. That's an unfair comparison because activation function adaptation to a particular task can improve performance. The thesis did also compare its evolutionary search on single functions, and that approach did about as well as the mixed functions. So far so good, but then, when the network was scaled up from VGG-HE-2 to VGG-HE-4, their evolved activation sets all got worse, while ReLU and Swish got better. Their best mixed activation set went from 80% to 10% accuracy as the network was scaled up, while the evolved single functions held up better but all became worse than Swish. One of the issues I mentioned with mixed activation functions is specific to SGD training; there's also been some work on using them with neuroevolution.
beren20

Exactly this. This is the relationship in RL between the discount factor and the probability of transitioning into an absorbing state (death)

4MSRayne
Ooh! I don't know much about the theory of reinforcement learning, could you explain that more / point me to references? (Also, this feels like it relates to the real reason for the time-value of money: money you supposedly will get in the future always has a less than 100% chance of actually reaching you, and is thus less valuable than money you have now.)
beren50

I think this is a really good post. You might be interested in these two posts which explore very similar arguments on the interactions between search in the world model and more general 'intuitive policies' as well as the fact that we are always optimizing for our world/reward model and not reality and how this affects how agents act.

1azsantosk
Thank you very much for linking these two posts, which I hadn't read before. I'll start using the direct vs amortized optimization terminology as I think it makes things more clear. The intuition that reward models and planners have an adversarial relationship seems crucial, and it doesn't seem as widespread as I'd like. On a meta-level your appreciation comment will motivate me to write more, despite the ideas themselves being often half-baked in my mind, and the expositions not always clear and eloquent.
beren40

Yes! This would be valuable. Generally, getting a sense of the 'self-awareness' of a model in terms of how much it knows about itself would be a valuable thing to start testing for.

beren40

I don't think model's currently have this ability by default anyway. But we definitely should think very hard before letting them do this!

beren20

Yes, I think what I proposed here is the broadest and crudest thing that will work. It can of course be much more targeted to specific proposals or posts that we think are potentially most dangerous. Using existing language models to rank these is an interesting idea.

beren50

I'm very glad you wrote this. I have had similar musings previously as well, but it is really nice to see this properly written up and analyzed in a more formal manner.

beren130

Interesting thoughts! By the way, are you familiar with Hugo Touchette's work on this? which looks very related and I think has a lot of cool insights about these sorts of questions.

2johnswentworth
Hadn't seen that before, thankyou.
beren64

I think this is a good intuition. I think this comes down to the natural structure of the graph and the fact that information disappears at larger distances. This means that for dense graphs such as lattices etc regions only implicitly interact through much lower dimensional max-ent variables which are then additive while for other causal graph structures such as the power-law small-world graphs that are probably sensible for many real-world datasets, you also get a similar thing where each cluster can be modelled mostly independently apart from a few long... (read more)

beren*30

Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I'm just making stuff up here.
 

Yes this is exactly right. This is precisely the kind of linearity that I am talking about not the input->output mapping which is... (read more)

beren20

Thanks for the typos! Fixed now.

beren20

Doesn't this imply that people with exceptionally weak autobiographical memory (e.g., Eliezer) have less self-understanding/sense of self? Or maybe you think this memory is largely implicit, not explicit? Or maybe it's enough to have just a bit of it and it doesn't "impair" unless you go very low?
 

This is an interesting question and I would argue that it probably does lead to a less-understanding and sense-of-self ceteris paribus. I think that the specific sense of self is mostly an emergent combination of having autobiographical memories -- i.e. at e... (read more)

beren120

Yes. The idea is that the latent space of the neural network's 'features' are 'almost linear' which is reflected in both the linear-ish properties of the weights and activations. Not that the literal I/O mapping of the NN is linear, which is clearly false.

 

More concretely, as an oversimplified version of what I am saying, it might be possible to think of neural networks as a combined encoder and decoder to a linear vector space. I.e. we have nonlinear function f and  g which encode the input x to a latent space z and g which decodes it to the out... (read more)

beren101

Thanks for these links! This is exactly what I was looking for as per Cunningham's law. For the mechanistic mode connectivity, I still need to read the paper, but there is definitely a more complex story relating to the symmetries rendering things non-connected by default but once you account for symmetries and project things into an isometric space where all the symmetries are collapsed things become connected and linear again. Is this different to that?

 

I agree about the NTK. I think this explanation is bad in its specifics although I think the NTK ... (read more)

6Noosphere89
One other problem of NTK/GP theory is that it isn't able to capture feature learning/transfer learning, and in general starts to break down as models get more complicated. In essence, NTK/GP fails to capture some empirical realities. From the post "NTK/GP Models of Neural Nets Can't Learn Features": In essence, NTK/GP can't transfer learn because it stays where it's originally at in the transfer space, and this doesn't change even in the limit of NTK. A link to the post is below: https://www.lesswrong.com/posts/76cReK4Mix3zKCWNT/ntk-gp-models-of-neural-nets-can-t-learn-features
beren41

Interesting point, which I broadly agree with. I do think however, that this post has in some sense over updated on recent developments around agentic LLMs and the non-dangers of foundation models. Even 3-6 months ago, in the intellectual zeitgeist it was unclear whether autoGPT style agentic LLM wrappers were the main threat and people were primarily worried about foundation models being directly dangerous. It now seems clearer that at least at current capability levels, foundation models are not directly goal-seeking at present, although adding agency is... (read more)

1Max H
I agree the zeitgeist has changed, but I think some people (or at least Nate and Eliezer in particular), have always been more concerned about more agent-like systems, along the lines of Mu Zero. For example, in the 2021 MIRI conversations here: and here:  and here: Deep Deceptiveness is more recent, but it's another example of a carefully non-specific argument that doesn't factor through any current DL-paradigm methods, and is consistent with the kind of thing Nate and Eliezer have always been saying. I think recent developments with LLMs have caused some other people to update towards LLMs alone being dangerous, which might be true, but if so it doesn't imply that more complex systems are not even more dangerous.
beren30

Thanks for these points! 

Equivalence token to bits

Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn't 1 token equal 13-17 bits a more accurate equivalence?

My thinking here is that the scaffolded LLM is a computer which operates directly in the natural language semantic space so it makes more sense to define the units of its context in terms of its fundamental units such as tokens. Of course each token has a lot more information-theoretic content than a s... (read more)

beren51

I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I'm fairly confident in, but which haven't actually propagated into the set of commonly-assumed background assumptions.

I have found this conversation very interesting. Would be very interested if you could do a quick summary or writeup of the background conclusions you are referring to. I have my own thoughts about the feasibility of massive agency gains from AutoGPT like wrappers but would be interested to hear your thoughts

3Thane Ruthenis
Here's the future post I was referring to!
3Thane Ruthenis
I may make a post about it soon. I'll respond to this comment with a link or a summary later on.
beren5-1

I think you're saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.

I'm skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.

I think our biggest crux is this. My idea here is that by default we get systems that look like this -- DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AG... (read more)

1Max H
Agree this is a crux.  A few remarks: * Structural similarity doesn't necessarily tell us a lot about a system's macro-level behavior. Examples: Stockfish 1 vs. Stockfish 20, the brain of a supervillain vs. the brain of an average human, a transformer model with random weights vs. one trained to predict the next token in a sequence of text. Or, if you want to extend the similarity to the training process, a transformer model trained on a corpus of text from the human internet vs. one trained on a corpus of text from an alien internet. An average human vs. a supervillain who have 99%+ identical life experiences from birth. Stockfish implemented by a beginner programmer vs. a professional team. * I'd say, to the extent that current DL systems are structurally similar to human brains, it's because these structures are instrumentally useful for doing any kind of useful work, regardless of how "values" in those systems are formed, or what those values are. And as you converge towards the most useful structures, there is less room left over for the system to "look similar" to humans, unless humans are pretty close to performing cognition optimally already. Also, a lot of the structural similarity is in the training process of the foundation models that make up one component of a larger artificial system. The kinds of things people do with LangChain today don't seem similar in structure to any part of a single human brain, at least to me. For example, I can't arrange a bunch of copies of myself in a chain or tree, and give them each different prompts running in parallel. I could maybe simulate that by hiring a bunch of people, though it would be OOMs slower and more costly. I also can't add a python shell or a "tree search" method, or perform a bunch of experimental neurosurgery on humans, the way I can with artificial systems. These all seem like capabilities-enhancing tools that don't preserve structural similarity to humans, and may also not pres
beren20

Most of these claims seem plausibly true of average humans today, but false about smarter (and more reflective) humans now and in the future.

On the first point, most of the mundane things that humans do involve what looks to me like pretty strong optimization; it's just that the things they optimize for are nice-looking, normal (but often complicated) human things. Examples of people explicitly applying strong optimization in various domains: startup founders, professional athletes, AI capabilities researchers, AI alignment researchers, dating.

My claim is ... (read more)

1Max H
I think the post does a great job of explaining human value formation, as well as the architecture of human decision-making, at least mechanically. I'm saying that neuroanatomy seems insufficient to explain how humans function in the most important situations, let alone artificial systems, near or far. If a hedge fund trader can beat the market, or a chess grandmaster can beat their opponent, what does it matter whether the decision process they use under the hood looks more like tree search, or more like function approximation, or a combination of both? It might matter quite a lot, if you're trying to build a human-like AGI! If you just want to know if your AGI is capable of killing you though, both function approximation and tree search at the level humans do them (or even somewhat below that level) seem pretty deadly, if they're pointed in the wrong direction. Whether it's easy or hard to point an artificial system in any particular direction is another question. I think you're saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree. I'm skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully. Though, a world where such systems are easy to build is not one I'd call "benign", since if it's easy to "just ask for alignment", it's probably also pretty easy to ask for not-alignment. Put another way, in the world where CoEms are the first kind of strong AGIs to get built, I think p(s-risk) goes up dramatically, though p(good outcomes) also goes up, perhaps even more dramatically, and p(molecular squiggles) goes down. I mostly think we're not in that world, though.
beren40

The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.

Can you expand on this? How do vision transformers exploit parameter sharing in a way that is not available to standard LLMs?

6jacob_cannell
Consider a vision transformer - or more generally an RNN - which predicts the entire image at once (and thus has hidden states that are larger than the image due to depth and bottleneck layers etc). That obviously wouldn't exploit weight sharing at all, but is really the only option if you are running a transformer or RNN on an ultra-slow ultra-wide 100hz neuromorphic computer like the brain and have tight latency constraints. But of course that isn't the only or most sensible option on a GPU. Instead you can use a much smaller transformer/RNN over a stream of image patches instead of the entire image at once, which then naturally exploits weight sharing very much like CNNs. Ultimately vision transformers and CNNs both map to matrix multiplication, which always involves weight sharing. The interesting flip consequence is that a brain-like architecture - a massive RNN - doesn't naturally map to matrix multiplication at all and thus can't easily exploit GPU acceleration.
beren20

Nice. My main issue is that just because humans have values a certain way, doesn't mean we want to build an AI that way, and so I'd draw pretty different implications for alignment. I'm pessimistic about anything that even resembles "make an AI that's like a human child," and more interested in "use a model of a human child to help an inhuman AI understand humans in the way we want."
 

I pretty much agree with this sentiment. I don't literally think we should build AGI like a human and expect it to be aligned. Humans themselves are far from aligned enou... (read more)

beren30

So, I agree and I think we are getting at the same thing (though not completely sure what you are pointing at). The way to have a model-y critic and actor is to have the actor and critic perform model-free RL over the latent space of your unsupervised world model. This is the key point of my post and why humans can have 'values' and desires for highly abstract linguistic concepts such as 'justice' as opposed to pure sensory states or primary rewards.

berenΩ120

afaict, a big fraction of evolution's instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc.
 

This is true but I don't think is super important for this argument. Evolution definitely encodes inductive biases into learning about relevant things which ML archit... (read more)

berenΩ580

I always say that the whole brain (including not only the basal ganglia but also the thalamocortical system, medulla, etc.) operates as a model-based RL system. You’re saying that the BG by itself operates as a model-free RL system. So I don’t think we’re disagreeing, because “the cortex is the model”?? (Well, we definitely have some disagreements about the BG, but we don’t have to get into them, I don’t think they’re very important for present purposes.)
 

I think there is some disagreement here, at least in the way I am using model-based / model-free ... (read more)

1Mateusz Bagiński
I'd say that the probability of success depends on (1) Conservatism - how much of the prior structure (i.e., what our behavior actually looks like at the moment, how it's driven by particular shards, etc.). The more conservative you are, the harder it is. (2) Parametrization - how many moving parts (e.g., values in value consequentialism or virtues in virtue ethics) you allow for in your desired model - the more, the easier. If you want to explain all of human behavior and reduce it to one metric only, the project is doomed.[1] For some values of (1) and (2) you can find one or more coherent extrapolations of human values/value concepts. The thing is, often there's not one extrapolation that is clearly better for one particular person and the greater the number of people whose values you want to extrapolate, the harder it gets. People differ in what extrapolation they would prefer (or even if they would like to extrapolate away from their status quo common sense ethics) due to different genetics, experiences, cultural influences, pragmatic reasons etc. ---------------------------------------- 1. There may also be some misunderstanding if one side assumes that the project is descriptive (adequately describe all of human behavior with a small set of latent value concepts) or prescriptive (provide a unified, coherent framework that retains some part of our current value system but makes it more principled, robust against moving out of distribution, etc.) ↩︎
2Steven Byrnes
Seems like just terminology then. I’m using the term “model-based RL” more broadly than you. I agree with you that (1) explicit one-timestep-at-a-time rollouts is very common (maybe even universal) in self-described “model-based RL” papers that you find on arxiv/cs today, and that (2) these kinds of rollouts are not part of the brain “source code” (although they might show up sometimes as a learned metacognitive strategy). I think you’re taking (1) to be evidence that “the term ‘model-based RL’ implies one-timestep-at-a-time rollouts”, whereas I’m taking (1) to be evidence that “AI/CS people have some groupthink about how to construct effective model-based RL algorithms”. Hmm, I think the former is a strict subset of the latter. E.g. I think “learning through experience that I should suck up to vain powerful people” is the latter but not the former. Yeah I agree with the “directly” part. For example, I think some kind of social drives + the particular situations I’ve been in, led to me thinking that it’s good to act with integrity. But now that desire / value is installed inside me, not just a means to an end, so I feel some nonzero motivation to “act with integrity” even when I know for sure that I won’t get caught etc. Not that it’s always a sufficient motivation …
2Charlie Steiner
Huh. I'd agree that's an important distinction, but having a model also can be leveraged for learning; the way I'd normally use it, actor-critic architectures can fall on a spectrum of "modeliness" depending on how "modely" the critic is, even if the actor is a non-recursive, non-modely architecture. I think this is relevant to shard theory because I think the best arguments about shards involve inner alignment failure in model-free-in-my-stricter-sense models.
berenΩ120

1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as "seek social status" => as a consequence I don't think the evolutionary older drives are out of play and the landscape is flat as you assume, and dominated by language-model-based values
 

Yes, I think drives like this are important on two levels. At the first level, we are experience them as primary rewards -- i.e. as social status gives direct ... (read more)

2Charlie Steiner
afaict, a big fraction of evolution's instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc. Patterns of behavior (some of which I'd include in my goals) encoded in my model can act in a way that's somewhere between unconscious and too obvious to question - you might end up doing things not because you have visceral feelings about the different options, but simply because your model is so much better at some of the options that the other options never even get considered.
beren31

Thanks for your comment. 

The most substantive disagreement in relation to alignment is on how much of our values is determined by the basic reward system, and how much is essentially arbitrary from there. I tend to side with you, but I'm not sure, and I do think that adult human values and behavior is still shaped in important ways by our innate reward signals. But the important question is whether we could do without those, or perhaps with a rough emulation of them, in an AGI that's loosely brainlike.

I am not sure how much we actually disagree here. ... (read more)

beren51

Fair point. I need to come up with a better name than 'orthogonality' for what I am thinking about here -- 'well factoredness?'

Will move the footnote into the main text.

beren40

No worries! I'm happy you went to the effort of summarising it. I was pretty slow in crossposting anyhow. 

beren20

Yes, I guess I am overstating the possible speedup if I call it 'much much faster', but there ought to at least be a noticeable speedup by cutting out the early steps if it's basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.

I think we agree here. Testing whether it converges to a better optimum would also be interesting. 

Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of n

... (read more)
beren20

Good idea -- will run this experiment!

Load More