Notes on Dwarkesh Patel’s Podcast with Sholto Douglas and Trenton Bricken

Zvi

Dwarkesh Patel continues to be on fire, and the podcast notes format seems like a success, so we are back once again.

This time the topic is how LLMs are trained, work and will work in the future. Timestamps are for YouTube. Where I inject my own opinions or takes, I do my best to make that explicit and clear.

This was highly technical compared to the average podcast I listen to, or that Dwarkesh does. This podcast definitely threated to technically go over my head at times, and some details definitely did go over my head outright. I still learned a ton, and expect you will too if you pay attention.

This is an attempt to distill what I found valuable, and what questions I found most interesting. I did my best to make it intuitive to follow even if you are not technical, but in this case one can only go so far. Enjoy.

(1:30) Capabilities only podcast, Trenton has ‘solved alignment.’ April fools!
(2:15) Huge context tokens is underhyped, a huge deal. It occurs to me that the issue is about the trivial inconvenience of providing the context. Right now I mostly do not bother providing context on my queries. If that happened automatically, it would be a whole different ballgame.
(2:50) Could the models be sample efficient if you can fit it all in the context window? Speculation is it might work out of the box.
(3:45) Does this mean models are already in some sense superhuman, with this much context and memory? Well, yeah, of course. Computers have been superhuman at math and chess and so on for a while. Now LLMs have quickly gone from having worse short term working memory than humans to vastly superior short term working memory. Which will make a big difference. The pattern will continue.
(4:30) In-context learning is similar to gradient descent. It gets problematic for adversarial attacks, but of course you can ignore that because as Tenton reiterates alignment is solved, and certainly it is solved for such mundane practical concerns. But it does seem like he’s saying if you do this then ‘you’re fine-tuning but in a way where you cannot control what is going on’?
(6:00) Models need to learn how to learn from examples in order to take advantage of long context. So does that mean the task of intelligence requires long context? That this is what causes the intelligence, in some sense, they ask? I don’t think you can reverse it that way, but it is possible that this will orient work in directions that are more effective?
(7:00) Dwarkesh asks about how long contexts link to agent reliability. Douglas says this is more about lack of nines of reliability, and GPT-4-level models won’t cut it there. And if you need to get multiple things right, the reliability numbers have to multiply together, which does not go well in bulk. If that is indeed the issue then it is not obvious to me the extent to which scaffolding and tricks (e.g. Devin, probably) render this fixable.
(8:45) Performance on complex tasks follows log scores. It gets it right one time in a thousand, then one in a hundred, then one in ten. So there is a clear window where the thing is in practice useless, but you know it soon won’t be. And we are in that window on many tasks. This goes double if you have complex multi-step tasks. If you have a three-step task and are getting each step right one time in a thousand, the full task is one in a billion, but you are not so far being able to in practice do the task.
(9:15) The model being presented here is predicting scary capabilities jumps in the future. LLMs can actually (unreliably) do all the subtasks, including identifying what the subtasks are, for a wide variety of complex tasks, but they fall over on subtasks too often and we do not know how to get the models to correct for that. But that is not so far from the whole thing coming together, and that would include finding scaffolding that lets the model identify failed steps and redo them until they work, if which tasks fail is sufficiently non-deterministic from the core difficulties.
(11:30) Attention costs for context window size are quadratic, so how is Google getting the window so big? Suggestion is the cost is still actually dwarfed by the MLP block, and while generating tokens the cost is no longer n-squared, your marginal cost becomes linear.
(13:30) Are we shifting where the models learn, with more and more in the forward pass? Douglas says essentially no, the context length allows useful working memory, but is not ‘the key thing towards actual reasoning.’
(15:10) Which scaling up counts? Tokens, compute, model size? Can you loop through the model or brain or language? Yes, but in practice notice humans only in practice do 5-7 steps in complex sentences because of working memory limits.
(17:15) Where is the model reasoning? No crisp answer. The residual stream that the model carries forward packs in a lot of different vectors that encode all the info. Attention is about what to pick up and put into what is effectively RAM.
(20:40) Does the brain work via this residual stream? Yes. Humans implement a bunch of efficient algorithms and really scale up our cerebral cortex investment. A key thing we do is very similar to the attention algorithm.
(24:00) How does the brain reason? Trenton thinks mostly intelligence is pattern matching. ‘Association is all you need.’
(25:45) Paper from Demis in 2008 noted that memory is reconstructive, so it is linked to creativity and also is horribly unreliable.
(26:45) What makes Sherlock Homes so good? Under this theory: A really long context length and working memory, and better high-level association. Also a good algorithm for his queries and how to build representations. Also proposed: A Sherlock Homes evaluation. Give a mystery novel or story, ask for probability distribution over ‘The suspect is X.’
(28:30) A vector in the residual stream is the composite of all the tokens to which I have previously paid attention, even by layer two.
(30:30) Could we do an unsupervised benchmark? It has been explored, such as with constitutional AI. Again, alignment-free podcast here.
(31:45) If intelligence is all associations, should we be less worried about superintelligence, because there’s not this sense in which it is Sherlock++ and it can’t solve physics from a world frame? The response is, they would need to learn the associations, but also the tech makes that quick to do, and silicon can be about as generally intelligent as humans and can recursively improve anyway.
My response here would strongly be that if this is true, we should be more worried rather than less worried, because it means there is no secret or trick, and scale really would be all you would need, if you scale enough distinct aspects, and we should expect that we would do that.
(32:45) Dwarkesh asks if this means disagreeing with the premise of them not being that much more powerful. To which I would strongly say yes. If it turns out that the power comes from associations, then that still leads to unbounded power, so what if it does not sound impressive? What matters is if it works.
(33:30) If we got thousands of you do we get an intelligence explosion? We do dramatically speed up research but compute is a binding constraint. Trenton thinks we would need longer contexts, more reliability and lower cost to get an intelligence explosion, but getting there within a few years seems plausible.
(37:30) Trenton expects this to speed up a lot of the engineering soon, accelerating research and compounding, but not (yet) a true intelligence explosion.
(39:00) What about the costs of training orders-of-magnitude bigger models? Does this break recursive intelligence explosion? It’s a breaking mechanism. We should be trying hard to estimate how much of this is automatable. I agree that the retraining costs and required time are a breaking mechanism, but also efficiency gains could quickly reduce those costs, and one could choose to work around the need to do that via other methods. One should not be confident here.
(41:00) Understanding what goes wrong is key to making AI progress. There are lots of ideas but figuring out which ideas are worth exploring is vital. This includes anticipating which trend lines will hold when scaled up and which won’t. There’s an invisible graveyard of trend lines that looked promising and then failed to hold.
(44:20) A lot of good research works backwards from solving actual problems. Trying to understand what is going on, figuring out how to run experiments. Performance is lots of low-level hard engineering work. Ruthless prioritization is key to doing high quality research, the most effective people attack the problem, do really fast experiments and do not get attached to solutions. Everything is empirical.
(48:00) “Even though we wouldn’t want to admit it, the whole community is kind of doing greedy evolutionary optimization over the landscape of possible AI architectures and everything else. It’s no better than evolution. And that’s not even a slight against evolution.” Does not fill one with confidence on safety.
(49:30) Compute and taste on what to do are the current limiting factors for capabilities. Scaling to properly use more humans is hard. For interpretability they need more good engineers.
(51:00) “I think the Gemini program would probably be maybe five times faster with 10 times more compute or something like that. I think more compute would just directly convert into progress.”
(51:30) If compute is such a bottleneck is it being insufficiently allocated to such research and smaller training tasks? You also need the big training runs to avoid getting off track.
(53:00) What does it look like for AI to speed up AI research? Could be algorithmic progress from AI. That takes more compute, but seems quite reasonable this could act as a force multiplier for humans. Also could be synthetic data.
(55:30) Reasoning traces are missing from data sets, and seem important.
(56:15) Is progress going to be about making really amazing AI maps of the training data? Douglas says clearly a very important part. Doing next token on a sufficiently good data set requires so many other things.
(58:30) Language as synthetic data by humans for humans? With verifier via real world.
(59:30) Yeah, whole development process is largely evolutionary, more people means more recombination, more shots on target. That does to me seem in conflict with the best people being the ones who can discriminate over potential tasks and ideas. But also they point out serendipity is a big deal and it scales. They expect AGI to be the sum of a bunch of marginal things.
(1:01:30) If we don’t get AGI by GPT-7-levels-of-OOMs are we stuck? Sholto basically buys this, that orders of magnitude have at core diminishing returns, although they unlock reliability, reasoning progress is sublinear in OOMs. Dwarkesh notes this is highly bearish, which seems right.
(1:03:15) Sholto points out that even with smaller progress, another 3.5→4 jump in GPT-levels is still pretty huge. We should expect smart plus a lot of reliability. This is not to undersell what is coming, rather the jumps so far are huge, and even smaller jumps from here unlock lots of value. I agree.
(1:07:30) Bigger models allow you to minimize superposition (overloading more features onto less parameters), making results less noisy, whereas smaller ones are under parameterized given their goal of representing the entire internet. Speculation that superposition is why interpretability is so hard. I wonder if that means it could get easier with more parameters? Could we use ‘too many’ parameters on purpose in order to help with this?
(1:11:00) What’s happening with distilled models? Dwarkesh suggests GPT-4-Turbo is distilled, Sholto suggests it could instead be new architecture.
(1:12:30) Distillation is powerful because the full probability distribution gives you much richer data to work with.
(1:13:30) Adaptive compute means spend more cycles on harder questions. How do you do that via chain of thought? You get to pass a KV-value during forward passes, not only passing only the token, which helps, so the KV-cache is (headcanon-level, not definitively) pushing forward the CoT without having to link to the output tokens. This is ‘secret communication’ (from the user’s perspective) of the model to its forward inferences, and we don’t know how much of that is happening. Not always the thing going on, but there is high weirdness.
(1:19:15) Anthropic sleeper agents paper, notice the CoT reasoning does seem to impact results and the reasoning it does is pretty creepy. But in another paper, the model will figure out the multiple choice answer is always ‘A’ but the reasoning in its CoT will be something else that sounds plausible. Dwarkesh notes humans also come up with crazy explanations for what they are doing, such as when they have split brains. “It’s just that some people will hail chain-of-thought reasoning as a great way to solve AI safety, but actually we don’t know whether we can trust it.”
(1:23:30) Agents, how will they work once they work well enough? Short term expectation from Sholto is agent talking together. Sufficiently long context windows could make fine-tuning unnecessary or irrelevant.
(1:26:00) With sufficient context could you train everything on a global goal like ‘did the firm make money?’ In the limit, yes, that is ‘the dream of reinforcement learning.’ Can you feel the instrumental convergence? At first, though, they say, in practice, no, it won’t work.
(1:27:45) Suggestion that languages evolve to be good at encoding things to teach children important things, such as ‘don’t die.’
(1:29:30) In other modalities figuring out exactly what you are predicting is key to success. For language you predict the next token, it is easy mode in that sense.
(1:31:30) “there are interesting interpretability pieces where if we fine-tune on math problems, the model just gets better at entity recognition.” It makes the model better at attending to positions of things and such.
(1:32:30) Getting better at code makes the model a better thinking. Code is reasoning, you can see how it would transfer. I certainly see this happening in humans.
(1:35:00) Section on their careers. Sholto’s story is a lot of standard things you hear from high-agency, high-energy high-achieving people. They went ahead and did things, and also pivot and go in different directions and follow curiosity, read all the papers. Strong ideas, loosely held, carefully selected, vigorously pursued. Dwarkesh notes one of the most important things is to go do the things, and managers are desperate for people who will make sure things get done. If you get bottlenecked because you need lawyers, well, why didn’t you go get the lawyers? Lots of impact is convincing people to work with you to do a thing.
(1:43:30) Sholto is working on AI largely because he thinks it can lead to a wonderful future, and was sucked into scaling by Gwern’s scaling hypothesis post. That is indeed the right reason, if you are also taking into account the downside risks including existential risks, and still think this is a good idea. It almost certainly is not a neutral idea, it is either a very good idea or extremely ill-advised.
(1:43:35) Sholto says McKinsey taught him how to actually do work, and the value of not taking no for an answer, whereas often things don’t happen because no individual cares enough to make it happen. The consultant can be that person, and you can be that person otherwise without being a consultant. He got hired largely by being seen on the internet asking questions about how things work, causing Google to reach out. It turns out at Google you can ask the algorithm and systems experts and they will gladly teach you everything they know.
(1:51:30) Being in the office all the time, collaborating with others including pair programming with Sergey Brin sometimes, knowing the people who make decisions, matters a lot.
(1:54:00) Trenton’s story begins, his was more standard and direct.
(1:55:30) Dwarkesh notes that these stories are framed as highly contingent, that people tend to think their own stories are contingent and those of others are not. Sholto mentions the idea of shots on goal, putting yourself in position to get lucky. I buy this. There are a bunch of times I got lucky and something important happened. If you take those times away, or add different ones, my life could look very different. Also a lot of what was happening was, effectively, engineering the situation to allow those events to happen, without having a particular detailed event in mind. Same with these two.
(1:57:00) Google is continuing the experiment to find high-agency people and bootstrap them. Seems highly promising. Also Chris Olah was hired off a cold email. You need to send and look out for unusual signals. I agree with Dwarkesh that is very good for the world that a lot of this hiring is not done legibly, and instead is people looking out for agency and contributions generally. If you write a great paper or otherwise show you have the goods, the AI labs will find you.
(2:01:45) You still need to do the interview process, make sure people can code or what not and you are properly debiased, but that process should be designed not to get in the way otherwise.
(2:03:00) Emphasis on need to care a ton, and go full blast towards what you want, doing everything that would help.
(2:04:30) When you get your job then is that the time to relax or to put petal to the metal? There’s pros and cons. Not everyone can go all out, many people want to focus on their families or otherwise relax. Others need to be out there working every hour of the week, and the returns are highly superlinear. And yes, this seems very right to me, returns to going fully in on something have been much higher than returns to ordinary efforts. Jane Street would have been great for me if I could have gone fully in, but I was not in a position to do that.
(2:06:00) Dwarkesh: “I just try to come up with really smart questions to send to them. In that entire process I’ve always thought, if I just cold email them, it’s like a 2% chance they say yes. If I include this list, there’s a 10% chance. Because otherwise, you go through their inbox and every 34 seconds, there’s an interview for some podcast or interview. Every single time I’ve done this they’ve said yes.” And yep, story checks out.
(2:09:30) A discussion of what is a feature. It is whatever you call a feature, or it is anything you can turn on and off, it any of the things. Is that a useful definition? Not if the features were not predictive, or if the features did not do anything. The point is to compose the features into something higher level.
(2:17:00) Trenton thinks you can detect features that correspond to deceptive behavior, or malicious behavior, when evaluating a request. I’ve discussed my concerns on this before. It is only a feature if you can turn it on and off, perhaps?
(2:20:00) There are a bunch of circuits that have various jobs they try to do, sometimes as simple as ‘copy the last token,’ and then there are other heads that suppress that behavior. Reasons to do X, versus reasons not to do X.
(2:20:45) Deception circuit gets labeled as whatever fires in examples where you find deception, or similar? Well, sure, basically.
(2:22:00) RLHF induces theory of mind.
(2:22:05) What do we do if the model is superhuman, will our interpretability strategies still work, would we understand what was going on? Trenton says that the models are deterministic (except when finally sampling) so we have a lot to work with, and we can do automated interpretability. And if it is all associations, then in theory that means what in my words would be ‘no secret’ so you can break down whatever it is doing into parts that we can understand and thus evaluate. A claim that evaluation in this sense is easier than generation, basically.
(2:24:00) Can we find things without knowing in advance what they are? It should be possible to identify a feature and how it relates to other features even if you do not know what the feature is in some sense. Or you can train in the new thing and see what activates, or use other strategies.
(2:26:00) Is red teaming Gemma helping jailbreak Gemini? How universal are features across models? To some extent.
(2:27:00) Curriculum learning, which is trying to teach the model things in an intentional order to facilitate learning, is interesting and mentioned in the Gemini paper.
(2:29:45) Very high confidence that this general model of what is going on with superposition is right, based on success of recent work.
(2:31:00) A fascinating question: Should humans learn a real representation of the world, or would a distorted one be more useful in some cases? Should venomous animals flash neon pink, a kind of heads-up display baked into your eyes? The answer is that you have too many different use cases, distortions do more harm than good, you want to use other ways to notice key things, and so that is what we do. So Trenton is optimistic the LLMs are doing this too.
(2:32:00) “Another dinner party question. Should we be less worried about misalignment? Maybe that’s not even the right term for what I’m referring to, but alienness and Shoggoth-ness? Given feature universality there are certain ways of thinking and ways of understanding the world that are instrumentally useful to different kinds of intelligences. So should we just be less worried about bizarro paperclip maximizers as a result?” I quote this question because I do not understand it. If we have feature universality, how is that not saying that the features are compatible with any set of preferences, over next tokens or otherwise? So why is this optimistic? The response is that components of LLMs are often very Shoggoth-like.
(2:34:00) You can talk to any of the current models in Base64 and it works great.
(2:34:10) Dwarkesh asks, doesn’t the fact that you needed a Base64 expert to happen to be there to recognize what the Base64 feature was mean that interpretability on smarter models is going to be really hard, if no human can grok it? Anomaly detection is suggested, you look for something different. Any new feature is a red flag. Also you can ask the model for help sometimes, or automate the process. All of this strikes me as exactly how you train a model how not to be interpretable.
(2:36:45) Feature splitting is where if you only have so much space in the model for birds it will learn ‘birds’ and call it a day, whereas if it has more room it will learn features for different specific birds.
(2:38:30) We have this mess of neurons and connections. The dream is bootstrapping to making sense of all that. Not claiming we have made any progress here.
(2:39:45) What parts of the process for GPT-7 will be expensive? Training the sparce encoder and doing projection into a wider space of features, or labeling those features? Trenton says it depends on how much data goes in and how dimensional is your space, which I think means how overloaded and full of superpositions you are or are measuring.
(2:42:00) Dwarkesh asks: Why should the features be things we can understand? In Mixtral of Experts they noticed their experts were not distinctive in ways they could understand. They are excited to study this question more but so far don’t know much. It is empirical, and they will know when they look and find out. They claim there is usually clear breakdown of expert types, but that you can also get distinctions that break up what you would naively expect.
(2:45:00) Try to disentangle all these neurons, audience. Sholto’s challenge to you.
(2:48:00) Bruno Olshausen theorizes that all the brain regions you do not here about are doing a ton of computation in superposition. And sure, why not? The human brain sure seems under-parameterized.
(2:49:25) Superposition is a combinatorial code, not an artifact of one neuron.
(2:51:20) GPT-7 has been trained. Your interpretability research succeeded. What will you do next? Try to get it to do the work, of course. But no, before that, what do you need to do to be convinced it is safe to deploy? ‘I mean we have our RSP.’ I mean, no you don’t, not yet, not for GPT-7-level models, it says ‘fill this in later’ over there. So Trenton rightfully says we would need a lot more interpretability progress. Right now he would not give the green light, he’d be crying and hoping the tears interfered with GPUs.
(2:53:00) He says ‘Ideally we can find some compelling deception circuit which lights up when the model knows that it’s not telling the full truth to you.’ Dwarkesh asks about linear probes, Trenton says that does not look good.
I would ask, what makes you think that you have found the only such circuit? If the model had indeed found a way around your interpretability research, would you not expect it to give you a deception circuit to find, in addition to the one you are not supposed to find, because you are optimizing for exactly that which will fool you? Wouldn’t you expect the unsupervised learning to give you what you want to find either way? Fundamentally, this seems like saying ‘oh sure he lies all the time, but when he lies he never looks the person in the eye, so there is nothing to worry about, there is no way he would ever lie while looking you in the eye.’ And you do this with a thing much smarter than you, that knows you will notice this, and expect it to go well. For you, that is.
Also I would reiterate all my ‘not everything you should be worried about requires the model to be deceptive in way that is distinct from its normal behavior, even in the worlds where this distinction is maximally real,’ and also ‘deception is not a distinct thing from what is imbued into almost every communication.’ And that’s without things smarter than us. None of this seems to me to have any hope, on a very fundamental level.
(2:56:15) Yet Trenton continues to be optimistic such techniques will understand GPT-7. A third of team is scaling up dictionary learning, a second group is identifying circuits, a third is working to identify attention heads.
(3:01:00) A good test would be, we found feature X, we ablated it, and now we can’t elicit X to happen. That does sound a little better?
(3:02:00) What are the unknown unknowns for superhuman models? The answer is ‘we’ll see,’ our hope is automated interpretability. And I mean, yes, ‘we’ll see’ is in some sense the right way to discuss unknown unknowns, there are far worse answers, but my despair is palpable.
(3:03:00) Should we worry if alignment succeeds ‘too hard’ and people get fine-grained control over AIs? “That is the whole Valley lock-in argument in my mind. It’s definitely one of the strongest contributing factors for why I am working on capabilities at the moment. I think the current player set is actually extremely well-intentioned.”
(3:07:00) “If it works well, it’s probably not being published.” Finally.

LESSWRONG
LW

41

Notes on Dwarkesh Patel’s Podcast with Sholto Douglas and Trenton Bricken

41

New to LessWrong?

41