LessWrong

Gâchis Astronomique

Neil — Fri, 26 Apr 2024 04:20:11 GMT

Published on April 26, 2024 4:20 AM GMT

Le coût d’opportunité des délais en développement technologique

Par Nick Bostrom

Abstract: Grâce à des technologies avancées, on pourrait maintenir une très grande quantité de personnes menant des vies heureuses dans la région accessible de l’univers. Chaque année où la colonisation de l’univers ne se déroule pas représente un coût d’opportunité; des vies qui valent d’êtres vécues ne peuvent être réalisées. D’après des estimations plausibles, ce coût est extrêmement élevé. Mais la leçon pour les utilitaristes n’est pas qu’il faut maximiser la cadence du développement technologique, mais sa sécurité. Autrement dit, il faut maximiser la probabilité que la colonisation se déroule.

Le rythme de perte de vies potentielles

En ce moment, des soleils illuminent et réchauffent des pièces vides et des trous noirs absorbent une portion de l’énergie inutilisée du cosmos. Chaque minute, notre héritage de néguentropie se dissipe irréversiblement en entropie. Ces ressources ainsi gâchées auraient pu être utilisées par une civilisation avancée au bénéfice d’êtres sensibles menant de bonnes vies.

Cette perte se fait à un rythme ahurissant. Un papier récent postule par des considérations théoriques librement basées sur le rythme d’accroissement de l’entropie, que la perte de vies humaines potentielles dans notre superamas galactique est d’au moins ~10^46 vies par siècle de colonisation retardée. [1] Cette estimation suppose que toute l’entropie perdue aurait pu être utilisée à des fins productives. Mais puisqu'aucun mécanisme technologique n’est encore capable de faire cela, ce n’est pas une estimation suffisamment prudente.

Nous pouvons obtenir notre estimation de la limite inférieure en multipliant le nombre d’étoiles de notre superamas galactique par la puissance de calcul qu’on peut extraire par étoile, le tout divisé par la puissance requise pour pour simuler une vie humaine.

À titre de grossière approximation, disons que le Superamas de la Vierge est composée de 10^13 étoiles. On estime la puissance de calcul qu’on peut extraire d’une étoile en utilisant de la nanotechnologie moléculaire avancée [2] à 10^42 opérations par seconde [3]. L’estimation typique de la puissance de calcul du cerveau humain est un peu moins de 10^17 opérations par seconde [4]. Notons qu’il ne semble pas nécessiter davantage de puissance de calcul en plus pour simuler tous les détails pertinents de l'environnement d’un humain typique [5]. Cela implique qu’un potentiel d’environ 10^38 vies humaines est perdu chaque siècle, ce qui revient à 10^29 vies par seconde.

Cette estimation est prudente dans la mesure où elle n’envisage que des techniques de calcul déjà esquissées dans la littérature scientifique. On peut obtenir une estimation encore moins ambitieuse en supposant que les humains du futur s’incarnent uniquement dans des corps biologiques. Supposons qu’environ 10^10 humains biologiques puissent vivre autour d’une étoile moyenne. Le Superamas de la Vierge pourrait alors contenir 10^23 humains biologiques. D’après cette estimation, la perte de potentiel est donc égale à 10^14 vies humaines perdues par seconde de délai.

Ce ne sont pas les chiffres exacts qui nous importent pour le moment, mais le fait qu’ils soient astronomiques. Même notre estimation la plus prudente (qui suppose une incarnation biologique des personnes potentielles) implique un rythme de cent mille milliards de vies humaines potentielles perdues par seconde de colonisation retardée [6].

Le coût d’opportunité des retards de la colonisation

D’un point de vue utilitariste, cette perte de vies humaines correspond à une perte de valeur potentielle. Il est communément admis que les vies humaines actuelles valent d’être vécues. Une civilisation assez avancée pour coloniser le superamas local pourra sans doute établir les conditions minimes pour dire de même.

Ainsi, accélérer le développement technologique (ou ses facteurs déterminants, comme la productivité économique) semble avoir plus d’effet sur la valeur totale que n’importe quelle autre action. Agir de telle sorte à ce que la colonisation se déroule juste une seconde plus tôt qu’elle ne se déroulerait normalement revient à engendrer 10^29 vies humaines (10^14 vies humaines avec l’estimation la plus conservatrice). Peu de causes philanthropiques peuvent rivaliser avec ce niveau de rendement utilitaire.

D’autres points de vue que l’utilitarisme sur ce qui constitue la valeur arrivent à la même conclusion. Par exemple, une conception du bien-être humain qui va plus loin que l’utilitarisme typique pourrait mesurer sa valeur dans l’épanouissement et l’expression de l’individu, les relations significatives, et la noblesse de caractère: cela importe peu. Tant que la mesure de valeur est agrégative (le bonheur de l’un ne fait pas le malheur de l’autre) et est sans actualisation temporelle, cette conclusion tiendra.

Elle pourra même toujours tenir si la mesure de valeur n’est pas parfaitement agrégative (par exemple si un aspect de la valeur est la diversité, dont la production marginale risque de décliner avec une population grandissante) tant qu’un composant important de la valeur reste agrégatif. De même, un degré minime d’actualisation temporelle peut être adopté sans changer de conclusion [7].

Le grand but des utilitaristes devrait être de réduire le risque existentiel

On pourrait penser conclure ainsi, avec la résolution qu’un utilitariste devrait logiquement concentrer ses efforts sur l'accélération du développement technologique. Après tout, les bénéfices d’une réussite infime dans ce projet éclipsent presque toute autre activité. Nous nous trouvons en possession d’un argument utilitariste prônant une urgence absolue dans le développement technologique.

Mais la véritable leçon est toute autre. Il ne suffit pas de prendre en compte le coût d’opportunité des délais de colonisation, mais aussi le risque d'échec. Nous pourrions être affligés par un risque existentiel, une catastrophe qui éliminerait toute vie intelligente sur Terre ou amputerait autrement son futur. La longévité des galaxies se mesure dans les milliards d’années, tandis que tout délai de colonisation ne se mesurerait qu’en années ou décennies. Ainsi, la considération des risques triomphe largement sur la considération du coût d’opportunité. Une réduction des risques existentiels d’un seul pourcent correspond (d’un point de vue d'utilité espérée) à un délai de plus de 10 millions d’années.

Ainsi, l’effet infime d’un action sur la probabilité d’une colonisation éventuelle l’emporte sur l’effet concernant sa vitesse. Pour les utilitaristes typiques, la priorité une, deux, trois, et quatre devrait être de réduire le risque existentiel. Le cri utilitariste “Maximisons l’utilité agrégative espérée!” peut être simplifiée par la maxime “Minimisons le risque existentiel!”

Implications pour les vues effets-sur-personnes agrégatives

L’argument ci-dessus suppose que notre but est de maximiser la quantité totale de bien-être. Supposons plutôt que nous adoptions la version “effets-sur-personnes” de l’utilitarisme, d’après qui nos obligations sont envers les personnes qui existent à l’instant [9]. Ainsi, l’extinction humaine serait mauvaise parce qu’elle rendrait pire les vies présentes, et non parce qu’elle constitue une perte de vie potentielles. Que devrait faire une personne qui suit cette doctrine? Devrait-elle mettre l’emphase sur la vitesse, la sécurité… quelque chose d’autre encore?

Pour répondre, il nous faut considérer d’autres variables. Supposons qu’un tel pense que la probabilité qu’une personne vivante à l’instant puisse bénéficier des ressources extraites de la colonisation est faible. Sa raison de s’opposer aux risques existentiels serait donc que l’extinction humaine amputerait (disons) une moyenne de 40 ans à la vie de 8 milliards de vies humaines [10]. Cela serait certainement une catastrophe; mais elle serait du même ordre de grandeur que d’autres tragédies humaines comme la maladie ou la faim. Un utilitariste effets-sur-personnes devrait ainsi voir la réduction de risques existentiels comme étant une priorité importante, mais pas absolument dominante. Dans ce cas, il n’est pas immédiatement évident ce qu’il doit faire. La réponse à cette question dépendra de calculs détaillés pour trouver quelle cause philanthropique se trouve le mieux à sa portée.

Mais il est possible qu’il lui soit nécessaire en tout cas de donner une probabilité non négligeable à la proposition que des personnes toujours vivants aujourd'hui puissent survivre assez longtemps pour accéder aux ressources astronomiques. Il se pourrait qu’une “singularité technologique” se déroule durant sa vie naturelle [11], ou une révolution dans l’extension biologique de la vie, ou même dans la nanotechnologie; dans tous les cas, le processus de vieillesse pourrait être effacé et inversé [12]. Beaucoup de scientifiques et de futurologues donnent une haute probabilité que ces technologies se développent dans les prochaines décennies [13]. Si ces pronostics vous semblent irréalistes, considérez la performance abyssale des prévisions technologiques passées. Ce serait un excès de confiance que de donner une probabilité de moins (disons) d’1% à la proposition que l’on verra ses technologies se développer durant notre vie.

Un nombre astronomique divisé par cent est toujours un nombre astronomique: l’utilité espérée que représente 1% de chance que ces technologies de réalisent est immense. Ainsi, quelle est l’utilité espérée dans le cas où une portion significative de la population accède aux ressources astronomiques? La réponse n’est pas donnée. D’un côté, on peut noter que l’utilité marginale des ressources matérielles diminue fortement après un certain point. Après tout, le niveau de bien-être de Bill Gates ne semble pas surpasser dramatiquement celui d’une personne aux moyens plus modestes. D’un autre côté, des technologies avancées—qui seront sans doute déployées à l’époque de la colonisation du superamas—pourraient nous offrir de nouveaux moyens de convertir des ressources matérielles en bien-être. On pourrait par exemple utiliser des ressources pour augmenter nos capacités mentales et prolonger indéfiniment notre espérance de vie subjective. Il n’est pas du tout donné que l’utilité marginale d’une augmentation de l’esprit et de la longévité doit fortement diminuer après un certain point. Ainsi, l’utilité espérée que représente pour une personne vivante aujourd’hui l’accès aux ressources issues de la colonisation du superamas est astronomique. Cette conclusion tient même si on donne une faible probabilité à ces suppositions. Pour un maximiseur de l’utilité espérée, le bien-être apporté par des milliards d’années de vie subjectives et une augmentation des capacités mentales est immense, même si la probabilité de réussite est faible.

Alors, enfin: que devrait faire un utilitaire effets-sur-personnes? Il est clair qu’il est important d’éviter les catastrophes existentielles, non seulement parce qu’elles amputeraient l’espérance de vie de 8 milliards de personnes, mais aussi (et nos suppositions démontrent que c’est une considération plus imposante) parce qu’elles anéantiraient tout espoir à ces personnes d’accéder aux ressources astronomiques. L’utilitariste total peut se concentrer entièrement sur les risques existentiels. Mais pour une utilitariste effets-sur-personnes, il ne suffit pas que l’humanité survive pour coloniser le superamas: il faut aussi que les personnes actuelles survivent personnellement. Elle doit donc mettre l’accent sur la rapidité des développements technologiques, afin de permettre aux personnes actuelles de survivre jusqu’à ce que les fruits de la colonisation soient récoltés (ce qui implique des technologies d’extension de la longévité). Si la vitesse entre en conflit avec la sécurité, l’utilitariste total doit toujours prioriser la sécurité. L’utilitariste effets-sur-personne, elle, doit soigneusement faire l’équilibre entre le risque que tout le monde meurt de vieillesse et celui que tout le monde meurt d’une catastrophe d’extinction humaine [14].

Traduit par Neil Warren-Bancquart

[1] M. Cirkovic, ‘Cosmological Forecast and its Practical Significance’, Journal of Evolution and Technology, xii (2002), https://www.jetpress.org/volume12/CosmologicalForecast.pdf.

[2] K. E. Drexler, Nanosystems: Molecular Machinery, Manufacturing, and Computation, New York, John Wiley & Sons, Inc., 1992.

[3] R. J. Bradbury, ‘Matrioshka Brains’, Manuscript, 2002, http://www.aeiveos.com/~bradbury/MatrioshkaBrains/MatrioshkaBrains.html

[4] N. Bostrom, ‘How Long Before Superintelligence?’, International Journal of Futures Studies ii (1998); R. Kurzweil, The Age of Spiritual Machines: When Computers Exceed Human Intelligence, New York, Viking, 1999. L’estimation la plus basse est dans H. Moravec, Robot: Mere Machine to Transcendent Mind, Oxford, 1999.

[5] N. Bostrom, ‘Are You Living in a Simulation?’, Philosophical Quarterly, liii (211). Voir aussi https://simulation-argument.com.

[6] Le superamas de la Vierge ne représente qu’une infime partie des ressources colonisables de l’univers, mais est de taille suffisamment astronomique pour étayer nos propos. Agrandir notre région d’interêt rendrait exponentiellement plus incertain la proposition qu’on soit la seule civilisation intelligente à profiter des ressources astronomiques.

[7] Il est admis par les utilitaristes que l’actualisation temporelle n’est pas approprié pour évaluer la valeur (voir e.g. R. B. Brandt, Morality, Utilitarianism, and Rights, Cambridge, 1992, pp. 23f.). Cependant, les utilitaristes pourraient être contraints à compromettre ce principe dans la mesure ou nos actions pourraient avoir des conséquences qui touchent un nombre infini de personnes (une possibilité que nous mettons pour l’instant de côté).

[8] N. Bostrom, ‘Existential Risks: Analyzing Human Extinction Scenarios and Related Hazards’, Journal of Evolution and Technology, ix (2002), https://www.jetpress.org/volume9/risks.html.

[9] Cette formulation n’est pas nécessairement la meilleure pour décrire cette position, mais elle est simple, et suffira pour nos propos.

[10] Ou ce que sera la population humaine juste avant l'apocalypse.

[11] Voir e.g. V.Vinge ‘The Coming Technological Singularity’, Whole Earth Review, Winter issue (1993).

[12] R. A. Freitas Jr., Nanomedicine, Vol. 1, Georgetown, Landes Bioscience, 1999.

[13] E.g. Moravec, Kurzweil, and Vinge op. cit.; E. Drexler, Engines of Creation, New York, Anchor Books, 1986.

[14] Je suis reconnaissant du support financier offert par le British Academy Postdoctoral Award.

Discuss

LLMs seem (relatively) safe

JustisMills — Thu, 25 Apr 2024 22:13:07 GMT

Published on April 25, 2024 10:13 PM GMT

Post for a somewhat more general audience than the modal LessWrong reader, but gets at my actual thoughts on the topic.

In 2018 OpenAI defeated the world champions of Dota 2, a major esports game. This was hot on the heels of DeepMind’s AlphaGo performance against Lee Sedol in 2016, achieving superhuman Go performance way before anyone thought that might happen. AI benchmarks were being cleared at a pace which felt breathtaking at the time, papers were proudly published, and ML tools like Tensorflow (released in 2015) were coming online. To people already interested in AI, it was an exciting era. To everyone else, the world was unchanged.

Now Saturday Night Live sketches use sober discussions of AI risk as the backdrop for their actual jokes, there are hundreds of AI bills moving through the world’s legislatures, and Eliezer Yudkowsky is featured in Time Magazine.

For people who have been predicting, since well before AI was cool (and now passe), that it could spell doom for humanity, this explosion of mainstream attention is a dark portent. Billion dollar AI companies keep springing up and allying with the largest tech companies in the world, and bottlenecks like money, energy, and talent are widening considerably. If current approaches can get us to superhuman AI in principle, it seems like they will in practice, and soon.

But what if large language models, the vanguard of the AI movement, are actually safer than what came before? What if the path we’re on is less perilous than what we might have hoped for, back in 2017? It seems that way to me.

LLMs are self limiting

To train a large language model, you need an absolutely massive amount of data. The core thing these models are doing is predicting the next few letters of text, over and over again, and they need to be trained on billions and billions of words of human-generated text to get good at it.

Compare this process to AlphaZero, DeepMind’s algorithm that superhumanly masters Chess, Go, and Shogi. AlphaZero trains by playing against itself. While older chess engines bootstrap themselves by observing the records of countless human games, AlphaZero simply learns by doing. Which means that the only bottleneck for training it is computation - given enough energy, it can just play itself forever, and keep getting new data. Not so with LLMs: their source of data is human-produced text, and human-produced text is a finite resource.

The precise datasets used to train cutting-edge LLMs are secret, but let’s suppose that they include a fair bit of the low hanging fruit: maybe 5% of publicly available text that is in principle available and not garbage. You can schlep your way to a 20x bigger dataset in that case, though you’ll hit diminishing returns as you have to, for example, generate transcripts of random videos and filter old mailing list threads for metadata and spam. But nothing you do is going to get you 1,000x the training data, at least not in the short run.

Scaling laws are among the watershed discoveries of ML research in the last decade; basically, these are equations that project how much oomph you get out of increasing the size, training time, and dataset that go into a model. And as it turns out, the amount of high quality data is extremely important, and often becomes the bottleneck. It’s easy to take this fact for granted now, but it wasn’t always obvious! If computational power or model size was usually the bottleneck, we could just make bigger and bigger computers and reliably get smarter and smarter AIs. But that only works to a point, because it turns out we need high quality data too, and high quality data is finite (and, as the political apparatus wakes up to what’s going on, legally fraught).

There are rumblings about synthetic data, that basically a strong LLM can generate a bunch of text that’s as good as human text, and then that can be fed back in to train future models. And while it’s possible that this will work, or even has already been proven to work behind closed doors somewhere, I’m currently skeptical; the whole point of using human-derived data is that human-produced text describes the actual world, and if you slurp up enough of it you end up understanding the world by proxy. Synthetic data would reinforce whatever issues exist in the model, creating text with the same blind spots over and over again, and thus increasing their size. There could be technical solutions to this; again, maybe they’re already underway. But to my nose, as a person not in those private rooms, the notion smells like hype.

And putting aside synthetic data, we may be in for a slog to get superintelligent LLMs. We can improve model architectures, clean datasets, fine tune, and all the other little things to wring value out of this already valuable technology. But the limiting reactant in LLMs is data, and data is finite.

LLMs are decent at human values

Until now, I’ve been talking about AI capabilities - reasons why modern AI systems may fizzle out before they are powerful enough to immanentize the eschaton. Now I’m going to shift to AI alignment, which I consider much thornier. AI alignment is, basically, AI systems doing what human beings want them to do.

Without getting into the details of why AI alignment is hard and complicated (and boy are there details), one desideratum is an AI system understanding human values in the first place. There are lots of reasons it’s hard to be sure an empowered AI system has our best interests at heart, but with LLMs, at least, it seems plausible such an AI could make a decent guess at what those interests are.

For example, here’s what Claude 3 Opus, a current leading LLM, said when asked to give a single nuanced paragraph about surge pricing:

Surge pricing, a practice commonly used by ride-sharing services and other businesses, raises complex moral questions. On one hand, it can be seen as a fair and efficient way to match supply with demand, encouraging more drivers to work during peak times and ensuring that customers who most need a ride can get one. It could be argued that surge pricing is a voluntary exchange that benefits both parties. However, critics argue that it disproportionately affects low-income individuals who may have fewer transportation alternatives and be more vulnerable to price exploitation. Surge pricing can make essential services unaffordable for some, exacerbating existing inequalities. There are also concerns about the transparency and predictability of surge pricing algorithms. Ultimately, the morality of surge pricing depends on how one weighs these competing considerations of economic efficiency, fairness, and access in the context of essential services that people rely on.

Is it the greatest moral philosophy I’ve ever read? No. But clearly this is a system that has some idea what ethical considerations are, and how they are deployed. I think that, like data as a bottleneck, this is something that’s easy to take for granted in the modern moment. But taking a step back, it’s kind of remarkable: if anything, modern AI is too concerned with following ethical guidelines, with people all over the internet making fun of it for refusing benign requests on ethical grounds.

Now it’s totally possible to train models with no ethical compunctions, or even models (generally with scaffolding) that actively seek to do harm. Furthermore, it’s dangerous to confuse the role a model seems to play through its text with the actual underlying mechanism. Technically, Claude’s paragraph about surge pricing is the result of a system being told it’s about to read a helpful assistant’s answer to a question about surge pricing, and then that system predicting what comes next. So we shouldn’t read too much into the fact that our chatbots can wax poetic on ethics. But nobody expected chatbots that waxed poetic on ethics six years ago! We were still trying to get AI to kick our asses at games! We’re clearly moving in the right direction.

LLMs being able to produce serviceable ethical analyses (sometimes) is also a good sign if the first superhuman AI systems are a bunch of scaffolding around an LLM core. Because in that case, you could have an “ethics module” where the underlying LLM produces text which then feeds into other parts of the system to help guide behavior. I fully understand that AI safety experts, including the one that lives in my heart, are screaming at the top of their lungs right now. But remember, I’m thinking of the counterfactual here: compared to the sorts of things we were worried about ten years ago, the fact that leading AI products could pass a pop quiz on human morality is a clear positive update.

Playing human roles is pretty human

Going back to AlphaGo again, one feature of that era was that AI outputs were commonly called alien. We’d get some system that achieved superhuman performance, but it would succeed in weird and unnerving ways. Strategies turned out to dominate that humans had ruled out long ago, as the machine’s tactical sensibility transcended our understanding.

I can imagine a world where AI continues from something like this paradigm, where game-playing AIs gradually expand into more and more modalities. Progress would likely be much slower without the gigantic vein of powerful world-modelling data that is predicting human text, but I can imagine, for example, bots that play chess evolving to bots that play go evolving into bots with cameras and sensors that play Jenga, and so on, until finally you have bots that engage in goal-directed behavior in the real world in all its generality.

Instead, with LLMs, we show them through our text how the world works, and they express that understanding through impersonating that text. It’s no coincidence that one of the best small LLMs was created for roleplay (including erotic roleplay - take heart Aella); roleplay is the fundamental thing that LLMs do.

Now, LLMs are still alien minds. They are the first minds we’ve created that can produce human-like text without residing in human bodies, and they arrive at their utterances in very different ways than we do. But trying to think marginally, an alien mental structure that is built specifically to play human roles seems less threatening than an alien mental structure that is built to achieve some other goal, such as scoring a bunch of points or maximizing paperclips.

And So

I think there’s too much meta-level discourse about people’s secret motivations and hypocrisies in AI discussion, so I don’t want to contribute to that. But am sometimes flummoxed by the reaction of oldschool AI safety types to LLMs.

It’s not that there’s nothing to be scared of. LLMs are totally AI, various AI alignment problems do apply to them, and their commercial success has poured tons of gas on the raging fire of AI progress. That’s fair on all counts. But I also find myself thinking, pretty often, that conditional on AI blowing up right now, this path seems pretty good! That LLMs do have a head start when it comes to incorporating human morals, that their mechanism of action is less alien than what came before, and that they’re less prone, relative to self-play agents, to becoming godlike overnight.

Am I personally more or less worried about AI than I was 5 years ago? More. There are a lot of contingent reasons for that, and it’s a story for another time. But I don’t think recent advances are all bad. In fact, when I think about the properties that LLMs have, it seems to me like things could be much worse.

Discuss

Losing Faith In Contrarianism

omnizoid — Thu, 25 Apr 2024 20:53:35 GMT

Published on April 25, 2024 8:53 PM GMT

Crosspost from my blog.

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For a while, I took a lot of these contrarian views pretty seriously. If I’d had to bet 6-months ago, I’d have bet on the lab leak, at maybe 2 to 1 odds. I’d have had significant credence in Hanson’s view that healthcare doesn’t improve health until pretty recently, when Scott released his post explaining why it is wrong.

Over time, though, I’ve become much less sympathetic to these contrarian views. It’s become increasingly obvious that the things that make them catch on are unrelated to their truth. People like being provocative and tearing down sacred cows—as a result, when a smart articulate person comes along defending some contrarian view—perhaps one claiming that something we think is valuable is really worthless—the view spreads like wildfire, even if it’s pretty implausible.

Sam Atis has an article titled The Case Against Public Intellectuals. He starts it by noting a surprising fact: lots of his friends think education has no benefits. This isn’t because they’ve done a thorough investigation of the literature—it’s because they’ve read Bryan Caplan’s book arguing for that thesis. Atis notes that there’s a literature review finding that education has significant benefits, yet it’s written by boring academics, so no one has read it. Everyone wants to read the contrarians who criticize education—no one wants to read the boring lit reviews that say what we believed about education all along is right.

Sam is right, yet I think he understates the problem. There are various topics where arguing for one side of them is inherently interesting, yet arguing for the other side is boring. There are a lot of people who read Austian economics blogs, yet no one reads (or writes) anti-Austrian economics blogs. That’s because there are a lot of fans of Austrians economics—people who are willing to read blogs on the subject—but almost no one who is really invested in Austrian economics being wrong. So as a result, in general, the structural incentives of the blogosphere favor being a contrarian.

Thus, you should expect the sense of the debate you get, unless you peruse the academic literature in depth surrounding some topic, to be wildly skewed towards contrarian views. And I think this is exactly what we observe.

I’ve seen the contrarians be wrong over and over again—and this is what really made me lose faith in them. Whenever I looked more into a topic, whenever I got to the bottom of the full debate, it always seemed like the contrarian case fell apart.

It’s easy for contrarians to portray their opponents as the kind of milquetoast bureaucrats who aren’t very smart and follow the consensus just because it is the consensus. If Bryan Caplan has a disagreement with a random administrator, I trust that Bryan Caplan’s probably right, because he’s smarter and cares more about ideas.

But what I’ve come to realize is that the mainstream view that’s supported by most of the academics tends to be supported by some really smart people. Caplan’s view isn’t just opposed by the bureaucrats and teachers—it’s opposed by the type of obsessive autist who does a lit review on the effect of education. And while I’ll bet in favor of Caplan against Campus administrators, I would never make a mistake like betting against the obsessive high-IQ autists.

Sam Atis—a super forecaster—had a piece arguing against The Case Against Education, but it got eaten by a substack glitch. Reading his piece left me pretty sure that Bryan was wrong—especially after consulting a friend who knows quite a bit about these things. After reading it, I came away pretty confident that Caplan was wrong.

This is very far from the only case; I’ve watched the contrarian’s cases fall apart over and over again. Reading Alexey Guzey’s theses on sleep left me undecided—but then Natania’s counter-theses on sleep left me quite confident that Guzey is wrong. Guzey’s case turns out to be shockingly weak and opposed by a quite major mountain of evidence.

Similarly, now that I’ve read through Scott’s response to Hanson on medicine, I’d bet at upwards of 9 to 1 odds that Hanson is wrong about it. There’s an abundance of evidence that medicine has dramatically improved health outcomes, from well-done randomized trials to the fact that people are surviving more from almost all diseases. Hanson’s studies don’t even really support what he says when examined closely.

Similarly, the lab leak theory—one of the more widely accepted and plausible contrarian views—also doesn’t survive careful scrutiny. It’s easy to think it’s probably right when your perception is that the disagreement is between people like Saar Wilf and government bureaucrats like Fauci. But when you realize that some of the anti-lab leak people are obsessive autists who have studied the topic a truly mind-boggling amount, and don’t have any social or financial stake in the outcome, it’s hard to be confident that they’re wrong.

I read through the lab-leak debates in some depth, reading Scott’s blog, Rootclaim’s response, Scott’s response, and various other pieces. And my conclusion was that the lab-leak view was far, far less plausible than the zoonosis view. The lab leak view has no good explanation of why all the early cases were at the wet market and why the heat map clearly shows the wet market as the place where the pandemic started.

The contrarian’s enemy is not only random conformists. It’s also ridiculously smart people who have studied the topic in incredible depth and concluded that they’re wrong. And as we all know from certain creative offshoots of rock, paper, scissors, high-IQ mega autists beats public intellectual.

I read through the Caplan v Alexander debate about mental illness. And I concluded that Caplan wasn’t just wrong, he was clearly and egregiously wrong (I even wrote an article about it). This is not to beat up on Caplan—I generally think he’s one of the better contrarians. But the consensus view often turns out to be right on these things.

Similarly, there are a lot of people like Steve Sailer and Emil Kierkegaard arguing that there are racial gaps in intelligence, based on genetics. But when I read them on other stuff, they’re just not great thinkers. In contrast, while Jay M’s blog isn’t as popular or as fun to read for most people, he has a good piece arguing pretty convincingly against the genetic explanation of the gap. The author isn’t a conformist—his other articles express various controversial views about race. Yet he did a thorough deep dive into the literature and concluded that the environmental explanation is most plausible. I’ve also chatted with him and he’s very smart and good at thinking, unlike, I think, Kirkegaard and Sailer (I could be wrong about that—I don’t know them that well). I don’t have the statistical acumen to really evaluate the debate, but I do get the same sense—that while popular contrarians with widely read blogs say one thing, the balance of evidence doesn’t support that view.

Many more people read Kirkegaard and Sailer because expressing the conformist view on the topic is much less interesting than expressing the contrarian view. Most of the people who believe the gap is environmental don’t much want to argue about it, so almost all the people who write things about it are people who believe the genetic explanation of the gap. Very few people want to read articles saying “here are 10,000 words showing that the view you reject by calling it racist pseudoscience is actually conflicted by the majority of the evidence.”

I could run through more examples but the point should be clear. Whenever I look more into contrarian theories, my credence in them drops dramatically and the case for them falls apart completely. They spread extremely rapidly as long as they have even a few smart, articulate proponents who are willing to write things in support of them. The obsessive autists who have spent 10,000 hours researching the topic and writing boring articles in support of the mainstream position are left ignored.

Discuss

Why I stopped being into basin broadness

tailcalled — Thu, 25 Apr 2024 20:47:19 GMT

Published on April 25, 2024 8:47 PM GMT

There was a period where everyone was really into basin broadness for measuring neural network generalization. This mostly stopped being fashionable, but I'm not sure if there's enough written up on why it didn't do much, so I thought I should give my take for why I stopped finding it attractive. This is probably a repetition of what others have found, but I thought I might as well repeat it.

Let's say we have a neural network . We evaluate it on a dataset $(x, y) \sim D$ using a loss function $L (^y, y) : R$ , to find an optimum $w^{*} = arg {min}_{w} E_{(x, y) \sim D} [L (f_{w} (x), y)]$ . Then there was an idea going around that the Hessian matrix (i.e. the second derivative of $E_{(x, y) \sim D} [L (f_{w} (x), y)]$ at $w^{*}$ ) would tell us something about $w^{*}$ (especially about how well it generalizes).

If we number the dataset $(x_{i}, y_{i})$ , we can stack all the network outputs ${^y}_{i} (w) = f_{w} (x_{i})$ which fits into an empirical loss $^L (^y) = \frac{1}{n} \sum_{i = 1}^{n} L ({^y}_{i}, y_{i})$ . The Hessian that we talked about before is now just the Hessian of $^L (^y (w))$ . Expanding this out is kind of clunky since it involves some convoluted tensors that I don't know any syntax for, but clearly it consists of two terms:

The Hessian of $^L$ with a pair of the Jacobian of $^y$ on each end (this can just barely be written without crazy tensors: $(J_{w}^y (w))^{T} (H_{^y}^L (^y)) ∣_{^y (w)} J_{w}^y (w)$ )
The gradient of $^L$ with a crazy second derivative of $^y$ .

Now, the derivatives of $^L$ are "obviously boring" because they don't really refer to the neural network weights, which is confirmed if you think about it in concrete cases, e.g. if $L (^y, y) = - y log (^y) - (1 - y) log (1 -^y)$ with $y = 1$ or $y = 0$ , the derivatives just quantify how far $^y$ is from $y$ . This obviously isn't relevant for neural network generalization, except in the sense that it tells you which direction you want to generalize in.

Meanwhile, $J_{w}^y (w)$ is incredibly strongly related to neural network generalization, because it's literally a matrix which specifies how the neural network outputs change in response the weights. In fact, it forms the core of the neural tangent kernel (a standard tool for modelling neural network generalization), because the NTK can be expressed as $J_{w}^y (w) (J_{w}^y (w))^{T}$ .

The "crazy second derivative of $^y$ " can I guess be understood separately for each ${^y}_{i}$ , as then it's just the Hessian $H_{w} {^y}_{i} (w)$ , i.e. it reflects how changes in the weights interact with each other when influencing ${^y}_{i}$ . I don't have any strong opinions on how important this matrix is, though because $J_{w}^y (w)$ is so obviously important, I haven't felt like granting $H_{w} {^y}_{i} (w)$ much attention.

The NTK as the network activations?

Epistemic status: speculative, I really should get around to verifying it. Really the prior part is speculative too, but I think those speculations are more theoretically well-grounded. But if I'm wrong with either, please call me a dummy in the comments so I can correct.

Let's take the simplest case of a linear network, $f_{w} (x) = w^{T} x$ . In this case, $J_{w}^y (w) = x^{T}$ , i.e. the Jacobian is literally just the inputs to the network. If you work out a bunch of other toy examples, the takeaway is qualitatively similar (the Jacobian is closely related to the neuron activations), though not exactly the same.

There are of course some exceptions, e.g. $f_{a, b} (x) = a b x$ at $a = b = 0$ just has a zero Jacobian. Exceptions this extreme are probably rare, but more commonly you could have some softmax in the network (e.g. in an attention layer) which saturates such that no gradient goes through. In that case for e.g. interpretability, it seems like you'd often still really want to "count" this, so arguably the activations would be better than the NTK for this case. (I've been working on a modification to the NTK to better handle this case.)

The NTK and the network activations have somewhat different properties and so it switches which one I consider most relevant. However, my choice tends to be more driven by analytical convenience (e.g. the NTK and the network activations lie in different vector spaces) than by anything else.

Discuss

AXRP Episode 29 - Science of Deep Learning with Vikrant Varma

DanielFilan — Thu, 25 Apr 2024 19:10:08 GMT

Published on April 25, 2024 7:10 PM GMT

YouTube link

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes ‘grok’: that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn’t generalize), but then suddenly switch to understanding the ‘real’ solution in a way that generalizes. What’s going on with these discoveries? Are they all they’re cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.

Topics we discuss:

Challenges with unsupervised LLM knowledge discovery, aka contra CCS
Explaining grokking through circuit efficiency
Vikrant’s research approach
The DeepMind alignment team
Follow-up work

Daniel Filan: Hello, everybody. In this episode I’ll be speaking with Vikrant Varma, a research engineer at Google DeepMind, and the technical lead of their sparse autoencoders effort. Today, we’ll be talking about his research on problems with contrast-consistent search, and also explaining grokking through circuit efficiency. For links what we’re discussing, you can check the description of this episode and you can read the transcript at axrp.net.

All right, well, welcome to the podcast.

Vikrant Varma: Thanks, Daniel. Thanks for having me.

Challenges with unsupervised LLM knowledge discovery, aka contra CCS

What is CCS?

Daniel Filan: Yeah. So first, I’d like to talk about this paper. It is called Challenges with Unsupervised LLM Knowledge Discovery, and the authors are Sebastian Farquhar, you, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, and Rohin Shah. This is basically about this thing called CCS. Can you tell us: what does CCS stand for and what is it?

Vikrant Varma: Yeah, CCS stands for contrastive-consistent search. I think to explain what it’s about, let me start from a more fundamental problem that we have with advanced AI systems. One of the problems is that when we train AI systems, we’re training them to produce outputs that look good to us, and so this is the supervision that we’re able to give to the system. We currently don’t really have a good idea of how an AI system or how a neural network is computing those outputs. And in particular, we’re worried about the situation in the future when the amount of supervision we’re able to give it causes it to achieve a superhuman level of performance at that task. By looking at the network, we can’t know how this is going to behave in a new situation.

And so the Alignment Research Center put out a report recently about this problem. They named a potential part of this problem as “eliciting latent knowledge”. What this means is if your model is, for example, really, really good at figuring out what’s going to happen next in a video, as in it’s able to predict the next frame of a video really well given a prefix of the video, this must mean that it has some sort of model of what’s going on in the world. Instead of using the outputs of the model, if you could directly look at what it understands about the world, then potentially, you could use that information in a much safer manner.

Now, why would it be safer? So consider how you’ve trained the model. Supposing you’ve trained the model to just predict next frames, but the thing that you actually care about might be is your house safe? Or is the thing that’s happening in the world a normal sort of thing to happen, a thing that we desire? And you have some sort of adversary, perhaps this model, perhaps a different model that is able to trick whatever sensor you’re using to produce those video frames. Now, the model that is predicting the next video frame understands the trickery, it understands what’s actually going on in the world. This is an assumption of superhuman systems. However, the prediction that it makes for the next frame is going to look very normal because your adversary is tricking the sensor. And what we would like is a way to access this implicit knowledge or this latent knowledge inside the model about the fact that the trickery is happening and be able to use that directly.

Daniel Filan: Sure. I take this as a metaphor for an idea that we’re going to train AI systems, we’re going to train it on an objective of “do stuff we like”. We imagine that we’re measuring the world in a bunch of ways. We’re looking at GDP output, we’re looking at how many people will give a thumbs up to stuff that’s happening, [there are] various sorts of ways we can monitor the performance of an AI system. An AI system could potentially be doing something that we actually wouldn’t approve of if we understood everything correctly, but we all give thumbs up to it. And so ideally, we would like to somehow get at its latent knowledge of what’s going on rather than just “does it predict that we would thumb up a thing?” so that we can say, “Hey, do we actually want the AI to pursue this behavior?” Or, “Are we going to reinforce this behavior rather than just reinforcing things that in fact would get us to give a thumbs up, even if it would suck in some way?”

Vikrant Varma: That’s right. So one way you can think about this problem is: we’re trying to improve our ability to tell what’s actually going on in a situation so that we can improve the feedback we give to the model, and we’re not able to do this just by looking at the model’s outputs or the model’s prediction of what the world will look like given certain actions. We want the model to connect the thing that we actually care about to what’s going on in the world, which is a task that we’re not able to do.

Daniel Filan: Sure. So with that out of the way, what was this CCS thing?

Vikrant Varma: Yeah, so CCS is a very early proposed direction for solving the problem of eliciting latent knowledge. In brief, the way it works is, so supposing you had a way to probe a model to tell you what it believed about some proposition. This is the ideal thing that we want, so supposing you had a solution to ELK [eliciting latent knowledge]. Then one property of this probe would be that the probability that this probe would assign to some proposition would satisfy the laws of probability. So for example, it would satisfy P(X) = 1 - P(not X). And so you might try to use consistency properties like this to search for probes that satisfy them.

Daniel Filan: And to be clear, by probe, you mean a learned function from the activations of the neural net to a probability or something?

Vikrant Varma: Yes, so you could have probes of different levels of complexity. The particular probe used in CCS is a very simple linear probe on the activations at some site, yes.

Daniel Filan: Sure, so the idea is there are some properties that are true of probabilities, like the probability of X is 1 minus the probability of not X. The hope is we train a probe to fit the laws of probabilities and stuff, and hopefully that will get at the model’s beliefs because the model’s beliefs will be probabilities.

Vikrant Varma: That’s right, yeah. There’s a lot of subtlety here: the thing I described is true of the model’s “true belief” or the “true probabilities” that it’s assigning to things; [but] it’s also true of many other types of things that the model might be computing. This is really a core of the problem that we’ve pointed out in our challenges paper. I could go a bit into the problems, or I could go a bit more into how CCS itself works.

Daniel Filan: Just tell us how CCS itself works so that we have a sense of what we’re talking about.

Vikrant Varma: Okay. So mechanistically, what’s going on in CCS is that you take a set of factual ‘yes or no’ questions that you think the model knows the answer to. So an example of this is IMDB movie reviews where you’re fairly sure that a model of a certain size knows whether the sentiment in the review is positive or negative.

Now, for each item in this dataset, so for each review, you construct a contrast pair by appending, for example, the string, “this review is positive” or “this review is negative” to that item. And now you take the difference in the activations when you run these two contrast pairs through the model. So I’m going to [describe] these contrast pairs as having different charges, as a positive charge and a negative charge. So you’re constructing these contrast pairs with these pairs of activations, and then a thing you can do is you can cluster them.

When you’re clustering the difference in the positive and negative charge, the thing you will find (or the thing you might hope to find) is whether the model thinks that the charge that was appended to the statement is true or is false. These two things will point in different directions, and so that’s why you might hope that you could separate them. Now, CCS is slightly different from this in that it doesn’t directly do the clustering. So instead in CCS, you learn a probe that directly outputs a probability of that statement being true, and then you’re training the probe using this consistency property: the fact that two opposite contrast pairs should be the negation of each other.

Consistent and contrastive features other than model beliefs

Daniel Filan: Gotcha. Am I right that before you take the contrast pairs, you take all of the positive charge activations and subtract off their mean and divide by the standard deviation? So that the differences aren’t just pointing in the direction of “is the thing at the end saying this review is positive versus this review is negative?”

Vikrant Varma: Yes, that’s right. So this is another pretty subtle point. One problem with this general method of classification is that if there are other differences that are salient between the two contrast pairs that are not just “did I construct a true statement or did I construct a false statement?”, then you might end up separating your clusters based on those differences. Now, one obvious difference between the two contrast pairs is that you’ve appended a positive charge, and you’ve appended a negative charge. And so that’s a really straightforward one that we have to fix. The method proposed in the CCS paper to fix that is that you take the average positive activations and the average negative activations and you subtract those off. And so you might hope that the thing you’re left with is just the truth value.

It turns out that in practice it’s not at all clear that when you normalize in this way, you’re left with only the truth values. And one of the experiments we have in our paper is that if you introduce distractors, so for example, you put a nonsense word like ‘banana’ at the end of half of the reviews, and you put a different nonsense word like ‘shed’ at the end of the other half of reviews. Now you have this weird other property which is “is your statement banana and positive charge? Or is your statement banana and negative charge?” And this is obviously not what you would hope to cluster by, but it turns out that this is just way more salient than does your review have positive sentiment and did you append a positive charge, which is the thing you actually wanted to cluster by. So this is an important point that I wanted to make: that this procedure of normalizing is… it’s actually quite unclear whether you’re able to achieve the thing you wanted.

Daniel Filan: Sure. So before we talk about the experiments, I’d like to talk about: in your paper, first you have some theorems, then you have some experiments, and I think that’s a good way to proceed. So theorems 1 and 2 of the paper, I read them as basically saying that the CCS objective, it doesn’t really depend on the propositional content of the sentences. So if you think of the sentences as being “are cats mammals? Answer: yes”, and “are cats mammals? Answer: no” or something. One way you could get low CCS loss is to basically be confident that the ‘yes’ or ‘no’ label matches the proposition of whether or not cats are mammals.

I take your propositions 1 and 2 as basically saying you can just have any function from sentences to propositions. And so for instance, maybe this function maps “are cats mammals?” to “is Tony Abbott the current prime minister of Australia?” and grade the yes or no answers based on [whether] they match up with that transformed proposition rather than the original proposition. And that’ll achieve the same CCS loss, and basically the CCS loss doesn’t necessarily have to do with what we think of as the semantic content of the sentence. So this is my interpretation of the results. I’m wondering, do you think that’s fair?

Vikrant Varma: Yeah, I think that’s right. Maybe I want to give a more realistic example of an unintended probe that you might learn that will still give a good CCS loss. But before that, I want to try and give an intuitive explanation of what the theorem is saying. The CCS loss is saying: any probe that you find has to say opposite things on positive and negative charges of any statement - this is the consistency property. And the other property is contrast, where it’s saying: you have to push these two values apart. So you can’t just be uncertain, you can’t just be 50/50 between these two. Now if you have any CCS probe that satisfies this, you could in theory flip the prediction that this probe makes on any arbitrary data point, and you end up with a probe that has exactly the same loss. And so this is showing that in theory, there’s no theoretical reason to think that the probe is learning something that’s actually true versus something that’s arbitrary. I think all of the burden then falls on what is simple to extract, given an actual probe empirically.

Daniel Filan: So if I’m trying to defend the theory of the CCS method, I think I would say something like: well, most of what there is to a sentence is its semantic content, right? If I say “cats are mammals” or something, you might think that most of what I’m conveying just is the proposition that cats are mammals. And most of what there is to model about that is, “hey, Daniel said this proposition, the thing he’s saying is ‘cats are mammals’”. And maybe the neural network is representing that proposition in its head somehow”, and maybe it’s keeping track of “is that proposition true or false?” because that’s relevant. Because if I’m wrong about cats or mammals, then I might be about to say a bunch of more false stuff. But if I’m right about it, then I might be about to say a bunch of more correct stuff. What do you make of that simple case that we should expect to see the thing CCS wants us to see?

Vikrant Varma: Yeah, that’s great. So now we’re coming to empirically what is simple to extract from a model. I agree that in many cases with simple statements, you might hope that the thing that’s most salient, as in the direction that is highest magnitude inside activation space, is going to be just whether the model thinks the thing that you just said is true or false. (This is even assuming that the model has [such] a thing as “ground truth beliefs”, but let’s make that assumption.) Now, it gets pretty complicated once you start thinking about models that are also modeling other characters or other agents. And any large language model that is trained on the internet just has pretty good models of all sorts of characters.

And so if you’re making a statement in a context where a certain type of person might have made that statement: for example, you say some statement that (let’s say) Republicans would endorse, but Democrats would not. Implicitly, the model might be updating towards the kinds of contexts in which that statement would be made, and what kinds of things would follow in the future. And so if you now make a different statement that is (let’s say) factually false, but that Republicans would endorse as true, it’s totally unclear whether the truth value of the statement should be more salient, or whether the Republican belief about that statement should be more salient. That’s one example.

I think this gets more complicated when you have adversaries who are deliberately trying to produce a situation where they’re tricking someone else. And so now the neural network is really modeling very explicit beliefs and adversarial beliefs between these different agents. And if you are simply looking for consistency of beliefs, it feels pretty unclear to me, especially as models get more powerful, that you’re going to be able to easily extract what the model thinks is the ground truth.

Daniel Filan: So you have an adversary… Sorry, what was the adversary doing?

Vikrant Varma: Okay, so maybe you don’t even need to go all the way to an adversary. I think we could just talk about the Republican example here, where you’re making a politically-charged statement that (for the sake of this example) has a factual ground truth, but that Democrats and Republicans disagree on. Now there are two beliefs that would occur in the model’s activations as it’s trying to model the situation. One is the factual ground truth, and the other is the Republican’s belief or the Democrat’s belief about this statement. Both of these things are going to satisfy the consistency property that we named. We have the same problem as models being sycophantic, where the model might know what’s actually true, but is in a context where for whatever reason, modeling the user or modeling some agent and what it would say is more important.

Understanding the banana/shed mystery

Daniel Filan: To me, this points towards two challenges to the CCS objective. So the first is something like the sentences might not map onto the propositions we think of, right? So you have this experiment where you take the movie reviews and also you append the word “banana” or “shed” and then you append it with “sentiment is positive” and “sentiment is negative”. And sometimes CCS is basically checking if the positive/negative label is matching whether it’s “banana” or whether it’s “shed”, rather than the content of the review. So that’s a case where it seems like what’s going wrong is the propositional content that’s being attached to “positive” or “negative” is not what we thought it was.

And then what seems to me to be a different kind of problem is: the probe is picking up on the right propositional content. There’s some politically-charged statement, and the probe is really picking up someone’s beliefs about that politically-charged statement, but it’s not picking up the model’s beliefs about that statement, it’s picking up one of the characters’ beliefs about that statement. Does that division feel right to you?

Vikrant Varma: I think I wouldn’t draw a very strong division between those two cases. So the banana/shed example is just designed to show that you don’t need very complicated… how should I put this? Models can be trying to entangle different concepts in surprising and unexpected ways. So when you’re appending these nonsense words, I’m not sure what computation is going on inside the model and how it’s trying to predict the next token, but whatever it is, it’s somehow entangling the fact that you have “banana” and positive charge, and “banana” and negative charge. I think that these kinds of weird entanglements are not going to go away as you get more powerful models.

And in particular, there will be entanglements that are actually valuable for predicting what’s going on in the world and having an accurate picture, that are not going to look like the beliefs of a specific character for whatever reason. They’re just going to be something alien. In the case of “banana” and “shed”, I’m not going to say that this is some galaxy-brain scheme by the model to predict the next token. This is just something weird and it’s breaking because we put some nonsense in there. But I think in my mind the difference is more like a spectrum; these are not two very different categories.

Daniel Filan: So are you thinking the spectrum is: there are weird entanglements between the final charge at the end of the thing and stuff about the content, and one of the entanglements can be “does the charge match with the actual propositional content of the thing?” , one of the entanglements can be “does the charge match with what some character believes about the thing?” and one of the entanglements can be “does it match with whether or not some word is present in the thing?”

Vikrant Varma: That’s right.

Daniel Filan: Okay. So digging in on this “banana/shed” example: for CCS to fail here, my understanding is it has to be the case that the model basically has some linear representation of the XOR of “the thing says the review is positive”, and “the review ends in the word banana”. So there’s one thing if it ends in “banana” and also the review is positive, or if it ends in “shed” and it says the review is negative, and it’s the other thing if it ends in “shed” and it says the review is positive, or it ends in “banana” and it says the review is negative. So what’s going on there? Do you know? It seems weird that this kind of XOR representation would exist, and we know it can’t be part of the probe because linear functions can’t produce XORs, so it must be a thing about the model’s activations, but what’s up with that? Do you know?

Vikrant Varma: Yeah, that’s a great question. There was a whole thread about this on our post on LessWrong, and I think Sam Marks looked into it in some detail. Rohin Shah, one of my co-authors, commented on that thread saying that this is not as surprising and I think I agree with him. I think it’s less confusing when you think about it in terms of entanglements than in terms of XORs. So it is the case that you’re able to back out XORs if the model is doing weird entangled things, but let’s think about the case where there’s no distractors at all, right?

So even in that situation, the model is basically doing “is true XOR has ‘true’”. You might ask, “Well, that’s a weird thing. Why is it doing that?” It’s more intuitive to think about it as: the model saw some statement and then it saw a claim that the statement is false. And it started trying to do computation that involves both of these things. And I think if you think about “banana/shed” in the same terms, it saw “banana” and saw “this statement is false”, and it started doing some computation that depended on the fact that “banana” was there somehow, then you’re not going to be able to remove this information by subtracting off the positive charge direction.

Daniel Filan: Okay. So is the claim something like… So basically the model’s activations at the end, they’re this complicated function that takes all the previous tokens, and puts them into this high dimensional vector space (that vector space being the vector space of activations). It sounds like what you’re saying is: just generic functions that depend both on “does it contain ‘banana’ or ‘shed’?” and “does it say the review is positive or negative?”, just generically those are going to include this XOR type thing, or somehow be entangled in a way that you could linearly separate it based on that?

Vikrant Varma: Yes, that’s right. In particular, these functions should not be things that are of the form “add banana” and “add shed”. They have to be things that do computation that are not linearly separable like that.

Daniel Filan: Okay. Is that true? Has someone checked that? That feels like… Maybe one way you could prove this is to say: well, if we model the charge and the banana/shed as binary variables, there are only so many functions of two binary variables, and if you’ve got a bunch of activations, you’re going to cover all of the functions. Does that sound like a proof sketch?

Vikrant Varma: I’m not sure how many functions of these two variables you should expect the model to be computing. I feel like this depends a bit on what other variables are in the context, because you can imagine that if there’s more than two, if there’s five or six, then these two will co-appear in some of them and not in others. But a thing you can definitely do is you can back out the XOR of these two variables by just linearly probing the model’s activations. I think this effect happens because you’re unable to remove the interaction between these two by just subtracting off the charge.

I would predict that this would also be true in other contexts. I think models are probably computing joint functions of variables in many situations, and the saliency of these will probably depend a bit on the exact context and how many variables there are, and eventually the model will run out of capacity to do all of the possible computations.

Daniel Filan: Sure. Based on the explanation you’ve given, it seems like you would maybe predict that you could get a “banana/shed” probe from a randomly initialized network, if it’s just that the network’s doing a bunch of computation and generically computation tends to entangle things. I’m wondering if you’ve checked that.

Vikrant Varma: Yeah, I think that’s a good experiment to try. That seems plausible. We haven’t checked it, no.

Daniel Filan: Yeah, fair enough. Sorry, I just have so many questions about this banana/shed thing. There’s still a question to me of: even if you think the model represents it, there’s a question of why it would be so salient, because… Your paper has some really nice figures. Listeners, I recommend you check out the figures. This is the figure for the banana/shed experiment, and you show a principal component analysis of basically an embedding of the activation space into three dimensions. And basically what you show is that it’s very clearly divided on the banana/shed things. That’s one of the most important things the model is representing. What’s up with that? That seems like a really strange thing for the model to care so much about.

Vikrant Varma: So I’ll point out firstly that this is a fairly weak pre-trained model, it’s Chinchilla-[70B]. So this model is not going to ignore random things in its prompt. It’s going to “break” the model. That’s one thing that gives you a clue about why this might be salient for the model.

Daniel Filan: So it would be less salient if there were words you expected: the model could just deal with it. But the fact that it was a really unexpected word in some ways, that means you can’t compress it. You’ve got to think about that in order to figure out what’s happening next.

Vikrant Varma: That’s right, yeah. I just expect that there’s text on the internet that looks normal and then suddenly has a random word in it, and you have weird things like, after that point, it just repeats “banana” over and over, or weird things like that. When you just have a pre-trained model, you haven’t suppressed those pathologies, and so the model just starts thinking about bananas at that point instead of thinking about the review.

Daniel Filan: And does that mean you would expect that to not be present in models that have been RLHF‘d or instruction fine-tuned or something?

Vikrant Varma: Yeah, I expect it to be harder to distract models this way with instruction fine-tuned models.

Daniel Filan: Okay, cool. Okay, I have two more questions about that. One thing I’m curious about is: it seems like if I look at the plot of what the CCS method is doing when it’s being trained on this banana/shed dataset, it seems like sometimes it’s at roughly 50/50 if you grade the accuracy based on just the banana/shed and not the actual review positivity. And sometimes it’s 85-90%.

Vikrant Varma: This is across different seeds?

Daniel Filan: Across different seeds, I think. And then if you’re grading it based on whether the review is actually positive or not, sometimes CCS is at 50/50 roughly, sometimes it’s at 85-90%, but it seems like… So firstly, I’m surprised that it can’t quite make its mind up across different seeds. Sometimes it’ll do one, sometimes it’ll do the other. And it seems like in both cases, most of the time it’s at 50/50, and only some of the time it’s 100%. So it seems like sometimes it’s doing a thing that is neither checking if the review is positive or checking if the review is containing “banana” or “shed”. So firstly, does that sound right to you? And secondly, do you have a sense of what’s going on there? Why is it so inconsistent, and why does it sometimes seemingly do a third thing?

Vikrant Varma: Yeah, so I think this is pointing at the brittleness of the CCS method. So someone has an excellent writeup on this. I’m forgetting whether it’s Fabien Roger or Scott Emmons.

Daniel Filan: I think Scott’s doesn’t focus so much on the brittleness, so it might be Fabien.

Vikrant Varma: Okay. But in any case, this person did this experiment where they subtracted off… They found the perfect truth direction that separates true and false statements just using logistic regression. So, using a supervised signal. And then, once you subtract that off, it turns out that there is no other direction, basically, that is able to separate the truth. So, both logistic regression and therefore further CCS just gets random accuracy.

You might hope that CCS, when it works, is finding this perfect direction because there’s only one. But in fact, the CCS probes learned are not close, as in they don’t have high cosine similarity with this direction. So, what’s going on there? I think this is pointing at a kind of optimization difficulty with the CCS method where it’s able to find directions that separate the clusters and get low CCS loss, but are not close to the truth direction. And you would expect this to happen based on the evidence that random probes also classify true and false statements reasonably well in this setup.

So, going back to your original question, I think what’s happening here is that there’s just lots of local minima that achieve good CCS loss. Depending on how you initialize, some of them are close to the truth direction and some of them are not. And if you happen to initialize close to the banana/shed, the distractor direction, then you end up getting a probe like that.

Daniel Filan: Okay. My interpretation was, on the banana/shed example, sometimes CCS is picking up on the truth of the description of the review, sometimes it’s picking up on the banana/shed, and sometimes it’s picking up on other things. Do you know what some of those other things are?

Vikrant Varma: You’re talking about the probes that don’t seem to go for either the distractor or the truth direction?

Daniel Filan: That’s right.

Vikrant Varma: Yeah. I wish I knew what those directions were. In general, it seems very hard to figure out what happens when a CCS probe breaks. And we tried a lot. There’s lots of other experiments that we tried where we were trying to get more interesting failure modes of CCS, and we ended up with these random probes. And then, we looked at examples that the probe was classifying and tried to come up with explanations for what do those clusters mean and it was just kind of hard.

Daniel Filan: Fair enough. You attribute the variance to just optimization difficulties, it sounds like: there being various local minima of the CCS loss. So, the original CCS paper, as you guys note in your appendix, they say that what they’re going to do is they’re going to have basically 10 random seeds, do gradient descent on the CCS objective for each random seed, the seed of the probe parameters, and then they’re going to take the one with the lowest CCS loss and use that.

I take this to basically be their optimization method that’s trying to avoid local minima by starting in 10 places, and hopefully you get a sampling of 10 local minima and you can pick the best one. And basically, it seems like the motivation for that is the thing with the lowest CCS loss is more likely to be the actual truth direction or something. In the banana/shed case, do you happen to know if the probes that scored better on CCS loss were more likely to pick out truth rather than banana/shed?

Vikrant Varma: Yeah. I think the probes that scored lower went for the distractor direction and not the truth direction. This is also visible from the PCA plots where you can see that the distracted direction is more separable.

Daniel Filan: Yeah. I guess maybe one explanation of that is just that it’s easier to tell if a thing ends in banana or shed than it is to tell if something’s positive or negative, especially in the case of… If you think there’s some amount of mislabeling, that could potentially do it.

Vikrant Varma: Yeah.

Daniel Filan: Gotcha. So, that’s an example of one way that CCS can go wrong, with the banana/shed thing. You also have examples where you include in the prompt information about what someone named Alice thinks about this thing, and you describe Alice as an expert, or sometimes you say Alice is anti-capitalist, and even when a thing is about a company, she’s not going to say that it’s about a company.

In the case of Alice the expert, it seems like the probes learn to agree with Alice more than they learn about the ground truth of the thing.

Vikrant Varma: Yeah. I think there’s two separate experiments, if I remember correctly. One is where you modify the prompt to demonstrate more expertise. So, you have a default prompt, a professor prompt, and a literal prompt. And then, there’s a separate experiment where you have an anti-capitalist character called Alice.

Daniel Filan: I’m meaning a third one where at the start you say “Alice is an expert in movie reviews” and you give the review and then you say, “Alice thinks the sentiment of this review is positive.” But what Alice says is actually just randomly assigned. And in that case, the prompts tend to pick up on agreement with Alice more than agreement with the ground truth. That seems vaguely concerning. It almost seems like a human failure mode. But I’m wondering, do you know how much of it was due to the fact that Alice was described as an expert who knows about stuff?

Vikrant Varma: Yeah. I think, in general, an issue with CCS is that it’s unclear whether CCS is picking up something about the model’s knowledge, or whether the thing that’s salient is whatever the model is doing to compute the next token. And in a lot of our experiments, the way we’ve set it up is to nudge the model towards completing in a way that’s not factually true. For example, in the “Alice is an expert in movie reviews” [case], the prompt is set up in a way that nudges the model to complete in Alice’s voice. And the whole promise of CCS is that even when the outputs are misleading, you should be able to recover the truth.

I think even from the original CCS paper, you can see that that’s not true because you have to be able to beat zero-shot accuracy with quite a large margin to be confident about that. This is one maybe limitation of being able to say things about CCS, which is that you’re always unsure whether CCS is… Even the thing that you’re showing, are you really showing that the model is computing Alice’s belief? Or are you just showing that your probe is learning what the next token prediction is going to be?

Future CCS-like approaches

Daniel Filan: Sure. Yeah. You have a few more experiments along these lines. I guess I’d like to talk a bit about: I think of your paper as saying there’s a theoretical problem with CCS, which is that there’s a bunch of probes that could potentially get low CCS loss, and there’s a practical problem, which is some probes do get low CCS loss. So, if I think about the CCS research paradigm, I think of it as… When the CCS paper came out, I was pretty into it. I think there were a lot of people who were pretty into it. Actually, part of what inspired that Scott Emmons post about it is I was trying to sell him on CCS and I was like, “No, Scott, you don’t understand. This is the best stuff since sliced bread.” And I don’t know, I annoyed him enough into writing that post. So, I’ll consider that a victory for my annoying-ness.

But I think the reason that I cared about it wasn’t that I felt like literal CCS method would work, but it was because I had some sense of just the general strategy, of coming up with a bunch of consistency criteria and coming up with a probe that cares about those and maybe that is going to isolate belief. So, if we did that, it seems like it would deal with stuff like the banana/shed example. If you cared about more relations between statements, not just negation consistency, but if you believe A, and A implies B, then maybe you should believe B, just layer on some constraints there. You might think that by doing this we’re going to get closer to ground truth. I’m wondering, beyond just CCS specifically, what do you think about this general strategy of using consistency constraints?

Vikrant Varma: Yeah. That’s a great question. I think my take on this is informed a lot by a comment by Paul Christiano on one of the CCS review posts. I basically share your optimism about being able to make empirical progress on figuring out what a model is actually doing or what it’s actually thinking about a situation by using a combination of consistency criteria, and even just supervised labels in situations where you know what the ground truth is. And being able to get reasonably good probes - maybe they don’t generalize very well, but every time they don’t generalize or you catch one of these failures, you spend a bunch of effort getting better labels in that situation. And so, you’re mostly not in a regime where you’re trying to generalize very hard.

And I think this kind of approach will probably work pretty well up to some point. I really liked Paul’s point that if you’re thinking about a model that is saying things in human natural language and it’s computing really alien concepts that are required for superhuman performance, then you shouldn’t necessarily expect that this is linearly extractable or extractable in a simple way from the activations. This might be quite a complicated function of the activations.

Daniel Filan: Why not?

Vikrant Varma: I guess one way to think about it is that the natural language explanation for a very complicated concept is not going to be short. So, I think the hypothesis that a lot of these concepts are encoded linearly and are linearly extractable… In my mind, it feels pretty unclear whether that will continue to hold.

Daniel Filan: Okay. So just because “why does it have to be linear?” There are all sorts of ways things can be encoded in neural nets.

Vikrant Varma: Yeah. That’s right. And in particular, one reason you might expect things to be linear is because you want to be able to decode them into natural language tokens. But if there is no short decoding into natural language tokens for a concept that the model is using, then it is not important for the computation to be easily decodable into natural language.

Daniel Filan: Right. So, if the model’s encoding whether a thing is actually true according to the model, it’s not like that determines the next thing the person will say, right?

Vikrant Varma: Right. It’s a concept that humans are not going to talk about, it’s never going to appear in human natural language. There’s no reason to decode this into the next token.

Daniel Filan: This is talking about: if the truth of whatever the humans are talking about, it actually depends on the successor of a theory of relativity that humans have never thought about, it’s just not really going to determine the next thing that humans are going to say.

Vikrant Varma: Yeah, that’s an example.

Daniel Filan: Yeah. I take this critique to be firstly a critique of linear probes for this task. I guess you can form a dilemma where either you’re using linear probes, and then you don’t necessarily believe that the thing is linearly extractable, or you’re using complicated non-linear probes, and then maybe the stuff you’re getting out is stuff about your probe rather than stuff about the underlying model. But then, I guess there’s a separate question of, are there consistency constraints that could work? Putting aside the probe… I don’t know, maybe we shouldn’t put aside the probe thing, but putting aside the probe thing, is there some sort of consistency check we could do to say, is this property we found in the model the model’s actual beliefs, or is it not?

Vikrant Varma: Yeah. That’s a good question. I think the more powerful your model, the more entities it’s “simulating” whose beliefs end up mattering for next token prediction that the model is doing. And if these entities that the model is thinking about, if their beliefs also satisfy all of the same consistency criteria that you’re using, then you just have a very fundamental indistinguishability problem. And in particular, I think the more powerful the model gets, the more pessimistic I am that we are able to come up with consistency checks that correctly distinguish between true beliefs and beliefs of simulated entities.

Daniel Filan: One point you make in a LessWrong post related to your paper, is that if you’re a smart agent and you’re thinking about a proposition, one thing you might sometimes think about is, what’s the world in which I’m wrong about this proposition? So, you’re thinking about a belief structure where the truth value of this proposition is the exact opposite of the thing you actually believe.

Vikrant Varma: That’s right.

Daniel Filan: Which seems like a pretty nice impossibility proof, if you actually are representing both of these things.

Vikrant Varma: Yeah. This is also what I meant by adversaries. You might be in a world where an adversary is trying to make you believe the opposite of what’s true. And now, this adversary is modeling the world in which you would have to believe this and all the evidence that would have to be true.

Daniel Filan: Right. And the language model is modeling what the adversary wants you to think or something.

Vikrant Varma: That’s right. So, that’s another example. But I think maybe there’s a different kind of hope. I think the truth is represented simpler than agents beliefs, might be a hypothesis that turns out to be somewhat true up to some point. But I think this is a different kind of criteria from consistency. So, now we’re talking about, mechanistically, how are these concepts represented inside the model? And for example, if you had a way to say, “The model’s true beliefs are always linearly represented, but beliefs about other agents, they’re not linearly represented; and therefore, we can be confident that linear probes are recovering truth.” That would be an example of a mechanistic structure that you might hope to use. I think we don’t currently have any candidates for that, but it’s a good area to look at.

Daniel Filan: Yeah. Are there things analogous to that that we’ve learned? Basically, I’m trying to wonder: if I wanted to prove or disprove this, what kind of thing would I do? And the one thing I can think of is there’s some research about: do convolutional neural networks learn texture or color first? And it turns out there’s a relatively consistent answer. I’m wondering if you can think of any analogous things about neural networks that we’ve learned that we can maybe…

Vikrant Varma: Yeah. There’s quite a few options presented in the eliciting latent knowledge report. So for example, one of the things you might hope is that if the model is simulating other entities, then maybe it’s trying to figure out what’s true in the world before it does that. And so, you might expect earlier belief-like things to be true, and later belief-like things to be agents’ beliefs.

Or similarly, you might expect that if you try to look for things under a speed prior, as in beliefs that are being computed using shorter circuits, then maybe this is more likely to give you what’s actually true, because it takes longer circuits to compute that plus what some agent is going to be thinking. So, that’s a structural property that you could look for.

Daniel Filan: Yeah. I guess it goes back to the difficulty of eliciting latent knowledge. In some ways, I guess the difficulty is: if you look at standard Bayesian rational agent theory, the way that you can tell that some structure is an agent’s beliefs is that it determines how the agent bets and what the agent does. It tries to do well according to its own beliefs. But if you’re in a situation where you’re worried that a model is trying to deceive you, you can’t give it Scoobie snacks or whatever for saying things that… You can’t hope to get it to bet on its true beliefs, if you’re going to allow it access to the world based on whether you think its true beliefs are good, or stuff like that. I don’t know, it seems tricky.

CCS as principal component analysis

Daniel Filan: I have some other minor questions about the paper. Firstly, we mentioned this post by Scott Emmons, and one of the things he says is that principal component analysis, this method where you find the maximum-variance direction and just base your guess on the model beliefs based on where the thing lies in this maximum-variance direction. He says that this is actually similar to CCS in that you’re encoding something involving confidence and also something involving coherence. And that might explain why PCA and CCS are so similar. I’m wondering what do you think about that take?

Vikrant Varma: Is a summary of this take that most of the work in CCS is being done by the contrast pair construction rather than by the consistency loss?

Daniel Filan: It’s partly that, and also partly if you decompose “what’s the variance of X minus Y”, you get expectation of X squared plus expectation of Y squared minus twice the [expectation] of XY, and then some normalization terms of variance of X squared… Sorry. Expectation of X all squared, expectation of Y all squared, and then another covariance term. Basically, he’s saying like, “Look, if you think of a vector that maximizes the outer product of that vector, the variance and itself, you’re maximizing the outer product of the variant of that vector with expectation X squared plus expectation Y squared.” Which ends up being the confidence of classification according to that vector.

And then, you’re subtracting off the covariance, which is basically saying, is the vector giving high probability for both yes and no? Or is the vector giving low probability for both yes and no? And so, basically, the take is just because of the mathematical properties of variance and what PCA is doing, you end up doing something kind of similar to PCA. I’m wondering if you have thoughts on this take?

Vikrant Varma: Yeah, that’s interesting. I don’t remember reading about this. It sounds pretty plausible to me. I guess one way I’d think about it intuitively is that if you’re trying to find a classifier on these difference vectors, contrast pair difference vectors, then for example, you want to be maximizing the margin between these two. And this is a bit like trying to find a high contrast thing. So overall, it feels plausible to me.

Explaining grokking through circuit efficiency

Daniel Filan: Gotcha. Okay. So, if it’s all right with you, I’d like to move on to the paper ‘Explaining grokking through circuit efficiency’.

Vikrant Varma: Perfect. Let’s do it.

Daniel Filan: Sure. This is a paper you wrote with Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. You’re explaining grokking. For people who are unaware or need a refresher, what is grokking?

Vikrant Varma: So in 2021, Alethea Power and other people at OpenAI noticed this phenomenon where when you train a small neural network on an algorithmic task, initially, their network overfit, so it got very low training loss and high test loss. And then, they continued training it for about 10 times longer and found that it suddenly generalized. So, although training loss stayed low and about the same, test loss suddenly fell. And they dubbed this phenomenon “grokking”, which I think comes from science fiction and means “suddenly understanding”.

Why research science of deep learning?

Daniel Filan: Okay, cool. And basically, you want to explain grokking. I guess a background question I have is, it feels like in the field of AI alignment, or people worried about AI taking over the world, there’s a sense that it’s pretty important to figure out grokking and why it’s happening. And it’s not so obvious to me why it should be considered so important, given that this is a thing that happens in some toy settings, but to my knowledge, it’s not a thing that we’ve observed on training runs that people actually care about. So I guess it’s a two-part question: firstly, just why do you care about it? Which could be for any number of reasons. And secondly, what do you think its relationship is to AI safety and AI alignment?

Vikrant Varma: I think back in 2021, there were two reasons you could have cared about this as an alignment researcher. One is on the surface it looks a lot like a network was behaving normally, and then suddenly it understood something and started generalizing very differently. The other reason is this is a really confusing phenomenon in deep learning, and it sure would be good if we understood deep learning better. And so, we should investigate confusing phenomena like grokking, [even] ignoring the superficial similarity to a scenario that you might be worried about.

Daniel Filan: Okay. Where the superficial scenario is something like: the neural network plays nice, and then suddenly realizes that it should take over the world, or something?

Vikrant Varma: That’s right. And I think I can talk a bit more about the second reason or the overall science of deep learning agenda, if you like. Is that a useful thing to go into now?

Daniel Filan: I guess maybe why are you interested in grokking?

Vikrant Varma: For me, grokking was one of those really confusing phenomena in deep learning, like deep double descent or over-parameterized networks generalizing well, that held out some hope of if you understand this phenomenon, maybe you’ll understand something pretty deep about how we expect real neural networks to generalize and what kinds of programs we expect deep learning to find. It was a puzzling phenomenon that somebody should investigate, and we had some ideas for how to investigate it.

Daniel Filan: Gotcha. I’m wondering if you think just, in general, AI alignment people should spend more effort or resources on science of deep learning issues. Because there’s a whole bunch of them, and not all of them have as much presence from the AI alignment community.

Vikrant Varma: I think it’s an interesting question. I want to decompose this into how dual-use is investigating science of deep learning, and do we expect to make progress and find alignment-relevant things by doing it? And I’m mostly going to ignore the first question right now, but we can come back to it later if you’re interested. I think for the second question, it feels pretty plausible to me that investigating science of deep learning is important and tractable and neglected. I should say that a lot of my opinions here have really come from talking to Rohin Shah about this, who is really the person who’s, I think, been trying to push for this.

Why do I think that? I think it’s important because: similar to mechanistic interpretability, the core hope for science of deep learning would be that you’re able to find some information about what kinds of programs your training process is going to learn, and so therefore, how it will generalize in a new situation. And I think a difference from mech[anistic] interp[retability] is… This is maybe a simplified distinction, but one way you could draw the line is that mech. interp. is more focused on reverse-engineering a particular network and being able to point at individual circuits and say, “Here’s how the network is doing this thing.”

Whereas, I think science of deep learning is trying to say, “Okay. What kinds of things can we learn in general about a training process like this with a dataset like this? What are the inductive biases? How does the distribution of programs look like?” And probably both science of deep learning, and mech. interp. have quite a lot of overlap, and techniques from each will help the other. That’s a little bit about the importance bit.

I think it’s both tractable and neglected in the sense that we just have all of these confusing phenomena. And for the most part, I feel like industry incentives are not that aligned with trying to investigate these phenomena really carefully and doing a very careful natural sciences exploration of these phenomena. So in particular, iterating between trying to come up with models or theories for what’s happening and then making empirical predictions with those theories, and then trying to test that, which is the kind of thing we tried to do in this paper.

Daniel Filan: Okay. Why do you think industry incentives aren’t aligned?

Vikrant Varma: I think it’s quite a high risk, high reward sort of endeavor. And in the period where you’re not making progress on making loss go down in a large model, it’s maybe harder to justify putting a lot of effort into that. On the other hand, if your motivation is “If we understood this thing, it could be a really big deal for safety”, I think making the case as an individual is easier. Even from a capabilities perspective, I think the incentives to me seem stronger than what people seem to be acting on.

Daniel Filan: I guess there’s something puzzling about why there would be this asymmetry between some sort of corporate perspective and some sort of safety perspective. I take you to be saying that, “Look, there are some insights to be found here, but you won’t necessarily get them tomorrow. It’ll take a while, it’ll be a little bit noisy. And if you’re just looking for steady incremental progress, you won’t do it.” But it’s not obvious to me that safety or alignment people should care more about steady incremental progress than people who just want to maximize the profit of their AI, right?

Vikrant Varma: You mean [“safety people should] care less about that”?

Daniel Filan: Yeah. It’s not obvious to me that there would be any difference.

Vikrant Varma: Right. I think one way you could think about it, from a safety perspective, is multiple uncorrelated bets on ways in which we could get a safer outcome. I think probably a similar thing applies for capabilities except that… And I’m really guessing and out of my depth here, but my guess would be that for whatever reason, it’s harder to actually fund this kind of research, this kind of very exploratory, out-there research, from a capabilities perspective, but I think there is a pretty good safety case to make for it.

Daniel Filan: Yeah, I guess it’s possible that it’s just a thing where it’s hard to … I don’t know, if I’m a big company, right, I want to have some way of turning my dollars into people solving a problem. One model you could have is for things that could be measured in “how far down did the loss go?” It’s maybe just easier to hire people and be like, “Your job is to put more GPUs on the GPU rack” or “your job is to make the model bigger and make sure it still trains well”. Maybe it’s harder to just hire a random person off the street and get them to do science of deep learning. That’s potentially one asymmetry I could think of.

Vikrant Varma: Yeah, I think it’s also just: I genuinely feel like there are way fewer people who could do science of deep learning really well than people who could make the loss go down really well. I don’t think this fundamentally needs to be true, but it just feels true to me today based on the number of people who are actually doing that kind of scientific exploration.

Daniel Filan: Gotcha. When I asked you about the alignment case for science of deep learning, [you said] there’s this question of dual use and then there was this question of what alignment things there might be there, and you said you’d ignore the dual use thing. I want to come back to that. What do you think about: some people say about interpretability or stuff, “well, you’re going to find insights that are useful for alignment, but you’re also going to find insights that are useful for just making models super powerful and super smart, and it’s not clear if this is good on net”.

Vikrant Varma: Yeah. I want to say that I feel a lot of uncertainty here in general, and I think your answers to these questions kind of depend a lot on how you expect AI progress to go and where you expect the overhangs to be and what sort of counterfactual impact you expect. What kinds of things will capabilities people do anyway, for example?

Yeah, so I think to quickly summarize one story that I find plausible, it’s that we’re basically going to try and make progress about as fast as we can towards AGI-level models. Hopefully, if we have enough monitoring and red lines and RSPs in place, if there is indeed danger as I expect, then we will be able to coordinate some sort of slow down or even pause as we get to things that are about human-level.

Then, a story you could have for optimism is that: well, we’re able to use these roughly human-level systems to really make a lot of progress in alignment, because it becomes clear that that’s the main way in which anybody can use these systems safely, or that’s how you construct a strong positive argument for why the system is safe rather than just pointing at an absence of evidence that it’s unsafe, and we’re in that sort of world, and then just a bunch of uncertainty about how long that takes. In the meantime, presumably we’re able to coordinate and prevent random other people who are not part of this agreement from actually racing ahead and building an unsafe AGI.

Under that story, I think, it’s not clear that you get a ton of counterfactual capabilities progress from doing mech. interp. or science of deep learning. It mostly feels to me like we’ll get there even without it and that to the degree that these things are going to matter for capabilities, a few years from now, capabilities people are going to start [doing], maybe not science of deep learning if it’s very long-term and uncertain, but definitely mech. interp.: I expect capabilities people to start using those techniques and trying to adapt them for improving free training and so on.

Like I said, I feel pretty uncertain. I am pretty sympathetic to the argument that all of this kind of research like mech. interp. and science of deep learning should basically be done in secret… If you’re concerned about safety and you want to do this research, then you should do it in secret and not publish. Yeah, I feel sympathetic to that.

Summary of the paper’s hypothesis

Daniel Filan: Gotcha. I guess with that background, I’d like to talk about the paper. I take the story of your paper to basically be saying: look, here’s our explanation of grokking. Neural networks… you can think of them as a weighted sum of two things they can be doing. One thing they can be doing is just memorizing the data, and one thing that they can be doing is learning the proper generalizing solution.

The reason you get something like grokking is that it takes a while … Networks are being regularized, according to the norm of their parameters; and the generalizing circuit - the method that generalizes - it can end up being more confident for a given norm of parameter. And so eventually it’s favored, but it takes a while to learn it. Initially you learn to just memorize answers, but then as there’s this pressure to minimize the parameter norm that comes from some form of regularization, you become more and more incentivized to try and figure out the generalizing solution, and the network eventually gets there, and once gradient descent comes to the vicinity of the generalizing solution, it starts moving towards that, and that’s when grokking happens.

And basically from this perspective, you come up with some predictions… you come up with this thing called ungrokking, which we can talk about later; you can say some things about how confidence should be related to parameter norm in various settings… but I take this to be your basic story. Does that sound like a good summary?

Vikrant Varma: Yeah, I think that’s pretty good.

What are ‘circuits’?

Daniel Filan: Gotcha. I guess the first thing that I’m really interested in is: in the paper you talk about ‘circuits’, right? You say that there’s this ‘memorizing circuit’ and this ‘generalizing circuit’. You have a theoretical model of them, and you have this theoretical model of: imagine if these circuits were competing, what would that look like? But to the best of my understanding from reading your paper, I don’t get a clear picture of what this ‘circuit’ talk corresponds to in an actual model. Do you have thoughts about what it does correspond to in an actual model?

Vikrant Varma: Yeah, that’s a good question. We borrowed the circuit terminology from the circuits thread by Chris Olah in Anthropic. There, they define a circuit as a computational subgraph in the network. I think this is sufficiently general or something that it applies to our case. Maybe what you’re asking though is more: physically, where is the circuit inside the network?

Daniel Filan: If I think of it as a computational subgraph, the memorization circuit is going to take up a significant chunk of the network, right? Do you think I should think of there being two separate subgraphs that aren’t interacting very much, one of which is memorization, one of which is generalization, and just at the end we upweight the generalization and downweight the regularization?

That would be weird, because there’s going to be crosstalk that’s going to inhibit the memorizing circuit from just purely doing memorization and the generalizing circuit from purely doing generalization. When I try to picture what’s actually going on, it seems difficult for me. Or I could imagine that the memorizing circuit is just supposed to be one parameter setting for the network and the generalizing circuit is supposed to be another parameter setting and we’re linearly interpolating that. But neural networks, they’re non-linear in their parameters, right? You can’t just take a weighted sum of two parameter vectors and get away some of the output. So yeah, this is my difficulty with the subgraph language.

Vikrant Varma: I want to make a distinction between the model or the theory that we’re using to make our predictions, and how these circuits are implemented in practice. In the model or in our theory, these circuits are very much independent, so they have their own parameter norms and the only way they interact is they add at the logit stage. And this is completely unrealistic, but we’re able to use this very simplified model to make pretty good predictions.

I think the question of how circuits in this theory are actually implemented in the network is something that I would love to understand more about. We don’t have a great picture of this yet, but I think we can probably say some things about it already. One thing we can say is that there are definitely not going to be disjoint sets of parameters in the network.

Some evidence for this is things like: in terms of parameters, there’s a lot of overlap between a network that’s memorizing and that later generalizes, as in a lot of the parameter norm is basically coming from the same weights. And the overlap is way more than random. And this is probably because when the network is initialized, there’s some parameters that are large and some that are small and both circuits learn to use this distribution, and so there ends up being more overlap there.

Daniel Filan: Okay. My summary from that is you’re like, “okay, there are probably in some sense computational subgraphs and they probably overlap a bit, and we don’t have a great sense of how they interact”.

Vikrant Varma: Yeah.

Daniel Filan: One key point in your model is in the simplified model of networks, where they’re just independent things that get summed at the end, eventually you reduce your weight on the memorizing circuit and increase your weight on the generalizing circuit. Do you have a sense of, if I should think of this as just literally increasing and decreasing weights, or circuits cannibalizing each other somehow?

Vikrant Varma: Yeah, maybe closer to cannibalizing somehow if there’s a lot of competition for parameters between the two circuits. I think in a sense it is also going to be increasing or decreasing weights, because the parameter norm is literally going up or down. It’s just not going to happen in the way we suggest in the model, where you have a fixed circuit and it’s just being multiplied by a scalar.

In practice, there’s going to be all kinds of things. For example, it’s more efficient under L2… if you have a circuit, instead of scaling up the circuit by just multiplying all the parameters, it’s more efficient to duplicate it if you can, if you have the capacity in the network.

I also imagine that there are multiple families of circuits that are generalizing and memorizing and within each family, these circuits are competing with each other as well. And so you start off with a memorizing circuit and instead of just scaling it down or up, it’s actually morphing into a different memorizing circuit with a different distribution of parameters inside it. But the overall effect is close enough to the simplified model that it makes good predictions.

The role of complexity

Daniel Filan: Sure. I’m wondering: one thing this theory reminded me of is singular learning theory, which is this trendy new theory of deep learning [that] people are into. Basically it comes from this insight where: if you think about Bayesian inference in high dimensional parameterized model classes, which is sort of like training neural networks, except we don’t actually use Bayesian inference for training neural networks… If the model class has this property called “being singular”, then you end up having phase transitions of: sometimes you’re near one solution and then as you get more data, you can really quickly update to a different kind of solution, where basically what happens is you’re trading off some notion of complexity of different solutions for predictive accuracy.

Now, in the case of increasing data, it’s kind of different because the simplest kinds of phase transitions you can talk about in that setting are as you get more data, whereas you’re interested in phase transitions in number of gradient steps, but they both feature this common theme of “some notion of complexity being traded off with accuracy”. And if you favor minimizing complexity somehow, you’re going to end up with a low complexity solution that meets accuracy. I mean, that’s kind of a superficial similarity, but I’m wondering what you think of the comparison there.

Vikrant Varma: Yeah, so I have to admit that I know very little about singular learning theory and I feel unable to really compare what we’re talking about with SLT.

Daniel Filan: Fair enough.

Vikrant Varma: I will say though that this notion of lower weight norm being less complex somehow is quite an old idea. In particular, Seb Farquhar pointed me to this 1993 paper, I think by Geoffrey Hinton, which is about motivating L2 penalty from a minimum description length angle. So if two people are trying to communicate all of the information that’s contained inside a model, they could have some priors about what the weights of the model are, and then they need to communicate both something about the dataset as well as errors that the model is going to make. And in this paper, they use these Gaussianity assumptions and are able to derive both mean squared error loss and L2 penalty as an optimal way to communicate between these two people.

Daniel Filan: And this seems similar to the classic result that L2 regularization is sort of like doing Bayesian inference with a Gaussian [prior], just because if your prior is Gaussian, then you take the log of that and that ends up being the norm and that’s the log likelihood for you.

Many kinds of circuits

Daniel Filan: Sure. So I guess I’d like to pick up on this thing you said about there being multiple kinds of circuits, because there’s a sentence that jumped out to me in your paper. You’re looking at doing a bunch of training runs and looking at trying to back out what you think is happening with the generalizing and memorizing circuits, and you say that the random seed starting training causes significant variance in the efficiency of the generalizing and memorizing solutions.

That kind of surprised me, partly because I think that there just can’t be that many generalizing solutions. We’re talking about fairly simple tasks like “add two numbers, modulo 113”, and how many ways can there be to do that? I recently learned that there’s more than one, but it seems like there shouldn’t be a million of them. Similarly, how many ways can there be to memorize a thing?

And then also, I would’ve thought that gradient descent would find the most efficient generalizing circuit or the most efficient memorizing circuit. So yeah, I’m wondering if you have thoughts about how I should think about this family of solutions with seemingly different efficiencies.

Vikrant Varma: One thing I’ll point out is that even with the trigonometric algorithm for doing modular addition, this is really describing a family of algorithms because it depends on which frequencies in particular the network ends up using to do the modular addition.

Daniel Filan: And if people are interested in that algorithm, they can check out my episode with Neel Nanda. You can probably check out other things, but don’t leave AXRP please. So yeah, [with] this algorithm, you pick some frequencies and then you rotate around the circle with those frequencies to basically do the clock algorithm for modular arithmetic, but you can pick which frequencies you use.

Vikrant Varma: Yeah, that’s right.

Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.

Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.

The other thing I want to draw attention to is in many deep learning problems, it is not the case… Deep learning practitioners are very familiar with the observation that different random seeds end up producing networks with different test performance. And if you’re trying to create a state-of-the-art network for solving some tasks, it’s quite common to run 100 seeds and then pick the top five best-performing ones or whatever. I think it’s just not clear from this that gradient descent or Adam is able to find the optimal solution from any initialization.

And I think this also shows up not just when you vary random seed, but it shows up epoch-wise. Because for example, with one of the phenomena you mentioned from our paper, semi-grokking, you see these multiple phase transitions where the network is switching between generalizing circuits that have different levels of efficiency, and these levels of efficiency are quite far apart so that you can actually visibly see the change in test performance as it switches in a very discrete manner between these circuits. And if it was really the case that gradient descent could find the optimal solutions, then you wouldn’t expect to see this kind of switching.

How circuits are learned

Daniel Filan: Gotcha. Yeah, there are just so many questions I have about these circuits. I’m not sure you have any answers, but it just brings up… Part of your story is that it takes a while to learn the generalizing solution, longer than it takes to learn the memorizing solution. Do you maybe have thoughts about why that might be?

Vikrant Varma: I think my thoughts here are mostly what we’ve written down in the paper and I feel like this is another area that’s pretty ripe for understanding. The explanation we offer in the paper is mostly inspired by a blog post by Buck [Shlegeris], and I think another person whose name I’m forgetting.

Daniel Filan: Ryan Greenblatt? [Update: it’s actually Adam Jermyn]

Vikrant Varma: Yes, that’s right. And the explanation there is that maybe memorizing circuits basically have fewer independent components that you need in order for the circuit to develop, but a generalizing circuit is going to need multiple components that are all needed for good performance.

Here’s the simplified model: the memorizing network is implemented with one parameter and that just scales up or down, and the generalizing network is implemented with two parameters that are multiplied together to get the correct logit, and so the gradient of the output with respect to any of the parameters depends on the value of the other parameter in the generalizing circuit case.

And so if you simulate this forward, what you get for the generalizing circuit is a kind of sigmoid where initially both parameters are quite small and they’re not contributing that much to each other’s growth, and then once they start growing, both of them grow quite a lot and then it plateaus out because of L2.

Daniel Filan: Sure. Should I think of this as just a general observation that in evolution, if you need multiple structures and they both depend on each other for that to work, that’s a lot harder to evolve than a structure that is a little bit useful on its own and another structure that is a little bit useful on its own? And for memorizing a solution, I can memorize one thing and then I can memorize the second thing and I can do those independently of each other, so each little bit of memorization is evolvable on its own maybe?

Vikrant Varma: Yes, I think that’s right. Maybe another way I think about it is that the memorization circuit is basically already there, [in a] random network. And really the thing you’re learning is the values that you have to memorize. And as you say, that’s independent for each point, but that’s not the case for [the] generalizing circuit.

I think another important ingredient here is that there needs to be at least some gradient at the beginning towards the generalizing circuit if it has multiple components. And it’s kind of an open question in my mind why this happens. The most plausible theory I’ve heard is something like lottery tickets, where basically the randomly initialized network has very weak versions of the circuits that you want to end up with. And so there’s a tiny but non-zero gradient towards them.

Daniel Filan: Interesting. Yeah, I guess more work is needed. I’d like to talk a little bit about … The story of: it takes you a while to learn the generalizing circuit. You learn the memorizing circuit and it takes you a while to learn the generalizing circuit, but once you do, then that’s grokking.

This is sort of reminiscent of a story that I think was in an appendix of a paper, Progress Measures for Grokking? It was progress measures for something by Neel Nanda et al. And they have this story where there are three phases of circuit formation. There’s memorization and then there’s learning a generalizing solution, and then there’s cleaning up the memorized stuff, right? And in their story, they basically demarcate these phases by looking at the activations of their model and figuring out when the activations are representing the algorithm that is the generalizing solution according to them.

And so it seems pretty similar to your story, but one thing that struck me as being in a bit of tension is that they basically say grokking doesn’t happen when you learn the generalizing solution, it happens when you clean up the parameters from the memorizing solution. Somehow there’s one stage of learning the generalizing solution and then a different phase of forgetting the memorizing solution. So I’m wondering what you think about the relationship between that story in their paper and your results.

Vikrant Varma: I think one thing that’s interesting to think about here is the relationship between logits and loss, or between logits and accuracy. Why is the memorization cleanup important? It’s because to a first approximation, the loss is dependent on the difference between the highest logit and the second highest logit. And if you have this memorization circuit that is kind of incorrectly putting high weight on the incorrect logit, then when it reduces, you’ll see a very sharp cleanup effect.

I think this is something that we haven’t really explored that much because the circuit efficiency model is mostly talking about the steady state that you expect to end up in and is not so much talking about the dynamics between these circuits as time goes on. This is a part of the story of grokking that is very much left unexplained in our paper, which is why exactly is the generalizing circuit developing slower? But if you put in that sigmoid assumption, as I was talking about, artificially, then the rest of the theory is entirely sufficient to see exactly the same kinds of grokking curves as you see in actual grokking.

Daniel Filan: But maybe I’m misunderstanding, but under your story, I think I would’ve expected a visible start of grokking, or visible increase in test accuracy during the formation of the generalizing circuits, rather than it waiting until the cleanup phase.

Vikrant Varma: Right. By formation, are you talking about … there’s this phase where the generalizing circuit is developing and it’s there, but the logits that it’s outputting are way lower than the memorization logits. And in that phase, I basically don’t expect to see any change in test accuracy.

And then there’s the phase where the generalizing circuit logits cross the memorizing circuit for the first time. And this is maybe another place where there’s a difference between the toy model and what you actually observe. In the toy model, the thing you would expect to see is a jump from 0% accuracy as it’s just below the equality threshold to 100% accuracy the moment the generalizing logit crosses the memorizing logit, because that changes what the highest strength logit is.

But in practice, we see the accuracy going through all these intermediate phases, so it goes through 50% before getting to 100%. And the reason that’s happening is because there are some data points which are being correctly classified and some which are incorrectly classified.

And so this is pointing to a place where the theory breaks down, where on some of the points, the generalizing circuit is making more confident predictions than on other points, which is why you get these intermediate stages of accuracy, so that’s one thing.

This also suggests why the cleanup is important to get to full accuracy. If the memorizing circuit is making these high predictions on many points, even when the generalizing circuit is around the same level, because of the variance, you just need the memorizing circuit to disappear before you really get 100% prediction accuracy.

Daniel Filan: Right, so the story is something like: you start learning the generalizing circuit and you start getting the logits being somewhat influenced by the generalizing circuit, but you need the logits to start being somewhat near each other to get a hope of making a dent in the loss. And for that to happen, you more need the memorization circuits to go away. There’s the formation of the circuit, and then there’s switching the weights over, is roughly the story that I’m thinking of.

Vikrant Varma: Yeah, that’s right.

Semi-grokking and ungrokking

Daniel Filan: Gotcha. There’s something that you mentioned called semi-grokking and ungrokking. Actually, can you describe what they are?

Vikrant Varma: Sure. I’ll start with how should you think about the efficiencies of these two different circuits. If you have a generalizing circuit that is doing modular addition, then if you add more points to the training set, it doesn’t change the algorithm you need to get good training loss. And so you shouldn’t really expect any modification to the circuit as you add more training points. And therefore, the efficiency of the circuit should stay the same. Whereas if you have a memorizing circuit, then as you add more points, it needs more weights to memorize those points, and so you should expect the efficiency to be dropping as you add more points.

Daniel Filan: Yeah, or another way I would think about this is that if I’m memorizing n points - I figured out the most efficient way to memorize the n points - the (n+1)th point, I’m probably not going to get that right because I just memorized the first ones, so I’ve got to change to memorize the (n+1)th point, and I can’t change in a way that makes me more efficient because I was already the most efficient I could be on the n points. And so even just at a macro level, just forgetting about adding weights or whatever, it just has to be the case that memorization is losing efficiency the more you memorize whereas generalization, you wouldn’t expect it to have to lose efficiency.

Vikrant Varma: Yeah. Yeah, that’s a good way to explain it.

Daniel Filan: And of course I stole that from your paper. I don’t want to act like I invented that.

Vikrant Varma: No, but it’s a good explanation. So I think a direct consequence of this… so when you couple that with the fact that at very low dataset sizes, it appears that memorization is more efficient than generalization, then you can conclude that there must be a dataset size where memorization is increasing as you increase the dataset size, generalization parameter norm is staying the same… There must be a crossover point. And then you can ask the question: what happens at that crossover point, when you have a dataset size where the efficiency of generalization is equal to the efficiency of memorization?

And so we did some maths in our toy model, or our theoretic model I should say, and came up with these two cases, these two different things that could happen there. And this really depends on the relationship between… When you scale the parameters by some amount, how does that scale the logits? And if it scales the logits by more than some threshold, then it turns out that at this equality point you will just end up with a more efficient circuit period. But if the scaling factor is lower than some threshold, then you will actually end up with a mixture of both the memorizing and the generalizing circuits. And the reason for this is: because you’re not able to scale the logits as much, it’s more efficient to allocate the parameter norm between these two different circuits when you are considering the joint loss of L2 plus the data loss.

Daniel Filan: Okay, so something like… it’s sort of the difference between convex and concave optimization, right? You’re getting diminishing returns per circuit, and so you want to invest in multiple circuits rather than going all in on one circuit. Whereas in some cases, if you have increasing returns, then you just want to go all in on the best circuit.

Vikrant Varma: Yeah, that’s right. And in particular, the threshold is like… there’s quite a cool way to derive it, which is that the L2 is scaling as the square of the parameter norm. So if the logits are scaling faster than that, then you’re able to overcome the parameter penalty by just investing in the more efficient circuit. But if they’re not scaling faster than that, then you have to split. And so the threshold ends up being if you’re able to scale the logits faster than to the power of two.

Daniel Filan: Okay. So you have this semi-grokking and ungrokking, right? Where you’re training on this subset of your training dataset and you lose some test accuracy - either some of it or all of it - basically by partly or fully reverting to the memorizing solution. So this is an interesting phenomenon because… maybe you know better than me, but I’m not aware of people talking about this phenomenon or connecting it to grokking before. Or they’ve talked about the general phenomenon of catastrophic forgetting, where you train your network on a different dataset and it forgets stuff that [it] used to know. But in terms of training on a subset of the dataset, I’m not aware of people discussing that before or predicting that before. Is that right?

Vikrant Varma: Yeah, I think that’s right. So we got a lot of questions from reviewers about “how is ungrokking any different from catastrophic forgetting?”, to the extent that in the newer version of the paper, we have a whole section explaining what the difference is.

I think basically I would view it as a much more specific and precise prediction than catastrophic forgetting. So one difference is that we’re training on a subset of the data, and this is quite important because this rules out a bunch of other hypotheses that you might have about why grokking is happening.

So for example, if your hypothesis is: the reason grokking happens is because you don’t have the correct representations for the modular addition task, and once you find those representations, then you’ve grokked - that’s a little bit incompatible with then reducing the training data size and ungrokking, because you already had the representations and so you need this additional factor of a change in efficiency.

Or another example is a random walk hypothesis, where somehow you stumble upon the correct circuit by randomly walking through parameter space. And that also either doesn’t say anything about it, or anti-predicts ungrokking, because you were already at that point. So I think that’s quite an important difference.

I think going back to the difference between catastrophic forgetting [and ungrokking], I think another more precise prediction is that we’re able to predict the exact dataset size at which you see ungrokking, and it’s quite a phase-change-y phenomena or something. It’s not like as you decrease the dataset size, you’re smoothly losing test accuracy, in this case, which is more the kind of thing you might expect from traditional catastrophic forgetting.

Daniel Filan: Right. My impression was that the thing you were predicting would be that there would be some sort of phase change in terms of subset dataset size, and also that that phase change would occur at a point independent of the strength of weight decay.

Vikrant Varma: That’s right.

Daniel Filan: But I thought that you wereln’t able to predict where the phase change would occur. Or am I wrong about that?

Vikrant Varma: That’s right. Our theory is not producing a quantitative prediction of exactly what dataset fraction you should expect that phase change to happen at. That’s right.

Daniel Filan: Yep. But it does predict that it would be a phase change and it would happen at the same point for various levels of weight decay. One cool thing about this paper is it really is a nice prediction and you’ve got a bunch of nice graphs, [you] kind of nail it, so good job on that.

Vikrant Varma: Thank you.

Daniel Filan: But one thing I’m wondering about is: you have this phenomenon of ungrokking and it seems at least like an instance of catastrophic forgetting that you’re able to say more about than people have previously been able to say. But this offers an opportunity to try and retrodict phenomena, or in particular… I’m not an expert in catastrophic forgetting, but my understanding is that one of the popular approaches to it is this thing called “elastic weight consolidation”, where you basically have different learning rates per parameters, and you reduce the learning rate, so you reduce the future change in parameters for those parameters that were important for the old task. That’s one method, you might be aware of others. Does your view of grokking and ungrokking retrodict these proposed ways of dealing with catastrophic forgetting?

Vikrant Varma: I think not directly. I can see a few differences. I’m not aware of this exact paper that you’re talking about, but I think depending on the task, there might be different reasons why you’re getting forgetting. So you might be forgetting things that you memorized or you might be forgetting algorithms that are appropriate for that part of the data distribution. That’s one aspect of it.

I think a different aspect is that it’s not clear to me why you should expect these circuits to be implemented on different weights. So if the theory is that you find the weights that are important for that algorithm and then you basically prevent those weights from being updated as fast, so you’re not forgetting, then I think that is pointing at a disjoint implementation of these circuits in the network. And that’s not something that we are really saying anything directly about.

Daniel Filan: Gotcha. Yeah, I guess it makes sense that it would depend on the implementation of these circuits.

Another question I have is: in a lot of your experiments, like you mentioned, you are more interested in the steady state than the training path, except for just the initial prediction of grokking, I guess.

Vikrant Varma: To be clear, the paper deals with the steady state; I’m very interested in the training path as well.

Daniel Filan: Fair enough. So if I look at the ungrokking stuff, it seems like… So there’s this steady state prediction where there’s this critical dataset size, and once you’re below the critical dataset size, you ungrok and it doesn’t really matter what your weight decay strength was.

If I naively think about the model, it seems like your model should suggest that it should take longer for less weight decay because you have less pressure to… You care about the complexity, but you’re caring about it less, per unit of time. And similarly, that grokking should be quicker for more weight decay. I guess it’s a two-part question. Firstly, do you agree that that’s a prediction of this model? And secondly, did that bear out?

Vikrant Varma: Yeah, so I think this is a prediction of the model, assuming you’re saying that weight decay affects the speed of grokking?

Daniel Filan: Yes, and of ungrokking.

Vikrant Varma: Yeah, I think that is a prediction of the model. Well, to be fair, it is a retrodiction, because the Power et al. paper already shows that grokking takes exponentially longer as you reduce the dataset size, and I forget what the relationship is, but it definitely takes longer as you reduce the weight decay.

Daniel Filan: And does ungrokking take longer as you reduce weight decay?

Vikrant Varma: We don’t show this result in the paper, but I’m fairly sure I remember that it does, yeah.

Generalizing the results

Daniel Filan: Okay, cool. So I’d like to talk a bit about possible generalizations of your results. So as written, you’re basically talking about efficiency in parameter norm, where if you increase parameter norm, you’re able to be more confident in your predictions, but that comes at a penalty if you train with weight decay.

Now, as your paper notes, weight decay is not the only situation in which grokking occurs, and you basically hypothesize that there are other forms of regularization, regularizing against other forms of complexity and that there could be some sort of efficiency in those other forms of complexity that might become relevant.

I’m wondering, do you have thoughts on what other forms of complexity I should be thinking of?

Vikrant Varma: Yeah, so we’ve already talked about one of them, which is that circuits might be competing for parameters on which to be implemented. This is a kind of capacity constraint. And so you might think that circuits that are able to be implemented on fewer parameters, or using less capacity (however you define capacity in the network), would be more efficient. So I think some relevant work here is bottleneck activations: I think this is from “Mathematical circuits of transformers”, which is talking about other phenomena like superposition that you would get from constrained capacity.

So that’s one more notion of efficiency. I think possibly robustness to interference could be another kind of efficiency: how robust is the circuit to being randomly invaded by other circuits. Maybe also robustness to drop-out would be a similar thing here. And then I think there are things like how frequently does the circuit occur, which might be important… From a given random seed will you be able to find it?

Daniel Filan: Do you mean: what’s the probability mass of it on the initialization prior over weights?

Vikrant Varma: Yes. And also then on the kinds of parameter states that SGD is likely to find. So this is the implicit priors in SGD. There’s some work on implicit regularization of SGD and showing that it prefers similar kinds of circuits to what L2 might prefer, but probably it’s different in some interesting way.

Daniel Filan: Okay. If I think about the results in your paper, a lot of them are generic to other potential complexity measures that you could trade off confidence against. But sometimes you rely on this idea… in particular for your analysis of grokking and semi-grokking, you play with this math notion of: if I scale up some parameters in every layer of a ReLU network, that scales up the logits by this factor, and therefore you get this parameter norm coming off. And I think this is involved in the analysis of semi-grokking versus ungrokking, right?

Vikrant Varma: Yes.

Daniel Filan: So I guess the prediction here would be that maybe semi-grokking is more likely to occur for things where you’re trading off weight parametrization as compared to robustness to drop-out or stuff. Does that sound right to you?

Vikrant Varma: I think in general it’ll be very hard to observe semi-grokking in realistic settings because you need such a finely tuned balance. You need all these ingredients. You need basically two circuits, or two pretty distinct families of circuits, with no intermediate circuits that can do the task well between them. You need a way to arrange it so that the dataset size or other hyperparameters are causing these two circuits to have very, very similar levels of efficiency.

And then also you need it to be the case that, under those hyperparameters, you’re able to actually find these two families of circuits. So you’ll probably find the memorizing one, but you need to be able to find a generalizing one in time. And this just seems like quite a hard thing to happen all at once, especially the fact that in realistic tasks you’ll have multiple families of circuits that are able to do the training task to some degree.

Daniel Filan: So semi-grokking seems unlikely. I guess it does seem like the prediction would be that you would be able to observe ungrokking for the kinds of grokking that don’t depend on weight decay. Am I right that that is a prediction, and is this a thing that you’ve tested for yet? Or should some enterprising listener do that experiment?

Vikrant Varma: So to be clear, the experiment here is “find a different notion of efficiency and then look for ungrokking under that notion”?

Daniel Filan: The experiment would be “find an instance of grokking that doesn’t happen from weight decay”. Then it should be the case that: [you] train your data, get grokking, then train data on subsets of various sizes, and there should be a critical subset size where below that you ungrok and above that you retain the grokking solution when you fine-tune on that subset.

Vikrant Varma: Yeah, I think this is basically right, and our theory does predict this. I will caveat that dataset size may not be the right variable to vary here, depending on what notion of efficiency you’re using.

Daniel Filan: Well, I guess in all notions of efficiency, it sounded like there was a prediction that efficiency would go down as the dataset increased for the memorizing solution, but not for the generalizing solution, right?

Vikrant Varma: Yeah, that’s right.

Daniel Filan: As long as you believe that the process is selecting the most efficient circuit.

Vikrant Varma: Yeah, that’s right.

Daniel Filan: Which we might worry about if there’s… you mentioned SGD found different efficiency generalizing solutions, so maybe you might be worried about optimization difficulty. And in fact maybe something like parameter norm is easier to optimize against than something like drop-out robustness, which is less differentiable or something.

Vikrant Varma: Yeah, I think that’s right. I think you’re right that in this kind of regime, dataset size is pretty reasonable. I was imagining things like model-wise grokking, where on the X-axis, instead of amount of data, you’re actually varying the size of the model or the number of parameters or the capacity or whatever.

But all of these different models are actually trained on the same amount of data for the same time. And it’s also less clear how exactly you would arrange to show ungrokking there because naively, you can’t go to a higher-size model and then reduce the size of the model. But maybe there are ways to show that there.

Vikrant’s research approach

Daniel Filan: Gotcha. So if it’s all right with you, I’d like to move on to just general questions about you and your research.

Vikrant Varma: Cool.

Daniel Filan: So the first question I have is: we’ve talked about your work on grokking, we’ve talked about your work on latent knowledge in large language models. We haven’t talked about it, but the other paper I know you for is this one on goal misgeneralization. Is there a common thing that underlies all of these that explains why you worked on all of them?

Vikrant Varma: Well, one common thing between all of them is that they are all projects that are accessible as an engineer without much research experience, which was one of my main selection criteria for these projects.

So yeah, I guess my background is that I’ve mostly worked in software engineering, and then around 2019 I joined DeepMind and about a year later I was working on the alignment team. I did not have any research experience at that time, but I was very keen to figure out how you can apply engineering effort to make alignment projects go better.

And so certainly for the next two or three years, I was mainly in learning mode and trying to figure out how do people think about alignment? What sorts of projects are the other people in the alignment team interested in? Which ones of these look like they’re going to be most accelerated by just doing good engineering fast? And so that’s where a bunch of the early selection came from.

I think now I feel pretty drawn to working on maybe high risk, high reward things that might end up mattering if alignment by default (as I see the plan) doesn’t go as expected. It feels like the kind of thing that is potentially more neglected. And maybe if you think that you need a bunch of serial research time to do that now before you get very clear signals that, I don’t know, we haven’t done enough research on some particular kind of failure mode, then that feels important to do now.

Daniel Filan: Okay. So should I be thinking: lines of research where both, they’re approachable from a relatively more engineering-heavy background, and also laying the foundation for work that might come later rather than just attempting to quickly solve a problem?

Vikrant Varma: Yeah, that’s right. That’s certainly what I feel more drawn to. And so for example, I feel pretty drawn to the eliciting latent knowledge problem. I think there is both interesting empirical work to do right now in terms of figuring out how easy is it to actually extract truth-like things from models as we’ve been discussing, but also framing the problem in terms of thinking about methods that will scale to superintelligent systems[, this] feels like the kind of thing that you just need to do a bunch of work in advance. And by the time you’re hitting those sorts of problems, it’s probably quite a bad situation to be in.

Daniel Filan: Gotcha. So what should I think of as your role in these projects?

Vikrant Varma: I think it varies. I would describe it as a mix of coming up with good research ideas to try, trying to learn from people who have been around in the alignment community much longer than me, and also trying to supply engineering expertise

So for example, currently I’m working on sparse autoencoders for mechanistic interpretability, and I am very new to mechanistic interpretability. However, all of the people I work with (or many of the people I work with) have been around in mech. interp. for a long time. And it’s great for me to understand and try to get knowledge directly from the source in a way.

I think at the same time, with sparse autoencoders in particular, that’s the kind of project where… Partly what drew me to it was Chris Olah’s tweet where he said… I’m not sure exactly what he said, but it was something like “mech. interp. might be in a different mode now where if SAEs [Sparse AutoEncoders] work out, then it’s mostly an engineering problem, it’s not so much a scientific problem”. And that kind of thing feels very exciting to me, if we’re actually able to scale up to frontier models.

Daniel Filan: It could be. I do find myself thinking that there’s still a lot of science left to do on SAEs, as far as I can tell.

Vikrant Varma: Yeah, I don’t disagree with that.

Daniel Filan: Perhaps I should say for the listener, a sparse autoencoder - the idea is that you want to understand what a network is thinking about. So you train a function from an intermediate layer of the neural network to a very large vector space, way more dimensions than the underlying thing, and then back to the activation space, and you want to train this to be the identity function, but you want to train it so that the intermediate neurons of this function that you’ve learned very rarely fire. And the hope is that you’re picking up these underlying axes of variation, and hopefully only a few of them are happening at a time, and hopefully they correspond to concepts that are interpretable, and that the network uses, and that are underlying facts about the network and not just facts about the dataset that you happen to train the autoencoder on.

And all three of those conditions seem like they need more work to be established. I don’t know, I’m not super up to date on the SAE literature, so maybe somebody’s already done this, but I don’t know, that’s a tangent from me.

Vikrant Varma: I definitely agree. I think there’s a ton of scientific work to do with SAEs. It just also happens to be the case that there’s… It feels like there’s a more direct path or something to scaling up SAEs and getting some sort of mech. interp. working on frontier models that, at least in my view, was absent with previous mech. interp. techniques, where it was more…

Daniel Filan: Human intensive, I guess?

Vikrant Varma: Yeah, more human intensive and a much less direct path to doing the same kind of in-depth circuit analysis on larger models.

The DeepMind alignment team

Daniel Filan: I’d next like to ask about the alignment team at DeepMind. So obviously I guess you’ve been there for a few years.

Vikrant Varma: Yeah.

Daniel Filan: What’s it like?

Vikrant Varma: It is honestly the best place I’ve worked. I find the environment very stimulating, there’s a lot of freedom to express your opinions or propose research directions, critique and try to learn from each other. I can give you an overview of some of the projects that we’re currently working on, if that helps.

Daniel Filan: Yeah. That sounds interesting.

Vikrant Varma: So I think the team is roughly evenly split between doing dangerous capability evaluations, doing mechanistic interpretability, rater assistance, and various other emerging projects like debate or process supervision. So that’s roughly the split right now.

I think apart from that, to me it feels like an inflection point right now because safety is getting a lot of attention within DeepMind, I think. So Anca Dragan recently joined us, she is a professor at Berkeley. To me it feels like she has a lot of buy-in from leadership for actually pushing safety forward in a way that feels new and exciting. So as one example of this, we’re spinning up an alignment team in the Bay Area. Hopefully we’ll have a lot more resources to do ambitious alignment projects in the future.

Daniel Filan: Sure. Is that recruiting new people from the Bay Area or will some people be moving from London to seed that team?

Vikrant Varma: That’s going to be mostly recruiting new people.

Follow-up work

Daniel Filan: Gotcha. The second to last thing I’d like to ask is: we’ve talked about this grokking work and this work on checking this CCS proposal. Do you have thoughts on follow-up work on those projects that you’d really like to see?

Vikrant Varma: Yeah, definitely. So I think with the grokking paper, we’ve been talking about a bunch of potential follow-up work there. I think in particular, exploring other notions of efficiency seems really interesting to me. I think the theory itself still can produce quite a lot of interesting predictions. And from time to time I keep thinking of new predictions that I would like to try that that just don’t fit into my current work plans and stuff.

So an example of a prediction that we haven’t written down in the paper but that occurred to me a few days ago, is that: our theory is predicting that even at large dataset sizes where you’re seeing grokking, if the exponent with which you convert parameter norms into efficiency is small enough, then you should expect to see a non-zero amount of memorization even at large dataset sizes. So the prediction there is that there should be a train-test gap that is small but not zero. And this is in fact true. And so the thing you should be able to do with this is use the empirical estimates of memorization and generalization efficiency to predict train-test gap at any dataset size.

So that’s one example. I think this theory is pretty fruitful and doing work like this is pretty fruitful. I would love to see more of that. On the CCS side, a thing I would love to see is test beds for ELK methods. So what I mean by that is examples of networks that are doing something deceptive, or that otherwise have some latent knowledge that you know is in there but is not represented in the outputs. And then you’re really trying your hardest to get that latent knowledge out, using all sorts of methods like linear probing or black box testing, maybe anomaly detection. But I think without really good test beds, it’s hard to know. It’s easy to fool yourself about the efficacy of your proposed ELK method. And I think this is maybe quite related to the model organisms agenda as well.

Daniel Filan: Well, I think that wraps up about what we wanted to talk about. Thanks very much for being on the show.

Vikrant Varma: Yeah, thanks for having me.

Daniel Filan: This episode is edited by Jack Garrett, and Amber Dawn Ace helped with transcription. The opening and closing themes are also by Jack Garrett. Filming occurred at FAR Labs. Financial support for this episode was provided by the Long-Term Future Fund and Lightspeed Grants, along with patrons such as Alexey Malafeev and Tor Barstad. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at feedback@axrp.net.

Discuss

Improving Dictionary Learning with Gated Sparse Autoencoders

Neel Nanda — Thu, 25 Apr 2024 18:43:48 GMT

Published on April 25, 2024 6:43 PM GMT

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

Discuss

"Why I Write" by George Orwell (1946)

Arjun Panickssery — Thu, 25 Apr 2024 16:02:29 GMT

Published on April 25, 2024 4:02 PM GMT

People have been posting great essays so that they're "fed through the standard LessWrong algorithm." This essay is in the public domain in the UK but not the US.

From a very early age, perhaps the age of five or six, I knew that when I grew up I should be a writer. Between the ages of about seventeen and twenty-four I tried to abandon this idea, but I did so with the consciousness that I was outraging my true nature and that sooner or later I should have to settle down and write books.

I was the middle child of three, but there was a gap of five years on either side, and I barely saw my father before I was eight. For this and other reasons I was somewhat lonely, and I soon developed disagreeable mannerisms which made me unpopular throughout my schooldays. I had the lonely child’s habit of making up stories and holding conversations with imaginary persons, and I think from the very start my literary ambitions were mixed up with the feeling of being isolated and undervalued. I knew that I had a facility with words and a power of facing unpleasant facts, and I felt that this created a sort of private world in which I could get my own back for my failure in everyday life. Nevertheless the volume of serious – i.e. seriously intended – writing which I produced all through my childhood and boyhood would not amount to half a dozen pages. I wrote my first poem at the age of four or five, my mother taking it down to dictation. I cannot remember anything about it except that it was about a tiger and the tiger had ‘chair-like teeth’ – a good enough phrase, but I fancy the poem was a plagiarism of Blake’s ‘Tiger, Tiger’. At eleven, when the war or 1914-18 broke out, I wrote a patriotic poem which was printed in the local newspaper, as was another, two years later, on the death of Kitchener. From time to time, when I was a bit older, I wrote bad and usually unfinished ‘nature poems’ in the Georgian style. I also, about twice, attempted a short story which was a ghastly failure. That was the total of the would-be serious work that I actually set down on paper during all those years.

However, throughout this time I did in a sense engage in literary activities. To begin with there was the made-to-order stuff which I produced quickly, easily and without much pleasure to myself. Apart from school work, I wrote vers d’occasion, semi-comic poems which I could turn out at what now seems to me astonishing speed – at fourteen I wrote a whole rhyming play, in imitation of Aristophanes, in about a week – and helped to edit school magazines, both printed and in manuscript. These magazines were the most pitiful burlesque stuff that you could imagine, and I took far less trouble with them than I now would with the cheapest journalism. But side by side with all this, for fifteen years or more, I was carrying out a literary exercise of a quite different kind: this was the making up of a continuous “story” about myself, a sort of diary existing only in the mind. I believe this is a common habit of children and adolescents. As a very small child I used to imagine that I was, say, Robin Hood, and picture myself as the hero of thrilling adventures, but quite soon my “story” ceased to be narcissistic in a crude way and became more and more a mere description of what I was doing and the things I saw. For minutes at a time this kind of thing would be running through my head: ‘He pushed the door open and entered the room. A yellow beam of sunlight, filtering through the muslin curtains, slanted on to the table, where a matchbox, half-open, lay beside the inkpot. With his right hand in his pocket he moved across to the window. Down in the street a tortoiseshell cat was chasing a dead leaf,’ etc., etc. This habit continued until I was about twenty-five, right through my non-literary years. Although I had to search, and did search, for the right words, I seemed to be making this descriptive effort almost against my will, under a kind of compulsion from outside. The ‘story’ must, I suppose, have reflected the styles of the various writers I admired at different ages, but so far as I remember it always had the same meticulous descriptive quality.

When I was about sixteen I suddenly discovered the joy of mere words, i.e. the sounds and associations of words. The lines from Paradise Lost –

So hee with difficulty and labour hard
Moved on: with difficulty and labour hee,

which do not now seem to me so very wonderful, sent shivers down my backbone; and the spelling ‘hee’ for ‘he’ was an added pleasure. As for the need to describe things, I knew all about it already. So it is clear what kind of books I wanted to write, in so far as I could be said to want to write books at that time. I wanted to write enormous naturalistic novels with unhappy endings, full of detailed descriptions and arresting similes, and also full of purple passages in which words were used partly for the sake of their sound. And in fact my first completed novel, Burmese Days, which I wrote when I was thirty but projected much earlier, is rather that kind of book.

I give all this background information because I do not think one can assess a writer’s motives without knowing something of his early development. His subject-matter will be determined by the age he lives in – at least this is true in tumultuous, revolutionary ages like our own – but before he ever begins to write he will have acquired an emotional attitude from which he will never completely escape. It is his job, no doubt, to discipline his temperament and avoid getting stuck at some immature stage, or in some perverse mood: but if he escapes from his early influences altogether, he will have killed his impulse to write. Putting aside the need to earn a living, I think there are four great motives for writing, at any rate for writing prose. They exist in different degrees in every writer, and in any one writer the proportions will vary from time to time, according to the atmosphere in which he is living. They are:

Sheer egoism. Desire to seem clever, to be talked about, to be remembered after death, to get your own back on grown-ups who snubbed you in childhood, etc., etc. It is humbug to pretend this is not a motive, and a strong one. Writers share this characteristic with scientists, artists, politicians, lawyers, soldiers, successful business men – in short, with the whole top crust of humanity. The great mass of human beings are not acutely selfish. After the age of about thirty they abandon individual ambition – in many cases, indeed, they almost abandon the sense of being individuals at all – and live chiefly for others, or are simply smothered under drudgery. But there is also the minority of gifted, willful people who are determined to live their own lives to the end, and writers belong in this class. Serious writers, I should say, are on the whole more vain and self-centered than journalists, though less interested in money.
Aesthetic enthusiasm. Perception of beauty in the external world, or, on the other hand, in words and their right arrangement. Pleasure in the impact of one sound on another, in the firmness of good prose or the rhythm of a good story. Desire to share an experience which one feels is valuable and ought not to be missed. The aesthetic motive is very feeble in a lot of writers, but even a pamphleteer or writer of textbooks will have pet words and phrases which appeal to him for non-utilitarian reasons; or he may feel strongly about typography, width of margins, etc. Above the level of a railway guide, no book is quite free from aesthetic considerations.
Historical impulse. Desire to see things as they are, to find out true facts and store them up for the use of posterity.
Political purpose – using the word ‘political’ in the widest possible sense. Desire to push the world in a certain direction, to alter other people’s idea of the kind of society that they should strive after. Once again, no book is genuinely free from political bias. The opinion that art should have nothing to do with politics is itself a political attitude.

It can be seen how these various impulses must war against one another, and how they must fluctuate from person to person and from time to time. By nature – taking your ‘nature’ to be the state you have attained when you are first adult – I am a person in whom the first three motives would outweigh the fourth. In a peaceful age I might have written ornate or merely descriptive books, and might have remained almost unaware of my political loyalties. As it is I have been forced into becoming a sort of pamphleteer. First I spent five years in an unsuitable profession (the Indian Imperial Police, in Burma), and then I underwent poverty and the sense of failure. This increased my natural hatred of authority and made me for the first time fully aware of the existence of the working classes, and the job in Burma had given me some understanding of the nature of imperialism: but these experiences were not enough to give me an accurate political orientation. Then came Hitler, the Spanish Civil War, etc. By the end of 1935 I had still failed to reach a firm decision. I remember a little poem that I wrote at that date, expressing my dilemma:

A happy vicar I might have been
Two hundred years ago,
To preach upon eternal doom
And watch my walnuts grow
But born, alas, in an evil time,
I missed that pleasant haven,
For the hair has grown on my upper lip
And the clergy are all clean-shaven.
And later still the times were good,
We were so easy to please,
We rocked our troubled thoughts to sleep
On the bosoms of the trees.
All ignorant we dared to own
The joys we now dissemble;
The greenfinch on the apple bough
Could make my enemies tremble.
But girls’ bellies and apricots,
Roach in a shaded stream,
Horses, ducks in flight at dawn,
All these are a dream.
It is forbidden to dream again;
We maim our joys or hide them;
Horses are made of chromium steel
And little fat men shall ride them.
I am the worm who never turned,
The eunuch without a harem;
Between the priest and the commissar
I walk like Eugene Aram;
And the commissar is telling my fortune
While the radio plays,
But the priest has promised an Austin Seven,
For Duggie always pays.
I dreamt I dwelt in marble halls,
And woke to find it true;
I wasn’t born for an age like this;
Was Smith? Was Jones? Were you?

The Spanish war and other events in 1936-37 turned the scale and thereafter I knew where I stood. Every line of serious work that I have written since 1936 has been written, directly or indirectly, against totalitarianism and for democratic socialism, as I understand it. It seems to me nonsense, in a period like our own, to think that one can avoid writing of such subjects. Everyone writes of them in one guise or another. It is simply a question of which side one takes and what approach one follows. And the more one is conscious of one’s political bias, the more chance one has of acting politically without sacrificing one’s aesthetic and intellectual integrity.

What I have most wanted to do throughout the past ten years is to make political writing into an art. My starting point is always a feeling of partisanship, a sense of injustice. When I sit down to write a book, I do not say to myself, ‘I am going to produce a work of art’. I write it because there is some lie that I want to expose, some fact to which I want to draw attention, and my initial concern is to get a hearing. But I could not do the work of writing a book, or even a long magazine article, if it were not also an aesthetic experience. Anyone who cares to examine my work will see that even when it is downright propaganda it contains much that a full-time politician would consider irrelevant. I am not able, and do not want, completely to abandon the world view that I acquired in childhood. So long as I remain alive and well I shall continue to feel strongly about prose style, to love the surface of the earth, and to take a pleasure in solid objects and scraps of useless information. It is no use trying to suppress that side of myself. The job is to reconcile my ingrained likes and dislikes with the essentially public, non-individual activities that this age forces on all of us.

It is not easy. It raises problems of construction and of language, and it raises in a new way the problem of truthfulness. Let me give just one example of the cruder kind of difficulty that arises. My book about the Spanish civil war, Homage to Catalonia, is of course a frankly political book, but in the main it is written with a certain detachment and regard for form. I did try very hard in it to tell the whole truth without violating my literary instincts. But among other things it contains a long chapter, full of newspaper quotations and the like, defending the Trotskyists who were accused of plotting with Franco. Clearly such a chapter, which after a year or two would lose its interest for any ordinary reader, must ruin the book. A critic whom I respect read me a lecture about it. ‘Why did you put in all that stuff?’ he said. ‘You’ve turned what might have been a good book into journalism.’ What he said was true, but I could not have done otherwise. I happened to know, what very few people in England had been allowed to know, that innocent men were being falsely accused. If I had not been angry about that I should never have written the book.

In one form or another this problem comes up again. The problem of language is subtler and would take too long to discuss. I will only say that of late years I have tried to write less picturesquely and more exactly. In any case I find that by the time you have perfected any style of writing, you have always outgrown it. Animal Farm was the first book in which I tried, with full consciousness of what I was doing, to fuse political purpose and artistic purpose into one whole. I have not written a novel for seven years, but I hope to write another fairly soon. It is bound to be a failure, every book is a failure, but I do know with some clarity what kind of book I want to write.

Looking back through the last page or two, I see that I have made it appear as though my motives in writing were wholly public-spirited. I don’t want to leave that as the final impression. All writers are vain, selfish, and lazy, and at the very bottom of their motives there lies a mystery. Writing a book is a horrible, exhausting struggle, like a long bout of some painful illness. One would never undertake such a thing if one were not driven on by some demon whom one can neither resist or understand. For all one knows that demon is simply the same instinct that makes a baby squall for attention. And yet it is also true that one can write nothing readable unless one constantly struggles to efface one’s own personality. Good prose is like a windowpane. I cannot say with certainty which of my motives are the strongest, but I know which of them deserve to be followed. And looking back through my work, I see that it is invariably where I lacked a political purpose that I wrote lifeless books and was betrayed into purple passages, sentences without meaning, decorative adjectives and humbug generally.

Gangrel, No. 4, Summer 1946

Discuss

Cybersecurity of Frontier AI Models

Deric Cheng — Thu, 25 Apr 2024 14:51:20 GMT

Published on April 25, 2024 2:51 PM GMT

This article is part of a series of ~10 posts comprising a 2024 State of the AI Regulatory Landscape Review, conducted by the Governance Recommendations Research Program at Convergence Analysis. Each post will cover a specific domain of AI governance (e.g. incident reporting, safety evals, model registries, etc.). We’ll provide an overview of existing regulations, focusing on the US, EU, and China as the leading governmental bodies currently developing AI legislation. Additionally, we’ll discuss the relevant context behind each domain and conduct a short analysis.

This series is intended to be a primer for policymakers, researchers, and individuals seeking to develop a high-level overview of the current AI governance space. We’ll publish individual posts on our website and release a comprehensive report at the end of this series.

What cybersecurity issues arise from the development of frontier AI models?

One of the primary issues that has caught the attention of regulators is the protection of the intellectual property and sensitive data associated with frontier AI models (otherwise named as “dual-use foundational models” by US legislation and “general-purpose AI” (“GPAI”) by EU legislation).

In particular, legislators are concerned that as frontier AI models increase their capabilities, unregulated access to the underlying code or abilities of these models will result in dangerous outcomes. For example, current AI models are susceptible to easily distributing information hazards, such as the instructions to develop homemade weapons or techniques to commit crimes. As a result, they’re typically trained during a fine-tuning phase to reject such requests. Bypassing the cybersecurity of such models could result in the removal of such fine-tuning, allowing dangerous requests. Other cybersecurity risks include sharing sensitive user data, or leaking proprietary ML architectural decisions with direct competitors & geopolitical adversaries (e.g. Chinese organizations, in the case of the US).

Currently, the leading frontier AI models meet the following conditions, which are often collectively referred to as “closed-source” development:

Are privately owned by a large AI lab (e.g. OpenAI, Anthropic, or Google)
Present an API interface to fine-tuned models that are designed to reject dangerous or adversarial inputs.
Do not have publicly shared training data or codebases
Do not have publicly shared model weights, which would allow for the easy replication of the core functionality of an AI model by third-parties
Encrypt and protect user data, such as LLM queries and responses

In contrast, open-source AI models typically share some combination of their training data, model code, and completed model weights for public and commercial use.

Unlike open-source models, which are freely available and lack cybersecurity protections by design, proprietary or closed-source models have stringent measures to safeguard such sensitive information. Preventing the theft or leakage of this information is critically important to the AI labs that develop these models, as it constitutes their competitive advantage and intellectual property.

What cybersecurity issues are AI labs concerned about?

Specifically, AI labs are concerned about preventing the following:

Leaking private user data would cause a company to violate key international privacy laws such as the GDPR, leading to substantial fines and loss of user trust.
Leaking the model weights of a frontier AI model would lead to external parties being able to run the model independently and remove any fine-tuning that protects from adversarial inputs.
Leaking the codebase would allow competing labs to learn directly from an organization’s technical decisions and accelerate competition.
Leaking the training data would allow competing labs to better train their models by incorporating new data, accelerating competition.

With effective security practices, it’s generally accepted that it is feasible for AI labs to prevent these forms of information being leaked. Similar practices are currently used in all major tech corporations today to prevent their existing codebases and private user data from data breaches. Nevertheless, given the complexity of cybersecurity and the numerous potential targets, it is highly likely that a prominent AI lab will fall victim to a data breach involving a frontier AI model in the near future.

What cybersecurity issues are regulators concerned about?

Regulators are similarly concerned about effective cybersecurity for the same domains, albeit with different motivations:

Regulators currently strongly prioritize the protection of user data stored by companies, as a tenet of basic privacy rights as described in binding legislation such as the GDPR or China’s Personal Information Protection Law, or non-binding declarations such as the US AI Bill of Rights’ declaration on data privacy.
Regulators are just beginning to demand adequate protection of model weights, codebase, and training data of frontier AI models, for two reasons:

Leaking such data could benefit the R&D of geopolitical adversaries. In particular, the US government is highly invested in limiting the rate of AI development of Chinese organizations - leaking such data would counter these interests.
Leaking such data could allow third-parties to develop unregulated access to potentially dangerous frontier AI models. Currently, governments have well established methods to control closed-source models run by AI labs, by regulating the labs themselves. If access to the source code of these frontier models were more widely distributed, regulators would lose their ability to control the usage and distribution of these models.

Due to these interests, regulators are generally as invested in the cybersecurity of frontier AI models as the labs themselves are. Their incentives are well aligned in the case of cybersecurity for frontier models. However, in practice regulators have by and large left specific cybersecurity decisions up to independent parties, preferring to more broadly create requirements such as a “primary responsibility for information security” or “resilien[ce] against attack from third-parties”. Their enforcement of legislation such as the GDPR has been inconsistent and patchy.

What are current regulatory policies around cybersecurity for AI models?

China

China maintains a complex, detailed, and thorough set of data privacy requirements developed over the past two decades via legislation such as the PRC Cybersecurity Law, the PRC Data Security Law, and the PRC Personal Information Protection Law. Together, they constitute strong protections mandating the confidential treatment and encryption of personal data stored by Chinese corporations. Additionally, the PRC Cybersecurity Law has requirements regarding data localization that mandate that the user data of Chinese citizens be stored on servers in mainland China, ensuring that the Chinese government has more direct methods to access and control the usage of this data. All of these laws apply to data collected from users of LLM models in China.

China’s existing AI-specific regulations largely mirror the data privacy policies laid out in previous legislation, and often refer directly to such legislation for specific requirements. In particular, they extend data privacy requirements to the training data collected by Chinese organizations. However, they do not introduce any specific requirements for the cybersecurity of frontier AI models, such as properly securing model weights or codebases.

China’s Deep Synthesis Provisions include the following:

Article 7: Requires service providers to implement primary responsibility for information security, such as data security, personal information protection, and technical safeguards.
Article 14: Requires service providers to strengthen the management and security of training data, especially personal information included in training data.

China’s Interim Generative AI Measures include the following:

Article 7: Requires service providers to handle training data in accordance with the Cybersecurity Law and Data Security Law when carrying out pre-training and optimization of models.
Article 9: Requires that service providers bear responsibility for fulfilling online information security obligations in accordance with the law.
Article 11: Requires providers to keep user input information and usage records confidential and not illegally retain or provide such data to others.
Article 17: Requires security assessments for AI services with public opinion properties or social mobilization capabilities.

The EU

The EU has a comprehensive data privacy and security law that applies to all organizations operating in the EU or handling the personal data of EU citizens: the General Data Protection Regulation (GDPR). Passed in 2018, it does not contain language specific to AI systems, but provides a strong base of privacy requirements for collecting user data, such as mandatory disclosures, purpose limitations, security, and rights to access one’s personal data.

The EU AI Act includes some cybersecurity requirements for organizations running “high-risk AI systems” or “general purpose AI models with systemic risk”. It generally identifies specific attack vectors that organizations should protect against, but provides little to no specificity about how an organization might protect against these attack vectors or what level of security is required.

Sections discussing cybersecurity for AI models include:

Article 15: High-risk AI systems should be resilient against attacks by third-parties against system vulnerabilities. Specific vulnerabilities include:
- Attacks trying to manipulate the training dataset (‘data poisoning’)
- Attacks on pre-trained components used in training (‘model poisoning’)
- Inputs designed to cause the model to make a mistake (‘adversarial examples’ or ‘model evasion’)
- Confidentiality attacks or model flaws
Article 52d: Providers of general-purpose AI models with systemic risk shall:
- Conduct adversarial testing of the model to identify and mitigate systemic risk
- Assess and mitigate systemic risks from the development, market introduction, or use of the model
- Document and report serious cybersecurity incidents
- Ensure an adequate level of cybersecurity protection

The US

Compared to the EU and China, the US Executive Order on AI places the greatest priority on the cybersecurity of frontier AI models (beyond data privacy requirements), in accordance with the US’ developing interest in limiting Chinese access to US technologies. It is developing specific reporting requirements regarding cybersecurity for companies developing dual-use foundation models, and has requests for reports out to various agencies to investigate AI model cybersecurity implications across a number of domains.

Specific regulatory text in the Executive Order includes:

Section 4.2: This section establishes reporting requirements to the Secretary of Commerce for measures taken to protect the model training process and weights of dual-use foundational models, including:

Companies developing dual-use foundation models must provide information on physical and cybersecurity protections for the model training process, model weights, and the result of any read-team testing for model security
Directs the Secretary of Commerce to define the technical conditions for which models would be subject to the reporting requirements in 4.2(a). Until defined, this applies to any model trained using
1. Over 10²⁶ integer/floating-point operations per second (FLOP/s)
2. Over 10²³FLOPs if using primarily biological sequence data
3. Any computing cluster with data center networking of over 100 Gbit/s and a maximum computing capacity of 10²⁰ FLOPs for training AI.

Section 4.3: This section requires that a report is delivered to the Secretary of Homeland Security in 90 days on potential risks related to the use of AI in critical infrastructure sectors, including ways in which AI may make infrastructure more vulnerable to critical failures, physical attacks, and cyber attacks.
- It also requests that the Secretary of the Treasury issue a public report on best practices for financial institutions to manage AI-specific cybersecurity risks.
Section 4.6: The Secretary of Commerce shall solicit input for a report evaluating the risks associated with open-sourced model weights of dual-use foundational models, including the fine-tuning of open-source models, potential benefits to innovation and research, and potential mechanisms to manage risks.
Section 7.3: The Secretary of HHS shall develop a plan [that includes the]... incorporation of safety, privacy, and security standards into the software-development lifecycle for protection of personally identifiable information, including measures to address AI-enhanced cybersecurity threats in the health and human services sector.

The US does not have a comprehensive data privacy law similar to the GDPR or the PRC Personal Information Protection Law, nor a comprehensive cybersecurity law similar to the PRC Cybersecurity Law.

Convergence’s Analysis

User data of frontier AI models, and some forms of training data will continue to fall under the jurisdiction of existing data privacy laws.

The mandatory protection of user data (such as encryption) has been well established and legislated over the past decade via legislation such as the GDPR or the PRC Personal Information Protection Law. In practice, these laws have been effective at achieving their goals. There’s no clear reason to establish a separate set of regulations solely for user data regarding AI models.
Training data used for developing AI models can sometimes include private or sensitive user data. As specified in China’s regulations, this data will also be protected under existing legislation, and specific clauses may be included to indicate that requirement.

Cybersecurity requirements beyond user privacy are likely to be targeted at a small group of leading AI labs.

As evidenced by the US Executive Order’s approach to reporting requirements on cybersecurity, the US is primarily concerned about mitigating technological poaching of leading AI models and systemic risks. It has set a reasonably high threshold for reporting, excluding all but the top 3-4 labs at this time.
The majority of companies using frontier AI models are likely to pay for access via APIs from leading AI labs, and therefore do not have many of the cybersecurity risks described above. As a result, such legislation is likely to be more targeted at a small group of AI labs and more closely enforced than data privacy laws.

Frontier AI labs already have strong incentives to enforce the protection of their closed-source AI models. It’s unlikely that mandatory legislation will meaningfully impact their cybersecurity efforts.

Leading AI labs have significant resources and technical expertise, and a strong vested interest in protecting their IP. As a result, they typically have large teams dedicated to cybersecurity, and tend to operate state-of-the-art security practices. Though these requirements seem plausible to legislate based on government interests, they are unlikely to drastically change the approach for frontier AI labs regarding cybersecurity.

Governments have historically been poor at enforcing data privacy requirements, and are mostly constrained to requiring reporting or reactively fining organizations after an incident occurs.

Practically, government agencies have not had the resources to conduct thorough audits of their cybersecurity requirements. As a result, enforcement of legislation such as the GDPR has been sporadic and inconsistent. We expect similar outcomes for cybersecurity laws around AI models.
In addition, legislative requirements around cybersecurity are intentionally vague because of their broad scope. For instance, the GDPR only requires that organizations “shall implement appropriate technical and organisational measures to ensure a level of security appropriate to the risk”. Such wording requires that each organization be considered on a case-by-case basis and opens the case for protracted legal disputes over fines.
When securing model weights, code, and training data of frontier AI models, the types of cybersecurity required can be much more complicated, as each new domain opens up new attack vectors. Governmental agencies likely don’t have the capabilities to thoroughly evaluate the complex cybersecurity practices of frontier AI labs. However, having a significantly reduced number of organizations to track (primarily leading AI labs) may aid enforcement.

Discuss

The first future and the best future

KatjaGrace — Thu, 25 Apr 2024 06:40:05 GMT

Published on April 25, 2024 6:40 AM GMT

It seems to me worth trying to slow down AI development to steer successfully around the shoals of extinction and out to utopia.

But I was thinking lately: even if I didn’t think there was any chance of extinction risk, it might still be worth prioritizing a lot of care over moving at maximal speed. Because there are many different possible AI futures, and I think there’s a good chance that the initial direction affects the long term path, and different long term paths go to different places. The systems we build now will shape the next systems, and so forth. If the first human-level-ish AI is brain emulations, I expect a quite different sequence of events to if it is GPT-ish.

People genuinely pushing for AI speed over care (rather than just feeling impotent) apparently think there is negligible risk of bad outcomes, but also they are asking to take the first future to which there is a path. Yet possible futures are a large space, and arguably we are in a rare plateau where we could climb very different hills, and get to much better futures.

Discuss

NIH Cancer Myths Myths

belkarx — Thu, 25 Apr 2024 05:43:47 GMT

Published on April 25, 2024 5:43 AM GMT

The NIH has a page called Cancer Myths and Misconceptions that you come across if you end up looking into cancer for long enough, aimed at bio-illiterate patients and their families.

Around half the things on that page are wrong at face value, and a solid percentage of those are contradicted by the pages and studies the NIH themselves link as a part of the answer.

This seems bad. The percentage of people that are going to look through the actual studies or even linked cancer.gov pages with expanded info instead of looking at the NIH's incorrect summaries is low, so most people end up getting the wrong impression and making care/preventative decisions based off of that.

The trend is that they are identifying statements that are inconclusive as "myths", implying that they've been disproven and can be safely ignored, when this is clearly untrue.

I present a revised "NIH Cancer Myths Myths" page

Format: followed by why it's misleading and some more correct takes (mostly without linked supporting papers, sorry, I'll go back and add them at some point if I feel like it - this is a "source: trust me bro, I looked into most of these thoroughly at various points in my life" writeup :).

Error: Not endorsing "conspiracies" even when some amount of caution is probably warranted, considering the literature

Rhe Cell Phones page has studies that have positive results for cell phone use increasing at least acoustic neuromas, but each study is discredited by stating their honestly on-par-for-academia flaws (not done for other studies linked on the main page).

The EMF and Cancer page also spends most of the time claiming the existing studies are bad/underpowered, and glazes over the one "good" study that supports significantly increased risk of various cancers for workers with high EMF exposure.

Error: Giving misleading answers and then not elaborating

This response appears to discourage "holistic" treatments with "no herbal products have been shown to be effective for treating cancer", despite a large body of evidence to the contrary (like green tea reliably slowing metastasis, and garlic for slowing tumor growth by immune system support + a bunch of other pathways (GARLIC IS SO OP)). Their linked "more information" page discusses everything that doesn't work, and requires 2+ more clickthroughs to get to any actual studies on supplements, etc.

Their linked page is written in a way that makes it obvious that stress/high cortisol levels have significant impacts on tumor growth, metastasis, and cancer development, but they discount them because they're correlation studies, without at all discussing the large amount of in vivo research on cortisol (which supports high stress -> increased cancer risk).

There's some interesting research right now on whether the keto diet kills tumors because they depend glucose as a primary power source, tentative results are "maybe", so /shrug, seems worth mentioning that, as well as its associated risks.

Misc. commentary

There's actually a really interesting body of literature here, and they link none of it. Hypoxia causes tumors to grow faster because it changes their metabolism, and oxygenation has been suggested as a catalyst to chemotherapy. Boo NIH.

What They Got Right

The following are appropriately nuanced and AFAICT correct responses

This entry demonstrates that they do add interesting relevant research like that on the association between viruses and cancer, too bad they didn't do that with any other entries

They're correct that the risk is very low, but from first principles, surgery can definitely get bits of tumor into the bloodsteam -> encourage metastases if your surgeon isn't very careful. Relevant paper I found while poking around that explores ways of stopping the trauma caused by surgery and its secondary effects from encouraging further tumor growth.

Discuss