Not sure, but I have definitely noticed that llms have subtle "nuance sycophancy" for me. If I feel like there's some crucial nuance missing I'll sometimes ask and LLM in a way that tracks as first-order unbiased and get confirmation of my nuanced position. But at some point I noticed this in a situation where there were two opposing nuanced interpretations and tried modeling myself as asking "first-order-unbiased" questions having opposite views. And I got both views confirmed as expected. I've since been paranoid about this.

Generally I recommend this move of trying two opposing instances of "directional nuance" a few times. Basically I ask something like "the conventional view is X. Is the conventional view considered correct by modern historians?" Where X was formulated in a way that can naturally lead to a rebuttal Y. And then for sufficiently ambiguous and interpretation-dependent pairs of X and X', with fully opposing "nuanced corrections" Y and ¬Y. I've been pretty successful at this several times I think

Replying toStrategy of von Neumann and strategy of Rosenbergs

Dmitry Vaintrob7d

Strategy of von Neumann and strategy of Rosenbergs

I think a much more sympathetic and earlier proponent of the second policy would be Niels Bohr, or maybe Klaus Fuchs

Replying toHow Articulate Are the Whales?

Dmitry Vaintrob15d

How Articulate Are the Whales?

Ah never mind. I just re-read your last sentence and it seems like the papers consider this - in particular if the ocean floor were a factor this effect would likely depend on depth. Very cool "citizen research" piece on your end!

Replying toHow Articulate Are the Whales?

Dmitry Vaintrob15d

How Articulate Are the Whales?

Likely this is totally off base, but I wonder if you can distinguish beaming artifacts from enviromental distortion/ multipath effects where sounds interfere with themselves because of the environment (marine floor etc.). Based on a low-effort chatgpt interaction it seems like there are some studies of whales that measure the same sound in different locations. I wonder if there's enough publicly available data to see how measurement location affects the distance between peaks

Replying toAda Palmer: Inventing the Renaissance

Dmitry Vaintrob16d

Ada Palmer: Inventing the Renaissance

Arguably the same is true of modern LLMs. Even a base model is not a "generic person" but a "generic text". The model ranke-4b is also fine-tuned (at least on question formats and to stay in character). So it's a reconstructed version

The base-model is an unpolished diamond: it is full of raw potential, but extracting its knowledge is not always an effortless undertaking since it does not respond to questions in a chat-formatted manner.

A tale of three theories: sparsity, frustration, and statistical field theory

Dmitry Vaintrob

19d

This post is an informal preliminary writeup of a project that I've been working on with friends and collaborators. Some of the theory was developed jointly with Zohar Ringel, and we hope to write a more formal paper on it this year. Experiments are joint with Lucas Teixeira (and also an extensive use of llm assistants). This work is part of the research agenda we have been running with Lucas Teixeira and Lauren Greenspan at my organization Principles of Intelligence (formerly PIBBSS), and Lauren helped in writing this post.

Introduction

Pre-introduction

I have had trouble writing an introduction to this post. It combines three aspects of interpretability that I like and have thought about in... (read 5389 more words →)

Replying toLLM-generated text is not testimony

Dmitry Vaintrob3mo

LLM-generated text is not testimony

I like the analogy of a LARP. Characters in a book don't have reputation or human-like brain states that they honestly try to represent - but a good book can contain interesting, believable characters with consistent motivation, etc. I once participated in a well-organized fantasy LARP in graduate school. I was bad at it but it was a pretty interesting experience. In particular people who are good are able to act in character and express thoughts that "the character would be having" which are not identical to the logic and outlook of the player (I was bad at this, but other players could do it I think). In my case, I noticed... (read more)

Dmitry Vaintrob4mo

Very cool, thanks! I agree that Dalcy's epsilon-game picture makes arguments about ELO vs. optimality more principled

Dmitry Vaintrob4mo

I really like this question and this analysis! I think an extension I'd do here is to restrict the "3 reasonable moves" picture by looking at proposed moves of different agents in various games. My guess is that in fact the "effective information content" in a move at high-level play is less than 1 bit per move on average. If you had a big gpu to throw at this problem you could try to explicitly train an engine via an RL policy with a strong entropy objective and see what maximal entropy is compatible with play at different ratings

Dmitry Vaintrob5mo*Quick Take

SLT is a thermodynamic theory of Bayesian learning, but not the thermodynamic theory of Bayesian learning

SLT provides a rigorous mathematical framework for Bayesian learning in a certain regime, but I argue its practical applicability to real neural networks (even in a Bayesian learning/ high-level modeling context) is limited by finite-size effects and high-dimensionality. The valuable empirical work in this space is better understood as 'thermodynamic interpretations of ML' rather than validations of SLT proper

I've been having lots of conversations with people about SLT. I like SLT as a model for Bayesian learning a lot. At the same time I think that the assumptions of SLT are a model of the reality of... (read 3013 more words →)

Dmitry Vaintrob9mo

This is fascinating! If there's nothing else going on with your prompting, this looks like an incredibly hacky mid-inference intervention. My guess would be that openai applied some hasty patch against a sycophancy steering vector and this vector caught both actual sycophantic behaviors and descriptions of sycophantic behaviors in LLMs (I'd guess "sycophancy" as a word isn't so much the issue as the LLM behavior connotation). Presumably the patch they used activates at a later token in the word "sycophancy" in an AI context. This is incredibly low-tech and unsophisticated -- like much worse than the stories of repairing Apollo missions with duct tape. Even a really basic finetuning would not exhibit... (read more)

On the friendship fallacy and Owen Barfield

I just finished reading the book "The Fellowship: The Literary Lives of the Inklings", by Philip and Carol Zaleski. It's a book about an intellectually appealing and socially cohesive group of writers in Oxford who met weekly and critiqued each other's work, which included JRR Tolkien and CS Lewis. The book is very centered on Christianity (the writers also write Christian apologetics), but this works well, as understanding either Lewis or Tolkien or the Inklings in general without the lens of their deeply held thoughtful Christianity is about as silly as trying to analyze the Lion King without reading Hamlet.

But there is a core character in... (read 1287 more words →)

Steelmanning heuristic arguments

Dmitry Vaintrob

10mo

Introduction

This is a nuanced “I was wrong” post.

Something I really like about AI safety and EA/rationalist circles is the ease and positivity in people’s approach to being criticised.^[1] For all the blowups and stories of representative people in the communities not living up to the stated values, my experience so far has been that the desire to be truth-seeking and to stress-test your cherished beliefs is a real, deeply respected and communally cultured value. This in particular explains my ability to keep getting jobs and coming to conferences in this community, despite being very eager to criticise and call bullshit on people’s theoretical agendas.

One such agenda that I’ve been a somewhat vocal critic of (and... (read 4914 more words →)

Memorization-generalization in practice

Dmitry Vaintrob

Short post today, which is part II.1 or my series on tempering and SLT (see part one here). In this post I’ll explain in a bit more detail the “in practice” connection that experiments should see between the learning coefficient spectrum, tempering, and empirical measurements of the learning coefficient. In future installments of this part I’ll explain a bit the theory behind this and how the predictions from information theory have remarkable qualitative agreement with predictions from singular learning theory (with the caveat that in the SLT picture, each circuit is "fuzzy", and has a small continuous spectrum of its own). I'll then relate this picture to some notions inherent in the... (read 980 more words →)

Efficiency spectra and “bucket of circuits” cartoons

Dmitry Vaintrob

This is “Part I.75” of my series on the memorization-generalization spectrum and SLT.

I’m behind on some posts, so I thought I would try to explain, in a straightforward and nontechnical way, the main takeaway from my previous post for people thinking about and running experiments with neural nets. I’m going to punt on some technical discussions (in particular the discussion on SLT-type phenomenon related to “degrading a single circuit” and a deeper discussion of “locality”), and give an informal explanation of local efficiency spectra and the associated spectrum of (local) learning coefficients.

I will next list a few predictions – both “purely cartoon” predictions, and how they are complicated by reality – about... (read 1931 more words →)

The memorization-generalization spectrum and learning coefficients

Dmitry Vaintrob

This is part I.5 of the series on “generalization spectra” and SLT. I’ll try to make it readable independently of Part I.

The purpose of this post is to reconceptualize the ideas from the last post in a “more physically principled” (and easier-to-measure) sense, and to connect it more clearly to the specifics of neural nets (such as continuously- rather than discretely- parameterized programs), and to the idea of memorization-generalization tradeoffs. In particular I’ll introduce the core conceptual player in thinking of memorization-generalization “spectra”, which is a parameter I’ll call the learning coefficient (notation borrowed from SLT) at a given precision regime. I’ll explain that this parameter can be thought of as a measure of... (read 2748 more words →)

My supervillain origin story

Dmitry Vaintrob

When I started graduate school (for math), I was very interested in big ideas. I had had a couple experiences of having general research intuitions pan out really well and felt like the core of good research is having a brave idea, a gestalt. I went into grad school looking for the “gestalt people”. The people whose math had that mysterious, cutting edge flavor but was not too pop (at the time the sexiest thing around was higher category theory and I was drawn to it and tried to learn it, but I didn’t want to do the “common” thing of working in that field). I ended up choosing an advisor, and skipped... (read 1452 more words →)

121

The generalization phase diagram

Dmitry Vaintrob

Introduction

This is part I of 3.5 planned posts on the “tempered posterior” (one of which will be “SLT in a nutshell”). It is a kind of moral sequel to “Dmitry’s Koan” and “Logits, log-odds, and loss for parallel circuits”, and is related to the post on grammars. It can be read independently of any of these.

In my view, one of the most valuable contributions of singular learning theory (SLT) so far has been the introduction and serious experimental use of tempering (what in my Koan I call “natural degradation”) in the study of neural nets. As I’ve mentioned in a few of my posts this month, I think that this is a seriously underutilized notion,... (read 4729 more words →)

On polytopes

Dmitry Vaintrob

[Epistemic status: slightly ranty. This is a lightly edited slack chat, and so may be lower-quality.]

I am surprised by the perennial spikes in excitement about "polytopes" and "tropical geometry on activation space" in machine learning and interpretability^[1]. I'm not going to discuss tropical geometry in this post in depth (and might save it for later -- it will have to wait until I'm in a less rant-y mood^[2]).

As I'll explain below, I think some interesting questions and insights are extractable by suitably weakening the "polytopes" picture, and a core question it opens up (that of "statistical geometry" -- see below) is very deep and worth studying much more systematically. However if taken... (read 3503 more words →)

•••

QFT and neural nets: the basic idea

Dmitry Vaintrob

Previously in the series: The laws of large numbers and Basics of Bayesian learning.

Reminders: formalizing learning in ML and Bayesian learning

Learning and inference in neural nets and Bayesian models

As a very basic sketch, in order to specify an ML algorithm one needs five pieces of data.

An architecture: i.e., a parametrized space of functions that associates to each weight vector $θ \in R^{w}$ a function $f_{θ} : R^{d_{i n}} \to R^{d_{o u t}},$ from some input space to an output space.
An initialization prior on weights. This is a (usually stochastic) algorithm to initialize a weight from which to begin learning. Generally this is some Gaussian distribution on the weight $θ \in R^{w} .$ While this is often ignored, in many contexts it is actually quite important

... (read 2237 more words →)

Writing experiments and the banana escape valve

Dmitry Vaintrob

[Note: This is not alignment-related, but rather a spacefiller personal blog post.]

I've been trying to write a public post every day of January. So far I’ve been enjoying it. I don’t think this approach works for everyone: in particular, I’ve also been hanging on by a thread to the schedule and to the ability to sleep. I publicly committed to write 3 posts a week, but my “secret” personal goal to write a post every day for the month. Not only in order to have more output, but because in the past intense daily deadlines have worked pretty well in the past; also there’s a kind of scientific idea of “you’ll get... (read 576 more words →)

Statistical localization in disordered systems, and dreaming of more realistic interpretability endpoints

[epistemic status: half fever dream, half something I think is an important point to get across. Note that the physics I discuss is not my field though close to my interests. I have not carefully engaged with it or read the relevant papers -- I am likely to be wrong about the statements made and the language used.]

A frequent discussion I get into in the context of AI is "what is an endpoint for interpretability". I get into this argument from two sides:

arguing with interpretability purists, who say that the only way to get robust safety from interpretability is to mathematically

... (read 1413 more words →)

Why I'm in AI sequence: 2020 Journal entry about gpt3

I moved from math academia to full-time AI safety a year ago -- in this I'm in the same boat as Adam Shai, whose reflection post on the topic I recommend you read instead of this.

In making the decision, I went through a lot of thinking and (attempts at) learning about AI before that. A lot of my thinking had been about whether a pure math academic can make a positive difference in AI, and examples that I thought counterindicated this -- I finally decided this might be a good idea after talking to my sister Lizka extensively and doing MATS in Summer... (read 1518 more words →)

Why you should try degrading NN behavior in experiments.

I got some feedback on the post I wrote yesterday that seems right. The post is trying to do too many things, and not properly explaining what it is doing, why this is reasonable, and how the different parts are related.

I want to try to fix this, since I think the main piece of advice in this post is important, but gets lost in all the mess.

This main point is:

experimentalists should in many cases run an experiment on multiple neural nets with a variable complexity dial that allows some "natural" degradations of the NN's performance, and certain dials are better than others depending on

... (read 573 more words →)

Alignment is not all you need. But that doesn't mean you don't need alignment.

One of the fairytales I remember reading from my childhood is the "Three sillies". The story is about a farmer encountering three episodes of human silliness, but it's set in one more frame story of silliness: his wife is despondent because there is an axe hanging in their cottage, and she thinks that if they have a son, he will walk underneath the axe and it will fall on his head.

The frame story was much more memorable to me than any of the "body" stories, and I randomly remember this story much more often than any other fairytale I... (read 1164 more words →)

On the surprising effectiveness of linear regression as a toy model of generalization.

Another shortform today (since Sunday is the day of rest). This time it's really a hot take: I'm not confident about the model described here being correct.

Neural networks aren't linear -- that's the whole point. They notice interesting, compositional, deep information about reality. So when people use linear regression as a qualitative comparison point for behaviors like generalization and learning, I tend to get suspicious. Nevertheless, the track record of linear regression as a model for "qualitative" asymptotic behaviors is hard to deny. Linear regression models (neatly analyzable using random matrix theory) give surprisingly accurate models of double descent, scaling... (read 1024 more words →)

We have in fact reverse engineered alien datastructures.

I'm trying to keep up a regular schedule of writing for my Nanowrimo project. I'm working on a post that's not done yet, so for today I'll write a quick (and not very high-effort) take related to discussions I've had with more doomy friends (I'm looking at you Kaarel Hänni). A good (though incomplete) crux on why something like "faithful interpretability" may be hard (faithful here being "good enough to genuinely decompose enough of the internal thought process to notice and avoid deception") is contained in Tsvi's post on "alien datastructures". I think it's a great piece, though I largely probably agree with it only... (read 1316 more words →)

•••

LESSWRONG
LW

LESSWRONG
LW

Dmitry Vaintrob

Toward A Mathematical Framework for Computation in Superposition

My supervillain origin story

The subset parity learning problem: much more than you wanted to know

The purposeful drunkard

Dmitry Vaintrob

Dmitry Vaintrob

A tale of three theories: sparsity, frustration, and statistical field theory

Steelmanning heuristic arguments

Memorization-generalization in practice

Efficiency spectra and “bucket of circuits” cartoons

The memorization-generalization spectrum and learning coefficients

My supervillain origin story

The generalization phase diagram

Dmitry Vaintrob

Toward A Mathematical Framework for Computation in Superposition

My supervillain origin story

The subset parity learning problem: much more than you wanted to know

The purposeful drunkard

Dmitry Vaintrob

Dmitry Vaintrob

A tale of three theories: sparsity, frustration, and statistical field theory

Steelmanning heuristic arguments

Memorization-generalization in practice

Efficiency spectra and “bucket of circuits” cartoons

The memorization-generalization spectrum and learning coefficients

My supervillain origin story

The generalization phase diagram

Introduction

Pre-introduction

SLT is a thermodynamic theory of Bayesian learning, but not the thermodynamic theory of Bayesian learning

On the friendship fallacy and Owen Barfield

Introduction

Introduction

Reminders: formalizing learning in ML and Bayesian learning

Learning and inference in neural nets and Bayesian models

Why I'm in AI sequence: 2020 Journal entry about gpt3