LESSWRONG
LW

LawrenceC — LessWrong

“Features” aren’t always the true computational primitives of a model, but that might be fine anyways

17d

Or, partly against discussion about generic “features” in mechanistic interpretability

Probably the most debated core concept in mechanistic interpretability is that of the “feature”: common questions include “are there non-linear features, and does this mean that the linear representation hypothesis is false?”, “do SAEs recover a canonical set of features, and are SAE features ‘features’ in the first place?”, and so forth.

But what do people mean by feature?

Definition 1: When asked for a definition, people in mech interp tend to define features as something like: “features are the computational primitives of the algorithm implemented by the neural network“, and point to the decompilation analogy. The obvious problem is that there might not be a... (read 1426 more words →)

LawrenceC4mo

A large part of why I'm writing this is to try to make this work higher status and to encourage more of this work. Consider yourself to be encouraged and/or thanked if you're working in this space or planning to work in this space.

To add a second data point: I think effective community/capacity building can be both incredibly challenging and incredibly impactful. I respect the people who do CB well a lot, much more than the median AIS researcher I meet (especially since it requires pushing back against the default incentive gradients). I’m not sure how much this means to people in practice, but I think it’s worth putting on the public record anyways.

Replying toExisting Safety Frameworks Imply Unreasonable Confidence

LawrenceC8mo

Existing Safety Frameworks Imply Unreasonable Confidence

Thanks for writing this!

I agree that current frontier AI risk management frameworks (henceforth RMFs) are insufficient to ensure acceptably low risks from AI development. I also think that you've pointed out the key reasons for why they're inadequate: not addressing risks from development or internal deployment, a lack of planning for what happens assuming dangerous capabilities are detected, and (most importantly) a reliance on only checking for a few known risks (as operationalized by current evals). I think that the main value proposition of current RMFs is transparency -- that is, it's good to lay out what the evals and plans, and it's good for people to do very basic checks for... (read more)

Replying toBeware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC8mo

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

Super fair. I did not read that section in detail, and missed your interpretation on my skim. I interpreted it to mean "in the planning literature, we typically see k=3 for n<6" (which we do!), noted that they did not say the alternative value, and then went with the k=3 value they did say in the main body.

You're right that your interpretation is more natural. If they actually used k=4, then the problem is solvable and the paper was (a bit) better than I portrayed it to be here.

Worth noting that, even assuming your interpretation is correct, it's possible is that the person who wrote the problem spec did know about the... (read more)

Replying toBeware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC8mo

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

Fair, but in my head I did plan to get it done on the 10th. The tweet is not in itself the prediction, it's just evidence that I made the prediction in my head.

And indeed I did finish the draft on June 10th, but at 11 PM and I decided to wait for feedback before posting. So I wasn't that off in the end, but I still consider it off.

Replying toBeware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC8mo

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

Oh, fair. Thanks for the correction, I didn't realize how much artists were affected.

Replying toBeware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC8mo

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

This isn't my area of expertise, so take what I have to say with a grain of salt.

I wouldn't say that anyone is particularly on the ball here, but there are certainly efforts to bring about more "mainstream" awareness about AI. See e.g. the recent post by the ControlAI folk about their efforts to brief British politicians on AI, in which they mention some of the challenges they ran into. Still within the greater LW/80k circle, we have Rational Animations, which has a bunch of AI explainers via animated Youtube videos. Then slightly outside of it, we have Rob Miles's videos, both on his own channel and on Computerphile. Also, I'd take... (read more)

Replying toBeware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC8mo

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

The authors specify that they used k=3 on page 6 of their paper:

If they used a different k (such that the problem is possible), they did not say so in their paper.

You're right that I could've been more clear here, I've edited in a footnote to clarify.

Replying toBeware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC8mo

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

There are indeed many, many silly claims out there, on either side of any debate. And yes, the people pretending that the AIs of 2025 have the limitations of those from 2020 are being silly, journalist or no.

I do want to clarify that I don't think this is a (tech) journalist problem. Presumably when you mention Nightshade dismissively, it's a combination of two reasons: 1) Nightshade artefacts are removable via small amounts of Gaussian blur and 2) Nightshade can't be deployed at scale on enough archetypal images to have a real effect? If you look at the Nightshade website, you'll see that the authors lie about 1):

As with Glaze, Nightshade effects are

LawrenceC

8mo

1.

Late last week, researchers at Apple released a paper provocatively titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, which “challenge[s] prevailing assumptions about [language model] capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning”.

Normally I refrain from publicly commenting on newly released papers. But then I saw the following tweet from Gary Marcus:

I have always wanted to engage thoughtfully with Gary Marcus. In a past life (as a psychology undergrad), I read both his work on infant language acquisition and his 2001 book The Algebraic Mind; I found both insightful and interesting. From reading his Twitter,... (read 4569 more words →)

297

•••

Anthropic releases Claude 3.7 Sonnet with extended thinking mode

LawrenceC

About 1.5 hours ago, Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model that interpolates between a normal LM and long chains of thought:

Today, we’re announcing Claude 3.7 Sonnet¹, our most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users also have fine-grained control over how long the model can think for.

They call Claude's reasoning ability "extended thinking" (from their system card):

Claude 3.7 Sonnet introduces a new feature called "extended thinking" mode. In

... (read 1032 more words →)

•••

Replying toIntroducing the WeirdML Benchmark

LawrenceC1y

Introducing the WeirdML Benchmark

It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.

My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.

I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one.

Is the reason you can't do one of the existing tasks, just to get a sense of the difficulty?

Replying toIntroducing the WeirdML Benchmark

LawrenceC1y

Introducing the WeirdML Benchmark

Makes sense, thanks!

For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.

It's hard to say because I'm not even sure you can rent Titan Vs at this point,^[1] and I don't know what your GPU utilization looks like, but I suspect API costs will dominate.

An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shadeform). And even A100s are ridiculously better than... (read more)

-2

Debate: Is it ethical to work at AI capabilities companies?

Ben Pace

Ben Pace, LawrenceC

Epistemic status: Soldier mindset. These are not (necessarily) our actual positions, these are positions we were randomly assigned, and for which we searched for the strongest arguments we could find, over the course of ~1 hr 45 mins.

Sides: Ben was assigned to argue that it's ethical to work for an AI capabilities company, and Lawrence was assigned to argue that it isn't.

Reading Order: Ben and Lawrence drafted each round of statements simultaneously. This means that each of Lawrence statements you read were written without Lawrence having read Ben's statements that are immediately proceeding.

Ben's Opening Statement

Ben Pace

I think it is sometimes ethical to work for AI capabilities companies.

I think the standard argument against

... (read 3178 more words →)

What progress have we made on automated auditing?

LawrenceC

One use case for model internals work is to perform automated auditing of models:

https://www.alignmentforum.org/posts/cQwT8asti3kyA62zc/automating-auditing-an-ambitious-concrete-technical-research

That is, given a specification of intended behavior, the attacker produces a model that doesn't satisfy the spec, and the auditor needs to determine how the model doesn't satisfy the spec. This is closely related to static backdoor detection: given a model M, determine if there exists a backdoor function that, for any input, transforms that input to one where M has different behavior $.$ ^[1]

There's some theoretical work (Goldwasser et al. 2022) arguing that for some model classes, static backdoor detection is impossible even given white-box model access -- specifically, they prove their results for random feature regression and (the... (read more)

Compact Proofs of Model Performance via Mechanistic Interpretability

LawrenceC

LawrenceC, rajashree, Adrià Garriga-alonso, Jason Gross

We recently released a paper on using mechanistic interpretability to generate compact formal guarantees on model performance. In this companion blog post to our paper, we'll summarize the paper and flesh out some of the motivation and inspiration behind our work.

Paper abstract

In this work, we propose using mechanistic interpretability – techniques for reverse engineering model weights into human-interpretable algorithms – to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of- $K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative

... (read 2124 more words →)

104

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda

Neel Nanda, LawrenceC, fbarez

Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024! I think this is an exciting development that's a lagging indicator of mech interp gaining legitimacy as an academic field, and a good chance for field building and sharing recent progress!

We'd love to get papers submitted if any of you have relevant projects! Deadline May 29, max 4 or max 8 pages. We welcome anything that brings us closer to a principled understanding of model internals, even if it's not "traditional” mech interp. Check out our website for example topics! There's $1750 in best paper prizes. We also welcome less standard submissions, like open source software, models or datasets, negative results, distillations,... (read more)

Superposition is not "just" neuron polysemanticity

LawrenceC

TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non-neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features. I conclude by listing a few reasons why it might be important to distinguish the two concepts.

Epistemic status: I wrote this “quickly”... (read 3652 more words →)

•••

I finally got around to reading the Mamba paper. H/t Ryan Greenblatt and Vivek Hebbar for helpful comments that got me unstuck.

TL;DR: authors propose a new deep learning architecture for sequence modeling with scaling laws that match transformers while being much more efficient to sample from.

A brief historical digression

As of ~2017, the three primary ways people had for doing sequence modeling were RNNs, Conv Nets, and Transformers, each with a unique “trick” for handling sequence data: recurrence, 1d convolutions, and self-attention.

RNNs are easy to sample from — to compute the logit for x_t+1, you only need the most recent hidden state h_t and the last token x_t, which means it’s both fast

... (read 999 more words →)

How my views on AI(S) have changed over the last 5.5 years

This was ~~shamelessly copied from~~ directly inspired by Erik Jenner's "How my views on AI have changed over the last 1.5 years". I think my views when I started my PhD in Fall 2018 look a lot worse than Erik's when he started his PhD, though in large part due to starting my PhD in 2018 and not 2022.

Apologies for the disorganized bullet points. If I had more time I would've written a shorter shortform.

AI Capabilities/Development

Summary: I used to believe in a 2018-era MIRI worldview for AGI, and now I have updated toward slower takeoff, fewer insights, and shorter timelines.

In Fall

... (read 1196 more words →)

•••

Anthropic release Claude 3, claims >GPT-4 Performance

LawrenceC

Today, we're announcing the Claude 3 model family, which sets new industry benchmarks across a wide range of cognitive tasks. The family includes three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application.

Better performance than GPT-4 on many benchmarks

The largest Claude 3 model seems to outperform GPT-4 on benchmarks (though note slight differences in evaluation methods):

Opus, our most intelligent model, outperforms its peers on most of the common evaluation benchmarks for AI systems, including undergraduate level expert knowledge (MMLU), graduate level

... (read 475 more words →)

115

Sam Altman fired from OpenAI

LawrenceC

Basically just the title, see the OAI blog post for more details.

Mr. Altman’s departure follows a deliberative review process by the board, which concluded that he was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities. The board no longer has confidence in his ability to continue leading OpenAI.
In a statement, the board of directors said: “OpenAI was deliberately structured to advance our mission: to ensure that artificial general intelligence benefits all humanity. The board remains fully committed to serving this mission. We are grateful for Sam’s many contributions to the founding and growth of OpenAI. At the same time, we believe new leadership is

... (read more)

192

(This was originally posted in my CHAI lab notes, and Daniel Filan suggested I write this up on LW. It's basically not relevant to anything else I post about.)

So, there’s this set of questions called “36 questions to fall in love” by the NYT, which originates from the 1997 paper “The Experimental Generation of Interpersonal Closeness” by Aron et al.

I recently saw a tweet thread that claimed it was kind of bs, because the original 1991 study that actually lead to a marriage and a few relationships used a 1.5 hour procedure that had a slightly different set of 40 questions, including “play act as if you’re falling in love”:
https://twitter.com/IvanVendrov/status/1611809666266435584

Notably, the line

The

... (read more)

LLM prompt engineering can replace weaker ML models

Epistemic status: Half speculation, half solid advice. I'm writing this up as I've said this a bunch IRL.

Current large language models (LLMs) are sufficiently good at in-context learning that for many NLP tasks, it's often better and cheaper to just query an LM with the appropriate prompt, than to train your own ML model. A lot of this comes from my personal experience (i.e. replacing existing "SoTA" models in other fields with prompted LMs, and getting better performance), but there's also examples with detailed writeups, for example:

Armstrong and Gorman's Using GPT-Eliezer against ChatGPT Jailbreaking uses ChatGPT with a prompt (Pretend to be Eliezer, and filter out unsafe prompts) to filter out queries that OpenAI's moderation interface fails to flag.
Brooks et al's In-Context Policy Iteration replaces policy iteration in parameter space with policy iteration in prompt space.

Here's a rough sketch of the analogy:

Prompt <-> Weights

Prompt engineering/Prompt finetuning <-> Training/finetuning

I wrote this in response to the ACX CHAI + Assistance Games + Fully Updated Deference post, and am linking it here so I have a good reference.

I’m a PhD student at CHAI, where Stuart is one of my advisors. Here’s some of my perspectives on the corrigibility debate and CHAI in general. That being said, I’m posting this as myself and not on behalf of CHAI: all opinions are my own and not CHAI’s or UC Berkeley’s.
I think it’s a mistake to model CHAI as a unitary agent pursuing the Assistance Games agenda (hereafter, CIRL, for its original name of Cooperative Inverse Reinforcement Learning). It’s better to think of it as

... (read 638 more words →)

LESSWRONG
LW

LESSWRONG
LW

LawrenceC

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

Natural Abstractions: Key Claims, Theorems, and Critiques

What I would do if I wasn’t at ARC Evals

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

“Features” aren’t always the true computational primitives of a model, but that might be fine anyways

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

Anthropic releases Claude 3.7 Sonnet with extended thinking mode

Debate: Is it ethical to work at AI capabilities companies?

What progress have we made on automated auditing?

Compact Proofs of Model Performance via Mechanistic Interpretability

Mechanistic Interpretability Workshop Happening at ICML 2024!

(Lawrence's) Reflections on Research

[Redwood Research] Causal Scrubbing

LawrenceC

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

Natural Abstractions: Key Claims, Theorems, and Critiques

What I would do if I wasn’t at ARC Evals

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

“Features” aren’t always the true computational primitives of a model, but that might be fine anyways

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

Anthropic releases Claude 3.7 Sonnet with extended thinking mode

Debate: Is it ethical to work at AI capabilities companies?

What progress have we made on automated auditing?

Compact Proofs of Model Performance via Mechanistic Interpretability

Mechanistic Interpretability Workshop Happening at ICML 2024!

(Lawrence's) Reflections on Research

[Redwood Research] Causal Scrubbing

1.

Ben's Opening Statement

Paper abstract

A brief historical digression

How my views on AI(S) have changed over the last 5.5 years

AI Capabilities/Development

Better performance than GPT-4 on many benchmarks

LLM prompt engineering can replace weaker ML models