LESSWRONG
LW

All of LawrenceC's Comments + Replies

It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.

My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.

I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best

... (read more)

1Håvard Tveit Ihle3mo

I guess I was thinking that the human baseline should be without LLMs, because otherwise I could just forward the prompt to the best LLM, se what they did, and perhaps improve upon it, which would put human level always at or above the best LLM. Then again this is not how humans typically work now, so it’s unclear what is a «fair» comparison. I guess it depends on what the human baseline is supposed to represent, and you have probably thought a lot about that question at METR. I could, but it would not really be a fair comparison, since I have seen many of the LLMs solutions, and have seen what works. Doing a fresh task I made myself would not be totally fair either, since I will know more about the data then they do, but it would definitely be closer to fair.

Introducing the WeirdML Benchmark

LawrenceC3mo20

Makes sense, thanks!

For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.

It's hard to say because I'm not even sure you can rent Titan Vs at this point,^[1] and I don't know what your GPU utilization looks like, but I suspect API costs will dominate.

An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shade... (read more)

1Håvard Tveit Ihle3mo

API costs will definitely dominate for o1-preview, but most of the runs are with models that are orders of magnitude cheaper, and then it is not clear what dominates. Going forward, models like o1-preview (or even more expensive) will probably dominate the cost, so the compute will probably be a small fraction.

Introducing the WeirdML Benchmark

LawrenceC3mo40

This is really impressive -- could I ask how long this project took, how long does each eval take to run on average, and what you spent on compute/API credits?

(Also, I found the preliminary BoK vs 5-iteration results especially interesting, especially the speculation on reasoning models.)

3Håvard Tveit Ihle3mo

Thank you! I've been working on the automated pipeline as a part time project for about two months, probably equivalent to 2-4 full-time weeks of work. One run for one model and one task typically takes perhaps 5-15 minutes, but it can be up to about an hour (if they use their 10 min compute time efficiently, which they tend not to do). Total API costs for the project is probably below 200$ (if you do not count the credits used on googles free tier). Most of the cost is for running o1-mini and o1-preview (even though o1-preview only went through a third of the runs compared to the other models). o1-preview costs about 2$ for each run on each task. For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger. I expect the API costs to dominate going forward though if we want to run o3 models etc through the eval.

What Indicators Should We Watch to Disambiguate AGI Timelines?

LawrenceC3mo92

(Disclaimer: have not read the piece in full)

If “reasoning models” count as a breakthrough of the relevant size, then I argue that there’s been quite a few of these in the last 10 years: skip connections/residual stream (2015-ish), transformers instead of RNNs (2017), RLHF/modern policy gradient methods (2017ish), scaling hypothesis (2016-20 depending on the person and which paper), Chain of Thought (2022), massive MLP MoEs (2023-4), and now Reasoning RL training (2024).

5snewman3mo

See my response to Daniel (https://www.lesswrong.com/posts/auGYErf5QqiTihTsJ/what-indicators-should-we-watch-to-disambiguate-agi?commentId=WRJMsp2bZCBp5egvr). In brief: I won't defend my vague characterization of "breakthroughs" nor my handwavy estimates of how how many are needed to reach AGI, how often they occur, and how the rate of breakthroughs might evolve. I would love to see someone attempt a more rigorous analysis along these lines (I don't feel particularly qualified to do so). I wouldn't expect that to result in a precise figure for the arrival of AGI, but I would hope for it to add to the conversation.

"We know how to build AGI" - Sam Altman

LawrenceC3mo85

I think the title greatly undersells the importance of these statements/beliefs. (I would've preferred either part of your quote or a call to action.)

I'm glad that Sam is putting in writing what many people talk about. People should read it and take them seriously.

6Nikola Jurkovic3mo

I have edited the title in response to this comment

o3, Oh My

LawrenceC3mo23

Nit:
> OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.

Should this say Christmas?

1[comment deleted]3mo

What I would do if I wasn’t at ARC Evals

LawrenceC4moΩ16240Review for 2023 Review

I think writing this post was helpful to me in thinking through my career options. I've also been told by others that the post was quite valuable to them as an example of someone thinking through their career options.

Interestingly, I left METR (then ARC Evals) about a month and a half after this post was published. (I continued to be involved with the LTFF.) I then rejoined METR in August 2024. In between, I worked on ambitious mech interp and did some late stage project management and paper writing (including some for METR). I also organized a mech ... (read more)

Evaluations (of new AI Safety researchers) can be noisy

LawrenceC4moΩ552Review for 2023 Review

I think this post made an important point that's still relevant to this day.

If anything, this post is more relevant in late 2024 than in early 2023, as the pace of AI makes ever more people want to be involved, while more and more mentors have moved towards doing object level work. Due to the relative reduction of capacity in evaluating new AIS researcher, there's more reliance on systems or heuristics to evaluate people now than in early 2023.

Also, I find it amusing that without the parenthetical, the title of the post makes another important point: "evals are noisy".

Touch reality as soon as possible (when doing machine learning research)

LawrenceC4moΩ560Review for 2023 Review

I think this post was useful in the context it was written in and has held up relatively well. However, I wouldn't active recommend it to anyone as of Dec 2024 -- both because the ethos of the AIS community has shifted, making posts like this less necessary, and because many other "how to do research" posts were written that contain the same advice.

Background

This post was inspired by conversations I had in mid-late 2022 with MATS mentees, REMIX participants, and various bright young people who were coming to the Bay to work on AIS (collectively, "kid... (read more)

Anthropic leadership conversation

LawrenceC4mo80

Evan joined Anthropic in late 2022 no? (Eg his post announcing it was Jan 2023 https://www.alignmentforum.org/posts/7jn5aDadcMH6sFeJe/why-i-m-joining-anthropic)

I think you’re correct on the timeline, I remember Jade/Jan proposing DC Evals in April 2022, (which was novel to me at the time), and Beth started METR in June 2022, and I don’t remember there being such teams actually doing work (at least not publically known) when she pitched me on joining in August 2022.

It seems plausible that anthropic’s scaring laws project was already under work before then (... (read more)

6ryan_greenblatt4mo

lol

The o1 System Card Is Not About o1

LawrenceC4mo30

Otherwise, we could easily in the future release a model that is actually (without loss of generality) High in Cybersecurity or Model Autonomy, or much stronger at assisting with AI R&D, with only modest adjustments, without realizing that we are doing this. That could be a large or even fatal mistake, especially if circumstances would not allow the mistake to be taken back. We need to fix this.

[..]

This is a lower bound, not an upper bound. But what you need, when determining whether a model is safe, is an upper bound! So what do we do?

Part of the prob... (read more)

The o1 System Card Is Not About o1

LawrenceC4mo60

Re: the METR evaluations on o1.

We'll be releasing more details of our evaluations of the o1 model we evaluated, in the same style of our blog posts for o1-preview and Claude 3.5 Sonnet (Old). This includes both more details on the general autonomy capability evaluations as well as AI R&D results on RE-Bench.

Whereas the METR evaluation, presumably using final o1, was rather scary.
[..]
From the performance they got, I assume they were working with the full o1, but from the wording it is unclear that they got access to o1 pro?

Our evaluations we... (read more)

3LawrenceC4mo

Part of the problem is the classic problem with model evaluations: elicitation efforts, by default, only ever provide existence proofs and rarely if ever provide completeness proofs. A prompt that causes the model to achieve a task provides strong evidence of model capability, but the space of reasonable prompts is far too vast to search exhaustively to truly demonstrate mode incapability. Model incapability arguments generally rely on an implicit "we've tried as hard at elicitation as would be feasible post deployment", but this is almost certainly not going to be the case, given the scale of pre-deployment evaluations vs post-deployment use cases. The way you get a reasonable upper bound pre-deployment is by providing pre-deployment evaluators with some advantage over end-users, for example by using a model that's not refusal trained or by allowing for small amounts of finetuning. OpenAI did do this in their original preparedness team bio evals; specifically, they provided experts with non--refusal fine-tuned models. But it's quite rare to see substantial advantages given to pre-deployment evaluators for a variety of practical and economic reasons, and in-house usage likely predates pre-deployment capability/safety evaluations anyways.

How much I'm paying for AI productivity software (and the future of AI use)

LawrenceC6mo20

This is really good, thanks so much for writing it!

I've never heard of Whisper or Eleven labs until today, and I'm excited to try them out.

How much I'm paying for AI productivity software (and the future of AI use)

LawrenceC6mo20

Yeah, this has been my experience using Grammarly pro as well.

Base LLMs refuse too

LawrenceC6moΩ562

I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.

Base LLMs refuse too

LawrenceC6moΩ350

I mean, we don't know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + "high-quality multi-task instruction data". I wouldn't be surprised if the same were true of Qwen 1.5.

From the Qwen2 report:

Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-t

... (read more)

3Arthur Conmy6mo

> Qwen2 was explicitly trained on synthetic data from Qwen1.5 ~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~ EDITED TO ADD: "these [Qwen] models are utilized to synthesize high-quality pre-training data" is clear evidence, I am being silly. All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models "trained to predict the next word on the internet" (I don't think the training samples being IID early and late in training is an important detail)

Base LLMs refuse too

LawrenceC6moΩ781

After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above.

It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.

Base LLMs refuse too

LawrenceC6moΩ220

It's worth noting that there's reasons to expect the "base models" of both Gemma2 and Qwen 1.5 to demonstrate refusals -- neither is trained on unfilted webtext.

We don't know what 1.5 was trained on, but we do know that Qwen2's pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes "high-quality multi-task instruction data"! From the Qwen2 report:

Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased met

... (read more)

[This comment is no longer endorsed by its author]Reply

8LawrenceC6mo

After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above. It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.

Debate: Is it ethical to work at AI capabilities companies?

LawrenceC8mo40

I'm down.

Defining alignment research

LawrenceC8moΩ440

Ah, you're correct, it's from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/

Defining alignment research

LawrenceC8moΩ330

(The Anthropic paper I cited predates ChatGPT by 7 months)

3habryka8mo

Huh, interesting. Maybe the OpenAI statements about their models being "more aligned" came earlier than that in the context of Instruct-GPT? I definitely feel like I remember some Twitter threads and LW comment threads about it in the context of OpenAI announcements, and nothing in the context of Anthropic announcements.

Defining alignment research

LawrenceC8moΩ450

Pretty sure Anthropic's early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862

But yes, people complained about it a lot at the time

4habryka8mo

Yeah, my sense is others (like Anthropic) followed along after OpenAI did that, though it seemed to me mostly to be against consensus in the alignment field (though I agree it's messy).

Four management/leadership book summaries

LawrenceC8mo31

Thanks for the summaries, I found them quite useful and they've caused me to probably read some of these books soon. The following ones are both new to me and seem worth thinking more about:

You should judge a person's performance based on the performance of the ideal person that would hold their position
Document every task you do more than once, as soon as you do it the second time.
Fun is important. (yes, really)
People should know the purpose of the organization (specifically, being able to recite a clear mission statement)
"I’m giving you these comme

... (read more)

1Nikola Jurkovic8mo

Thanks for the comment :) I don't really remember the books talking about this, I think they basically assume that the reader is a full-time manager and thus has time to do things like this. There's probably also an assumption that many of these can be done in an automated way (e.g. schedule sending a bunch of check-in messages).

Protecting against sudden capability jumps during training

LawrenceC8mo81

Thanks for writing this!

I think that phased testing should be used during frontier model training runs. By this, I mean a testing approach which starts off with extremely low surface area tests, and gradually increases surface area. This makes it easy to notice sudden capability gains while decreasing the likelihood that the model takes over.

I actually think the proposal is more general than just for preventing AI escapes during diverse evals -- you want to start with low surface area tests because they're cheaper anyways, and you can use the performa... (read more)

1Nikola Jurkovic8mo

I think this point is completely correct right now but will become less correct in the future, as some measures to lower a model's surface area might be quite costly to implement. I'm mostly thinking of "AI boxing" measures here, like using a Faraday-caged cluster, doing a bunch of monitoring, and minimizing direct human contact with the model.

When fine-tuning fails to elicit GPT-3.5's chess abilities

LawrenceC9moΩ330

Very cool work; I'm glad it was done.

That being said, I agree with Fabien that the title is a bit overstated, insofar as it's about your results in particular::

Thus, fine-tuned performance provides very little information about the best performance that would be achieved by a large number of actors fine-tuning models with random prompting schemes in parallel.

It's a general fact of ML that small changes in finetuning setup can greatly affect performance if you're not careful. In particular, it seems likely to me that the empirical details that Fabien ... (read more)

[Interim research report] Activation plateaus & sensitive directions in GPT2

LawrenceC9mo111

Good work, I'm glad that people are exploring this empirically.

That being said, I'm not sure that these results tell us very much about whether or not the MCIS theory is correct. In fact, something like your results should hold as long as the following facts are true (even without superposition):

Correct behavior: The model behavior is correct on distribution, and the correct behavior isn't super sensitive to many small variations to the input.
Linear feature representations: The model encodes information along particular directions, and "reads-off" th

... (read more)

1StefanHex9mo

Thanks for the comment Lawrence, I appreciate it! * I agree this doesn't distinguish superposition vs no superposition at all; I was more thinking about the "error correction" aspect of MCIS (and just assuming superposition to be true). But I'm excited too for the SAE application, we got some experiments in the pipeline! * Your Correct behaviour point sounds reasonable but I feel like it's not an explanation? I would have the same intuitive expectation, but that doesn't explain how the model manages to not be sensitive. Explanations I can think of in increasing order of probability: * Story 0: Perturbations change activations and logprobs, but the answer doesn't change because the logprob difference was large. I don't think the KL divergence would behave like that. * Story 1: Perturbations do change the activations but the difference in the logprobs is small due to layer norm, unembed, or softmax shenanigans. * We did a test-experiment of perturbing the 12th layer rather than the 2nd layer, and the difference between real-other and random disappeared. So I don't think it's a weird effect when activations get converted to outputs. * Story 2: Perturbations in a lower layer cause less perturbation in later layers if the model is on-distribution (+ similar story for sensitivity). * This is what the L2-metric plots (right panel) suggest, and also what I understand your story to be. * But this doesn't explain how the model does this, right? Are there simple stories how this happens? * I guess there's lots of stories not limited to MCIS, anything along the lines of "ReLUs require thresholds to be passed"? Based on that, I think the results still require some "error-correction" explanation, though you're right that this doesn't have to me MCIS (it's just that there's no other theory that doesn't also conflict with superposition?).

OpenAI appoints Retired U.S. Army General Paul M. Nakasone to Board of Directors

LawrenceC10mo5320

This also continues the trend of OAI adding highly credentialed people who notably do not have technical AI/ML knowledge to the board.

Daniel Kokotajlo10mo2815

This fact will be especially important insofar as a situation arises where e.g. some engineers at the company think that the latest system isn't safe. Board won't be able to engage with the arguments or evidence, it'll all come down to who they defer to.

Gradient surfing: the hidden role of regularization

LawrenceC10moΩ220

Have you tried instead 'skinny' NNs with a bias towards depth,

I haven't -- the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot -> grokking were done with the hope of interpreting the model before/after the slingshots.

That being said, you're probably correct that having more layers does seem related to slingshots.

(Particularly for MLPs, which are notorious for overfitting due to their power.)

What do you mean by power here?

4gwern10mo

Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of "can be real smart and learn complicated stuff" is.

Comments on Anthropic's Scaling Monosemanticity

LawrenceC10mo20

70b storing 6b bits of pure memorized info seems quite reasonable to me, maybe a bit high. My guess is there's a lot more structure to the world that the models exploit to "know" more things with fewer memorized bits, but this is a pretty low confidence take (and perhaps we disagree on what "memorized info" means here). That being said, SAEs as currently conceived/evaluated won't be able to find/respect a lot of the structure, so maybe 500M features is also reasonable.

I don't think SAEs will actually work at this level of sparsity though, so this is mostly

... (read more)

Seth Herd's Shortform

LawrenceC10mo40

On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like

... (read more)

2Seth Herd10mo

Me, too! My reasons are a bit more complex, because I think much progress will continue, and overhangs do increase risk. But in sum, I'd support a global scaling pause, or pretty much any slowdown. I think a lot of people in the middle would too. That's why I suggested this as a possible compromise position. I meant to say that installing an off switch is also a great idea that almost anyone who's thought about it would support. I had been against slowdown because it would create both hardware and algorithmic overhang, making takeoff faster, and re-rolling the dice on who gets there first and how many projects reach it roughly at the same time. But I think slowdowns would focus effort on developing language model agents into full cognitive architectures on a trajectory to ASI. And that's the easiest alignment challenge we're likely to get. Slowdown would prevent jumping to the next, more opaque type of AI.

Comments on Anthropic's Scaling Monosemanticity

LawrenceC10mo110

But I was quietly surprised by how many features they were using in their sparse autoencoders (respectively 1M, 4M, or 34M). Assuming Claude Sonnet has the same architecture of GPT-3, its residual stream has dimension 12K so the feature ratios are 83x, 333x, and 2833x, respectively^[1]. In contrast, my team largely used a feature ratio of 2x, and Anthropic's previous work "primarily focus[ed] on a more modest 8× expansion". It does make sense to look for a lot of features, but this seemed to be worth mentioning.

There's both theoretical work (i.e. this theor... (read more)

4ryan_greenblatt10mo

It's seems plausible to me that a 70b model stores ~6 billion bits of memorized information. Naively, you might think this requires around 500M features. (supposing that each "feature" represents 12 bits which is probably a bit optimistic) I don't think SAEs will actually work at this level of sparsity though, so this is mostly besides the point. I'm pretty skeptical of a view like "scale up SAEs and get all the features". (If you wanted "feature" to mean something.)

One-layer transformers aren’t equivalent to a set of skip-trigrams

LawrenceC10moΩ450

Worth noting that both some of Anthropic's results and Lauren Greenspan's results here (assuming I understand her results correctly) give a clear demonstration of learned (even very toy) transformers not being well-modeled as sets of skip trigrams.

An OV-Coherent Toy Model of Attention Head Superposition

LawrenceC10mo*Ω9121

I'm having a bit of difficulty understanding the exact task/set up of this post, and so I have a few questions.

Here's a summary of your post as I understand it:

In Anthropic's Toy Model of Attention Head "Superposition",^[1] they consider a task where the model needs to use interference between heads to implement multiple skip trigrams. In particular, they consider, and call this "OV-incoherent, because the OV seems to need to use information "not present" in V of the source token. (This was incorrect, because you can implement their task perfectl

... (read more)

1keith_wynroe8mo

Hey, sorry for the (very) belated response - thanks for the comment! Your description of the problem set-up/model look right to me. FWIW this post was ~my first attempt at digging into something superpositon-related, so I think you're right that it was being pretty sloppy/confused with the concept of "superposition". I've since come around more to your perspective of polysemanticity/distributed representation/interference being insufficient for "true" superposition. Re: your point about there existing simpler solutions - you're totally right that for d-head >= 4, there exists a more straightforward n_head = 1 solution, I did try solving this problem on paper before training anything and arrived at the same thing as you However we found that for d_head = 1, n_head = 2 the model could still solve the problem perfectly - in this case I think the problem is less trivial and it does rely on the kind of "conditional attention hierarchy" behaviour and the associated interference we talk about. When n_head = 2 and d_head >= 4 the model still prefers this approach over the more trivial method you outline - we included the plots from this experiment over the n_head = 2, d_head = 1 version because the plots were a bit easier to read and we felt made the same point, but in retrospect Overall I'm a lot less impressed/interested by this work in retrospect largely for the reasons you point out here, however I think some of the qualitative behaviours we saw are still quite interesting, and have at least for me affected how I think about what kinds of things attention layers might be doing (although the lessons may not be new/interesting to others) 1. "Inverted attention preferences": In almost all of our tests, the two heads learn to invert the order in which they attend to important tokens. If there are multiple important key-tokens that all need to be attended to, you really don't want multiple heads attending to the same token and ignoring some, so the QK-circuits of heads

Ilya Sutskever and Jan Leike resign from OpenAI [updated]

LawrenceC11mo144

Yeah, it's been a bit of a meme ("where is Ilya?"). See e.g. Gwern's comment thread here.

7Mateusz Bagiński11mo

Yeah, that meme did reach me. But I was just assuming Ilya got back (was told to get back) to doing the usual Ilya superalignment things and decided (was told) not to stick his neck out.

yanni's Shortform

LawrenceC1y116

What does a "majority of the EA community" mean here? Does it mean that people who work at OAI (even on superalignment or preparedness) are shunned from professional EA events? Does it mean that when they ask, people tell them not to join OAI? And who counts as "in the EA community"?

I don't think it's that constructive to bar people from all or even most EA events just because they work at OAI, even if there's a decent amount of consensus people should not work there. Of course, it's fine to host events (even professional ones!) that don't invite OAI... (read more)

Thomas Kwa's Shortform

LawrenceC1y60

To be honest, I would've preferred if Thomas's post started from empirical evidence (e.g. it sure seems like superforecasters and markets change a lot week on week) and then explained it in terms of the random walk/Brownian motion setup. I think the specific math details (a lot of which don't affect the qualitative result of "you do lots and lots of little updates, if there exists lots of evidence that might update you a little") are a distraction from the qualitative takeaway.

A fancier way of putting it is: the math of "your belief should satisfy co... (read more)

Thomas Kwa's Shortform

LawrenceC1y82

Technically, the probability assigned to a hypothesis over time should be the martingale (i.e. have expected change zero); this is just a restatement of the conservation of expected evidence/law of total expectation.

The random walk model that Thomas proposes is a simple model that illustrates a more general fact. For a martingale $(S_{n})_{n \in Z^{+}}$ , the variance of $S_{t}$ is equal to the sum of variances of the individual timestep changes $X_{i} := S_{i} - S_{i - 1}$ (and setting $S_{0} := 0$ ): $Var (S_{t}) = \sum_{i = 1}^{t} Var (X_{i})$ . Under this frame, insofar as small updates ... (read more)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

LawrenceC1y41

Huh, that's indeed somewhat surprising if the SAE features are capturing the things that matter to CLIP (in that they reduce loss) and only those things, as opposed to "salient directions of variation in the data". I'm curious exactly what "failing to work" means -- here I think the negative result (and the exact details of said result) are argubaly more interesting than a positive result would be.

Thomas Kwa's Shortform

LawrenceC1yΩ153112

The general version of this statement is something like: if your beliefs satisfy the law of total expectation, the variance of the whole process should equal the variance of all the increments involved in the process.^[1] In the case of the random walk where at each step, your beliefs go up or down by 1% starting from 50% until you hit 100% or 0% -- the variance of each increment is 0.01^2 = 0.0001, and the variance of the entire process is 0.5^2 = 0.25, hence you need 0.25/0.0001 = 2500 steps in expectation. If your beliefs have probability p of going... (read more)

Thomas Kwa1yΩ4123

I talked about this with Lawrence, and we both agree on the following:

There are mathematical models under which you should update >=1% in most weeks, and models under which you don't.
Brownian motion gives you 1% updates in most weeks. In many variants, like stationary processes with skew, stationary processes with moderately heavy tails, or Brownian motion interspersed with big 10%-update events that constitute <50% of your variance, you still have many weeks with 1% updates. Lawrence's model where you have no evidence until either AI takeover happen

... (read more)

4niplav1y

Thank you a lot for this. I think this or @Thomas Kwas comment would make an excellent original-sequences-style post—it doesn't need to be long, but just going through an example and talking about the assumptions would be really valuable for applied rationality. After all, it's about how much one should expect ones beliefs to vary, which is pretty important.

William_S's Shortform

LawrenceC1yΩ7183

When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn't.

"AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case

LawrenceC1y118

Also, another nitpick:

Humane vs human values

I think there's a harder version of the value alignment problem, where the question looks like, "what's the right goals/task spec to put inside a sovereign ai that will take over the universe". You probably don't want this sovereign AI to adopt the value of any particular human, or even modern humanity as a whole, so you need to do some Ambitious Value Learning/moral philosophy and not just intent alignment. In this scenario, the distinction between humane and human values does matter. (In fact, you c... (read more)

"AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case

LawrenceC1y20

Also, I added another sentence trying to clarify what I meant at the end of the paragraph, sorry for the confusion.

"AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case

LawrenceC1y42

No, I'm saying that "adding 'logic' to AIs" doesn't (currently) look like "figure out how to integrate insights from expert systems/explicit bayesian inference into deep learning", it looks like "use deep learning to nudge the AI toward being better at explicit reasoning by making small changes to the training setup". The standard "deep learning needs to include more logic" take generally assumes that you need to add the logic/GOFAI juice in explicitly, while in practice people do a slightly different RL or supervised finetuning setup instead.

(EDITED... (read more)

"AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case

LawrenceC1y*101

I think this is really quite good, and went into way more detail than I thought it would. Basically my only complaints on the intro/part 1 are some terminology and historical nitpicks. I also appreciate the fact that Nicky just wrote out her views on AIS, even if they're not always the most standard ones or other people dislike them (e.g. pointing at the various divisions within AIS, and the awkward tension between "capabilities" and "safety").

I found the inclusion of a flashcard review applet for each section super interesting. My guess is it probab... (read more)

LawrenceC1y118

Also, another nitpick:

Humane vs human values

4mako yass1y

The intention of this part of the paragraph wasn't totally clear but you seem to be saying this wasn't great? From what I understand, these actually did all made the model far more interpretable? Chain of thought is a wonderful thing, it clears a space where the model will just earnestly confess its inner thoughts and plans in a way that isn't subject to training pressure, and so it, in most ways, can't learn to be deceptive about it.

LawrenceC1y449

I agree with many of the points made in this post, especially the "But my ideas/insights/research is not likely to impact much!" point. I find it plausible that in some subfields, AI x-risk people are too prone to publishing due to historical precedent and norms (maybe mech interp? though little has actually come of that). I also want to point out that there are non-zero arguments to expect alignment work to help more with capabilties, relative to existing "mainstream" capabilities work, even if I don't believe this to be the case. (For example, you might ... (read more)

5mesaoptimizer1y

This is a really well-written response. I'm pretty impressed by it.

LawrenceC1y*3314

While I've softened my position on this in the last year, I want to give a big +1 to this response, especially these two points:

It's genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don't actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
[..]
High level ideas are generally not that valuable in and of themselves. People generally learn

... (read more)

2luciaquirke10mo

Is there anything you recommend for understanding the history of the field?

Refusal in LLMs is mediated by a single direction

LawrenceC1yΩ592

I don't know what the "real story" is, but let me point at some areas where I think we were confused. At the time, we had some sort of hand-wavy result in our appendix saying "something something weight norm ergo generalizing". Similarly, concurrent work from Ziming Liu and others (Omnigrok) had another claim based on the norm of generalizing and memorizing solutions, as well as a claim that representation is important.

One issue is that our picture doesn't consider learning dynamics that seem actually important here. For example, it seems that one of... (read more)

Superposition is not "just" neuron polysemanticity

LawrenceC1y30

I think the key takeaway I wanted people to get is that superposition is something novel and non-trivial, and isn't just a standard polysemantic neuron thing. I wrote this post in response to two interactions where people assumed that superposition was just polysemanticity.

It turned out that a substantial fraction of the post went the other way (i.e. talking about non-superposition polysemanticity), so maybe?

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

LawrenceC1y20

Also have you looked at the dot product of each of the SAE directions/SAE reconstructed representaitons with the image net labels fed through the text encoder??

1hugofry1y

Ah yes! I tried doing exactly this to produce a sort of 'logit lens' to explain the SAE features. In particular I tried the following. * Take an SAE feature encoder direction and map it directly to the multimodal space to get an embedding. * Pass each of the ImageNet text prompts “A photo of a {label}.” through the CLIP text model to generate the multimodal embeddings for each ImageNet class. * Calculate the cosine similarities between the SAE embedding and the ImageNet class embeddings. Pass this through a softmax to get a probability distribution. * Look at the ImageNet labels with a high probability - this should give some explanation as to what the SAE feature is representing. Surprisingly, this did not work at all! I only spent a small amount of time trying to get this to work (<1day), so I'm planning to try again. If I remember correctly, I also tried the same analysis for the decoder feature vector and also tried shifting by the decoder bias vector too - both of these didn't seem to provide good ImageNet class explanations of the SAE features. I will try doing this again and I can let you know how it goes!

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

LawrenceC1y83

Cool work!

As with Arthur, I'm pretty surprised by. how much easier vision seems to be than text for interp (in line with previous results). It makes sense why feature visualization and adversarial attacks work better with continuous inputs, but if it is true that you need fewer datapoints to recover concepts of comparable complexity, I wonder if it's a statement about image datasets or about vision in general (e.g. "abstract" concepts are more useful for prediction, since the n-gram/skip n-gram/syntactical feature baseline is much weaker).

I think th... (read more)

1hugofry1y

Thanks for the comments! I am also surprised that SAEs trained on these vision models seem to require such little data. Especially as I would have thought the complexity of CLIP's representations for vision would be comparable to the complexity for text (after all we can generate an image from a text prompt, and then use a captioning model to recover the text suggesting most/all of the information in the text is also present in the image). With regards to the model loss, I used the text template “A photo of a {label}.”, where {label} is the ImageNet text label (this was the template used in the original CLIP paper). These text prompts were used alongside the associated batch of images and passed jointly into the full CLIP model (text and vision models) using the original contrastive loss function that CLIP was trained on. I used this loss calculation (with this template) to measure both the original model loss and the model loss with the SAE inserted during the forward pass. I also agree completely with your explanation for the reduction in loss. My tentative explanation goes something like this: * Many of the ImageNet classes are very similar (eg 118 classes are of dogs and 18 are of primates). A model such as CLIP that is trained on a much larger dataset may struggle to differentiate the subtle differences in dog breeds and primate species. These classes alone may provide a large chunk of the loss when evaluated on ImageNet. * CLIP's representations of many of these classes will likely be very similar,[1] using only a small subspace of the residual stream to separate these classes. When the SAE is included during the forward pass, some random error is introduced into the model's activations and so these representations will on average drift apart from each other, separating slightly. This on average will decrease the contrastive loss when restricted to ImageNet (but not on a much larger dataset where the activations will not be clustered in this way). That w

2LawrenceC1y

Also have you looked at the dot product of each of the SAE directions/SAE reconstructed representaitons with the image net labels fed through the text encoder??