Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Current take on the implications of "GPT-4b micro": Very powerful, very cool, ~zero progress to AGI, ~zero existential risk. Cheers. First, the gist of it appears to be: Crucially, if the reporting is accurate, this not an agent. The model did not engage in autonomous open-ended research. Rather, humans guessed that if a specific model is fine-tuned on a specific dataset, the gradient descent would chisel into it the functionality that would allow it to produce groundbreaking results in the corresponding domain. As far as AGI-ness goes, this is functionally similar to AlphaFold 2; as far as agency goes, it's at most at the level of o1. To speculate on what happened: Perhaps GPT-4b ("b" = "bio"?) is based on some distillation of an o-series model, say o3. o3's internals contain a lot of advanced machinery for mathematical reasoning. What this result shows, then, is that the protein-factors problem is in some sense a "shallow" mathematical problem that could be easily solved if you think about it the right way. Finding the right way to think about it is itself highly challenging, however – a problem teams of brilliant people have failed to crack – yet deep learning allowed to automate this search and crack it. This trick likely generalizes. There may be many problems in the world that could be cracked this way[1]: those that are secretly "mathematically shallow" in this manner, and for which you can get a clean-enough fine-tuning dataset. ... Which is to say, this almost certainly doesn't cover social manipulation/scheming (no clean dataset), and likely doesn't cover AI R&D (too messy/open-ended, although I can see being wrong about this). (Edit: And if it Just Worked given any sorta-related sorta-okay fine-tuning dataset, the o-series would've likely generalized to arbitrary domains out-of-the-box, since the pretraining is effectively this dataset for everything. Yet it doesn't.) It's also not entirely valid to call that "innovative AI", any more than it was va
Charlie SteinerΩ23519
14
Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me? It's not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I'm not clear. But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve. But clearly other people think differently than me.
Some rough notes from Michael Aird's workshop on project selection in AI safety.  Tl;dr how to do better projects?  * Backchain to identify projects. * Get early feedback, iterate quickly * Find a niche On backchaining projects from theories of change * Identify a "variable of interest"  (e.g., the likelihood that big labs detect scheming). * Explain how this variable connects to end goals (e.g. AI safety). * Assess how projects affect this variable * Red-team these. Ask people to red team these.  On seeking feedback, iteration.  * Be nimble. Empirical. Iterate. 80/20 things * Ask explicitly for negative feedback. People often hesitate to criticise, so make it socially acceptable to do so * Get high-quality feedback. Ask "the best person who still has time for you".  On testing fit * Forward-chain from your skills, available opportunities, career goals. * "Speedrun" projects. Write papers with hypothetical data and decide whether they'd be interesting. If not then move on to something else. * Don't settle for "pretty good". Try to find something that feels "amazing" to do, e.g. because you're growing a lot / making a lot of progress.  Other points On developing a career * "T-shaped" model of skills; very deep in one thing and have workable knowledge of other things * Aim for depth first. Become "world-class" at something. This ensures you get the best possible feedback at your niche and gives you a value proposition within larger organization. After that, you can broaden your scope.  Product-oriented vs field-building research * Some research is 'product oriented', i.e. the output is intended to be used directly by somebody else * Other research is 'field building', e.g. giving a proof of concept, or demonstrating the importance of something. You (and your skills / knowledge) are the product.  A specific process to quickly update towards doing better research.  1. Write “Career & goals 2-pager” 2. Solicit ideas from mentor, exp
habryka1844
8
It's the last 6 hours of the fundraiser and we have met our $2M goal! This was roughly the "we will continue existing and not go bankrupt" threshold, which was the most important one to hit.  Thank you so much to everyone who made it happen. I really did not expect that we would end up being able to raise this much funding without large donations from major philanthropists, and I am extremely grateful to have so much support from such a large community.  Let's make the last few hours in the fundraiser count, and then me and the Lightcone team will buckle down and make sure all of these donations were worth it.
"Just ask the LM about itself" seems like a weirdly effective way to understand language models' behaviour.  There's lots of circumstantial evidence that LMs have some concept of self-identity.  * Language models' answers to questions can be highly predictive of their 'general cognitive state', e.g. whether they are lying or their general capabilities * Language models know things about themselves, e.g. that they are language models, or how they'd answer questions, or their internal goals / values * Language models' self-identity may directly influence their behaviour, e.g. by making them resistant to changes in their values / goals Some work has directly tested 'introspective' capabilities.  * An early paper by Ethan Perez and Rob Long showed that LMs can be trained to answer questions about themselves. Owain Evans' group expanded upon this in subsequent work. * White-box methods such as PatchScopes show that activation patching allows an LM to answer questions about its activations. LatentQA fine-tunes LMs to be explicitly good at this.  This kind of stuff seems particularly interesting because: * Introspection might naturally scale with general capabilities (this is supported by initial results). * Introspection might be more tractable. Alignment is difficult and possibly not even well-specified, but "answering questions about yourself truthfully" seems like a relatively well-defined problem (though it does require some way to oversee what is 'truthful')   Failure modes include language models becoming deceptive, less legible, or less faithful. It seems important to understand whether each failure mode will happen, and corresponding mitigations Research directions that might be interesting:  * Scaling laws for introspection * Understanding failure modes * Fair comparisons to other interpretability methods * Training objectives for better ('deeper', more faithful, etc) introspection 

Popular Comments

Recent Discussion

Scaling inference

With the release of OpenAI's o1 and o3 models, it seems likely that we are now contending with a new scaling paradigm: spending more compute on model inference at run-time reliably improves model performance. As shown below, o1's AIME accuracy increases at a constant rate with the logarithm of test-time compute (OpenAI, 2024).

The image shows two scatter plots comparing "o1 AIME accuracy" during training and at test time. Both charts have "pass@1 accuracy" on the y-axis and compute (log scale) on the x-axis. The dots indicate increasing accuracy with more compute time.

OpenAI's o3 model continues this trend with record-breaking performance, scoring:

  • 2727 on Codeforces, which makes it the 175th best competitive programmer on Earth;
  • 25% on FrontierMath, where "each problem demands hours of work from expert mathematicians";
  • 88% on GPQA, where 70% represents PhD-level science knowledge;
  • 88% on ARC-AGI, where the average Mechanical Turk human worker scores 75% on hard visual reasoning problems.

According to OpenAI, the bulk of model performance improvement in the o-series of models comes from increasing...

Math proofs are math proofs, whether they are in plain English or in Lean. Contemporary LLMs are very good at translation, not just between high-resource human languages but also between programming languages (transpiling), from code to human (documentation) and even from algorithms in scientific papers to code. Thus I wouldn't expect formalizing math proofs to be a hard problem in 2025.

However I generally agree with your line of thinking. As wassname wrote above (it's been quite obvious for some time but they link to a quantitative analysis), good in-sili... (read more)

4Anonymous
When I hear “distillation” I think of a model with a smaller number of parameters that’s dumber than the base model. It seems like the word “bootstrapping” is more relevant here. You start with a base LLM (like GPT-4); then do RL for reasoning, and then do a ton of inference (this gets you o1-level outputs); then you train a base model with more parameters than GPT-4 (let’s call this GPT-5) on those outputs — each single forward pass of the resulting base model is going to be smarter than a single forward pass of GPT-4. And then you do RL and more inference (this gets you o3). And rinse and repeat.  I don’t think I’m really saying anything different from what you said, but the word “distill” doesn’t seem to capture the idea that you are training a larger, smarter base model (as opposed to a smaller, faster model). This also helps explain why o3 is so expensive. It’s not just doing more forward passes, it’s a much bigger base model that you’re running with each forward pass.  I think maybe the most relevant chart from the Jones paper gwern cites is this one: 
4Mateusz Bagiński
Another little bit of a cheat is that they only train Qwen2.5-Math-7B according to the procedure described. In contrast, for the other three models (smaller than Qwen2.5-Math-7B), they instead use the fine-tuned Qwen2.5-Math-7B to generate the training data to bootstrap round 4. (Basically, they distill from DeepSeek in round 1 and then they distill from fine-tuned Qwen in round 4.) They justify: TBH I'm not sure how this helps them with saving on GPU resources. For some reason it's cheaper to generate a lot of big/long rollouts with the Qwen2.5-Math-7B-r4 than three times with [smaller model]-r3?)

Not on sci-hub or Anna's Archive, so I'm just going off the abstract and summary here; would love a PDF if anyone has one.

If you email the authors they will probably send you the full article.

2David Matolcsi
Does anyone know of a not peppermint flavored zinc acetate lozenge? I really dislike peppermint, so I'm not sure it would be worth it to drink 5 peppermint flavored glasses of water a day to decrease the duration of cold with one day, and I haven't found other zinc acetate lozenge options yet, the acetate version seems to be rare among zing supplement. (Why?)

If you've never read the LessWrong Sequences (which I read through the book-length compilation Rationality: From AI To Zombies), I suggest that you read the Sequences as if they were written today. Additionally, if you're thinking of rereading the Sequences, I suggest that your agenda for rereading, in addition to what it may already be, should be to read the Sequences as if they were written today.

To start, I'd like to take a moment to clarify what I mean. I don't mean "think about what you remember the Sequences talking about, and try to apply those concepts to current events." I don't even mean "read the Sequences and reflect on where the concepts are relevant to things that have happened since they were written." What I mean...

3Viliam
Yes, I would certainly love to read more, in a format longer than the bullet points you made here (but maybe shorter than the original Sequences?). If you believe (it seems to me that correctly) that some lessons from the Sequences are frequently misunderstood, then it probably makes sense to make the explanations very clear, with several specific examples, a summary at the end... simply, if they were misinterpreted once, it seems like there is a specific attractor in the idea-space, and that attractor will act with the same force on your clarifications, so you need do defend hard against it. So please do err on the side of providing more specific examples and further dumbing it down for audience such as me. (Also, clearly spell out the specific misunderstanding you are trying to avoid, and highlight the difference. Maybe as a separate section at the end of the article.) Definitely interesting! Not sure if 1:1 correspondence is optimal (one article of yours per one article of the original Sequences). The information density varies and so does the article length; sometimes it might make more sense to read two or three articles at the same time; sometimes it might make sense to address two important points from the same article separately. Up to you; just saying that if you start with this format, don't feel like you have to stick with all the time.

Thanks for the support. I'll try and work a bit more on my first post in the coming days and I hope it will be up soon.

Some rough notes from Michael Aird's workshop on project selection in AI safety. 

Tl;dr how to do better projects? 

  • Backchain to identify projects.
  • Get early feedback, iterate quickly
  • Find a niche

On backchaining projects from theories of change

  • Identify a "variable of interest"  (e.g., the likelihood that big labs detect scheming).
  • Explain how this variable connects to end goals (e.g. AI safety).
  • Assess how projects affect this variable
  • Red-team these. Ask people to red team these. 

On seeking feedback, iteration. 

  • Be nimble. Empirical.
... (read more)
5Daniel Tan
"Just ask the LM about itself" seems like a weirdly effective way to understand language models' behaviour.  There's lots of circumstantial evidence that LMs have some concept of self-identity.  * Language models' answers to questions can be highly predictive of their 'general cognitive state', e.g. whether they are lying or their general capabilities * Language models know things about themselves, e.g. that they are language models, or how they'd answer questions, or their internal goals / values * Language models' self-identity may directly influence their behaviour, e.g. by making them resistant to changes in their values / goals Some work has directly tested 'introspective' capabilities.  * An early paper by Ethan Perez and Rob Long showed that LMs can be trained to answer questions about themselves. Owain Evans' group expanded upon this in subsequent work. * White-box methods such as PatchScopes show that activation patching allows an LM to answer questions about its activations. LatentQA fine-tunes LMs to be explicitly good at this.  This kind of stuff seems particularly interesting because: * Introspection might naturally scale with general capabilities (this is supported by initial results). * Introspection might be more tractable. Alignment is difficult and possibly not even well-specified, but "answering questions about yourself truthfully" seems like a relatively well-defined problem (though it does require some way to oversee what is 'truthful')   Failure modes include language models becoming deceptive, less legible, or less faithful. It seems important to understand whether each failure mode will happen, and corresponding mitigations Research directions that might be interesting:  * Scaling laws for introspection * Understanding failure modes * Fair comparisons to other interpretability methods * Training objectives for better ('deeper', more faithful, etc) introspection 
4the gears to ascension
Partially agreed. I've tested this a little personally; Claude successfully predicted their own success probability on some programming tasks, but was unable to report their own underlying token probabilities. The former tests weren't that good, the latter ones somewhat were okay, I asked Claude to say the same thing across 10 branches and then asked a separate thread of Claude, also downstream of the same context, to verbally predict the distribution.
1Daniel Tan
That's pretty interesting! I would guess that it's difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe "willingness to self-report honestly" should be something we train models to do. 

Today's post is in response to the post "Quantum without complications", which I think is a pretty good popular distillation of the basics of quantum mechanics. 

For any such distillation, there will be people who say "but you missed X important thing". The limit of appeasing such people is to turn your popular distillation into a 2000-page textbook (and then someone will still complain). 

That said, they missed something!

To be fair, the thing they missed isn't included in most undergraduate quantum classes. But it should be.[1]

Or rather, there is something that I wish they told me when I was first learning this stuff and confused out of my mind, since I was a baby mathematician and I wanted the connections between different concepts in the world to actually have...

1Optimization Process
Question: if I'm considering an isolated system (~= "the entire universe"), you say that I can swap between state-vector-format and matrix-format via |ϕ⟩↔ρ=|ϕ⟩⟨ϕ| . But later, you say... But if ρ:=|ϕ⟩⟨ϕ|, how could it ever be rank>1? (Perhaps more generally: what does it mean when a state is represented as a rank>1 density matrix? Or: given that the space of possible ρs is much larger than the space of possible |ϕ⟩s, there are sometimes (always?) multiple ρs that correspond to some particular |ϕ⟩; what's the significance of choosing one versus another to represent your system's state?)

The usual story about where rank > 1 density matrices come from is when your subsystem is entangled with an environment that you can't observe. 

The simplest example is to take a Bell state, say 

|00> + |11>  (obviously I'm ignoring normalization) and imagine you only have access to the first qubit; how should you represent this state? Precisely because it's entangled, we know that there is no |Psi> in 1-qubit space that will work. The trace method alluded to in the post is to form the (rank-1) density matrix of the Bell state, and... (read more)

2Charlie Steiner
When you say there's "no such thing as a state," or "we live in a density matrix," these are statements about ontology: what exists, what's real, etc. Density matrices use the extra representational power they have over states to encode a probability distribution over states. If we regard the probabilistic nature of measurements as something to be explained, putting the probability distribution directly into the thing we live in is what I mean by "explain with ontology." Epistemology is about how we know stuff. If we start with a world that does not inherently have a probability distribution attached to it, but obtain a probability distribution from arguments about how we know stuff, that's "explain with epistemology." In quantum mechanics, this would look like talking about anthropics, or what properties we want a measure to satisfy, or solomonoff induction and coding theory.   What good is it to say things are real or not? One useful application is predicting the character of physical law. If something is real, then we might expect it to interact with other things. I do not expect the probability distribution of a mixed state to interact with other things.
2Linda Linsefors
I think you mean ρ  here, not ψ

Current take on the implications of "GPT-4b micro": Very powerful, very cool, ~zero progress to AGI, ~zero existential risk. Cheers.

First, the gist of it appears to be:

OpenAI’s new model, called GPT-4b micro, was trained to suggest ways to re-engineer the protein factors to increase their function. According to OpenAI, researchers used the model’s suggestions to change two of the Yamanaka factors to be more than 50 times as effective—at least according to some preliminary measures.

The model was trained on examples of protein sequences from many species, as

... (read more)
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
This is a linkpost for https://arxiv.org/abs/2405.12241

A short summary of the paper is presented below.

This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) .

TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.

Introduction

Current SAEs focus on the wrong goal: They are trained to minimize mean squared reconstruction...

Why do you need to have all feature descriptions at the outset? Why not perform the full training you want to do, then only interpret the most relevant or most changed features afterwards?

5habryka
I think the core argument is "if you want to slow down, or somehow impose restrictions on AI research and deployment, you need some way of defining thresholds. Also, most policymaker's cruxes appear to be that AI will not be a big deal, but if they thought it was going to be a big deal they would totally want to regulate it much more. Therefore, having policy proposals that can use future eval results as a triggering mechanism is politically more feasible, and also, epistemically helpful since it allows people who do think it will be a big deal to establish a track record".  I find these arguments reasonably compelling, FWIW.
2Nathan Helm-Burger
I think it would be good for more people to explicitly ask political staffers and politicians the question: "What hypothetical eval result would change your mind if you saw it?" I think a lot of the evals are more targeted towards convincing tech workers than convincing politicians.
habryka22

My sense is political staffers and politicians aren't that great at predicting their future epistemic states this way, and so you won't get great answers for this question. I do think it's a really important one to model!

12Bogdan Ionut Cirstea
At the very least, evals for automated ML R&D should be a very decent proxy for when it might be feasible to automate very large chunks of prosaic AI safety R&D.

In the aftermath of a disaster, there is usually a large shift in what people need, what is available, or both. For example, people normally don't use very much ice, but after a hurricane or other disaster that knocks out power, suddenly (a) lots of people want ice and (b) ice production is more difficult. Since people really don't want their food going bad, and they're willing to pay a lot to avoid that, In a world of pure economics, sellers would raise prices.

This can have serious benefits:

  • Increased supply: at higher prices it's worth running production facilities at higher output. It's even worth planning, through investments in storage or production capacity, so you can sell a lot at high prices in the aftermath of future disasters.

  • Reallocated supply: it's expensive to transport ice, but at higher prices it

...
Abe10

If items are only available at "gouged" rates, then this will make them more expensive. That is, this tax will fall only on people in the emergency zone, and specifically those who are desperate enough to buy goods at these elevated costs. Since demand is very inelastic under these circumstances, the tax burden will fall almost entirely on the consumer.

Another approach might be to temporarily raise taxes everywhere except the emergency zone on these goods. For example, if bottled water falls under a temporary excise tax during a hurricane everywhere except the hurricane zone, that incentivizes sellers to bring bottled water to the hurricane victims.

4FlorianH
Called Windfall Tax Random examples: VOXEU/CEPR Energy costs: Views of leading economists on windfall taxes and consumer price caps Reuters Windfall tax mechanisms on energy companies across Europe Especially with the 2022 Ukraine energy prices, the notion's popularity spiked along. Seems to me also a very neat way to deal with supernormal short-term profits due to market price spikes, in cases where supply is extremely inelastic. I guess, and some commentaries suggest, in actual implementation, with complex firm/financial structures etc., and with actual clumsy politics, not always as trivial as it might look on first sight, but feasible, and some countries managed to implement some in the energy crisis.
4Dagon
True.  The main thing the "tax a price increase" misses is that it mutes the supply incentive effects of the price increase.  I'd need to understand the elasticities of the two (including the pre-supply incentives for some goods: a decision to store more than current demand BEFORE the emergency gets paid DURING) to really make a recommendation, and it'd likely be specific enough to time and place and product and reason for emergency that "don't get involved at a one-size-fits-all level" is the only thing I really support.  
4jefftk
The thing that I think would be overall better (no price controls) is politically unpopular, strongly socially discouraged, and often illegal. This is a proposal that tries to move us in a direction I think is better, while addressing some of what price gouging opponents dislike.