Makes sense, thanks!
For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.
It's hard to say because I'm not even sure you can rent Titan Vs at this point,[1] and I don't know what your GPU utilization looks like, but I suspect API costs will dominate.
An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shade...
This is really impressive -- could I ask how long this project took, how long does each eval take to run on average, and what you spent on compute/API credits?
(Also, I found the preliminary BoK vs 5-iteration results especially interesting, especially the speculation on reasoning models.)
(Disclaimer: have not read the piece in full)
If “reasoning models” count as a breakthrough of the relevant size, then I argue that there’s been quite a few of these in the last 10 years: skip connections/residual stream (2015-ish), transformers instead of RNNs (2017), RLHF/modern policy gradient methods (2017ish), scaling hypothesis (2016-20 depending on the person and which paper), Chain of Thought (2022), massive MLP MoEs (2023-4), and now Reasoning RL training (2024).
I think the title greatly undersells the importance of these statements/beliefs. (I would've preferred either part of your quote or a call to action.)
I'm glad that Sam is putting in writing what many people talk about. People should read it and take them seriously.
Nit:
> OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.
Should this say Christmas?
I think writing this post was helpful to me in thinking through my career options. I've also been told by others that the post was quite valuable to them as an example of someone thinking through their career options.
Interestingly, I left METR (then ARC Evals) about a month and a half after this post was published. (I continued to be involved with the LTFF.) I then rejoined METR in August 2024. In between, I worked on ambitious mech interp and did some late stage project management and paper writing (including some for METR). I also organized a mech ...
I think this post made an important point that's still relevant to this day.
If anything, this post is more relevant in late 2024 than in early 2023, as the pace of AI makes ever more people want to be involved, while more and more mentors have moved towards doing object level work. Due to the relative reduction of capacity in evaluating new AIS researcher, there's more reliance on systems or heuristics to evaluate people now than in early 2023.
Also, I find it amusing that without the parenthetical, the title of the post makes another important point: "evals are noisy".
I think this post was useful in the context it was written in and has held up relatively well. However, I wouldn't active recommend it to anyone as of Dec 2024 -- both because the ethos of the AIS community has shifted, making posts like this less necessary, and because many other "how to do research" posts were written that contain the same advice.
This post was inspired by conversations I had in mid-late 2022 with MATS mentees, REMIX participants, and various bright young people who were coming to the Bay to work on AIS (collectively, "kid...
Evan joined Anthropic in late 2022 no? (Eg his post announcing it was Jan 2023 https://www.alignmentforum.org/posts/7jn5aDadcMH6sFeJe/why-i-m-joining-anthropic)
I think you’re correct on the timeline, I remember Jade/Jan proposing DC Evals in April 2022, (which was novel to me at the time), and Beth started METR in June 2022, and I don’t remember there being such teams actually doing work (at least not publically known) when she pitched me on joining in August 2022.
It seems plausible that anthropic’s scaring laws project was already under work before then (...
Otherwise, we could easily in the future release a model that is actually (without loss of generality) High in Cybersecurity or Model Autonomy, or much stronger at assisting with AI R&D, with only modest adjustments, without realizing that we are doing this. That could be a large or even fatal mistake, especially if circumstances would not allow the mistake to be taken back. We need to fix this.
[..]
This is a lower bound, not an upper bound. But what you need, when determining whether a model is safe, is an upper bound! So what do we do?
Part of the prob...
Re: the METR evaluations on o1.
We'll be releasing more details of our evaluations of the o1 model we evaluated, in the same style of our blog posts for o1-preview and Claude 3.5 Sonnet (Old). This includes both more details on the general autonomy capability evaluations as well as AI R&D results on RE-Bench.
Whereas the METR evaluation, presumably using final o1, was rather scary.
[..]
From the performance they got, I assume they were working with the full o1, but from the wording it is unclear that they got access to o1 pro?
Our evaluations we...
This is really good, thanks so much for writing it!
I've never heard of Whisper or Eleven labs until today, and I'm excited to try them out.
Yeah, this has been my experience using Grammarly pro as well.
I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.
I mean, we don't know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + "high-quality multi-task instruction data". I wouldn't be surprised if the same were true of Qwen 1.5.
From the Qwen2 report:
...Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-t
After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above.
It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.
It's worth noting that there's reasons to expect the "base models" of both Gemma2 and Qwen 1.5 to demonstrate refusals -- neither is trained on unfilted webtext.
We don't know what 1.5 was trained on, but we do know that Qwen2's pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes "high-quality multi-task instruction data"! From the Qwen2 report:
...Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased met
Ah, you're correct, it's from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/
(The Anthropic paper I cited predates ChatGPT by 7 months)
Pretty sure Anthropic's early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862
But yes, people complained about it a lot at the time
Thanks for the summaries, I found them quite useful and they've caused me to probably read some of these books soon. The following ones are both new to me and seem worth thinking more about:
...
- You should judge a person's performance based on the performance of the ideal person that would hold their position
- Document every task you do more than once, as soon as you do it the second time.
- Fun is important. (yes, really)
- People should know the purpose of the organization (specifically, being able to recite a clear mission statement)
- "I’m giving you these comme
Thanks for writing this!
I think that phased testing should be used during frontier model training runs. By this, I mean a testing approach which starts off with extremely low surface area tests, and gradually increases surface area. This makes it easy to notice sudden capability gains while decreasing the likelihood that the model takes over.
I actually think the proposal is more general than just for preventing AI escapes during diverse evals -- you want to start with low surface area tests because they're cheaper anyways, and you can use the performa...
Very cool work; I'm glad it was done.
That being said, I agree with Fabien that the title is a bit overstated, insofar as it's about your results in particular::
Thus, fine-tuned performance provides very little information about the best performance that would be achieved by a large number of actors fine-tuning models with random prompting schemes in parallel.
It's a general fact of ML that small changes in finetuning setup can greatly affect performance if you're not careful. In particular, it seems likely to me that the empirical details that Fabien ...
Good work, I'm glad that people are exploring this empirically.
That being said, I'm not sure that these results tell us very much about whether or not the MCIS theory is correct. In fact, something like your results should hold as long as the following facts are true (even without superposition):
This also continues the trend of OAI adding highly credentialed people who notably do not have technical AI/ML knowledge to the board.
This fact will be especially important insofar as a situation arises where e.g. some engineers at the company think that the latest system isn't safe. Board won't be able to engage with the arguments or evidence, it'll all come down to who they defer to.
Have you tried instead 'skinny' NNs with a bias towards depth,
I haven't -- the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot -> grokking were done with the hope of interpreting the model before/after the slingshots.
That being said, you're probably correct that having more layers does seem related to slingshots.
(Particularly for MLPs, which are notorious for overfitting due to their power.)
What do you mean by power here?
70b storing 6b bits of pure memorized info seems quite reasonable to me, maybe a bit high. My guess is there's a lot more structure to the world that the models exploit to "know" more things with fewer memorized bits, but this is a pretty low confidence take (and perhaps we disagree on what "memorized info" means here). That being said, SAEs as currently conceived/evaluated won't be able to find/respect a lot of the structure, so maybe 500M features is also reasonable.
...I don't think SAEs will actually work at this level of sparsity though, so this is mostly
...On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like
But I was quietly surprised by how many features they were using in their sparse autoencoders (respectively 1M, 4M, or 34M). Assuming Claude Sonnet has the same architecture of GPT-3, its residual stream has dimension 12K so the feature ratios are 83x, 333x, and 2833x, respectively[1]. In contrast, my team largely used a feature ratio of 2x, and Anthropic's previous work "primarily focus[ed] on a more modest 8× expansion". It does make sense to look for a lot of features, but this seemed to be worth mentioning.
There's both theoretical work (i.e. this theor...
Worth noting that both some of Anthropic's results and Lauren Greenspan's results here (assuming I understand her results correctly) give a clear demonstration of learned (even very toy) transformers not being well-modeled as sets of skip trigrams.
I'm having a bit of difficulty understanding the exact task/set up of this post, and so I have a few questions.
Here's a summary of your post as I understand it:
What does a "majority of the EA community" mean here? Does it mean that people who work at OAI (even on superalignment or preparedness) are shunned from professional EA events? Does it mean that when they ask, people tell them not to join OAI? And who counts as "in the EA community"?
I don't think it's that constructive to bar people from all or even most EA events just because they work at OAI, even if there's a decent amount of consensus people should not work there. Of course, it's fine to host events (even professional ones!) that don't invite OAI...
To be honest, I would've preferred if Thomas's post started from empirical evidence (e.g. it sure seems like superforecasters and markets change a lot week on week) and then explained it in terms of the random walk/Brownian motion setup. I think the specific math details (a lot of which don't affect the qualitative result of "you do lots and lots of little updates, if there exists lots of evidence that might update you a little") are a distraction from the qualitative takeaway.
A fancier way of putting it is: the math of "your belief should satisfy co...
Technically, the probability assigned to a hypothesis over time should be the martingale (i.e. have expected change zero); this is just a restatement of the conservation of expected evidence/law of total expectation.
The random walk model that Thomas proposes is a simple model that illustrates a more general fact. For a martingale, the variance of is equal to the sum of variances of the individual timestep changes (and setting ): . Under this frame, insofar as small updates ...
Huh, that's indeed somewhat surprising if the SAE features are capturing the things that matter to CLIP (in that they reduce loss) and only those things, as opposed to "salient directions of variation in the data". I'm curious exactly what "failing to work" means -- here I think the negative result (and the exact details of said result) are argubaly more interesting than a positive result would be.
The general version of this statement is something like: if your beliefs satisfy the law of total expectation, the variance of the whole process should equal the variance of all the increments involved in the process.[1] In the case of the random walk where at each step, your beliefs go up or down by 1% starting from 50% until you hit 100% or 0% -- the variance of each increment is 0.01^2 = 0.0001, and the variance of the entire process is 0.5^2 = 0.25, hence you need 0.25/0.0001 = 2500 steps in expectation. If your beliefs have probability p of going...
I talked about this with Lawrence, and we both agree on the following:
When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn't.
Also, another nitpick:
Humane vs human values
I think there's a harder version of the value alignment problem, where the question looks like, "what's the right goals/task spec to put inside a sovereign ai that will take over the universe". You probably don't want this sovereign AI to adopt the value of any particular human, or even modern humanity as a whole, so you need to do some Ambitious Value Learning/moral philosophy and not just intent alignment. In this scenario, the distinction between humane and human values does matter. (In fact, you c...
Also, I added another sentence trying to clarify what I meant at the end of the paragraph, sorry for the confusion.
No, I'm saying that "adding 'logic' to AIs" doesn't (currently) look like "figure out how to integrate insights from expert systems/explicit bayesian inference into deep learning", it looks like "use deep learning to nudge the AI toward being better at explicit reasoning by making small changes to the training setup". The standard "deep learning needs to include more logic" take generally assumes that you need to add the logic/GOFAI juice in explicitly, while in practice people do a slightly different RL or supervised finetuning setup instead.
(EDITED...
I think this is really quite good, and went into way more detail than I thought it would. Basically my only complaints on the intro/part 1 are some terminology and historical nitpicks. I also appreciate the fact that Nicky just wrote out her views on AIS, even if they're not always the most standard ones or other people dislike them (e.g. pointing at the various divisions within AIS, and the awkward tension between "capabilities" and "safety").
I found the inclusion of a flashcard review applet for each section super interesting. My guess is it probab...
Also, another nitpick:
Humane vs human values
I think there's a harder version of the value alignment problem, where the question looks like, "what's the right goals/task spec to put inside a sovereign ai that will take over the universe". You probably don't want this sovereign AI to adopt the value of any particular human, or even modern humanity as a whole, so you need to do some Ambitious Value Learning/moral philosophy and not just intent alignment. In this scenario, the distinction between humane and human values does matter. (In fact, you c...
I agree with many of the points made in this post, especially the "But my ideas/insights/research is not likely to impact much!" point. I find it plausible that in some subfields, AI x-risk people are too prone to publishing due to historical precedent and norms (maybe mech interp? though little has actually come of that). I also want to point out that there are non-zero arguments to expect alignment work to help more with capabilties, relative to existing "mainstream" capabilities work, even if I don't believe this to be the case. (For example, you might ...
While I've softened my position on this in the last year, I want to give a big +1 to this response, especially these two points:
...
- It's genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don't actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
- [..]
- High level ideas are generally not that valuable in and of themselves. People generally learn
I don't know what the "real story" is, but let me point at some areas where I think we were confused. At the time, we had some sort of hand-wavy result in our appendix saying "something something weight norm ergo generalizing". Similarly, concurrent work from Ziming Liu and others (Omnigrok) had another claim based on the norm of generalizing and memorizing solutions, as well as a claim that representation is important.
One issue is that our picture doesn't consider learning dynamics that seem actually important here. For example, it seems that one of...
I think the key takeaway I wanted people to get is that superposition is something novel and non-trivial, and isn't just a standard polysemantic neuron thing. I wrote this post in response to two interactions where people assumed that superposition was just polysemanticity.
It turned out that a substantial fraction of the post went the other way (i.e. talking about non-superposition polysemanticity), so maybe?
Also have you looked at the dot product of each of the SAE directions/SAE reconstructed representaitons with the image net labels fed through the text encoder??
Cool work!
As with Arthur, I'm pretty surprised by. how much easier vision seems to be than text for interp (in line with previous results). It makes sense why feature visualization and adversarial attacks work better with continuous inputs, but if it is true that you need fewer datapoints to recover concepts of comparable complexity, I wonder if it's a statement about image datasets or about vision in general (e.g. "abstract" concepts are more useful for prediction, since the n-gram/skip n-gram/syntactical feature baseline is much weaker).
I think th...
My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.
... (read more)