I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I'm also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
Makes sense, thanks!
For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.
It's hard to say because I'm not even sure you can rent Titan Vs at this point,[1] and I don't know what your GPU utilization looks like, but I suspect API costs will dominate.
An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shadeform). And even A100s are ridiculously better than a Titan V, in that it has 40 or 80 GB of memory and (pulling number out of thin air) 4-5x faster.
So if o1 costs $2 per task and it's 15 minutes per task, compute will be an order of magnitude cheaper. (Though as for all similar evals, the main cost will be engineering effort from humans.)
I failed to find an option to rent them online, and I suspect the best way I can acquire them is by going to UC Berkeley and digging around in old compute hardware.
This is really impressive -- could I ask how long this project took, how long does each eval take to run on average, and what you spent on compute/API credits?
(Also, I found the preliminary BoK vs 5-iteration results especially interesting, especially the speculation on reasoning models.)
(Disclaimer: have not read the piece in full)
If “reasoning models” count as a breakthrough of the relevant size, then I argue that there’s been quite a few of these in the last 10 years: skip connections/residual stream (2015-ish), transformers instead of RNNs (2017), RLHF/modern policy gradient methods (2017ish), scaling hypothesis (2016-20 depending on the person and which paper), Chain of Thought (2022), massive MLP MoEs (2023-4), and now Reasoning RL training (2024).
I think the title greatly undersells the importance of these statements/beliefs. (I would've preferred either part of your quote or a call to action.)
I'm glad that Sam is putting in writing what many people talk about. People should read it and take them seriously.
Nit:
> OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.
Should this say Christmas?
I think writing this post was helpful to me in thinking through my career options. I've also been told by others that the post was quite valuable to them as an example of someone thinking through their career options.
Interestingly, I left METR (then ARC Evals) about a month and a half after this post was published. (I continued to be involved with the LTFF.) I then rejoined METR in August 2024. In between, I worked on ambitious mech interp and did some late stage project management and paper writing (including some for METR). I also organized a mech interp workshop at ICML 2024, which, if you squint, counts as "onboarding senior academics".
I think leaving METR was a mistake ex post, even if it made sense ex ante. I think my ideas around mech interp when I wrote this post weren't that great, even if I thought the projects I ended up working on were interesting (see e.g. Compact Proofs and Computation in Superposition). While the mech interp workshop was very well attended (e.g. the room was so crowded that people couldn't get in due to fire code) and pretty well received, I'm not sure how much value it ended up producing for AIS. Also, I think I was undervaluing the resources available to METR as well as how much I could do at METR.
If I were to make a list for myself in 2023 using what I know now, I'd probably have replaced "onboarding senior academics" with "get involved in AI policy via the AISIs", and instead of "writing blog posts or takes in general", I'd have the option of "build common knowledge in AIS via pedagogical posts". Though realistically, knowing what I know now, I'd have told my past self to try to better leverage my position at METR (and provided him with a list of projects to do at METR) instead of leaving.
Also, I regret both that I called it "ambitious mech interp", and that this post became the primary reference for what this term meant. I should've used a more value-neutral name such as "rigorous model internals" and wrote up a separate post describing it.
I think this post made an important point that's still relevant to this day.
If anything, this post is more relevant in late 2024 than in early 2023, as the pace of AI makes ever more people want to be involved, while more and more mentors have moved towards doing object level work. Due to the relative reduction of capacity in evaluating new AIS researcher, there's more reliance on systems or heuristics to evaluate people now than in early 2023.
Also, I find it amusing that without the parenthetical, the title of the post makes another important point: "evals are noisy".
I think this post was useful in the context it was written in and has held up relatively well. However, I wouldn't active recommend it to anyone as of Dec 2024 -- both because the ethos of the AIS community has shifted, making posts like this less necessary, and because many other "how to do research" posts were written that contain the same advice.
This post was inspired by conversations I had in mid-late 2022 with MATS mentees, REMIX participants, and various bright young people who were coming to the Bay to work on AIS (collectively, "kiddos"). The median kiddo I spoke with had read a small number of ML papers and a medium amount of LW/AF content, and was trying to string together an ambitious research project from several research ideas they recently learned about. (Or, sometimes they were assigned such a project by their mentors in MATS or REMIX.)
Unfortunately, I don't think modern machine learning is the kind of field where you can take several where research consistently works out of the box. Many high level claims even in published research papers are just... wrong, it can be challenging to reproduce results even when they are right, and even techniques that work reliably may not work for the reasons people think they do.
Hence, this post.
I think the core idea of this post held up pretty well with time. I continue to think that making contact with reality is very important, and I think the concrete suggestions for how to make contact with reality are still pretty good.
If I were to write it today, I'd probably add a fifth major reason for why it's important to make quick contact with reality: mental health/motivation. That is, producing concrete research outputs, even small ones, feels pretty essential to maintaining motivation for the vast majority of researchers. My guess is I missed this factor because I focused on the content of research projects, as opposed to the people doing the research.
Over the past two years, the ethos of the AIS community has changed substantially toward empirical work, over the past two years, and especially in 2024.
The biggest part of this is because of the pace of AI. When this post was written, ChatGPT was a month old, and GPT-4 was still more than 2 months away. People both had longer timelines and thought of AIS in more conceptual terms. Many research conceptual research projects of 2022 have fallen into the realm of the empirical as of late 2024.
Part of this is due to the rise of (dangerous capability) evals as a major AIS focus in 2023, which is both substantially more empirical compared to the median 2022 AIS research topic, and an area where making contact with reality can be as simple as "pasting a prompt into claude.ai".
Part of this is due to Anthropic's rise to being the central place for AIS researchers. "Being able to quickly produce ML results" is a major part of what it takes to get hired there as a junior researcher, and people know this.
Finally, there's been a decent amount of posts or write-ups giving the same advice, e.g. Neel's written advice for his MATS scholars and a recent Alignment Forum post by Ethan Perez.
As a result, this post feels much less necessary or relevant in late December 2024 than in December 2022.
Evan joined Anthropic in late 2022 no? (Eg his post announcing it was Jan 2023 https://www.alignmentforum.org/posts/7jn5aDadcMH6sFeJe/why-i-m-joining-anthropic)
I think you’re correct on the timeline, I remember Jade/Jan proposing DC Evals in April 2022, (which was novel to me at the time), and Beth started METR in June 2022, and I don’t remember there being such teams actually doing work (at least not publically known) when she pitched me on joining in August 2022.
It seems plausible that anthropic’s scaring laws project was already under work before then (and this is what they’re referring to, but proliferating QA datasets feels qualitatively than DC Evals). Also, they were definitely doing other red teaming, just none that seem to be DC Evals
My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.
Is the reason you can't do one of the existing tasks, just to get a sense of the difficulty?