I created my first web app in under 2 hours, using Claude Code with Opus 4.5. It is good. It is very very good. If you haven't already you should immediately pay £20 for a month of access and try it (and if you can't do it right now, create a reminder/task to do it).
If you're interested, see my post for details on the process, reflections, and a link to the app itself! https://lovkush.substack.com/p/i-created-my-first-web-app-in-under
Thanks for sharing this experience! Maybe I should create a short form to send to previous ARENA participants to get some aggregate stats.
1 and 2. This is subjective and more a gut feeling, but I think doing ARENA after having done LASR or MATS is not a good use of time, especially against the counter-factual of doing 4 research sprints. In my mind (and without more context), doing MATS and then ARENA would be a counter-signal - "How come you need to ARENA after having done MATS?"
3. See James Lester's reply.
My main disagreement is it does not deal with the reality of who actually participates in ARENA. If the AI Safety community could magically coordinate perfectly, then ARENA would serve the role you're describing, but as of now, I think the participants who do ARENA are better served by doing research sprints rather than the ARENA notebooks. See comment by sturb below for one participants perspective
> If you were super disciplined and you took one day every two weeks to work through one notework, you'd spend most of a year just to qualify for the program
I believe: 1) you don't need to diligently work through a whole notebook to get most of the value of the notebook and 2) the majority of the value of ARENA is contained in a subset of the notebooks. Some reasons:
1a) The notebooks are often, by design, far more work than is possible to do in a day. Even in ARENA, where you have pair programming, TAs on hand, great co-working space, lunch and dinner provided, etc.. Note a 'day' here is roughly 5.5-6.5 hours. (10am to 6pm, morning lecture at 10, lunch at 12, break at 3.30)
1b) Even for the shorter notebooks, it is often only manageable to complete in a day if you skip some exercises, or cheat and peak at the solution. (This is recommended by ARENA, and I agree with this recommmendation given time constraints.)
1c) There are 5 (or 6) LARGE mech interp notebooks for final three days of mech interp week. One recommendation is to try two notebooks on the Wed and Thu, then continue with the one you prefer on Friday. So I saw 2 out of the 5 notebooks when I participated in ARENA. Despite this, I was still able to TA during mech interp. It was bit frantic, but I would skim the start of each of the notebooks I didn't understand, enough that I could help people unblock or to explain key ideas. I feel I got good percent of the value that the ARENA participants got out of those other notebooks without having done a single exercise in them.
2a) In ARBOx2, the schedule was (comma represents different days)
- Week 1: CNNs, Transformers, Induction circuit, IoI circuit, [another mech interp notebook. cant remember which. likely SAEs]
- Week 2: RL day 1, RL day 2, project, project, project.
The biggest thing missing from this IMO is the theory of impact exercise from second half of ARENA evals day 1. Otherwise, for the calibre of ppl doing ARENA, a quick skim of the other notebooks gives majority of the value.
I would recommend ARBOx over ARENA because of the time efficiency. You get high percentage of value of ARENA, but in 40% of the time.
> most other programs focus more on research than developing ML skills
I dont think ARENA focusses on ML skills. Week 0 has content directly supervised ML, and only a small (but crucial!) part of ML, namely, writing networks in pytorch and creating training loops. Week 2 has content on RL. But given time constraints, many other parts of ML aren't covered in depth, e.g. how to do hyper-parameter tuning (most of the time just use the hyper-parameters provided, there's no time to actually do hyper-parameter tuning), how to even tell if hyper-parameters are the issue, data collection and cleaning, cluster management, selecting GPUs, etc.
The first is the counterfactual where participants aren't selected for ARENA, do they then go on to do good things
This is not crux for me. I believe ARENA provides counter-factual value compared to not doing ARENA. You work much harder during ARENA than you otherwise would, in great environment, great support, etc.
> The second is the counterfactual where people spend 4 weeks doing research sprints.
This is crux. And agreed it is hard to measure!
Thanks for engaging thoughtfully. Useful to think things through.
Thanks, James, for the detailed thoughts and for reading through the post. I'll respond once here. If we want further back and forth, better to have a chat in private so we can iron out our cruxes (and then summarize for community benefit). I'd also want to hear what others in community think before committing to anything.
> Because ARENA's main bottleneck to scaling hasn't really been TAs
I am happy to defer to you regarding the scaling bottlenecks of Arena. That's not a big crux for the proposal.
> I'm confused about the evidence given that ARENA functions primarily as a signaling mechanism
Maybe the word signaling isn't correct. Let me try to explain. When I point out that there are four people who did ARBOx and are now doing elite fellowship programs, my hunch is that those four had a very good chance of getting into those elite programs, even if they hadn't done ARBOx. Furthermore, if ARBOx did provide a significant boost to their profile/skillset, then one needs to consider how much extra value the extra three weeks at ARENA are providing. Another way of saying this is that ARBOx and Arena and these elite programs have similar selection processes. And so Arena or ARBOx accepting someone is strongly correlated with the fact that they have high potential for future AI safety research, regardless of how much value they add on top.
> I also think having a less-structured, more research-driven model would probably require more TA involvement? If you wanted to do it well and have it be accessible to early-stage people, at least.
I do not consider participants of ARENA to be 'early stage'. In my mind they are mid-stage (i.e. middle of upskilling towards a full-time researcher role) and most participants would be able to do solid research sprints without having gone through ARENA. My proposal is based on helping such mid-stage researchers. I think something like BlueDot (at least, BlueDot in 2024, I dont know about current BlueDot) or AISC targets early-stage researchers.
> Also we do put out impact reports where people self-assess as having improved their ability
My claim (which I have not really justfied, except to defer to Neel Nanda's post) is that the counter-factual of doing four mini research sprints would be significantly higher impact. This could be the central crux.
> Side note that I would love to see more reports like this from other programs in the ecosystem, and on a more regular basis
100%. Thanks for doing this and being a role model!
I'm not sure it presented itself as fact, but it definitely blurs the line and it's not obvious. I still found the stories highly engaging!
Can recommend reading Labatuts "When We Cease to understand the world"!
"Today, we're [Goodfire] excited to announce a $150 million Series B funding round at a $1.25 billion valuation." https://www.goodfire.ai/blog/our-series-b
Is my instinct correct that this is a big deal? How does $150 million compare to all other interp research?