How do labs working at or near the frontier assess major architecture and/or algorithm changes before committing huge compute resources to try them out? For example, how do they assess stability and sample efficiency without having to do full-scale runs?
I am not an AI researcher, nor do I have direct access to any AI research processes. So, instead of submitting an answer, I am writing this in the comment section.
I have one definite easily sharable observation. I drew from this a lot of inferences, which I will separate out so that the reader can condition their world-model on their own interpretations of whatever pieces of evidence - if any - are unshared.
This interview in this particular segment, with the most seemingly relevant part to me occuring around roughly the timestamp 40:15.
So, in this segment Dwarkesh is asking Sholto Douglas, a researcher at Google Deepmind a sub-question in a discussion about how researchers see the feasibility of "The Intelligence Explosion."
The intent of this question seems to be to get an object-level description of the workflow of an AI researcher, in order to inform the meta-question of "how is AI going to increase the rate of AI research."
Potentially important additional detail, the other person at that table is Trenton Bricken, a "member of Technical Staff on the Mechanistic Interpretability team at Anthropic" (description according to his website.)
Sholto makes some kind of allusion to the fact that the bulk of his work at the time of this interview does not appear directly relevant to the question, so he seems to be answering for some more generic case of AI researcher.
Sholto's description of his work excerpted from the "About" section of his blog hosted on GitHub.
In this segment of the podcast, Sholto talks about "scaling laws inference" - seemingly alluding to the fact that researchers will have some compute budget to run experiments, and there will be agreed upon desideratum in the metrics of these experiments which could be used in the process of selecting features for programs which will then be given much larger training runs.
How do the researchers get this compute budget? Do all researchers have some compute resources available beyond just their personal workstation hardware? What does the process look like for spinning up a small-scale training run and reporting its results?
I am unsure, but from context will draw some guesses.
Sholto mentions, in providing further context in this segment:
He continues to give a few sentences that seem to gesture at a part of this internal process:
This seems to imply that a part of this process is receiving some 'mission' or set of 'missions' (my words not theirs, you could say quests or tasks or assignments) - and then some group(s) of researchers propose and test small scale tests for solutions to those.
Does this involve taking snapshots of these models at the scale where "behaviors or issues" appear and branching them to run shorter, lower compute, continuations of training/reinforcement learning?
Presumably this list of "grand problems" may include some items like:
Possibly the "behaviors and issues" which occur "when you scale" include:
Sholto continues:
This seems to imply that these AI labs have put their finger on the problem of doing work in large teams/titled sub-projects introduces a lot of friction. This could be Sholto's take on the ideal way to run an AI lab which could be informed by AI labs not actually working this way - but I presume Google Deepmind, at least, has a culture where they attempt to prevent individual researchers grumbling a lot about organizational stuff slowing down their projects. It seems, to me, that Sholto is right about it being much faster to do more in "parallel" - where individual researchers can work on these sub problems without having to organize a meeting, submit paperwork, and write memos to 3 other teams to get access to relevant pieces of their work.
The trio continues to talk about the meta-level question and sections relevant to "what does AI research look like" return to being as diffuse as you may expect in a conversation which includes 2/3rds AI researchers and focuses on topics associated with AI research.
One other particular quote that may be relevant to people drawing some inferences - Dwarkesh asks:
Sholto:
Dwarkesh:
Sholto:
Dwarkesh goes on to ask why labs aren't reallocating some of the compute they have from running large runs/serving clients to doing experiments if this is such a massive bottleneck.
Sholto replies:
What does this actual breakdown look like within Deepmind? Well, obviously Sholto doesn't give us details about that. If you get actual first-hand details about the allocation of compute budgets from this question, I'd be rather surprised...
Well, actually, not terribly surprised. These are modern AI labs, not Eliezer's fantasy-football AI lab from Six Dimensions Of Operational Adequacy. They may just DM you with a more detailed breakdown of what stuff looks like on the inside. I doubt someone will answer publicly in a way that could be tied back to them. That would probably breach a bunch of clauses on a bunch of contracts and get them in actual serious trouble.
What do I infer from this?
Well, first, you can watch the interview and pick up the rhythm. When I've done that, I get the impression that there are some relatively independent researchers who work under the umbrella of departments which have some amount of compute budgeted to them. It seems to me likely that this compute is not budgeted as strictly as something like timeslots on orbital telescopes - such that an individual researchers can have a brilliant idea one day and just go try it using some very-small fraction of their organization's compute for a short period of time. I think there is probably a range of experiment sizes above a certain threshold where you're going to have to have a strong case and make that case to those involved in compute-budgeting in order to get the compute-time to do experiments of that scale.
Does that level of friction with compute available to individual researchers account for the "0.5 elasticity" that Sholto was talking about? I'm not sure. Plausibly there is no "do whatever you want with this" compute-budget for individual researchers beyond what they have plugged into their individual work-stations. This would surprise me, I think? That seems like a dumb decision when you take the picture Sholto was sketching about how progress gets made at face-value. Still, it seems to me like a characteristic dumb decision of large organizations - where they try really hard to have any resource expenditures accounted for ahead of time, such that intangibles like "ability to just go try stuff" get squashed by considerations like "are we utilizing all of our resources with maximum efficiency?"
Hopefully this interview and my analysis is helpful to answering this question. I can probably discuss more, but I've noticed this comment is already rather long, and my brain is telling me that further writing will likely just be meandering and hand-waving.
If there is more content relevant to this discussion able to be mined from this interview, perhaps others will be able to iterate on my attempt and help flesh out all of the parts which seem easy to update our models on.
Thanks! It's no problem :)
Agreed that the interview is worth watching in full for those interested in the topic. I don't think it answers your question in full detail, unless I've forgotten something they said - but it is evidence.
(Edit: Dwarkesh also posts full transcripts of his interviews to his website. They aren't obviously machine-transcribed or anything, more like what you'd expect from a transcribed interview in a news publication. You'll lose some body language/tone details from the video interview, but may be worth it for some people, since most ... (read more)