How do labs working at or near the frontier assess major architecture and/or algorithm changes before committing huge compute resources to try them out? For example, how do they assess stability and sample efficiency without having to do full-scale runs?
Dwarkesh's interview with Sholto sounds well worth watching in full, but the segments you've highlighted and your analyses are very helpful on their own. Thanks for the time and thought you put into this comment!