I'm doing a physics PhD, and you're making me feel better about my coding practices. I appreciate your explicit example as well, as I'm interested in trying my hand at ML research and curious about what it looks like in terms of toolsets and typical sort-of-thing-one-works-on. I want to chime in down here in the comments to assure people that at least one horrible coder in a field which has nothing to do with machine learning (most of the time) thinks that the sentiment of this post is true. I admit that I'm biased by having very little formal CS training, so proper functional programming is more difficult for me than writing whatever has worked for me in the past writing ad-hoc Bash scripts. My sister is a professional software developer, and she winces horribly at my code. However, you point out that it is often the case that any particular piece of research code you are running has a particular linear set of tasks to achieve, and so:
As an example of the Good/Good-enough divide, here's a project I'm working on. I'm doing something which requires speed, so I'm using c++ code built on top of old code someone else wrote. I'm extremely happy that the previous researcher did not follow your advice, at least when they cleaned up the code for publishing, because it makes life easier for me to have most of the mechanics of my code hidden away out of view. Their code defines a bunch of custom types which rather intuitively match certain physical objects. They wrote a function which parses arg files so that you don't need to recompile the code to rerun a calculation with different physical parameters. Then there's my code which uses all of that machinery: My main function that I have written is sort of obviously a nest of loops over discrete tasks which could easily be separate functions, but I just throw them all together into one file, and I rewrite the whole file for different research questions so I have a pile of "main" files which reuse a ton of structure. As an example of a really ugly thing I did, I hard-code indices corresponding to momenta I want to study into the front of my program instead of making a function which parses momenta and providing an argument file listing the sets I want. I might have done that for the sake of prettiness, but I needed to provide a structure which lets me easily find momenta of opposite parity. Hard-coding the momenta let me keep the structure I was using at front of mind when I created the four other subtasks in the code which exploited that structure to let me construct subtasks which needed to easily find objects of opposite parity.
Can't agree more with this post! I used to be afraid of long notebooks but they are powerful in allowing me to just think.
Although while creating a script I tend to use "#%%" of vscode to run cells inside the script to test stuff. My notebooks usually contain a bunch of analysis code that don't need to be run, but should stay.
Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I'll give it a try
At the start of my Ph.D. 6 months ago, I was generally wedded to writing "good code". The kind of "good code" you learn in school and standard software engineering these days: object oriented, DRY, extensible, well-commented, and unit tested. I like writing "good code" - in undergrad I spent hours on relatively trivial assignments, continually refactoring to construct clean and intuitive abstractions. Part of programming's appeal is this kind of aesthetic sensibility - there's a deep pleasure in constructing a pseudo-platonic system of types, objects, and functions that all fit together to serve some end. More freedom than math, but more pragmatism than philosophy. This appreciation for the "art" of programming can align with more practical ends - beauty often coincides with utility. Expert programmers are attracted to Stripe because of their culture of craftsmanship, and Stripe promotes a culture of craftsmanship (in part, presumably) because "building multi-decadal abstractions" (in the words of Patrick Collison) is useful for the bottom line.
And this is all well and good, if (as is more often the case than not) you expect your code to be used in 1, 2, 10 years. But in research, this is often not the case! Projects typically last on the order of weeks to months, not years to decades. Moreover, projects typically involve a small number (often 1 or 0) of highly involved collaborators, as opposed to the large, fragmented teams typical of industry. Moreover, speed is paramount. Research is a series of bets, and you want to discover the outcome of the bet as fast as possible. Messy code might incur technical debt, but you don't have to pay if you scrap the entire project.
I had heard advice like this going into my PhD, both in the context of research and product development generally (MVP, Ballmer Peak, etc). It took me a while to internalize it though, in part, I suspect, because there's an art to writing "messy" code too. Writing error-prone spaghetti code is not the answer - you need stuff to work quickly to get results quickly. The goal is to write good enough code, efficiently, but learning what good enough means is a skill unto itself.
Principles for Good-Enough Code
Below is a first pass at some guiding principles. I focused on ML research in Python, but I suspect the lessons are generalizable
This is the kind of advice that’s horrible for freshman CS students, but probably helpful for first-year PhD students [1] Having everything in one place increases context - you can just read the program logic, without having to trace through various submodules and layers of abstraction. It also encourages you to constantly review code which otherwise might be tucked away, naturally helping you to catch errors, identify improvements, or notice additional axes of variation in the system.
Again, all this advice assumes a baseline of "standard software engineering practices" - I want to help cure you of deontic commitments like never repeating yourself. But if you don't need curing in the first place, you should probably reverse this advice.
My ML Research Workflow
With these principles in mind, I'll walk through my current research workflow. My goal is to fluidly transition and forth from a rough experimental notebook to a full experiment pipeline with tracking, sweeps, and results visualization.
Initialize an empty python project with a project-specific virtual environment (I’d recommend poetry, which makes dependency and virtual environment management really seamless - dependency hell is a great way to get slowed down)
Install bare-minimum dependencies - numpy, pandas matplotlib, torch, and (to use a jupyter notebook) ipykernel.
Using the config, setup a simple experiment tracking system. In general, use a datetime system rather than config-specific directories to start - your code and configs will change a lot early on, config file names can get long, and you don't want to overwrite old experiments after making changes. Do make sure to log the serialized config in the experiment directory though.
Eventually though, you’ll want to run experiment sweeps, typically on a shared cluster managed by slurm. This requires submitting a slurm job - consisting of required resources and a command to execute. Since we can't run notebooks directly, I use nbconvert to convert my experiment notebook into a runable script:
With this infrastructure in place, we can execute experimental sweeps. Typical practice is to create bash scripts with different settings, but preferring to work in python, I create a separate notebook, ( exp_sweeps.ipynb) constructing an experiment config that contains a subset of the full configuration parameters (remember, you are the user - you don’t need to enforce this subset with inheritance or type checks).
After constructing a list of experiment objects, I use submitit to launch experiments programmatically, converting the experiment configs to command line arguments:
Once the experiments are completed, I load and analyze the results using the same experiment objects. In this way, data generation and analysis are tightly coupled - paper figures are defined in the same notebook where experiments are run
Mileage on this exact setup may vary, but thus far I’ve found it strikes a great balance between flexibility and efficiency. Most significantly, I've found my "ugh field" around moving from local experimental notebook to submitting cluster jobs has been substantially reduced.
Conclusion
So yeah, those are my tips and basic setup. Again, they apply most strongly to early stage research, and most weakly to developing large compressive pieces of infrastructure (including research infrastructure like PyTorch, Hugging Face, and Transformer-lens). In some sense, the core mistake is to assume that early stage research requires novel extensive research infrastructure[2]. Developing open source infrastructure is, to a first approximation[3] prosocial: the gains are largely born by other users. So by all means, develop nice open-source frameworks - the world will benefit from you. But if you have new research ideas that you're eager to try out, the best approach is often to just try them ASAP.
Related Articles / Sources of Inspiration
I was initially shocked by how “messy” this GPT training script was - now I think it's the Way
This meme has been propagated to a certain extent by big labs, who make the (true) point that (infrastructure produced by) research engineers dramatically accelerates research progress. But this can simultaneously be true while it also being the case that for a small research time with limited budget, myopically pursing results is a better bet
Reputational gains aside