At the start of my Ph.D. 6 months ago, I was generally wedded to writing "good code". The kind of "good code" you learn in school and standard software engineering these days: object oriented, DRY, extensible, well-commented, and unit tested. I like writing "good code" - in undergrad I spent hours on relatively trivial assignments, continually refactoring to construct clean and intuitive abstractions. Part of programming's appeal is this kind of aesthetic sensibility - there's a deep pleasure in constructing a pseudo-platonic system of types, objects, and functions that all fit together to serve some end. More freedom than math, but more pragmatism than philosophy. This appreciation for the "art" of programming can align with more practical ends - beauty often coincides with utility. Expert programmers are attracted to Stripe because of their culture of craftsmanship, and Stripe promotes a culture of craftsmanship (in part, presumably) because "building multi-decadal abstractions" (in the words of Patrick Collison) is useful for the bottom line.

And this is all well and good, if (as is more often the case than not) you expect your code to be used in 1, 2, 10 years. But in research, this is often not the case! Projects typically last on the order of weeks to months, not years to decades. Moreover, projects typically involve a small number (often 1 or 0) of highly involved collaborators, as opposed to the large, fragmented teams typical of industry. Moreover, speed is paramount. Research is a series of bets, and you want to discover the outcome of the bet as fast as possible. Messy code might incur technical debt, but you don't have to pay if you scrap the entire project.

I had heard advice like this going into my PhD, both in the context of research and product development generally (MVP, Ballmer Peak, etc). It took me a while to internalize it though, in part, I suspect, because there's an art to writing "messy" code too. Writing error-prone spaghetti code is not the answer - you need stuff to work quickly to get results quickly. The goal is to write good enough code, efficiently, but learning what good enough means is a skill unto itself.

Principles for Good-Enough Code

Below is a first pass at some guiding principles. I focused on ML research in Python, but I suspect the lessons are generalizable

  1. Future-t you is the target user
    • where t is something like an exponential distribution with median 1 day.
  2. Minimize indirection - Have as much of the code as is reasonable in a single notebook
    • This is the kind of advice that’s horrible for freshman CS students, but probably helpful for first-year PhD students [1] Having everything in one place increases context - you can just read the program logic, without having to trace through various submodules and layers of abstraction. It also encourages you to constantly review code which otherwise might be tucked away, naturally helping you to catch errors, identify improvements, or notice additional axes of variation in the system.

  3. Only refactor when you need to - but always refactor when you need to
    • When you’ve had the impulse two or three times to pull something out into a separate function or object, do it. As a default though, be very suspicious of coding activity that isn’t directly doing the thing.
  4. Use your context
    • Lots of best practices in software engineering revolve around providing context and constraints to future readers and editors, in the form of comments, assertions, documentation, and type checks. Some of this will be useful for you 5 minutes after writing it, but a lot of it won't. So don't worry too much about using enums instead of strings, adding docstrings for each function, and avoiding magic numbers. Semantic naming and parse commenting are useful enough.
  5. Copy and paste is your friend
    • In keeping with minimizing indirection, it's often better to reuse a component by copying and pasting rather than sharing it across functions/scripts. Not only does this improve context, but it also promotes decoupling. If you end up needing to modify the component for a particular usecase, you can do so without worrying about how the change will affect functionality elsewhere (the conserve of this is that if you want to make the same modification, you have to do it twice, so, as always, user discretion is required).
  6. You're still allowed to think - slow is smooth and smooth is fast
    • When told to prioritize speed in coding, we often imagine the rogue hacker, wizzing away at a terminal , no time wasted without a keystroke. And sure, maybe 10x engineers operate something like this. But for mere mortals, it's important to remember that you can still do a bit of planning before getting to work. For me, planning usually takes the form of pseudo-code comments, but a little diagram sketching and rubbing ducking won't hurt either. The key is to efficiently execute an imperfect plan - and this requires having an imperfect plan to begin with.
  7. Avoid unit tests - at least early on
    • The most obvious case of trading speed for reliability. In research, you should be constructing your code incrementally, running it at each step in a REPL or notebook. By the time you're done, you've basically covered the central use case (running the full experiment script), and don't have to worry about arbitrarily users exploiting weird edge cases. You are the target user. And running the script is often the only (integration) test you need (do check tensor shapes though, ML debugging is hard and all).
  8. Use an LLM
    • This should be obvious. As of December 14th 2024, I'd recommend Curser with Sonnet 3.5 (though I occasionally use O1 to work through some math)

Again, all this advice assumes a baseline of "standard software engineering practices" - I want to help cure you of deontic commitments like never repeating yourself. But if you don't need curing in the first place, you should probably reverse this advice.

My ML Research Workflow

With these principles in mind, I'll walk through my current research workflow. My goal is to fluidly transition and forth from a rough experimental notebook to a full experiment pipeline with tracking, sweeps, and results visualization.

  • Initialize an empty python project with a project-specific virtual environment (I’d recommend poetry, which makes dependency and virtual environment management really seamless - dependency hell is a great way to get slowed down)

    mkdir my-project
    cd my-project
    mkdir my_project
    touch my_project/__init__.py
    poetry init --no-iteraction
  • Install bare-minimum dependencies - numpy, pandas matplotlib, torch, and (to use a jupyter notebook) ipykernel.

    poetry add numpy pandas torch matplotlib ipykernel
  • Create a notebook, and name it something like run_exp.ipynb, using your newly created virtual environment as the kernel.
  • Now, write your experiment code, as fast as possible, all in the single notebook. Go!
    • This is where you (without loss of generality) load the dataset and model, play around with transforms, tokenization, dataloading, etc, check that shapes are as expected, and write the “for epoch in range(epochs)” loop
    • Don’t worry too much about extensive metric logging with fancy experiment trackers like tensorboard or wandb - log the minimum amount of information (often with dictionaries and print statements) to convince you that training is roughly working as expected (but fine, tensorboard can be helpful here in plotting training curves in real time)
  • Once you have something half working (ideally have a semi-promising result you want to investigate further) clean things up a bit. Move components that you don’t expect to change too much (e.g. datasets, model backbone definitions) into separate submodes, and factor out especially eggregious repetitions .
  • Most importantly, create a Config object. Treat the config as the primary “interface” of the notebook - it should contain all the parameters that you foresee changing. In general, air on the side of including parameters, but don't worry about being exhaustive (e.g. you probably don’t need to include Adam beta values)
  • @dataclass
    class Config():
      lr: float=1e-3
      weight_decay: float=1e-4
      epochs: int = 5
      # ...
  • Using the config, setup a simple experiment tracking system. In general, use a datetime system rather than config-specific directories to start - your code and configs will change a lot early on, config file names can get long, and you don't want to overwrite old experiments after making changes. Do make sure to log the serialized config in the experiment directory though.

    • Again, feel free to use experiment tracking systems like tensorboard and wandb, but you can get a surprising amount of mileage out of nested collections, print statements, and matplotlib, and there are often benefits to "rolling your own" (c.f. minimize indirection, or more generally maximizing context)
    @dataclass
    class Config():
      #...
      exp_dir: str = f"output/{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
    conf = Config()
    
    # Note we use OmegaConfig - which we'll return to in the next step!
    with open(f"{exp_dir}/config.yaml", "w") as f:
        OmegaConf.save(config=conf, f=f)
    
    metrics = {}
    # run experiments and log metrics ...
    with open(f"{conf.exp_dir}/metrics.json", 'r') as f: 
      json.dump(metrics, f)
  • Now that we have a cleaned-up script with a config and experiment tracking, we can start to experiment more systematically with different settings. Initially, I try to run quick experiments in the notebook by manually changing configs (this is more feasible if you have flexible access to gpu), keeping you close to the code and allowing for rapid iteration.
  • Eventually though, you’ll want to run experiment sweeps, typically on a shared cluster managed by slurm. This requires submitting a slurm job - consisting of required resources and a command to execute. Since we can't run notebooks directly, I use nbconvert to convert my experiment notebook into a runable script:

    #!/bin/bash
    NOTEBOOK_PATH="$1"
    jupyter nbconvert --clear-output --inplace "$NOTEBOOK_PATH"
    jupyter nbconvert --to script "$NOTEBOOK_PATH"
  • this also helps with version control of notebooks, at the cost of maintaining two copies of essentially the same file.
  • To vary experimental settings across sweeps, we'll want to accept config overrides in the notebook/script. To do so, I use an is_notebook() function (found from to check whether to parse command line arguments, and OmegaConf to parse and merge the config overrides:
  • if not is_notebook():
        import sys 
        overrides = OmegaConf.from_cli(sys.argv[1:])
        conf_dict = OmegaConf.merge(OmegaConf.structured(conf), overrides)
        conf = Config(**conf_dict) # reinitialize for dot access
  • With this infrastructure in place, we can execute experimental sweeps. Typical practice is to create bash scripts with different settings, but preferring to work in python, I create a separate notebook, ( exp_sweeps.ipynb) constructing an experiment config that contains a subset of the full configuration parameters (remember, you are the user - you don’t need to enforce this subset with inheritance or type checks).

    @dataclass 
    class Experiment: 
      lr: float = 1e-3
      weight_decay: float = 1e-4
      exp_dir: str = None
      def __post_init__(self):
        # we use semantic directories for structured sweeps
        self.exp_dir = f"output/lr_{self.lr}_wd_{self.weight_decay}"
    
    # contruct experiments 
    from itertools import product
    lrs = [1e-4, 1e-3, 1e-2]
    wds = [1e-3, 1e-3]
    experiments = [Experiment(lr, wd) for lr, wd in product(lrs, wds)]
  • When running sweeps, I tend to overwrite the default experiment directory with semantic directory names corresponding to the experiment config. While this sometimes introduces the problems I discussed above (namely overwriting prior experiments), it feels more appropriate in the sweep stage where we typically compare results to other results in the sweep, rather than earlier sweep iterations. And in cases where we want to preserve results that would otherwise be overwritten, we can just take care in doing so (by e.g. moving them to a different directory).
  • After constructing a list of experiment objects, I use submitit to launch experiments programmatically, converting the experiment configs to command line arguments:

    def conf_to_args(conf: dict):
        args = []
        for key, value in conf.items():
            # check if value is an enum 
            if isinstance(value, Enum):
                value = value.name 
            elif value is None:
                value = 'null'
            args.append(f"{key}={value}")
        return args
    
    def run_experiments(executor, experiments: list[Experiment], script_name: str):
        with executor.batch():
            jobs = []
            for exp in experiments:
                function = submitit.helpers.CommandFunction(
                    ["python", script_name] + conf_to_args(exp.__dict__)
                )
                jobs.append(executor.submit(function))
        return jobs
      
    # example executor that runs locally
    executor = submitit.AutoExecutor(folder=out_dir)
    executor.update_parameters(timeout_min=60 * 48, mem_gb=16,gres="gpu:1")
    jobs = run_experiments(executor, experiments, "run_exp.py")
  • Once the experiments are completed, I load and analyze the results using the same experiment objects. In this way, data generation and analysis are tightly coupled - paper figures are defined in the same notebook where experiments are run

    def get_exp_metrics(exp: Experiment):
        if not (exp.exp_dir / "metrics.json").exists():
            raise FileNotFoundError(f"Metrics file not found for {exp.exp_dir}")
        with open(exp.exp_dir / "metrics.json", "r") as f:
            exp_metrics = json.load(f)
        return exp_metrics
      
    # load exp metrics after jobs are completed 
    exp_metrics = [get_exp_metrics(exp) for exp in experiments]
    # ... (analyze data, make figures, etc)

Mileage on this exact setup may vary, but thus far I’ve found it strikes a great balance between flexibility and efficiency. Most significantly, I've found my "ugh field" around moving from local experimental notebook to submitting cluster jobs has been substantially reduced.

Conclusion

So yeah, those are my tips and basic setup. Again, they apply most strongly to early stage research, and most weakly to developing large compressive pieces of infrastructure (including research infrastructure like PyTorch, Hugging Face, and Transformer-lens). In some sense, the core mistake is to assume that early stage research requires novel extensive research infrastructure[2]. Developing open source infrastructure is, to a first approximation[3] prosocial: the gains are largely born by other users. So by all means, develop nice open-source frameworks - the world will benefit from you. But if you have new research ideas that you're eager to try out, the best approach is often to just try them ASAP.

  1. ^

     I was initially shocked by how “messy” this GPT training script was - now I think it's the Way

  2. ^

     This meme has been propagated to a certain extent by big labs, who make the (true) point that (infrastructure produced by) research engineers dramatically accelerates research progress. But this can simultaneously be true while it also being the case that for a small research time with limited budget, myopically pursing results is a better bet

  3. ^

     Reputational gains aside

New Comment
6 comments, sorted by Click to highlight new comments since:

I'm doing a physics PhD, and you're making me feel better about my coding practices. I appreciate your explicit example as well, as I'm interested in trying my hand at ML research and curious about what it looks like in terms of toolsets and typical sort-of-thing-one-works-on. I want to chime in down here in the comments to assure people that at least one horrible coder in a field which has nothing to do with machine learning (most of the time) thinks that the sentiment of this post is true. I admit that I'm biased by having very little formal CS training, so proper functional programming is more difficult for me than writing whatever has worked for me in the past writing ad-hoc Bash scripts. My sister is a professional software developer, and she winces horribly at my code. However, you point out that it is often the case that any particular piece of research code you are running has a particular linear set of tasks to achieve, and so:

  • You don't need to worry much about resilient code which handles weird edge cases.
  • It is often better to have everything in one place where you can see it than to have a bunch of broken up functions scattered across a folder full of files.
  • Nobody else will need to use the code later, including yourself, so legibility is less important

As an example of the Good/Good-enough divide, here's a project I'm working on. I'm doing something which requires speed, so I'm using c++ code built on top of old code someone else wrote. I'm extremely happy that the previous researcher did not follow your advice, at least when they cleaned up the code for publishing, because it makes life easier for me to have most of the mechanics of my code hidden away out of view. Their code defines a bunch of custom types which rather intuitively match certain physical objects. They wrote a function which parses arg files so that you don't need to recompile the code to rerun a calculation with different physical parameters. Then there's my code which uses all of that machinery: My main function that I have written is sort of obviously a nest of loops over discrete tasks which could easily be separate functions, but I just throw them all together into one file, and I rewrite the whole file for different research questions so I have a pile of "main" files which reuse a ton of structure. As an example of a really ugly thing I did, I hard-code indices corresponding to momenta I want to study into the front of my program instead of making a function which parses momenta and providing an argument file listing the sets I want. I might have done that for the sake of prettiness, but I needed to provide a structure which lets me easily find momenta of opposite parity. Hard-coding the momenta let me keep the structure I was using at front of mind when I created the four other subtasks in the code which exploited that structure to let me construct subtasks which needed to easily find objects of opposite parity.

thanks for the detailed (non-ML) example!  exactly the kind of thing I'm trying to get at

Can't agree more with this post! I used to be afraid of long notebooks but they are powerful in allowing me to just think. 

Although while creating a script I tend to use "#%%" of vscode to run cells inside the script to test stuff. My notebooks usually contain a bunch of analysis code that don't need to be run, but should stay. 

Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I'll give it a try

This is a great post, and I like the research process. Do you know if the LLM code completion in cursor is compatible with ipynb notebooks?

thanks! yup curser is notebook compatible