Important caveat for the pass-through approach: if any of your build_dataset()
functions accept **kwargs
, you have to be very careful about how they're handled to preserve the property that "calling a function with unused arguments is an error". It was a lot of work to clean this up in Matplotlib...
The general lesson is that "magic" interfaces which try to 'do what I mean' are nice to work with at the top-level, but it's a lot easier to reason about composing primitives if they're all super-strict.
Another example: $hypothesis write numpy.matmul
produces code against a very strict (and composable) runtime API, but you probably don't want to look at how.
The general lesson is that "magic" interfaces which try to 'do what I mean' are nice to work with at the top-level, but it's a lot easier to reason about composing primitives if they're all super-strict.
100% agree. In general I usually aim to have a thin boundary layer that does validation and converts everything to nice types/data structures, and then a much stricter core of inner functionality. Part of the reason I chose to write about this example is because it's very different from what I normally do.
Important caveat for the pass-through approach: if any of your
build_dataset()
functions accept**kwargs
, you have to be very careful about how they're handled to preserve the property that "calling a function with unused arguments is an error". It was a lot of work to clean this up in Matplotlib...
To make the pass-through approach work, the build_dataset
functions do accept excess parameters and throw them away. That's definitely a cost. The easiest way to handle it is to have the build_dataset
functions themselves just pass the actually needed arguments to a stricter, core function, e.g.:
def build_dataset(a, b, **kwargs):
build_dataset_strict(a, b)
build_dataset(**parameters) # Succeeds as long as keys named "a" and "b" are in parameters
The reason to be explicit is to be able to handle control flow.
def run_job(make_dataset1: bool, make_dataset2: bool):
if make_dataset1 && make_dataset2:
make_third_dataset()
If your jobs are independent, then they should be scheduled as such. This allows jobs to run in parallel.
def make_datasets_handler(job):
for dataset in job.params.datasets:
schedule_job('make_dataset', {dataset})
def make_dataset_handler(job):
{name, params} = job.params.dataset
constructors.get(name)(**params)
Passing random params to functions and hoping for failure is a terrible idea great for breaking code ('fuzzing').
The performance difference of explicit vs pass-through comes down to control flow. Your errors would come out just as fast if you ran check_dataset_params()
up front.
the first gives us a 5x faster feedback loop
A good way to increase feedback rate is to write better tests. Failure in production should be the exception, not the norm.
The reason to be explicit is to be able to handle control flow.
The datasets aren't dependent on each other, though some of them use the same input parameters.
If your jobs are independent, then they should be scheduled as such. This allows jobs to run in parallel.
Sure, there's some benefit to breaking down jobs even further. There's also overhead to spinning up workers. Each of these functions takes ~30s to run, so it ends up being more efficient to put them in one job instead of multiple.
Your errors would come out just as fast if you ran
check_dataset_params()
up front.
So then you have to maintain check_dataset_params
, which gives you a level of indirection. I don't think this is likely to be much less error-prone.
The benefit of the pass-through approach is that it uses language-level features to do the validation – you simply check whether the parameters dict has keywords for each argument the function is expecting.
A good way to increase feedback rate is to write better tests.
I agree in general, but I don't think there are particularly good ways to test this without introducing indirection.
Failure in production should be the exception, not the norm.
The failure you're talking about here is tripping a try
clause. I agree that exceptions aren't the best control flow – I would prefer if the pattern I'm talking about could be implemented with if statements – but it's not really a major failure, and (unfortunately) a pretty common pattern in Python.
I would prefer if the pattern I’m talking about could be implemented with if statements
This has its own problems, but you could use inspect.signature, I think?
Each of these functions takes ~30s to run, so it ends up being more efficient to put them in one job instead of multiple.
This is a perfect example of the AWS Batch API 'leaking' into your code. The whole point of a compute resource pool is that you don't have to think about how many jobs you create.
It sounds like you're using the wrong tool for the job (or a misconfiguration - e.g. limit the batch template to 1 vcpu).
The benefit of the pass-through approach is that it uses language-level features to do the validation
You get language-level validation either way. The assert
statements are superfluous in that sense. What they do add is in effect check_dataset_params()
, whose logic probably doesn't belong in this file.
The failure you're talking about here is tripping a try clause.
No, I meant a developer introducing a runtime bug.
This is a perfect example of the AWS Batch API 'leaking' into your code. The whole point of a compute resource pool is that you don't have to think about how many jobs you create.
This is true. We're using AWS Batch because it's the best tool we could find for other jobs that actually do need hundreds/thousands of spot instances, and this particular job goes in the middle of those. If most of our jobs looked like this one, using Batch wouldn't make sense.
You get language-level validation either way. The
assert
statements are superfluous in that sense. What they do add is in effectcheck_dataset_params()
, whose logic probably doesn't belong in this file.
You're right. In the explicit example, it makes more sense to have that sort of logic at the call site.
Here's a concrete example of two approaches to a software problem, each with different advantages, and circumstances that would lead me to choose one over the other.
Recently I wrote a job to create and save datasets that are used for downstream work. The code is run through a scheduler that will spin up a worker and execute, so it has to be decoupled from any upstream or downstream jobs.
The first design I considered is pretty standard: we tell the job which datasets we want, validate that the parameters needed for the dataset are present, and then build those datasets.
In the second approach, instead of telling the job which datasets to build, we pass it all the parameters we have, and the job builds whatever datasets it can.
These are pretty different approaches, representing two poles of design – let's call them explicit and pass-through. (In between versions exist too, but let's ignore them for now.) What are some tradeoffs between them?
build_dataset
functions, or at least their signatures. In the pass-through approach, the calling code can just submit the job to a scheduler – it doesn't know anything about those functions, and doesn't need to change when we modify them.build_dataset
functions all have different signatures, making it hard to abstract over them and requiring us to change the job code if the functions ever change. In the pass-through approach, we can call them all in the same way, and only need to change the job code if we add or remove a dataset.In short, the explicit version is a bit safer , but suffers from excess coupling – we have to update many more parts of the code when things change.
I went with the pass-through approach. Why?
First, the safety offered by the explicit approach consists entirely of failing earlier. This is a good thing, but in my use case the pipeline takes about 10 minutes for the workers to start up, but then each job runs relatively quickly. In the explicit approach, we'll get a loud failure right around when the workers spin up, which means 10 minutes in. In the pass-through approach, we get that failure about 12 minutes in. So the extra safety is pretty marginal in this case.
(If we were submitting to an always-running cluster, the extra safety would be much more valuable, because the failure times might look more like 30 seconds and 150 seconds, meaning the first gives us a 5x faster feedback loop instead of just a 20% improvement.)
Next, I expected the explicit version to give us a bunch of extra failures – since a common change would have to modify 3 parts of the code instead of 1, the odds that I would make a mistake were pretty high, and this mistake would result in a loud failure 10 minutes in.
(If this could be easily statically type checked and give us a compile-time error instead of a runtime error, that would would be a big point in favor of the explicit approach.)
Last, the the cost of having smart calling code was quite high for us – since our scheduler (AWS Batch) could only handle logic of the form "run job 2 after job 1", it meant that I would essentially need to run another coordinating process, that would duplicate some but not all of the work of the scheduler. This adds another moving part that can fail, and ties up a development machine.
(If we had already been using a more capable scheduling framework like AirFlow, this wouldn't be a big deal. But the one-time cost of setting up such a framework is pretty high, and I didn't think it was worth it.)