That pipeline is theoretically possible (and exists in some other countries), but in the US there are very substantial structural obstacles to creating and deploying it.
It sounds to me like a clear failure of political leadership. It would have been possible for Donald Trump, Joe Biden or Xavier Becerra to say that the problem is important and then give IBM 100 million to solve it but given that the leadership was largely just reactive and wasn't interested in doing any leadership in the pandemic, the problem was left unsolved.
We could certainly have done much better (both before and during the pandemic), but unfortunately it isn't as simple as just giving IBM $100M. Any solution needs to fit into the vast array of other existing systems used for reporting lab results, managing medical records in hospitals, etc.
The US delivers health care in a very patchwork way, which has made the deployment of electronic medical records very slow and difficult. Strong, smart leadership at the top would help a great deal, but even in the best possible case, really fixing this problem would take many years.
Why is Covid Surveillance so Hard?
Following up on a recent thread, I want to discuss how Covid case numbers are generated. I think this is interesting not only because Covid is interesting, but because it’s a good example of non-obvious real-world complexity.
A reasonable person might have a mental model of surveillance that looks something like this:
That model is completely wrong.
This is a brief overview of some of the challenges involved in Covid surveillance: the situation is far more complicated than I’m presenting here.
Organizational challenges
Legal challenges
Pipeline challenges
Before Covid, it was pretty much unheard of for health departments to collect negative test results or tests results from point-of-care tests like BinaxNow. These results now make up the large majority of reports sent to state health departments and are shoehorned into systems that were never designed for them.
Testing data is sent to the state health department by:
Data enters the main pipeline via:
Many of those sources are in different formats and many are incomplete, erroneous, or malformed. Many of the points of origin have never before conducted any kind of medical testing and are completely new to reporting lab results.
There are numerous third party systems dumping data into the main pipeline and every one of them has the ability to break the entire pipeline by introducing malformed data.
Data challenges
Test results are often incomplete or contain errors. Many test results enter the system multiple times from multiple sources, sometimes with misspelled names or different names for the same person. There is no universal identifier like a social security number associated with reports. Data deduplication is computationally expensive and requires significant human intervention.
A typical sequence of reporting goes something like this:
What date do you associate with a given test: the day the sample was collected, or the day the result was reported to the health department? CDC uses the report date, which simplifies reporting but gives less insight into the situation on the ground. Many states use the date the sample was collected, which is more useful but involves longer reporting delays. Regardless of which convention you use, there are substantial delays: in a state with an efficient surveillance system, it might take 7 days for 90% of all test results to pass through the system.
What about other data? It’s useful to know how many kids in a given school are testing positive so you can detect and respond to outbreaks. That’s great if you’re testing at school, but what if someone goes to a drive-thru site? What if they get tested by their family doctor? Most of those sources don’t report what school the patient attends. When they do, there is no standard format for reporting schools, so the raw data contains a mess of misspellings and mistakes in the school names, which must be tediously corrected to be useful.
Testing data is used for many kinds of analysis, some of which require complex interactions between multiple test records, medical records, and other data sources. For example, determining whether a given case is a reinfection requires figuring out whether the patient has ever had a previous positive test (keeping in mind that they may have moved, changed their name, etc. in the interim). Calculating breakthrough cases requires matching test results to vaccination records. None of those operations are easy, and none of them are computationally cheap.
Closing thoughts
There is, I think, an interesting object lesson here. Much like container logistics, Covid surveillance is much harder than it seems to a casual observer. From the outside, it’s hard to understand why Covid surveillance is so slow, incomplete, and erratic. If you’re in tech, it’s easy to imagine how you could build a modern data pipeline that would solve all of those problems, and it’s hard to understand why someone hasn’t already done that.
That pipeline is theoretically possible (and exists in some other countries), but in the US there are very substantial structural obstacles to creating and deploying it. Any attempt to improve Covid surveillance must account for those obstacles, or it will fail.
More generally, changing the real world requires understanding the real world. Victory comes from solving the tedious mundane problems as well as the fun technical problems.
Important disclaimers
I speak for myself only and not for any other person or government agency.
I have no professional credentials in this field, but I have consulted closely with an expert in Covid surveillance while writing this.