One particularly perfidious example of this problem comes when incorrect data is 'corrected' to be more accurate.
A fictionalized conversation:
Data Vendor: We've heard that Enron [1]falsified their revenue data[2]. They claimed to make eleven trillion dollars last year, and we put that in our data at the time, but on closer examination their total revenue was six dollars and one Angolan Kwanza, worth one-tenth of a penny.
Me: Oh my! Thank you for letting us know.
DV: We've corrected Enron's historical data in our database to reflect this upd-
Me: You what??
DV: W-we assumed that you would want corrected data...
Me: We absolutely do not want that! Do not correct it! Go back and...incorrect...the historical data immediately!
Can you help me see this point? Why not correct it in the dataset? (Assuming that the dataset hasn't yet been used to train any models)
You can correct it in the dataset going forward, but you shouldn't go back and correct it historically. To see why, imagine this simplified world:
If you care about having accurate tracking of the corrected 'what was Enron's real revenue back in 2000' number, you can store that number somewhere. But by putting it in your historical data, you're making it look like you had access to that number in 2000. Ideally you would want to distinguish between:
but this requires a more complicated database.
Very interesting! I think this is one of the rare times where I feel like a post would benefit from an up-front Definition. What actually is Leakage, by intensional definition?
"Using information during Training and/or Evaluation of models which wouldn't be available in Deployment."
. . . I'll edit that into the start of the post.
Once upon a time I worked on language models and we trained on data that was correctly split from tuning data that was correctly split from test data.
And then we sent our results to the QA team who had their own data, and if their results were not good enough, we tried again. Good enough meant "enough lift over previous benchmarks". So back and forth we went until QA reported success. On their dataset. Their unchanging test dataset.
But clearly since we correctly split all of our data, and since we could not see the contents of QA's test dataset, no leakage could be occurring.
But of course you were engaged in meta-overfitting by the constant attack on the test dataset... How did you wind up detecting the leakage? Bad results when deployed to the real world?
Not to toot my own horn* but we detected it when I was given the project of turning some of our visualizations into something that could accept QA's format so they could look at their results using those visualizations and then I was like "... so how does QA work here, exactly? Like what's the process?"
I do not know the real-world impact of fixing the overfitting.
*tooting one's own horn always follows this phrase
sellers auction several very similar lots in quick succession and then never auction again
This is also extremely common in biochem datasets. You'll get results in groups of very similar molecules, and families of very similar protein structures. If you do a random train/test split your model will look very good but actually just be picking up on coarse features.
The other day, during an after-symposium discussion on detecting BS AI/ML papers, one of my colleagues suggested doing a text search for “random split” as a good test.
A paper I'm doing mech interp on used a random split when the dataset they used already has a non-random canonical split. They also validated with their test data (the dataset has a three way split) and used the original BERT architecture (sinusoidal embeddings which are added to feedforward, post-norming, no MuP) in a paper that came out in 2024. Training batch size is so small it can be 4xed and still fit on my 16GB GPU. People trying to get into ML from the science end have got no idea what they're doing. It was published in Bioinformatics.
Curated. I liked that this post both illustrated an important idea through a few different lenses, and in particular that it showcased how easy it would be to nod along with an incomplete/wrong explanation.
How about this: I train on all available data, but only report performance for the lots predicted to be <$1000?
This still feels squishy to me (even after your footnote about separately tracking how many lots were predicted <$1000). You're giving the model partial control over how the model is tested.
The only concrete abuse I can immediately come up with is that maybe it cheats like you predicted by submitting artificially high estimates for hard-to-estimate cases, but you miss it because it also cheats in the other direction by rounding down its estimates for easier-to-predict lots that are predicted to be just slightly over $1000.
But just like you say that it's easier to notice leakage than to say exactly how (or how much) it'll matter, I feel like we should be able to say "you're giving the model partial control over which problems the model is evaluated on, this seems bad" without necessarily predicting how it will matter.
My instinct would be to try to move the grading closer to the model's ultimate impact on the client's interests. For example, if you can determine what each lot in your data set was "actually worth (to you)", then perhaps you could calculate how much money would be made or lost if you'd submitted a given bid (taking into account whether that bid would've won), and then train the model to find a bidding strategy with the highest expected payout.
But I can imagine a lot of reasons you might not actually be able to do that: maybe you don't know the "actual worth" in your training set, maybe unsuccessful bids have a hard-to-measure opportunity cost, maybe you want the model to do something simpler so that it's more likely to remain useful if your circumstances change.
Also you sound like you do this for a living so I have about 30% probability you're going to tell me that my concerns are wrong-headed for some well-studied reason I've never heard of.
I work for a leading private statistical research company and think this is a wonderful post. I heartily agree with all the takeaways. I may expand on data leakage examples I've seen "in the wild" in a follow-up post if there's demand for more stories, but your second "time-travelling" example brought back wonderful memories of a large company-wide debate, since your initial suggestion was our modus operandi, and there was likewise "tolerant skepticism" when it was questioned.
"I looked into it and found . . . that the conventional approach worked fine." So did we, and nobody could provide a clear example where this flavour of leakage was a problem on the actual data we had rather than theorised data (perhaps we didn't look hard enough?). In my experience, the tacit knowledge and understanding of the data-generating process should probably be the main determiner of how important data leakage is in practice, and therefore how much you should care about it. In this case, time-travel was an issue because the data process you were modeling had serial correlation.
The knowledge that "all models are wrong" is the best tonic I've found for dealing with the nagging uncertainty inherent when working with data involving the arrow of time. I still pretend time-travel is fine almost every working day. We all know it's wrong, but for my company at least, we don't know the price we're paying.
"It's what everyone does" and "It's what we always do" are meaningful evidence that a given Leakage is more likely to be the bearable kind
Sounds more like evidence for cheap alpha to me, but I don't work in the field so I'll trust you on this one.
This is a description of my work on some data science projects, lightly obfuscated and fictionalized to protect the confidentiality of the organizations I handled them for (and also to make it flow better). I focus on the high-level epistemic/mathematical issues, and the lived experience of working on intellectual problems, but gloss over the timelines and implementation details.
Data Leakage (n.): The use of information during Training and/or Evaluation which wouldn't be available in Deployment.
The Upper Bound
One time, I was working for a company which wanted to win some first-place sealed-bid auctions in a market they were thinking of joining, and asked me to model the price-to-beat in those auctions. There was a twist: they were aiming for the low end of the market, and didn't care about lots being sold for more than $1000.
"Okay," I told them. "I'll filter out everything with a price above $1000 before building any models or calculating any performance metrics!"
They approved of this, and told me it'd take a day or so to get the data ready. While I waited, I let my thoughts wander.
"Wait," I told them the next morning. "That thing I said was blatantly insane and you're stupid for thinking it made sense[1]. We wouldn't know whether the price of a given lot would be >$1000 ahead of time, because predicting price is the entire point of this project. I can't tell you off the top of my head what would go wrong or how wrong it would go, but it's Leakage, there has to be a cost somewhere. How about this: I train on all available data, but only report performance for the lots predicted to be <$1000?"
They, to their great credit, agreed[2]. They then provided me with the dataset, alongside the predictive model they'd had some Very Prestigious Contractors make, which they wanted me to try and improve upon. After a quick look through their documentation, I found that the Very Prestigious Contractors had made the same mistake I had, and hadn't managed to extricate themselves from it; among other things, this meant I got to see firsthand exactly how this Leakage damaged model performance.
If you make a model predicting a response from other factors, but feed it a dataset excluding responses over a certain ceiling, it'll tend to underestimate, especially near the cutoff point; however, if you then test it on a dataset excluding the same rows, it'll look like it's overestimating, since it's missing the rows it would underestimate. The end result of this was the Very Prestigious Contractors putting forth a frantic effort to make the Actual-vs-Predicted graphs line up (i.e. actively pushing things in the wrong direction), and despairing when no possible configuration of extra epicycles let them fit 'correctly' to their distorted dataset while keeping model complexity below agreed limits; their final report concluded with a sincere apology for not managing to screw up more than they did.
But I didn't need to know things would break that exact way. I just needed to be able to detect Leakage.
The Time-Travelling Convention
Another time, I was working for another company which wanted to know how aggressively to bid in some first-place sealed-bid auctions, and asked me to model how much they were likely to make from each lot. There was no twist: they had a smallish but very clean dataset of things they'd won previously, various details about each lot which were available pre-auction, and how much money they'd made from them. Everything was normal and sensible.
"Okay," I told them. "I'll random-sample the data into training and testing sets, decide hyperparameters (and all the other model choices) by doing cross-validation inside the training set, then get final performance metrics by testing once on the testing set."
They approved of this, and told me to get started. I took a little time to plan the project out.
"Wait," I told them. "That thing I said was blatantly insane and you're stupid for thinking it made sense[1]. If I get my training and testing sets by random-sampling, then I'll be testing performance of a model trained (in part) on February 20XX data on a dataset consisting (in part) of January 20XX data: that's time travel! I can't tell you off the top of my head what would go wrong or how wrong it would go, but it's Leakage, there has to be a cost somewhere. We should be doing a strict chronological split: train on January data, validate and optimize on February data, final-test on March data."
The company responded with tolerant skepticism, mentioning that random splits were convention both for them and the wider industry; I replied that this was probably because everyone else was wrong and I was right[1]. They sensibly asked me to prove this assertion, demonstrating a meaningful difference between what they wanted doing and what I planned.
I looked into it and found . . . that the conventional approach worked fine. Context drift between training and deployment was small enough to be negligible, the ideal hyperparameters were the same regardless of what I did, and maintaining a strict arrow of time wasn't worth the trouble of changing the company's processes or the inconvenience of not being able to use conventional cross-validation. I was chastened by this result . . .
. . . until I looked into performance of chronological vs random splits on their how-much-will-this-lot-cost-us datasets, and found that chronological splits were meaningfully better there. It was several months after I proved this that I figured out why, and the mechanism - sellers auction several very similar lots in quick succession and then never auction again; random splits put some of those 'clone' lots in train and some in validation/test, incentivizing overfit; meanwhile, chronological splits kept everything in a given batch of clones on one side of the split - wasn't anything I'd been expecting.
But I didn't need to know things would break that exact way. I just needed to be able to detect Leakage (. . . and test whether it mattered).
The Tobit Problem
A third time, I was working for a third company which were winning some first-place sealed-bid auctions, and wanted to win more . . . actually, I've already written the story out here. Tl;dr: there was some Leakage but I spotted it (this time managing to describe pre hoc what damage it would do), came up with a fix which I thought wasn't Leakage (but I thought it prudent to check what it did to model performance, and subsequently figured out where and how I'd been wrong), and then scrambled around frantically building an actually Leakage-proof solution.
My Takeaways
(Comparisons to bad-reasoning-in-general are left as an exercise for the reader.)
I did not use these exact words.
They also asked that I should report [# of lots predicted as <$1000] alongside my other performance metrics. This struck me as sensible paranoia: if they hadn't added that stipulation, I could have just cheated my way to success by predicting which lots would be hard to predict and marking them as costing $9999.