One way or another, we'd try to use the most relevant dataset first.
Otherwise known as "underfitting"...
Maybe you do this, but me, and many people in ML, do our best to avoid ever doing that. Transfer learning powers the best and highest-performing models. Even in pure supervised learning, you train on the largest dataset possible, and then finetune. And that works much better than training on just the target task. You cannot throw a stick in ML today without observing this basic paradigm.
I know, let's take a dataset of 2d images of cars and their 3d rendering and train the model on that first.
There are GAN papers, among others, which do pretty much this for inferring models & depth maps.
But that's just because the hard part, the training, is already done.
No. You don't do it 'just' to save computation. You do it because it learns superior representations and generalizes better on less data. That finetuning is a lot cheaper is merely convenient.
Given that your motivating analogy to machine learning is comprehensively wrong, perhaps you should rethink this essay.
TL;DR Please provide references in order for me to give a more cohesive reply, see papers bellow + my reasoning & explanation as to why you are basically wrong and/or confusing things that work in RL with things that work in SL and/or confusing techniques being used to train with scarce data for ones that would work even when the data is large enough that compute is a bottleneck (which is the case I'm arguing for, i.e. that compute should first be thrown at the most relevant data)
Maybe you do this, but me, and many people in ML, do our best to avoid ever doing that. Transfer learning powers the best and highest-performing models. Even in pure supervised learning, you train on the largest dataset possible, and then finetune. And that works much better than training on just the target task. You cannot throw a stick in ML today without observing this basic paradigm.
I would ask for a citation on that.
Never in any ML literature have I ever heard of people training models on datasets other than those they wanted to solve as a more efficient alternative to training on the dataset itself. Of course, provided more time once you converge on your data training on related data can be helpful, but my point is just that training on the actual data is the first approach one takes (obviously, depending on the size of the problem you might start with weight transfer directly)
People transfer weights all the time, but that's because it shortens training time.
New examples of unrelated data (or less-related data) does not make a model converge faster on validation data assuming you could instead create a new example of problem-specific data.
In theory it could make the model generalize better, but when I say "in theory" I mean in layman's terms since doing research on this topic is hard and there's scarce little in supervised learning.
Most rigorous research on this topic seems to be in RL, e.g.: https://arxiv.org/pdf/1909.01331.pdf and it's nowhere near clear cut.
Out of the research that seems to apply better to SL I find this theory/paper to be most rigorous and up to date: https://openreview.net/pdf?id=ryfMLoCqtQ ... and the findings here as in literally any other paper by a respected team or university you will find on the subject can be boiled down to:
"Sometime it helps with generalization on the kind of data not present in the training set and sometime it just results in a shittier models and it depends a lot on the SNR of the data the model was trained on relative to the data you are training for now"
There are GAN papers, among others, which do pretty much this for inferring models & depth maps.
Again, links to papers please. My bet is that the GAN papers do this:
a) Because they lack 3d rendering of the objects they want to create.
b) Because they lack 3d renderings of most of the objects they want to create.
c) Because they are trying to showcase an approach that generalizes to different classes of data that aren't available at training time (I.e. showing that a car 3d rendering model can generalize to do 3d renderings of glasses, not that it can perform better than one that's been specifically trained to generate 3d renderings of glasses).
If one can achieve better results with unrelated data than with related data in similar compute time (i.e. up until either of the models has converged on a validation dataset/runs or in a predefined period of time), or even if one can achieve better results by training on unrelated data *first* and then on related data rather than vice versa... I will eat my metaphorical hat and retract this whole article. (Provided both models use appropriate regularization or that at least the relevant-data model uses it, otherwise I can see a hypothetical where a bit of high-noise data can serve as a form of regularization, but even this I would think to be highly unlikely)
No. You don't do it 'just' to save computation. You do it because it learns superior representations and generalizes better on less data. That finetuning is a lot cheaper is merely convenient.
Again see my answers above and please provide relevant citations if you wish to claim the contrary, it seems to me that what you are saying here goes both against common sense. i.e. given a choice between problem-specific data and less-related data your claim is that at some point using less-related data is superior.
A charitable reading of this is that introducing noise in the training data helps generalize (see e.g. techniques involving introducing noise in the training data, l2 regularization and dropout), which seems kind of true but far from true on that many tasks and I invite you to experiment with it an realize it actually doesn't really apply to everything nor are the effect sizes large unless you are specifically focusing on adversarial examples or datasets where the train set covers only a minute portion of potential data.
Goodhart's Law also seems relevant to invoke here, if we're talking about goal vs incentive mismatch.
Em, you don't need a PhD in applied mathematics to learn to take derivatives, it's something you learn in school.
The more specific case I was hinting at was figuring out the loss <--> gradient landscape relationship.
Which yes, a highschooler can do for a 5 cell network, but for any real network it seems like it's fairly hard to say anything about it... I.e. I've read a few paper delving into the subject and they seem complex to me.
Maybe not PhD level ? I don't know. But hard enough that most people usually choose to stick with a loss that makes sense for the task rather than optimize it such that the resulting gradient is "easy to solve" (aka yields faster training and/or converges on a "more" optimal solution).
But I'm not 100% sure I'm correct here and maybe learning the correct 5 primitives makes the whole thing seem like childplay... though based on people's behavior around the subject I kinda doubt it.
I really don't want to say that I've figured out the majority of what's wrong with modern education and how to fix it, BUT
1. We train ML models on the tasks they are meant to solve
When we train (fit) any given ML model for a specific problem, on which we have a training dataset, there are several ways we go about it, but all of them involve using that dataset.
Say we’re training a model that takes a 2d image of some glassware and turn it into a 3d rendering. We have images of 2000 glasses from different angles and in different lighting conditions and an associated 3d model.
How do we go about training the model? Well, arguable, we could start small then feed the whole dataset, we could use different sizes for test/train/validation, we could use cv to determine the overall accuracy of our method or decide it would take to long... etc
But I'm fairly sure that nobody will ever say:
If you already have a trained model that does some other 2d image processing or predicts 3d structure from 2d images, you might try doing some weight transfer or using part of the model as a backbone. But that's just because the hard part, the training, is already done.
To have a very charitable example, maybe our 3d rendering is not accurate enough and we've tried everything but getting more data is too expensive. At that point, we could decide to bring in other 2d to 3d datasets and also train the model on that and hope there's enough similarity between the two datasets that the model will get better at the glassware task.
One way or another, we'd try to use the most relevant dataset first.
2. We don't do this with humans
I'm certain some % of the people studying how to implement basic building blocks (e.g. allocators, decision trees, and vectors) in C during a 4-year CS degree end up becoming language designers or kernel developers and are glad they took the time to learn those things.
But the vast majority of CS students go on to become frontend developer or full-stack developers where all the "backend" knowledge they require is how to write SQL and how to read/write from file and high-level abstractions over TCP or UDP sockets.
At which point I ask something like:
And I get a mumbled answer about something-something having to learn the fundamentals. To which I reply:
At which point I get into arguments about how education seems only marginally related to salary and job performance in most programming jobs. The whole thing boils down to half-arsed statistics because evaluating things like salary, job performance, and education levels is, who would have guessed, really hard.
At which point I get into arguments about how education seems only marginally related to salary and job performance in most programming jobs. The whole thing boils down to half-arsed statistics because evaluating things like salary, job performance, and education levels is, who would have guessed, really hard.
So for the moment, I will just assume that your run of the mill angular developer doesn't need a 4 year CS degree to do his job and a 6-month online program that teaches the direct skills required is sufficient.
3. Based on purely empirical evidence, I think we should
Going even further, let's get into topics like memory ordering. I'm fairly sure these would be considered fairly advanced subjects as far as programming is concerned, to know how to properly use memory ordering is to basically write ASM code.
I learned about the subject by just deciding one day that I will write a fixed size, lock-free, wait-free, thread-safe queue that allows more multi-reader, multi-writer, or both... then to make it a bit harder I went ahead and also developed a Rust version in parallel.
Note: I'm fairly proud of the above implementations since I was basically a kid 4 years ago when I wrote them. I don't think they are well tested enough to use in production and they likely have flaws that basic testing on an x86 and raspberry PI ARM processor didn't catch. Nor are they anywhere close to the most efficient implementations possible.
I'm certain that I don't have anywhere near a perfect grasp of memory ordering and "low level" parallelism in general. However, I do think the above learning experience was a good way to get an understanding equivalent to an advanced university course in ~2 weeks of work.
Now, this is an n=1 example, but I don't think that I'm alone in liking to learn this way.
The few people I know that can write an above half-arsed compiler didn't finish the Dragon book and then started making their first one, they did the two in parallel or even the other way around.
I know a lot of people who swear by Elm, Clojure, and Haskell, to my knowledge none of them bothered to learn Category theory in-depth or ever read Hilbert, Gödel or Russel. This didn't seem to stop them from learning Haskell or becoming good at it.
Most ML practitioners I know don't have a perfect grasp of linear algebra, they can compute the gradients for a simple neural network by hand or even speculate on the gradients resulting from using a specific loss function, but it's not a skill they are very confident in. On the other hand, most ML practitioners I know are fairly confident in writing their own loss functions when it suits the problem.
That's because most people learning ML don't start with a Ph.D. in applied mathematics, they start playing around with models and learn just enough LA to understand(?) what they are doing.
Conversely, most people that want to get a Ph.D. in applied mathematics working on automatic differentiation might not know how to use TensorFlow very well, but I don't think that's something which is at all harmful to their progress. Even though the final results will find practical applications in libraries over which Tensorflow might serve as a wrapper.
Indeed, in any area of computer-based research or engineering people seem comfortable tackling hard problems even if they don't have all the relevant context for understanding those problems, they have to be, one lifetime is enough to learn all the relevant context.
That's not to say you never have to learn anything other than the thing you are working on, I'd be the last person to make that claim. But usually, if you have enough context to understand a problem, the relevant context will inevitably come up as you are trying to solve it.
4. Why learning the contextual skills independently is useful
But say that you are a medieval knight and are trying to learn how to be the most efficient killing machine in a mounted charge with lances.
There are many ways to do it, but hopping on a horse, strapping you spurs and charging with a heavy lance attached your warhorse (the 1,000kg beast charging at 25km/h into a sea of sharp metal) is 100% not the way to do it.
You'd probably learn how to ride first, then learn how to ride fast, then learn how to ride in formation, then learn how to do it in heavy armor, then how to do it with a lance... etc
In parallel, you'd probably be practicing with a lance on foot, or by stabbing hey sacks with a long blunt stick while riding a poney.
You might also do quite a lot of adjacent physical training, learn to fight with and against a dozen or so weapons and types of shield and armor.
Then you'd spend years helping on the battlefield as a squire, a role that didn't involve being in the vanguard of a mounted charge.
Maybe you'd participate in a tourney or two, maybe you'd participate a dozen mock tourneys where the goal is slightly patting your opponent with a blunt stick.
Why?
Because the cost of using the "real training data" is high. If I would be teleported inside a knight's armor strapped to horse galloping towards an enemy formation with a lance in my hand I would almost certainly die.
If someone with only 20% of a knight's training did so, the chance of death or debilitating injury might go down to half.
But at the end of the day, even for the best knight out there, the cost of learning in battle is still a, say, 1/20th chance of death or debilitating injury.
Even partially realistic training examples like tourneys would still be very dangerous, with a small chance of death, a small but significant chance of a debilitating injury, and an almost certainty of suffering a minor trauma and damage to your expensive equipment.
I've never tried fighting with blunt weapons in a mock battle, but I have friends who did and they inform me you get injured all the time and it's tiresome and painful and not a thing one could do for a long time. I tend to believe them, even if I was wearing a steel helmet, the thought of being hit over the head full strength with a 1kg piece of steel is not a pleasant one.
On the other hand, practicing combat stances or riding around on a horse involve almost zero risk of injury, death, or damage to expensive equipment.
The knight example might be a bit extreme, but even for something as simple as baking bread, the cost of failure might be the difference between being able to feed yourself or starving/begging for food for the next few weeks.
The same idea applied and to some extent still applies to any trade. When the stakes involve the physical world and your own body "training on real data" is prohibitively expensive if you're not already 99% fit for the task.
5. Why the idea stuck
We are very irrational and for good reason, most of the time we try to be rational we fail.
If for 10,000 years of written history "learning" something was 1% doing and 99% learning contextual skills we might assume this is a good pattern and stick to it without questioning it much.
Maybe we slowly observe that in e.g. CS people that code more and learn theory less do better, so our CS courses go from 99% theory and 1% coding to 25/75. Maybe we observe that frontend devs that learn their craft in 10 weeks are just as good interns as people with a CS degree, so we start replacing CS degrees for frontend developers with 6-month "boot camps".
But what we should instead be thinking is whether or not we should throw away the explicit learning of contextual skills altogether in these kinds of fields.
I will, however, go even further and say that we sometimes learn contextual skills when we'd be better of using technology to play with fake but realistic "training data".
Most kids learn science as if it's a collection of theories and thus end up thinking of it as a sort of math-based religion, rather than as an incomplete and always shifting body of knowledge that's made easier to understand with layers of mathematical abstraction and gets constantly changed and refactored.
This is really bad, since many people end up rejection is the same way they'd reject a religion, refusing to bow down before the tenured priesthood. It's even worst because the people that "learn" science usually learn it as if it were a religion or some kind of logic-based exercise where all conclusions are absolute and all theories perfectly correct.
But why do we teach kids about theory instead of letting them experiment and analyze data, thus teach them the core idea of scientific understanding? Because experiments are expensive and potentially dangerous.
Sure, there's plenty of token chemistry, physics, and even biology experiments that you can do in a lab. However, telling an 8th-grade class:
It sounds like a joke or something done by genius kids inside the walls of an ivory tower. Not something that can be done in a random school serving a pop-2,000 village in Bavaria.
But why?
After all, if you make enough tradeoffs, it doesn't seem that hard to make a simulation for this experiment. Not one that can be used to design industrial insecticides mind you. But you can take a 1,000 play-neurons model of an insect brain then create 500 different populations, some of which have a wired quickly that causes arrhythmia when certain dopamine pathways are overly stimulated.
You can equally well make a toy simulation for mimicking the dopaminergic effects of nicotine derivative based on 100 simple to model parameters (e.g. expression of certain enzymes that can destroy nicotine, affinity to certain sites, ease of passing the blood-brain barrier).
You're not going to come up with anything useful, but you might even stumble close to an already known design for a neonicotinoid insecticide.
The kids or teachers needn't understand the whole simulation, after all that would defeat the point. They need only be allowed to look at parts of it and use their existing chemistry and biology knowledge to speculate on what change might work.
Maybe that's too high of a bar for a kid? Fine, then they need only use their chemistry knowledge to figure out if a specific molecule could even exist in theory, then fire away as many molecules as possible and see how the simulation reacts. After all, that sounds like 99% of applied biochemistry in the past and potentially even now.
This analogy breaks down at some point, but what we have currently is so skewed towards the "training on real data is a bad approach" extreme that I think any shift towards the "training on real data is the correct approach" would be good.
The shift from "training of real data" being expensive and dangerous is a very recent one, to be charitable it might have happened in the last 30 years or so with the advent of reasonably priced computers.
Thus I think it's perfectly reasonable to assume most people haven't seen this potential paradigm shift. However, it seems like the kind of swim or sink distinction that will make or break many education systems in the 21st century. In many ways, I think this process has already started, but we are just not comprehending it fully as of yet.