MrMind comments on Open Thread, Jun. 29 - Jul. 5, 2015 - Less Wrong

5 Post author: Gondolinian 29 June 2015 12:14AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (210)

You are viewing a single comment's thread.

Comment author: MrMind 29 June 2015 08:01:53AM 1 point [-]

I've started learning Machine Learning (he!), and upon reading the first chapter of the most famous textbook I was already gasping for air.

For someone like me who grew into probability with Jaynes' book, seeing in the first chapter that algorithms are trained using multiple times the same data (cross-validation) was... annoying, let's say (I actually screamed at the book).

Is there a sane textbook on machine learning? I don't demand one that starts from objective bayesianism, that would be asking too much. But at least something that assumes bayesianism as a foundation? Pretty please?

Comment author: Vaniver 29 June 2015 01:43:34PM *  15 points [-]

For someone like me who grew into probability with Jaynes' book, seeing in the first chapter that algorithms are trained using multiple times the same data (cross-validation) was... annoying, let's say (I actually screamed at the book).

There's two ways to train algorithms 'multiple times' on the same data. The bad one is data duplication, but cross-validation is the good one. Data duplication is the sort of thing that Jaynes would have been worried about, because it means you're counting evidence from the same piece of data twice, thus your model has illusory precision.

But what does cross-validation do? There's an issue called "overfitting," where any statistical procedure performed on a training set will fit both the noise and the signal in the training set, but while the signal on a test set will presumably be the same, the noise will be different and thus the model will do worse. Single validation is when you split your data into two parts, the training set and the test set, so that you can see how well your model trained on the training set does on the test set. When there's a tunable parameter in the training method, people will sometimes optimize the tunable parameter given data in the test set.*

But to do one split and leave it at that is wasteful. Cross-validation is when you partition the data many times, and fit many different models, and can thus talk about how the population of models behaves. In particular, consider the case of 'leave-one-out' cross-validation, where in a dataset of n points, we train n different models, each time using n-1 datapoints to fit the model parameters, test them on the 1 datapoint left out. This gives each individual model as much training data as possible while still leaving us a test dataset to determine how resilient to overfitting our model-generation procedure is.

* The principled way to do this is to split the data three times, into a training set (which the algorithm always has access to), a validation set (which the algorithm only has access to when setting the tunable parameters), and then a test set (which the algorithm never has access to, but is used to assess how well the model does after the tunable parameter has been optimized).

Comment author: MrMind 30 June 2015 08:13:32AM 0 points [-]

Allow me to quote directly from the book:

The training sample of size m is then used to compute the n-fold cross-validation error RCV(θ) for a small number of possible values of θ. θ is next set to the value θ0 for which RCV(θ) is smallest and the algorithm is trained with the parameter setting θ0 over the full training sample of size m

So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.

Besides, even cross-validation for model-selection is suspicious. Shouldn't I, ideally, train all model with all the data and form a posterior on the most probable values?

Comment author: Vaniver 30 June 2015 04:42:09PM *  1 point [-]

So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.

Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called 'hierarchical Bayesian models.')

Instead of pulling a prior out of thin air for the hyperparameters, this asks the question "which hyperparameters are best for generalizing models to test sets outside the training set?", which is a different question from "which parameters maximize the likelihood of this data?"

(I should add that some people call it 'cross-tuning' to report a model whose hyperparameters have been selected by this sort of process, if there's no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as 'cross-validation.')

Besides, even cross-validation for model-selection is suspicious. Shouldn't I, ideally, train all model with all the data and form a posterior on the most probable values?

If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?

But once they do have the hyperparameter in place, this is what they do--they fit the model on the full training data, so that they can make the most use of everything.

Comment author: IlyaShpitser 29 June 2015 09:07:57AM *  14 points [-]

^ Above post is the illustration of the danger of LW's style Bayes. Below is a non-crazy discussion (e.g. one where people don't scream):

http://andrewgelman.com/2013/12/10/cross-validation-bayesian-estimation-tuning-parameters/

Comment author: MrMind 30 June 2015 08:24:36AM 0 points [-]

Unfortunately the discussion is above my current understanding. But by glancing at the comments I catched this:

remember that many people in machine learning are frequentist (or have not yet learned the Bayesian arts) and don’t really have any other means of tuning hyperparameters so they jump on whatever methods might be available.

Which explains why it's going to be so difficult for me to learn ML. It's like I'm forced to learn Aristotelian physics. Aaargh!

Comment author: IlyaShpitser 30 June 2015 08:25:35AM *  1 point [-]

The relationship between F and B is not like the relationship between Aristotelian physics and relativity. Not at all.

Comment author: MrMind 30 June 2015 09:07:19AM *  0 points [-]

I'm very tempted to argue that it is!
But what I wanted to convey is that it feels like I'm supposed to learn something which is manifestly inferior, in its logical foundation, than what is already known and available.

And maybe under the constraint of computational cost the finishing point of the Bayesian and the frequentist approach is the same, but where's the proof? Where's the place where someone says: "This is Bayesian machine learning, but it's computationally too costly. So by making this and this simplifying assumptions, we end up with frequentist machine learning."?

Instead, what I read are things like: "In practice, Bayesian optimization has been shown to obtain better results in fewer experiments than grid search and random search" (from here).

Comment author: jsteinhardt 30 June 2015 03:46:01PM 4 points [-]

I would urge you to follow ChristianKI's advice, since I suspect you probably know much less than you think you know about either Bayesian or frequentist statistics. Perhaps you could explain in your own words why exactly it is clear that the ML book you are reading is "manifestly inferior" to your preferred approach?

Also consider reading this: A Fervent Defense of Frequentist Statistics.

Comment author: MrMind 01 July 2015 09:30:56AM 0 points [-]

Perhaps you could explain in your own words why exactly it is clear that the ML book you are reading is "manifestly inferior" to your preferred approach?

There is a bit of confusion here. I'm not stating that frequentist machine learning is inferior to Bayesian machine learning. I'm stating that Bayesian probability is superior to frequentist probability.
How do I say this? Because in all the case that I know, either a Bayesian model can be reduced to a frequentist one or a Bayesian model gives more accurate prediction.

That said, not even this is a problem. Since I'm learning the subject, I'm not at the stage of saying "this sentence is wrong". I'm at the stage of "this sentence doesn't make sense in the context of Bayesianism". So I'm asking "is there a book that teaches ML from a Bayesian point of view?".
The answer I'm discovering, appallingly but maybe not so, is no.

As for the fervent defence, under the premises elucidated in the comments, I hold none of the myths, so it doesn't apply.

Comment author: Vaniver 01 July 2015 01:27:49PM *  5 points [-]

Because in all the case that I know, either a Bayesian model can be reduced to a frequentist one or a Bayesian model gives more accurate prediction.

I typically see this stated as "there is a Bayesian interpretation for every effective statistical technique." As pointed out elsewhere, typically people use "frequentist" to mean "non-Bayesian," which is not particularly effective as a classification.

So I'm asking "is there a book that teaches ML from a Bayesian point of view?".

The answer I'm discovering, appallingly but maybe not so, is no.

Did you google Bayesian Machine Learning, or search for it on Amazon? Barber is a well-rated textbook available online for free. (I haven't read it; Sebastien Bratieres thinks it's comparable to Murphy, the second most popular ML book, which is Bayesian.) Incidentally, Bishop, the most popular ML book, is also Bayesian. You managed to find the only ML textbook I've seen which has, as a comment in one of the Amazon reviews, a positive comment that the book is not Bayesian!

The more meta point here is to not let a worldview shut you out from potentially useful resources. Yes, Bayesianism is the best philosophy of probability, but that does not mean it is the most effective practice of statistics, and excluding concepts or practices from your knowledge of statistics because of a disagreement on philosophy is parochial and self-limiting.

Comment author: MrMind 02 July 2015 08:49:06AM *  0 points [-]

As pointed out elsewhere, typically people use "frequentist" to mean "non-Bayesian," which is not particularly effective as a classification.

Reducing a frequentist model to a Bayesian one though it's not a pointless excercise, since it elucidates the hidden assumptions, and at least you are better aware of its field of applicability.

Did you google Bayesian Machine Learning, or search for it on Amazon?

Only after buying the book I have :/ Bishop though seems a lot interesting, thanks!

The more meta point here is to not let a worldview shut you out from potentially useful resources.

Thankfully, I'm learning ML for my own education, it's not something I need to practice right now.

Comment author: Vaniver 02 July 2015 01:50:48PM 1 point [-]

Bishop though seems a lot interesting, thanks!

You're welcome! I should point out that the other words I was considering using to describe Bishop are "classic" and "venerable"--it's not out of date (most actively used ML methods are surprisingly old), but you may want to read it in parallel with Barber. (In general, if you've never read textbooks in parallel before, I recommend it as a lesson in textbook design / pedagogy.)

Comment author: ChristianKl 30 June 2015 11:05:04AM 3 points [-]

But what I wanted to convey is that it feels like I'm supposed to learn something which is manifestly inferior, in its logical foundation, than what is already known and available.

I think it's very useful to listen to be able to listen to someone with domain expertise telling you when you are wrong when you are a beginner.

Comment author: MrMind 01 July 2015 09:36:57AM 0 points [-]

But then I'm allowed to ask "why?", and if the answer is "because I say so", then I feel pretty confident to dismiss the expert.

But that's not even the stage I'm at. A book is not an interactive medium, so the act has gone like this:

  • book: Cross-validation!
  • me: "Gaaaak! That sounds like totally wrong! Is there anyone that can explain me either why this is right or, if it's actually wrong, what is the correct approach?"

I'm still searching for an answer...

Comment author: Wei_Dai 01 July 2015 11:25:15PM *  4 points [-]

I'm still searching for an answer...

Try this paper or page 403 of this textbook.

Also, although in this case there seems to be an available answer, I don't think it makes sense to always expect that. Sometimes people find a technique that tends to work in practice and then only later come up with a theoretical explanation of why it works. If you happen to live in the period in between...

Comment author: MrMind 02 July 2015 08:36:28AM 0 points [-]

If you happen to live in the period in between...

He! I've suddenly remembered that LW was founded exactly because the fields of AI and ML used too much frequentist (il)logic. The Sequence was about to restore sanity in the field.
Anyway, the textbook you mentioned seems pretty cool, thank you very much!

Comment author: ChristianKl 01 July 2015 10:19:16PM 1 point [-]

I'm no expert at machine learning. However as far as I remember the point of doing cross-validation is to find out whether your model is robust. Robustness is not a standard "Bayesian" concept. Maybe you don't appreciate it's value?

Comment author: MrMind 02 July 2015 08:39:08AM 0 points [-]

I would appreciate if there was en explanation of why something is done the way it is. Instead it's all about learning the passwords. Maybe it's just that the main textbook in the field is pedagogically bad, it wouldn't be the first time.

Comment author: ChristianKl 02 July 2015 12:07:03PM 0 points [-]

Getting deep understanding of a complex field like machine intelligence isn't easy. You shouldn't expect it to be easy and something that you can acquire in a few days.

Comment author: Viliam 30 June 2015 09:26:23PM *  0 points [-]

This is probably very arrogant of me to say, but my advice would be: "Listen to the domain expert when he tells you what you should do... and then find a Bayesian and let them explain to you why that works."

In my defense, this was my personal experience with statistics at school. I was very good at math in general, but statistics somehow didn't "click". I always had this feeling as if what was explained was built on some implicit assumptions that no one ever mentioned explicitly, so unlike with the rest of the math, I had no other choice here but to memorize that in a situation x you should do y, because, uhm, that's what my teachers told me to do. -- More than ten years later, I read LW, and here I am told that yes, the statistics that I was taught does have implicit assumptions, and suddenly it all makes sense. And it makes me very angry that no one told me this stuff at school. -- I am a "deep learner" (this, not this), and I have problem learning something when I am told how, but I can't find out why. Most people probably don't have a problem with this, they are told how, and they do, and can be quite successful with it; and probably later they will also get an idea of why. But I need to understand the stuff from the very beginning, otherwise I can't do it well. Telling me to trust a domain expert does not help; I may put a big confidence in how, but I still don't know why.

Comment author: jsteinhardt 30 June 2015 10:24:31PM 2 points [-]

ChristianKI is not telling you to trust a domain expert, but rather to read / listen to the domain expert long enough to understand what they are saying (rather than instantly assuming they are wrong because they say something that seems to conflict with your preconceived notions).

I think if you were to read most machine learning books, you would get quite a lot of "why". See this manuscript for instance. I don't really see why you think that Bayesians have a monopoly on being able to explain things.

Comment author: ChristianKl 30 June 2015 11:02:18PM 0 points [-]

I think you make a mistake if you put a school teacher who doesn't understand statistics on a deep level into the same category of academic machine learning experts who don't happen to be "Bayesians".

Comment author: jacob_cannell 02 July 2015 12:28:10AM 2 points [-]

There is the probabilistic programming community which uses clean tools (programming languages) to hand construct models with many unknown parameters. They use approximate bayesian methods for inference, and they are slowly improving the efficiency/scalability of those techniques.

Then there is the neural net & optimization community which uses general automated models. It is more 'frequentist' (or perhaps just ad-hoc ), but there are also now some bayesian inroads there. That community has the most efficient/scalable learning methods, but it isn't always clear what tradeoffs they are making.

And even in the ANN world, you sometimes see bayesian statistics brought in to justify regularizers or to derive stuff - such as in variational methods. But then for actual learning they take gradients and use SGD, with the understanding that SGD is somehow approximating the bayesian inference step, or at least doing something close enough.

Comment author: IlyaShpitser 30 June 2015 11:20:22AM 2 points [-]

I'm very tempted to argue that it is!

Ok, thank you for your time.

Comment author: Manfred 01 July 2015 05:49:01PM 4 points [-]

Eventually it makes sense, I promise. "Bayesianism" in the sense of keeping track of every hypothesis is very computationally expensive - modern algorithms only keep track of a very small number of hypotheses (only those representable by a neural network [or what have you], and even then only those required to do gradient descent). This fact opens you up to the overfitting problem, where the simplest perfect hypothesis in your space actually has very little information about the true external reality. You need some way of throwing away the parts of the signal that your model wasn't going to figure out anyhow.

For this reason among others, modern machine learning algorithms often have a lot of settings that have to be set by smarter systems (humans), before your algorithm can actually learn a novel domain. These settings reflect how the properties of the domain interact with properties of your algorithm (e.g. how many resources the algorithm has to commit before it can expect to have found something good, or what degree of noise the algorithm has to learn to throw away). These are those "hyperparameter" things. Cross-validation is just an empirical tool that helps humans figure out the right settings. You can probably figure out why it's expected to work.

Comment author: MrMind 02 July 2015 08:40:41AM 0 points [-]

I upvoted because I understand the rationale, I understand the explanation, I just rather wish that a book whose purpose is to teach the subject wouldn't be so... ad hoc.