(This is the ninth post in a sequence on Machine Learning based on this book. Click here for part I.)
Kernels
To motivate this chapter, consider some training sequence S=((x1,y1),...,(xm,ym)) with instances in some domain set X. Suppose we wish to use an embedding ψ:X→Rd of the kind discussed in the previous post (i.e., to make the representation of our points more expressive, so that they can be classified by a hyperplane). Most importantly, suppose that d is significantly larger than m. In such a case, we're describing each point ψ(xi) in terms of d coordinates, even though our space only has m points, which means that there can, in some sense, only be m "relevant" directions. In particular, let
U:=span(ψ(Sx))={p∈Rd|∃a∈Rm:p=∑di=1αiψ(xi)}
where Sx is the training sequence without labels, so that ψ(Sx)=(ψ(x1),...,ψ(xm)). Then U is an (at most) m-dimensional subspace of Rd, and we would like to prove that we can work in U rather than in Rd.
I
As a first justification for this goal, observe that ψ(xi)∈U for all i∈[m]. (The symbol [n] for any n∈N denotes the set {1,...,n}.) Recall that we wish to learn a hyperplane parametrized by some w∈Rd that can then be used to predict a new instance ψ(y) for some y∈X by checking whether ⟨w,ψ(y)⟩>0. The bulk of the difficulty, however, lies in finding the vector w; this is generally much harder than computing a single inner product ⟨w,ψ(y)⟩.
Thus, our primary goals are to show that
(1) w will lie in U
(2) w can somehow be computed by only working in U
To demonstrate this, we need to look at how w is chosen, which depends on the algorithm we use. In the case of Soft Support Vector Machines (previous post), we choose
This rule shows that we only care about the inner product between w and our mapped training points, the ψ(xi). Thus, if we could somehow prove (1), then (2) would seem to follow: if w∈U, then, according to the rule above, we would only end up caring about inner products between points that are both in U.
Therefore, we now turn to proving (1) formally. To have the result be a bit more general (so that it also applies to algorithms other than Soft Support Vector Machines), we will analyze a more general minimization problem. We assume that
where f is any function and R is any monotonically non-decreasing function. (You might verify that the original problem is an instance of this one.) Now let w∗ be a solution to the above problem. Then we can use extended orthogonal decomposition1 to write w∗=π(w∗)+q, where π:Rd→U is the projection onto U that leaves vectors in U unchanged and q is orthogonal to every vector in U. Then, for any u∈U, we have
⟨w∗,u⟩=⟨π(w∗)+q,u⟩=⟨π(w∗),u⟩+⟨q,u⟩=⟨π(w∗),u⟩.
In particular, this is true for all the ψ(xi). Furthermore, since R is non-decreasing and the norm of w∗ is at least as large as the norm of π(w∗) (note that ||w∗||2=||π(w∗)||2+||u||2 due to the Pythagorean theorem), this shows that π(w∗) is a solution to the optimization problem. Moreover, if R is strictly monotonically increasing (as is the case for Soft Support Vector Machines), then if q>0, it would also be better than w∗, which is impossible since w∗ is by assumption optimal. Thus, q must be 0, which implies that not only some but all solutions lie in U.
[1] Regular orthogonal decomposition, as I've formulated in the previous post, only guarantees that u is orthogonal to ψ(w∗) rather than to every vector in U. But the extended version is no harder to prove. Choose some orthonormal basis B of U, extend it to an orthonormal basis B′ of all of Rd (amazingly, this is always possible), and define π by π(∑|B′|i=1αibi)=∑|B|i=1αibi; i.e., just discard all basis elements that belong to B′ but not B. That does the job.
II
We've demonstrated that only the inner products between mapped training points matter for the training process. Another way to phrase this statement is that, if we have access to the function
Kψ:X×X→RKψ:(x,y)↦⟨ψ(x),ψ(y)⟩
we no longer have any need to represent the points ψ(xk) explicitly. The function K is what is called the kernel function, that gives the chapter its name.
Note that K takes two arbitrary points in X; it is not restricted to elements in the training sequence. This is important because, to actually apply the predictor, we will have to compute ⟨w,ψ(y)⟩ for some y∈X, as mentioned above. But to train the predictor, we only need inner products between mapped training points, as we've shown. Thus, if we set
gk,ℓ:=K(xk,xℓ)=⟨ψ(xk),ψ(xℓ)⟩∀k,ℓ∈[m]
then we can do our training based solely on the gk,ℓ (which will lead to a predictor that uses K to classify domain points.) Now let's reformulate all our relevant terms to that end. Recall that we have just proved that w∗∈U. This implies that w∗=∑mi=1αiψ(xi) for the right αi. Also recall that our objective is to find w∗ in the set
Plugging both of those into the term behind the argmax, we obtain
f(∑mi=1αigi,1,...,∑mi=1αigi,m)+R(∑mk,ℓ=1αkαℓgk,ℓ)
This is enough to establish that one can learn purely based on the gk,ℓ. Unfortunately, the Machine Learning literature has the annoying habit of writing everything that can possibly be written in terms of matrices and vectors in terms of matrices and vectors, so we won't quite leave it there. By setting α:=(α1,...,αm) (a row vector), we can further write the above as
f([αG]1,...,[αG]m)+R(αGαT)whereG=(gk,ℓ)1≤k≤m1≤ℓ≤m
or even as f((αG))+R(αGαT), at which point we've successfully traded any conceivable intuition for compactness. Nonetheless, the point that G is sufficient for learning still stands. G is also called the Gram matrix.
At this point, you might notice that we never represented U explicitly, but just reformulated everything in terms of inner products. Indeed, one could introduce kernels without mentioning U, but I find that thinking in terms of U is quite helpful for understanding why all of this stuff works. Note that the above equation (where we predict the label of a new instance) is not an exception to the idea that we're working in U. Even though it might not be immediately apparent from looking at it, it is indeed the case that we could first project ψ(y) into U without changing anything about its prediction. In other words, it is indeed the case that ⟨w,ψ(y)⟩=⟨w,π(ψ(y))⟩ for all y∈X. This follows from the definition of π and the fact that all basis vectors outside of U are orthogonal to everything in U.
III
Kernels allow us to deal with arbitrarily high-dimensional data (even infinitely dimensional) by computing m2 distances, and later do some additional computations to apply the output predictor – under the essential condition that we are able to evaluate the kernel function K. Thus, we are interested in embeddings ψ such that Kψ is easy to evaluate.
For an important example, consider an embedding for multi-variable polynomials. Suppose we have such a polynomial of the form p:Rn→R, i.e. something like
p(x,y,z)=x2yz2+3xyz2−2x3z2+12y2
where the above would be a 3-variable polynomial of degree 5. Now recall that, to learn one-dimensional polynomials with linear methods, we chose the embedding ψ:x↦(1,x,x2,...,xk). That way, a linear combination of the image coordinates can do everything a polynomial predictor can do. To do the same for an arbitrary n-dimensional polynomial of degree k, we need the far more complex embedding
An n-dimensional polynomial of degree k may have one value for each possible combination of its n variables such that at most k variables appear in each term. Each w∈{0,...,n}k defines such a combination. Note that this is a sequence, so repetitions are allowed: for example, the sequence (1,2,...,2)∈{0,...,n}k corresponds to the term x1xk−12. We set x0=1 so that we also catch all terms with degree less than k: for example, the sequence (0,0,0,3,...,3) corresponds to the term xk−33 and the sequence (0,...,0) to the absolute value of the polynomial.
For large n and k this target space is extremely high-dimensional, but we're studying kernels here, so the whole point will be that we won't have to represent it explicitly.
Now suppose we have two such instances ψ(x) and ψ(x′). Then,⟨ψ(x),ψ(x′)⟩=⟨(∏ki=1xw(i))w∈{0,...,n}k,(∏ki=1x′w(i))w∈{0,...,n}k⟩⟨ψ(x),ψ(x′)⟩=∑w∈{0,...,n}k⟨∏ki=1xw(i),∏ki=1x′w(i)⟩⟨ψ(x),ψ(x′)⟩=∑w∈{0,...,n}k∏ki=1xw(i)x′w(i)
And for the crucial step, the last term can be rewritten as (∑ni=0xix′i)k – both terms include all sequences xix′i of length k where i∈{0,...,n}. Now (recall that x0=x′0=1) this means that the above sum simply equals (1+⟨x,x′⟩)k. In summary, this calculation shows that
K(x,x′):=(1+⟨x,x′⟩)k=⟨ψ(x),ψ(x′)⟩∀x,x′∈X
Thus, even though ψ maps points into the very high-dimensional space R(n+1)k, it is nonetheless feasible to learn a multi-polynomial predictor through linear methods, namely by embedding the values via ψ and then ignoring ψ and using K instead. The gram matrix G will consist of m2 entries, where for each, a term of the form (1+⟨x,x′⟩)k=(1+∑ni=1xix′i)k has to be computed. This doesn't look that scary! Even for relatively large values of d, k, and m, it should be possible to compute on a reasonable machine.
If we do approach learning a multi-dimensional polynomial in this way, then (I think) there are strong reasons to question in what sense the embedding ψ actually happens – this question is what I was trying to wrap my head around at the end of the previous post. It seemed questionable to me that ψ is fundamental even if the problem is learned without kernels, but even more so if it is learned with them.
And that is all I have to say about kernels. For the second half of this post, we'll turn to a largely independent topic.
Boosting
Boosting is another item under the "widening the applicability of classes" category, much like the ψ from earlier.
I
This time, the approach is not to expand the representation of data and then apply a linear classifier on that representation. Instead, we wish to construct a complex classifier as a linear combination of simple classifiers.
When hyperplanes are visualized, it is usually understood that one primarily cares about hyperplanes in higher-dimensional spaces where they are much more expressive, despite the illustration depicting an instance in 2-d or 3-d. But this time, think of the problem instance below in literal 2-d space:
No hyperplane can classify this instance correctly, but consider a combination of these three hyperplanes:
By letting h(p)=σsign(h1(p)+h2(p)+h3(p)−2.5) where hi is the predictor corresponding to the i-th hyperplane and σsign is the sign function, we have constructed a predictor h which has zero empirical error on this training instance.
Perhaps more surprisingly, this trick can also learn non-convex areas. The instance below,
will be classified correctly by letting h(p)=σsign(h1(p)+2h2(p)+2h3(p)), with the hi (ordered left to right) defined like so:
These two examples illustrate that the resulting class is quite expressive. The question is, how to learn such a linear combination?
II
First, note that hyperplanes are just an example; the framework is formulated in terms of a learning algorithm that has access to a weak learner, where
An algorithm A is called a γ-weak learner for a hypothesis class H iff there is a function w∗:(0,1)→N such that, for any probability distribution D over X×Y and any failure probability δ∈(0,1), if S consists of at least w∗(δ) i.i.d. points sampled via D, then with probability at least 1−δ over the choice of S, it holds that ℓ(A(S))≤12−γ.
If you recall the definition of PAC learnability back from chapter 1, you'll notice that this is very similar. The only difference is in the error: PAC learning demands that it be arbitrarily close to the best possible error, while a weak learner merely has to bound it away from 12 by some fixed amount γ, which can be quite small. Thus, a weak learner is simply an algorithm that puts out a predictor that performs a little bit better than random. In the first example, the upper hyperplane could be the output of a weak learner. The term "boosting" refers to the process of upgrading this one weak learner into a better one, precisely by applying it over and over again under the supervision of a smartly designed algorithm –
– which brings us back to the question of how to define such an algorithm. The second example (the non-convex one) illustrates a key insight here: repeatedly querying the weak learner on the unaltered training instance is unlikely to be fruitful, because the third hyperplane by itself performs worse than random, and will thus not be output by a γ-weak learner (not for any γ∈R+). To remedy this, we somehow need to prioritize the points we're currently getting wrong. Suppose we begin with the first two hyperplanes. At this point, we have classified the left and middle cluster correctly. If we then weigh the right cluster sufficiently more strongly than the other two, eventually, h3 will perform better than random. Alas, we wish to adapt our weighting of training points dynamically, and we can do this in terms of a probability distribution over the training sequence.
Now the roadmap for defining the algorithm which learns a predictor on a binary classification problem via boosting is as follows:
Have access to a training sequence S and a γ-weak learner Aγ
Manage a list of weak predictors which Aγ has output in previous rounds
At every step, hand Aγ the training sequence S along with some distribution D(t) over S, and have it output a γ-weak predictor ht+1 on the problem (S,D(t)), where each point in S is taken into account proportional to its probability mass.
Stop at some point and output a linear combination of the hi
The particular algorithm we will construct is called Ada-Boost, where "Ada" doesn't have any relation to the programming language, but simply means "adaptive".
III
Let's first look into how to define our probability distribution, which will be the most complicated part of the algorithm. Suppose we have our current distribution D(t) based on past predictors h1,...,ht−1 output by Aγ, and suppose further that we have computed weights w1,...,wt−1 such that wi measures the quality of hi (higher is better). Now we receive a new predictor ht with quality wt. Then we can define a new probability distribution D(t+1) by letting
D(t+1)((xi,yi))∝D(t)((xi,yi))⋅e−wtyiht(xi)∀i∈[m]
where we write ∝ rather than = because the term isn't normalized; it will equal the above scaled such that all probabilities sum to 1.
The term yiht(xi) is 1 iff predictor ht classified xi correctly. Thus, the right component of the product equals e−wt iff the point was classified correctly, and ewt if it wasn't. If ht is a bad predictor and wt is small, say 10−3, the two terms are both close to 1, and we don't end up changing our weight on (xi,yi) very much. But if ht is good and wt is large, the old weight D(t)((xi,yi)) will be scaled significantly upward (if it got the point wrong) or downward (if it got the point right). In our second example, the middle hyperplane performs quite well on the uniform distribution, so w2 should be reasonably high, which will cause the probability mass on the right cluster to increase and on the two other clusters to decrease. If this is enough to make the right cluster dominate the computation, then the weak learner might output the right hyperplane next. If not, it might output the second hyperplane again. Eventually, the weights will have shifted enough for the third hyperplane to become feasible.
IV
Now let's look at the weights. Let ϵt=ℓ0−1S(ht) be the usual empirical error of ht, i.e., ϵt=1m|{(x,y)∈S|ht(x)≠y}|. We would like wi to be a real number, which starts close to 0 for ϵt close to 12 and grow indefinitely for ϵt close to 0. One possible choice is wt:=12ln(1ϵt−1). You can verify that it has these properties – in particular, recall that ht is output by a weak learner so that its error is bounded away from 12 by at least γ. Because of this, 1ϵt is larger than 2 so that 1ϵt−1 is larger than 1 and wt is larger than 0.
V
To summarize,
AdaBoost (Aγ : weak learner, S : training sequence, T:N+)
D(0)←(1m⋯1m)
for(t←1 to T)do
ht←Aγ(S,D(t−1))
ϵt←1m|{(x,y)∈S|ht(x)≠y}|
wt←12ln(1ϵt−1)
D(t)←normalize((Di⋅e−wtyiht(xi))i∈[m])
endfor
return f:x↦σsign(∑Tt=1wtht(x))
end
VI
If one assumes that Aγ always returns a predictor with error at most 12−γ (recall that it may fail with probability δ), one can derive a bound on the error of the output predictor. Fortunately, the dependence of the sample complexity on δ is only logarithmic, so δ can probably be pushed low enough that Aγ is unlikely to fail even if it is called T times.
Now the error bound one can derive is e−2γ2T. Looking at this, it has exactly the properties one would expect: a higher γ pushes the error down, and so do more rounds of the algorithm. On the other hand, doing more rounds increases the chance of overfitting to random quirks in the training data. Thus, the parameter T allows one to balance the overfitting vs. underfitting tradeoff, which is another nice thing about AdaBoost.The book mentions that Boosting has been successfully applied to the task of classifying gray-scale images into 'contains a human face' and 'doesn't contain a human face'. This implies that human faces can be recognized using a set of quantitative rules – but, importantly, rules which have been generated by an algorithm rather than constructed by hand. (In that case, the weak learner did not return hyperplanes, but simple predictors of another form.) In this case, the result fits with my intuition (that face recognition is the kind of task where a set-of-rules approach will work). It would be interesting to know how well boosting performs on other problems.
Just wanted to thank you for writing up this series. I've been slowly going through the book on my own. Just finished Chapter 2 and it's awesome to have these notes to review.
(This is the ninth post in a sequence on Machine Learning based on this book. Click here for part I.)
Kernels
To motivate this chapter, consider some training sequence S=((x1,y1),...,(xm,ym)) with instances in some domain set X. Suppose we wish to use an embedding ψ:X→Rd of the kind discussed in the previous post (i.e., to make the representation of our points more expressive, so that they can be classified by a hyperplane). Most importantly, suppose that d is significantly larger than m. In such a case, we're describing each point ψ(xi) in terms of d coordinates, even though our space only has m points, which means that there can, in some sense, only be m "relevant" directions. In particular, let
U:=span(ψ(Sx))={p∈Rd|∃a∈Rm:p=∑di=1αiψ(xi)}
where Sx is the training sequence without labels, so that ψ(Sx)=(ψ(x1),...,ψ(xm)). Then U is an (at most) m-dimensional subspace of Rd, and we would like to prove that we can work in U rather than in Rd.
I
As a first justification for this goal, observe that ψ(xi)∈U for all i∈[m]. (The symbol [n] for any n∈N denotes the set {1,...,n}.) Recall that we wish to learn a hyperplane parametrized by some w∈Rd that can then be used to predict a new instance ψ(y) for some y∈X by checking whether ⟨w,ψ(y)⟩>0. The bulk of the difficulty, however, lies in finding the vector w; this is generally much harder than computing a single inner product ⟨w,ψ(y)⟩.
Thus, our primary goals are to show that
(1) w will lie in U
(2) w can somehow be computed by only working in U
To demonstrate this, we need to look at how w is chosen, which depends on the algorithm we use. In the case of Soft Support Vector Machines (previous post), we choose
w∈argminw∈Rd(λ||w||2+1mm∑k=1max[0,1−yk⟨w,ψ(xk)⟩]).
This rule shows that we only care about the inner product between w and our mapped training points, the ψ(xi). Thus, if we could somehow prove (1), then (2) would seem to follow: if w∈U, then, according to the rule above, we would only end up caring about inner products between points that are both in U.
Therefore, we now turn to proving (1) formally. To have the result be a bit more general (so that it also applies to algorithms other than Soft Support Vector Machines), we will analyze a more general minimization problem. We assume that
w∈argminw∈Rd[f(y1,...,ym)(⟨w,ψ(x1)⟩,...,⟨w,ψ(xm)⟩)+R(||w||2)]
where f is any function and R is any monotonically non-decreasing function. (You might verify that the original problem is an instance of this one.) Now let w∗ be a solution to the above problem. Then we can use extended orthogonal decomposition1 to write w∗=π(w∗)+q, where π:Rd→U is the projection onto U that leaves vectors in U unchanged and q is orthogonal to every vector in U. Then, for any u∈U, we have
⟨w∗,u⟩=⟨π(w∗)+q,u⟩=⟨π(w∗),u⟩+⟨q,u⟩=⟨π(w∗),u⟩.
In particular, this is true for all the ψ(xi). Furthermore, since R is non-decreasing and the norm of w∗ is at least as large as the norm of π(w∗) (note that ||w∗||2=||π(w∗)||2+||u||2 due to the Pythagorean theorem), this shows that π(w∗) is a solution to the optimization problem. Moreover, if R is strictly monotonically increasing (as is the case for Soft Support Vector Machines), then if q>0, it would also be better than w∗, which is impossible since w∗ is by assumption optimal. Thus, q must be 0, which implies that not only some but all solutions lie in U.
[1] Regular orthogonal decomposition, as I've formulated in the previous post, only guarantees that u is orthogonal to ψ(w∗) rather than to every vector in U. But the extended version is no harder to prove. Choose some orthonormal basis B of U, extend it to an orthonormal basis B′ of all of Rd (amazingly, this is always possible), and define π by π(∑|B′|i=1αibi)=∑|B|i=1αibi; i.e., just discard all basis elements that belong to B′ but not B. That does the job.
II
We've demonstrated that only the inner products between mapped training points matter for the training process. Another way to phrase this statement is that, if we have access to the function
Kψ:X×X→RKψ:(x,y)↦⟨ψ(x),ψ(y)⟩
we no longer have any need to represent the points ψ(xk) explicitly. The function K is what is called the kernel function, that gives the chapter its name.
Note that K takes two arbitrary points in X; it is not restricted to elements in the training sequence. This is important because, to actually apply the predictor, we will have to compute ⟨w,ψ(y)⟩ for some y∈X, as mentioned above. But to train the predictor, we only need inner products between mapped training points, as we've shown. Thus, if we set
gk,ℓ:=K(xk,xℓ)=⟨ψ(xk),ψ(xℓ)⟩∀k,ℓ∈[m]
then we can do our training based solely on the gk,ℓ (which will lead to a predictor that uses K to classify domain points.) Now let's reformulate all our relevant terms to that end. Recall that we have just proved that w∗∈U. This implies that w∗=∑mi=1αiψ(xi) for the right αi. Also recall that our objective is to find w∗ in the set
argmaxw∈Uf(⟨w,ψ(x1)⟩,...,⟨w,ψ(xm)⟩)+R(||w||2)
Now we can reformulate
⟨w,ψ(xk)⟩=⟨∑mi=1αiψ(xi),ψ(xk)⟩=∑mi=1αi⟨ψ(xi),ψ(xk)⟩=∑mi=1αigi,k
for all k∈[m], and
||w||2=⟨w,w⟩=⟨∑mi=1αiψ(xi),∑mi=1αiψ(xi)⟩=∑mk,ℓ=1αkαℓgk,ℓ.
Plugging both of those into the term behind the argmax, we obtain
f(∑mi=1αigi,1,...,∑mi=1αigi,m)+R(∑mk,ℓ=1αkαℓgk,ℓ)
This is enough to establish that one can learn purely based on the gk,ℓ. Unfortunately, the Machine Learning literature has the annoying habit of writing everything that can possibly be written in terms of matrices and vectors in terms of matrices and vectors, so we won't quite leave it there. By setting α:=(α1,...,αm) (a row vector), we can further write the above as
f([αG]1,...,[αG]m)+R(αGαT)whereG=(gk,ℓ)1≤k≤m1≤ℓ≤m
or even as f((αG))+R(αGαT), at which point we've successfully traded any conceivable intuition for compactness. Nonetheless, the point that G is sufficient for learning still stands. G is also called the Gram matrix.
And for predicting a new point ψ(y), we have
⟨w,ψ(y)⟩=⟨∑mi=1αiψ(xi),ψ(y)⟩=∑mi=1αi⟨ψ(xi),ψ(y)⟩=∑mi=1αiK(xi,ψ(y)).
At this point, you might notice that we never represented U explicitly, but just reformulated everything in terms of inner products. Indeed, one could introduce kernels without mentioning U, but I find that thinking in terms of U is quite helpful for understanding why all of this stuff works. Note that the above equation (where we predict the label of a new instance) is not an exception to the idea that we're working in U. Even though it might not be immediately apparent from looking at it, it is indeed the case that we could first project ψ(y) into U without changing anything about its prediction. In other words, it is indeed the case that ⟨w,ψ(y)⟩=⟨w,π(ψ(y))⟩ for all y∈X. This follows from the definition of π and the fact that all basis vectors outside of U are orthogonal to everything in U.
III
Kernels allow us to deal with arbitrarily high-dimensional data (even infinitely dimensional) by computing m2 distances, and later do some additional computations to apply the output predictor – under the essential condition that we are able to evaluate the kernel function K. Thus, we are interested in embeddings ψ such that Kψ is easy to evaluate.
For an important example, consider an embedding for multi-variable polynomials. Suppose we have such a polynomial of the form p:Rn→R, i.e. something like
p(x,y,z)=x2yz2+3xyz2−2x3z2+12y2
where the above would be a 3-variable polynomial of degree 5. Now recall that, to learn one-dimensional polynomials with linear methods, we chose the embedding ψ:x↦(1,x,x2,...,xk). That way, a linear combination of the image coordinates can do everything a polynomial predictor can do. To do the same for an arbitrary n-dimensional polynomial of degree k, we need the far more complex embedding
ψ:Rn→R(n+1)kψ:(x1,...,xn)↦(∏ki=1xw(i))w∈{0,...,n}k
An n-dimensional polynomial of degree k may have one value for each possible combination of its n variables such that at most k variables appear in each term. Each w∈{0,...,n}k defines such a combination. Note that this is a sequence, so repetitions are allowed: for example, the sequence (1,2,...,2)∈{0,...,n}k corresponds to the term x1xk−12. We set x0=1 so that we also catch all terms with degree less than k: for example, the sequence (0,0,0,3,...,3) corresponds to the term xk−33 and the sequence (0,...,0) to the absolute value of the polynomial.
For large n and k this target space is extremely high-dimensional, but we're studying kernels here, so the whole point will be that we won't have to represent it explicitly.
Now suppose we have two such instances ψ(x) and ψ(x′). Then,⟨ψ(x),ψ(x′)⟩=⟨(∏ki=1xw(i))w∈{0,...,n}k,(∏ki=1x′w(i))w∈{0,...,n}k⟩⟨ψ(x),ψ(x′)⟩=∑w∈{0,...,n}k⟨∏ki=1xw(i),∏ki=1x′w(i)⟩⟨ψ(x),ψ(x′)⟩=∑w∈{0,...,n}k∏ki=1xw(i)x′w(i)
And for the crucial step, the last term can be rewritten as (∑ni=0xix′i)k – both terms include all sequences xix′i of length k where i∈{0,...,n}. Now (recall that x0=x′0=1) this means that the above sum simply equals (1+⟨x,x′⟩)k. In summary, this calculation shows that
K(x,x′):=(1+⟨x,x′⟩)k=⟨ψ(x),ψ(x′)⟩∀x,x′∈X
Thus, even though ψ maps points into the very high-dimensional space R(n+1)k, it is nonetheless feasible to learn a multi-polynomial predictor through linear methods, namely by embedding the values via ψ and then ignoring ψ and using K instead. The gram matrix G will consist of m2 entries, where for each, a term of the form (1+⟨x,x′⟩)k=(1+∑ni=1xix′i)k has to be computed. This doesn't look that scary! Even for relatively large values of d, k, and m, it should be possible to compute on a reasonable machine.
If we do approach learning a multi-dimensional polynomial in this way, then (I think) there are strong reasons to question in what sense the embedding ψ actually happens – this question is what I was trying to wrap my head around at the end of the previous post. It seemed questionable to me that ψ is fundamental even if the problem is learned without kernels, but even more so if it is learned with them.
And that is all I have to say about kernels. For the second half of this post, we'll turn to a largely independent topic.
Boosting
Boosting is another item under the "widening the applicability of classes" category, much like the ψ from earlier.
I
This time, the approach is not to expand the representation of data and then apply a linear classifier on that representation. Instead, we wish to construct a complex classifier as a linear combination of simple classifiers.
When hyperplanes are visualized, it is usually understood that one primarily cares about hyperplanes in higher-dimensional spaces where they are much more expressive, despite the illustration depicting an instance in 2-d or 3-d. But this time, think of the problem instance below in literal 2-d space:
No hyperplane can classify this instance correctly, but consider a combination of these three hyperplanes:
By letting h(p)=σsign(h1(p)+h2(p)+h3(p)−2.5) where hi is the predictor corresponding to the i-th hyperplane and σsign is the sign function, we have constructed a predictor h which has zero empirical error on this training instance.
Perhaps more surprisingly, this trick can also learn non-convex areas. The instance below,
will be classified correctly by letting h(p)=σsign(h1(p)+2h2(p)+2h3(p)), with the hi (ordered left to right) defined like so:
These two examples illustrate that the resulting class is quite expressive. The question is, how to learn such a linear combination?
II
First, note that hyperplanes are just an example; the framework is formulated in terms of a learning algorithm that has access to a weak learner, where
If you recall the definition of PAC learnability back from chapter 1, you'll notice that this is very similar. The only difference is in the error: PAC learning demands that it be arbitrarily close to the best possible error, while a weak learner merely has to bound it away from 12 by some fixed amount γ, which can be quite small. Thus, a weak learner is simply an algorithm that puts out a predictor that performs a little bit better than random. In the first example, the upper hyperplane could be the output of a weak learner. The term "boosting" refers to the process of upgrading this one weak learner into a better one, precisely by applying it over and over again under the supervision of a smartly designed algorithm –
– which brings us back to the question of how to define such an algorithm. The second example (the non-convex one) illustrates a key insight here: repeatedly querying the weak learner on the unaltered training instance is unlikely to be fruitful, because the third hyperplane by itself performs worse than random, and will thus not be output by a γ-weak learner (not for any γ∈R+). To remedy this, we somehow need to prioritize the points we're currently getting wrong. Suppose we begin with the first two hyperplanes. At this point, we have classified the left and middle cluster correctly. If we then weigh the right cluster sufficiently more strongly than the other two, eventually, h3 will perform better than random. Alas, we wish to adapt our weighting of training points dynamically, and we can do this in terms of a probability distribution over the training sequence.
Now the roadmap for defining the algorithm which learns a predictor on a binary classification problem via boosting is as follows:
The particular algorithm we will construct is called Ada-Boost, where "Ada" doesn't have any relation to the programming language, but simply means "adaptive".
III
Let's first look into how to define our probability distribution, which will be the most complicated part of the algorithm. Suppose we have our current distribution D(t) based on past predictors h1,...,ht−1 output by Aγ, and suppose further that we have computed weights w1,...,wt−1 such that wi measures the quality of hi (higher is better). Now we receive a new predictor ht with quality wt. Then we can define a new probability distribution D(t+1) by letting
D(t+1)((xi,yi))∝D(t)((xi,yi))⋅e−wtyiht(xi)∀i∈[m]
where we write ∝ rather than = because the term isn't normalized; it will equal the above scaled such that all probabilities sum to 1.
The term yiht(xi) is 1 iff predictor ht classified xi correctly. Thus, the right component of the product equals e−wt iff the point was classified correctly, and ewt if it wasn't. If ht is a bad predictor and wt is small, say 10−3, the two terms are both close to 1, and we don't end up changing our weight on (xi,yi) very much. But if ht is good and wt is large, the old weight D(t)((xi,yi)) will be scaled significantly upward (if it got the point wrong) or downward (if it got the point right). In our second example, the middle hyperplane performs quite well on the uniform distribution, so w2 should be reasonably high, which will cause the probability mass on the right cluster to increase and on the two other clusters to decrease. If this is enough to make the right cluster dominate the computation, then the weak learner might output the right hyperplane next. If not, it might output the second hyperplane again. Eventually, the weights will have shifted enough for the third hyperplane to become feasible.
IV
Now let's look at the weights. Let ϵt=ℓ0−1S(ht) be the usual empirical error of ht, i.e., ϵt=1m|{(x,y)∈S|ht(x)≠y}|. We would like wi to be a real number, which starts close to 0 for ϵt close to 12 and grow indefinitely for ϵt close to 0. One possible choice is wt:=12ln(1ϵt−1). You can verify that it has these properties – in particular, recall that ht is output by a weak learner so that its error is bounded away from 12 by at least γ. Because of this, 1ϵt is larger than 2 so that 1ϵt−1 is larger than 1 and wt is larger than 0.
V
To summarize,
AdaBoost (Aγ : weak learner, S : training sequence, T:N+)
D(0)←(1m⋯1m)
for (t←1 to T) do
ht←Aγ(S,D(t−1))
ϵt←1m|{(x,y)∈S|ht(x)≠y}|
wt←12ln(1ϵt−1)
D(t)←normalize((Di⋅e−wtyiht(xi))i∈[m])
endfor
return f:x↦σsign(∑Tt=1wtht(x))
end
VI
If one assumes that Aγ always returns a predictor with error at most 12−γ (recall that it may fail with probability δ), one can derive a bound on the error of the output predictor. Fortunately, the dependence of the sample complexity on δ is only logarithmic, so δ can probably be pushed low enough that Aγ is unlikely to fail even if it is called T times.
Now the error bound one can derive is e−2γ2T. Looking at this, it has exactly the properties one would expect: a higher γ pushes the error down, and so do more rounds of the algorithm. On the other hand, doing more rounds increases the chance of overfitting to random quirks in the training data. Thus, the parameter T allows one to balance the overfitting vs. underfitting tradeoff, which is another nice thing about AdaBoost.The book mentions that Boosting has been successfully applied to the task of classifying gray-scale images into 'contains a human face' and 'doesn't contain a human face'. This implies that human faces can be recognized using a set of quantitative rules – but, importantly, rules which have been generated by an algorithm rather than constructed by hand. (In that case, the weak learner did not return hyperplanes, but simple predictors of another form.) In this case, the result fits with my intuition (that face recognition is the kind of task where a set-of-rules approach will work). It would be interesting to know how well boosting performs on other problems.