{"title": "Weak Learners and Improved Rates of Convergence in Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 280, "page_last": 286, "abstract": null, "full_text": "Weak Learners and Improved Rates of \n\nConvergence in Boosting \n\nShie Mannor and Ron Meir \n\nDepartment of Electrical Engineering \n\nTechnion, Haifa 32000, Israel \n\n{shie,rmeir }@{techunix,ee}.technion.ac.il \n\nAbstract \n\nThe problem of constructing weak classifiers for boosting algo(cid:173)\nrithms is studied. We present an algorithm that produces a linear \nclassifier that is guaranteed to achieve an error better than random \nguessing for any distribution on the data. While this weak learner \nis not useful for learning in general, we show that under reasonable \nconditions on the distribution it yields an effective weak learner for \none-dimensional problems. Preliminary simulations suggest that \nsimilar behavior can be expected in higher dimensions, a result \nwhich is corroborated by some recent theoretical bounds. Addi(cid:173)\ntionally, we provide improved convergence rate bounds for the gen(cid:173)\neralization error in situations where the empirical error can be made \nsmall, which is exactly the situation that occurs if weak learners \nwith guaranteed performance that is better than random guessing \ncan be established. \n\n1 \n\nIntroduction \n\nThe recently introduced boosting approach to classification (e.g., [10]) has been \nshown to be a highly effective procedure for constructing complex classifiers. Boost(cid:173)\ning type algorithms have recently been shown [9] to be strongly related to other in(cid:173)\ncremental greedy algorithms (e.g., [6]). Although a great deal of numerical evidence \nsuggests that boosting works very well across a wide spectrum of tasks, it is not a \npanacea for solving classification problems. In fact, many versions of boosting algo(cid:173)\nrithms currently exist (e.g., [4],[9]), each possessing advantages and disadvantages \nin terms of classification accuracy, interpretability and ease of implementation. \nThe field of boosting provides two major theoretical results. First, it is shown that \nin certain situations the training error of the classifier formed converges to zero \n(see (2)). Moreover, under certain conditions, a positive margin can be guaranteed. \nSecond, bounds are provided for the generalization error of the classifier (see (1)). \nThe main contribution of this paper is twofold. First, we present a simple and \nefficient algorithm which is shown, for every distribution on the data, to yield a \nlinear classifier with guaranteed error which is smaller than 1/2 -\n'Y where 'Y is \nstrictly positive. This establishes that a weak linear classifier exists. From the \ntheory of boosting [10] it is known that such a condition suffices to guarantee that \nthe training error converges to zero as the number of boosting iterations increases. \n\n\fIn fact, the empirical error with a finite margin is shown to converge to zero if , \nis sufficiently large. However, the existence of a weak learner with error 1/2 - , \nis not always useful in terms of generalization error, since it applies even to the \nextreme case where the binary labels are drawn independently at random with \nequal probability at each point, in which case we cannot expect any generalization. \nIt is then clear that in order to construct useful weak learners, some assumptions \nneed to be made about the data. In this work we show that under certain natural \nconditions, a useful weak learner can be constructed for one-dimensional problems, \nin which case the linear hyper-plane degenerates to a point. We speculate that \nsimilar results hold for higher dimensional problems, and present some supporting \nnumerical evidence for this. In fact, some very recent results [7] show that this \nexpectation is indeed borne out. The second contribution of our work consists of \nestablishing faster convergence rates for the generalized error bounds introduced \nrecently by Mason et al. [8]. These improved bounds show that faster convergence \ncan be achieved if we allow for convergence to a slightly larger value than in previous \nbounds. Given the guaranteed convergence of the empirical loss to zero (in the \nlimited situations in which we have proved such a bound), such a result may yield a \nbetter trade-off between the terms appearing in the bound, offering a better model \nselection criterion (see Chapter 15 in [1]). \n\n2 Construction of a Linear Weak Learner \n\nWe recall the basic generalization bound for convex combinations of classifiers. Let \nH be a class of binary classifiers of VC-dimension dv , and denote by co(H) the \nconvex hull of H. Given a sample S = {(Xl, Y1), ... , (xm, Ym)} E (X x {-I, +1})m \nof m examples drawn independently at random from a probability distribution D \nover X x {-I, +1}, Schapire et al. [10] show that with probability at least 1 - 15, \nfor every f E co(H) and every () > 0, \n\nP DIY f(X} '\" 0] '\" Ps [Y f(X} '\" 8] + 0 ( Jm (d. IOg~m/d.) + Iog(l/5)) 'I') , \n\n(1) \nwhere the margin-error P slY f(X) :::; ()] denotes the fraction of training points for \nwhich yd(Xi) :::; (). Clearly, if the first term can be made small for a large value of \nthe margin (), a tight bound can be established. Schapire et al. [10] also show that \nif each weak classifier can achieve an error smaller than 1/2 -\" then \n\nP slY f(X) :::; ()] :::; ((1 - 2,)1-11 (1 + 2,)1+8) T/2 , \n\n(2) \nwhere T is the number of boosting iterations. Note that if , > (), the bound \ndecreases to zero exponentially fast. It is thus clear that a large value of, is needed \nin order to guarantee a small value for the margin-error. However, if , (and thus \n()) behaves like m -/3 for some (3 > 0, the rate of convergence in the second term in \n(1) will deteriorate, leading to worse bounds than those available by using standard \nVC results [11]. What is needed is a characterization of conditions under which \nthe achievable () does not decrease rapidly with m. In this section we present such \nconditions for one-dimensional problems, and mention recent work [7] that proves \na similar result in higher dimensions. \nWe begin by demonstrating that for any distribution on m points, a linear classifier \ncan achieve an error smaller than 1/2 -\" where, = O(I/m). In view of our com(cid:173)\nments above, such a fast convergence of, to zero may be useless for generalization \nbounds. We then use our construction to show that, under certain regularity con(cid:173)\nditions, a value of\" and thus of (), which is independent of m can be established \nfor one-dimensional problems. \n\n\f~. \n\nLet {Xl, ... , xm} be points in IRd, and denote by {Yl, ... , Ym} their binary labels, \ni.e., Yi E {-1,+1}. A linear decision rule takes the form y(x) = sgn(a\u00b7 X + b), \nwhere\u00b7 is the standard inner product in IRd. Let P E ~ m be a probability measure \non the m points. The weighted misclassification error for a classifier Y is Pe (a, b) = \nLz:,l PiI(Yi i Yi). For technical reasons, we prefer to use the expression 1- 2Pe = \nLz:,l PiYiYi. Obviously if 1 - 2Pe ~ E we have that Pe ~ ~ -\nLemma 1 For any sample of m distinct points, S = {(Xi'Yi)}~l E (IRd X \n{ -1, + 1 } ) m, and a probability measure P E ~ m on S, there is some a E IRd and \nb E IR such that the weighted misclassification error of the linear classifier Y = \nsgn(a\u00b7 X + b) is bounded away from 1/2, in particular Lz:,l PiI(Yi i Yi) :::; ~ - 4~\u00b7 \nProof The basic idea of the proof is to project a finite number of points onto a line \nh so that no two points coincide. Since there is at least one point X whose weight \nis not smaller than 11m, we consider the four possible linear classifiers defined by \nh with boundaries near x (at both sides of it and with opposite sign), and show \nthat one of these yields the desired result. We proceed to the detailed proof. Fix \na probability vector P = (PI' . .. ' Pm) E ~m. We may assume w.l.o.g that all the \nXi are different, or we can merge two elements and get m - 1 points. First, observe \nthat if ILz:,l PiYil 2: 2~' then the problem is trivially solved. To see this, denote \nby S\u00b1 the sub-samples of S labelled by \u00b11 respectively. Assume, for example, \nthat LiES+ Pi 2: LiES_ Pi + 2~\u00b7 Then the choice a = 0, b = 1, namely Yi = 1 \nfor all i, implies that Li PiYiYi 2: 2~. Similarly, the choice a = 0, b = -1 solves \nthe problem if LiES_ Pi 2: LiES+ Pi + 2~. Thus, we can assume, without loss of \ngenerality, that ILz:,l PiYil < 2~\u00b7 Next, note that there exists a direction u such \nthat i i j implies that U\u00b7 Xi i U\u00b7 Xj. This can be seen by the following argument. \nConstruct all one-dimensional lines containing two data points or more; clearly the \nnumber of such lines is at most m(m -1)/2. It is then obvious that any line, which \nis not perpendicular to any of these lines obeys the required condition. Let Xi be \na data-point for which Pi 2: 11m, and set E to be a positive number such that \n0< E < min{lu\u00b7xi -u\u00b7xjl: i,j E 1, ... ,m}. Such an E always exists since the \npoints are assumed to be distinct. Note the following trivial algebraic fact: \n\n(3) \nFor each j = 1,2, .. . , m let the classification be given by Yj = sgn(u\u00b7 Xj + b), where \nthe bias b is given by b = -U\u00b7Xi +EYi. Then clearly Yi = Yi and Yj = sgn( u\u00b7Xj -U\u00b7Xi), \nand therefore Lj PjYjYj = Pi + L#i PjYjsgn(u . Xj - U . Xi). Let A = Pi and \nB = L#i PjYjsgn(u . Xj - U . Xi). If IA + BI 2: 112m we are done. Otherwise, if \nIA + BI < 112m, consider the classifier yj = sgn( -u . Xj + b'), with b' = U . Xi + EYi \n(note that y~ = Yi and yj = -Yj, j i i). Using (3) with 61 = 112m and 62 = 11m \nthe claim follows. \n\u2022 \nWe comment that the upper bound in Lemma 1 may be improved to 1/2 -II (4(m-\n1)), m 2: 2, using a more refined argument. \n\n'Y, where 'Y = O(l/m), can \nRemark 1 Lemma 1 implies that an error of 1/2 -\nbe guaranteed for any set of arbitrarily weighted points. It is well known that the \nproblem of finding a linear classifier with minimal classification error is NP-hard \n(in d) [5]. Moreover, even the problem of approximating the optimal solution is \nNP-hard [2]. Since the algorithm described in Lemma 1 is clearly polynomial (in \nm and d), there seems to be a transition as a function of 'Y between the class NP \nand P (assuming, as usual, that they are different). This issue warrants further \ninvestigation. \n\n\fWhile the result given in Lemma 1 is interesting, its generality precludes its use(cid:173)\nfulness for bounding generalization error. This can be seen by observing that the \ntheorem guarantees the given margin even in the case where the labels Yi are drawn \nuniformly at random from {\u00b11}, in which case no generalization can be expected. \nIn order to obtain a more useful result, we need to restrict the complexity of the \ndata distribution. We do this by imposing constraints on the types of decision re(cid:173)\ngions characterizing the data. In order to generate complex, yet tractable, decision \nregions we consider a multi-linear mapping from Rd to {-I, l}k, generated by the \nk hyperplanes Pi = {x: WiX + WiO,X E Rd},i = 1, ... ,k, as in the first hidden \nlayer of a neural network. Such a mapping generates a partition of the input space \nRd into M connected components, {Rd \\ U~=l Pi }, each characterized by a unique \nbinary vector of length k. Assume that the weight vectors (Wi, WiO) E Rd+! are in \ngeneral position. The number of connected components is given by (e.g., Lemma \n3.3. in [1]) C(k, d + 1) = 2 2::~=0 (kil). This number can be bounded from below \nby 2(k;;I), which in turn is bounded below by 2((k - 1)/d)d. An upper bound is \ngiven by 2(e(k - 1)/d)d, m ~ d. In other words, C(k, d + 1) = e ((k/d)d). In order \nto generate a binary classification problem, we observe that there exists a binary \nfunction from {-I, l}k I--t {-I, I}, characterized by these M decision regions. This \ncan be seen as follows. Choose an arbitrary connected component, and label it by \n+1 (say). Proceed by labelling all its neighbors by -1, where neighbors share a \ncommon boundary (a (d -I)-dimensional hyperplane in d dimensions). Proceeding \nby induction, we generate a binary classification problem composed of exactly M \ndecision regions. Thus, we have constructed a binary classification problem, char(cid:173)\nacterized by at least 2(k;;l) ~ 2((k -1)/d))d decision regions. Clearly as k becomes \narbitrarily large, very elaborate regions are formed. \nWe now apply these ideas, together with Lemma 1, to a one dimensional problem. \nNote that in this case the partition is composed of intervals. \n\nTheorem 1 Let F be a class of functions from R to {\u00b11} which partitions the real \nline into at most k intervals, k ~ 2. Let f.L be an arbitrary probability measure on \nR. Then for any f E F there exist a, T* E R for which, \n1 \n\n1 \nf.L {x: f(x)sgn(ax - T*) = I} ~ 2\" + 4k \n\n(4) \n\nProof Let a function f be given, and denote its connected components by h, ... , Ik, \nthat is h = [-00, h), h = [h,12), 13 = [12,la), and so on until Ik = [lk-I,OO], with \n-00 = 10 < h < 12 < ... < lk-l. Associate with every interval a point in R, \nXl = h - 1, X2 = (h + l2) /2, ... , Xk-l = (lk-2 + lk-d /2, Xk = lk-1 + 1, \n\na weight f.Li = f.L(Ii), i = 1, ... , k, and a label f(Xi) E {\u00b11}. We now apply Lemma 1 \nto conclude that there exist a E {\u00b11} and T E R such that 2::7=1 f.Ld(xi)sgn(axi -\nT) > 1/(4k). The value of T lies between li and li+! for some i E {O, 1, ... , k -\nI} (recall that lo = -00). We identify T* of (4) as lH1. This is the case since \nby choosing this T*, f(x) in any segment Ii is equal to f(Xi) so we have that \nf.L {x: f(x)sgn(ax - T*) = I} = ~ + 2::7=1 f.Ld(xi)sgn(axi - T*) ~ ~ + 41k\u00b7 \nNote that the result in Theorem 1 is in fact more general than we need, as it applies \nto arbitrary distributions, rather than distributions over a finite set of points. An \nopen problem at this point is whether a similar result applies to d-dimensional prob(cid:173)\nlems. We conjecture that in d dimensions 'Y behaves like k-l(d) for some function \nl, where k is a measure for the number of homogeneous convex regions defined by \nthe data (a homogeneous region is one in which all points possess identical labels). \n\n\u2022 \n\n\fWhile we do not have a general proof at this stage, we have recently shown [7] \nthat the conjecture holds under certain natural conditions on the data. This result \nimplies that, at least under appropriate conditions, boosting-like algorithm are ex(cid:173)\npected to have excellent generalization performance. To provide some motivation, \nwe present results of some numerical simulations for two-dimensional problems. For \nthis simulation we used random lines to generate a partition of the unit square in \nIR2. We then drew 1000 points at random from the unit square and assigned them \nlabels according to the partition. Finally, in order to have a non-trivial problem we \nmade sure that the cumulative weights of each class are equal. We then calculated \nthe optimal linear classifier by exhaustive search. In Figure 1 (b) we show a sam(cid:173)\nple decision region with 93 regions. Figure l(a) shows the dependence of'Y on the \nnumber of regions k. As it turns out there is a significant logarithmic dependence \nbetween'Y and k, which leads us to conjecture that 'Y '\" Ck- l + E for some C, land \nE. In the presented case it turns out that 1 = 3 turns out to fit our model well. \nIt is important to note, however, that the procedure described above only supports \nour claim in an average-case, rather than worst-case, setting as is needed. \n\n(a) g a rT1rT1 8 as a. func:1: l on of r e gion s \n\nn:I \n\n0 . 3_ \n\n~ \n~O . 2 l5 \n\n. \n.-,-\n\no . oeo'----------c~-----c_=_=_--~--=' \n\nNumber of R e gion s \n\nFigure 1: (a) 'Y as a function of the number of regions. \npartition of the unit square used in the simulations. \n\n(b) A typical complex \n\n3 \n\nImproved Convergence Rates \n\nIn Section 2 we proved that under certain conditions a weak learner exists with a \nsufficiently large margin, and thus the first term in (1) indeed converges to zero. \nWe now analyze the second term in (1) and show that it may be made to converge \nconsiderably faster, if the first term is made somewhat larger. First, we briefly \nrecall the framework introduced recently by Mason et al. [8]. These authors begin \nby introducing the notion of a B-admissible family of functions. For completeness \nwe repeat their definition. \n\nDefinition 1 (Definition 2 in [8]) A family {C N : N E N} of margin cost functions \nis B -admissible for B 2: a if for all N E N there is an interval Y C IR of length no \nmore than B and a function W N \n\n: [-1, 1] r-+ Y that satisfies \nsgn( -0:) ::; EZ~QN Jw N(Z)] ::; CN(o:) \n\nfor all 0: E [-1,1], where EZ~QN.'\" (-) denotes the expectation when Z is chosen \nrandomly as Z = (liN) 2:[:1 Zi. and P(Zi = 1) = (1 + 0:)/2. \nDenote the convex hull of a class H by co(H). The main theoretical result in [8] is \nthe following lemma. \n\n\fLemma 2 ([8], Theorem 3) For any B-admissible family {CN : N E N} of \nmargin junctions, for any binary hypothesis class of VC dimension dv and any \ndistribution D on X x {-I, + 1 }, with probability at least 1 - c5 over a random \nsample S of m examples drawn at random according to D, every N and ev(cid:173)\n:::; Es[CN(yf(x)] + EN, where EN = \nery f E co(H) satisfies Pn[yf(x) :::; 0] \no ([(B 2 N dv logm + log(N/c5))/m] 1/2) . \nRemark 2 The most appealing feature of Lemma 2, as of other results for convex \ncombinations, is the fact that the bound does not depend on the number of hy(cid:173)\npotheses from H defining f E co(H), which may in fact be infinite. Using standard \nVC results (e.g. [11]) would lead to useless bounds, since the VC dimension of these \nclasses is often huge (possibly infinite). \n\nLemma 2 considers binary hypotheses. Since recent works has demonstrated the \neffectiveness of using real valued hypotheses, we consider the case where the weak \nclassifiers may be confidence-rated, i.e., taking values in [-1,1] rather than {\u00b1I}. \nWe first extend Lemma 2 to confidence-rated classifiers. Note that the variables Zi \nin Definition 1 are no longer binary in this case. \n\nLemma 3 Let the conditions of Lemma 2 hold, except that H is a class of real \nvalued functions from X to [-1, +1] of pseudo-dimension dp . Assume further that \nWN in Definition 1 obeys a Lipschitz condition of the form IWN(X) - wN(x')1 :::; \nLlx - x'I for every x, x' EX. Then with probability at least 1- c5, Pn[yf(x) :::; 0] :::; \nEs[CN(yf(x)] + EN, where EN = 0 ([(LB 2 Ndp logm + log(N/c5))/m] 1/2) . \nProof The proof is very similar to the proof of Theorem 2, and will be omitted for \nthe sake of brevity. \n\u2022 \nIt is well known that in the standard setting where CN is replaced by the empirical \nclassification error, improved rates, replacing O( Jlog m/m) by O(log m/m), are \npossible in two situations: (i) if the minimal value of CN is zero (the restricted \nmodel of [1]), and (ii) if the empirical error is replaced by (1 +o:)CN for some 0: > O. \nThe latter case is especially important in a model selection setup, where nested \nclasses of hypothesis functions are considered, since in this case one expects that, \nwith high probability, CN becomes smaller as the classes become more complex. In \nthis situation, case (ii) provides better overall bounds, often leading to the optimal \nminimax rates for non parametric problems (see a discussion of these issues in Sec. \n15.4 of [1]). \nWe now establish a faster convergence rate to a slightly larger value than \nIn situations where the latter quantity approaches zero, the \nEs [CN(Yf(X))]. \noverall convergence rate may be improved, as discussed above. We consider cost \nfunctions CN(o:), which obey the condition \n\nCN(o:) :::; (1 + (3N )1(0: < 0) + 'TJN \n\n((3N > 0, 'TJN > 0). \n\n(5) \n\nfor some positive (3N and 'TJN (see [8] for details on legitimate cost functions). \nTheorem 2 Let D be a distribution over X x {-I, + I}, and let S be a sample of \nm points chosen independently at random according to D. Let dp be the pseudo(cid:173)\ndimension of the class H, and assume that CN(O:) obeys condition (5). Then for \nsufficiently large m, with probability at least 1-c5, every function f E co( H) satisfies \nthe following bound for every 0 < 0: < 1/(3 N \n\nPn [Yf(X):::; 0]:::; 1- O:;N Es [CN(Yf(X))] + 0 \n\n( 1 + \n\n) \n\n(d N log m + log! ) \n\np mo:/(:P+ 20:) \n\n(j \n\n\u2022 \n\n\fProof The proof combines two ideas. First, we use the method of [8] to transform \nthe problem from co(H) to a discrete approximation of it. Then, we use recent \nresults for relative uniform deviations of averages from their means [3]. Due to lack \nof space, we defer the complete proof to the full version of the paper. \n\n4 Discussion \n\nIn this paper we have presented two main results pertaining to the theory of boost(cid:173)\ning. First, we have shown that, under reasonable conditions, an effective weak \nclassifier exists for one dimensional problems. We conjectured, and supported our \nclaim by numerical simulations, that such a result holds for multi-dimensional prob(cid:173)\nlems as well. The non-trivial extension of the proof to multiple dimensions can be \nfound in [7]. Second, using recent advances in the theory of uniform convergence and \nboosting we have presented bounds on the generalization error, which may, under \ncertain conditions, be significantly better than standard bounds, being particularly \nuseful in the context of model selection. \nAcknowledgment We thank Shai Ben-David and Yoav Freund for helpful discus(cid:173)\nsions. \n\nReferences \n\n[1] M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foun(cid:173)\n\ndations. Cambridge University Press, 1999. \n\n[2] P. Bartlett and S. Ben-David. On the hardness of learning with neural net(cid:173)\n\nworks. In Proceedings of the fourth European Conference on computational \nLearning Theory, 99. \n\n[3] P. Bartlett and G. Lugosi. An inequality for uniform deviations of sample \naverages from their means. Statistics and Probability Letters, 44:55- 62, 1999. \n[4] T. Hastie J. Friedman and R. Tibshirani. Additive logistic regression: a sta(cid:173)\n\ntistical view of boosting. The Annals of Statistics, To appear, 2000. \n\n[5] D.S. Johnson and F.P. Preparata. The densest hemisphere problem. Theoret(cid:173)\n\nical Computer Science, 6:93- 107,1978. \n\n[6] S. Mallat and Z. Zhang. Matching pursuit with time-frequencey dictionaries. \n\nIEEE Trans. Signal Processing, 41(12):3397- 3415, December 1993. \n\n[7] S. Mannor and R. Meir. On the existence of weak learners and applications to \n\nboosting. Submitted to Machine Learning \n\n[8] L. Mason, P. Bartlett and J. Baxter. Improved generalization through explicit \n\noptimization of margins. Machine Learning, 2000. To appear. \n\n[9] L. Mason, P. Bartlett, J. Baxter and M. Frean. Functional gradient techniques \n\nfor combining hypotheses. In B. Sch6lkopf A. Smola, P. Bartlett and D. Schu(cid:173)\nurmans, editors, Advances in Large Margin Classifiers. MIT Press, 2000. \n\n[10] R.E. Schapire, Y. Freund, P. Bartlett and W .S. Lee. Boosting the margin: \na new explanation for the effectiveness of voting methods. The Annals of \nStatistics, 26(5):1651-1686, 1998. \n\n[11] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer \n\nVerlag, New York, 1982. \n\n\f", "award": [], "sourceid": 1906, "authors": [{"given_name": "Shie", "family_name": "Mannor", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": null}]}