{"title": "Efficient Approaches to Gaussian Process Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 251, "page_last": 257, "abstract": null, "full_text": "Efficient Approaches to Gaussian Process \n\nClassification \n\nLehel Csato, Ernest Fokoue, Manfred Opper, Bernhard Schottky \n\nNeural Computing Research Group \n\nSchool of Engineering and Applied Sciences \nAston University Birmingham B4 7ET, UK. \n\n{opperm,csatol}~aston.ac.uk \n\nTheoretical Physics II, Lund University, Solvegatan 14 A, \n\nOle Winther \n\nS-223 62 Lund, Sweden \nwinther~thep.lu.se \n\nAbstract \n\nWe present three simple approximations for the calculation of \nthe posterior mean in Gaussian Process classification. The first \ntwo methods are related to mean field ideas known in Statistical \nPhysics. The third approach is based on Bayesian online approach \nwhich was motivated by recent results in the Statistical Mechanics \nof Neural Networks. We present simulation results showing: 1. that \nthe mean field Bayesian evidence may be used for hyperparameter \ntuning and 2. that the online approach may achieve a low training \nerror fast. \n\n1 \n\nIntroduction \n\nGaussian processes provide promising non-parametric Bayesian approaches to re(cid:173)\ngression and classification [2, 1]. In these statistical models, it is assumed that the \nlikelihood of an output or target variable y for a given input x E RN can be written \nas P(Yla(x)) where a : RN --+ R are functions which have a Gaussian prior distri(cid:173)\nbution, i.e. a is (a priori) assumed to be a Gaussian random field. This means that \nany finite set of field variables a(xi), i = 1, ... ,l are jointly Gaussian distributed \nwith a given covariance E[a(xi)a(xj)] = K(Xi' Xj) (we will also assume a zero mean \nthroughout the paper). \nPredictions on a(x) for novel inputs x, when a set D of m training examples (Xi, Yi) \ni = 1, . . . , m , is given, can be computed from the posterior distribution of the m + 1 \nvariables a(x) and a(xd, ... ,a(xm). A major technical problem of the Gaussian \nprocess models is the difficulty of computing posterior averages as high dimensional \nintegrals, when the likelihood is not Gaussian. This happens for example in classifi(cid:173)\ncation problems. So far, a variety of approximation techniques have been discussed: \nMonte Carlo sampling [2], the MAP approach [4], bounds on the likelihood [3] and \na TAP mean field approach [5]. In this paper, we will introduce three different novel \nmethods for approximating the posterior mean of the random field a(x), which we \nthink are simple enough to be used in practical applications. Two of the techniques \n\n\f252 \n\nL. Csato, E. Fokoue, M Opper, B. Schottky and 0. Winther \n\nare based on mean field ideas from Statistical Mechanics, which in contrast to the \npreviously developed TAP approach are easier to implement. They also yield simple \napproximations to the total likelihood of the data (the evidence) which can be used \nto tune the hyperparameters in the covariance kernel K (The Bayesian evidence (or \nMLII) framework aims at maximizing the likelihood of the data). \n\nWe specialize to the case of a binary classification problem, where for simplicity, \nthe class label Y = \u00b1 1 is assumed to be noise free and the likelihood is chosen as \n(1) \nwhere 8(x) is the unit step function, which equals 1 for x > 0 and zero else. We \nare interested in computing efficient approximations to the posterior mean (a(x)), \nwhich we will use for a prediction of the labels via Y = sign(a(x)), where ( .. . ) \ndenotes the posterior expectation. If the posterior distribution of a(x) is symmetric \naround its mean, this will give the Bayes optimal prediction. \n\nP(Yla) = 8(ya) , \n\nBefore starting, let us add two comments on the likelihood (1). First, the MAP \napproach (i.e. predicting with the fields a that maximize the posterior) would not \nbe applicable, because it gives the trivial result a(x) = O. Second, noise can be \neasily introduced within a probit model [2], all subsequent calculations will only be \nslightly altered. Moreover, the Gaussian average involved in the definition of the \nprobit likelihood can always be shifted from the likelihood into the Gaussian process \nprior, by a redefinition of the fields a (which does not change the prediction), leaving \nus with the simple likelihood (1) and a modified process covariance [5] . \n\n2 Exact Results \n\nAt first glance, it may seem that in order to calculate (a(x)) we have to deal with \nthe joint posterior of the fields ai = a(xi) ' i = 1, ... , m together with the field at \nthe test point a(x) . This would imply that for any test point, a different new m + 1 \ndimensional average has to be performed. Actually, we will show that this is not the \ncase. As above let E denote the expectation over the Gaussian prior. The posterior \nexpectation at any point, say x \n\nE [a (x) TI7=l P(Yj laj)] \n(a(x)) = --=-~---~--\"-\n\nE [TI7=l P(Yj laj)] \n\ncan by integration by parts-for any likelihood-be written as \n\n)) ~ ) \n\n(a(x = L...J K(x, Xj ajYj \n\nj \n\nand \",.=y . (CHnp(Yj1a j )) \n\n.... J \n\nJ \n\naa . \nJ \n\n(2) \n\n(3) \n\nshowing that aj is not dependent on the test point x. It is therefore not necessary \nto compute a m + 1 dimensional average for every prediction. \nWe have chosen the specific definition (3) in order to stress the similarity to pre(cid:173)\ndictions with Support Vector Machines (for the likelihood (1), the aj will come \nout nonnegative) . In the next sections we will develop three approaches for an \napproximate computation of the aj. \n\n3 Mean Field Method I: Ensemble Learning \n\nOur first goal is to approximate the true posterior distribution \na II P(Y \u00b7la\u00b7) \n\np(alD ) -\n\n1 \n-\n\n1 T \n\ne-\"2 a K \n\n1 \n\n-1 \n\nm \n\nm - Z J(27r)m detK \n\nj=l \n\nJ \n\nJ \n\n(4) \n\n\fEfficient Approaches to Gaussian Process Classification \n\n253 \n\nof a == (al,\"\" am) by a simpler, tractable distribution q. Here, K denotes the \ncovariance matrix with elements Kij = K (Xi, Xj). In the variational mean field \napproach-known as ensemble learning in the Neural Computation Community,-\nthe relative entropy distance K L(q,p) = J da q(a) In :~:~ is minimized in the family \nof product distributions q(a) = TI,7=l qj(aj). This is in contrast to [3], where a \nvariational bound on the likelihood is computed. We get \n\nKL(q,p) = \n\nqi(ai) \n\n! \n~ daiqi(ai) In P(Yilai) + \n, \n~ L [K-1Lj (ai)O(aj)o + ~ L [K-1Li (a~)o \ni,j,i#j \n\ni \n\nwhere ( .. ')0 denotes expectation w.r.t. q. By setting the functional derivative \nof KL(q,p) with respect to qi(a) equal to zero, we find that the best product \ndistribution is a Gaussian prior times the original Likelihood: \n\nqi(a) ex: P(Yila) ~e- hi \n\n1 \n\n(o-\"'i)2 \n\n(5) \n\nwhere mi = -Ai :Ej,#i(K-l)ij(aj)o and Ai = [K-l]:l. Using this specific form \nfor the approximated posterior q(a), replacing the average over the true posterior in \n(3) by the approximation (5), we get (using the likelihood (I)) a set of m nonlinear \nequations in the unknowns aj: \n\nwhere D(z) = e- z2 / 2/..f2i and *(z) = J~oo dt D(t). As a useful byproduct of \nthe variational approximation, an upper bound on the Bayesian evidence P(D) = \nJ da 7r(a)P{Dla) can be derived. \n(71' denotes the Gaussian process prior and \nP{Dla) = TI,7=l P(Yjlaj)). The bound can be written in terms of the mean field \n'free energy' as \n\n-lnP(D) < Eqlnq(a) -Eqln[7r(a)P(Dla)] \n\n- '''In (Y' m j\n~ , JXi \n, \n\n) + ~ \"'y \u00b7a \u00b7(K\u00b7 \u00b7 -8\" A\u00b7)y \u00b7a\u00b7 \n'J' 3 3 \n\n2 ~ \" \n\n'3 \n\n~ \n\n(7) \n\nwhich can be used as a yardstick for selecting appropriate hyperparameters in the \ncovariance kernel. \n\nThe ensemble learning approach has the little drawback, that it requires inversion \nof the covariance matrix K and, for the free energy (7) one must compute a deter(cid:173)\nminant. A second, simpler approximation avoids these computations. \n\n4 Mean Field Theory II: A 'Naive' Approach \n\nThe second mean field theory aims at working directly with the variables aj. As a \nstarting point, we consider the partition function (evidence), \n\nZ = P(D) = ! dze-tzTKz IT P(Yjlzj) , \n\n;=1 \n\n(8) \n\n\f254 \n\nL. Csato. E. Fokoue. M. Opper. B. Schottky and O. Winther \n\nwhich follows from (4) by a standard Gaussian integration, introducing the Fourier \ntransform of the Likelihood .P(Ylz) = I g~ eiaz P(Yla) with i being the imaginary \nunit. It is tempting to view (8) as a normalizing partition function for a Gaussian \nprocess Zi having covariance matrix K-l and likelihood P. Unfortunately, P is \nnot a real number and precludes a proper probabilistic interpretation. N everthe(cid:173)\nless, dealing formally with the complex measure defined by (8), integration by parts \nshows that one has YjCl!j = -i(zj)., where the brackets ( ... ). denote a average over \nthe complex measure. This suggests a simple approximation for calculating the \nCl!j. One may think of trying a saddle-point (or steepest descent) approximation \nto (8) and replace (Zj). by the value of Zj (in the complex Z plane) which makes \nthe integrand stationary thereby neglecting the fluctuations of the Zj. Hence, this \napproximation would treat expectations of products as (ZiZj). as (Zi).(Zj)*, which \nmay be reasonable for i i= j, but definitely not for the self-correlation i = j. Ac(cid:173)\ncording to the general formalism of mean field theories (outlined e.g. in [6]), one \ncan improve on that idea, by treating the 'self-interactions' z; separately. This can \nbe done by replacing all Zi (except in the form zi) by a new variable J.Li by inserting \na Dirac 8 function representation 8(z - J.L) = J ~r;:e-im(z-J1.) into (8) and integrate \nover the Z and a variables exactly (the integral factorizes), and finally perform a \nsaddle-point integration over the m and J.L variables. The details of this calculation \nwill be given elsewhere. Within the saddle-point approximation, we get the system \nof nonlinear equations \n\nwhich is of the same form as (6) with Aj replaced by the simpler K jj . These \nequations have also been derived by us in [5] using a Callen identity, but our present \nderivation allows also for an approximation to the evidence. By plugging the saddle(cid:173)\npoint values back into the partition function, we get \n\ni,i=l=j \n\ni,i=l=j \n\n-lnP(D) ~ - ~ln~ (Y'~) + ~ ~Y'Cl!'(K\" - 8\u00b7 \u00b7K .. )y\u00b7Cl!\u00b7 \n\n~ t V Kii \n\n2 ~ t \n\ntJ \n\nt \n\ntJ \n\n11 \n\nJ \n\nJ \n\nt \n\n~ \n\nwhich is also simpler to compute than (7) but does not give a bound on the true \nevidence. \n\n5 A sequential Approach \n\nBoth previous algorithms do not give an explicit expression for the posterior mean, \nbut require the solution of a set of nonlinear equations. These must be obtained \nby an iterative procedure. We now present a different approach for an approximate \ncomputation of the posterior mean, which is based on a single sequential sweep \nthrough the whole dataset giving an explicit update of the posterior. \n\nThe algorithm is based on a recently proposed Bayesian approach to online learning \n(see [8] and the articles of Opper and Winther& Solla in [9]). Its basic idea applied \nto the Gaussian process scenario, is as follows: Suppose, that qt is a Gaussian \napproximation to the posterior after having seen t examples. This means that we \napproximate the posterior process by a Gaussian process with mean (a(x))t and \ncovariance Kt(x, y), starting with (a(x))o = 0 and Ko(x, y) = K(x, y). After a \nnew data point Yt+l is observed, the posterior is updated according to Bayes rule. \nThe new non-Gaussian posterior qt is projected back into the family of Gaussians \nby choosing the closest Gaussian qt+l minimizing the relative entropy K L(qt, qt+d \n\n\fEfficient Approaches to Gaussian Process Classification \n\n255 \n\nin order to keep the loss of information small. This projection is equivalent to a \nmatching of the first two moments of lit and qt+1 ' E.g., for the first moment we get \n\n(a(x))t+l = \n\n(P( \n\n(a(x) P(Yt+1la(Xt+d))t \n\nI ( \n\nYt+l a Xt+l \n\n))) \nt \n\n= (a(x))t + Kl (t)Kt(x, Xt+l) \n\nwhere the second line follows again from an integration by parts and Kt{t) \n1!:!\u00b1!. ~ with z - Yt\u00b1l (a(Xt\u00b1l)t and ~2(t) - K (x \nx \nthe corresponding one for K t can be solved by the ansatz \n\nt+l, t+l . \n\n) This recursion and \n\nu \n\ntJ>(zt} \n\n-\n\nt \n\nu(t) \n\nt -\n\n(a(x))t \n\nj=l \n\nt L K(x, xj)Yjaj(t) \nL K(x, Xi)Cij (t)K(x, Xj) + K(x, y) \ni,j \n\n(10) \n\n(11) \n\nwhere the vector a(t) = (al,\"\" at, 0, 0, ... ) and the matrix C(t) (which has also \nonly txt nonzero elements) are updated as \n\na(t + 1) \nC(t + 1) \n\na(t) + Kl(t) (C(t)kt+l + et+l) @y \nC(t) + K2(t) (C(t)kt+l + et+1) (C(t)kt+1 + et+1f \n\n(12) \n\nwhere K2 (t) = ~ { ~~~:\u00bb) - (~gt\\) r}, k t \n\nis the vector with elements K tj , \n\nj = 1 ... , t and @ denotes the element-wise product between vectors. The sequen(cid:173)\ntial algorithm defined by (10)-(12) has the advantage of not requiring any matrix \ninversions. There is also no need to solve a numerical optimization problem at each \ntime as in the approach of [11] where a different update of a Gaussian posterior ap(cid:173)\nproximation was proposed. Since we do not require a linearization of the likelihood, \nthe method is not equivalent to the extended Kalman Filter approach. \nSince it is possible to compute the evidence of the new datapoint P(Yt+1) = \n(P(Yt+1lat+d)t based on the old posterior, we can compute a further approximation \nto the log evidence for m data via In P(Dm) = 2:~lln(P(Yt+llat+1)k \n\n6 Simulations \n\nWe present two sets of simulations for the mean field approaches. In the first, we test \nthe Bayesian evidence framework for tuning the hyperparameters of the covariance \nfunction (kernel). In the second, we test the ability of the sequential approach to \nachieve low training error and a stable test error for fixed hyperparameters. \n\nFor the evidence framework, we give simulation results for both mean field free \nenergies (7) and (10) on a single data set, 'Pima Indian Diabetes (with 200/332 \ntraining/test-examples and input dimensionality d = 7) [7]. The results should \ntherefore not be taken as a conclusive evidence for the merits of these approaches, \nbut simply as an indication that they may give reasonable results. We use the \nradial basis function covariance function K(x,x') = exp (-~ 2:~WI(XI - XD2) \n. A \ndiagonal term v is added to the covariance matrix corresponding to a Gaussian noise \nadded to the fields with variance v [5]. The free energy, -lnP(D) is minimized \nby gradient descent with respect to v and the lengthscale parameters WI, \u2022 .. , Wd \nand the mean field equations for aj are solved by iteration before each update of \nthe hyperparameters (further details will be given elsewhere). Figure 1 shows the \nevolution of the naive mean free energy and the test error starting from uniform \n\n\f256 \n\nL. Csat6, E. Fokoue, M. Opper, B. &hottky and 0. Winther \n\nws. \nIt typically requires of the order of 10 iteration steps of the a;-equations \nbetween each hyperparameter update. We also used hybrid approaches, where the \nfree energy was minimized by one mean field algorithm and the hyperparameters \nused in the other. As it may be seen from table 1, the naive mean field theory can \noverestimate the free energy (since the ensemble free energy is an upper bound to \nthe free energy). The overestimation is not nearly as severe at the minimum of the \nnaive mean field free energy. Another interesting observation is that as long as the \nsame hyperparameters are used the actual performance (as measured by the test \nerror) is not very sensitive to the algorithm used. This also seems to be the case \nfor the TAP mean field approach and Support Vector Machines [5]. \n\n115..-----~----~-------,-----__, \n\n74..-----~----~-------,-----__, \n\n0114 \\\\ \nIi:\" \n' \n.E \n1.113 \n>. \n~ \nQ) c: \n~112 \n~ u... \n\n111 \n\no \n\n, , \n\n\\ \n\n, \n\n'. \"'-----\n\n70 g \nW68 \n~ \n1-66 \n\n64 \n\n20 \n\n~ \n\n60 \n\n~ \n\n62 \n\n0~----2~0----~~~--~~~--~~ \n\nIterations \n\nIterations \n\nFigure 1: Hyperparameter optimization for the Pima Indians data set using the \nnaive mean field free energy. Left figure: The free energy as a function of the \nnumber of hyperparameter updates. Right figure: The test error count (out of 332) \nas a function of the number of hyperparameter updates. \n\nTable 1: Pima Indians dataset. Hyperparameters found by free energy minimiza(cid:173)\ntion. Left column gives the free energy -lnP(D) used in hyperparameter optimiza(cid:173)\ntion. Test error counts in range 63- 75 have previously been reported [5] \nNaive MF \n\nEnsemble MF \n\nFree Energy minimization \nEnsemble Mean Field, eq. (7) \nNaive Mean Field, eq. (10) \n\nError \n\n-lnP(D) Error \n\n72 \n62 \n\n100.6 \n107.0 \n\n70 \n62 \n\n-lnP(D) \n\n183.2 \n110.9 \n\nFor the sequential algorithm, we have studied the sonar [10] and crab [7] datasets. \nSince we have not computed an approximation to the evidence so far, a simple fixed \npolynomial kernel was used. Although a probabilistic justification of the algorithm \nis only valid, when a single sweep through the data is used (the independence of the \ndata is assumed), it is tempting to reuse the same data and iterate the procedure as a \nheuristic. The two plots show that in this way, only a small improvement is obtained, \nand it seems that the method is rather efficient in extracting the information from \nthe data in a single presentation. For the sonar dataset, a single sweep is enough \nto achieve zero training error. \nAcknowledgements: BS would like to thank the Leverhulme Trust for their support \n(F /250/K). The work was also supported by EPSRC Grant GR/L52093. \n\n\fEfficient Approaches to Gaussian Process Classification \n\n257 \n\nso \n\n45 \n\n40 \n\n35 \n\n30 \n\n25 \n\n20 \n\n15 \n\n10 \n\nI \n1,- ~ \n\nTraining Error \n\n-\n- - Test Error \n\n35 \n\n30 \n\n... \n\ng25 \nw \n\n00 \n\n20 \n\n40 \n\n60 \n\n80 \n\n-\" \n\n20 \n\n15 \n\n10 \n\n5 \n\n1'\\ .. \" \n\n\\ ___ \n\n\\ ,- - .. -\n\n-\n\n-\n\n-\n\n-\n\n.. \n\n140 \n\n160 \n\n\u00b00L----5~0---F=::::lIQ.-A.1SO'::-----::-2~OO \n\nHeration. \n\nFigure 2: Training and test errors during learning for the sonar (left) and crab \ndataset (right). The vertical dash-dotted line marks the end of the training set \nand the starting point of reusing of it. The kernel function used is K(x, x') \n(1 + x . x' jm)k with order k = 2 (m is the dimension of inputs) . \n\nReferences \n\n[1] Williams C.K.I. and Rasmussen C.E., Gaussian Processes for Regression, in Neural \nInformation Processing Systems 8, Touretzky D.s, Mozer M.C. and Hasselmo M.E. \n(eds.), 514-520, MIT Press (1996) . \n\n[2] Neal R.M, Monte Carlo Implementation of Gaussian Process Models for Bayesian \n\nRegression and Classification, Technical Report 9702, Department of Statistics, Uni(cid:173)\nversity of Toronto (1997). \n\n[3] Gibbs M.N. and Mackay D.J.C., Variational Gaussian Process Classifiers, Preprint \n\nCambridge University (1997). \n\n[4] Williams C.K.I. and Barber D, Bayesian Classification with Gaussian Processes, IEEE \n\nTrans Pattern Analysis and Machine Intelligence, 20 1342-1351 (1998). \n\n[5] Opper M. and Winther O. Gaussian Processes \n\nfor Classification: Mean \nto Neural Computation, http://www.thep.lu.se \n\nField Algorithms, Submitted \n/tf2/staff/winther/ (1999). \n\n[6] Zinn-Justin J, Quantum Field Theory and Critical Phenomena, Clarendon Press Ox(cid:173)\n\nford (1990) . \n\n[7] Ripley B.D, Pattern Recognition and Neural Networks, Cambridge University Press \n\n(1996). \n\n[8] Opper M., Online versus Offline Learning from Random Examples: General Results, \n\nPhys. Rev. Lett. 77, 4671 (1996). \n\n[9] Online Learning in Neural Networks, Cambridge University Press, D. Saad (ed.) \n\n(1998). \n\n[10] Gorman R.P and Sejnowski T .J, Analysis of hidden units in a layered network trained \n\nto classify sonar targets, Neural Networks 1, (1988). \n\n[11] Jaakkola T . and Haussler D. Probabilistic kernel regression, In Online Proceedings \n(1999) , \n\nof \nhttp://uncertainty99.microsoft.com/proceedings.htm. \n\nWorkshop \n\n7-th \n\nand \n\nStatistics \n\non \n\nAI \n\nInt. \n\n\f", "award": [], "sourceid": 1694, "authors": [{"given_name": "Lehel", "family_name": "Csat\u00f3", "institution": null}, {"given_name": "Ernest", "family_name": "Fokou\u00e9", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Bernhard", "family_name": "Schottky", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}*