High-dimensional binary classification has been intensively studied in the community of machine learning in the last few decades. Support vector machine (SVM), one of the most popular classifier, depends on only a portion of training samples called support vectors which leads to suboptimal performance in the setting of high dimension and low sample size (HDLSS). Large-margin unified machines (LUMs) are a family of margin-based classifiers proposed to solve the so-called "data piling" problem which is inherent in SVM under HDLSS settings. In this paper we study the binary classification algorithms associated with LUM loss functions in the framework of reproducing kernel Hilbert spaces. Quantitative convergence analysis has been carried out for these algorithms by means of a novel application of projection operators to overcome the technical difficulty. The rates are explicitly derived under priori conditions on approximation and capacity of the reproducing kernel Hilbert space.
Citation: |
[1] | N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc., 68 (1950), 337-404. doi: 10.2307/1990404. |
[2] | P. L. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE Trans. Inform. Theory, 44 (1998), 525-536. doi: 10.1109/18.661502. |
[3] | P. L. Bartlett, M. I. Jordan and J. D. McAuliffe, Convexity, classification, and risk bounds, J. Amer. Statist. Assoc., 101 (2006), 138-156. doi: 10.1198/016214505000000907. |
[4] | B. E. Boser, I. Guyon and V. Vapnik, A training algorithm for optimal margin classifiers, in Computational Learning Theory, Madison, WI, ACM, (1992), 144–152. |
[5] | D. R. Chen, Q. Wu, Y. Ying and D. X. Zhou, Support vector machine soft margin classifiers: error analysis, J. Mach. Learn. Res., 5 (2004), 1143-1175. |
[6] | C. Cortes and V. Vapnik, Support vector networks, Mach. Learn., 20 (1995), 273-297. |
[7] | F. Cucker and D. X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press, Cambridge, 2007. doi: 10.1017/CBO9780511618796. |
[8] | D. E. Edmunds and H. Triebel, Function Spaces, Entropy Numbers, Differential Operators, Cambridge University Press, Cambridge, 1996. doi: 10.1017/CBO9780511662201. |
[9] | J. Fan and D. H. Xiang, Comparison theorems on large margin learning, Technique report, 2019. |
[10] | Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci., 55 (1997), 119-139. doi: 10.1006/jcss.1997.1504. |
[11] | J. Friedman, T. Hastie and R. Tibshirani, Additive logistic regression: a statistical view of boosting (with discussion), Ann. Statist., 28 (2000), 337-407. doi: 10.1214/aos/1016218223. |
[12] | T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, New York, 2001. doi: 10.1007/978-0-387-21606-5. |
[13] | Y. W. Lei, U. Dogan, D. X. Zhou and M. Kloft, Data-dependent generalization bounds for multi-class classification, IEEE Trans. Inform. Theory, 65 (2019), 2995-3021. doi: 10.1109/TIT.2019.2893916. |
[14] | Y. F. Liu, H. H. Zhang and Y. C. Wu, Hard or soft classificaiton? large-margin unified machines, J. Amer. Statist. Assoc., 106 (2011), 166-177. doi: 10.1198/jasa.2011.tm10319. |
[15] | J. S. Marron, M. Todd and J. Ahn, Distance weighted discrimination, J. Amer. Statist. Assoc., 102 (2007), 1267-1271. doi: 10.1198/016214507000001120. |
[16] | S. Smale. and D. X. Zhou, Online learning with markov sampling, Anal. Appl., 7 (2009), 87-113. doi: 10.1142/S0219530509001293. |
[17] | S. Smale and D. X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx., 26 (2007), 153-172. doi: 10.1007/s00365-006-0659-y. |
[18] | I. Steinwart, On the influence of the kernel on the consistency of support vector machines, J. Mach. Learn. Res., 2 (2001), 67-93. doi: 10.1162/153244302760185252. |
[19] | I. Steinwart and A. Christmann, Estimating conditional quantiles with the help of the pinball loss, Bernoulli, 17 (2011), 211-225. doi: 10.3150/10-BEJ267. |
[20] | I. Steinwart and C. Scovel, Fast rates for support vector machines using gaussian kernels, Ann. Statist., 35 (2007), 575-607. doi: 10.1214/009053606000001226. |
[21] | A. B. Tsybakov, Optimal aggregation of classifiers in statistical learning, Ann. Statist., 32 (2004), 135-166. doi: 10.1214/aos/1079120131. |
[22] | G. Wahba, Spline Models for Observational Data, SIAM, 1990. doi: 10.1137/1.9781611970128. |
[23] | C. Wang and T. Hu, Online minimum error entropy algorithm with unbounded sampling, Anal. Appl., 17 (2019), 293-322. doi: 10.1142/S0219530518500148. |
[24] | B. X. Wang and H. Zou, Another look at distance-weighted discrimination, J. R. Statist. Soc., 80 (2018), 177-198. doi: 10.1111/rssb.12244. |
[25] | R. Williamson, A. J. Smola and B. Schölkopf, Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators, IEEE Trans. Inform. Theory, 47 (2001), 2516-2532. doi: 10.1109/18.945262. |
[26] | Q. Wu, Classification and Regularization in Learning Theory, VDM Verlag, 2009. |
[27] | Q. Wu, Y. M. Ying and D. X. Zhou, Multi-kernel regularized classifiers, J. Complexity, 23 (2007), 108-134. doi: 10.1016/j.jco.2006.06.007. |
[28] | Q. Wu, Y. Ying and D. X. Zhou, Learning rates of least-square regularized regression, Found. Comput. Math., 6 (2006), 171-192. doi: 10.1007/s10208-004-0155-9. |
[29] | Q. Wu and D. X. Zhou, Svm soft margin classifiers: linear programming versus quadratic programming, Neural Comput., 17 (2005), 1160-1187. doi: 10.1162/0899766053491896. |
[30] | Q. Wu and D. X. Zhou, Analysis of support vector machine classification, J. Comput. Anal. Appl., 8 (2006), 99-119. doi: 10.1109/BMEI.2008.178. |
[31] | D. H. Xiang, Classification with gaussians and convex loss ii: improving error bounds by noise conditions, Sci. China Math., 54 (2011), 165-171. doi: 10.1007/s11425-010-4043-2. |
[32] | D. H. Xiang, Logistic classification with varying gaussians, Comput. Math. Appl., 61 (2011), 397-407. doi: 10.1016/j.camwa.2010.11.016. |
[33] | D. H. Xiang, Conditional quantiles with varying gaussians, Adv. Comput. Math., 38 (2013), 723-735. doi: 10.1007/s10444-011-9257-5. |
[34] | D. H. Xiang and D. X. Zhou, Classification with gaussians and convex loss, J. Mach. Learn. Res., 10 (2009), 1447-1468. doi: 10.1007/s11425-010-4043-2. |
[35] | D. H. Xiang, T. Hu and D. X. Zhou, Approximation analysis of learning algorithms for support vector regression and quantile regression, J. Appl. Math., 2012 (2012), 17 pp. doi: 10.1155/2012/902139. |
[36] | Y. Ying and D. X. Zhou, Learnability of Gaussians with Flexible Variances, J. Mach. Learn. Res., 8 (2007), 249-276. |
[37] | T. Zhang, Statistical behavior and consistency of classification methods based on convex risk minimization, Ann. Statist., 32 (2004), 56-85. doi: 10.1214/aos/1079120130. |
[38] | Y. L. Zhao, J. Fan and L. Shi, Learing rates for regularized least squares ranking algorithm, Anal. Appl., 15 (2017), 815-836. doi: 10.1142/S0219530517500063. |
[39] | D. X. Zhou, The covering number in learning theory, J. Complexity, 18 (2002), 739-767. doi: 10.1006/jcom.2002.0635. |
[40] | D. X. Zhou, Capacity of reproducing kernel spaces in learning theory, IEEE. Trans. Inform. Theory, 49 (2003), 1743-1752. doi: 10.1109/TIT.2003.813564. |