# American Institute of Mathematical Sciences

• Previous Article
Differential equation method based on approximate augmented Lagrangian for nonlinear programming
• JIMO Home
• This Issue
• Next Article
Analytical modeling of laminated composite plates using Kirchhoff circuit and wave digital filters
September  2020, 16(5): 2253-2266. doi: 10.3934/jimo.2019052

## Improved SVRG for finite sum structure optimization with application to binary classification

 1 School of Computer Science and Technology, Anhui University of Technology, Maanshan 243032, China 2 School of Sciences, Hangzhou Dianzi University, Hangzhou 310018, China

* Corresponding author: Wei Xue (E-mail: cswxue@ahut.edu.cn)

Received  May 2018 Revised  November 2018 Published  September 2020 Early access  May 2019

This paper looks at a stochastic variance reduced gradient (SVRG) method for minimizing the sum of a finite number of smooth convex functions, which has been involved widely in the field of machine learning and data mining. Inspired by the excellent performance of two-point stepsize gradient method in batch learning, in this paper we present an improved SVRG algorithm, named stochastic two-point stepsize gradient method. Under some mild conditions, the proposed method achieves a linear convergence rate $O(\rho^k)$ for smooth and strongly convex functions, where $\rho\in(0.68, 1)$. Simulation experiments on several benchmark data sets are reported to demonstrate the performance of the proposed method.

Citation: Guangmei Shao, Wei Xue, Gaohang Yu, Xiao Zheng. Improved SVRG for finite sum structure optimization with application to binary classification. Journal of Industrial and Management Optimization, 2020, 16 (5) : 2253-2266. doi: 10.3934/jimo.2019052
##### References:
 [1] J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), 141-148.  doi: 10.1093/imanum/8.1.141. [2] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, 2004.  doi: 10.1017/CBO9780511804441. [3] Y. Chen, L. Qi and Q. Wang, Computing extreme eigenvalues of large scale hankel tensors, J. Sci. Comput., 68 (2016), 716-738.  doi: 10.1007/s10915-015-0155-8. [4] W. Cheng and Y. Dai, Gradient-based method with active set strategy for l1 optimization, Math. Comp., 87 (2018), 1283-1305.  doi: 10.1090/mcom/3238. [5] Y. Dai and R. Fletcher, On the asymptotic behaviour of some new gradient methods, Math. Program., 103 (2005), 541-559.  doi: 10.1007/s10107-004-0516-9. [6] Y. Dai and R. Fletcher, Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming, Numer. Math., 100 (2005), 21-47.  doi: 10.1007/s00211-004-0569-y. [7] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Adv. Neural Inf. Process. Syst., (2013), 315–323. [8] H. Liu, Z. Liu and X. Dong, A new adaptive Barzilai and Borwein method for unconstrained optimization, Optim. Lett., 12 (2018), 845-873.  doi: 10.1007/s11590-017-1150-9. [9] J. Mairal, Incremental majorization-minimization optimization with application to large-scale machine learning, Optim. Lett., 25 (2015), 829-855.  doi: 10.1137/140957639. [10] E. Min, Y. Zhao, J. Long, C. Wu, K. Li and J. Yin, SVRG with adaptive epoch size, in Proc. IEEE Int. Joint Conf. Neural Netw., (2017), 2935–2942. doi: 10.1109/IJCNN.2017.7966219. [11] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, Boston, 2004. doi: 10.1007/978-1-4419-8853-9. [12] A. Nitanda, Accelerated stochastic gradient descent for minimizing finite sums, in Proc. Int. Conf. Artif. Intell. Statist., (2016), 195–203. [13] A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in Adv. Neural Inf. Process. Syst., (2016), 1574–1582. [14] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, John Wiley and Sons, New Jersey, 2007. doi: 10.1002/9780470182963. [15] M. Raydan, On the Barzilai and Borwein choice of steplength for the gradient method, IMA J. Numer. Anal., 13 (1993), 321-326.  doi: 10.1093/imanum/13.3.321. [16] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), 400-407.  doi: 10.1214/aoms/1177729586. [17] N. L. Roux, M. Schmidt and F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Adv. Neural Inf. Process. Syst., (2012), 2663–2671. [18] Z. Shen, H. Qian, T. Zhou and T. Mu, Adaptive variance reducing for stochastic gradient descent, in Proc. Int. Joint Conf. Artif. Intell., (2016), 1990–1996. [19] C. Tan, S. Ma, Y. Dai and Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, in Adv. Neural Inf. Process. Syst., (2016), 685–693. [20] Z. Wen, C. Yang, X. Liu and Y. Zhang, Trace-penalty minimization for large-scale eigenspace computation, J. Sci. Comput., 66 (2016), 1175-1203.  doi: 10.1007/s10915-015-0061-0. [21] L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24 (2014), 2057-2075.  doi: 10.1137/140961791. [22] W. Xue, W. Zhang and J. Ren, Online leaming based on stochastic spectral gradient, Comput. Sci., 43 (2016), 47-51. [23] G. Yu and S. Niu, Multivariate spectral gradient projection method for nonlinear monotone equations with convex constraints, J. Ind. Manag. Optim., 9 (2013), 117-129.  doi: 10.3934/jimo.2013.9.117. [24] B. Zhou, L. Gao and Y. Dai, Gradient methods with adaptive step-sizes, Comput. Optim. Appl., 35 (2006), 69-86.  doi: 10.1007/s10589-006-6446-0. [25] R. Zhou, X. Shen and L. Niu, A fast algorithm for nonsmooth penalized clustering, Neurocomputing, 273 (2018), 583-592.  doi: 10.1016/j.neucom.2017.08.048.

show all references

##### References:
 [1] J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), 141-148.  doi: 10.1093/imanum/8.1.141. [2] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, 2004.  doi: 10.1017/CBO9780511804441. [3] Y. Chen, L. Qi and Q. Wang, Computing extreme eigenvalues of large scale hankel tensors, J. Sci. Comput., 68 (2016), 716-738.  doi: 10.1007/s10915-015-0155-8. [4] W. Cheng and Y. Dai, Gradient-based method with active set strategy for l1 optimization, Math. Comp., 87 (2018), 1283-1305.  doi: 10.1090/mcom/3238. [5] Y. Dai and R. Fletcher, On the asymptotic behaviour of some new gradient methods, Math. Program., 103 (2005), 541-559.  doi: 10.1007/s10107-004-0516-9. [6] Y. Dai and R. Fletcher, Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming, Numer. Math., 100 (2005), 21-47.  doi: 10.1007/s00211-004-0569-y. [7] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Adv. Neural Inf. Process. Syst., (2013), 315–323. [8] H. Liu, Z. Liu and X. Dong, A new adaptive Barzilai and Borwein method for unconstrained optimization, Optim. Lett., 12 (2018), 845-873.  doi: 10.1007/s11590-017-1150-9. [9] J. Mairal, Incremental majorization-minimization optimization with application to large-scale machine learning, Optim. Lett., 25 (2015), 829-855.  doi: 10.1137/140957639. [10] E. Min, Y. Zhao, J. Long, C. Wu, K. Li and J. Yin, SVRG with adaptive epoch size, in Proc. IEEE Int. Joint Conf. Neural Netw., (2017), 2935–2942. doi: 10.1109/IJCNN.2017.7966219. [11] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, Boston, 2004. doi: 10.1007/978-1-4419-8853-9. [12] A. Nitanda, Accelerated stochastic gradient descent for minimizing finite sums, in Proc. Int. Conf. Artif. Intell. Statist., (2016), 195–203. [13] A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in Adv. Neural Inf. Process. Syst., (2016), 1574–1582. [14] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, John Wiley and Sons, New Jersey, 2007. doi: 10.1002/9780470182963. [15] M. Raydan, On the Barzilai and Borwein choice of steplength for the gradient method, IMA J. Numer. Anal., 13 (1993), 321-326.  doi: 10.1093/imanum/13.3.321. [16] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), 400-407.  doi: 10.1214/aoms/1177729586. [17] N. L. Roux, M. Schmidt and F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Adv. Neural Inf. Process. Syst., (2012), 2663–2671. [18] Z. Shen, H. Qian, T. Zhou and T. Mu, Adaptive variance reducing for stochastic gradient descent, in Proc. Int. Joint Conf. Artif. Intell., (2016), 1990–1996. [19] C. Tan, S. Ma, Y. Dai and Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, in Adv. Neural Inf. Process. Syst., (2016), 685–693. [20] Z. Wen, C. Yang, X. Liu and Y. Zhang, Trace-penalty minimization for large-scale eigenspace computation, J. Sci. Comput., 66 (2016), 1175-1203.  doi: 10.1007/s10915-015-0061-0. [21] L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24 (2014), 2057-2075.  doi: 10.1137/140961791. [22] W. Xue, W. Zhang and J. Ren, Online leaming based on stochastic spectral gradient, Comput. Sci., 43 (2016), 47-51. [23] G. Yu and S. Niu, Multivariate spectral gradient projection method for nonlinear monotone equations with convex constraints, J. Ind. Manag. Optim., 9 (2013), 117-129.  doi: 10.3934/jimo.2013.9.117. [24] B. Zhou, L. Gao and Y. Dai, Gradient methods with adaptive step-sizes, Comput. Optim. Appl., 35 (2006), 69-86.  doi: 10.1007/s10589-006-6446-0. [25] R. Zhou, X. Shen and L. Niu, A fast algorithm for nonsmooth penalized clustering, Neurocomputing, 273 (2018), 583-592.  doi: 10.1016/j.neucom.2017.08.048.
Objective loss and test classification accuracy versus epochs on data set "a9a" (best viewed in color)
Objective loss and test classification accuracy versus epochs on data set "w8a"
Evolutions of objective loss and test classification accuracy w.r.t. epochs for Eq. (14) on data set "a9a"
Evolutions of objective loss and test classification accuracy w.r.t. epochs for Eq. (14) on data set "w8a"
Evolutions of objective loss and test classification accuracy w.r.t. epochs for Eq. (14) on data set "a1a"
Evolutions of objective loss and test classification accuracy w.r.t. epochs for Eq. (14) on data set "w1a"
Evolutions of objective loss and test classification accuracy versus epochs on data set "a9a" with different $\eta_0$
Evolutions of objective loss and test classification accuracy versus epochs on data set "w8a" with different $\eta_0$
Comparison results between SVM and STSG on test classification accuracy on four test data sets
Details of data sets
 Data set $\sharp$ of training samples $\sharp$ of test samples $\sharp$ of dimension a1a 1605 30956 123 a9a 32561 16281 123 w1a 2477 47272 300 w8a 49749 14951 300
 Data set $\sharp$ of training samples $\sharp$ of test samples $\sharp$ of dimension a1a 1605 30956 123 a9a 32561 16281 123 w1a 2477 47272 300 w8a 49749 14951 300
 [1] Ning Zhang, Qiang Wu. Online learning for supervised dimension reduction. Mathematical Foundations of Computing, 2019, 2 (2) : 95-106. doi: 10.3934/mfc.2019008 [2] Wen-Chiao Cheng. Two-point pre-image entropy. Discrete and Continuous Dynamical Systems, 2007, 17 (1) : 107-119. doi: 10.3934/dcds.2007.17.107 [3] Feliz Minhós, A. I. Santos. Higher order two-point boundary value problems with asymmetric growth. Discrete and Continuous Dynamical Systems - S, 2008, 1 (1) : 127-137. doi: 10.3934/dcdss.2008.1.127 [4] Wenming Zou. Multiple solutions results for two-point boundary value problem with resonance. Discrete and Continuous Dynamical Systems, 1998, 4 (3) : 485-496. doi: 10.3934/dcds.1998.4.485 [5] Frédéric Legoll, William Minvielle. Variance reduction using antithetic variables for a nonlinear convex stochastic homogenization problem. Discrete and Continuous Dynamical Systems - S, 2015, 8 (1) : 1-27. doi: 10.3934/dcdss.2015.8.1 [6] Jerry L. Bona, Hongqiu Chen, Shu-Ming Sun, Bing-Yu Zhang. Comparison of quarter-plane and two-point boundary value problems: The KdV-equation. Discrete and Continuous Dynamical Systems - B, 2007, 7 (3) : 465-495. doi: 10.3934/dcdsb.2007.7.465 [7] Xiao-Yu Zhang, Qing Fang. A sixth order numerical method for a class of nonlinear two-point boundary value problems. Numerical Algebra, Control and Optimization, 2012, 2 (1) : 31-43. doi: 10.3934/naco.2012.2.31 [8] Jerry Bona, Hongqiu Chen, Shu Ming Sun, B.-Y. Zhang. Comparison of quarter-plane and two-point boundary value problems: the BBM-equation. Discrete and Continuous Dynamical Systems, 2005, 13 (4) : 921-940. doi: 10.3934/dcds.2005.13.921 [9] Shao-Yuan Huang, Shin-Hwa Wang. On S-shaped bifurcation curves for a two-point boundary value problem arising in a theory of thermal explosion. Discrete and Continuous Dynamical Systems, 2015, 35 (10) : 4839-4858. doi: 10.3934/dcds.2015.35.4839 [10] Marcel Lesieur. Two-point closure based large-eddy simulations in turbulence, Part 1: Isotropic turbulence. Discrete and Continuous Dynamical Systems - S, 2011, 4 (1) : 155-168. doi: 10.3934/dcdss.2011.4.155 [11] Marcel Lesieur. Two-point closure based large-eddy simulations in turbulence. Part 2: Inhomogeneous cases. Discrete and Continuous Dynamical Systems, 2010, 28 (1) : 227-241. doi: 10.3934/dcds.2010.28.227 [12] Cai-Tong Yue, Jing Liang, Bo-Fei Lang, Bo-Yang Qu. Two-hidden-layer extreme learning machine based wrist vein recognition system. Big Data & Information Analytics, 2017, 2 (1) : 59-68. doi: 10.3934/bdia.2017008 [13] Tengteng Yu, Xin-Wei Liu, Yu-Hong Dai, Jie Sun. Variable metric proximal stochastic variance reduced gradient methods for nonconvex nonsmooth optimization. Journal of Industrial and Management Optimization, 2021  doi: 10.3934/jimo.2021084 [14] Shuhua Wang, Zhenlong Chen, Baohuai Sheng. Convergence of online pairwise regression learning with quadratic loss. Communications on Pure and Applied Analysis, 2020, 19 (8) : 4023-4054. doi: 10.3934/cpaa.2020178 [15] Émilie Chouzenoux, Henri Gérard, Jean-Christophe Pesquet. General risk measures for robust machine learning. Foundations of Data Science, 2019, 1 (3) : 249-269. doi: 10.3934/fods.2019011 [16] Ana Rita Nogueira, João Gama, Carlos Abreu Ferreira. Causal discovery in machine learning: Theories and applications. Journal of Dynamics and Games, 2021, 8 (3) : 203-231. doi: 10.3934/jdg.2021008 [17] Sriram Nagaraj. Optimization and learning with nonlocal calculus. Foundations of Data Science, 2022  doi: 10.3934/fods.2022009 [18] Prashant Shekhar, Abani Patra. Hierarchical approximations for data reduction and learning at multiple scales. Foundations of Data Science, 2020, 2 (2) : 123-154. doi: 10.3934/fods.2020008 [19] Joon Kwon, Panayotis Mertikopoulos. A continuous-time approach to online optimization. Journal of Dynamics and Games, 2017, 4 (2) : 125-148. doi: 10.3934/jdg.2017008 [20] Aude Hofleitner, Tarek Rabbani, Mohammad Rafiee, Laurent El Ghaoui, Alex Bayen. Learning and estimation applications of an online homotopy algorithm for a generalization of the LASSO. Discrete and Continuous Dynamical Systems - S, 2014, 7 (3) : 503-523. doi: 10.3934/dcdss.2014.7.503

2020 Impact Factor: 1.801

## Tools

Article outline

Figures and Tables

[Back to Top]