• Previous Article
    Fast non-convex low-rank matrix decomposition for separation of potential field data using minimal memory
  • IPI Home
  • This Issue
  • Next Article
    Adversarial defense via the data-dependent activation, total variation minimization, and adversarial training
February  2021, 15(1): 147-158. doi: 10.3934/ipi.2020045

A new initialization method based on normed statistical spaces in deep networks

1. 

Department of Mathematics, Yeung Kin Man Academic Building, City University of Hong Kong, Tat Chee Avenue, Kowloon Tong, Hong Kong, China

2. 

Department of Mathematics, School of Science, Shanghai University, Shanghai 200444, China

3. 

HISILICON Technologies Co., Ltd., Huawei Base, Bantian, Longgang District, Shenzhen 518129, China

4. 

Department of Mathematics, The Chinese University of Hong Kong, Shatin, Hong Kong, China

*Corresponding author: Tieyong Zeng (zeng@math.cuhk.edu.hk)

Received  November 2019 Revised  April 2020 Published  February 2021 Early access  August 2020

Fund Project: Raymond Chan's research is supported by HKRGC Grants No. CUHK 14306316 and CUHK 14301718, CityU Grant 9380101, CRF Grant C1007-15G, AoE/M-05/12. Tieyong Zeng's research is supported by National Science Foundation of China No. 11671002, CUHK start-up and CUHK DAG 4053342, RGC 14300219, and NSFC/RGC N_CUHK 415/19

Training deep neural networks can be difficult. For classical neural networks, the initialization method by Xavier and Yoshua which is later generalized by He, Zhang, Ren and Sun can facilitate stable training. However, with the recent development of new layer types, we find that the above mentioned initialization methods may fail to lead to successful training. Based on these two methods, we will propose a new initialization by studying the parameter space of a network. Our principal is to put constrains on the growth of parameters in different layers in a consistent way. In order to do so, we introduce a norm to the parameter space and use this norm to measure the growth of parameters. Our new method is suitable for a wide range of layer types, especially for layers with parameter-sharing weight matrices.

Citation: Hongfei Yang, Xiaofeng Ding, Raymond Chan, Hui Hu, Yaxin Peng, Tieyong Zeng. A new initialization method based on normed statistical spaces in deep networks. Inverse Problems & Imaging, 2021, 15 (1) : 147-158. doi: 10.3934/ipi.2020045
References:
[1]

D. M. Bradley, Learning in Modular Systems, PhD thesis, Carnegie Mellon University, 2010.  Google Scholar

[2]

C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan et al., Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices, in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, (2017), 395–408. doi: 10.1145/3123939.3124552.  Google Scholar

[3]

X. Ding, H. Yang, R. Chan, H. Hu, Y. Peng and T. Zeng, A new initialization method for neural networks with weight sharing, Submitted for Publication. Google Scholar

[4]

C. DongC. C. LoyK. He and X. Tang, Image super-resolution using deep convolutional networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 (2015), 295-307.  doi: 10.1109/TPAMI.2015.2439281.  Google Scholar

[5]

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, (2010), 249–256. Google Scholar

[6]

I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.  Google Scholar

[7]

K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in The IEEE International Conference on Computer Vision (ICCV), 2015. doi: 10.1109/ICCV.2015.123.  Google Scholar

[8]

K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. doi: 10.1109/CVPR.2016.90.  Google Scholar

[9]

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv: 1704.04861. Google Scholar

[10]

J. Hu, L. Shen and G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 7132–7141. doi: 10.1109/CVPR.2018.00745.  Google Scholar

[11]

A. Krizhevsky and G. Hinton, Learning Multiple Layers of Features from Tiny Images, Technical report, Citeseer, 2009. Google Scholar

[12]

Y. LeCunL. BottouY. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), 2278-2324.  doi: 10.1109/5.726791.  Google Scholar

[13]

D. Mishkin and J. Matas, All You Need Is A Good Init, International Conference on Learning Representations, 2016. Google Scholar

[14]

O. Ronneberger, P. Fischer and T. Brox, U-net: Convolutional networks for biomedical image segmentation, in International Conference on Medical Image Computing and Computer-assisted Intervention, (2015), 234–241. doi: 10.1007/978-3-319-24574-4_28.  Google Scholar

[15]

W. Rudin, Real and Complex Analysis, 3rd edition, McGraw-Hill Book Co., New York, 1987.  Google Scholar

[16]

W. Rudin, Functional Analysis, 2nd edition, International Series in Pure and Applied Mathematics, McGraw-Hill, Inc., New York, 1991.  Google Scholar

[17]

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 4510–4520. Google Scholar

[18]

A. Saxe, J. L. McClelland and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv: 1312.6120. Google Scholar

[19]

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv: 1409.1556. Google Scholar

[20]

C. Szegedy, S. Ioffe, V. Vanhoucke and A. A. Alemi, Inception-v4, Inception-Resnet and the Impact of Residual Connections on Learning, Thirty-First AAAI Conference on Artificial Intelligence, 2017. Google Scholar

[21]

M. Taki, Deep residual networks and weight initialization, arXiv: 1709.02956. Google Scholar

[22]

L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. Schoenholz and J. Pennington, Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks, in International Conference on Machine Learning, (2018), 5389–5398. Google Scholar

[23]

F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv: 1511.07122. Google Scholar

[24]

K. Zhang, W. Zuo, S. Gu and L. Zhang, Learning deep cnn denoiser prior for image restoration, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 3929–3938. Google Scholar

[25]

T. Zhang, G.-J. Qi, B. Xiao and J. Wang, Interleaved group convolutions, in Proceedings of the IEEE International Conference on Computer Vision, (2017), 4373–4382. doi: 10.1109/ICCV.2017.469.  Google Scholar

show all references

References:
[1]

D. M. Bradley, Learning in Modular Systems, PhD thesis, Carnegie Mellon University, 2010.  Google Scholar

[2]

C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan et al., Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices, in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, (2017), 395–408. doi: 10.1145/3123939.3124552.  Google Scholar

[3]

X. Ding, H. Yang, R. Chan, H. Hu, Y. Peng and T. Zeng, A new initialization method for neural networks with weight sharing, Submitted for Publication. Google Scholar

[4]

C. DongC. C. LoyK. He and X. Tang, Image super-resolution using deep convolutional networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 (2015), 295-307.  doi: 10.1109/TPAMI.2015.2439281.  Google Scholar

[5]

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, (2010), 249–256. Google Scholar

[6]

I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.  Google Scholar

[7]

K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in The IEEE International Conference on Computer Vision (ICCV), 2015. doi: 10.1109/ICCV.2015.123.  Google Scholar

[8]

K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. doi: 10.1109/CVPR.2016.90.  Google Scholar

[9]

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv: 1704.04861. Google Scholar

[10]

J. Hu, L. Shen and G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 7132–7141. doi: 10.1109/CVPR.2018.00745.  Google Scholar

[11]

A. Krizhevsky and G. Hinton, Learning Multiple Layers of Features from Tiny Images, Technical report, Citeseer, 2009. Google Scholar

[12]

Y. LeCunL. BottouY. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), 2278-2324.  doi: 10.1109/5.726791.  Google Scholar

[13]

D. Mishkin and J. Matas, All You Need Is A Good Init, International Conference on Learning Representations, 2016. Google Scholar

[14]

O. Ronneberger, P. Fischer and T. Brox, U-net: Convolutional networks for biomedical image segmentation, in International Conference on Medical Image Computing and Computer-assisted Intervention, (2015), 234–241. doi: 10.1007/978-3-319-24574-4_28.  Google Scholar

[15]

W. Rudin, Real and Complex Analysis, 3rd edition, McGraw-Hill Book Co., New York, 1987.  Google Scholar

[16]

W. Rudin, Functional Analysis, 2nd edition, International Series in Pure and Applied Mathematics, McGraw-Hill, Inc., New York, 1991.  Google Scholar

[17]

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 4510–4520. Google Scholar

[18]

A. Saxe, J. L. McClelland and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv: 1312.6120. Google Scholar

[19]

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv: 1409.1556. Google Scholar

[20]

C. Szegedy, S. Ioffe, V. Vanhoucke and A. A. Alemi, Inception-v4, Inception-Resnet and the Impact of Residual Connections on Learning, Thirty-First AAAI Conference on Artificial Intelligence, 2017. Google Scholar

[21]

M. Taki, Deep residual networks and weight initialization, arXiv: 1709.02956. Google Scholar

[22]

L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. Schoenholz and J. Pennington, Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks, in International Conference on Machine Learning, (2018), 5389–5398. Google Scholar

[23]

F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv: 1511.07122. Google Scholar

[24]

K. Zhang, W. Zuo, S. Gu and L. Zhang, Learning deep cnn denoiser prior for image restoration, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 3929–3938. Google Scholar

[25]

T. Zhang, G.-J. Qi, B. Xiao and J. Wang, Interleaved group convolutions, in Proceedings of the IEEE International Conference on Computer Vision, (2017), 4373–4382. doi: 10.1109/ICCV.2017.469.  Google Scholar

Table 1. (b) Plot of losses of network summarized in Table 2. (c) Plot of losses of network summarized in Table 3. (d) Plot of losses of network summarized in Table 4. Mean and std for the last of the smoothed loss values: Ours (a) $ 0.070\pm 0.005 $, (b) $ 0.111\pm 0.006 $, (c) $ 0.088\pm 0.003 $, (d) $ 0.083\pm 0.004 $; Xavier/He (a) $ 0.069\pm 0.001 $, (b) $ 0.206\pm 0.012 $, (c) $ 0.221\pm 0.016 $, (d) $ 0.164\pm 0.012 $. We also tested the evaluation accuracies on the test set with results: Ours versus Xavier/He (a) $ 98.06\% $, $ 98.13\% $, (b) $ 95.66\% $, $ 93.94\% $, (c) $ 97.17\% $, $ 95.07\% $, (d) $ 98.01\% $, $ 96.22\% $">Figure 1.  (a) Plot of losses of network summarized in Table 1. (b) Plot of losses of network summarized in Table 2. (c) Plot of losses of network summarized in Table 3. (d) Plot of losses of network summarized in Table 4. Mean and std for the last of the smoothed loss values: Ours (a) $ 0.070\pm 0.005 $, (b) $ 0.111\pm 0.006 $, (c) $ 0.088\pm 0.003 $, (d) $ 0.083\pm 0.004 $; Xavier/He (a) $ 0.069\pm 0.001 $, (b) $ 0.206\pm 0.012 $, (c) $ 0.221\pm 0.016 $, (d) $ 0.164\pm 0.012 $. We also tested the evaluation accuracies on the test set with results: Ours versus Xavier/He (a) $ 98.06\% $, $ 98.13\% $, (b) $ 95.66\% $, $ 93.94\% $, (c) $ 97.17\% $, $ 95.07\% $, (d) $ 98.01\% $, $ 96.22\% $
Table 2.  Network structure of Figure 1(b). For the last convolution layer with kernel size $ 55\times 55 $ we use periodic padding on the input images to make sure the conditions on $ T $ in (6) are satisfied
Layer Output channel Number of parameters
Conv2d+Maxpool $ 32 $ $ 3\times 3 \times 1 \times 32 $
Conv2d+Maxpool $ 64 $ $ 3\times 3 \times 32\times 64 $
Reshape to $ 56\times 56 $
Conv2d $ 1 $ $ 55\times 55 \times 1\times 1 $
Fc $ 10 $ $ 3136\times 10 $
Layer Output channel Number of parameters
Conv2d+Maxpool $ 32 $ $ 3\times 3 \times 1 \times 32 $
Conv2d+Maxpool $ 64 $ $ 3\times 3 \times 32\times 64 $
Reshape to $ 56\times 56 $
Conv2d $ 1 $ $ 55\times 55 \times 1\times 1 $
Fc $ 10 $ $ 3136\times 10 $
Table 1.  Network structure of Figure 1(a)
Layer Output channel Number of parameters
Conv2d+MaxPool $ 32 $ $ 3\times 3 \times 1 \times 32 $
Conv2d+MaxPool $ 64 $ $ 3\times 3 \times 32\times 64 $
Fc $ 64 $ $ 3136\times 64 $
Fc $ 10 $ $ 64\times 10 $
Layer Output channel Number of parameters
Conv2d+MaxPool $ 32 $ $ 3\times 3 \times 1 \times 32 $
Conv2d+MaxPool $ 64 $ $ 3\times 3 \times 32\times 64 $
Fc $ 64 $ $ 3136\times 64 $
Fc $ 10 $ $ 64\times 10 $
Table 3.  Network structure of Figure 1(c). For a compression ratio $ B> 1 $, we use circulant implementation with block size $ B $. The number of parameters for a CirCNN implementation should be divided by $ B $
Layer Out channel Number of parameters Compression ratio
Conv2d+Maxpool $ 32 $ $ 3\times 3 \times 1 \times 32 $ $ 1 $
Conv2d+Maxpool $ 64 $ $ 3\times 3 \times 32\times 64 $ $ 1 $
Fc $ 1568 $ $ 3136\times 1568 $ $ 1568 $
Fc $ 10 $ $ 1568\times 10 $ $ 1 $
Layer Out channel Number of parameters Compression ratio
Conv2d+Maxpool $ 32 $ $ 3\times 3 \times 1 \times 32 $ $ 1 $
Conv2d+Maxpool $ 64 $ $ 3\times 3 \times 32\times 64 $ $ 1 $
Fc $ 1568 $ $ 3136\times 1568 $ $ 1568 $
Fc $ 10 $ $ 1568\times 10 $ $ 1 $
Table 4.  Network structure of Figure 1(d). For a compression ratio $ B> 1 $, we use circulant implementation with block size $ B $. The number of parameters for a CirCNN implementation should be divided by $ B $
Layer Out channel Number of parameters Compression ratio
Conv2d+Maxpool $ 32 $ $ 3\times 3 \times 1 \times 32 $ $ 1 $
Conv2d+Maxpool $ 64 $ $ 3\times 3 \times 32\times 64 $ $ 1 $
Conv2d $ 256 $ $ 3\times 3 \times 64 \times 256 $ $ 1 $
Conv2d $ 256 $ $ 3\times 3 \times 256 \times 256 $ $ 256 $
Conv2d $ 256 $ $ 3\times 3 \times 256 \times 256 $ $ 256 $
Fc $ 64 $ $ 12544\times 64 $ $ 1 $
Fc $ 10 $ $ 64\times 10 $ $ 1 $
Layer Out channel Number of parameters Compression ratio
Conv2d+Maxpool $ 32 $ $ 3\times 3 \times 1 \times 32 $ $ 1 $
Conv2d+Maxpool $ 64 $ $ 3\times 3 \times 32\times 64 $ $ 1 $
Conv2d $ 256 $ $ 3\times 3 \times 64 \times 256 $ $ 1 $
Conv2d $ 256 $ $ 3\times 3 \times 256 \times 256 $ $ 256 $
Conv2d $ 256 $ $ 3\times 3 \times 256 \times 256 $ $ 256 $
Fc $ 64 $ $ 12544\times 64 $ $ 1 $
Fc $ 10 $ $ 64\times 10 $ $ 1 $
[1]

Lars Grüne. Computing Lyapunov functions using deep neural networks. Journal of Computational Dynamics, 2021, 8 (2) : 131-152. doi: 10.3934/jcd.2021006

[2]

Scott R. Pope, Laura M. Ellwein, Cheryl L. Zapata, Vera Novak, C. T. Kelley, Mette S. Olufsen. Estimation and identification of parameters in a lumped cerebrovascular model. Mathematical Biosciences & Engineering, 2009, 6 (1) : 93-115. doi: 10.3934/mbe.2009.6.93

[3]

Houssein Ayoub, Bedreddine Ainseba, Michel Langlais, Rodolphe Thiébaut. Parameters identification for a model of T cell homeostasis. Mathematical Biosciences & Engineering, 2015, 12 (5) : 917-936. doi: 10.3934/mbe.2015.12.917

[4]

Kangbo Bao, Libin Rong, Qimin Zhang. Analysis of a stochastic SIRS model with interval parameters. Discrete & Continuous Dynamical Systems - B, 2019, 24 (9) : 4827-4849. doi: 10.3934/dcdsb.2019033

[5]

Eduardo Castillo-Castaneda. Neural network training in SCILAB for classifying mango (Mangifera indica) according to maturity level using the RGB color model. STEM Education, 2021, 1 (3) : 186-198. doi: 10.3934/steme.2021014

[6]

Guangzhou Chen, Guijian Liu, Jiaquan Wang, Ruzhong Li. Identification of water quality model parameters using artificial bee colony algorithm. Numerical Algebra, Control & Optimization, 2012, 2 (1) : 157-165. doi: 10.3934/naco.2012.2.157

[7]

Yanqin Bai, Yudan Wei, Qian Li. An optimal trade-off model for portfolio selection with sensitivity of parameters. Journal of Industrial & Management Optimization, 2017, 13 (2) : 947-965. doi: 10.3934/jimo.2016055

[8]

Hanwool Na, Myeongmin Kang, Miyoun Jung, Myungjoo Kang. Nonconvex TGV regularization model for multiplicative noise removal with spatially varying parameters. Inverse Problems & Imaging, 2019, 13 (1) : 117-147. doi: 10.3934/ipi.2019007

[9]

Long Zhang, Gao Xu, Zhidong Teng. Intermittent dispersal population model with almost period parameters and dispersal delays. Discrete & Continuous Dynamical Systems - B, 2016, 21 (6) : 2011-2037. doi: 10.3934/dcdsb.2016034

[10]

M. M. El-Dessoky, Muhammad Altaf Khan. Application of Caputo-Fabrizio derivative to a cancer model with unknown parameters. Discrete & Continuous Dynamical Systems - S, 2021, 14 (10) : 3557-3575. doi: 10.3934/dcdss.2020429

[11]

Min He. On continuity in parameters of integrated semigroups. Conference Publications, 2003, 2003 (Special) : 403-412. doi: 10.3934/proc.2003.2003.403

[12]

Mohsen Abdolhosseinzadeh, Mir Mohammad Alipour. Design of experiment for tuning parameters of an ant colony optimization method for the constrained shortest Hamiltonian path problem in the grid networks. Numerical Algebra, Control & Optimization, 2021, 11 (2) : 321-332. doi: 10.3934/naco.2020028

[13]

H. N. Mhaskar, T. Poggio. Function approximation by deep networks. Communications on Pure & Applied Analysis, 2020, 19 (8) : 4085-4095. doi: 10.3934/cpaa.2020181

[14]

Seonho Park, Maciej Rysz, Kaitlin L. Fair, Panos M. Pardalos. Synthetic-Aperture Radar image based positioning in GPS-denied environments using Deep Cosine Similarity Neural Networks. Inverse Problems & Imaging, 2021, 15 (4) : 763-785. doi: 10.3934/ipi.2021013

[15]

Yuantian Xia, Juxiang Zhou, Tianwei Xu, Wei Gao. An improved deep convolutional neural network model with kernel loss function in image classification. Mathematical Foundations of Computing, 2020, 3 (1) : 51-64. doi: 10.3934/mfc.2020005

[16]

Freddy Dumortier, Robert Roussarie. Canard cycles with two breaking parameters. Discrete & Continuous Dynamical Systems, 2007, 17 (4) : 787-806. doi: 10.3934/dcds.2007.17.787

[17]

Christopher Oballe, David Boothe, Piotr J. Franaszczuk, Vasileios Maroulas. ToFU: Topology functional units for deep learning. Foundations of Data Science, 2021  doi: 10.3934/fods.2021021

[18]

Tinevimbo Shiri, Winston Garira, Senelani D. Musekwa. A two-strain HIV-1 mathematical model to assess the effects of chemotherapy on disease parameters. Mathematical Biosciences & Engineering, 2005, 2 (4) : 811-832. doi: 10.3934/mbe.2005.2.811

[19]

Alexandr Golodnikov, Stan Uryasev, Grigoriy Zrazhevsky, Yevgeny Macheret, A. Alexandre Trindade. Optimization of composition and processing parameters for alloy development: a statistical model-based approach. Journal of Industrial & Management Optimization, 2007, 3 (3) : 489-501. doi: 10.3934/jimo.2007.3.489

[20]

Rehana Naz, Imran Naeem. Exact solutions of a Black-Scholes model with time-dependent parameters by utilizing potential symmetries. Discrete & Continuous Dynamical Systems - S, 2020, 13 (10) : 2841-2851. doi: 10.3934/dcdss.2020122

2020 Impact Factor: 1.639

Metrics

  • PDF downloads (211)
  • HTML views (271)
  • Cited by (0)

[Back to Top]