# American Institute of Mathematical Sciences

doi: 10.3934/ipi.2020045

## A new initialization method based on normed statistical spaces in deep networks

 1 Department of Mathematics, Yeung Kin Man Academic Building, City University of Hong Kong, Tat Chee Avenue, Kowloon Tong, Hong Kong, China 2 Department of Mathematics, School of Science, Shanghai University, Shanghai 200444, China 3 HISILICON Technologies Co., Ltd., Huawei Base, Bantian, Longgang District, Shenzhen 518129, China 4 Department of Mathematics, The Chinese University of Hong Kong, Shatin, Hong Kong, China

*Corresponding author: Tieyong Zeng (zeng@math.cuhk.edu.hk)

Received  November 2019 Revised  April 2020 Published  August 2020

Fund Project: Raymond Chan's research is supported by HKRGC Grants No. CUHK 14306316 and CUHK 14301718, CityU Grant 9380101, CRF Grant C1007-15G, AoE/M-05/12. Tieyong Zeng's research is supported by National Science Foundation of China No. 11671002, CUHK start-up and CUHK DAG 4053342, RGC 14300219, and NSFC/RGC N_CUHK 415/19

Training deep neural networks can be difficult. For classical neural networks, the initialization method by Xavier and Yoshua which is later generalized by He, Zhang, Ren and Sun can facilitate stable training. However, with the recent development of new layer types, we find that the above mentioned initialization methods may fail to lead to successful training. Based on these two methods, we will propose a new initialization by studying the parameter space of a network. Our principal is to put constrains on the growth of parameters in different layers in a consistent way. In order to do so, we introduce a norm to the parameter space and use this norm to measure the growth of parameters. Our new method is suitable for a wide range of layer types, especially for layers with parameter-sharing weight matrices.

Citation: Hongfei Yang, Xiaofeng Ding, Raymond Chan, Hui Hu, Yaxin Peng, Tieyong Zeng. A new initialization method based on normed statistical spaces in deep networks. Inverse Problems & Imaging, doi: 10.3934/ipi.2020045
##### References:

show all references

##### References:
(a) Plot of losses of network summarized in Table 1. (b) Plot of losses of network summarized in Table 2. (c) Plot of losses of network summarized in Table 3. (d) Plot of losses of network summarized in Table 4. Mean and std for the last of the smoothed loss values: Ours (a) $0.070\pm 0.005$, (b) $0.111\pm 0.006$, (c) $0.088\pm 0.003$, (d) $0.083\pm 0.004$; Xavier/He (a) $0.069\pm 0.001$, (b) $0.206\pm 0.012$, (c) $0.221\pm 0.016$, (d) $0.164\pm 0.012$. We also tested the evaluation accuracies on the test set with results: Ours versus Xavier/He (a) $98.06\%$, $98.13\%$, (b) $95.66\%$, $93.94\%$, (c) $97.17\%$, $95.07\%$, (d) $98.01\%$, $96.22\%$
Network structure of Figure 1(b). For the last convolution layer with kernel size $55\times 55$ we use periodic padding on the input images to make sure the conditions on $T$ in (6) are satisfied
 Layer Output channel Number of parameters Conv2d+Maxpool $32$ $3\times 3 \times 1 \times 32$ Conv2d+Maxpool $64$ $3\times 3 \times 32\times 64$ Reshape to $56\times 56$ Conv2d $1$ $55\times 55 \times 1\times 1$ Fc $10$ $3136\times 10$
 Layer Output channel Number of parameters Conv2d+Maxpool $32$ $3\times 3 \times 1 \times 32$ Conv2d+Maxpool $64$ $3\times 3 \times 32\times 64$ Reshape to $56\times 56$ Conv2d $1$ $55\times 55 \times 1\times 1$ Fc $10$ $3136\times 10$
Network structure of Figure 1(a)
 Layer Output channel Number of parameters Conv2d+MaxPool $32$ $3\times 3 \times 1 \times 32$ Conv2d+MaxPool $64$ $3\times 3 \times 32\times 64$ Fc $64$ $3136\times 64$ Fc $10$ $64\times 10$
 Layer Output channel Number of parameters Conv2d+MaxPool $32$ $3\times 3 \times 1 \times 32$ Conv2d+MaxPool $64$ $3\times 3 \times 32\times 64$ Fc $64$ $3136\times 64$ Fc $10$ $64\times 10$
Network structure of Figure 1(c). For a compression ratio $B> 1$, we use circulant implementation with block size $B$. The number of parameters for a CirCNN implementation should be divided by $B$
 Layer Out channel Number of parameters Compression ratio Conv2d+Maxpool $32$ $3\times 3 \times 1 \times 32$ $1$ Conv2d+Maxpool $64$ $3\times 3 \times 32\times 64$ $1$ Fc $1568$ $3136\times 1568$ $1568$ Fc $10$ $1568\times 10$ $1$
 Layer Out channel Number of parameters Compression ratio Conv2d+Maxpool $32$ $3\times 3 \times 1 \times 32$ $1$ Conv2d+Maxpool $64$ $3\times 3 \times 32\times 64$ $1$ Fc $1568$ $3136\times 1568$ $1568$ Fc $10$ $1568\times 10$ $1$
Network structure of Figure 1(d). For a compression ratio $B> 1$, we use circulant implementation with block size $B$. The number of parameters for a CirCNN implementation should be divided by $B$
 Layer Out channel Number of parameters Compression ratio Conv2d+Maxpool $32$ $3\times 3 \times 1 \times 32$ $1$ Conv2d+Maxpool $64$ $3\times 3 \times 32\times 64$ $1$ Conv2d $256$ $3\times 3 \times 64 \times 256$ $1$ Conv2d $256$ $3\times 3 \times 256 \times 256$ $256$ Conv2d $256$ $3\times 3 \times 256 \times 256$ $256$ Fc $64$ $12544\times 64$ $1$ Fc $10$ $64\times 10$ $1$
 Layer Out channel Number of parameters Compression ratio Conv2d+Maxpool $32$ $3\times 3 \times 1 \times 32$ $1$ Conv2d+Maxpool $64$ $3\times 3 \times 32\times 64$ $1$ Conv2d $256$ $3\times 3 \times 64 \times 256$ $1$ Conv2d $256$ $3\times 3 \times 256 \times 256$ $256$ Conv2d $256$ $3\times 3 \times 256 \times 256$ $256$ Fc $64$ $12544\times 64$ $1$ Fc $10$ $64\times 10$ $1$
 [1] Meilan Cai, Maoan Han. Limit cycle bifurcations in a class of piecewise smooth cubic systems with multiple parameters. Communications on Pure & Applied Analysis, 2021, 20 (1) : 55-75. doi: 10.3934/cpaa.2020257 [2] Sushil Kumar Dey, Bibhas C. Giri. Coordination of a sustainable reverse supply chain with revenue sharing contract. Journal of Industrial & Management Optimization, 2020  doi: 10.3934/jimo.2020165 [3] Laurence Cherfils, Stefania Gatti, Alain Miranville, Rémy Guillevin. Analysis of a model for tumor growth and lactate exchanges in a glioma. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020457 [4] Laurent Di Menza, Virginie Joanne-Fabre. An age group model for the study of a population of trees. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020464 [5] Weiwei Liu, Jinliang Wang, Yuming Chen. Threshold dynamics of a delayed nonlocal reaction-diffusion cholera model. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020316 [6] Siyang Cai, Yongmei Cai, Xuerong Mao. A stochastic differential equation SIS epidemic model with regime switching. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020317 [7] Yining Cao, Chuck Jia, Roger Temam, Joseph Tribbia. Mathematical analysis of a cloud resolving model including the ice microphysics. Discrete & Continuous Dynamical Systems - A, 2021, 41 (1) : 131-167. doi: 10.3934/dcds.2020219 [8] Zhouchao Wei, Wei Zhang, Irene Moroz, Nikolay V. Kuznetsov. Codimension one and two bifurcations in Cattaneo-Christov heat flux model. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020344 [9] Shuang Chen, Jinqiao Duan, Ji Li. Effective reduction of a three-dimensional circadian oscillator model. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020349 [10] Barbora Benešová, Miroslav Frost, Lukáš Kadeřávek, Tomáš Roubíček, Petr Sedlák. An experimentally-fitted thermodynamical constitutive model for polycrystalline shape memory alloys. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020459 [11] Cuicui Li, Lin Zhou, Zhidong Teng, Buyu Wen. The threshold dynamics of a discrete-time echinococcosis transmission model. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020339 [12] Yolanda Guerrero–Sánchez, Muhammad Umar, Zulqurnain Sabir, Juan L. G. Guirao, Muhammad Asif Zahoor Raja. Solving a class of biological HIV infection model of latently infected cells using heuristic approach. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020431 [13] H. M. Srivastava, H. I. Abdel-Gawad, Khaled Mohammed Saad. Oscillatory states and patterns formation in a two-cell cubic autocatalytic reaction-diffusion model subjected to the Dirichlet conditions. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020433 [14] A. M. Elaiw, N. H. AlShamrani, A. Abdel-Aty, H. Dutta. Stability analysis of a general HIV dynamics model with multi-stages of infected cells and two routes of infection. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020441 [15] Hai-Feng Huo, Shi-Ke Hu, Hong Xiang. Traveling wave solution for a diffusion SEIR epidemic model with self-protection and treatment. Electronic Research Archive, , () : -. doi: 10.3934/era.2020118 [16] Youming Guo, Tingting Li. Optimal control strategies for an online game addiction model with low and high risk exposure. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020347 [17] Omid Nikan, Seyedeh Mahboubeh Molavi-Arabshai, Hossein Jafari. Numerical simulation of the nonlinear fractional regularized long-wave model arising in ion acoustic plasma waves. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020466 [18] Chao Xing, Jiaojiao Pan, Hong Luo. Stability and dynamic transition of a toxin-producing phytoplankton-zooplankton model with additional food. Communications on Pure & Applied Analysis, 2021, 20 (1) : 427-448. doi: 10.3934/cpaa.2020275 [19] Bernard Bonnard, Jérémy Rouot. Geometric optimal techniques to control the muscular force response to functional electrical stimulation using a non-isometric force-fatigue model. Journal of Geometric Mechanics, 2020  doi: 10.3934/jgm.2020032

2019 Impact Factor: 1.373