A new initialization method based on normed statistical spaces in deep networks

  • *Corresponding author: Tieyong Zeng (zeng@math.cuhk.edu.hk)

Raymond Chan's research is supported by HKRGC Grants No. CUHK 14306316 and CUHK 14301718, CityU Grant 9380101, CRF Grant C1007-15G, AoE/M-05/12. Tieyong Zeng's research is supported by National Science Foundation of China No. 11671002, CUHK start-up and CUHK DAG 4053342, RGC 14300219, and NSFC/RGC N_CUHK 415/19

  • Training deep neural networks can be difficult. For classical neural networks, the initialization method by Xavier and Yoshua which is later generalized by He, Zhang, Ren and Sun can facilitate stable training. However, with the recent development of new layer types, we find that the above mentioned initialization methods may fail to lead to successful training. Based on these two methods, we will propose a new initialization by studying the parameter space of a network. Our principal is to put constrains on the growth of parameters in different layers in a consistent way. In order to do so, we introduce a norm to the parameter space and use this norm to measure the growth of parameters. Our new method is suitable for a wide range of layer types, especially for layers with parameter-sharing weight matrices.

    Mathematics Subject Classification: Primary: 68T01, 68T05; Secondary: 68Q32.


    \begin{equation} \\ \end{equation}
  • Figure 1.  (a) Plot of losses of network summarized in Table 1. (b) Plot of losses of network summarized in Table 2. (c) Plot of losses of network summarized in Table 3. (d) Plot of losses of network summarized in Table 4. Mean and std for the last of the smoothed loss values: Ours (a) $ 0.070\pm 0.005 $, (b) $ 0.111\pm 0.006 $, (c) $ 0.088\pm 0.003 $, (d) $ 0.083\pm 0.004 $; Xavier/He (a) $ 0.069\pm 0.001 $, (b) $ 0.206\pm 0.012 $, (c) $ 0.221\pm 0.016 $, (d) $ 0.164\pm 0.012 $. We also tested the evaluation accuracies on the test set with results: Ours versus Xavier/He (a) $ 98.06\% $, $ 98.13\% $, (b) $ 95.66\% $, $ 93.94\% $, (c) $ 97.17\% $, $ 95.07\% $, (d) $ 98.01\% $, $ 96.22\% $

    Table 2.  Network structure of Figure 1(b). For the last convolution layer with kernel size $ 55\times 55 $ we use periodic padding on the input images to make sure the conditions on $ T $ in (6) are satisfied

    Layer Output channel Number of parameters
    Conv2d+Maxpool $ 32 $ $ 3\times 3 \times 1 \times 32 $
    Conv2d+Maxpool $ 64 $ $ 3\times 3 \times 32\times 64 $
    Reshape to $ 56\times 56 $
    Conv2d $ 1 $ $ 55\times 55 \times 1\times 1 $
    Fc $ 10 $ $ 3136\times 10 $
    Table 1.  Network structure of Figure 1(a)

    Layer Output channel Number of parameters
    Conv2d+MaxPool $ 32 $ $ 3\times 3 \times 1 \times 32 $
    Conv2d+MaxPool $ 64 $ $ 3\times 3 \times 32\times 64 $
    Fc $ 64 $ $ 3136\times 64 $
    Fc $ 10 $ $ 64\times 10 $
    Table 3.  Network structure of Figure 1(c). For a compression ratio $ B> 1 $, we use circulant implementation with block size $ B $. The number of parameters for a CirCNN implementation should be divided by $ B $

    Layer Out channel Number of parameters Compression ratio
    Conv2d+Maxpool $ 32 $ $ 3\times 3 \times 1 \times 32 $ $ 1 $
    Conv2d+Maxpool $ 64 $ $ 3\times 3 \times 32\times 64 $ $ 1 $
    Fc $ 1568 $ $ 3136\times 1568 $ $ 1568 $
    Fc $ 10 $ $ 1568\times 10 $ $ 1 $
    Table 4.  Network structure of Figure 1(d). For a compression ratio $ B> 1 $, we use circulant implementation with block size $ B $. The number of parameters for a CirCNN implementation should be divided by $ B $

    Layer Out channel Number of parameters Compression ratio
    Conv2d+Maxpool $ 32 $ $ 3\times 3 \times 1 \times 32 $ $ 1 $
    Conv2d+Maxpool $ 64 $ $ 3\times 3 \times 32\times 64 $ $ 1 $
    Conv2d $ 256 $ $ 3\times 3 \times 64 \times 256 $ $ 1 $
    Conv2d $ 256 $ $ 3\times 3 \times 256 \times 256 $ $ 256 $
    Conv2d $ 256 $ $ 3\times 3 \times 256 \times 256 $ $ 256 $
    Fc $ 64 $ $ 12544\times 64 $ $ 1 $
    Fc $ 10 $ $ 64\times 10 $ $ 1 $
