Layer | Output channel | Number of parameters |
Conv2d+Maxpool | ||
Conv2d+Maxpool | ||
Reshape to |
||
Conv2d | ||
Fc |
Training deep neural networks can be difficult. For classical neural networks, the initialization method by Xavier and Yoshua which is later generalized by He, Zhang, Ren and Sun can facilitate stable training. However, with the recent development of new layer types, we find that the above mentioned initialization methods may fail to lead to successful training. Based on these two methods, we will propose a new initialization by studying the parameter space of a network. Our principal is to put constrains on the growth of parameters in different layers in a consistent way. In order to do so, we introduce a norm to the parameter space and use this norm to measure the growth of parameters. Our new method is suitable for a wide range of layer types, especially for layers with parameter-sharing weight matrices.
Citation: |
Figure 1. (a) Plot of losses of network summarized in Table 1. (b) Plot of losses of network summarized in Table 2. (c) Plot of losses of network summarized in Table 3. (d) Plot of losses of network summarized in Table 4. Mean and std for the last of the smoothed loss values: Ours (a) $ 0.070\pm 0.005 $, (b) $ 0.111\pm 0.006 $, (c) $ 0.088\pm 0.003 $, (d) $ 0.083\pm 0.004 $; Xavier/He (a) $ 0.069\pm 0.001 $, (b) $ 0.206\pm 0.012 $, (c) $ 0.221\pm 0.016 $, (d) $ 0.164\pm 0.012 $. We also tested the evaluation accuracies on the test set with results: Ours versus Xavier/He (a) $ 98.06\% $, $ 98.13\% $, (b) $ 95.66\% $, $ 93.94\% $, (c) $ 97.17\% $, $ 95.07\% $, (d) $ 98.01\% $, $ 96.22\% $
Table 2.
Network structure of Figure 1(b). For the last convolution layer with kernel size
Layer | Output channel | Number of parameters |
Conv2d+Maxpool | ||
Conv2d+Maxpool | ||
Reshape to |
||
Conv2d | ||
Fc |
Table 1. Network structure of Figure 1(a)
Layer | Output channel | Number of parameters |
Conv2d+MaxPool | ||
Conv2d+MaxPool | ||
Fc | ||
Fc |
Table 3.
Network structure of Figure 1(c). For a compression ratio
Layer | Out channel | Number of parameters | Compression ratio |
Conv2d+Maxpool | |||
Conv2d+Maxpool | |||
Fc | |||
Fc |
Table 4.
Network structure of Figure 1(d). For a compression ratio
Layer | Out channel | Number of parameters | Compression ratio |
Conv2d+Maxpool | |||
Conv2d+Maxpool | |||
Conv2d | |||
Conv2d | |||
Conv2d | |||
Fc | |||
Fc |
[1] | D. M. Bradley, Learning in Modular Systems, PhD thesis, Carnegie Mellon University, 2010. |
[2] | C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan et al., Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices, in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, (2017), 395–408. doi: 10.1145/3123939.3124552. |
[3] | X. Ding, H. Yang, R. Chan, H. Hu, Y. Peng and T. Zeng, A new initialization method for neural networks with weight sharing, Submitted for Publication. |
[4] | C. Dong, C. C. Loy, K. He and X. Tang, Image super-resolution using deep convolutional networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 (2015), 295-307. doi: 10.1109/TPAMI.2015.2439281. |
[5] | X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, (2010), 249–256. |
[6] | I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org. |
[7] | K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in The IEEE International Conference on Computer Vision (ICCV), 2015. doi: 10.1109/ICCV.2015.123. |
[8] | K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. doi: 10.1109/CVPR.2016.90. |
[9] | A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv: 1704.04861. |
[10] | J. Hu, L. Shen and G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 7132–7141. doi: 10.1109/CVPR.2018.00745. |
[11] | A. Krizhevsky and G. Hinton, Learning Multiple Layers of Features from Tiny Images, Technical report, Citeseer, 2009. |
[12] | Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), 2278-2324. doi: 10.1109/5.726791. |
[13] | D. Mishkin and J. Matas, All You Need Is A Good Init, International Conference on Learning Representations, 2016. |
[14] | O. Ronneberger, P. Fischer and T. Brox, U-net: Convolutional networks for biomedical image segmentation, in International Conference on Medical Image Computing and Computer-assisted Intervention, (2015), 234–241. doi: 10.1007/978-3-319-24574-4_28. |
[15] | W. Rudin, Real and Complex Analysis, 3rd edition, McGraw-Hill Book Co., New York, 1987. |
[16] | W. Rudin, Functional Analysis, 2nd edition, International Series in Pure and Applied Mathematics, McGraw-Hill, Inc., New York, 1991. |
[17] | M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 4510–4520. |
[18] | A. Saxe, J. L. McClelland and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv: 1312.6120. |
[19] | K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv: 1409.1556. |
[20] | C. Szegedy, S. Ioffe, V. Vanhoucke and A. A. Alemi, Inception-v4, Inception-Resnet and the Impact of Residual Connections on Learning, Thirty-First AAAI Conference on Artificial Intelligence, 2017. |
[21] | M. Taki, Deep residual networks and weight initialization, arXiv: 1709.02956. |
[22] | L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. Schoenholz and J. Pennington, Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks, in International Conference on Machine Learning, (2018), 5389–5398. |
[23] | F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv: 1511.07122. |
[24] | K. Zhang, W. Zuo, S. Gu and L. Zhang, Learning deep cnn denoiser prior for image restoration, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 3929–3938. |
[25] | T. Zhang, G.-J. Qi, B. Xiao and J. Wang, Interleaved group convolutions, in Proceedings of the IEEE International Conference on Computer Vision, (2017), 4373–4382. doi: 10.1109/ICCV.2017.469. |
(a) Plot of losses of network summarized in Table 1. (b) Plot of losses of network summarized in Table 2. (c) Plot of losses of network summarized in Table 3. (d) Plot of losses of network summarized in Table 4. Mean and std for the last of the smoothed loss values: Ours (a)