\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents
Article

Extended neural delay differential equations

Author contributions: Q. Zhu conceived the idea, Q. Zhu and W. Lin performed the research, J. Zhang and Q. Zhu performed the mathematical arguments and the experiments, J. Zhang, Q. Zhu and W. Lin wrote the paper.
Competing interests: The authors declare no competing interests.
Handling Editor: Huanfei Ma

Abstract / Introduction Full Text(HTML) Figure(10) / Table(1) Related Papers Cited by
  • Neural Ordinary Differential Equations (NODEs), as a class of continuous deep neural networks, have gained wide attention in machine learning recently. To overcome NODEs' inherent limitations, researchers have proposed Neural Delay Differential Equations (NDDEs), which integrate delay into the network to enhance nonlinear representational ability. In this study, we present an advanced extension of NDDEs, referred to as Extended Neural Delay Differential Equations (ENDDEs). Our framework extends traditional ones by treating, in optimization, not only standard neural network parameters but also the DDEs' delay, termination time, and initial state as additional trainable parameters. To efficiently train ENDDEs, we use the adjoint sensitivity method to compute gradients of loss functions and analyze the proposed framework's computational complexity in detail. Furthermore, we validate the effectiveness of ENDDEs through a series of extensive experiments, which cover both model-based and model-free system identification tasks. Meanwhile, we also evaluate the performance of ENDDEs in classification tasks on image datasets. The results further confirm that ENDDEs can serve as a powerful tool for advancing the field of continuous deep neural networks, with remarkable application potential.

    Mathematics Subject Classification: Primary: 34K35; Secondary: 68T07, 37N30.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Sketchy diagrams comparing ENDDEs with NDDEs, including the initial function $ \boldsymbol{\phi}(t) $. Both ENDDEs and NDDEs act as feature extractors, with the subsequent neural network layer processing the extracted features using a predefined loss function

    Figure 2.  (a) The data at time $ t = 0 $. (b) The transformed data of ENDDEs (6) at the learned terminal time $ T = 0.10 $, with both $ \tau $ and T being the parameters. (c) The transformed data of NDDEs at the fixed final time $ T = 1.00 $. In both (b) and (c), the transformed data are linearly separable

    Figure 3.  Evolutions of the ENDDEs in the feature space during the training procedure on fitting the function $ \boldsymbol{G}(x) $ for $ d = 2 $

    Figure 4.  The training losses (a) and the number of function evaluations (NFE) (b) of the NODEs, the NDDEs and the ENDDEs on fitting the function $ \boldsymbol{G}(x) $ for $ d = 2 $

    Figure 5.  Decision boundaries and the flows of ENDDE (a-d), NDDE (e-h) and NODE (i-l) on a concentric dataset in different training epochs. The flow of the NODEs is generated by the code provided in [5]

    Figure 6.  System reconstruction of the population dynamics by using the ENDDEs in the model-based case. The identified $ r $ and $ \tau $ are shown in panels respectively. Here, we normalize the learned parameter at different training epochs by dividing the true value. Specifically, these parameters are initialized in different deviation levels, i.e., $ p = (1 + DL)p $, where $ p\equiv r $, or $ \tau $, and $ DL $ is selected from the set $ \{0.2, 0.4, ..., 1.0\} $

    Figure 7.  System identification of the Mackey-Glass system by using the ENDDEs. The identified $ \beta $, $ n $, $ \gamma $, and $ \tau $ are shown in panels (a), (b), (c), and (d), respectively. Here, we normalize the learned parameter at different training epochs by dividing the true value. Specifically, these parameters are initialized in different deviation levels, i.e., $ p = (1 + DL)p $, where $ p\equiv \beta, n, \gamma $, or $ \tau $, and $ DL $ is selected from the set $ \{0.2, 0.4, ..., 1.0\} $

    Figure 8.  Inferring the underlying delay of the Mackey-Glass system by using the ENDDEs in a model-free manner. The learned delays are shown in (a) and (b) with 8 and 16 hidden neurons, respectively, and we use 3 hidden layers and the $ \tanh $ activation function. $ DL $ is selected from the set $ \{0.2, 0.4, 0.6, 0.8\} $

    Figure 9.  Reconstruction results of the model-based ENDDEs (a-d) and model-free ENDDEs (e-h) for 2-D time series when considering the delay effect of the original system. From left to right the subplots represent: (a, e) Comparisons between the true (solid lines) and fitted (dashed lines) time series; (b, f) Comparisons between the true and fitted trajectories in the phase space; (c, g) Dynamic evolution processes of the trainable parameters in the neural network; (d, h) Dynamic variation curves of loss during training

    Figure 10.  The training loss (a), the test loss (b), the accuracy (c), the normalized NFE (i.e., $ \langle NFE \rangle = NFE / T $) (d), the mean value (e) and the standard deviation (f) of $ |\boldsymbol{w}| $ over 4 realizations for NODEs and ENDDEs with different fixed terminal time $ T $ (1 or 4) on CIFAR10

    Table 1.  The test accuracies with their standard deviations over 4 realizations on CIFAR10. In the table, $ i $ (1 or 4) in T$ i $ means the terminal time $ T = i $ while $ j = 0 $ (resp., 1) in Adap$ j $ means that $ T $ is a fixed (resp., learnable) parameter

    T1+Adap0 T1+Adap1 T4+Adap0 T4+Adap1
    NODE $ 54.12\%\pm 0.35 $ $ 53.95\%\pm 1.00 $ $ 54.52\%\pm 1.46 $ $ 54.12\%\pm 0.33 $
    ENDDE $ \mathbf{ 55.57\%\pm 0.15} $ $ \mathbf{ 55.72\%\pm 0.23} $ $ \mathbf{ 56.41\%\pm 0.20} $ $ \mathbf{ 56.15\%\pm 0.18} $
     | Show Table
    DownLoad: CSV
  • [1] B. ChangM. ChenE. Haber and E. H. Chi, AntisymmetricRNN: A dynamical system view on recurrent neural networks, International Conference on Learning Representations, (2019). 
    [2] B. ChangL. MengE. HaberF. Tung and D. Begert, Multi-level residual networks from dynamical systems view, International Conference on Learning Representations, (2018). 
    [3] R. T. Q. ChenY. RubanovaJ. Bettencourt and D. K. Duvenaud, Neural ordinary differential equations, Advances in Neural Information Processing Systems, (2018), 6571-6583. 
    [4] J. DevlinM.-W. ChangK. Lee and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, Proceedings of NAACL-HLT, 1 (2019), 4171-4186. 
    [5] E. DupontA. Doucet and Y. W. Teh, Augmented neural ODEs, Advances in Neural Information Processing Systems, (2019), 3140-3150. 
    [6] W. E, A proposal on machine learning via dynamical systems, Communications in Mathematics and Statistics, 5 (2017), 1-11.  doi: 10.1007/s40304-017-0103-z.
    [7] W. E, J. Han and Q. Li, A mean-field optimal control formulation of deep learning, Research in the Mathematical Sciences, 6 (2019), Paper No. 10, 41 pp. doi: 10.1007/s40687-018-0172-y.
    [8] T. Erneux, Applied Delay Differential Equations, volume 3, Springer Science & Business Media, 2009.
    [9] E. Haber and L. Ruthotto, Stable architectures for deep neural networks, Inverse Problems, 34 (2017), Paper No. 014004, 22 pp. doi: 10.1088/1361-6420/aa9a90.
    [10] K. HeX. ZhangS. Ren and J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770-778. 
    [11] G. HuangZ. LiuL. Van Der Maaten and K. Q. Weinberger, Densely connected convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 4700-4708. 
    [12] A. KrizhevskyI. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Communications of the ACM, 60 (2017), 84-90.  doi: 10.1145/3065386.
    [13] Q. LiL. ChenC. Tai and W. E, Maximum principle based algorithms for deep learning, The Journal of Machine Learning Research, 18 (2017), 5998-6026. 
    [14] Q. Li and S. Hao, An optimal control approach to deep learning and applications to discrete-weight neural networks, International Conference on Machine Learning, (2018), 2985-2994. 
    [15] Y. LuA. ZhongQ. Li and B. Dong, Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations, International Conference on Machine Learning, (2018), 3276-3285. 
    [16] J. Pathak, B. Hunt, M. Girvan, Z. Lu and E. Ott, Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach, Physical Review Letters, 120 (2018), Paper No. 024102, 5 pp. doi: 10.1103/PhysRevLett.120.024102.
    [17] L. S. PontryaginV. G. BoltyanskijR. V. Gamkrelidze and  E. F. MishchenkoThe Mathematical Theory of Optimal Processes, Interscience Publishers John Wiley & Sons, Inc., New York-London, 1962. 
    [18] L. Ruthotto and E. Haber, Deep neural networks motivated by partial differential equations, Journal of Mathematical Imaging and Vision, 62 (2020), 352-364.  doi: 10.1007/s10851-019-00903-1.
    [19] R. F. Service, 'The game has changed.' AI triumphs at protein folding, Science, 370 (2020), 1144-1145.  doi: 10.1126/science.370.6521.1144.
    [20] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of Go with deep neural networks and tree search, Nature, 529 (2016), 484-489. doi: 10.1038/nature16961.
    [21] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., Mastering the game of Go without human knowledge, Nature, 550 (2017), 354-359. doi: 10.1038/nature24270.
    [22] Y. Tang, J. Kurths, W. Lin, E. Ott and L. Kocarev, Introduction to focus issue: When machine learning meets complex systems: Networks, chaos, and nonlinear dynamics, Chaos: An Interdisciplinary Journal of Nonlinear Science, 30 (2020), Paper No. 063151, 8 pp. doi: 10.1063/5.0016505.
    [23] A. VaswaniN. ShazeerN. ParmarJ. UszkoreitL. JonesA. N. GomezŁ. Kaiser and I. Polosukhin, Attention is all you need, Advances in Neural Information Processing Systems, (2017), 6000-6010. 
    [24] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang and A. Stolcke, The Microsoft 2017 conversational speech recognition system, IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.
    [25] D. ZhangT. ZhangY. LuZ. Zhu and B. Dong, You only propagate once: Accelerating adversarial training via maximal principle, Advances in Neural Information Processing Systems, (2019), 227-238. 
    [26] Q. Zhu, Y. Guo and W. Lin, Neural delay differential equations, International Conference on Learning Representations, 2021.
    [27] Q. Zhu, X. Li and W. Lin, Leveraging neural differential equations and adaptive delayed feedback to detect unstable periodic orbits based on irregularly sampled time series, Chaos: An Interdisciplinary Journal of Nonlinear Science, 33 (2023), Paper No. 031101, 9 pp. doi: 10.1063/5.0143839.
    [28] Q. Zhu, H. Ma and W. Lin, Detecting unstable periodic orbits based only on time series: When adaptive delayed feedback control meets reservoir computing, Chaos: An Interdisciplinary Journal of Nonlinear Science, 29 (2019), Paper No. 093125, 11 pp. doi: 10.1063/1.5120867.
  • 加载中

Figures(10)

Tables(1)

SHARE

Article Metrics

HTML views(971) PDF downloads(116) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return