# American Institute of Mathematical Sciences

eISSN:
2639-8001

All Issues

## Foundations of Data Science

June 2021 , Volume 3 , Issue 2

Select all articles

Export/Reference:

2021, 3(2): 99-131 doi: 10.3934/fods.2021009 +[Abstract](1351) +[HTML](370) +[PDF](10367.19KB)
Abstract:

The oscillations observed in many time series, particularly in \biomedicine, exhibit morphological variations over time. These morphological variations are caused by intrinsic or extrinsic changes to the state of the generating system, henceforth referred to as dynamics. To model these time series (including and specifically pathophysiological ones) and estimate the underlying dynamics, we provide a novel wave-shape oscillatory model. In this model, time-dependent variations in cycle shape occur along a manifold called the wave-shape manifold. To estimate the wave-shape manifold associated with an oscillatory time series, study the dynamics, and visualize the time-dependent changes along the wave-shape manifold, we propose a novel algorithm coined Dynamic Diffusion map (DDmap) by applying the well-established diffusion maps (DM) algorithm to the set of all observed oscillations. We provide a theoretical guarantee on the dynamical information recovered by the DDmap algorithm under the proposed model. Applying the proposed model and algorithm to arterial blood pressure (ABP) signals recorded during general anesthesia leads to the extraction of nociception information. Applying the wave-shape oscillatory model and the DDmap algorithm to cardiac cycles in the electrocardiogram (ECG) leads to ectopy detection and a new ECG-derived respiratory signal, even when the subject has atrial fibrillation.

2021, 3(2): 133-149 doi: 10.3934/fods.2021010 +[Abstract](735) +[HTML](342) +[PDF](1569.13KB)
Abstract:

In 2019, Anderson et al. proposed the concept of rankability, which refers to a dataset's inherent ability to be meaningfully ranked. In this article, we give an expository review of the linear ordering problem (LOP) and then use it to analyze the rankability of data. Specifically, the degree of linearity is used to quantify what percentage of the data aligns with an optimal ranking. In a sports context, this is analogous to the number of games that a ranking can correctly predict in hindsight. In fact, under the appropriate objective function, we show that the optimal rankings computed via the LOP maximize the hindsight accuracy of a ranking. Moreover, we develop a binary program to compute the maximal Kendall tau ranking distance between two optimal rankings, which can be used to measure the diversity among optimal rankings without having to enumerate all optima. Finally, we provide several examples from the world of sports and college rankings to illustrate these concepts and demonstrate our results.

2021, 3(2): 151-200 doi: 10.3934/fods.2021013 +[Abstract](544) +[HTML](267) +[PDF](2668.83KB)
Abstract:

We consider shallow (single hidden layer) neural networks and characterize their performance when trained with stochastic gradient descent as the number of hidden units \begin{document}$N$\end{document} and gradient descent steps grow to infinity. In particular, we investigate the effect of different scaling schemes, which lead to different normalizations of the neural network, on the network's statistical output, closing the gap between the \begin{document}$1/\sqrt{N}$\end{document} and the mean-field \begin{document}$1/N$\end{document} normalization. We develop an asymptotic expansion for the neural network's statistical output pointwise with respect to the scaling parameter as the number of hidden units grows to infinity. Based on this expansion, we demonstrate mathematically that to leading order in \begin{document}$N$\end{document}, there is no bias-variance trade off, in that both bias and variance (both explicitly characterized) decrease as the number of hidden units increases and time grows. In addition, we show that to leading order in \begin{document}$N$\end{document}, the variance of the neural network's statistical output decays as the implied normalization by the scaling parameter approaches the mean field normalization. Numerical studies on the MNIST and CIFAR10 datasets show that test and train accuracy monotonically improve as the neural network's normalization gets closer to the mean field normalization.

2021, 3(2): 201-224 doi: 10.3934/fods.2021014 +[Abstract](649) +[HTML](269) +[PDF](5158.46KB)
Abstract:

Manifold Markov chain Monte Carlo algorithms have been introduced to sample more effectively from challenging target densities exhibiting multiple modes or strong correlations. Such algorithms exploit the local geometry of the parameter space, thus enabling chains to achieve a faster convergence rate when measured in number of steps. However, acquiring local geometric information can often increase computational complexity per step to the extent that sampling from high-dimensional targets becomes inefficient in terms of total computational time. This paper analyzes the computational complexity of manifold Langevin Monte Carlo and proposes a geometric adaptive Monte Carlo sampler aimed at balancing the benefits of exploiting local geometry with computational cost to achieve a high effective sample size for a given computational cost. The suggested sampler is a discrete-time stochastic process in random environment. The random environment allows to switch between local geometric and adaptive proposal kernels with the help of a schedule. An exponential schedule is put forward that enables more frequent use of geometric information in early transient phases of the chain, while saving computational time in late stationary phases. The average complexity can be manually set depending on the need for geometric exploitation posed by the underlying model.

2021, 3(2): 225-249 doi: 10.3934/fods.2021015 +[Abstract](533) +[HTML](244) +[PDF](2055.16KB)
Abstract:

We study the clustering problem on graphs: it is known that if there are two underlying clusters, then the signs of the eigenvector corresponding to the second largest eigenvalue of the adjacency matrix can reliably reconstruct the two clusters. We argue that the vertices for which the eigenvector has the largest and the smallest entries, respectively, are unusually strongly connected to their own cluster and more reliably classified than the rest. This can be regarded as a discrete version of the Hot Spots conjecture and should be a useful heuristic for evaluating 'strongly clustered' versus 'liminal' nodes in applications. We give a rigorous proof for the stochastic block model and discuss several explicit examples.

2021, 3(2): 251-303 doi: 10.3934/fods.2021016 +[Abstract](728) +[HTML](469) +[PDF](30370.99KB)
Abstract:

Fine-scale simulation of complex systems governed by multiscale partial differential equations (PDEs) is computationally expensive and various multiscale methods have been developed for addressing such problems. In addition, it is challenging to develop accurate surrogate and uncertainty quantification models for high-dimensional problems governed by stochastic multiscale PDEs using limited training data. In this work to address these challenges, we introduce a novel hybrid deep-learning and multiscale approach for stochastic multiscale PDEs with limited training data. For demonstration purposes, we focus on a porous media flow problem. We use an image-to-image supervised deep learning model to learn the mapping between the input permeability field and the multiscale basis functions. We introduce a Bayesian approach to this hybrid framework to allow us to perform uncertainty quantification and propagation tasks. The performance of this hybrid approach is evaluated with varying intrinsic dimensionality of the permeability field. Numerical results indicate that the hybrid network can efficiently predict well for high-dimensional inputs.