# American Institute of Mathematical Sciences

eISSN:
2639-8001

All Issues

## Foundations of Data Science

September 2022 , Volume 4 , Issue 3

Select all articles

Export/Reference:

2022, 4(3): 323-353 doi: 10.3934/fods.2022009 +[Abstract](483) +[HTML](126) +[PDF](768.74KB)
Abstract:

Nonlocal models have recently had a major impact in nonlinear continuum mechanics and are used to describe physical systems/processes which cannot be accurately described by classical, calculus based "local" approaches. In part, this is due to their multiscale nature that enables aggregation of micro-level behavior to obtain a macro-level description of singular/irregular phenomena such as peridynamics, crack propagation, anomalous diffusion and transport phenomena. At the core of these models are nonlocal differential operators, including nonlocal analogs of the gradient/Hessian. This paper initiates the use of such nonlocal operators in the context of optimization and learning. We define and analyze the convergence properties of nonlocal analogs of (stochastic) gradient descent and Newton's method on Euclidean spaces. Our results indicate that as the nonlocal interactions become less noticeable, the optima corresponding to nonlocal optimization converge to the "usual" optima. At the same time, we argue that nonlocal learning is possible in situations where standard calculus fails. As a stylized numerical example of this, we consider the problem of non-differentiable parameter estimation on a non-smooth translation manifold and show that our nonlocal gradient descent recovers the unknown translation parameter from a non-differentiable objective function.

2022, 4(3): 355-393 doi: 10.3934/fods.2022010 +[Abstract](444) +[HTML](162) +[PDF](7649.88KB)
Abstract:

Phase is the most fundamental physical quantity when we study an oscillatory time series. There have been many tools aiming to estimate phase, and most of them are developed based on the analytic function model. Unfortunately, these analytic function model based tools might be limited in handling modern signals with intrinsic nonstartionary structure, for example, biomedical signals composed of multiple oscillatory components, each with time-varying frequency, amplitude, and non-sinusoidal oscillation. There are several consequences of such limitation, and we specifically focus on the one that phases estimated from signals simultaneously recorded from different sensors for the same physiological system from the same subject might be different. This fact might challenge reproducibility, communication, and scientific interpretation. Thus, we need a standardized approach with theoretical support over a unified model. In this paper, after summarizing existing models for phase and discussing the main challenge caused by the above-mentioned intrinsic nonstartionary structure, we introduce the adaptive non-harmonic model (ANHM), provide a definition of phase called fundamental phase, which is a vector-valued function describing the dynamics of all oscillatory components in the signal, and suggest a time-varying bandpass filter (tvBPF) scheme based on time-frequency analysis tools to estimate the fundamental phase. The proposed approach is validated with a simulated database and a real-world database with experts' labels, and it is applied to two real-world databases, each of which has biomedical signals recorded from different sensors, to show how to standardize the definition of phase in the real-world experimental environment. We report that the phase describing a physiological system, if properly modeled and extracted, is immune to the selected sensor for that system, while other approaches might fail. In conclusion, the proposed approach resolves the above-mentioned scientific challenge. We expect its scientific impact on a broad range of applications.

2022, 4(3): 395-422 doi: 10.3934/fods.2022011 +[Abstract](650) +[HTML](172) +[PDF](874.32KB)
Abstract:

This work proposes a novel technique for clustering multimodal data according to their information content. Statistical correlations present in data that contain similar information are exploited to perform the clustering task. Specifically, multiset canonical correlation analysis is equipped with norm-one regularization mechanisms to identify clusters within different types of data that share the same information content. A pertinent minimization formulation is put forth, while block coordinate descent is employed to derive a batch clustering algorithm which achieves better clustering performance than existing alternatives. Relying on subgradient descent, an online clustering approach is derived which substantially lowers computational complexity compared to the batch approach, while not compromising significantly the clustering performance. It is established that for an increasing number of data the novel regularized multiset framework is able to correctly cluster the multimodal data entries. Further, it is proved that the online clustering scheme converges with probability one to a stationary point of the ensemble regularized multiset correlations cost having the potential to recover the correct clusters. Extensive numerical tests demonstrate that the novel clustering scheme outperforms existing alternatives, while the online scheme achieves substantial computational savings.

2022, 4(3): 423-440 doi: 10.3934/fods.2022012 +[Abstract](332) +[HTML](77) +[PDF](3113.98KB)
Abstract:

Kernel matrices are crucial in many learning tasks such as support vector machines or kernel ridge regression. The kernel matrix is typically dense and large-scale. Depending on the dimension of the feature space even the computation of all of its entries in reasonable time becomes a challenging task. For such dense matrices the cost of a matrix-vector product scales quadratically with the dimensionality \begin{document}$N$\end{document}, if no customized methods are applied. We propose the use of an ANOVA kernel, where we construct several kernels based on lower-dimensional feature spaces for which we provide fast algorithms realizing the matrix-vector products. We employ the non-equispaced fast Fourier transform (NFFT), which is of linear complexity for fixed accuracy. Based on a feature grouping approach, we then show how the fast matrix-vector products can be embedded into a learning method choosing kernel ridge regression and the conjugate gradient solver. We illustrate the performance of our approach on several data sets.

2022, 4(3): 441-466 doi: 10.3934/fods.2022013 +[Abstract](267) +[HTML](152) +[PDF](10210.58KB)
Abstract:

Complete deconvolution analysis for bulk RNA-seq data is important and helpful to distinguish whether the differences of disease-associated GEPs (gene expression profiles) in tissues of patients and normal controls are due to changes in cellular composition of tissue samples, or due to GEPs changes in specific cells. One of the major techniques to perform complete deconvolution is nonnegative matrix factorization (NMF), which also has a wide-range of applications in the machine learning community. However, the NMF is a well-known strongly ill-posed problem, so a direct application of NMF to RNA-seq data will suffer severe difficulties in the interpretability of solutions. In this paper, we develop an NMF-based mathematical model and corresponding computational algorithms to improve the solution identifiability of deconvoluting bulk RNA-seq data. In our approach, we combine the biological concept of marker genes with the solvability conditions of the NMF theories, and develop a geometric structures guided optimization model. In this strategy, the geometric structure of bulk tissue data is first explored by the spectral clustering technique. Then, the identified information of marker genes is integrated as solvability constraints, while the overall correlation graph is used as manifold regularization. Both synthetic and biological data are used to validate the proposed model and algorithms, from which solution interpretability and accuracy are significantly improved.