# American Institute of Mathematical Sciences

eISSN:
2639-8001

All Issues

## Foundations of Data Science

September 2019 , Volume 1 , Issue 3

Select all articles

Export/Reference:

2019, 1(3): 249-269 doi: 10.3934/fods.2019011 +[Abstract](372) +[HTML](206) +[PDF](899.95KB)
Abstract:

A wide array of machine learning problems are formulated as the minimization of the expectation of a convex loss function on some parameter space. Since the probability distribution of the data of interest is usually unknown, it is is often estimated from training sets, which may lead to poor out-of-sample performance. In this work, we bring new insights in this problem by using the framework which has been developed in quantitative finance for risk measures. We show that the original min-max problem can be recast as a convex minimization problem under suitable assumptions. We discuss several important examples of robust formulations, in particular by defining ambiguity sets based on \begin{document}$\varphi$\end{document}-divergences and the Wasserstein metric. We also propose an efficient algorithm for solving the corresponding convex optimization problems involving complex convex constraints. Through simulation examples, we demonstrate that this algorithm scales well on real data sets.

2019, 1(3): 271-291 doi: 10.3934/fods.2019012 +[Abstract](400) +[HTML](169) +[PDF](4001.74KB)
Abstract:

This article proposes an active learning method for high-dimensional data, based on intrinsic data geometries learned through diffusion processes on graphs. Diffusion distances are used to parametrize low-dimensional structures on the dataset, which allow for high-accuracy labelings with only a small number of carefully chosen training labels. The geometric structure of the data suggests regions that have homogeneous labels, as well as regions with high label complexity that should be queried for labels. The proposed method enjoys theoretical performance guarantees on a general geometric data model, in which clusters corresponding to semantically meaningful classes are permitted to have nonlinear geometries, high ambient dimensionality, and suffer from significant noise and outlier corruption. The proposed algorithm is implemented in a manner that is quasilinear in the number of unlabeled data points, and exhibits competitive empirical performance on synthetic datasets and real hyperspectral remote sensing images.

2019, 1(3): 293-306 doi: 10.3934/fods.2019013 +[Abstract](430) +[HTML](246) +[PDF](1164.04KB)
Abstract:

Dynamic interaction networks frequently arise in biology, communications technology and the social sciences, representing, for example, neuronal connectivity in the brain, internet connections between computers and human interactions within social networks. The evolution and strengthening of the links in such networks can be observed through sequences of connection events occurring between network nodes over time. In some of these applications, the identity and size of the network may be unknown a priori and may change over time. In this article, a model for the evolution of dynamic networks based on the Pitman-Yor process is proposed. This model explicitly admits power-laws in the number of connections on each edge, often present in real world networks, and, for careful choices of the parameters, power-laws for the degree distribution of the nodes. A novel empirical method for the estimation of the hyperparameters of the Pitman-Yor process is proposed, and some necessary corrections for uniform discrete base distributions are carefully addressed. The methodology is tested on synthetic data and in an anomaly detection study on the enterprise computer network of the Los Alamos National Laboratory, and successfully detects connections from a red-team penetration test.

2019, 1(3): 307-327 doi: 10.3934/fods.2019014 +[Abstract](280) +[HTML](126) +[PDF](663.53KB)
Abstract:

We study the use of power weighted shortest path metrics for clustering high dimensional Euclidean data, under the assumption that the data is drawn from a collection of disjoint low dimensional manifolds. We argue, theoretically and experimentally, that this leads to higher clustering accuracy. We also present a fast algorithm for computing these distances.

2019, 1(3): 329-387 doi: 10.3934/fods.2019015 +[Abstract](319) +[HTML](129) +[PDF](1123.92KB)
Abstract:

Consider an open set \begin{document}$\mathbb{D}\subseteq\mathbb{R}^n$\end{document}, equipped with a probability measure \begin{document}$\mu$\end{document}. An important characteristic of a smooth function \begin{document}$f:\mathbb{D}\rightarrow\mathbb{R}$\end{document} is its second-moment matrix \begin{document}$\Sigma_{\mu}: = \int \nabla f(x) \nabla f(x)^* \mu(dx) \in\mathbb{R}^{n\times n}$\end{document}, where \begin{document}$\nabla f(x)\in\mathbb{R}^n$\end{document} is the gradient of \begin{document}$f(\cdot)$\end{document} at \begin{document}$x\in\mathbb{D}$\end{document} and \begin{document}$*$\end{document} stands for transpose. For instance, the span of the leading \begin{document}$r$\end{document} eigenvectors of \begin{document}$\Sigma_{\mu}$\end{document} forms an active subspace of \begin{document}$f(\cdot)$\end{document}, which contains the directions along which \begin{document}$f(\cdot)$\end{document} changes the most and is of particular interest in ridge approximation. In this work, we propose a simple algorithm for estimating \begin{document}$\Sigma_{\mu}$\end{document} from random point evaluations of \begin{document}$f(\cdot)$\end{document} without imposing any structural assumptions on \begin{document}$\Sigma_{\mu}$\end{document}. Theoretical guarantees for this algorithm are established with the aid of the same technical tools that have proved valuable in the context of covariance matrix estimation from partial measurements.