# American Institute of Mathematical Sciences

September  2019, 1(3): 271-291. doi: 10.3934/fods.2019012

## Learning by active nonlinear diffusion

 1 Department of Mathematics, Department of Applied Mathematics and Statistics, Mathematical Institute of Data Sciences, Institute of Data Intensive Engineering and Science, Johns Hopkins University, Baltimore, MD 21218, USA 2 Department of Mathematics, Tufts University, Medford, MA 02155, USA

* Corresponding author: James M. Murphy

Published  August 2019

Fund Project: This research is supported by NSF-DMS-125012, NSF-DMS-1724979, NSF-DMS-1708602, NSF-ATD-1737984, AFOSR FA9550-17-1-0280, NSF-IIS-1546392, NSF-DMS 1912737, and NSF-DMS 1924513.

This article proposes an active learning method for high-dimensional data, based on intrinsic data geometries learned through diffusion processes on graphs. Diffusion distances are used to parametrize low-dimensional structures on the dataset, which allow for high-accuracy labelings with only a small number of carefully chosen training labels. The geometric structure of the data suggests regions that have homogeneous labels, as well as regions with high label complexity that should be queried for labels. The proposed method enjoys theoretical performance guarantees on a general geometric data model, in which clusters corresponding to semantically meaningful classes are permitted to have nonlinear geometries, high ambient dimensionality, and suffer from significant noise and outlier corruption. The proposed algorithm is implemented in a manner that is quasilinear in the number of unlabeled data points, and exhibits competitive empirical performance on synthetic datasets and real hyperspectral remote sensing images.

Citation: Mauro Maggioni, James M. Murphy. Learning by active nonlinear diffusion. Foundations of Data Science, 2019, 1 (3) : 271-291. doi: 10.3934/fods.2019012
##### References:

show all references

##### References:
Data colored by class label. Both the data in (a) and (b) exhibit cluster structure, which can guide active learning in the case that the labels are constant on these clusters. Indeed, on the left, using a simple clustering algorithm such as $K$-means suggests that only 3 labels are necessary to correctly label the entire dataset. For the data on the right, many more than three labels are necessary if $K$-means is used for the underlying clustering, since the clusters are highly elongated and nonlinear. Indeed, $K$-means will split the annular and elongated clusters. On the other hand, if pairwise comparisons are made with distances other than Euclidean distances, it may be possible that active learning achieves near perfect results with only 3 labels. The proposed active learning scheme gains robustness to class shape via diffusion geometry, and is suitable for data in both (a) and (b)
for $\log_{10}(t) = 2$ and $\log_{10}(t) = 9$, respectively. We see that for small values of $t$, the mode estimation incorrectly places the first three modes on the highly elongated cluster. For larger time values, the underlying random walk reaches a mesoscopic equilibrium and correct mode estimation is achieved. The emergence of mesoscopic equilibria is apparent in (c), (d), which show the matrix of diffusion distances at time $\log_{10}(t) = 2$ and $\log_{10}(t) = 9$, respectively. When $\log_{10}(t) = 2$, $P^{t}$ has not mixed, and there are still substantial within-cluster distances. For $\log_{10}(t) = 9$, $P^{t}$ has reached mesoscopic equilibria, so that within-cluster distances are quite small, yet between-cluster distances are still large [40]">Figure 2.  In (a) and (b), the values of $\log_{10}( \mathcal{D}_{t})$ are shown for synthetic geometric data from Figure 1 (b) for $\log_{10}(t) = 2$ and $\log_{10}(t) = 9$, respectively. We see that for small values of $t$, the mode estimation incorrectly places the first three modes on the highly elongated cluster. For larger time values, the underlying random walk reaches a mesoscopic equilibrium and correct mode estimation is achieved. The emergence of mesoscopic equilibria is apparent in (c), (d), which show the matrix of diffusion distances at time $\log_{10}(t) = 2$ and $\log_{10}(t) = 9$, respectively. When $\log_{10}(t) = 2$, $P^{t}$ has not mixed, and there are still substantial within-cluster distances. For $\log_{10}(t) = 9$, $P^{t}$ has reached mesoscopic equilibria, so that within-cluster distances are quite small, yet between-cluster distances are still large [40]
Top row: Three different synthetic datasets in two dimensions are shown, categorized as geometric, bottleneck and Gaussian. Bottom row: Plots of node purity for three different multiscale, hierarchical methods of clustering: average linkage clustering (ALC), single linkage clustering (SLC) and learning by unsupervised nonlinear diffusion (LUND). As the number of leaves/clusters increases, purity is non-decreasing. The purity of the LUND clusters converges more rapidly to the optimal value 1, indicating that high accuracy can be gained by correctly labeling a smaller number of clusters, compared to ALC and SLC
In (a), data with natural hierarchical structure is exhibited. The four Gaussians have means $(0, 0), (0, 2), \left(\frac{3}{2}, 0\right), \left(\frac{3}{2}, 2\right)$. While at one level of granularity there are 4 clusters (shown in (b)), at a coarser level of granularity the top 2 and bottom 2 Gaussians form clusters, leading to a clustering with only 2 clusters (shown in (c))
The matrix of pairwise diffusion distances for $\log_{10}(t) = 1.5$ and $\log_{10}(t) = 5$ are shown in (a) and (b), respectively, illustrating the hierarchical cluster structure in the data. This hierarchical structure introduces ambiguities into the estimation of the number of clusters $K$ in LUND, as shown (c). For small time, $\hat{K} = 4$, while for larger time $\hat{K} = 2$
LAND labelings of the data with four queries under two different scenarios: small diffusion time (top row) and large diffusion time (bottom row), and four latent clusters (first column) and two latent clusters (second column). The clusters are closer in the horizontal direction than in the vertical direction, from whence the hierarchical structure is derived. In all cases, LAND is able to to correctly label the data with just four queries, one for each Gaussian
. We see that LAND achieves rapid convergence to perfect labeling accuracy, compared to much slower convergence for the two comparison methods">Figure 7.  Experimental results on the synthetic datasets introduced in Figure 3. We see that LAND achieves rapid convergence to perfect labeling accuracy, compared to much slower convergence for the two comparison methods
The Salinas A dataset consists of $83 \times 86 = 7138$ pixels in $D = 224$ dimensions. The image has spatial resolution $3.7$m/pixel, and was recorded over Salinas, USA by the Aviris sensor. The six labelled classes are arranged in diagonal rows, and are quite spatially regular. The sum across all spectral bands is shown in (a), and the labeled ground truth is shown in (b), with pixels having the same class being given the same color
Pavia data consists of a $270\times 50 = 13500$ subset of the full Pavia data set. The image has spatial resolution $1.3$m/pixel, and was recorded over Pavia, Italy by the ROSIS sensor. It consists of 6 spatial classes, some of which are quite well-spread out in the image. The sum across all spectral bands is shown in (a), and the labeled ground truth is shown in (b), with pixels having the same class being given the same color
The active learning results for the Salinas A and Pavia datasets are shown in (a) and (b), respectively. In both cases, the LAND algorithm strongly outperforms the modified LAND variant using randomly selected training data, and the CBAL algorithm. In particular, LAND is able to achieve a significant improvement in accuracy with a very small number of labels
 [1] Ana Rita Nogueira, João Gama, Carlos Abreu Ferreira. Causal discovery in machine learning: Theories and applications. Journal of Dynamics & Games, 2021  doi: 10.3934/jdg.2021008 [2] Shan-Shan Lin. Due-window assignment scheduling with learning and deterioration effects. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021081 [3] Mehmet Duran Toksari, Emel Kizilkaya Aydogan, Berrin Atalay, Saziye Sari. Some scheduling problems with sum of logarithm processing times based learning effect and exponential past sequence dependent delivery times. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021044 [4] Hao Li, Honglin Chen, Matt Haberland, Andrea L. Bertozzi, P. Jeffrey Brantingham. PDEs on graphs for semi-supervised learning applied to first-person activity recognition in body-worn video. Discrete & Continuous Dynamical Systems, 2021  doi: 10.3934/dcds.2021039 [5] Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, Guo-Wei Wei. HERMES: Persistent spectral graph software. Foundations of Data Science, 2021, 3 (1) : 67-97. doi: 10.3934/fods.2021006 [6] Beom-Seok Han, Kyeong-Hun Kim, Daehan Park. A weighted Sobolev space theory for the diffusion-wave equations with time-fractional derivatives on $C^{1}$ domains. Discrete & Continuous Dynamical Systems, 2021, 41 (7) : 3415-3445. doi: 10.3934/dcds.2021002 [7] Eric Babson and Dmitry N. Kozlov. Topological obstructions to graph colorings. Electronic Research Announcements, 2003, 9: 61-68. [8] Z. Reichstein and B. Youssin. Parusinski's "Key Lemma" via algebraic geometry. Electronic Research Announcements, 1999, 5: 136-145. [9] Braxton Osting, Jérôme Darbon, Stanley Osher. Statistical ranking using the $l^{1}$-norm on graphs. Inverse Problems & Imaging, 2013, 7 (3) : 907-926. doi: 10.3934/ipi.2013.7.907 [10] Alberto Bressan, Ke Han, Franco Rampazzo. On the control of non holonomic systems by active constraints. Discrete & Continuous Dynamical Systems, 2013, 33 (8) : 3329-3353. doi: 10.3934/dcds.2013.33.3329 [11] Antonio Rieser. A topological approach to spectral clustering. Foundations of Data Science, 2021, 3 (1) : 49-66. doi: 10.3934/fods.2021005 [12] Enkhbat Rentsen, N. Tungalag, J. Enkhbayar, O. Battogtokh, L. Enkhtuvshin. Application of survival theory in Mining industry. Numerical Algebra, Control & Optimization, 2021, 11 (3) : 443-448. doi: 10.3934/naco.2020036 [13] Mikhail Gilman, Semyon Tsynkov. Statistical characterization of scattering delay in synthetic aperture radar imaging. Inverse Problems & Imaging, 2020, 14 (3) : 511-533. doi: 10.3934/ipi.2020024 [14] Suzete Maria Afonso, Vanessa Ramos, Jaqueline Siqueira. Equilibrium states for non-uniformly hyperbolic systems: Statistical properties and analyticity. Discrete & Continuous Dynamical Systems, 2021  doi: 10.3934/dcds.2021045 [15] Jamal Mrazgua, El Houssaine Tissir, Mohamed Ouahi. Frequency domain $H_{\infty}$ control design for active suspension systems. Discrete & Continuous Dynamical Systems - S, 2021  doi: 10.3934/dcdss.2021036 [16] Ahmad Mousavi, Zheming Gao, Lanshan Han, Alvin Lim. Quadratic surface support vector machine with L1 norm regularization. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021046 [17] Guillaume Bal, Wenjia Jing. Homogenization and corrector theory for linear transport in random media. Discrete & Continuous Dynamical Systems, 2010, 28 (4) : 1311-1343. doi: 10.3934/dcds.2010.28.1311 [18] Felix Finster, Jürg Fröhlich, Marco Oppio, Claudio F. Paganini. Causal fermion systems and the ETH approach to quantum theory. Discrete & Continuous Dynamical Systems - S, 2021, 14 (5) : 1717-1746. doi: 10.3934/dcdss.2020451 [19] Fioralba Cakoni, Shixu Meng, Jingni Xiao. A note on transmission eigenvalues in electromagnetic scattering theory. Inverse Problems & Imaging, , () : -. doi: 10.3934/ipi.2021025 [20] Takeshi Saito, Kazuyuki Yagasaki. Chebyshev spectral methods for computing center manifolds. Journal of Computational Dynamics, 2021  doi: 10.3934/jcd.2021008

Impact Factor: