# American Institute of Mathematical Sciences

June  2019, 1(2): 227-247. doi: 10.3934/fods.2019010

## EmT: Locating empty territories of homology group generators in a dataset

 Department of Statistics and Data Science, Yale University, New Haven, CT 06511, USA

* Corresponding author: Xin Xu

Published  June 2019

Persistent homology is a tool within topological data analysis to detect different dimensional holes in a dataset. The boundaries of the empty territories (i.e., holes) are not well-defined and each has multiple representations. The proposed method, Empty Territory (EmT), provides representations of different dimensional holes with a specified level of complexity of the territory boundary. EmT is designed for the setting where persistent homology uses a Vietoris-Rips complex filtration, and works as a post-analysis to refine the hole representation of the persistent homology algorithm. In particular, EmT uses alpha shapes to obtain a special class of representations that captures the empty territories with a complexity determined by the size of the alpha balls. With a fixed complexity, EmT returns the representation that contains the most points within the special class of representations. This method is limited to finding 1D holes in 2D data and 2D holes in 3D data, and is illustrated on simulation datasets of a homogeneous Poisson point process in 2D and a uniform sampling in 3D. Furthermore, the method is applied to a 2D cell tower location geography dataset and 3D Sloan Digital Sky Survey (SDSS) galaxy dataset, where it works well in capturing the empty territories.

Citation: Xin Xu, Jessi Cisewski-Kehe. EmT: Locating empty territories of homology group generators in a dataset. Foundations of Data Science, 2019, 1 (2) : 227-247. doi: 10.3934/fods.2019010
##### References:

show all references

##### References:
The three loops are equivalent in the sense that they all encircle the same hole. However, it can be desirable to find a representation that captures the empty territory, with no data point inside. The first two loops contain data points inside them, while the third one is empty. EmT is a method for finding the third type of representation at a specified level of complexity
For a VR complex, if there are pairwise intersections between $k$ points, the corresponding $k$-simplex is added to the simplicial complex. (a) three circles with radius $\frac{\delta}{2}$ that intersect pairwise, and $a, b, c$ are within a distance of $\delta$ from each other. The $VR(X, \delta)$ includes three points $a, b, c$, the edges $\{ab, bc, ca\}$, and the triangle $\{abc\}$. (b) the resulting $VR(X, \delta)$
(a) The same point cloud as in Fig. 1 is used to generate the persistence diagram of (b). The farther a point is from the diagonal, the longer it appears in the filtration. There are 40 $H_0$ generators and one prominent $H_1$ generator, which is consistent with the point cloud
(a) The green circles occupy the area around the data points. The remaining area is the resulting alpha hull, outlined by the red curves. By straightening the arcs, it becomes an alpha shape, shown in blue lines. (b) A concave hull based on the alpha shape (blue polygon). (c) An inner shell (blue polygon) based on the same alpha shape as (a)
(a), (b) and (c) are three different ways of occupying the area from the interior of the dataset with $\alpha$ equal to 0.8. (d) An $R_{\alpha, i}$ with $\alpha$ equal to 0.6
The same dataset from previous examples is used here to illustrate EmT. (a) Convex hull of the dataset in green lines. The red circles are set $S_X$ and green crosses are set $S_{X'}$. (b) The alpha hull in read lines and alpha shape in blue lines. Green circles are $\alpha$ balls and green points are $C_{\alpha}$. (c) Blue lines are $CH_{\alpha}$ and red pluses are circle centers that are inside $CH_{\alpha}$. (d) The comparison between the EmT output $\widetilde{IS_{\alpha}}$ as blue diamonds and the original representation point set $S_X$ returned by persistent homology as red circles
The same outer loop is used in the four examples displayed in (a), (b), (c), and (d), but with different points inside. The black points are data points, the red circles are the raw representations from the persistent homology algorithm, the blue triangles are representations of the EmT algorithm, and in (c) the cyan curves are the corresponding alpha hull. In (a), the raw representation of persistent homology is not able to detect the inner points, while the EmT representation includes the inner points. In (b), both the raw representation and EmT have the inner points in their representations. In (c), while the raw representation contains the larger loop with the inner points, the EmT representation includes two separate loops, as indicated by the corresponding alpha hulls. In (d), due to the amount of scattered points inside the larger loop, the raw representation is not able to identify it; therefore the EmT representation is not able to find the larger loop either since EmT takes the raw representation as input
A dataset from the homogeneous Poisson point process with intensity $300$ in $[0, 1]\times[0, 1]$. (a) Persistence diagram of the dataset. (b) The eight $H_1$ generators with the largest persistences reported by persistent homology in different colors. (c) The representations of the eight $H_1$ generators obtained by the EmT approach, where the boundary of the empty regions are better captured
(a) A dataset generated from the uniform distribution in $[0, 1]\times[0, 1]\times[0, 1]$; points inside the red regions are deleted from the dataset. (b) The persistence diagram of the dataset. (c) The raw representations from the persistent homology algorithm (blue and green points). The representations are displayed in blue and green shading, where the red points are observations inside the shaded volumes. (d) The representations from the EmT approach (blue and green points). The representations are displayed in blue and green shading, and there is no point inside the shading volumes
(a) Cell tower locations in Minnesota where each of the black point indicates a cell tower. (b) The loop overlaps with the Red Lake Indian Reservation. (c) The persistence diagram of the dataset. (d), (e) Representations of eight $H_1$ generators with the largest persistences from persistent homology and the EmT approach, respectively
(a) A Voronoi foam simulation. The eight large red points are the void seeds that generate eight void regions. The green crosses highlight one of the eight void regions. (b) Persistence diagram of the original dataset with 1200 points
Persistence diagrams of the centers of k-means clustering for different k's. The k-means are made on the dataset shown in Fig. 11a
(a) Cosmic voids representations ($H_2$ generators) reported by the persistent homology algorithm. (b) Cosmic voids representations from the EmT approach. As a qualitative comparison, the representations of EmT trace the empty volumes, while the raw representations reported by persistent homology do not define well the empty regions. (c) The raw representation of a particular cosmic void is displayed in the blue shading and the red points are galaxy inside the volume. (d) The EmT representation of a particular cosmic is displayed in the blue shading and there is no galaxy inside. The axis units are $Mpc/h$, which megaparsec divided by a parameter $h$ (which accounts for the uncertainty about the expansion rate of the universe). One parsec is approximately 3.26 lightyears
Summary table indicating which of the eight void regions is detected for different k values; 1 = detected and 0 = not detected. The corresponding labeled void regions are shown in Fig. 11a
 1 2 3 4 5 6 7 8 original 1 0 1 1 1 1 1 1 k=60 1 0 0 1 1 1 1 0 k=120 1 0 1 1 1 1 1 0 k=180 1 0 1 1 0 1 1 1 k=240 1 0 1 1 1 1 1 1 k=300 1 1 1 1 1 1 1 1 k=360 1 1 1 1 1 1 1 1
 1 2 3 4 5 6 7 8 original 1 0 1 1 1 1 1 1 k=60 1 0 0 1 1 1 1 0 k=120 1 0 1 1 1 1 1 0 k=180 1 0 1 1 0 1 1 1 k=240 1 0 1 1 1 1 1 1 k=300 1 1 1 1 1 1 1 1 k=360 1 1 1 1 1 1 1 1
 [1] Tyrus Berry, Timothy Sauer. Consistent manifold representation for topological data analysis. Foundations of Data Science, 2019, 1 (1) : 1-38. doi: 10.3934/fods.2019001 [2] George Siopsis. Quantum topological data analysis with continuous variables. Foundations of Data Science, 2019, 1 (4) : 419-431. doi: 10.3934/fods.2019017 [3] Jochen Abhau, Oswin Aichholzer, Sebastian Colutto, Bernhard Kornberger, Otmar Scherzer. Shape spaces via medial axis transforms for segmentation of complex geometry in 3D voxel data. Inverse Problems & Imaging, 2013, 7 (1) : 1-25. doi: 10.3934/ipi.2013.7.1 [4] Esther Klann, Ronny Ramlau, Wolfgang Ring. A Mumford-Shah level-set approach for the inversion and segmentation of SPECT/CT data. Inverse Problems & Imaging, 2011, 5 (1) : 137-166. doi: 10.3934/ipi.2011.5.137 [5] Karol Mikula, Jozef Urbán, Michal Kollár, Martin Ambroz, Ivan Jarolímek, Jozef Šibík, Mária Šibíková. An automated segmentation of NATURA 2000 habitats from Sentinel-2 optical data. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020348 [6] Zhouchen Lin. A review on low-rank models in data analysis. Big Data & Information Analytics, 2016, 1 (2&3) : 139-161. doi: 10.3934/bdia.2016001 [7] Pankaj Sharma, David Baglee, Jaime Campos, Erkki Jantunen. Big data collection and analysis for manufacturing organisations. Big Data & Information Analytics, 2017, 2 (2) : 127-139. doi: 10.3934/bdia.2017002 [8] Habibe Zare Haghighi, Sajad Adeli, Farhad Hosseinzadeh Lotfi, Gholam Reza Jahanshahloo. Revenue congestion: An application of data envelopment analysis. Journal of Industrial & Management Optimization, 2016, 12 (4) : 1311-1322. doi: 10.3934/jimo.2016.12.1311 [9] Runqin Hao, Guanwen Zhang, Dong Li, Jie Zhang. Data modeling analysis on removal efficiency of hexavalent chromium. Mathematical Foundations of Computing, 2019, 2 (3) : 203-213. doi: 10.3934/mfc.2019014 [10] Mahdi Mahdiloo, Abdollah Noorizadeh, Reza Farzipoor Saen. Developing a new data envelopment analysis model for customer value analysis. Journal of Industrial & Management Optimization, 2011, 7 (3) : 531-558. doi: 10.3934/jimo.2011.7.531 [11] Jiang Xie, Junfu Xu, Celine Nie, Qing Nie. Machine learning of swimming data via wisdom of crowd and regression analysis. Mathematical Biosciences & Engineering, 2017, 14 (2) : 511-527. doi: 10.3934/mbe.2017031 [12] Jingmei Zhou, Xiangmo Zhao, Xin Cheng, Zhigang Xu. Visualization analysis of traffic congestion based on floating car data. Discrete & Continuous Dynamical Systems - S, 2015, 8 (6) : 1423-1433. doi: 10.3934/dcdss.2015.8.1423 [13] Matthew O. Williams, Clarence W. Rowley, Ioannis G. Kevrekidis. A kernel-based method for data-driven koopman spectral analysis. Journal of Computational Dynamics, 2015, 2 (2) : 247-265. doi: 10.3934/jcd.2015005 [14] Cheng-Kai Hu, Fung-Bao Liu, Cheng-Feng Hu. Efficiency measures in fuzzy data envelopment analysis with common weights. Journal of Industrial & Management Optimization, 2017, 13 (1) : 237-249. doi: 10.3934/jimo.2016014 [15] Massimiliano Guzzo, Giancarlo Benettin. A spectral formulation of the Nekhoroshev theorem and its relevance for numerical and experimental data analysis. Discrete & Continuous Dynamical Systems - B, 2001, 1 (1) : 1-28. doi: 10.3934/dcdsb.2001.1.1 [16] Sebastien Motsch, Mehdi Moussaïd, Elsa G. Guillot, Mathieu Moreau, Julien Pettré, Guy Theraulaz, Cécile Appert-Rolland, Pierre Degond. Modeling crowd dynamics through coarse-grained data analysis. Mathematical Biosciences & Engineering, 2018, 15 (6) : 1271-1290. doi: 10.3934/mbe.2018059 [17] Mohammad Afzalinejad, Zahra Abbasi. A slacks-based model for dynamic data envelopment analysis. Journal of Industrial & Management Optimization, 2019, 15 (1) : 275-291. doi: 10.3934/jimo.2018043 [18] Cheng-Kai Hu, Fung-Bao Liu, Hong-Ming Chen, Cheng-Feng Hu. Network data envelopment analysis with fuzzy non-discretionary factors. Journal of Industrial & Management Optimization, 2020  doi: 10.3934/jimo.2020046 [19] Andreas Chirstmann, Qiang Wu, Ding-Xuan Zhou. Preface to the special issue on analysis in machine learning and data science. Communications on Pure & Applied Analysis, 2020, 19 (8) : i-iii. doi: 10.3934/cpaa.2020171 [20] Zheng Dai, I.G. Rosen, Chuming Wang, Nancy Barnett, Susan E. Luczak. Using drinking data and pharmacokinetic modeling to calibrate transport model and blind deconvolution based data analysis software for transdermal alcohol biosensors. Mathematical Biosciences & Engineering, 2016, 13 (5) : 911-934. doi: 10.3934/mbe.2016023

Impact Factor: