EmT: Locating empty territories of homology group generators in a dataset

Persistent homology is a tool within topological data analysis to detect different dimensional holes in a dataset. The boundaries of the empty territories (i.e., holes) are not well-defined and each has multiple representations. The proposed method, Empty Territory (EmT), provides representations of different dimensional holes with a specified level of complexity of the territory boundary. EmT is designed for the setting where persistent homology uses a Vietoris-Rips complex filtration, and works as a post-analysis to refine the hole representation of the persistent homology algorithm. In particular, EmT uses alpha shapes to obtain a special class of representations that captures the empty territories with a complexity determined by the size of the alpha balls. With a fixed complexity, EmT returns the representation that contains the most points within the special class of representations. This method is limited to finding 1D holes in 2D data and 2D holes in 3D data, and is illustrated on simulation datasets of a homogeneous Poisson point process in 2D and a uniform sampling in 3D. Furthermore, the method is applied to a 2D cell tower location geography dataset and 3D Sloan Digital Sky Survey (SDSS) galaxy dataset, where it works well in capturing the empty territories.


1.
Introduction. Given a point cloud dataset, e.g., galaxy positions in astronomy or plant locations in forestry, it can be of interest to investigate the geometrical or topological structure of the underlying data. Topological data analysis (TDA) is an approach for learning about the underlying structure of a dataset, and one of the popular tools of TDA is persistent homology [9,3,41,6,10,13,36]. Intuitively, homology groups refer to different dimensional holes in a manifold: the 0th-dimensional homology group (H 0 ) is generated by connected components, the 1st-dimensional homology group (H 1 ) is generated by loops, the 2nd-dimensional homology group (H 2 ) is generated by voids (the boundary of an empty region in 3D, e.g. the boundary of a soccer ball), and so on. For a point cloud dataset, all the data points are separate connected components so the dimension of H 0 would be the number of points in the dataset, and there would not be any higher dimensional homology groups. In order to find higher dimensional homology groups in a point cloud dataset, an intermediate structure is needed. There are different approaches for defining this structure, and one method is to sequentially connect points based on their distances from each other. In this process, edges and loops form, along with potentially higher dimensional homology features. For reasons that will be 228 XIN XU AND JESSI CISEWSKI-KEHE discussed in more detail later, certain features appear and disappear during this process, and how long a feature exists is called its persistence. Persistent homology is a framework for building a sequence of homology groups from a data set. It provides a graphical summary called a persistence diagram, which summarizes the persistence of each feature that appears and disappears along the sequence.
Persistent homology has been used in many different settings. In [39], H 1 generators are used to detect 'tie-backs' structures in natural language processing; in particular, the work proposed a similarity filtration with the time skeleton (SIFTS) algorithm, which is a nested sequence of sets that takes text ordering into consideration in its construction. [2] use persistent homology for brain artery trees by analyzing persistences of H 0 and H 1 generators; principal component analysis and permutation test are used on persistences to understand the influences of age, sex, and artery length on artery tree structures. In [15], they use barcode persistence diagrams, which is another graphical presentation of persistence diagram using a collection of horizontal segments with the segment lengths indicating the feature persistences, to detect periodic structure in dynamical systems and achieved high accuracy in detecting abnormal human lung sounds. [37] introduced molecular topological fingerprint (MTF) as a summary statistic to analyze protein structure, flexibility and folding stability; moreover, an accumulated bar length is proposed to study the structure composition of protein evolution and rigidity.
Persistent homology detects homology group generators and summarizes them in a persistence diagram, but there is not a rigorous framework to locate a desirable representation of the homology group generators back in the original dataset, which can be useful information (e.g., [38]). One major issue with finding representations of homology group generators in original dataset is that there is not a unique representation. For example, in Fig. 1 all the three loops are representations of the same hole and they are equivalent in the sense that they all encircle the same hole. The widely used algorithm of [9] does provide locations of homology group generators as a subset of data points around the hole by tracking cycles that represents various homology groups during the computation. Unfortunately, the output locations are not necessarily a desirable representation of the homology group generators. In Fig. 1, although the three loops are equivalent in representing the hole, they look very different geometrically. In this paper, the goal is to construct a homology group generator representation with a certain complexity (explained later) that outlines the empty region; that is, the region that does not contain any data point inside the representation. The orange loop on the right of Fig. 1 illustrates a possible desired representation of the hole with a certain level of complexity, while the red and blue loops contain data points inside them. As current algorithms do not constrain the representation of the hole that is returned, we propose a post analysis method called Empty Territory (EmT) to refine the representations of homology group generators reported from the algorithm of [9]. The proposed representation is useful when the goal is to find the truly empty territories. However, this is not necessarily always going to be the desirable representation. Note that requiring emptiness restricts the method to locating H 1 generators in 2D data and H 2 generators in 3D data since, for example, there is not a general notion of emptiness of an H 1 generator in 3D data.
2. Background of persistent homology. Next we introduce some basic ideas of persistent homology; additional details can be found in [11]. Given a point Figure 1. The three loops are equivalent in the sense that they all encircle the same hole. However, it can be desirable to find a representation that captures the empty territory, with no data point inside. The first two loops contain data points inside them, while the third one is empty. EmT is a method for finding the third type of representation at a specified level of complexity.
cloud dataset, in order to gain meaningful homological structure, an intermediate structure is needed. A common structure used for this purpose is a simplicial complex, which is built to compute the homology on varying scales. A k-simplex σ is the convex hull of k + 1 affinely independent points X = {x 0 , x 1 , . . . , x k } ∈ R d , denoted as σ = convX. A face of σ is a simplex generated by a subset of the set of points that generate σ, denoted as convS, where S ⊂ X. A simplicial complex K is a collection of simplices of different dimensions such that, (i) every face of a simplex in K is also in K, and (ii) any intersection of two simplices is either empty or a face of both. Persistent homology computes homology groups on data by building simplicial complexes from the data along a sequence of filtration parameters, which starts at 0 and as it increases, it defines the changing simplicial complexes in the filtration. For example, let the filtration parameter δ be a threshold for pairwise distances between data points. At the beginning, δ = 0 and there are only the original data points, which are 0-simplexes. As δ increases, when there is a pair of points whose distance is less than δ, an edge between these two points is added to the simplicial complex. If a set of k points whose pairwise distances are less than δ, the corresponding k-simplex, i.e. triangles, polyhedrons, etc, is added to the simplicial complex. Different dimensional holes in the simplicial complex appear and disappear during the process of adding more simplexes into the simplicial complex, e.g. loops formed by edges in the simplicial complex can manifest as H 1 generators. Eventually all the data points become interconnected and the whole simplicial complex becomes a single connected component (H 0 generator). H p generators (for some homology dimension p) appearing and disappearing at different δ values during the process can be thought of as having a birth and death at different times in the filtration, with the change of δ value between its birth and death time being its persistence. One way to interpret a homology group generator that has higher persistence is as topological signal, while a homology group generator with smaller persistence can be considered topological noise [16].
Given a point-set of data, there are several ways to build a simplicial complex [8,35,4]. A common approach in TDA is to use the Vietoris-Rips (VR) complex construction [34], which requires pairwise intersections to form a simplex [18,11], as was used in the previous example. Definition (Vietoris-Rips Complex). Given a point set X = {x 1 , x 2 , . . . , x n } in R D and a filtration parameter δ > 0, let V R(X, δ) be the VR complex formed from X for this δ. For any k-point subset, the corresponding k-simplex σ i = conv{x i1 , x i2 , . . . , x i k } is in V R(X, δ) if and only if for every two-point pair in For example, consider a point cloud dataset X = {a, b, c} in R 2 as in Fig. 2a, and a filtration parameter δ > 0. First, V R(X, δ) contains all 0-simplices: a, b and c. Next, three circles centered at a, b, c with radius δ 2 are drawn in Fig. 2a. As the three circles intersect pairwise, we have d(a, b) ≤ δ, d(b, c) ≤ δ, and d(a, c) ≤ δ. Thus V R(X, δ) contains edges ab, ac, bc, and also the triangle abc. Thus V R(X, δ) = {a, b, c, ab, ac, bc, abc}, as displayed in Fig. 2b. Different δ values result in different V R(X, δ) complexes, with V R(X, 0) = X and V R(X, ∞) = {convS, ∀S ⊂ X}. An increasing sequence of δ: 0 < δ 1 < δ 2 < · · · < δ k < ∞, lead to a filtration of complexes: A sequence of the p-dimensional homology groups corresponding to this filtration can be determined as below: Along the sequence of homology groups, an H p generator has a birth time δ b and a death time δ d if it starts to appear in H p (V R(X, δ b )) and first disappears in H p (V R(X, δ d )), denoted as (b, d, p). For all the homology group generators ever present in (2), we have a collection of birth and death times and their corresponding homology group dimension, (b 1 , d 1 , p 1 ), (b 2 , d 2 , p 2 ), . . . , (b m , d m , p m ), where m is the total number of generators. These triples can be used to produce a persistence diagram. Fig. 3a displays the same point cloud used in Fig. 1, and Fig. 3b shows  3. Method. After computing a persistence diagram, it can be useful to consider where the homology group generators appear in the data. However, this goal is not well-defined because there is an equivalence class of representations for each homology group generator (see Fig. 1). The algorithm of [9] can provide a representation for each homology group generator by tracking compositions of simplicial complexes during the computation. The order the data points are input can affect the representation that gets reported. Therefore, it may be desirable to find a representation of homology group generators that is more robust to the order of input data. The proposed method, EmT, is a post-analysis approach for generating representations and is designed to find the empty regions of homology group generators at a specified level of complexity. EmT uses alpha shapes [12], which is a generalization of the convex hull, in such a way that the inner boundary of a homology group generator can be detected, and maps the boundary to original data points as a representation of the homology group generator. Before introducing our algorithm, we first introduce some necessary concepts and definitions.
3.1. Concave hull and inner shell. In order to find a representation of the hole, it is necessary to determine where the actual empty region is located. This work proposes a framework that uses a concave hull and inner shell to detect the outer and inner empty regions given a collection of data points associated with a homology group generator. For a set of points, the external boundary of the points can be captured by the convex hull of the point set. To obtain a representation of the hole inside the set of points, an alpha shape is used to separate the inner empty regions from the region outside the collection of points. As noted, an alpha shape is a generalization of the convex hull of a set of points, which is a subgraph of the Delaunay triangulation [12]. Alpha shapes can be intuitively explained by an example using a collection of small balls taking up volume (or area in 2D). Consider the space in R 3 around a set of points. Small balls (or circles in 2D) of a fixed radius fill the volume of the space without enclosing any data point inside the balls. In particular, the balls can fill in any empty space among data points. After filling the space as much as possible, the remaining unoccupied volume is an alpha hull [12]. If we straighten all the round faces carved by spheres to flat faces (or straighten the arcs to segments in 2D), it becomes an alpha shape. A formal definition of an alpha shape is given next [14], where S is a point set in R 2 or R 3 .
Definition. An α-ball is an open ball b with radius α, with ∂b representing the boundary of the ball.

Definition.
A k-simplex σ (k = 0, 1, or 2), which is the convex hull of a subset T ⊂ S, is called α-exposed if there exists an empty α-ball b without any point of S inside, such that T = ∂b ∩ S.
Definition. Let S α be an alpha shape with radius α, then the boundary ∂S α consists of all the α-exposed simplices of S.
An example of an alpha hull and alpha shape for the same dataset as in Fig. 1 is displayed in Fig. 4a. The green circles illustrate the α-balls occupying the area around the dataset. The remaining part outlined by red curves is the alpha hull, which generates the alpha shape by straightening the arcs, as drawn in blue lines. If we only consider the outer part of the alpha shape, it becomes a concave hull. Intuitively, a concave hull is a polygon that encloses all the points with less area than a convex hull, but the concave hull's perimeter is always longer [17]. There is not a universal definition of the concave hull, and multiple approaches have been proposed to generate concave hulls [25,1]. Generally, there is a smoothness parameter to adjust the shape of the concave hull, e.g. a value to bound the internal angle degrees of the polygon [1]. We define a concave hull using ideas related to alpha shapes except here the balls cannot arbitrarily fit in any empty space among the data points, but only the external space around the set of points. To determine which region is outside the set, an additional step is needed, which is presented next.
To define the concave hull, we first define an external point and an external path. An external point, denoted as p e is defined as a point whose distance to any point in S is larger than D, where D is the largest distance between any pair of points in S. That is, p e is at an external location away from any data point in S. An external path is a chain of α-balls connecting a point to p e , formally defined in below.
Definition. For any point u, if there exists a set of pairwise intersected α-balls (not necessarily with the same radius) denoted as P Definition. A k-simplex σ (k = 0, 1, or 2), which is the convex hull of a subset T ⊂ S, is α-outer-exposed if (i) there exists an empty α-ball b without any point of S inside, such that T = ∂b ∩ S; (ii) there exists a point u in the α-ball b, and an external path P of u, such that P ∩ S α = ∅, where S α is the alpha shape of S.  Definition. Let CH α be a concave hull with radius α, then the boundary ∂CH α consists of all the α-outer-exposed simplices of S. Fig. 4b shows a concave hull in blue lines. Since the goal is to obtain the representation of a hole, the inner boundary of an alpha shape is of more interest, which can be used to generate a representation of the hole. Similar to the definition of concave hull, we consider the inner part of an alpha shape and define it as an inner shell, as shown in Fig. 4c.

Definition.
A k-simplex (k = 0, 1, or 2) σ = convT of S is α-inner-exposed if (i) there exists an empty α-ball b without any point of S inside, such that T = ∂b ∩ S; (ii) for any point u in the α-ball b, there does not exist an external path P of u, such that P ∩ S α = ∅, where S α is the alpha shape of S.
Definition. Let IS α be an inner shell with radius α, then the boundary ∂IS α consists of all the α-inner-exposed simplices of S.
As displayed in Fig. 1, a single homology group generator can have multiple topologically equivalent representations. To find a particular representation, we consider a special class of representations for a hole, analogical to the idea of the inner shell. Still using balls to occupy the empty regions, but they only occupy regions from the interior, but do not occupy as much as possible -they can occupy less space than the inner shell. The data points touched by these circles are the resulting representation set. (If the balls do not touch any data point, the representation set is empty.) This defines a particular class of representations for any fixed α value. Any representation within this class has no data point inside and captures the empty region. The parameter α works as a complexity parameter for each representation class, since it controls the radius of α-balls and so the maximum number of points in each representation within this class.
Definition. Let R α,i be the ith representation of a homology group generator with α as radius of α-balls, then R α,i consists of (not necessarily all) 0-simplices of S, which are α-inner-exposed. R α = {R α,i , i = 1, . . . , n α }, where n α is the number of possible representations.    Fig. 5c and Fig. 5d are representations corresponding to the IS 0.8 and IS 0.6 : balls take up as much space as possible, and so the resulting representation contains as many points as possible. The vertex set of IS α , denoted as IS α , is the "largest" representation in the sense that the balls take up the most space and the representation contains the most points.
3.2. Algorithm. As discussed above, for a certain complexity α, the representation set derived from IS α has the most points among all the representations in R α , denoted as IS α . Therefore, IS α is used as the output representation set of the EmT algorithm. For a VR complex, the EmT approach is summarized in Algorithm 1. As mentioned before, persistent homology algorithm of [9] reports data points that are associated with the homology group generators on a persistence diagram, then IS α is obtained based on those locations. Details of the algorithm are contained in Algorithm 1, and summarized next.
Let the group generator location points returned by persistent homology algorithm for any group generator in point cloud S be set S X . S X does not necessarily

Algorithm 1 EmT Algorithm
Step 1: Build a convex hull on the representation point set reported by persistent homology algorithm of [9], and denote points inside the convex hull as set S X .
Step 2: Build an alpha shape S α on S X and get the circle (or sphere in 3D) centers C α of its corresponding alpha hull.
Step 3: Keep the ball centers in C α which are inside CH α as C α and delete the others which are outside CH α .
Step 4: Select the vertices of arcs (or spheres) touched by circles (or spheres) whose centers are in C α to be IS α , the representation points of the hole.
provide a good representation of the homology group generator, but can be used as a starting point to find a more desirable representation. Build a convex hull on the set S X and denote it as convS X . Then select all the data points inside convS X to be set S X , as shown in Fig. 6a, where the same dataset from Fig. 1 is used to illustrate the EmT algorithm. In Fig. 6a, the red circle points are the set S X , the green lines are convS X , and the green exes are set S X . We will only use set S X in the following steps.
On the set S X , build an alpha shape S α with alpha value α, which is the radius of the balls as mentioned above. The IS α gives a specific boundary shape of the empty region with the complexity level α. To obtain the IS α , consider the centers of the balls of the alpha hull, denoted as C α . The green points in Fig. 6b are the set C α .
If a ball is on the inner boundary of the alpha hull, its center is inside CH α . For each point in C α that is inside CH α , assign it to a set C α , drawn as the red pluses in Fig. 6c. Let the vertices in S X touched by the circles (or spheres) in C α be set IS α , drawn as the blue diamonds in Fig. 6d, which is the output of the EmT algorithm.
Remark. H 1 generators can have points inside a loop (i.e., a loop is not necessarily empty). The capability of persistent homology detecting the inner points or discovering the correct loop depends on specific situations. Four examples are displayed in Fig. 7, showing the same loop with different point arrangement inside. In Fig. 7a, the raw representation of persistent homology is not able to detect the inner points, while EmT does include the inner points in its representation. In Fig. 7b, both the raw representation and EmT have the inner points in their representations. In Fig. 7c, there are more points inside the loop and the points are more spread out. While the raw representation contains the bigger loop with the inner points, the EmT representation includes two separate loops, as indicated by the corresponding alpha hulls. In Fig. 7d, the raw representation is not able to identify the bigger loop and so the EmT representation is not able to find the bigger loop either because EmT uses the raw representation as input.
Remark. The α can be selected based on specific scenario and the desired complexity of the representation. As a starting point, we suggest using a value between birth time 2 and 3× birth time 2 (where birth time is the filtration parameter value when this homology group first appears). Note that if the α value is too large, there will not be a hole in alpha shape, and if it is too small a loop may not be found.  a homogeneous Poisson point process. There is no true underlying holes since the homogeneous Poisson point process has complete spatial randomness [5], but some holes may appear randomly. EmT is used to locate the boundaries of these random holes. The second example is an uniform distribution in a 3D cube, with two embedded empty regions of different shapes, and EmT is used to recover the boundaries of the empty regions. approach is used to see whether the boundary of empty regions could be reasonably traced. Fig. 8a shows the persistence diagram of the dataset.

Cell tower.
A geographical dataset of cell tower locations in Minnesota 1 is used as a 2D example in this section. Each cell tower has a working range affected by many conditions, such as signal frequency and cell tower height. A region on the map with no cell towers potentially suggests limited cell phone signal within this region. We apply the EmT algorithm on the cell tower location dataset and the resulting H 1 generator representations are the areas with potentially limited cell phone signal. The dataset is shown in Fig. 10a with a geographic view generated from Google Maps Static API using R package ggmap [7]. The resulting persistence diagram and the raw representations of the eight H 1 generators with the largest persistences reported by the persistent homology algorithm are displayed in Fig. 10c and Fig. 10d, respectively. Fig. 10e shows the representations of the same H 1 generators from EmT. In particular, the yellow representation on the right of Fig. 10d has one winding loop containing many cell towers inside. The corresponding representation Fig. 10e has two small loops, each circling a small empty region. This is because EmT is only going to find the empty territories, and so for a certain α value, the alpha shape of this H 1 generator contains two holes inside, which results in two small loops. In the bottom-left corner of Fig. 10d, there are two generators (orange and cyan) overlapping with each other. It is because the orange generator forms first and then the smaller cyan loop forms with several edges overlapped with the orange one. In Fig. 10e, the orange representation contains two loops, among which one is exactly the same as the cyan representation. The top-left blue raw representation loop in Fig. 10d contains a larger area than is uncovered by the corresponding representation by EmT in Fig. 10e. The blue raw representation loop throughout the filtration appears to segment into potentially three smaller loops, including the identified green loop. The number of loops found by the EmT algorithm depends on the complexity parameter α; if a smaller complexity parameter α was used, the additional empty regions of the blue raw loop could have been uncovered. The corresponding geographic view of the blue loop in Fig. 10e results in an interesting finding: the loop encloses a large portion of the Red Lake Indian Reservation, displayed in Fig. 10b.

SDSS galaxy dataset.
In the universe, regions of low matter density are called cosmic voids. Cosmic voids are considered useful in studying the universe structure and galaxy evolution [27,29,28]. Many methods have been developed to detect cosmic voids in the universe [26,30,32,22,20,38]. In this section, we use persistent homology to find cosmic voids similarly to the work in [38]. While [38] used a smoothed distance function to construct the filtration and the reported representations are not necessarily empty inside, this work builds the filtration using the VR complex as introduced in Sec. 2 and outputs representations with empty regions inside. The EmT approach is applied to get the representations of cosmic voids detected by persistent homology. The Sloan Digital Sky Survey (SDSS) main galaxy redshift survey 2 [31] is used. The 3D dataset is recorded in equatorial coordinate system, which is a celestial coordinate system used to describe the positions of celestial objects such as galaxies. It takes the Earth as the origin and projects the Earth's equator onto the celestial sphere to form the celestial equator.
In this work, a subset of the catalog called dim1 [32] is used, with 0.0 < z < 0.05 (z is redshift), containing 57795 galaxies. Then we transform the dataset into Cartesian coordinates as inputs to the persistent homology algorithm. When the size of a dataset is large, the VR complex becomes computationally prohibitive, as the computation time grows exponentially with number of observations [40]. In addition to the compuational time, large datasets also require substantial memory. For the SDSS dataset, 57795 data points are computationally prohibitive. To mitigate this issue, we consider a k-means clustering approach [24,19,23] to reduce the size of the data.
To investigate the impact of k-means clustering on 3D data with a structure similar to the large-scale structure of the Universe, an example is presented to see how well H 2 generators are detected as k changes. Fig. 11a displays a Voronoi foam simulation dataset, which is used as an approximation of the universe structures [ 21,33]. A Voronoi foam simulation dataset is generated by a set of randomly sampled void seeds (which specify the locations of the voids) and their Voronoi tessellation; the simulation procedure is explained in detail in [38]. Since the void location is known, the matching of H 2 generators to void seeds can be calculated by comparing distances of the H 2 generators to each void seed. We use eight void seeds to generate a In Fig. 11a, the eight large red points with blue labels are void seeds and void regions are generated around each of the seed points. The green pluses illustrate one of the eight void regions, which is corresponding to the void seed with label 1. Since we know the locations of the eight void seeds, we can check whether persistent homology 4 could find each of the eight voids. Using the original 1200 data points, the eight H 2 generators with the largest persistences match seven of the eight void regions (there are seven H 2 generators that have obviously larger persistence than the other H 2 generators). The void region with label 2 in Fig. 11a is not detected, which is smaller in volume than the others.
Next, the numbers of detected voids are compared among different k values for k-means. Consider using k-means clustering with k equal to 5%, 10%, 15%, 20%, 25% and 30% of the total number of points in the original dataset (denoted as n). Use only the k centers obtained by k-means clustering as the input for persistent homology. The resulting persistence diagrams are shown in Fig. 12. By checking the eight H 2 generators with the largest persistences of each persistence diagram, we can see how the results differ for different k values. Tab. 1 summarizes the performance for each k value. When k n = 5%, only five of the eight voids are identified, while k n = 10% and 15% find six of the eight voids. When k n = 20%, the same H 2 generators are detected as the original dataset; when k n = 25% and k n = 30%, all the eight voids are detected, which is better than the original dataset. This improved performance may be due to the noise-reducing effect of k-means clustering, because the sparse noise points are removed after k-means clustering. From this experiment, it seems a range from 5% to 30% locates most of the voids, therefore any of these values may be reasonable for the SDSS dataset. Using a k=5000, which is approximately 10% of the SDSS dataset, the number of input points for persistent homology is reduced to 5000. Applying persistent homology on the reduced dataset, seven cosmic voids (H 2 ) with the highest persistences   Fig. 11a.
are selected for EmT to find improved representations. The raw representations reported by the persistent homology algorithm and representations from EmT are displayed in Fig. 13a and Fig. 13b, respectively. Since it is a real dataset with the true H 2 generators unknown, we cannot obtain any quantitative conclusion. Qualitatively, Fig. 13b appears to have more accurate representations of the empty regions as expected from EmT, while representations in Fig. 13a do not define the empty regions clearly. For example, the cosmic void colored cyan is highlighted in Fig. 13c. The volume inside the raw representation from persistent homology is shaded in blue and galaxies inside the representation are displayed as red points. In Fig. 13d, the representation from EmT is shaded in blue and there is no galaxy point inside the volume.
6. Concluding remarks. In this work, we proposed EmT, a post-analysis approach for persistent homology to find a particular type of representation of each homology group generator that traces the empty territory. Persistent homology using a VR complex was introduced Sec. 2. To obtain the EmT representations that capture empty regions, a concave hull and inner shell are used as defined in Sec. 3.1 and motivated by alpha shapes. The concave hull is used to find the external boundary of the alpha shape with a complexity parameter α (i.e., the radius of the ball), while inner shell defines interior empty territory of the alpha shape with the same complexity parameter. With a fixed complexity parameter α, a special class of representations using the space-filling balls is considered. Within this class of representations, the vertex set of the corresponding inner shell has the most points and is the output of EmT. The EmT algorithm for finding a desirable representation of H 1 generators in 2D data and H 2 generators in 3D data is proposed in Sec For large datasets, persistent homology becomes computationally intractable due to both time and memory limits. An approach to deal with large datasets is proposed in this work using k-means clustering to reduce the size of a dataset. An illustration example of a Voronoi foam dataset is shown in Sec. 5.2. The performances of k n as 5%, 10%, 15%, 20%, 25%, and 30% are compared in Tab. 1. Based on this example, a k larger than 5% of the total number of data points is good to use for galaxy datasets.
The proposed algorithm is designed for locating H 1 generators in 2D and H 2 generators in 3D as alpha shape do not have a counterpart for H 1 generators in 3D. Finding, for example, H 1 features in 3D point cloud data remains an open problem. Moreover, this work does not consider functional filtrations (e.g. [4] and [16]) of persistent homology, but an approach was proposed in [38]. However, the method of [38] does not find representations that are empty, although using a functional filtration is computationally faster than the VR complex filtration. Moreover, a functional filtration typically requires a grid that loses information about the data in its representation of homology group generators. Though not always the case, it can be desirable to account for small scale features in the data (and sometimes even outliers), which can be done with a VR complex filtration, while a functional filtration generally washes out these features.