Power Weighted Shortest Paths for Clustering Euclidean Data

We study the use of power weighted shortest path distance functions for clustering high dimensional Euclidean data, under the assumption that the data is drawn from a collection of disjoint low dimensional manifolds. We argue, theoretically and experimentally, that this leads to higher clustering accuracy. We also present a fast algorithm for computing these distances.

. Three sample geodesics in the power weighted shortest path metric with p = 2, for the data set "Three Lines" (see §6). Observe how the geodesics consist of many small hops, instead of several large hops. The total lengths of the red and green paths are significantly smaller than the length of the blue path. example motion segmentation [12,2], images of faces or objects taken from different angles or under different lighting [4,18] or handwritten digits [27]. It is also usually assumed that the dimension of each M a is much lower than the ambient dimension D. Although it can be shown that taking d(·, ·) to be the Euclidean distance can be successful [3] for such data, data-driven metrics have been increasingly favored [10,7,9,22].
Once d(·, ·) has been chosen, A can be constructed. A common choice [25,30] is to use some variant of a Gaussian kernel, whereby A ij = exp(−d 2 (x i , x j )/σ 2 ) for a user defined parameter σ. However this is unsuitable for large data sets, as the resulting similarity matrix is dense and thus may be too large to store in memory. Moreover, operations which form core parts of many clustering algorithms, for example computing eigenvectors, are prohibitively expensive when A is full. Hence in this case, a common choice that yields a sparse similarity matrix is to use a k nearest neighbors (k-NN) graph constructed as: 1 if x i among the k nearest neighbors of x j with respect to d(·, ·) 0 otherwise Here k is a user specified parameter. In this article we consider taking d(·, ·) to be a power weighted shortest path metric (henceforth: p-wspm and defined in §2.2) with an emphasis on cases where the data satisfies the manifold hypothesis and where the data set is so large that a k-NN similarity matrix is preferable to a full similarity matrix. The use of shortest path metrics in clustering data is not new (see the discussion in §2. 4), but has typically been hindered by high computational cost. Indeed finding the pairwise distance between all x α , x β ∈ X in the shortest path metric is equivalent to the all pairs shortest paths problem on a complete weighted graph, which requires O(n 3 ) operations using the Floyd-Warshall algorithm. We provide a way around this computational barrier, and also contribute to the theoretical analysis of p-wspm's. Specifically, our contributions are: 1. We prove that p-wspm's behave as expected for data satisfying the manifold hypothesis. That is, we show that the maximum distance between points in the same cluster is small with high probability, and tends to zero as the number of data points tends to infinity. On the other hand, the maximum distance between points in different clusters remains bounded away from zero. 2. We show how p-wspm's can be thought of as interpolants between the Euclidean metric and the longest leg path distance (defined in §2.3), which we shall abbreviate to LLPD. 3. We introduce a novel modified version of Dijkstra's algorithm that computes the k nearest neighbors, with respect to any p-wspm or the LLPD, of any x α in X in O(k 2 T Enn ) time, where T Enn is the cost of a Euclidean nearest-neighbor query. Hence one can construct a p-wspm k-NN graph in O(nk 2 T Enn ). As we typically have k n, i.e. k = O(log(n)) or even k = O(1), this means that constructing a p-wspm k-NN graph requires only marginally more time than constructing a Euclidean k-NN graph (which requires O(nkT Enn )). 4. We verify experimentally that using a p-wspm in lieu of the Euclidean metric results in an appreciable increase in clustering accuracy, at the cost of a small increase in run time, for a wide range of real and synthetic data sets. After establishing notation and surveying the literature in §2, we prove our main results in §3 and §4. In §5 we present our algorithm for computing k nearest neighbors in any p-wspm, while in §6 we report the results of our numerical experiments. 2. Definitions and notation. Let us first fix some notation. Throughout this paper, D shall denote the ambient dimension, while X will denote a fixed, finite sets of distinct points in R D . We shall denote the Euclidean (i.e. 2 ) norm on R D as · . For any finite set S, by |S| we shall mean its cardinality. For any positive integer , by [ ] we mean the set {1, 2, . . . , }. Finally, for two functions f (n) and g(n) by f (n) = O(g(n)) we shall mean that there exist constants C and n 0 such that f (n) ≤ Cg(n) for all n ≥ n 0 . Similarly, by f (n) = o(g(n)) we shall mean that f (n)/g(n) → 0 as n → ∞. Occasionally we shall explicitly indicate the dependence on the variable n (later, also n a ) by writing O n or o n .
2.1. Data model. We consider data sets X = X 1 ∪ X 2 ∪ . . . ∪ X ⊂ R D consisting naturally of clusters, which are a priori unknown. Let |X a | = n a . We posit that for each X a there is a smooth, compact, embedded manifold M a → R D such that X a ⊂ M a . Let g a denote the restriction of the Euclidean metric to M a , then (M a , g a ) is a compact Riemannian manifold. We shall further assume that X a is sampled according to a probability density function µ a supported on M a and continuous with respect to the metric g a . For any x, y ∈ M a let dist a (x, y) := inf Since each M a is compact this supremum is in fact a maximum and diam(M a ) is finite. We assume that the data manifolds are fairly well separated, that is,

SHORTEST PATH DISTANCES FOR CLUSTERING
Note that frequently (for example, in [3]), this model is extended to allow for noisy sampling, whereby for some τ > 0, X a is sampled from the tube B(M a , τ ) defined as: but we leave this extension to future work.

2.2.
Power weighted shortest path metrics. For any distinct pair x α , x β ∈ X and any path γ = x α → x 1 → . . . → x m → x β , define the p-weighted length of γ to be: where by convention we declare x i0 = x α and x im+1 = x β . We define the p-weighted shortest path distance from x α to x β through X to be the minimum length over all such paths: X is a metric on the set X for p ≥ 1 (see, for example, [19]). As several authors [20,9,1] have noted, the metric d (p) X is density-dependent, so that if x α and x β are contained in a region of high density (i.e. a cluster) the path distance d (p) (x α , x β ) will likely be shorter than the Euclidean distance x α − x β (as long as p > 1).

2.3.
Longest-leg path distance. Another common path-based distance is the longest-leg path distance (LLPD), which we shall denote as d (∞) X (the choice of this notation should become clear shortly). It is defined as the minimum, over all paths from x α to x β through X , of the maximum distance between consecutive points in the path (i.e. legs). Before formally defining d (∞) X , define, for any path γ from x α to x β through X , the longest-leg length of γ as: again we are using the convention that x i0 = x α and x im+1 = x β . Now, in analogy with (3): d ∞ X is also a metric, in fact an ultrametric [22], on X . 2.4. Prior work. The idea of using p-wspm's for clustering was proposed in [28], and further explored in [26]. Recently, several papers [16,8,22] have considered the use of LLPD for clustering and in particular [22] provides performance guarantees for spectral clustering with LLPD for a data model that is similar to ours. [9] studies p-wspm's for p ≥ 2 and proposes to use such metrics with density-based clustering algorithms, such as DBScan, although they do not provide any experimental results. The paper [7] proposes the use of p-wspm's for semi-supervised learning and provides a fast Dijkstra-style algorithm for finding, for every x ∈ X , its nearest neighbor, with respect to a p-wspm, in some set of labeled data points L. They consider a similar data model to ours, but do not provide any quantitative results on the behaviour of shortest path distances. More generally, shortest path distances are a core part of the ISOMAP dimension reduction algorithm [27], although we emphasize that here not all paths through X are considered-first a k-NN graph G (k) is computed from X and only paths in this graph are admissible.
The asympotic behaviour of power weighted shortest path distances are analyzed for Euclidean Poisson processes in [19] and for points sampled from an arbitrary probability distribution supported on a Riemannian manifold M in [20]. Note that in [20] the lengths of the legs of the path are measured using geodesic distance on M, which is not computable in the data model we are considering as the M a are unknown. Finally, in [9] the results of [20] are used to show that, for certain k and p ≥ 2, with high probability the Euclidean k-NN graph can be used to compute p-wspm distances in the case where the data is sampled from a single Riemannian manifold M. We discuss this further in §5.3.
On the computational side, we are unaware of any prior mention of Algorithm 2 in the literature, although similar algorithms, which solve slightly different problems, are presented in [17], [23] and [7]. In [7] a variation of Dijkstra's algorithm is presented which starts from a set of labeled data L and expands its search through the rest of the data X . The algorithm is designed to find, for all x i ∈ X , the nearest point in L to x i with respect to a p-wspm and then terminate. Thus, it is not clear how one would extend this algorithm to finding k nearest neighbors for k > 1. The algorithm of [17] is formulated for any weighted graph G = (V, E, A) (i.e. not just graphs obtained from data sets X ⊂ R D ) and as such is not well-adapted to the problem at hand. In particular, it has run time O(k(n log(n) + |E|)). Because the distance graph obtained from X is implicitly complete, |E| = O(n 2 ) and this results in a run time proportional to kn 2 , which is infeasible for large data sets. Finally, the algorithm presented in [23], although adapted to the situation of distance graphs of data sets, actually solves a slightly different problem. Specifically they consider finding the k 1 p-wspm nearest neighbors of each x ∈ X in a k 2 Euclidean nearest neighbors graph of X . As such, it is not clear whether the set of nearest neighbors produced by their algorithm are truly the p-wspm nearest neighbors in X .
Let us also mention that our approach is "one at a time", whereas the other three algorithms mentioned are "all at once". That is, our algorithm takes as input x ∈ X and outputs the k p-wpsm nearest neighbors of x. This can then be iterated to find the p-wspm nearest neighbors of all x ∈ X . In contrast, "all at once" algorithms directly return the sets of k nearest neighbors for each x ∈ X . Thus it is possible our algorithm will have applications in other scenarios where the p-wspm nearest neighbors of only some small subset of points of X are required or in "online" scenarios where new data points are continuously received.  For any fixed x α , x β ∈ X , and any fixed path γ from x α to x β : 1. If 1 ≤ p < q < ∞ then: 2. For any 1 ≤ p < ∞: where n := |X |.
Proof. We shall make use of the following well-known results from analysis: 1. for any fixed m, any v ∈ R m and any 0 < p < q we have that 2. for any fixed m, any v ∈ R m and any 0 < p < ∞: For Observe that m + 1, the path length, is always less than or equal to n, the number of points in X . Thus (m + 1) (1/p−1/q) ≤ n (1/p−1/q) and part 1. of the theorem follows. In a similar fashion, we obtain part 2. from (6).
for all paths γ between x α and x β , hence the same relationship must hold for the minimum over all paths from x α to x β : which, appealing to the definition of d (3)) yields the corollary.
Proof. Again by Theorem 3.1 we have that L (∞) (γ) ≤ L (p) (γ) ≤ n (1/p) L (∞) (γ) for all paths γ between x α and x β , hence by the same reasoning as in the proof of Corollary 3.2: Because this holds for all p < ∞, taking the limit we get: and the result follows from the fact that lim p→∞ n (1/p) = 1. (1) is defined as a minimum over all paths from x α to x β through X , and in particular the one hop path γ α→β = x α → x β is such a path. We claim it is the shortest such path as for any other path γ = x α → x i1 → . . . → x im → x β by repeated applications of the triangle inequality: . p-wspm's in the multi-manifold setting. One of the most useful aspects of p-wspm's, when applied to clustering problems, is that they tend to "squeeze" points in the same cluster together, while (hopefully) keeping points in different clusters separated. Here we make this more precise. Specifically we show that for any 1 < p < ∞ if the data comes from the model described in §2.1 then: Lemma 4.1). Recall that δ is the minimal separation between data manifolds.
with probability tending to 1 as n tends to ∞. (see Theorem 4.6). In this section it is sometimes necessary to enlarge our definition of p-wspm to allow for paths between x, y ∈ R D that are not necessarily in X (and points that are not in X shall be denoted without a subscript). Thus d (p) X (x, y) is technically defined as, using the notation of §2.2, d (p) X ∪{x,y} (x, y).

Paths between points in different clusters.
Here we prove that p-wspm's maintain a separation between points in different clusters.
Theorem 4.1. Let 2 denote the minimal distance between points in different clusters. That is: Then 2 ≥ δ with δ as defined in (1).
Proof. For any x α ∈ X a and x β ∈ X b let γ = x α → x i1 →, . . . , → x im → x β be any path from x α to x β through X , where again we are using the convention that x i0 := x α and x im+1 = x β . If x α ∈ X a and x β ∈ X b there must exist (at least one) j * ∈ [m] such that x i j * ∈ X a while x i j * +1 ∈ X b . By the assumptions on the generative model, X a ⊂ M a and X b ⊂ M b and so: Because this holds for all such γ we have d (p) X (x α , x β ) := min γ L (p) (γ) ≥ δ and because this holds for all such x α and x β :  Xa (x α , x β ) as the minimum p-weighted length of paths from x α to x β through X a (i.e. we are excluding paths that may pass through points in X \ X a ). Because X a ⊂ X , it follows that d In this section we address the asymptotic behaviour of d Xa (x α , x β ). Here is where we make critical use of the main theorem of [20], which we state as Theorem 4.2. Recall that µ a is the probability density function with respect to which X a is sampled from M a , and 1 More generally the reader is invited to check that for any Y ⊂ X we have that d where here the infimum is over all piecewise smooth paths η : [0, 1] → M a with η(0) = x and η(1) = y. As in §2.1, for the Riemannian manifold (M a , g a ) let dist a (x, y) denotes the geodesic distance from x to y on M a with respect to g a . In order to bound d Xa (x α , x β ) we study an auxiliary shortest path distance, d Ma . This distance will also be defined as a minimum over p-weighted path lengths, but instead of measuring the length of the legs using the Euclidean distance · , we measure them with respect to the geodesic distance dist a : where again the min is over all paths γ from x to y through X a .
Theorem 4.2. Let M a be a compact Riemannian manifold, and assume that X a is drawn from M a with continuous probability distribution µ a satisfying min x∈Ma µ a (x) > 0. Let n a := |X a |. For all n a , let r a := n (1−p)/pda a . Then for any 1 ≤ p < ∞ and any fixed > 0 there exists a constant θ 0 independent of n a such that: is a constant depending only on d a and p, but not on n a . Remark 4.3. This is a slightly modified version Theorem 1 in [20]. As stated in [20], the sup is over x, y ∈ M satisfying dist a (x, y) ≥ b for a fixed constant b. However immediately below the statement of Theorem 1 the authors acknowledge that one can weaken this assumption to dist a (x, y) ≥ r a as long as n a r da a / log n a → ∞, which is the case for our choice of r a . Note that there is a slight notational discrepancy here. What is called d Corollary 4.4. With assumptions as in Theorem 4.2, there exists a constant θ 0 , independent of n a such that: where C a is a constant depending on d a , p, µ min a and diam(M a ) but not on n a .
Proof. First observe that because the because the one leg path γ α→β = x α → x β is trivially a path through X a , for any Hence as long as C a ≥ 1 we get that: Because X a ⊂ M a we may use Theorem 4.2 to bound this probability. Indeed, fix any small < 1. Then there exists a θ 0 such that with probability at least for all x α , x β ∈ X a satisfying dist a (x α , x β ) ≥ r a . We now upper-bound the bracketed quantity in (10). From the definition of dist (p) a (see (7)) Because M a is compact and embedded, its diameter (see §2.1) is finite, and thus proving the corollary.
Finally, it remains to compare the path distance with Euclidean legs, d Ma (x, y) Proof. Observe that for any x, y ∈ M a , x − y ≤ dist a (x, y). It follows that for and so: X . Theorem 4.6. Define 1 to be the maximal distance between points in the same cluster: With assumptions as in §2.1, for any 1 ≤ p < ∞: Proof. First, for all x α , x β ∈ X a observe that: where the first inequality is because X a ⊂ X and the second is Lemma 4.5. Now let C := max a C a . Clearly C a n (1−p)/pda a ≤ Cn (1−p)/pdmax and similarly: combining these observations, (12) and Corollary 4.4: for all a = 1, . . . , k. By the union bound, and the definition of 1 :  [19,20], namely that by increasing p we get a tighter upper bound on 1 , but it holds with lower probability. We find experimental evidence for this in §6.
In [3] and [22] bounds analogous to those in Theorems 4.1 and 4.6, but for · and d (∞) X respectively, are used to provide bounds on the performance of single-linkage heirarchical clustering and spectral clustering with a full similarity matrix. As the focus of this article is clustering with a k nearest neighbors similarity matrix, we do not pursue this line of inquiry further here.

5.
A fast algorithm for p-wspm nearest neighbors. In this section we start from a more general perspective. Let G = (V, E, A) be a weighted graph with weighted adjacency matrix A. Occasionally we shall find it more convenient not to fix an ordering of the vertices, in which case A(u, v) will represent the weight of the edge {u, v}. We assume all edge weights are positive. For any v ∈ V we denote its set of neighbors by N (v). By γ = u → w 1 → . . . → w m → v we shall mean the path from u to v in G through w 1 , . . . , w m . Here, this is only valid if {u, w 1 }, . . . , {w i , w i+1 }, . . . , {w m , v} are all edges in G. In analogy with §2.2 we maintain the convention that for such a path γ, w 0 = u and w m+1 = v. Define the length of γ as the sum of all its edge weights: L(γ) := m i=0 A(w i , w i+1 ) and similarly define the longest-leg length of γ as: L (∞) (γ) = max m i=0 A(w i , w i+1 ). For any u, v ∈ V define the shortest path distance as: : γ a path from u to v} and analogously define the longest-leg path distance as: Definition 5.2. For any graph G, define a directed k nearest neighbors graph G (k) with directed edges (u, v) whenever v ∈ N k,G (u).
In practice we do not compute the entire edge set of G (k) , but rather just compute the sets N k,G (u) as it becomes necessary.
We have not specified how to break ties in the definition of N k, k,G (v). For the results of this section to hold, any method will suffice, as long as we use the same method in all three cases. To simplify the exposition, we shall assume henceforth that all distances are distinct.
Remark 5.4. Let us relate this to the discussion in previous sections. For any set of data points X = {x 1 , . . . , x n } ⊂ R D and any power weighting 1 ≤ p < ∞ one can form a graph G on n vertices, where the vertex v i corresponds to the data point x i , and edge weights A ij = x i − x j p . Then: Before proceeding, let us briefly review how Dijkstra's algorithm works. For a graph G with non-negative edge weights, and a given source vertex, s, Dijkstra will return a list of pairs of the form (u, d G (s, u)). The computational complexity of Dijkstra's algorithm is O(|E| + n log n) so when G is a complete graph the complexity is O(n 2 ). Intuitively, our proposed algorithm (Algorithm 2) sidesteps this complexity by allowing one to run Dijkstra's algorithm on a much sparser graph, as long as one is only interested in determining the identity and path distance to the k nearest neighbors of s, with respect to the path distance d G (see Lemma 5.6).
The following implementation of Dijkstra's algorithm is as in [11]. The minpriority queue operations decreaseKey, insert and extractMin have their standard definitions (see, for example Chpt. 6 of [11]). For any vertex s ∈ V and any subset W ⊂ V , we shall also use the shorthand makeQueue(W, s) to denote the process of initializing a min-priority queue with key[s] = 0 and key[v] = +∞ for all v ∈ W \ s. u ← extractMin(Q) 6: Append (u, key[u]) to S.
Once u is extracted key[u] is shortest path length from s.

7:
for v ∈ N (u) do N (u) is the set of all vertices adjacent to u  Lemma 5.5. Suppose that all weights are non-negative: A ij ≥ 0. If u i is the i-th vertex to be removed from Q at step 11, then u i is the i-th closest vertex to s.
Proof. See, for example, the discussion in [11].
It follows that, if one is only interested in finding the k nearest neighbors of s in the path distance d G , one need only iterate through the "while" loop 3 → 12 k times. There is a further inefficiency, which was also highlighted in [7]. The "for" loop 6-10 iterates over all neighbors of u. The graphs we are interested in are, implicitly, fully connected, hence this for loop iterates over all n − 1 other vertices at each step. We fix this with the following observation: Lemma 5.6. For any graph G, let G (k) denote its k-Nearest-Neighbor graph (see Definition 5.2). Then: for all v Note that in the directed graph G (k) , we consider only paths that traverse each edge in the 'correct' direction.
Concretely: the path-nearest-neighbors in G are the same as the path-nearest neighbors in G (k) , hence one can find N d G k,G (v) by running a Dijkstra-style algorithm on G (k) , instead of G. As each vertex in G (k) has a small number of neighbors (precisely k), this alleviates the computational burden of the "for" loop 6-10 highlighted above.
Before proving this lemma, let us explain why it may seem counterintuitive. If w ∈ N d G k,G (v) there is a path γ from v to w that is short (at least shorter than the shortest paths to all u / ∈ N d G k,G (v)). In forming G (k) from G, one deletes a lot of edges. Thus it is not clear that γ is still a path in G (k) (some of its edges may now be "missing"). Hence it would seem possible that w is now far away from v in the shortest-path distance in G (k) . The lemma asserts that this cannot be the case.
k,G (k) (v) have the same cardinality (i.e. k), to prove equality it suffices to prove one containment. We shall show that We claim thatγ is a path in G (k) . If this not the case, then there is an edge {u i , u i+1 } that is inγ but (u i , u i+1 ) is not an edge in G (k) (we again adopt the convention that u 0 := v and u m+1 := w). By the construction of G (k) this implies that there are k vertices x 1 , . . . , x k that are closer to u i than u i+1 . (Note that the sets {u 0 , . . . , u i−1 } and {x 1 , . . . , x k } need not be disjoint). But then the paths and hence shorter thanγ, as all edge weights are assumed positive. It . If this were not the case, there would exists k other vertices w 1 , . . . , w k that are closer in the shortest-path distance d G (k) to v than w. That is, there would be paths γ 1 , . . . , γ k from v to w 1 , . . . , w k respectively that are shorter than γ. But every path in G (k) is also a path in G, hence w 1 , . . . , w k are also closer to v than w in the shortest-path distance d G . This contradicts the assumption that w ∈ N d G k,G (v).
There is a final, minor, inefficiency in Algorithm 1 that we can improve upon; Q is initialized to contain all vertices V when it is actually only necessary to initialize it to contain the neighbors of s. Combining these three insights we arrive at Algorithm 2. We call this algorithm Dijkstra-with-pruning as the key idea, expressed in Lemma 5.6, is to "prune" the neighborhood of each v ∈ V down to the k nearest neighbors of v. Note that we use DecreaseOrInsert as shorthand for the function that decreases . In fact, this is equivalent to inserting a copy of v into Q with priority key[v] = tempDist, hence DecreaseOrInsert has the same computational complexity as insert (see also [23]). Note that in this implementation the size of Q grows by one on every iteration of the inner for loop, 10-13.
Theorem 5.7. For any s and any G with positive weights, Algorithm 2 is correct. That is, S contains precisely the pairs (v, d G (s, v)) for all v ∈ N d G G,k (s).
Proof. By only using N k,G (u) in step 8, Algorithm 2 is essentially running Dijkstra's algorithm on G (k) . By Lemma 5.5, the first k elements to be popped off the queue in line 9 are indeed the k closest vertices to s in the graph G (k) . That is, S contains

SHORTEST PATH DISTANCES FOR CLUSTERING
Algorithm 2 Dijkstra-with-pruning 1: Input: Graph G, source vertex s. 2: Output: List S containing (v, d G (s, v)) for all v ∈ N d G G,k (s). Compute N k,G (u) 9: for v ∈ N k,G (u) do 10: tempDist ← key[u] + A(u, v) 11: DecreaseOrInsert(v, tempDist) 12: end for 13: end for 14: Output: S 5.1. Analysis of complexity. Let us determine the computational complexity of Algorithm 2. We shall remain agnostic for the moment about the precise implementation of the min-priority queue, and hence shall use the symbols T in , T dk and T em to denote the computational complexity of insert, decreaseKey and extractMin respectively. As discussed above, the complexity of DecreaseOrInsert is also T in . Let T nn denote the cost of a nearest neighbor query in G. Then the cost of a k nearest neighbor query, i.e. the cost of determining N k,G (u) as in line 8, is kT nn .
Initializing the queue in line 4 requires k insertions, for a cost of kT in . Precisely k extractMin operations are performed, for a total cost of kT em . DecreaseOrInsert is performed k 2 times, for a cost of k 2 T in . Finally k + 1 k-Nearest Neighbor queries are performed, for a cost of (k + 1)kT nn . This gives a total cost of kT em + (k + k 2 )T in + (k 2 + k)T nn . If the min priority queue is implemented using a Fibonacci heap, insert and decreaseKey both run in constant time (i.e. T in , T dk = O(1)) while for extractMin T em = O(log(|Q|)). Note that |Q| never exceeds k 2 + k as at most one element is added to Q during every pass through the inner for loop 9-12, which happens k 2 times. Hence T em = O(log(k)) and we have a net cost of O(k log(k) + k 2 T nn )) where T nn depends on the specifics of G.
Let us return to the case of primary interest in this paper; where G is the complete graph on n vertices and A ij = x i − x j p . By remark 5.4 to compute N Here, T nn is equal to the cost of a Euclidean nearest neighbors query on X , namely T Enn . Because T Enn log(k)/k we get that for this case Algorithm 2 runs in O(k 2 T Enn ), as advertised in the introduction. For a totally general data set, T Enn = O(Dn). However if X is intrinsically lowdimensional, which we are assuming, it is possible to speed this up. For example if X is stored in an efficient data structure such as a k-d tree [5] or a cover tree [6] then T Enn = O(log(n)). Note that initializing a Cover tree requires O(c dmax Dn log(n)) time, where c is a fixed constant [6]. Hence finding N (p) k,X (x i ) for all x i ∈ X requires O(k 2 n log(n) + c dmax Dn log(n)).

5.2.
Extension to longest-leg path distance. A small modification to Algorithm 2 allows one to compute the k nearest neighbors in the longest-leg-path distance, simply change the '+' in line 10 to a 'max'. This guarantees that tempDist represents the longest-leg length of the path s → . . . → u → v. For completeness, we present this algorithm below as Algorithm 3. The proof of correctness is analogous to Theorem 5.7, and we leave it to the interested reader.
Algorithm 3 Dijkstra-with-pruning for LLPD 1: Input: Graph G, source vertex s. Compute N k,G (u) 9: for v ∈ N G (k) (u) do k,X (x i ) for all x i ∈ X in O(k 2 n log(n) + c dmax Dn log(n)).

5.3.
Comparison with results of [9]. Recall the following definition: Definition 5.9. Let G = (V, E, A) be a weighted graph. A sub-graph H ⊂ G is called a 1-spanner of G if H has the same vertex set as G and, for all u, v ∈ V we have that d G (u, v) = d H (u, v).
When preparing this manuscript for publication, the authors became aware of the following result of Chu, Miller and Sheehy: 6. Numerical experiments. In this section we verify that using a p-wspm in lieu of the Euclidean distance does indeed result in more accurate clustering results, at a modest increase in run time. Specifically, we consider eight datasets, and compare the accuracy of spectral clustering using k-NN graphs constructed using pwspm's for p = 2, 10 and ∞ and using the Euclidean metric. As a baseline, we also consider spectral clustering with a full similarity matrix based on the Euclidean metric. For notational reasons it is convenient to denote the Euclidean metric as d 6.1. The data sets. Three Lines. We draw data uniformly from three horizontal line segments of length 5 in the x-y plane, namely y = 0, y = 1 and y = 2. We draw 500 points from each line to create three clusters. We then embed the data into R 50 by appending zeros to the coordinates, and add i.i.d. random Gaussian noise to each coordinate (with standard deviation σ = 0.14). Three Moons. This data set is as described in [29] and elsewhere. It has three clusters, generated by sampling points uniformly at random from the upper semicircle of radius 1 centered at (0, 0), the lower semi-circle of radius 1.5 centered at (1.5, 0.4) and the upper semi-circle of radius 1 centered at (3,0). As for the Three Lines data set, we draw 500 data points from each semi-circle, embed the data into R 50 by appending zeros, and then add Gaussian noise to each coordinate with standard deviation σ = 0.14. Three Circles. Here we draw data points uniformly from three concentric circles, of radii 1, 2.25 and 3.5. We draw 222 points from the smallest circle, 500 points from the middle circle and 778 points from the largest circle (the numbers are chosen so that the total number of points is 1500). As before, we embed this data into R 50 and add i.i.d Gaussian noise to each component, this time with standard deviation of σ = 0.14.
Two dimensional projections of these data sets are shown in Figure 2. We also considered five real data sets. We focused on image data sets that are suspected to satisfy the manifold hypothesis, namely images of faces and objects taken from different angles and handwritten digits. We obtained most of our datasets from the UCI Machine Learning Repository [14].  OptDigits. This data set consists of downsampled, 8 × 8 greyscale images of handwritten digits 0 − 9 and is available at . There are 150 images of zero, and approximately 550 images each of the remaining digits, for a total of 5620 images. 4 USPS. This data set consists of 16 × 16, greyscale images of the handwritten digits 0-9. There are 1100 images per class for a total of 11 000 images. 5 MNIST. This data set consists of 28 × 28 greyscale images of the handwritten digits 0-9. We combined the test and training sets to get a total of 70 000 images. 6 6.2. Preprocessing the data. Zelnik-Manor and Perona [30] propose the following, locally scaled kernel for spectral clustering: [r,i] ) and x [r,i] denotes the r-th closest point in X to x i . It is shown in [30] and elsewhere that this kernel tends to outperform the unscaled Gaussian kernel (and indeed most other choices of kernel function), hence we adopt this for our experiments. We construct the full Euclidean similarity matrix as: with r = 10. We also construct weighted k-NN similarity matrices for all four metrics using the same kernel and described in detail as Algorithm 4. All the real data sets were initially vectorized, so the 80 × 80 DrivFace data set becomes vectors in R 6400 and so on. Note that this procedure for constructing weighted k-NN similarity matrices was inspired by [21].
Remark 6.1. We certainly make no claim that the choice of parameters r = 10 and k = 15 is optimal, and indeed playing around with them can result in slightly better (or worse) results on certain data sets. However, we observed that changing the parameters had little qualitative effect on the results, and in particular on the ordering of the similarity matrices from least to most accurate (see Table 1). As a sanity check, we also experimented with an unweighted k-NN graph, whereby for p = 1, 2, 10 and ∞ we define A (uw,p) as: k,X (x i ) using Algorithm 2 or 3.
, where x [r,i] denotes the r-th closest point in X to x i with respect to the distance d (p) X . 6: DefineÃ (p) as:Ã Again, the relative ordering of the results changed little, although the accuracy was several points lower for all four metrics. All code to reproduce the experiments is available on the second author's website, and we invite the curious reader to experiment further for themselves.

Experimental results.
For each similarity matrix we perform normalized spectral clustering as described in Ng, Jordan and Weiss [25] using freely available code 7 . We compute k nearest neighbors in the p-wspm's using Algorithms 2 and 3, with the data points stored in a k-d tree. We calculate the accuracy by comparing the output of spectral clustering to the ground truth and recorded the running time. For the randomly generated data sets (ie Three Lines, Three Circles and Three Moons) we ran fifty independent trials and report the mean and standard deviation. For the deterministic data sets (ie all the others) we ran ten independent trials and report the mean. The results are displayed in Tables 1 and 2. We do not attempt clustering with a full similarity matrix, A (f,1) on MNIST as the resulting matrix is too large to hold in memory.
From these results, we may draw several broad conclusions. Observe that for smaller or low dimensional data sets, constructing the full similarity matrix A (f,1) is fastest. This is due to the overhead cost of constructing the k-d tree incurred by the nearest neighbors methods. This situation is reversed in higher dimensions or for larger data sets. The gap between the run-times for A (1) and A (2) , A (10) and for A (∞) is attributable to the cost of running Algorithm 2 or 3. Although this gap is large, relatively, for smaller data sets it becomes less relevant for larger data sets. In fact, the entire process of spectral clustering is faster with a p-wspm for the MNIST data set. We attribute this to the fact that A (2) , A (10) and A (∞) are more "block-diagonal" than A (1) , hence finding their leading eigenvectors takes less time. This more than offsets the extra time required to construct them. With regards to accuracy, observe that for every data set clustering using the Euclidean metric (either the full version or the k-NN version) is less accurate then using a p-wspm. The catch here is that which p-wspm varies. As a general rule, d (∞) X appears best for elongated data (ie the three lines) or when there is a large gap between the intrinsic dimension of the data and the ambient dimension (ie USPS or MNIST). On the other 7 See: https://www.mathworks.com/matlabcentral/fileexchange/ 34412-fast-and-efficient-spectral-clustering  Table 2. Run time of spectral clustering, in seconds. Note that this includes the time rquired to construct the similarity matrix. A (1) represents using the Euclidean metric.
hand, d (2) X appears best for more globular data (ie the three moons) or when the gap between intrinsic and ambient dimension is less pronounced (ie OptDigits). In all cases, d (10) X seems to be a good compromise between these two extremes. Finally, note that the standard deviation of the accuracy is much higher for d X . This is in agreement with Theorem 4.6, where it is shown that the bound on 1 holds with probability inversely proportional to p.
6.4. Varying the power weighting. From the analysis of §4 it would appear that taking p to be as large as possible always results in the best clustering results. However, this is true only in an asymptotic sense, and indeed the results contained in Table 1 indicate that for finite sample sizes this is not always the case. In Figure  3 we show the results of varying p from 1 to 20 for the three lines data set, this time with 300 points drawn from each cluster. We do this for three values of the ambient dimension, D = 10, 50 and 100. As is clear, the optimal value of p depends on the dimension 8 . In particular, observe that an intermediate value of p, say p = 14, is optimal when the ambient dimension is 50 but that a smaller power weighting (p = 2) is more appropriate when the ambient dimension is 10. When the ambient dimension is 100, no power weighting performs well, which is likely because for such a large value of D, and such a small amount of data per cluster, the noise drowns out any cluster structure. 7. Conclusions and future directions. In this paper we argued that p-wspm's are well-suited to the problem of clustering high dimensional data when the data is sampled from a disjoint union of low dimensional manifolds. We showed that spectral clustering with a p-wspm outperforms spectral clustering with Euclidean distance, and using Algorithm 2 the increase in computational burden is negligible. From the results of §6 it is clear that the geometry of the data manifolds influences which power weighting is optimal, and it would be of interest to analyze this further. Also of interest would be to extend our work to more general data models, for example those that only require the sampling distributions µ a to be supported "near" M a , or those that allow for intersections between the data manifolds.