THE APPROXIMATION ALGORITHM BASED ON SEEDING METHOD FOR FUNCTIONAL k -MEANS PROBLEM

. Diﬀerent from the classical k -means problem, the functional k - means problem involves a kind of dynamic data, which is generated by continuous processes. In this paper, we mainly design an O (ln k )-approximation algorithm based on the seeding method for functional k -means problem. More-over, the numerical experiment presented shows that this algorithm is more eﬃcient than the functional k -means clustering algorithm.


1.
Introduction. The k-means problem aims to separate a given set with n data into k(k ≤ n) parts in order to minimize the sum of squared distances. This is one of the classic problems in computational geometry and machine learning. Since the k-means problem is NP-hard [3], there are several algorithms presented in the literature based on different techniques. So far, the best approximation factor for kmeans problem is 6.357 + , which is obtained by using primal-dual algorithm [2]. A more popular algorithm due to Lloyd [14,16] is called Lloyd's algorithm or k-means, which performs very well in practice but belongs to heuristic algorithms. Moreover, it also belongs to Top 10 algorithms in data mining [23]. In order to improve the performance guarantee, an O(lnk)-approximation algorithms based on Lloyd's algorithm is designed, called k-means++, where the first initial k centers are chosen with a specific probability [4]. By applying the technique of bi-criteria, Wei [22] designs a new seeding algorithm with a constant factor. These two methods have been successfully applied to the variants of k-means problem [10,12,13]. There is also another clustering, which is based on polyhedral conic functions algorithm [17].
The given data in the standard k-means problem usually are real vectors, so the distance between two points can be expressed by Euclidean distance and the triangle inequality can be used. In fact, there are a special kind of data generated by continuous processes, which are also called dynamic data or functional data. The observation point in functional k-means problem belongs to above special data [7,18]. By using the property of the continuous functions, several kinds of clustering methods have been introduced for functional k-means problem, such as two-stage methods [1,11,19], non-parametric clustering methods [5,21], modelbased clustering methods [6], and so on. One can refer to [9] for more information.
In [15], by involving the derivative information in the distance between two functional data, Meng et al. define a new distance and apply the Lloyd's algorithm of k-means problem to solve this functional k-means problem. The numerical experiment shows this method is very efficient. However, there is lack of theoretical analysis to explain how good or bad this algorithm could get. That is, the algorithm presented by Meng et al. belongs to heuristic algorithms. In this paper, we will design an approximation algorithm for functional k-means problem, by using the new distance defined in [15] and applying the seeding algorithm for k-means problem. Moreover, the approximation ratio is obtained as 8(ln k + 2).
The rest of this paper is organized as follows. In Section 2, the functional kmeans problem and some basic notations are presented. In Section 3, We mainly introduce the approximation algorithm based on seeding method, as well as the main result for functional k-means problem. The proof to show the correctness of the algorithm is given in Section 4. In Section 5, the numerical experiment about the seeding algorithm for functional k-means problem is presented. And the final remarks are concluded in Section 6.

2.
Preliminaries. In this section, we mainly present the definition of the functional k-means problem, as well as some symbols and notations used in the following part. For the convenience of distinction these notations, we also explain their meanings in Table 1.
In general, given two real numbers R 1 , R 2 such that R 1 ≤ R 2 , we use R to denote an interval from R 1 to R 2 , i.e., R = [R 1 , R 2 ]. For any t ∈ R, the function x(t) : R → R is called a functional curve, which is one real valued function and assumed to be continuous. If x 1 (t), x 2 (t), . . . , x d (t) are the functional curves with the same ground set R, then the functional sample X(t) = (x 1 (t), x 2 (t), . . . , x d (t)) T can be given. And we denote F d (t) as the set of all d-dimensional functional samples, whose functional curves have the same ground set. Therefore, given two functional sam- , their similarity metric can be defined in the following way, where ∇x i p (t) denotes the first order derivative of x i p (t), which is the p-th functional curve in the i-th functional sample X i (t). Meng et al. have shown that this metric is a distance metric in [15]. Given a functional sample X(t) ∈ F d (t) and a set of functional samples Γ(t) ⊆ F d (t), if X(t) Γ(t) is the nearest functional sample to X(t) Meaning of symbols distance from the functional sample X i (t) to the subset of functional samples Γ(t) in Γ(t), i.e., we can give the distance of X(t) to Γ(t) as follows: d(X(t), Γ(t)) = d(X(t), X(t) Γ(t) ).
Assume that Γ(t) is a set with n functional samples in F d (t) and C(t) is a set with k functional samples in F d (t) (also called clustering of functional samples or clustering for short), we define the loss function or potential function of Γ(t) over C(t) as the sum of the squared similarity distances over each functional sample of Γ(t) to the clustering C(t), i.e., The f unctional k-means problem is to find one clustering C(t) * Γ(t) minimizing the potential function. That is, Given an optimal clustering set C(t) * Γ(t) of Γ(t), Φ * (Γ(t)) = Φ(Γ(t), C(t) * Γ(t) ) denotes the corresponding optimal cost for short. Specially, when k = 1, the clustering of Γ(t) denoted by µ(Γ(t)) can be given by µ(Γ(t)) = 1 n X(t)∈Γ(t) X(t) in [15], which is also called the center of mass of functional samples Γ(t). We will present the corresponding result in Lemma 2.1 as well as a simpler proof.
Given a clustering C(t) = {C 1 (t), C 2 (t), . . . , C k (t)} with k functional samples, Γ(t) can be partitioned into k parts according to the closest distance of functional samples in Γ(t) to the clustering C(t). In fact, for any C i (t) ∈ C(t), we use Γ i C(t) (t) to denote a cluster of the functional samples, consisting of those points which are closer to C i (t) than other centers, Therefore, the loss function can be presented as follows.
Proof. First, applying the definition of the metric between functional samples, we have Then, according to the definition of µ(Γ(t)), we can obtain the following result: for any l ∈ {1, 2, . . . , d}, x j l (t).
x j l (t)) = 0, Therefore, by the commutation property of summation and integrals, we get that the last two terms of the right hand in (2) are zero, i.e., At last, we present the following property, which can be easily verified.
Property 2.2. Given any functional sample set Γ(t) in F d (t), and two clusterings When we add new elements to a clustering, the value of potential function of Γ(t) to the new clustering will not be increased.
3. The approximation algorithm based on seeding method and our main result. In this section, we will mainly introduce an approximation algorithm based on seeding method for the functional k-means problem. The motivation comes from the following two algorithms: The k-means algorithm solving functional k-means problem and the seeding algorithm for k-means problem. From Step 1 to Step 5, we choose the initial functional sample clustering C(t) consisting of k functional sample centers from the functional sample set Γ(t) with very specific probabilities. The idea is that the distance between any centers should be as big as possible. Then, the set of observation functional samples Γ(t) is partitioned into k clusters according to the nearest similarity metric distance from Step 6 to Step 8. In the following processes from Step 9 to Step 11, we update the clustering by the result given in Lemma 2.1. In fact, the renewed clustering cannot increase the cost of potential function, which also has been explained by the Lloyd's method for k-means problem [8]. Therefore, we can improve the clustering in each iteration, and the algorithm will stop when the clustering is not changed. Therefore, if we can bound the cost function of the initial clustering which is returned at Step 5, the return clustering of this algorithm can be found with this property, too. Now, we show our main result in the following theorem.
is the clustering constructed in Algorithm 1 for Γ(t), then the corresponding potential function satisfies 4. Proof of correctness. In this section, we mainly present the proof that the clustering C(t) returned by Algorithm 1 is an 8(ln k + 2)-approximate functional clustering for functional k-means problem, which mainly follows the proof of k-means++ for k-means problem. In fact, if we can prove that the approximation ratio is correct for the set returned from seeding part (from Step 1 to Step 5), we can show that the result of Theorem 3.1 is also correct. The reason is that the value of loss function cannot be increased in the following iterations according to the high-level of Algorithm 1. Therefore, we can just pay attention to the first k sampled functional centers in the following discussion. Since the group of clusters Algorithm 1 The approximation algorithm based on seeding method for functional k-means problem. Input: A set Γ(t) ⊆ F d with n functional samples, an integer k, and C(t) := ∅.
Output: An approximate functional k-means C(t) of Γ(t).
1: Choose the first functional sample center C 1 (t) uniformly at random from Γ(t), Choose the functional sample center C i (t) from Γ(t) with probability Update the functional sample clustering C(t) by setting C i (t) := µ(Γ i C(t) (t)), which is the center functional sample of Γ i C(t) (t); 11: end for 12: Repeat Step 6 to Step 11 until C(t) no longer change; 13: Return C(t).
Γ 1 (t) * , Γ 2 (t) * , . . . , Γ k (t) * is a partition of Γ(t), and the first center is chosen uniformly from Γ(t), this chosen center should belong to one of these clusters. That is, there must exist one cluster which is a joint cluster to C(t). For this special joint cluster, if we cluster its functional samples to the first chosen center in Algorithm 1, rather than its center of mass, we can show that it doubles the value of loss function in the following lemma.
Lemma 4.1. Take any functional sample cluster ∆(t) of Γ(t) with respect to the optimal clusters C * (t), and assume that C(t) is a clustering with only one center, which is sampled uniformly from ∆(t), then = 2Φ(∆(t), C(t) * Γ(t) ). Now, we continue bounding the cost when only a new center of functional sample is added to the current clustering, which is chosen with special probability.
Lemma 4.2. Take any functional sample cluster ∆(t) of Γ(t) with respect to an optimal solution C(t) * Γ(t) . Suppose that C(t) is any clustering of Γ(t) with less than k centers. If we sample one center Q(t) from ∆(t) with probability , and add Q(t) to C(t), we have Proof. By the rule of clustering, the functional samples in ∆(t) will be clustered to the functional sample in C(t) or Q(t) after adding Q(t) to C(t). Therefore, the contribution of ∆(t) to C(t) can be expressed by Then, we obtain the following result by the mathematical expectation of discrete variables, .
Summing up all X(t) ∈ ∆(t), we have Thus, (3) can be relaxed as follows.
|∆(t)| X(t)∈∆(t) d 2 (X(t), C(t)) · X(t)∈∆(t) min{d 2 (X(t), C(t)), d 2 (X(t), Q(t))}). (4) The first term in the right hand of inequality (4) can be relaxed to which is less or equal to The second term in the right hand of inequality (4) also can be relaxed to because of min{d 2 (X(t), C(t)), d 2 (X(t), Q(t))} ≤ d 2 (X(t), C(t)). Therefore, (3) can be relaxed in the following way, From above Lemmas, we find that the approximation guarantee is a constant if the k centers are chosen from different clusters, which is a partition of Γ(t) with respect to an optimal solution C(t) * Γ(t) . In following lemma, we will consider the case that more than one centers are sampled in the same cluster. Assume Γ 1 (t) * , Γ 2 (t) * , . . . , Γ k (t) * are the k optimal functional sample clusters with respect to the optimal solution C(t) * Γ(t) . Now we will distinguish them into joint and disjoint clusters by the relationship between them and the current clustering C(t) returned by Algorithm 1. For any i ∈ {1, 2, . . . , k}, if Γ i (t) * ∩ C(t) = ∅, Γ i (t) * is called a disjoint cluster with C(t). Otherwise, it is called a joint cluster of C(t). And q is used to denote the number of the disjoint functional sample clusters of C(t), N C(t) = {X(t) ∈ Γ i (t) * : Γ i (t) * ∩ C(t) = ∅} means all the functional samples in the disjoint clusters, and D C(t) = Γ(t) \ N C(t) denotes the functional samples in the joint clusters.
Proof. The proof of this lemma will be given by induction. That is, assume that this lemma holds for (q, m − 1) and (q − 1, m − 1), we will show that it is still correct for (q, m). First, we check two special cases. One is that there is no functional samples chosen (i.e., m = 0) when q > 0, the other is (q, m) = (1, 1). Since the former special case can be obtained trivial by Property 2.2, we only need to consider the case q = m = 1, that is, there is only one disjoint (optimal) cluster to C(t) and we will chose one functional sample, denoted by Q(t) and added to C(t). Therefore, we need to prove that the following inequality holds, We can give the proof by discussing that the chosen functional sample Q(t) belongs to N C(t) or D C(t) . That is, Thus, (5) can be obtained. From now on, we will show the inductive step. The main idea is similar to the one of (q, m) = (1, 1), where we will discuss the first functional sample chosen from disjoint cluster or joint cluster. And without loss of generality, we assume the functional samples added to C(t) is ordered as In the following part, we will give the bounds to E[Φ(Γ(t), C(t) )|Q 1 (t) ∈ N C(t) ] and E[Φ(Γ(t), C(t) )|Q 1 (t) ∈ D C(t) ] in two cases, respectively.
Since Q 1 (t) ∈ N C(t) , both the joint and disjoint clusters will be changed when the functional sample clustering is changed from C(t) to C 1 (t). If we assume that the (optimal) cluster containing Φ(Γ(t),C(t)) . Here, we should pay attention to that the probability of the event that Q 1 (t) is chosen as the first functional sample added to C(t) is Φ(Q 1 (t),C(t)) Φ(Γ(t),C(t)) , denoted by Pr Q(t) . By Lemma 3, we can easily obtain the following result, Therefore, starting from C 1 (t), there is q −1 disjoint clusters and it should be added m − 1 functional samples to up to C(t) . Then, we can have some easy results as follows, Without loss of generality, in the discussion of (10), we still use ∆(t) to denote any disjoint cluster contained in N C(t) , and Q 1 (t) to denote any functional sample in ∆(t), Now, we will mainly discuss the expectation part in (10) by induction, . By applying (7), we know that Then, the proof of Case 1 can be finished by using .
Since Q 1 (t) ∈ D C(t) , the clustering C 1 (t) does not change the classification of joint and disjoint clusters. That is, D C 1 (t) = D C(t) and N C 1 (t) = N C(t) . Moreover, after adding Q 1 (t) to C(t), we can obtain C(t) just by adding m − 1 functional samples to C(t) , and there are still q disjoint clusters to C(t) . Therefore, by induction on (q, m − 1) and Property 2.2, we easily get the following results, The proof of Case 2 is finished. Then, by taking (6), Case 1, and Case 2 together, we have Therefore, Now, we can present the proof of Theorem 3.1 by using above lemmas. Proof of Theorem 3.1.We denote C(t) as the functional sample clustering returned by Algorithm 1, whose elements are sampled from Step 1 to Step 5. We also suppose that the current functional clustering is C 1 (t), which has only one functional sample center Q(t), i.e., C(t) = {Q(t)}. By Algorithm 1, we know that Q(t) is chosen uniformly at random. Moreover, Q(t) should belong to some optimal cluster ∆(t). Then, there are q = k − 1 disjoint optimal clusters, D C 1 (t) = ∆(t) and N C 1 (t) = Γ(t) \ ∆(t). Therefore, by Lemmas 4.1 and 4.3, we can easily obtain the following result,

Numerical experiments.
In this section, we test the approximation algorithm based on seeding method (Algorithm 1) on the data sets Simudata [20] and Sdata [15], then compare it to the standard functional k-means clustering algorithm presented in [15]. We first introduce the two data sets as follows.

• Sdata
There are three clusters in this data set. The clusters are with respect to functions X 1 (t) = cos(1.5 πt) + (t), X 2 (t) = sin(1.5πt) + (t), and X 3 (t) = sin(πt) + (t), where (t) is a white noise with expectation E[ (t)] = 0 and variance Var[ (t)] = 1. Each cluster also includes 100 functional curves on [0, 1]. Since we need to calculate the derivatives in our distance, so we first smooth the functional data by using the technology of the third order polynomial fit. Then we use two measures to evaluate the effectiveness of the two algorithms. These two measures are adjusted rand index (ARI) and Davies Bouldin index (DBI). ARI is an external clustering validation index, which has a value between 0 and 1. A larger ARI indicates higher consistence between the clustering result and the real class labels, especially, the value 1 of ARI indicates the clustering result is the same to the real class labels. DBI is an internal clustering validation index. If DBI is smaller, the between-cluster similarity will be less. We also present the costs of the initial and returned solutions, as well as the running times of the algorithms. Since  Table 2 where the notations SeedAlg and FuncAlg denote the Algorithm 1 and the functional k-means algorithm in [15] respectively. From these results, we can find that the clustering accuracies of the two algorithms are very close, since the relative differences of the ARIs, DBIs, and returned costs of these two algorithms are less than 1.4%. Nevertheless, SeedAlg can produce a better initial solution (the value is 38.9% less than that of FuncAlg for Simudata, and 32.9% for Sdata) and consumes less time (5.2% less than that of FuncAlg for Simudata, and 11.3% for Sdata), implying that SeedAlg is competitive in efficiency. 6. Conclusions. At beginning, we introduce a novel proof of the property of functional k-mean problem when k is 1, by using a new distance with derivative information. Then we design an O(log k)-approximation algorithm based on seeding method for functional k-means problem. Moreover, we present the numerical experiment showing the effectiveness of this new algorithm, comparing to the standard functional k-means with the first clustering chosen at random. The result of numerical experiment shows that our new algorithm can produce a better initial solution and has a faster runtime. In future, we will plan to study the parallel seeding algorithm, bi-criteria, as well as local search method for functional k-means problem.