Sparse generalized canonical correlation analysis via linearized Bregman method

Canonical correlation analysis (CCA) is a powerful statistical tool for detecting mutual information between two sets of multi-dimensional random variables. Unlike CCA, Generalized CCA (GCCA), a natural extension of CCA, could detect the relations of multiple datasets (more than two). To interpret canonical variates more efficiently, this paper addresses a novel sparse GCCA algorithm via linearized Bregman method, which is a generalization of traditional sparse CCA methods. Experimental results on both synthetic dataset and real datasets demonstrate the effectiveness and efficiency of the proposed algorithm when compared with several state-of-the-art sparse CCA and deep CCA algorithms.


1.
Introduction. Canonical correlation analysis (CCA) [16], is a celebrated data analysis tool that allows to detect the correlation between two sets of multidimensional variables. The two sets of multidimensional variables can be considered as two distinct objects or two views of the same object. CCA aims at finding two canonical variates (weight vectors) such that the projected variables in the lowerdimensional space are maximally correlated. Therefore, CCA has been widely used in many branches of science and technology, but not limited to, for instance, cross language document retrieval [23], genomic data analysis [26], functional magnetic resonance imaging [11]. However, CCA could not detect the mutual information of datasets that more than two. A natural extension of CCA, namely, generalized canonical correlation analysis (GCCA) [15,18,22] was proposed to handle the limitation of CCA. Tensor CCA was proposed [19] to deal with multiple datasets, too.
In practice, a large portion of features are not informative for high-dimensional data analysis. Typically, canonical loadings are not sparse. To make the canonical loadings more interpretable, a lot of sparse CCA algorithms were proposed extensively in the literature. Sparse penalized CCA algorithm [24], penalized matrix decomposition approach based sparse CCA method [25], and primal-dual framework [13] based sparse CCA were investigated. Moreover, Chu et al. selected the sparsest CCA solutions from a subset of all solutions via linearized Bregman method [9]. Chen et al. developed a precision adjusted iteration thresholding method to estimate the sparse canonical weights [8], while Gao et al. investigated a two stage based sparse CCA method [12], where the first initialization stage was solved by applying Alternating Direction Method with Multipliers (ADMM) method, and then a group-Lasso based method was utilized to find the sparse weights in the second refinement stage. However, how to achieve sparse GCCA algorithm is far less mature although it was introduced in [17].
In this paper, based on the model demonstrated in [22], we propose a novel sparse GCCA algorithm. Firstly, we convert GCCA problem into that of linear system of equations, and then apply linearized Bregman method to pursue sparsity. Experiments on both simulated dataset and real world datasets demonstrate the effectiveness of the proposed algorithm.
The rest of this paper is organized as follows. We first introduce some preliminaries about CCA and GCCA, then present the sparse GCCA algorithm. Next we show the experimental studies. Finally, we conclude this paper and discuss the future work.

2.2.
Generalized canonical correlation analysis. To analyze the case with more than two views, sum-of-correlations (SUMCOR) based generalized CCA was proposed [7], which has been shown to be NP-hard, where J stands for the number of views. Another formulation of GCCA is try to seek a common latent representation of different views [18]: where G ∈ R ×m is a common latent representation of the different views. Problem (2.3) is also called MAX-VAR formulation of GCCA, since the optimal solution amounts to taking principal eigenvectors of a matrix aggregated from the correlation matrices of the views. In fact, when J=2, Equation Notice that , by considering the constraints in Equation (2.2) and F . Therefore the target function in Equation (2.4) is a relaxation of the target function in (2.2).
3. Sparse GCCA algorithm. In order to pursue sparse GCCA algorithm, we need to reformulate problem (2.3). Firstly, we consider how to minimize G − W T j X j 2 F for each fixed j (j = 1, · · · , J). Obviously, , then taking derivatives with respect to W j , we can see that In practice, X j is normally a low rank matrix, e.g., the term-document matrix in Natural Language Processing (NLP). Therefore, to give a compact solution of W j , we use the compact approximation forms of X j . Let the reduced singular value decompositions (SVDs) of X j (j = 1, · · · , J) be where P j ∈ R nj ×rj , Q j ∈ R m×rj . Then X j X T j = P j Σ 2 j P T j (j = 1, · · · , J), we can convert , · · · , J) becomes AW = B, and we comes to min where n = n 1 + · · · + n J , W 1 = n s=1 τ =1 W (s, τ ). However, there is one problem that how to compute G. Here, we consider the situation that G is fixed and employ the conclusion described in [18] directly. That is the rows of G are the eigenvectors of the matrix M = J j=1 X T j (X j X T j ) −1 X j . Normally, X j X T j is not invertible in most situations. Thus, we use regularization technique or other schemes to modify this item. Notice that Reference [22] and this paper are both based on Equation (2.3). However, reference [22] converts Equation (2.3) into generalized eigenvelue problem and uses least squares method to find the canonical variates. Here, we import sparsity and convert Equation (2.3) into Equation (3.3), then utilize linearized Bregman to obtain the canonical variates, which will be elucidated in the sequel. Recall the SVDs of X j (problem (3.1)), then X j X T j = P j Σ 2 j P T j , simple computation leads to Now the problem is how to solve (3.3). Many numerical methods has been used to solve the following problem Among these methods, linearized Bregman method [27] is one of the most powerful approach to solve this type of problem. The iteration procedure of linearized Bregman method can be stated as
Here, we extend the linearized Bregman method in problem (3.5) to solve the one stated in (3.3), and comes to Algorithm 1. In Algorithm 1, the following stopping criterion can be adopted: where 0 < ε < 1 is a tolerance parameter. Actually, one can choose distinct ε for different views. This is consistent with the two views case when choosing J = 2. Definition 1. The linearized Bregman based sparse GCCA (SGCCA) algorithm for computing the elements of matrix W is defined by The convergence of the proposed algorithm 1 is presented in the following theorem. (3.6) converges to the unique solution of i.e. lim k→∞ W k −W * µ = 0. Denote W * as the solution of optimization problem in (3.3) that has the minimal Frobenius norm among all solutions of the optimization problem in (3.3), that is, 4.1. Experiment on synthetic dataset. In this part, we constructed synthetic dataset to investigate the consistency of the proposed sparse algorithms for GCCA problem. We constructed three matrices X , Y , and Z, which are corrupted with noise and take the form: , are three random noise matrices. u ∈ R 100 is a random vector with all entries being drawn from the normal distribution, i.e., Settings We randomly selected half of the data as training data and used the rest for testing, this procedure was repeated 30 times, and we report the average results. GCCA was used as the baseline algorithm. For WGCCA [2], Gaussian weight were utilized, while for Deep GCCA [3] method, we chose epoch = 100, Results Plots of X, Y and Z are presented in Figure 1. Table 1 shows the comparison results for the sparsity, the reconstruction error both for training data and testing data, where the reconstruction error for testing is defined as Note that for multiple datasets, summation of correlation coefficients is not appropriate for measuring the property of a good GCCA algorithm. To give a more comprehensive understanding of generalized CCA methods, we also indicate the above quantity for training data, i.e. replacing X ts j with X tr j for j = 1, · · · , J. We can see that only WGCCA and sparse GCCA can achieve sparsity results. Table 1 demonstrates that Sparse GCCA can obtain the highest sparsity result and comparable small reconstruction error. Deep CCA obtains the smallest reconstruction testing error, but it achieves higher reconstruction training error, which means Deep GCCA does not learn a good representation of G and canonical variates W j (j = 1, · · · , J).

Experiment on real datasets.
Our experiments are performed on the following 5 datasets: the first four datasets are from the gene database 1 , and the rest one is from the JRC-Acquis database. The details are explained below and the statistics can be found in Table 2.
• Leukemia: gene expression values for 72 samples (47 samples from patients with acute lymphoblastic leukemia, and 25 from patients with acute myeoblastic leukemia). • Lymphoma: 42 samples of diffuse large B-cell lymphoma, 9 observations of follicular lymphoma and 11 cases of chronic lymphocytic leukemia. The total sample size is 62, and the expression of 4026 well-measured genes.
• Brain: 42 microarray gene expression profiles from five different tumors of the central nervous system.
• JRC-Acquis [21]: a collection of parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian and Swedish. (see Figure 2) Settings The preprocessing procedure for gene data is described in [10]. For JRC-Acquis database, we first extracted the "body" part of the XML file, and then obtained a bag-of-words representation by using Term Frequency Inverse Document Frequency (TFIDF) approach, which is efficient in the document retrieval task. After removing numbers, stop-words (English, Spanish, and French, respectively), and rare words (appearing less than twice), we obtained 12421×1000 " English" matrix, 17203 × 1000 " Spanish" matrix and 15650 × 1000 " French" matrix. For experiments on gene datasets, we compared sparse GCCA with CCA, PMD CCA [24], Deep CCA ( [1]) and GCCA, where CCA was considered as the baseline algorithm. There are two parameters controlling the sparsity results of PMD CCA method, which were selected from the candidate set 0.01 : 0.01 : 0.5 (between 0.01 and 0.5 with a step increment of 0.01). For Deep CCA algorithm, we chose epoch= 100, learning rate from the candidate set {1e − 3, 1e − 2, 1e − 1}. The batch size was chosen as the integer part of ratio × #trnum, where the ratio was chosen from the candidate set { 1 2 , 1 4 , 1 8 }. For the proposed algorithm, we choose δ = 0.9, µ = 10.  WFKqTXH GH OD 5pSXEOLTXH G(VWRQLH GH OD 5pSXEOLTXH GH &K\SUH GH OD 5pSXEOLTXH GH /HWWRQLH GH OD 5pSXEOLTXH GH /LWXDQLH GH OD 5pSXEOLTXH GH +RQJULH GH OD 5pSXEOLTXH GH 0DOWH GH OD 5pSXEOLTXH GH 3RORJQH GH OD 5pSXEOLTXH GH 6ORYpQLH HW GH OD 5pSXEOLTXH VORYDTXH GDQV OH FDGUH GH OHXU DGKpVLRQ j O8QLRQ HXURSpHQQHS! S Q !$ /HWWUH GH OD &RPPXQDXWp HXURSpHQQHS! S Q !*HQqYH OH PDUV S! S Q !0RQVLHXUS! S Q !¬ OD VXLWH GH OHQJDJHPHQW GH QpJRFLDWLRQV DX WLWUH GH ODUWLFOH ;;,9 SDUDJUDSKH HW GH ODUWLFOH ;;9,,, GH ODFFRUG JpQpUDO VXU OHV WDULIV GRXDQLHUV HW OH FRPPHUFH GH *$77 HQ YXH GH PRGLILHU OHV OLVWHV GHQJDJHPHQWV GH OD 5pSXEOLTXH WFKqTXH GH OD 5pSXEOLTXH G(VWRQLH GH OD 5pSXEOLTXH GH &K\SUH GH OD 5pSXEOLTXH GH /HWWRQLH GH OD 5pSXEOLTXH GH /LWXDQLH GH OD 5pSXEOLTXH GH +RQJULH GH OD 5pSXEOLTXH GH 0DOWH GH OD 5pSXEOLTXH GH 3RORJQH GH OD 5pSXEOLTXH GH 6ORYpQLH HW GH OD 5pSXEOLTXH VORYDTXH GDQV OH FDGUH GH OHXU DGKpVLRQ j O8QLRQ HXURSpHQQH HW HQ YXH GH  For experiments on JRC-Acquis database, we compared the proposed method with GCCA, Deep GCCA and WGCCA, where GCCA was considered as the baseline algorithm. The parameters for WGCCA algorithm were the same as stated in the last part. Similarly, we randomly selected half of the data as training data and used the rest for testing, this procedure was repeated 30 times, and we report the average results. Results Experimental results for gene datasets is demonstrated in Table 3, while results for JRC-Acquis database are addressed in Table 4. Table 3 indicates that the proposed sparse GCCA algorithm achieves the highest sparsity results followed by PMD CCA method (except Prostate dataset, the difference is subtle, only 0.05%), and comparable accuracy results. Accuracy is defined as follows: where δ(y 1 , y 2 ) is the indicator function that equals 1 if y 1 = y 2 and 0 otherwise. For a given sample point x i , ol i and pl i are the obtained label and the provided label, respectively. N is the total number of test samples. Table 4 demonstrates that the proposed algorithm achieves the highest sparsity result, comparable reconstruction error. Although Deep CCA algorithm obtains the smallest reconstruction error for testing data, it achieves the highest reconstruction Table 3. Comparison results on gene data over 30 training-testing replications: the average of sparsity (Spar1 and Spar2 stand for the sparsity of the first view and second view, respectively), classification accuracy, and the summation of correlation coefficients (SCORR).

Methods
Spar1 (  error for training data, which means Deep GCCA method may lead to over-fitting results. Since the proposed Sparse GCCA algorithm is achieved based upon the traditional GCCA method, the experimental results indicate that sparse GCCA obtains higher sparsity result, and meanwhile it achieves lower accuracy or higher reconstruction error when comparing sparse GCCA with GCCA methods.

5.
Conclusion. CCA is a widely discussed and used statistical tool. But CCA could only deal with two datasets. Therefore, GCCA is studied to handle the shortcoming of CCA. However, traditional GCCA method lacks of interpretability for high dimensional data. To make the canonical variates more interpretable, in this paper, under the framework of GCCA, we reformulate GCCA problem into that of linear system of equations. Employing the idea of linearized Bregman iteration, we achieve sparse GCCA algorithm by imposing 1 norm on canonical variates. Experimental results on both synthetic dataset and real world datasets demonstrate the effectiveness of the proposed sparse GCCA algorithm. For future work we will investigate how to obtain sparse deep CCA algorithm.
Appendix. In this part, we first give a brief review of Bregman distance and linearized Bregman iteration method, then present the proof of Theorem 1. Bregman Distance is defined by [4] where F is a convex function. u ∈ ∂F (v) is a subgradient in the subdifferential of F at the point v. D u F (w, v) is not a distance in the usual sense since D u F (w, v) = D u F (v, w), and it is nonnegative obviously. Bregman iteration for general problem min w∈R n {F (w) : Aw = b} is given by [20]    w k+1 = arg min w∈R n 1 2 Aw Aw − b k+1 2 + µ w 1 .

(5.3)
In order to solve the seond step of (5.3), based on the Taylor expansion of Aw −b 2 around w k , where δ is a fixed parameter. We have the following linearized Bregman iteration for given w 0 = u 0 = 0, i.e.,    w k+1 = arg min where the second expression for updating w k+1 is to ensure that w k+1 ∈ ∂J(w k+1 ). Details could be found in [6]. It was shown in [6,27] that if F (w) = w 1 , (5.4) can be reformulated into its compact form of (3.5). In fact, (5.4) is equivalent to gradient descent applied to the dual of problem min w∈R n µF (w) + 1 2δ w 2 : Aw = b . (5.5) We need the following three lemmas before giving the proof of Theorem 1.