AN IMPROVED TOTAL VARIATION REGULARIZED RPCA FOR MOVING OBJECT DETECTION WITH DYNAMIC BACKGROUND

. Dynamic background extraction has been a fundamental research topic in video analysis. In this paper, a novel robust principal component analysis (RPCA) approach for foreground extraction is proposed by decom-posing video frames into three parts: rank-one static background, dynamic background and sparse foreground. First, the static background is represented by a rank-one matrix, which can avoid the computation of singular value de- composition because usually the dimensionality of a surveillance video is very large. Secondly, the dynamic background is characterized by (cid:96) 2 , 1 -norm, which can exploit the shared information. Thirdly, the sparse foreground is enhanced by total variation, which can preserve sharp edges that are usually the most im- portant for clear object extraction. An eﬃcient symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) is established with convergence analysis. Extensive experiments on real-world datasets show that our proposed approach outperforms the existing state-of-the-art approaches. In fact, to the best of our knowledge, this is the ﬁrst time to integrate the joint sparsity and total variation into a RPCA framework, which has demonstrated the superiority of performance. multipliers. 11601348)


1.
Introduction. In surveillance, video signals are usually captured by spot cameras and transferred to a data processing center. The number of cameras has increased dramatically over the decade, which results in a huge amount of data. Moreover, there still exist many challenges for video analysis in our real life environment (e.g., irregular object movements, low contrast, high sensor noise, camouflage, shelters, bad weather, and so on) . Therefore, how to extract the foreground (moving objects) in surveillance videos quickly and precisely has become a significant problem.
The traditional approaches for foreground extraction are developed with the help of using frame difference, such as the nonparametric models like kernel density estimation [12], neural network models [22] and mixture of Gaussian models [28]. However, all these approaches usually result in local misclassifications, and also cost plentiful time due to excessive processing or management for all pixels of surveillance videos. See, e.g. [1,27] for comprehensive reviews.
Recently, the techniques based on Compressed Sensing (CS) have developed quickly due to their excellent performance. The foreground extraction in this framework was initially formulated as the following optimization problem: min L,S rank(L) + λ S 0 where S 0 is the cardinality function of S, which denotes the number of nonzero entries, and λ > 0 is a regularization parameter. The data matrix M ∈ R m×n can be decomposed (approximately) as the sum of a low rank matrix L ∈ R m×n modeling the background and a sparse matrix S ∈ R m×n modeling the foreground. The background component has relatively static small changes over a period, so it is modeled as a low rank matrix L. The foreground component consists of moving objects, modeled as sparse matrix S with most of its entries being zero or nearly zero. The low rank matrix can capture the background of the surveillance video, while the sparse matrix can capture the foreground. In fact, the problem (1) is NP-hard due to the discontinuity and nonconvexity of rank(L) and λ S 0 . See, e.g., [5,9,10,23]. In order to solve this problem in a tractable way, Candès et al. [4] proposed the robust principal component analysis (RPCA) model where they replaced the rank with nuclear norm and 0 norm with 1 norm, that is, min in which L * is the nuclear norm (defined as the sum of all singular values) of L, S 1 is the 1 norm in componentwise (defined as the sum of absolute values of all entries). This new model is an approximation for the original problem (1) and further, it is proved that under rather weak assumptions, the solutions of L and S in (1) can be recovered exactly via solving (2). Now the RPCA is one of the most popular approaches for foreground extraction [2].
However, the RPCA currently suffers from some prominent issues. For example, the extracted foreground always has small artifacts, which in fact belong to the background. To overcome this obvious shortcoming of the RPCA, much effort has been devoted in research community. One interesting work is to separate the foreground S into dynamic background part E and the intrinsic foreground part F , For the left side, the first row is the foreground frames, the second and third row are the corresponding dynamic background frames and dynamic foreground frames, respectively. For the right side, the column represents the frames of the dynamic background, and it is not hard to conclude that this matrix has dense elements row-wise and sparse elements column-wise.
and then the optimization model can be described as min L,S,E,F which is called total variation regularized RPCA (TVRPCA) in [6]. Here, λ 1 , λ 2 , λ 3 are the weights for balancing the corresponding terms, and DF 1 is the three dimensional total variation norm defined as and the three difference operations at the voxel (i, j, t) along the horizontal, vertical, and temporal directions are As is shown in [25,26], the added total variation term can remove small artifacts and make images much more clean. Even though [6] established an alternating direction method of multipliers based algorithm, they didn't present the convergence result. In fact, the convergence of the algorithm can not be guaranteed [7]. In this paper, we attempt to fill this gap and propose an improved total variation regularized RPCA model: where u ∈ R m , 1 ∈ R n are the decomposition of rank-one matrix L, E 2,1 is defined as the sum of absolute values of the 2 -norm of all rows of the matrix E. The model is called the rank-one and joint sparsity matrix decomposition.
Compared to previous work, the main innovations of our paper are summarized below: 1. The joint sparsity [21,31] is applied to exploit the structure of the dynamic background part E, see Figure 1 for detailed illustrations. Also we reduce the number of hyper-parameters, which brings great benefits to save the computational time. 2. The rank-one approximation [19,29] is used to replace the low-rank part L.
The motivation behind is that for a surveillance video with a static background, the static background matrix should consist of identical columns, instead of being a general low-rank matrix. 3. An efficient symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) with convergence guarantee is established to optimize our proposed model, which can take full advantage of the special structure. Furthermore, the step-length τ should be in (0, (1 + √ 5)/2), rather than τ = 1 as discussed in [6]. In practice, we always choose τ = 1.618.
The remaining of this paper is organized as follows. In Section 2, we summarize some preliminaries which are useful for further discussion. In Section 3, we propose an efficient algorithm for our proposed model, as well as the convergence result. Then, we report some numerical experiments on various surveillance video datasets to substantiate the superiority of the proposed model over the other existing ones in Section 4. Finally, we make some conclusions in Section 5.

Preliminaries.
In this section, we summarize some preliminaries which are useful for further discussions. First, we briefly review on the ADMM, which is widely used in computer vision, image processing and statistical learning [3]. Let us start from a generic convex minimizaiton problem with a separable objective function and linear constraints: where X, Y ∈ R m×n are two variables, Z ∈ R m×n is a given matrix, f, g are closed proper convex functions [24], and A, B : R m×n → R m×n are linear maps. The augmented Lagrangian function for problem (5) is where β > 0 is a penalty parameter, Λ ∈ R m×n is the Lagrangian multiplier, and ·, · is the inner product of two matrices. To exploit the properties of f and g individually, the ADMM approach, which was originally proposed in [15], can be summarized in Algorithm 1. In most cases, the resulting subproblems admit closed-form solutions, which makes the ADMM particularly efficient. Under some mild conditions, the sequence {(X k , Y k )} generated by the ADMM can be shown to converge to an optimal solution of (5), see, e.g., [13,14,17]. However, when there exist more than two variables, the ADMM does not have a convergent guarantee. Recently, it is proved in [7] that the ADMM, when applied to a convex optimization problem with three variables, can be divergent. But, it really works well in [6] and other applications. This motivates the study of many provably convergent variants of the ADMM for convex problems with more than two variables, see [8,16,18,20] for example. Let us consider the following convex Algorithm 1: ADMM for model (5) Let τ ∈ (0, (1 + √ 5)/2) be a step-length and β > 0 be a given parameter.
composite programming whose objective involves a nonsmooth term: where X i , Y j ∈ R m×n are variables, Z ∈ R m×n is a given matrix, f 1 , g 1 are nonsmooth convex functions, f i , g j (i = 2, . . . , p, j = 2, . . . , q) are smooth convex functions, and A i , B j : R m×n → R m×n are linear maps. The augmented Lagrangian function for problem (6) is defined as With the help of symmetric Gauss-Seidel (sGS) technique, the sGS-ADMM for (6) was developed in [8,20] and can be summarized in Algorithm 2. The idea behind is that, to make full use of these smooth convex functions, we should first do backward GS sweep and then do forward GS sweep. In this paper, we use k + 1 2 and k + 1 to denote the backward GS sweep and forward GS sweep, respectively. We refer to [20] for the convergence of Algorithm 2. This gives us a theoretical guarantee for the sGS-ADMM with more than two variables.
For the sGS-ADMM type algorithm to be presently, their main subproblems at each iteration have closed-form solutions. We then review two operators which will help us express the closed-form solutions conveniently.
with β > 0 and X, T ∈ R m×n has a closed-form solution which is given by the soft-shrinkage operator defined as where • and sign represent, respectively, the pointwise product and signum function, and all operations are done componentwise. See [11] for more details.
Step 1a. (Backward GS sweep) Compute for i = p, . . . , 2, Step 2b. (Forward GS sweep) Compute for j = 1, . . . , q, Step with β > 0 and X, T ∈ R m×n has a closed-form solution which is given by where T i represents the i-th row of T for i = 1, . . . , m.
3. Algorithm and convergence analysis. In this section, we apply the sGS-ADMM idea to derive an efficient algorithm with convergence analysis for our proposed model (4). We first introduce an auxiliary variable K and then change (4) into the following equivalent form min u,F,S,E,K The auxiliary variable K is to liberate F from the nonsmooth term · 1 . Let L β (u, F, S, E, K; Λ 1 , Λ 2 , Λ 3 ) be the augmented Lagrangian function of problem in (7), which is defined as in which Λ 1 , Λ 2 , Λ 3 ∈ R m×n are the Lagrange multipliers and β 1 , β 2 , β 3 are the tuning penalty parameters. As is illustrated in Section 2, when there exist more than two variables, the convergence of traditional ADMM can not be guranteed. Hence, the symmetric Gauss-Seidel based ADMM (sGS-ADMM) should be adopted. By grouping the variables into (u, F ) and (S, E, K), the iterative scheme is as follows In the following, we will show that the minimization with respect to each variable can be separated into the following subproblems.
Step 1a. For variable u, the subproblem of L β with respect to u can be solved by where the operator Mean : R m×n → R m is defined as Mean(X) i := 1 n n j=1 X i,j , i = 1, . . . , m, ∀X ∈ R m×n .
Compared with low-rank decomposition, rank-one can avoid the computation of SVD, which saves a lot of time.
Step 1b. For variable F , the subproblem of L β with respect to F can be solved by the following equation Notice that the coefficient matrix β 2 I + β 3 D T D is nonsingular whenever β 2 , β 3 > 0. As we all know, the Moore-Penrose pseudo-inverse can be computed via common linear system solvers, such as the Cholesky decomposition and the conjugate gradient method, but it will be extremely slow. Under the periodic boundary conditions for F , D T D is a block circulant matrix with circulant blocks and thus is diagonalizable by the 3D discrete Fourier transforms (DFTs). As a result, the F k+ 1 2 can be computed by one forward DFT and one inverse DFT. Let where F is the 3-dimensional DFT, | · | 2 is the element-wise squares and the division also performs in element-wise.
Step 2a. For variable S, the optimization subproblem of L β with respect to S can be transformed as which can be solved by the soft-shrinkage operator defined in Lemma 2.1.
Step 2b. For variable E, the subproblem of L β with respect to E can be simplified to Step 2c. For variable K, the subproblem of L β with respect to K can be solved by Step 3a. Similarly to Step 1a, it is Step 3b. Similarly to Step 1b, it can be solved by the following equation (14) Finally, we present our sGS-ADMM based algorithm for solving model (7) (equivalently (4)) in Algorithm 3.

Numerical experiments.
In this section, we will conduct experiments on synthetic and real-world video data sets to demonstrate the superiority of our proposed model over the existing state-of-the-art approaches for foreground extraction. All the experiments are performed using MATLAB (R2017a) on a desktop computer with an Intel Core i5-3570M CPU with 3.4 GHz and 8 GB of memory. 4.1. Data sets. We test our model on two popular foreground extraction data sets, namely, the SABS 1 data set and the CDnet 2 data set. In particular, we choose 6 representative sequences from these data sets, including both synthetic and realworld videos. 4.1.1. Synthetic Data. The SABS data set is an artificial data set for pixelwise evaluation of background models. This data set is synthetic and provides high quality ground-truths. Therefore, it is easy to check out the ground-truth of foregrounds. The data set consists of video sequences for nine different challenges of background subtraction, and we select the noncamouf lage and noisynight sequences to evaluate our proposed approach. These two videos exist dynamic periodically swaying leaves in the background component and a moving car in the foreground component. See the left two columns of Figure 2.

4.1.2.
Real-world Data. The CDnet data set is considered as one of the most difficult tracking benchmarks, which consists of 31 real-world videos over 8000 frames and spanning six categories including diverse motion and change detection changes. To verify the performance in different scenarios, we choose two categories of videos to test our approach. The first category is bad weather, i.e. snowfall and skating. This category is much more difficult. On the one hand, snow reduces the contrast of surveillance videos, which makes distinction between foreground and background difficult. On the other hand, the movements of snowflakes are detected as foreground, which reduces the capacity to identify small objects. See the middle two columns of Figure 2. The second category is dynamic background, i.e., fall and fountain. This category is the most difficult. One challenge comes from the significant dynamic background, such as the tree and water flow. The other challenge is that the moving object is relatively small, and may be occluded by dynamic background. See the right two columns of Figure 2. 4.2. Data settings. We compare our model with two existing popular methods: RPCA 3 and TVRPCA 4 . For the RPCA and TVRPCA, the results are generated from the source codes released by their authors, and the parameters are all set as default. For our proposed method, we empirically set the joint sparse parameter λ 1 = 5/ √ mn and the total variation parameter λ 2 = 1/ √ mn. To accelerate the effciency of our proposed, we set τ = 1.618. For parameters β 1 , β 2 , β 3 > 0, we apply self-adaptive adjustment rules such as β k+1 i = σβ k i (i = 1, 2, 3) with given σ = 1.01. In addition, the stopping criterion is set as Tol < 10 −3 , that is,

Comparison results.
In this subsection, we will divide the comparison results into two parts: visual results and quantitative results. Note that in comparison results, the white represents correctly detected foreground, red missing pixels, and blue false alarm. In addition, some obvious differences are labeled by the yellow box (best seen in the zoom-in version of pdf).

4.3.1.
Visual Results. Figure 3 shows the visual results of surveillance videos from SABS data set and CDnet data set. In the fiure, the first column is the test samples, the second column is the extracted true foreground, the following three are extracted foregrounds by RPCA, TVRPCA and our proposed model. It can be seen that the RPCA wrongly labels the nearby background as the moving object regions by considering their spatial relationship. The TVRPCA and our model successfully removes the swaying leaves and watersurfaces with some different degrees. This is because that the total variation can preserve sharp edges and object boundaries by removing small artifacts. It is worth noting that, compared with the TVRPCA, our model can filter out more dynamic textures and obtain more accurate foreground. The superior is due to the joint sparsity, which exploits the structural property.

4.3.2.
Quantitative Results. In this subsection, we present some quantitative results. In order to evaluate the performance of foreground extraction, we compare the support of the recovered foreground with the support of the ground-truth by computing the following F-measure: where R and P are short for recall and precision, respectively, which are defined as R = TP TP + FN , P = TP TP + FP . • TP stands for true positives: the number of true foreground pixels that are recovered; • FP stands for false positives: the number of background pixels that are misdetected as foreground; • FN stands for false negatives: the number of true foreground pixels that are missed.
The support of the recovered foreground is obtained by thresholding entry-wise with a threshold value. We would like to point out that the F-measure varies between zero and one according to the similarity of the support of the recovered foreground and the ground-truth. The higher the F-measure value, the better the recovery accuracy. The F-measure approaches the maximum value one if the supports of of the recovered foreground and the ground-truth are the same, which means the foreground is recovered completely. Table 1 summarizes the quantitative results on both synthetic and real-world data sets. Notice that the F-measure here indicates the averaged F-measure on all frames. From the table, one can observe that our proposed model improves the recall and precision, which makes F-measure increase in all cases. This can be atttibuted to the integration of the joint sparsity and total variation, where the joint sparsity  14 can exploit the shared information and the total variation can preserve sharp edges and object boundaries by removing small artifacts.

5.
Conclusion. In this paper, we propose a novel total variation based RPCA approach for foreground extraction, in which the total variation is utilized to model the spatio-temporal correlation of the dynamic foreground, and the joint sparsity is employed to characterize the dynamic background. Furthermore, we develop an efficient sGS-ADMM algorithm for our proposed RPCA model, and establish the convergence of the generated sequence. Finally, we perform numerical experiments on real-world datasets to demonstrate the effectiveness of our proposed approach.