HOMOGRAPHY ESTIMATION ALONG SHORT VIDEOS BY RECURRENT CONVOLUTIONAL REGRESSION NETWORK

. Many moving-camera video processing and analysis tasks require accurate estimation of homography across frames. Estimating homography between non-adjacent frames can be very challenging when their camera view angles show large diﬀerence. In this paper, we propose a new deep-learning based method for homography estimation along videos by exploiting temporal dynamics across frames. More speciﬁcally, we develop a recurrent convolu- tional regression network consisting of convolutional neural network (CNN) and recurrent neural network (RNN) with long short-term memory (LSTM) cells,followed by a regression layer for estimating the parameters of homography. In the experiments, we evaluate the proposed method on both the synthesized and real-world short videos. The experimental results verify that the proposed method can estimate the homographies along short videos better than several existing methods.


Introduction.
A homography is the invertible mapping between two images of the same planar surface [16]. Homography estimation is an important problem in computer vision and plays a key role in many video-based applications, such as video stitching [22], video stabilization [19], optical flow estimation [28], action recognition [33,34,35], simultaneous localization and mapping (SLAM) [9], visual odometry (VO) [14], and augmented reality [48]. Many of these applications, such as video stitching [22] and action recognition [33,34,35], require to estimate homographies between both adjacent and non-adjacent frames.
We can simply treat a pair of frames, either adjacent or non-adjacent, as two input images and directly compute their homography without considering the information of any other frames. For this purpose, classical methods first detect hand-crafted features [30,11,5,42] on two images separately, followed by estimating homographies with Random sample consensus (RANSAC) [13]. Many template matching algorithms are also developed for estimating homographies between two images by identifying and matching relevant regions [2,6]. Recently, many deep learning methods [12,8,40] were developed to learn the homography parameters from training image pairs with known homographies. However, non-adjacent video frames may show very different camera view angles, which bring in difficulties to identify reliable matched features, either hand-crafted or deeply learned, for estimating homographies.
A video consists of a sequence of frames, which usually show good temporal relationship. Within a video, the homography transformation between a pair of non-adjacent frames is resulted from a temporal sequence of adjacent-frame homography transformations between these two frames. The homography transformation between non-adjacent frames are not only dependent on the spatial relationship between these frames, but also relied on their temporal relationship. Such temporal relationship can be used to help estimate homography between non-adjacent frames. For example, one can estimate homographies between adjacent frames and then compose them sequentially to get homographies between non-adjacent frames [15]. Template matching can be also applied to temporal sequences in an accumulative way for computing homographies between non-adjacent frames [23]. However, these approaches may accumulate estimation errors sequentially, and lead to large errors in computing non-adjacent frame homographies. We can also run a feature tracker, such as Lucas-Kanade tracker [31], along the video and then use the tracked features across non-adjacent frames to directly estimate their homographies. This approach requires high accuracy of feature tracking over a sequence of frames, which can be difficult in practice.
In this paper, we propose a Recurrent Convolutional Regression Network to estimate homographies along a short video by taking the whole video as the input. With this network, we exploit the temporal dynamics along the whole video in an end-to-end fashion to more accurately estimate the homographies between nonadjacent frames. The proposed network consists of convolutional neural network (CNN) and recurrent neural network (RNN) with long short-term memory (LSTM) cells, followed by a regression layer for estimating the parameters of homography. By employing recurrent architecture with LSTM, the proposed method does not need exhaustive feature/template matching or feature tracking, and can alleviate high accumulative errors in computing homographies between non-adjacent frames. We also introduce a simple but effective approach to synthesize large-scale videos with ground-truth homographies for network training and performance evaluation. In the experiment, we show that the proposed method can estimate more accurate homographies along videos than several existing methods, on both the synthesized and real-world videos.
2.1. Homography estimation between two images. Conventional methods for estimating the homography between two images first extract hand-crafted features, such as SIFT [30], SURF [5], HOG [11] or ORB [42] from each image. Recently, Barath et al. [3] proposed two general constraints on the orientation-and scale-covariant features (e.g. SIFT). The extracted features are then matched between the two input images using various matching algorithms, such as Fast Library for Approximate Nearest Neighbors (FLANN) [37] or Brute-force (BF) matcher. Finally, Direct Linear Transform (DLT) with Random sample consensus (RANSAC) [13] is applied to estimate the homography between images based on the feature correspondence.
Template matching is also applied for estimating homography between two images [2,6], where a quadrilateral template area is matched between two images in an iteratively way. In [2], the inverse compositional (IC) algorithm is used to exchange the role of the two images, and leads to the optimization which contains a constant pre-computed Hessian [36]. In [6], transformation is estimated by minimizing the sum-of-squared-difference between the correct and estimated templates, using an efficient second-order minimization (ESM).
Recently, deep learning has been used for homography estimation between two images. In [12], CNN implemented on VGG architecture [44] is applied to compute the homography. In the network, two approaches were explored to compute homography: direct regression and distribution model via classification. Experiments show that direct regression can achieve more accurate homography estimation. In [8], a cascaded Lucas-Kanade network is developed to progressively refine the homography estimation, by combining the IC algorithm and a pyramid feature representation. In [40], a hierarchy of twin CNNs is developed to regress the homography, with visual warping between adjacent hierarchical levels. In [39], an unsupervised deep learning method which combines CNN based on VGG network and DLT, is developed to estimate homography.
As discussed earlier, the direct application of these methods for estimating homographies between non-adjacent video frames may produce large errors when the non-adjacent frames show very different camera view angles.

2.2.
Homography estimation along a video. Garcia-Fidalgo et al propose to directly compute homographies between adjacent frames, and then compose them sequentially for estimating homographies between non-adjacent frames along the video [15]. In [23], homography is computed iteratively between each frame and a reference frame using Hyperplane Approximation (HA), which is a template matching algorithm that finds the relationship between the measured error and the transformation parameters variation by using difference decomposition in an off-line processing stage. Such composition or iteration scheme usually suffers from large accumulative errors. Various feature trackers can be used for identifying corresponding features across frames for homography estimation. For example, Lucas-Kanade (LK) tracker [31] first uses "Good Features To Track" [43] for tracking initialization, and then uses Lucas-Kanade algorithm to calculate optical flow for tracking features across frames, along with back-tracking for match verification between frames. The point correspondence of the tracked features can then be used to compute homographies. In practice, the accuracy of the tracked features usually get worse when tracking over many frames, and this affects the accuracy of homographies estimated between non-adjacent frames.
In some applications, structures with special geometry can be detected and tracked along a video for estimating homographies. In [20], homography estimation is accomplished by finding conic correspondences with visual features built on ellipse shapes detected and tracked along the video. In [29], homography for sports and traffic videos is estimated based on the correspondences of lines detected and tracked along the sport or traffic field. These methods require the presence of structures with pre-specified geometry in the video and can only be used in special applications. In this paper, we will develop a deep-learning based method for estimating the homography, which is applicable to general videos, without requiring the presence of such structures.

2.3.
Other relevant work. Deep learning has also been used to find other kinds of mapping between frames. For example, in [47], CNN was used to learn a mapping from pixels to optical flow, by following a signal processing principles. In [24], CNN was used to regress camera pose from frame to frame, by utilizing transfer learning from large scale classification data. In [46], deep multi-view feature learning (DMVFL) was proposed to exploit the collaboration between hand-crafted and deep-learning features for person re-identification. They are different from this work in that we aim to estimate homographies along a video. From the technical perspective, our method is also different from these methods by using recurrent architecture with LSTM to exploit the temporal dynamics across frames. 3. Proposed method. In this section, we elaborate on the proposed method. As mentioned earlier, homography transform is defined for planar surface, and like many existing work on homography estimation, the proposed method focus on videos in which all frames reflect the same planar surface. For more general videos containing multiple planes, planar surface detection [41] can be applied first and then estimate homography for each plane separately.
3.1. Overview. Two common parameterizations of the homography between two images are: the 4-point parameterization and the matrix parameterization [1]. The 4-point parameterization H 4point is defined as the 8 displacement values of 4 matched corners as follows: where (∆x i , ∆y i ) denote the horizontal and vertical displacements for each matched corner on two images. The matrix parameterization H matrix is a 3 × 3 matrix that contains both the rotational and translational terms. As shown in [12,40], the 4-point parameterization is more suitable than the matrix parameterization to represent homography, because it is difficult to balance these terms as part of an optimization problem [12]. Besides, H 4point can be easily converted to H matrix via simple perspective transform. Therefore, we use H 4point to represent the homographies in this paper and abbreviate it as H for brevity. Without loss of generality, given an input of video consisting of N + 1 frames {F 1 , F 2 , · · · , F N +1 }, we estimate the homographies between the first frame F 1 and its succeeding N frames. Specifically, we take the original video and re-organize it as a sequence of frame pairs is a pair of adjacent frames with t = 1, . . . , N . We aim to find a sequence of homographies {H 1,2 , H 1,3 , · · · , H 1,N +1 }, where H 1,t+1 is the homography between frames F 1 and F t+1 , t = 1, · · · , N . Inspired by the success of regressing homography for image pairs in deep learning methods [12,40], we formulate the homography estimation along videos as a regression problem. Specifically, we propose a novel Recurrent Convolutional Regression Network, to address this problem. As shown in Fig. 1, the proposed network contains convolutional neural network (CNN) and recurrent neural network (RNN) followed by a regression layer. Since temporal dynamics of a video are transfered through consecutive adjacent-frame pairs, we first use CNN to extract features from each adjacent-frame pair. Then, the features of all adjacent-frame pairs in each video are fed to RNN such that the temporal dynamics are fully exploited. The regression layer performs the final estimation of 4-point homography between F 1 and F t+1 for t = 1, · · · , N . We elaborate on the network in the following sections.  3.2. CNN architecture. Following [12], we use convolutional neural network (CNN) to first extract features from each pair of adjacent frames F t and F t+1 , where t = 1, · · · , N . Figure 2 shows the configuration of the CNN in our proposed network. Specifically, it contains 8 convolutional layers, every two of which are followed by a Batch Normalization (BN) [21] layer and a MaxPooling layer. There is no Max-Pooling layer after the last two convolution layers. The input of the CNN is a pair of adjacent frames with size of 128 × 128, each of which is a gray-scale image. The number of filters in the convolution layers are 64, 64, 64, 64, 128, 128, 128, and 128, respectively. Rectified Linear Unit (ReLU) [38] is used as the activation function to add non-linearity. Each filter in the convolution layers is of size 3 × 3 and convolved with a stride of 1. The MaxPooling is performed over 3 × 3 window with a stride of 2. After the last convolutional layer, a fully-connected (FC) layer is used to output 1,024-dimensional features for each input pair of adjacent frames. During training process, we add a Dropout [45] layer with a drop rate of 0.5 after the FC layer to avoid over-fitting. For an (N + 1)-frame video {F 1 , F 2 , · · · , F N +1 }, we extract a sequence of features {x 1 , x 2 , · · · , x N } from adjacent-frame pairs using the CNN network and feed these features to the RNN sequentially.
3.3. RNN architecture and regression layer. RNN with LSTM cells is good at sequence modeling, since there are specially designed units inside LSTM cells for remembering important information and forgetting unimportant information along the sequence. Figure 3 shows the architecture of an LSTM cell. Let σ and φ be the logistic sigmoid function and hyperbolic tangent function, respectively. {i, f, o, c, h} are the input gate, forget gate, output gate, memory cell and hidden state, respectively. The LSTM sequentially updates {i, f, o, c, h} at time step t, given input x t , hidden state h t−1 , and cell state c t−1 , as follows: where W's and b's are the network parameters to be learned and t = 1, · · · , N . With input gate and forget gate, each LSTM cell can learn to selectively forget its old memories and refresh with new inputs. In addition, the output gate o t controls how much of the stored memory to be passed to the hidden state h t . In this paper, we employ a single RNN layer with LSTM cells to process the features of each adjacent-frame pair extracted by the CNN. At each time step t, RNN will output features that contain information from the first frame until the tth frame, which can exploit the temporal dynamics along the sequence. A Dropout layer with a drop rate of 0.5 is applied after the RNN layer during training process. The final regression layer is a fully-connected layer, which takes RNN features at time step t as input and estimates the homographyĤ 1,t+1 between the first frame and the (t + 1)-th frame. The estimated homographyĤ 1,t+1 is an 8-dimensional vector, corresponding to the 8 values of 4-point homography parameterization.
3.5. Data generation. As a deep-learning model, the proposed network requires large-scale training and testing videos with known ground-truth homographies. Previously, random homography transformation is used in many existing deep-learning methods [12,8,40] to construct large amount of image pairs with known groundtruth homographies for training the networks. However, this approach is not directly applicable for video-based homography estimation, because the independent and random warping between different images could not reflect the temporal continuity and dynamics in videos. In this section, we introduce a simple approach to synthesize large-scale videos with realistic ground-truth homographies. As shown in Fig. 4, we draw four fixed points on a flat white board and then use a handheld camera to freely record an (N + 1)-frame video for these 4 points. With clear background (i.e., white board), we can easily detect and track these four points over the video. We monitor the tracking progress and rectify the incorrectly detected locations with manual annotations. The tracked 4-point correspondence is used to calculate the ground-truth homographies {H 1,2 , H 1,3 , · · · , H 1,N +1 } along the recorded video, as shown in Fig. 4. We then take a real image as the original image I 1 , and apply transforms {H 1,2 , H 1,3 , · · · , H 1,N +1 } to warp I 1 to construct warped images I 2 , . . . , I N +1 . This leads to a synthesized image sequence I 1 , . . . , I N +1 . More specifically, we make a uniform cropping, shown by blue boxes in Fig. 5, on both the original and the warped images to exclude blank areas as well as ensuring the identical size for all the generated frames along a sequence. Finally, the cropped regions are taken as the desired video consisting of frame sequence F 1 , . . . , F N +1 with ground-truth homographies {H 1,2 , H 1,3 , · · · , H 1,N +1 }, which reflects temporal dynamics in real world. We can use different real images for I 1 and construct different ground-truth homographies by moving hand-held camera in different ways (i.e., with different tilt, pan and zoom) toward the four points. In the later experiments, we use images in MS-COCO [27] dataset for data generation. Some sample videos are shown in Fig. 6. We can interpret each sample video as observing a natural scene from one view to different views. 4. Experiments. In this section, we first describe the experimental setup, and then report the experimental results on the synthesized and real-world video datasets.   4.1. Experimental setup. We implemented the proposed method via Keras [10], an open-source deep learning package. The parameters of the proposed recurrent convolutional regression network are initialized via He's method [17]. To train the network, we use the Adam optimizer [25] with the default parameters, except that is set to 0.1. The mini-batch size is set to 16. It takes approximately 20 hours to train our network for 200 epochs, on an NVIDIA Tesla P100 GPU. We monitor the validation loss to avoid over-fitting during the training process. The model that achieves the smallest validation loss is chosen for testing.
Following [12,8,40], we use the corner error as the metric for performance evaluation. The four corners of F 1 are transformed to P kj , j = 1, 2, 3, 4 respectively using ground-truth homography H 1,k , and toP kj , j = 1, 2, 3, 4 respectively using estimated homographyĤ 1,k , where k = 2, . . . , N + 1. The corner error for the video is then Lower corner error indicates better performance in homography estimation.
We compare the proposed method with nine existing homography estimation methods. SIFT+RANSAC, SURF+RANSAC and ORB+RANSAC are three conventional methods based on hand-crafted features SIFT, SURF and ORB respectively, all followed by DLT with RANSAC for homography estimation. We implemented them using the corresponding functions in the OpenCV library [7]. IC [2], ESM [6], and HA [23] are three template-matching based homography estimation methods, which are implemented based on the 2D template tracking code [32]. LK Tracker [31] is feature tracking based method, where tracked features are used for estimating homographies between adjacent or non-adjacent frames. It was implemented using the corresponding functions in the OpenCV library. HomographyNet is a deep-learning based method for homography estimation and we use the architecture and default setting as described in [12]. Graph-cut RANSAC [4] is a recent state-of-the-art homography-estimation approach, which runs a graph-cut algorithm in the local optimization step to separate inliers and outliers.
These comparison methods, except for HomographyNet, are unsupervised and we directly apply them on the testing data for evaluation. Furthermore, SIFT+RANSAC, SURF+ RANSAC, ORB+RANSAC, and HomographyNet are developed for directly estimating homographies between two images, e.g., any two frames in a video. We can also use them to only estimate homographies between adjacent frames and then compose them for homographies between non-adjacent frames [15]. We denote these methods with composing operations as SIFT+RANSAC*, SURF+RANSAC*, ORB+RANSAC* and HomographyNet*, respectively.

4.2.
Result on synthesized data. We use the technique developed in Section 3.5 to generate synthesized video data for performance evaluation. An iPhone 6s is used as the hand-held video camera to record videos of four points on a white board for constructing ground-truth homographies, with 30fps frame rate and 1920 × 1080 frame resolution. We collect 9 videos of four points and each video contains 2200 frames. We annotate every other frame in each video sequence. Starting from the every frame in one video, we take 16 consecutive annotated frames as an one-second segment, in which frame resolution is down-sampled to 320 × 240. This way, we can compute 15 sequential homographies H 1,2 , H 1,3 , . . . , H 1,16 for each video segment, via the getPerspectiveTransform function in OpenCV library, as the ground truth. The images that randomly chosen from MS-COCO dataset, are down-sampled to 320 × 240, and then taken as the original images for video generation. The final frames, i.e., the ones cropped by blue boxes in Fig. 5, have a size of 128 × 128. In total, we construct a dataset of 9 videos, in which each video contains 1000 short video sequences that contains 16 frames with ground-truth homographies. We randomly select 5 videos for training, 2 videos for testing, and 2 videos for validation, i.e., the training, validation and testing sets have 5000, 2,000 and 2,000 short video sequences, respectively.
We first study the impact of RNN by varying the number of LSTM cells in the proposed network. Specifically, we test four cases, and show the results in Table 1. We can see that the proposed network is capable of learning more effective features and produce smaller corner error (averaged over all the testing data in the synthesized dataset), by increasing the number of LSTM cells until it reaches 1,024. After that, the corner error increases if we further increase the number of LSTM cells, which we believe is caused by over-fitting. We use 1,024 LSTM cells for all the remaining experiments.  Table 1. Average corner error of the proposed method by using different numbers of LSTM cells.
The average corner errors of the proposed method and comparison methods on the testing videos of the synthesized dataset are summarized in Fig. 7. We can see that all the conventional methods based on hand-crafted features show comparable performances. ORB+RANSAC shows higher corner error than SIFT+RANSAC and SURF+RANSAC. One possible reason is that ORB does not have scale-invariance properties as SIFT and SURF. Graph-cut RANSAC obtains lower corner error than other methods based on RANS -AC, due to the technical improvement by incorporating graph-cut into RANSAC. Among the template matching methods, ESM has lower corner error than IC and HA. This may be caused by the fact that ESM has a higher convergence rate and does not need to compute Hessian [36]. LK tracker obtains lower corner error than other methods based on feature/template matching, due to the utilization of temporal continuity via feature tracking across frames. Ho-mographyNet performs better than other comparison methods, which verifies the effectiveness of the learned deep features.
Composing homographies between adjacent frames for homographies between non-adjacent frames (four methods with superscript "*") produce large errors because of the error accumulated frame by frame. The proposed method achieves the best performance. Compared to HomographyNet, the proposed method reduces the corner error by 64.2% to 1.36. It verifies that the proposed method can effectively exploit the temporal dynamics across frames and learn good features for homography estimation along short videos. 13  We also present the complexity analysis of the proposed method in terms of the number of parameters and the number of floating-point operations (FLOPs) for processing one pair of frames, as shown in Table 2. We can see that the proposed method introduces 1.7M additional parameters and 16.8M more FLOPs, when compared with the baseline method HomographyNet. These results show that the proposed method obtains much lower corner error without introducing too much computational cost.  Table 2. Complexity analysis, where #Param. denotes the number of parameters and FLOPs denotes the number of floating-point operations. Figure 8 shows the corner error on the test videos over time -for each time k, we compute the average e k over all the test data, where e k is defined in Eq. 4. We can see that the corner error increases over time for all methods due to increased camera view angle changes over time. Four methods with superscript "*" produce large accumulation errors in composing adjacent-frame homographies. The corner error of the proposed method is lower than all the comparison methods at each time. Furthermore, the performance of the proposed method decreases much slower than those of all the comparison methods over time. This results from the usage of the RNN network with LSTM cells in the proposed method, which effectively exploits the temporal dynamics and contributes to better homography estimation. We also conduct an experiment to test the robustness of the proposed method against two kinds of challenges which are highly likely to be present in real world. Specifically, we add color variations and/or Gaussian noise on the original dataset, to generate three synthesized video datasets: with color variations, with Gaussian noise and with both. The original dataset and these three datasets are denoted as "Original", "Color variations", "Gaussian noise" and "Both" respectively. For color variation, we randomly enhance the contrast, brightness and color with an enhancement amount between 0.5 and 1.5 in a random order for a whole sequence, following Howard [18]. For Gaussian noise, we apply it with zero mean and a standard variance of 0.02 on each sequence. We train a new model on the training data of the dataset "Both". The original model and the newly trained model are denoted as model orig and model both respectively.

Model
Test  Table 3. Performance of the proposed method trained the "Original" dataset and the "Both" dataset, tested on the "Original", "Color variations", "Gaussian noise" and "Both" datasets.
The two models are tested on the test sets of all four datasets. The resulting corner errors are summarized in Table 3. model orig performs worse on the "Color variations", "Gaussian noise" and "Both" datasets than on the "Original" dataset, while model both performs worse on the "Color variations" and "Original" datasets than on the "Both" dataset. This is due to the domain shift brought by color variations and/or Gaussian noise. Obviously, model orig is more sensitive to Gaussian noise than color variations. model both performs better on the "Gaussian noise" dataset than on the "Color variations" dataset. Since the "Both" dataset differs from the "Color variations" dataset by Gaussian noise, and differs from the "Gaussian noise" dataset by color variations, we can conclude that model both is also more sensitive to Gaussian noise. We believe that batch normalization in the proposed network can address some domain shift from color variation, while Gaussian noise is more difficult to be handled. Note that, model both achieves lower corner error on the "Gaussian noise" dataset than on the "Both" dataset. This may be caused by the randomness of the network initialization.

4.3.
Results on real-world data. We also evaluate the proposed method on a real-world video dataset [26]. This dataset is designed for planar object tracking, and captured in the wild scenario rather than the constrained laboratory environment. The videos are annotated for every other frame, and the ground-truth homographies are calculated by the labeled positions of a planar object in each frame. The resolution of each frame is 1280 × 720. Each video is recorded with 30 fps frame rate. Since the proposed method estimates homography based on the entire frames, we carefully choose all the frames in which the entire scenes are planar surfaces, i.e., the ground-truth homographies are suitable for the whole frames. As on synthesized data, starting from every frame, we take 16 consecutive frames as a segment, with a sequence of 15 ground-truth homographies for the segment. In total, we generate 335 non-overlapping video segments. This dataset involves several challenging factors, including lighting variation, scale change, perspective distortion, occlusion, and out-of-view, regarding the homography transformation of the planar object. Samples of the real-world videos are shown in Fig. 9.
In the experiment, the frames are first down-sampled to 128 × 128 and converted to gray-scale, in order to fit the input size of the proposed network. Then, we apply a three-fold cross validation scheme for evaluation. We randomly divide the generated (real-world) video dataset into three subsets of same size, and each time, two subsets are used for training and the remaining subset is used for testing. For training, we initialize the parameters of the proposed network with the weights pretrained on the synthesized video dataset. The same optimizer and mini-batch size as in Section 4.2 are used for fine-tuning. It takes about 150 epoch (20 minutes) to train the network until the training loss converges, on an NVIDIA Tesla P100 GPU.  On real-world video data, the performance of the proposed method and comparison methods are shown in Fig. 10. Here the corner error is calculated by using the four corners of the planar object in each frame. Note that, all the methods, except for HomographyNet and the proposed method, compute homographies by using input videos with their original size. Therefore, for HomographyNet and the proposed method, we map their estimated corners to the original scale, and then calculate the corner error for fair comparison with other methods. From Fig. 10 we can see, the template matching methods achieve higher corner error than the hand-crafted feature methods. One possible reason is that, the template matching methods need to match the whole template, which could be highly affected by the cases of occlusion and out-of-view, while the hand-crafted feature methods do not rely on the whole template. Graph-cut RANSAC obtains lower corner error than the deep learning method HomographyNet, and shows better performance than other comparison methods. This may be due to its superiority over the original RANSAC by better separating the inliers and outliers among matched feature points via graph-cut algorithm. The proposed method outperforms all the comparison methods, and reduces the corner error by 40.7% to 7.49, when compared to HomographyNet. Again, on the real-world data, these results verify that the proposed method can learn visual features more effectively, via exploiting the temporal dynamics across frames.
We also evaluate the corner error over time for the proposed method and the comparison methods on the real-world data. As shown in Fig. 11 Figure 10. Performance of the proposed method and the comparison methods on the real-world video dataset. and camera view angle resulted from the camera movement. The proposed method still outperforms all the comparison methods at every time step, and achieves much smaller performance degradation over time. The result proves the effectiveness of the usage of the RNN with LSTM cells in the proposed network, which can exploit the temporal dynamics and lead to more accurate homography estimation. Figure 11. Corner errors of the proposed method and the comparison methods over time on the real-world video dataset.

5.
Conclusion. In this paper, we propose a novel method, namely recurrent convolutional regression network, to estimate the homographies along a short video. The proposed network consists of CNN, RNN with LSTM cells and a regression layer to predict homographies sequentially. To train the network, we introduce a simple but effective approach to synthesize large video dataset with ground-truth homographies. We evaluate the proposed method on both the synthesized and realworld video datasets. The experimental results on these datasets verify that, by exploiting the temporal dynamics across frames via RNN with LSTM cells, the proposed method can estimate homographies along short videos, better than several existing methods.