\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Homography estimation along short videos by recurrent convolutional regression network

  • * Corresponding author: Song Wang

    * Corresponding author: Song Wang
Abstract Full Text(HTML) Figure(11) / Table(3) Related Papers Cited by
  • Many moving-camera video processing and analysis tasks require accurate estimation of homography across frames. Estimating homography between non-adjacent frames can be very challenging when their camera view angles show large difference. In this paper, we propose a new deep-learning based method for homography estimation along videos by exploiting temporal dynamics across frames. More specifically, we develop a recurrent convolutional regression network consisting of convolutional neural network (CNN) and recurrent neural network (RNN) with long short-term memory (LSTM) cells, followed by a regression layer for estimating the parameters of homography. In the experiments, we evaluate the proposed method on both the synthesized and real-world short videos. The experimental results verify that the proposed method can estimate the homographies along short videos better than several existing methods.

    Mathematics Subject Classification: Primary: 68U10, 68T10.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Architecture of the proposed recurrent convolutional regression network.

    Figure 2.  Configuration of the CNN used in the proposed method.

    Figure 3.  The architecture of an LSTM cell.

    Figure 4.  Sample frames of the 4 points in a recorded video, with computed homographies.

    Figure 5.  An illustration of constructing a video sequence with ground-truth homographies.

    Figure 6.  Sample videos generated using MS-COCO images with ground-truth homographies. Each row shows frames of a sample video.

    Figure 7.  Comparison of the proposed method to the existing homography estimation methods on the synthesized video dataset.

    Figure 8.  Corner errors of the proposed method and the comparison methods over time on the synthesized video dataset.

    Figure 9.  Sample real-world videos with ground-truth homographies. Each row shows a video with an observed challenge.

    Figure 10.  Performance of the proposed method and the comparison methods on the real-world video dataset.

    Figure 11.  Corner errors of the proposed method and the comparison methods over time on the real-world video dataset.

    Table 1.  Average corner error of the proposed method by using different numbers of LSTM cells.

    Method Number of LSTM Memory Cells Corner Error Error reduction
    Proposed 256 2.34 -
    512 1.44 38.5%
    1024 1.36 41.9%
    2048 1.37 41.4%
     | Show Table
    DownLoad: CSV

    Table 2.  Complexity analysis, where #Param. denotes the number of parameters and FLOPs denotes the number of floating-point operations.

    Method #Param. FLOPs
    HomographyNet Proposed 3.4M 68.4M
    5.1M 85.2M
     | Show Table
    DownLoad: CSV

    Table 3.  Performance of the proposed method trained the "Original" dataset and the "Both" dataset, tested on the "Original", "Color variations", "Gaussian noise" and "Both" datasets.

    Original Color variations Gaussian noise Both
    modelorig 1.36 1.41 2.18 2.42
    modelboth 1.79 1.88 1.65 1.69
     | Show Table
    DownLoad: CSV
  • [1] S. Baker, A. Datta and T. Kanade, Parameterizing homographies, in Tech. Report, CMU-RI-TR-06-11, Robotics Institute, Carnegie Mellon University, (2006).
    [2] S. Baker and I. Matthews, Lucas-Kanade 20 years on: A unifying framework, International Journal of Computer Vision, 56 (2004), 221-255.  doi: 10.1023/B:VISI.0000011205.11775.fd.
    [3] D. Barath and Z. Kukelova, Homography from two orientation- and scale-covariant features, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), (2019), 1091–1099. doi: 10.1109/ICCV.2019.00118.
    [4] D. Barath and J. Matas, Graph-cut RANSAC, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, (2018), 6733–6741. doi: 10.1109/TIP.2017.2704431.
    [5] H. Bay, T. Tuytelaars and L. Van Gool, SURF: Speeded up robust features, in Computer Vision – ECCV 2006, Lecture Notes in Computer Science, 3951, Springer, Berlin, Heidelberg, (2006), 404–417. doi: 10.1007/11744023_32.
    [6] S. Benhimane and E. Malis, Real-time image-based tracking of planes using efficient second-order minimization, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), 1 (2004), 943-948.  doi: 10.1109/IROS.2004.1389474.
    [7] G. Bradski, The OpenCV Library, Dr. Dobb's Journal of Software Tools.
    [8] C. Chang, C. Chou and E. Y. Chang, CLKN: Cascaded Lucas-Kanade networks for image alignment, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, (2017), 3777–3785. doi: 10.1109/CVPR.2017.402.
    [9] F. Chhaya, D. Reddy, S. Upadhyay, V. Chari, M. Z. Zia and K. M. Krishna, Monocular reconstruction of vehicles: Combining SLAM with shape priors, in 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, (2016), 5758–5765. doi: 10.1109/ICRA.2016.7487799.
    [10] F. Chollet et al., Keras, https://keras.io, 2015.
    [11] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 1 (2005), 886-893.  doi: 10.1109/CVPR.2005.177.
    [12] D. DeTone, T. Malisiewicz and A. Rabinovich, Deep image homography estimation, preprint, arXiv: 1606.03798.
    [13] M. A. Fischler and R. C. Bolles, Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography, Comm. ACM, 24 (1981), 381-395.  doi: 10.1145/358669.358692.
    [14] C. Forster, M. Pizzoli and D. Scaramuzza, SVO: Fast semi-direct monocular visual odometry, in 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, (2014), 15–22. doi: 10.1109/ICRA.2014.6906584.
    [15] E. Garcia-Fidalgo, A. Ortiz, F. Bonnin-Pascual and J. P. Company, A mosaicing approach for vessel visual inspection using a micro-aerial vehicle, in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, (2015), 104–110. doi: 10.1109/IROS.2015.7353361.
    [16] R. Hartley and  A. ZissermanMultiple View Geometry in Computer Vision, 2$^nd$ edition, Cambridge University Press, Cambridge, 2003. 
    [17] K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, (2015), 1026–1034. doi: 10.1109/ICCV.2015.123.
    [18] A. G. Howard, Some improvements on deep convolutional neural network based image classification, preprint, arXiv: 1312.5402.
    [19] Y.-F. Hsu, C.-C. Chou and M.-Y. Shih, Moving camera video stabilization using homography consistency, in 2012 19th IEEE International Conference on Image Processing, Orlando, FL, (2012), 2761–2764. doi: 10.1109/ICIP.2012.6467471.
    [20] M.-D. Hua, T. Hamel, R. Mahony and G. Allibert, Explicit complementary observer design on special linear group SL(3) for homography estimation using conic correspondences, in 2017 IEEE 56th Annual Conference on Decision and Control (CDC), Melbourne, VIC, (2017), 2434–2441. doi: 10.1109/CDC.2017.8264006.
    [21] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, (2015), 448–456.
    [22] W. Jiang and J. Gu, Video stitching with spatial-temporal content-preserving warping, in 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, (2015), 42–48. doi: 10.1109/CVPRW.2015.7301374.
    [23] F. Jurie and M. Dhome, Hyperplane approximation for template matching, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (2002), 996-1000.  doi: 10.1109/TPAMI.2002.1017625.
    [24] A. Kendall, M. Grimes and R. Cipolla, PoseNet: A convolutional network for real-time 6-DOF camera relocalization, in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, (2015), 2938–2946. doi: 10.1109/ICCV.2015.336.
    [25] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, preprint, arXiv: 1412.6980.
    [26] P. Liang, Y. Wu, H. Lu, L. Wang, C. Liao and H. Ling, Planar object tracking in the wild: A benchmark, in 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, (2018), 651–658. doi: 10.1109/ICRA.2018.8461037.
    [27] T.-Y. Lin, et al., Microsoft COCO: Common objects in context, in Computer Vision – ECCV 2014, Lecture Notes in Computer Science, 8693, Springer, Cham, (2014), 740–755. doi: 10.1007/978-3-319-10602-1_48.
    [28] S. Liu, L. Yuan, P. Tan and J. Sun, SteadyFlow: Spatially smooth optical flow for video stabilization, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, (2014), 4209–4216. doi: 10.1109/CVPR.2014.536.
    [29] S. LiuJ. ChenC.-H. Chang and Y. Ai, A new accurate and fast homography computation algorithm for sports and traffic video analysis, IEEE Transactions on Circuits and Systems for Video Technology, 28 (2018), 2993-3006.  doi: 10.1109/TCSVT.2017.2731781.
    [30] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, 60 (2004), 91-110.  doi: 10.1023/B:VISI.0000029664.99615.94.
    [31] B. D. Lucas and T. Kanade, An iterative image registration technique with an application to stereo vision, Proceedings of the 7th International Joint Conference on Artificial Intelligence, 2, Morgan Publishers Inc., San Francisco, CA, 1981,674–679.
    [32] C. MeiS. BenhimaneE. Malis and P. Rives, Efficient homography-based tracking and 3-D reconstruction for single-viewpoint sensors, IEEE Transactions on Robotics, 24 (2008), 1352-1364.  doi: 10.1109/TRO.2008.2007941.
    [33] Y. Mi, K. Zheng and S. Wang, Recognizing actions in wearable-camera videos by training classifiers on fixed-camera videos, in Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR '18, Association for Computing Machinery, New York, NY, (2018), 169–177. doi: 10.1145/3206025.3206041.
    [34] Y. Mi and S. Wang, Recognizing micro actions in videos: learning motion details via segment-level temporal pyramid, in 2019 IEEE International Conference on Multimedia and Expo, Shanghai, China, (2019), 1036-1041. doi: 10.1109/ICME.2019.00182.
    [35] Y. Mi, X. Zhang, Z. Li and S. Wang, Dual-branch network with a subtle motion detector for microaction recognition in videos, in IEEE Transactions on Image Processing, 29 (2020), 6194-6208. doi: 10.1109/TIP.2020.2989864.
    [36] K. Mikolajczyk and et al., A comparison of affine region detectors, Int. J. Comput. Vision, 65 (2005), 43-72.  doi: 10.1007/s11263-005-3848-x.
    [37] M. Muja and D. G. Lowe, Fast approximate nearest neighbors with automatic algorithm configuration, in VISAPP International Conference on Computer Vision Theory and Applications, (2009), 331–340.
    [38] V. Nair and G. E. Hinton, Rectified linear units improve restricted Boltzmann machines, in Proceedings of ICML, 27, Haifa, Isael, (2010), 807–814.
    [39] T. NguyenS. W. ChenS. S. ShivakumarC. J. Taylor and V. Kumar, Unsupervised deep homography: A fast and robust homography estimation model, IEEE Robotics and Automation Letters, 3 (2018), 2346-2353.  doi: 10.1109/LRA.2018.2809549.
    [40] F. E. Nowruzi, R. Laganiere and N. Japkowicz, Homography estimation from image pairs with hierarchical convolutional networks, in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, (2017), 904–911. doi: 10.1109/ICCVW.2017.111.
    [41] M. OzuysalM. CalonderV. Lepetit and P. Fua, Fast keypoint recognition using random ferns, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (2010), 448-461.  doi: 10.1109/TPAMI.2009.23.
    [42] E. Rublee, V. Rabaud, K. Konolige and G. Bradski, ORB: An efficient alternative to SIFT or SURF, in 2011 International Conference on Computer Vision, Barcelona, (2011), 2564–2571. doi: 10.1109/ICCV.2011.6126544.
    [43] J. Shi and Tomasi, Good features to track, in 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, (1994), 593–600. doi: 10.1109/CVPR.1994.323794.
    [44] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, preprint, arXiv: 1409.1556.
    [45] N. SrivastavaG. HintonA. KrizhevskyI. Sutskever and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, J. Mach. Learn. Res., 15 (2014), 1929-1958. 
    [46] D. TaoY. GuoB. YuJ. Pang and Z. Yu, Deep multi-view feature learning for person re-identification, IEEE Transactions on Circuits and Systems for Video Technology, 28 (2018), 2657-2666.  doi: 10.1109/TCSVT.2017.2726580.
    [47] D. Teney and M. Hebert, Learning to extract motion from videos in convolutional neural networks, in Computer Vision – ACCV 2016, Lecture Notes in Computer Science, 10115, Springer, Cham, (2016), 412–428. doi: 10.1007/978-3-319-54193-8_26.
    [48] X. Yang, X. Si, T. Xue, L. Zhang and K.-T. T. Cheng, Vision-inertial hybrid tracking for robust and efficient augmented reality on smartphones, in Proceedings of the 23rd ACM International Conference on Multimedia, MM '15, Association for Computing Machinery, New York, NY, (2015), 1039–1042. doi: 10.1145/2733373.2806396.
  • 加载中

Figures(11)

Tables(3)

SHARE

Article Metrics

HTML views(738) PDF downloads(468) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return