\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

PDEs on graphs for semi-supervised learning applied to first-person activity recognition in body-worn video

  • * Corresponding author: Andrea L. Bertozzi

    * Corresponding author: Andrea L. Bertozzi 

This work was supported by NIJ grant 2014-R2-CX-0101, NSF grant DMS-1737770, and NSF grant DMS-1952339. The first two authors contributed equally to this work

Abstract Full Text(HTML) Figure(6) / Table(5) Related Papers Cited by
  • This paper showcases the use of PDE-based graph methods for modern machine learning applications. We consider a case study of body-worn video classification because of the large volume of data and the lack of training data due to sensitivity of the information. Many modern artificial intelligence methods are turning to deep learning which typically requires a lot of training data to be effective. They can also suffer from issues of trust because the heavy use of training data can inadvertently provide information about details of the training images and could compromise privacy. Our alternate approach is a physics-based machine learning that uses classical ideas like optical flow for video analysis paired with linear mixture models such as non-negative matrix factorization along with PDE-based graph classification methods that parallel geometric equations from PDE such as motion by mean curvature. The upshot is a methodology that can work well on video with modest amounts of training data and that can also be used to compress the information about the video scene so that no personal information is contained in the compressed data, making it possible to provide a larger group of people access to these compressed data without compromising privacy. The compressed data retains information about the wearer of the camera while discarding information about people, objects, and places in the scene.

    Mathematics Subject Classification: Primary: 58F15, 58F17; Secondary: 53C35.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  A summary of the proposed method. First, we compute a dense optical flow field for each pair of consecutive frames. We then divide each optical flow field into $ s_x\times s_y $ spatial regions, where each region consists of $ dx\times dy $ pixels, and divide the video into $ s_t $ temporal segments, where each segment consists of $ dt $ frames. For each $ dx \times dy \times dt $ cuboid, we count the number of flow vectors with direction lying within in each octant, yielding a $ s_x\times s_y $ histogram for each segment of video. We reshape and concatenate each histogram into a single feature vector of dimension $ s_x \times s_y \times 8 $ describing the motion that occurs within the video segment. The dimension of the feature vectors is reduced with NMF and we smooth them with a moving-window average operator. Finally, we classify the smoothed features with a semi-supervised MBO scheme

    Figure 2.  Classification results on a contiguous sample of 4000 segments (approximately 13 minutes) from the LAPD body-worn video data set. The results are obtained by running both methods with the parameters described in section 4.2

    Figure 3.  Confusion matrices for the LAPD Body-worn video data set. The background intensity in cell $ (k, \ell) $ corresponds to the number of data points in class $ k $ that are classified as class $ \ell $ by the algorithm

    Figure 4.  Confusion matrix for the HUJI EgoSeg data set. The background intensity in cell $ (k, \ell) $ corresponds to the number of data points in class $ k $ that are classified as class $ \ell $ by the algorithm

    Figure 5.  Classification results on a contiguous sample of 4000 segments (approximately 4 hours) from the testing set of HUJI EgoSeg data set. The recall of the same experiment is reported in TABLE 4

    Figure 6.  Confusion matrices for the LAPD police Body-worn video data set. The background intensity of cell $ (k, \ell) $ corresponds to the number of data points in class $ k $ that are classified as class $ \ell $ by the algorithm

    Table 1.  Experimental Setup

    Motion feature NMF Spectrum of the Graph Laplacian MBO
    $ \Delta T $ (sec)FPSNumber of segmentsWindow size (segment) $ \hat{k} $ $ N_\mathrm{eig} $ $ \tau_{ij} $ $ N_{sample} $Batch size (segment)ηtNstep
    QUAD$ 1/60 $6014,399-50500$ \tau = 1 $1000-3000.110
    LAPD$ 1/5 $30274,4435502000$ K = 100 $2000300004000.110
    LAPD [30] $ 1/5 $30274,443--2000$ K = 100 $2000300004000.110
    HUJI41536,4212050400$ K = 40 $400-3000.110
     | Show Table
    DownLoad: CSV

    Table 2.  Class proportion and precision of the QUAD data set

    Precision
    Class Proportion [21] [30] Ours
    Jump 14.54% - 92.51% 99.07%
    Stand 13.74% - 87.90% 87.11%
    Walk 12.75% - 84.52% 98.37%
    Step 12.65% - 93.98% 98.54%
    Turn Left 11.25% - 89.43% 96.96%
    Turn Right 10.16% - 92.80% 96.21%
    Run 9.00% - 92.38% 96.17%
    Look Up 8.85% - 80.36% 90.02%
    Look Down 7.06% - 84.59% 89.00%
    Mean 11.11% 95% 88.74% 94.49%
     | Show Table
    DownLoad: CSV

    Table 3.  Class proportion, precision, and recall of the selected nine classes in the LAPD body-worn video data set

    Precision Recall
    Class Proportion [30] Ours [30] Ours
    Stand still 62.57% 73.10% 89.44% 85.42% 95.24%
    In stationary car 16.84% 41.83% 93.69% 43.18% 89.73%
    Walk 9.04% 38.36% 70.53% 19.54% 59.41%
    In moving car 5.76% 70.71% 91.03% 25.08% 84.40%
    At car window 0.64% 17.23% 71.45% 10.94% 45.28%
    At car trunk 0.58% 73.78% 71.79% 11.09% 51.78%
    Run 0.33% 96.15% 75.94% 11.03% 53.35%
    Bike 0.33% 85.71% 86.49% 14.37% 75.44%
    Motorcycle 0.08% 100% 92.49% 10.76% 71.75%
    Mean 10.68% 66.32% 82.54% 25.71% 69.60%
     | Show Table
    DownLoad: CSV

    Table 4.  Class proportion and recall of the HUJI EgoSeg data set

    Recall
    Class Proportion [36] [40] [43] [37] Ours
    Walking 34% 83% 91% 79% 89% 91%
    Sitting 25% 62% 70% 62% 84% 71%
    Standing 21% 47% 44% 62% 79% 47%
    Biking 8% 86% 34% 36% 91% 88%
    Driving 5% 74% 82% 92% 100% 95%
    Static 4% 97% 61% 100% 98% 96%
    Riding Bus 4% 43% 37% 58% 82% 84%
    Mean 14% 70% 60% 70% 89% 82%
    Training $ \sim $60% $ \sim $60% $ \sim $60% $ \sim $60% 6%
     | Show Table
    DownLoad: CSV

    Table 5.  Class proportion, precision, recall, and accuracy on the LAPD body-worn video data set

    Precision Recall
    Class Proportion [30] Ours [30] Ours
    Stand still 62.57% 73.10% 89.44% 85.42 % 95.24%
    In stationary car 16.84% 41.83% 93.69% 43.18% 89.73%
    Walk 9.04% 38.36% 70.53% 19.54% 59.41%
    In moving car 5.76% 70.71% 91.03% 25.08% 84.40%
    Obscured camera 2.80% 51.65% 80.82% 15.93% 70.46%
    At car window 0.64% 17.23% 71.45% 10.94% 45.28%
    At car trunk 0.58% 73.78% 71.79% 11.09% 51.78%
    Exit driver 0.35% 6.68% 50.25% 11.82% 21.12%
    Exit passenger 0.34% 79.69% 48.08% 11.59% 26.29%
    Run 0.33% 96.15% 75.94% 11.03% 53.35%
    Bike 0.33% 85.71% 86.49% 14.37% 75.44%
    Enter passenger 0.20% 5.97% 45.82% 13.27% 24.51%
    Enter driver 0.12% 5.72% 34.33% 12.3% 20.91%
    Motorcycle 0.08% 100% 92.49% 10.76% 71.75%
    Mean 7.14% 53.33% 71.58% 21.17% 56.41%
    Accuracy 65.03% 88.15%
     | Show Table
    DownLoad: CSV
  • [1] G. Abebe and A. Cavallaro, A long short-term memory convolutional neural network for first-person vision activity recognition, in Proceedings of the IEEE International Conference on Computer Vision, 2017, 1339–1346. doi: 10.1109/ICCVW.2017.159.
    [2] K. Aizawa, K. Ishijima and M. Shiina, Summarizing wearable video, in Proceedings to 2001 International Conference on Image Processing, vol. 3, IEEE, 2001,398–401. doi: 10.1109/ICIP.2001.958135.
    [3] J. L. BarronD. J. Fleet and S. S. Beauchemin, Performance of optical flow techniques, International Journal of Computer Vision, 12 (1994), 43-77. 
    [4] A. L. Bertozzi and A. Flenner, Diffuse interface models on graphs for classification of high dimensional data, SIAM Review, 58 (2016), 293-328.  doi: 10.1137/16M1070426.
    [5] A. L. BertozziX. LuoA. M. Stuart and K. C. Zygalakis, Uncertainty quantification in graph-based classification of high dimensional data, SIAM/ASA Journal on Uncertainty Quantification, 6 (2018), 568-595.  doi: 10.1137/17M1134214.
    [6] B. L. Bhatnagar, S. Singh, C. Arora, C. Jawahar and K. CVIT, Unsupervised learning of deep feature representation for clustering egocentric actions, in IJCAI, 2017, 1447–1453. doi: 10.24963/ijcai.2017/200.
    [7] J. Budd and Y. V. Gennip, Graph Merriman–Bence–Osher as a semi-discrete implicit euler scheme for graph Allen–Cahn flow, SIAM Journal on Mathematical Analysis, 52 (2020), 4101-4139.  doi: 10.1137/19M1277394.
    [8] T. F. Chan and L. A. Vese, Active contours without edges, IEEE Transactions on image processing, 10 (2001), 266-277.  doi: 10.1109/83.902291.
    [9] A. G. del MolinoC. TanJ.-H. Lim and A.-H. Tan, Summarization of egocentric videos: a comprehensive survey, IEEE Transactions on Human-Machine Systems, 47 (2017), 65-76.  doi: 10.1109/THMS.2016.2623480.
    [10] G. Farnebäck, Two-frame motion estimation based on polynomial expansion, in Scandinavian Conference on Image Analysis, Springer, 2003,363–370.
    [11] A. Fathi, J. K. Hodgins and J. M. Rehg, Social interactions: A first-person perspective,, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, 1226–1233. doi: 10.1109/CVPR.2012.6247805.
    [12] A. Fathi, A. Farhadi and J. M. Rehg, Understanding egocentric activities, in Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011,407–414. doi: 10.1109/ICCV.2011.6126269.
    [13] A. Fathi, Y. Li and J. M. Rehg, Learning to recognize daily actions using gaze, in European Conference on Computer Vision, Springer, 2012,314–327. doi: 10.1007/978-3-642-33718-5_23.
    [14] D. FortunP. Bouthemy and C. Kervrann, Optical flow modeling and computation: A survey, Computer Vision and Image Understanding, 134 (2015), 1-21.  doi: 10.1016/j.cviu.2015.02.008.
    [15] C. FowlkesS. BelongieF. Chung and J. Malik, Spectral grouping using the Nyström method, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (2004), 214-225.  doi: 10.1109/TPAMI.2004.1262185.
    [16] C. Garcia-CardonaE. MerkurjevA. L. BertozziA. Flenner and A. G. Percus, Multiclass data segmentation using diffuse interface methods on graphs, IEEE Transactions on Pattern Analysis and Machine Intelligence, 36 (2014), 1600-1613.  doi: 10.1109/TPAMI.2014.2300478.
    [17] G. Gilboa and S. Osher, Nonlocal operators with applications to image processing, Multiscale Modeling & Simulation, 7 (2008), 1005-1028.  doi: 10.1137/070698592.
    [18] B. K. Horn and B. G. Schunck, Determining optical flow, Artificial Intelligence, 17 (1981), 185-203.  doi: 10.1016/0004-3702(81)90024-2.
    [19] G. Iyer, J. Chanussot and A. L. Bertozzi, A graph-based approach for feature extraction and segmentation of multimodal images,, in 2017 IEEE International Conference on Image Processing (ICIP), IEEE, 2017, 3320–3324. doi: 10.1109/ICIP.2017.8296897.
    [20] M. JacobsE. Merkurjev and S. Esedoglu, Auction dynamics: A volume constrained MBO scheme, Journal of Computational Physics, 354 (2018), 288-310.  doi: 10.1016/j.jcp.2017.10.036.
    [21] K. M. Kitani, T. Okabe, Y. Sato and A. Sugimoto, Fast unsupervised ego-action learning for first-person sports videos,, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, 3241–3248. doi: 10.1109/CVPR.2011.5995406.
    [22] C. L. Lawson and R. J. Hanson, Solving Least Squares Problems, SIAM, Philadelphia, PA, 1995. doi: 10.1137/1.9781611971217.
    [23] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in Advances in Neural Information Processing Systems, 2001,556–562.
    [24] Y. Li, Z. Ye and J. M. Rehg, Delving into egocentric actions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015,287–295. doi: 10.1109/CVPR.2015.7298625.
    [25] B. D. Lucas and T. Kanade, An iterative image registration technique with an application to stereo vision, in Proceedings of the 1981 DARPA Image Understanding Workshop, 1981,121–130.
    [26] X. Luo and A. L. Bertozzi, Convergence of the graph Allen–Cahn scheme, Journal of Statistical Physics, 167 (2017), 934-958.  doi: 10.1007/s10955-017-1772-4.
    [27] M. Ma, H. Fan and K. M. Kitani, Going deeper into first-person activity recognition,, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 1894–1903. doi: 10.1109/CVPR.2016.209.
    [28] Z. Meng, A. Koniges, Y. H. He, S. Williams, T. Kurth, B. Cook, J. Deslippe and A. L. Bertozzi, OpenMP parallelization and optimization of graph-based machine learning algorithms,, in International Workshop on OpenMP, Springer, 2016, 17–31. doi: 10.1007/978-3-319-45550-1_2.
    [29] Z. MengE. MerkurjevA. Koniges and A. L. Bertozzi, Hyperspectral image classification using graph clustering methods, Image Processing On Line, 7 (2017), 218-245.  doi: 10.5201/ipol.2017.204.
    [30] Z. Meng, J. Sánchez, J.-M. Morel, A. L. Bertozzi and P. J. Brantingham, Ego-motion classification for body-worn videos,, in Imaging, Vision and Learning Based on Optimization and PDEs (eds. X.-C. Tai, E. Bae and M. Lysaker), Springer International Publishing, Cham, 2018,221–239. doi: 10.1007/978-3-319-91274-5_10.
    [31] E. MerkurjevC. Garcia-CardonaA. L. BertozziA. Flenner and A. G. Percus, Diffuse interface methods for multiclass segmentation of high-dimensional data, Applied Mathematics Letters, 33 (2014), 29-34.  doi: 10.1016/j.aml.2014.02.008.
    [32] E. MerkurjevT. Kostic and A. L. Bertozzi, An MBO scheme on graphs for classification and image processing, SIAM Journal on Imaging Sciences, 6 (2013), 1903-1930.  doi: 10.1137/120886935.
    [33] E. Merkurjev, J. Sunu and A. L. Bertozzi, Graph MBO method for multiclass segmentation of hyperspectral stand-off detection video,, in Image Processing (ICIP), 2014 IEEE International Conference on, IEEE, 2014,689–693. doi: 10.1109/ICIP.2014.7025138.
    [34] F. Özkan, M. A. Arabaci, E. Surer and A. Temizel, Boosted multiple kernel learning for first-person activity recognition, in Signal Processing Conference (EUSIPCO), 2017 25th European, IEEE, 2017, 1050–1054.
    [35] H. Pirsiavash and D. Ramanan, Detecting activities of daily living in first-person camera views,, in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, 2847–2854. doi: 10.1109/CVPR.2012.6248010.
    [36] Y. Poleg, C. Arora and S. Peleg, Temporal segmentation of egocentric videos,, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, 2537–2544. doi: 10.1109/CVPR.2014.325.
    [37] Y. Poleg, A. Ephrat, S. Peleg and C. Arora, Compact CNN for indexing egocentric videos, in Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, IEEE, 2016, 1–9.
    [38] L. I. RudinS. Osher and E. Fatemi, Nonlinear total variation based noise removal algorithms, Physica D: Nonlinear Phenomena, 60 (1992), 259-268.  doi: 10.1016/0167-2789(92)90242-F.
    [39] M. S. Ryoo and L. Matthies, First-person activity recognition: What are they doing to me?, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, 2730–2737.
    [40] M. S. Ryoo, B. Rothrock and L. Matthies, Pooled motion features for first-person videos,, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015,896–904. doi: 10.1109/CVPR.2015.7298691.
    [41] S. SinghC. Arora and C. Jawahar, Trajectory aligned features for first person action recognition, Pattern Recognition, 62 (2017), 45-55.  doi: 10.1016/j.patcog.2016.07.031.
    [42] E. H. Spriggs, F. De La Torre and M. Hebert, Temporal segmentation and activity classification from first-person sensing,, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, 17–24. doi: 10.1109/CVPRW.2009.5204354.
    [43] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri, Learning spatiotemporal features with 3D convolutional networks, in Computer Vision (ICCV), 2015 IEEE International Conference on, IEEE, 2015, 4489–4497.
    [44] Y. Van Gennip and A. L. Bertozzi et al., $\Gamma$-convergence of graph Ginzburg-Landau functionals, Advances in Differential Equations, 17 (2012), 1115-1180. 
    [45] Y. Van GennipN. GuillenB. Osting and A. L. Bertozzi, Mean curvature, threshold dynamics, and phase field theory on finite graphs, Milan Journal of Mathematics, 82 (2014), 3-65.  doi: 10.1007/s00032-014-0216-8.
    [46] X. WangL. GaoJ. SongX. ZhenN. Sebe and H. T. Shen, Deep appearance and motion learning for egocentric activity recognition, Neurocomputing, 275 (2018), 438-447.  doi: 10.1016/j.neucom.2017.08.063.
    [47] L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering, in Advances in Neural Information Processing Systems, 2005, 1601–1608.
    [48] W. ZhuV. ChayesA. TiardS. SanchezD. DahlbergA. L. BertozziS. OsherD. Zosso and D. Kuang, Unsupervised classification in hyperspectral imagery with nonlocal total variation and primal-dual hybrid gradient algorithm, IEEE Transactions on Geoscience and Remote Sensing, 55 (2017), 2786-2798.  doi: 10.1109/TGRS.2017.2654486.
  • 加载中

Figures(6)

Tables(5)

SHARE

Article Metrics

HTML views(460) PDF downloads(397) Cited by(0)

Access History

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return