# American Institute of Mathematical Sciences

doi: 10.3934/dcdss.2021102
Online First

Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Online First articles via the “Online First” tab for the selected journal.

## A dictionary learning algorithm for compression and reconstruction of streaming data in preset order

 Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA

* Corresponding author: Hoang Tran (tranha@ornl.gov)

Received  February 2021 Revised  June 2021 Early access September 2021

Fund Project: This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

There has been an emerging interest in developing and applying dictionary learning (DL) to process massive datasets in the last decade. Many of these efforts, however, focus on employing DL to compress and extract a set of important features from data, while considering restoring the original data from this set a secondary goal. On the other hand, although several methods are able to process streaming data by updating the dictionary incrementally as new snapshots pass by, most of those algorithms are designed for the setting where the snapshots are randomly drawn from a probability distribution. In this paper, we present a new DL approach to compress and denoise massive dataset in real time, in which the data are streamed through in a preset order (instances are videos and temporal experimental data), so at any time, we can only observe a biased sample set of the whole data. Our approach incrementally builds up the dictionary in a relatively simple manner: if the new snapshot is adequately explained by the current dictionary, we perform a sparse coding to find its sparse representation; otherwise, we add the new snapshot to the dictionary, with a Gram-Schmidt process to maintain the orthogonality. To compress and denoise noisy datasets, we apply the denoising to the snapshot directly before sparse coding, which deviates from traditional dictionary learning approach that achieves denoising via sparse coding. Compared to full-batch matrix decomposition methods, where the whole data is kept in memory, and other mini-batch approaches, where unbiased sampling is often assumed, our approach has minimal requirement in data sampling and storage: i) each snapshot is only seen once then discarded, and ii) the snapshots are drawn in a preset order, so can be highly biased. Through experiments on climate simulations and scanning transmission electron microscopy (STEM) data, we demonstrate that the proposed approach performs competitively to those methods in data reconstruction and denoising.

Citation: Richard Archibald, Hoang Tran. A dictionary learning algorithm for compression and reconstruction of streaming data in preset order. Discrete & Continuous Dynamical Systems - S, doi: 10.3934/dcdss.2021102
##### References:

show all references

##### References:
(Climate dataset) The top $20$ components from a dictionary of $40$ components extracted by our algorithm
(Climate dataset) Original and the reconstructed images by our algorithm
(STEM dataset) The top $24$ components from a dictionary of $41$ components extracted by our algorithm
(STEM dataset) Original and the reconstructed images by our algorithm
(Noisy climate dataset) Noisy and reconstructed images by our algorithm
(Noisy STEM dataset) Noisy and reconstructed images by our algorithm
(Growth of dictionaries) The growth of dictionary size is significantly different for our two test cases. For climate dataset, complicated development of the data requires us to update the dictionary more frequently in later stage. STEM data, on the other hand, progresses in similar cycles so the dictionary is established early
A comparison between our algorithm and other matrix factorization solvers in $\mathtt{scikit-learn}$ for dictionary learning of $\textbf{climate dataset}$
 Methods batch size RRMSE PSNR $\mathtt{MiniBatchSparsePCA}$ 1 0.842 12.561 $\mathtt{MiniBatchDictionaryLearning}$ 1 0.338 26.506 $\textbf{Our method}$ 1 0.068 43.484 $\mathtt{IncrementalPCA}$ 40 0.011 51.887 $\mathtt{DictionaryLearning}$ 2000 0.013 49.971
 Methods batch size RRMSE PSNR $\mathtt{MiniBatchSparsePCA}$ 1 0.842 12.561 $\mathtt{MiniBatchDictionaryLearning}$ 1 0.338 26.506 $\textbf{Our method}$ 1 0.068 43.484 $\mathtt{IncrementalPCA}$ 40 0.011 51.887 $\mathtt{DictionaryLearning}$ 2000 0.013 49.971
A comparison between our algorithm and other matrix factorization solvers in $\mathtt{scikit-learn}$ for dictionary learning of $\textbf{noisy climate dataset}$
 Methods batch size RRMSE PSNR $\mathtt{MiniBatchSparsePCA}$ 1 0.857 12.418 $\mathtt{MiniBatchDictionaryLearning}$ 1 0.364 26.455 $\textbf{Our method}$ 1 0.134 28.895 $\mathtt{IncrementalPCA}$ 61 0.183 25.996 $\mathtt{DictionaryLearning}$ 2000 0.179 26.236
 Methods batch size RRMSE PSNR $\mathtt{MiniBatchSparsePCA}$ 1 0.857 12.418 $\mathtt{MiniBatchDictionaryLearning}$ 1 0.364 26.455 $\textbf{Our method}$ 1 0.134 28.895 $\mathtt{IncrementalPCA}$ 61 0.183 25.996 $\mathtt{DictionaryLearning}$ 2000 0.179 26.236
A comparison between our algorithm and other matrix factorization solvers in $\mathtt{scikit-learn}$ for dictionary learning of $\textbf{STEM dataset}$
 Methods batch size RRMSE PSNR $\mathtt{MiniBatchSparsePCA}$ 1 0.647 16.546 $\mathtt{MiniBatchDictionaryLearning}$ 1 0.449 20.139 $\textbf{Our method}$ 1 0.0814 36.949 $\mathtt{IncrementalPCA}$ 41 0.011 52.378 $\mathtt{DictionaryLearning}$ 11616 0.132 30.878
 Methods batch size RRMSE PSNR $\mathtt{MiniBatchSparsePCA}$ 1 0.647 16.546 $\mathtt{MiniBatchDictionaryLearning}$ 1 0.449 20.139 $\textbf{Our method}$ 1 0.0814 36.949 $\mathtt{IncrementalPCA}$ 41 0.011 52.378 $\mathtt{DictionaryLearning}$ 11616 0.132 30.878
A comparison between our algorithm and other matrix factorization solvers in $\mathtt{scikit-learn}$ for dictionary learning of $\textbf{noisy STEM dataset}$
 Methods batch size RRMSE PSNR $\mathtt{MiniBatchSparsePCA}$ 1 0.594 17.280 $\mathtt{MiniBatchDictionaryLearning}$ 1 0.643 16.927 $\textbf{Our method}$ 1 0.211 26.327 $\mathtt{IncrementalPCA}$ 39 0.086 34.156 $\mathtt{DictionaryLearning}$ 11616 0.152 29.423
 Methods batch size RRMSE PSNR $\mathtt{MiniBatchSparsePCA}$ 1 0.594 17.280 $\mathtt{MiniBatchDictionaryLearning}$ 1 0.643 16.927 $\textbf{Our method}$ 1 0.211 26.327 $\mathtt{IncrementalPCA}$ 39 0.086 34.156 $\mathtt{DictionaryLearning}$ 11616 0.152 29.423
 [1] Aude Hofleitner, Tarek Rabbani, Mohammad Rafiee, Laurent El Ghaoui, Alex Bayen. Learning and estimation applications of an online homotopy algorithm for a generalization of the LASSO. Discrete & Continuous Dynamical Systems - S, 2014, 7 (3) : 503-523. doi: 10.3934/dcdss.2014.7.503 [2] Ran Ma, Lu Zhang, Yuzhong Zhang. A best possible algorithm for an online scheduling problem with position-based learning effect. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021144 [3] Ning Zhang, Qiang Wu. Online learning for supervised dimension reduction. Mathematical Foundations of Computing, 2019, 2 (2) : 95-106. doi: 10.3934/mfc.2019008 [4] Haixia Liu, Jian-Feng Cai, Yang Wang. Subspace clustering by (k,k)-sparse matrix factorization. Inverse Problems & Imaging, 2017, 11 (3) : 539-551. doi: 10.3934/ipi.2017025 [5] Shuhua Wang, Zhenlong Chen, Baohuai Sheng. Convergence of online pairwise regression learning with quadratic loss. Communications on Pure & Applied Analysis, 2020, 19 (8) : 4023-4054. doi: 10.3934/cpaa.2020178 [6] Yangyang Xu, Ruru Hao, Wotao Yin, Zhixun Su. Parallel matrix factorization for low-rank tensor completion. Inverse Problems & Imaging, 2015, 9 (2) : 601-624. doi: 10.3934/ipi.2015.9.601 [7] Jiping Tao, Ronghuan Huang, Tundong Liu. A $2.28$-competitive algorithm for online scheduling on identical machines. Journal of Industrial & Management Optimization, 2015, 11 (1) : 185-198. doi: 10.3934/jimo.2015.11.185 [8] Roberto C. Alamino, Nestor Caticha. Bayesian online algorithms for learning in discrete hidden Markov models. Discrete & Continuous Dynamical Systems - B, 2008, 9 (1) : 1-10. doi: 10.3934/dcdsb.2008.9.1 [9] Marc Bocquet, Alban Farchi, Quentin Malartic. Online learning of both state and dynamics using ensemble Kalman filters. Foundations of Data Science, 2021, 3 (3) : 305-330. doi: 10.3934/fods.2020015 [10] Soheila Garshasbi, Brian Yecies, Jun Shen. Microlearning and computer-supported collaborative learning: An agenda towards a comprehensive online learning system. STEM Education, 2021, 1 (4) : 225-255. doi: 10.3934/steme.2021016 [11] Ruiqi Yang, Dachuan Xu, Yicheng Xu, Dongmei Zhang. An adaptive probabilistic algorithm for online k-center clustering. Journal of Industrial & Management Optimization, 2019, 15 (2) : 565-576. doi: 10.3934/jimo.2018057 [12] Zongwei Chen. An online-decision algorithm for the multi-period bank clearing problem. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021091 [13] Lingling Lv, Zhe Zhang, Lei Zhang, Weishu Wang. An iterative algorithm for periodic sylvester matrix equations. Journal of Industrial & Management Optimization, 2018, 14 (1) : 413-425. doi: 10.3934/jimo.2017053 [14] Armin Eftekhari, Michael B. Wakin, Ping Li, Paul G. Constantine. Randomized learning of the second-moment matrix of a smooth function. Foundations of Data Science, 2019, 1 (3) : 329-387. doi: 10.3934/fods.2019015 [15] Yudong Li, Yonggang Li, Bei Sun, Yu Chen. Zinc ore supplier evaluation and recommendation method based on nonlinear adaptive online transfer learning. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021193 [16] Victor Meng Hwee Ong, David J. Nott, Taeryon Choi, Ajay Jasra. Flexible online multivariate regression with variational Bayes and the matrix-variate Dirichlet process. Foundations of Data Science, 2019, 1 (2) : 129-156. doi: 10.3934/fods.2019006 [17] Vassilios A. Tsachouridis, Georgios Giantamidis, Stylianos Basagiannis, Kostas Kouramas. Formal analysis of the Schulz matrix inversion algorithm: A paradigm towards computer aided verification of general matrix flow solvers. Numerical Algebra, Control & Optimization, 2020, 10 (2) : 177-206. doi: 10.3934/naco.2019047 [18] Jiping Tao, Zhijun Chao, Yugeng Xi. A semi-online algorithm and its competitive analysis for a single machine scheduling problem with bounded processing times. Journal of Industrial & Management Optimization, 2010, 6 (2) : 269-282. doi: 10.3934/jimo.2010.6.269 [19] Ran Ma, Jiping Tao. An improved 2.11-competitive algorithm for online scheduling on parallel machines to minimize total weighted completion time. Journal of Industrial & Management Optimization, 2018, 14 (2) : 497-510. doi: 10.3934/jimo.2017057 [20] Armin Lechleiter, Tobias Rienmüller. Factorization method for the inverse Stokes problem. Inverse Problems & Imaging, 2013, 7 (4) : 1271-1293. doi: 10.3934/ipi.2013.7.1271

2020 Impact Factor: 2.425

## Tools

Article outline

Figures and Tables