
Previous Article
Optimal liability ratio and dividend payment strategies under catastrophic risk
 JIMO Home
 This Issue

Next Article
Compensation plan, pricing and production decisions with inventorydependent salvage value, and asymmetric riskaverse sales agent
Performance optimization of paralleldistributed processing with checkpointing for cloud environment
1.  Graduate School of Informatics, Kyoto University, YoshidaHommachi, Sakyoku, Kyoto 6068501, Japan 
2.  Graduate School of Information Science, Nara Institute of Science and Technology, 89165 Takayama, Ikoma, Nara 6300192, Japan 
In cloud computing, the most successful application framework is paralleldistributed processing, in which an enormous task is split into a number of subtasks and those are processed independently on a cluster of machines referred to as workers. Due to its huge system scale, worker failures occur frequently in cloud environment and failed subtasks cause a large processing delay of the task. One of schemes to alleviate the impact of failures is checkpointing method, with which the progress of a subtask is recorded as checkpoint and the failed subtask is resumed by other worker from the latest checkpoint. This method can reduce the processing delay of the task. However, frequent checkpointing is system overhead and hence the checkpoint interval must be set properly. In this paper, we consider the optimal number of checkpoints which minimizes the taskprocessing time. We construct a stochastic model of paralleldistributed processing with checkpointing and approximately derive explicit expressions for the mean taskprocessing time and the optimal number of checkpoints. Numerical experiments reveal that the proposed approximations are sufficiently accurate on typical environment of cloud computing. Furthermore, the derived optimal number of checkpoints outperforms the result of previous study for minimizing the taskprocessing time on paralleldistributed processing.
References:
[1] 
L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of WarehouseScale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006. Google Scholar 
[2] 
T. C. Bressoud and M. A. Kozuch, Cluster faulttolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185. Google Scholar 
[3] 
C. L. P. Chen and C.Y. Zhang, Dataintensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314347. doi: 10.1016/j.ins.2014.01.015. Google Scholar 
[4] 
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010). Google Scholar 
[5] 
J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303312. doi: 10.1016/j.future.2004.11.016. Google Scholar 
[6] 
J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107113. doi: 10.1145/1327452.1327492. Google Scholar 
[7] 
J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009). Google Scholar 
[8] 
J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 7277. doi: 10.1145/1629175.1629198. Google Scholar 
[9] 
S. Di, Y. Robert, F. Vivien, D. Kondo, C. L. Wang and F. Cappello, Optimization of cloud task processing with checkpointrestart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217. Google Scholar 
[10] 
L. Fialho, D. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322332. doi: 10.1109/ICDCS.2011.12. Google Scholar 
[11] 
B. Javadi, D. Kondo, A. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 12081223. doi: 10.1016/j.jpdc.2013.04.002. Google Scholar 
[12] 
H. Jin, Y. Chen, H. Zhu and X.H. Sun, Optimizing HPC faulttolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525534. doi: 10.1109/ICPP.2010.80. Google Scholar 
[13] 
A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, C. Fetzer and A. Brito, Lowoverhead fault tolerance for highthroughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689699. doi: 10.1109/ICDCS.2011.29. Google Scholar 
[14] 
P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800145,2011. doi: 10.6028/NIST.SP.800145. Google Scholar 
[15] 
M. Taifi, J. Y. Shi and A. Khreishah, SpotMPI: A framework for auctionbased HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109120. doi: 10.1007/9783642246692_11. Google Scholar 
[16] 
J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530531. doi: 10.1145/361147.361115. Google Scholar 
[17] 
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker and I. Stoica, Discretized streams: Faulttolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423438. doi: 10.1145/2517349.2522737. Google Scholar 
show all references
References:
[1] 
L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of WarehouseScale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006. Google Scholar 
[2] 
T. C. Bressoud and M. A. Kozuch, Cluster faulttolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185. Google Scholar 
[3] 
C. L. P. Chen and C.Y. Zhang, Dataintensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314347. doi: 10.1016/j.ins.2014.01.015. Google Scholar 
[4] 
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010). Google Scholar 
[5] 
J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303312. doi: 10.1016/j.future.2004.11.016. Google Scholar 
[6] 
J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107113. doi: 10.1145/1327452.1327492. Google Scholar 
[7] 
J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009). Google Scholar 
[8] 
J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 7277. doi: 10.1145/1629175.1629198. Google Scholar 
[9] 
S. Di, Y. Robert, F. Vivien, D. Kondo, C. L. Wang and F. Cappello, Optimization of cloud task processing with checkpointrestart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217. Google Scholar 
[10] 
L. Fialho, D. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322332. doi: 10.1109/ICDCS.2011.12. Google Scholar 
[11] 
B. Javadi, D. Kondo, A. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 12081223. doi: 10.1016/j.jpdc.2013.04.002. Google Scholar 
[12] 
H. Jin, Y. Chen, H. Zhu and X.H. Sun, Optimizing HPC faulttolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525534. doi: 10.1109/ICPP.2010.80. Google Scholar 
[13] 
A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, C. Fetzer and A. Brito, Lowoverhead fault tolerance for highthroughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689699. doi: 10.1109/ICDCS.2011.29. Google Scholar 
[14] 
P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800145,2011. doi: 10.6028/NIST.SP.800145. Google Scholar 
[15] 
M. Taifi, J. Y. Shi and A. Khreishah, SpotMPI: A framework for auctionbased HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109120. doi: 10.1007/9783642246692_11. Google Scholar 
[16] 
J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530531. doi: 10.1145/361147.361115. Google Scholar 
[17] 
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker and I. Stoica, Discretized streams: Faulttolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423438. doi: 10.1145/2517349.2522737. Google Scholar 
Parameter  Description  Value 
Number of workers  
Subtaskprocessing time  
Time to make a checkpoint  
Mean time between worker failures  
Time to resume a failed subtask  
Number of checkpoints 
Parameter  Description  Value 
Number of workers  
Subtaskprocessing time  
Time to make a checkpoint  
Mean time between worker failures  
Time to resume a failed subtask  
Number of checkpoints 
[1] 
Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance analysis of largescale paralleldistributed processing with backup tasks for cloud computing. Journal of Industrial & Management Optimization, 2014, 10 (1) : 113129. doi: 10.3934/jimo.2014.10.113 
[2] 
Kyosuke Hashimoto, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance analysis of backuptask scheduling with deadline time in cloud computing. Journal of Industrial & Management Optimization, 2015, 11 (3) : 867886. doi: 10.3934/jimo.2015.11.867 
[3] 
Bin Zheng, Min Fan, Mengqi Liu, ShangChia Liu, Yunqiang Yin. Parallelmachine scheduling with potential disruption and positionaldependent processing times. Journal of Industrial & Management Optimization, 2017, 13 (2) : 697711. doi: 10.3934/jimo.2016041 
[4] 
Jiping Tao, Zhijun Chao, Yugeng Xi. A semionline algorithm and its competitive analysis for a single machine scheduling problem with bounded processing times. Journal of Industrial & Management Optimization, 2010, 6 (2) : 269282. doi: 10.3934/jimo.2010.6.269 
[5] 
Leiyang Wang, Zhaohui Liu. Heuristics for parallel machine scheduling with batch delivery consideration. Journal of Industrial & Management Optimization, 2014, 10 (1) : 259273. doi: 10.3934/jimo.2014.10.259 
[6] 
Weidong Bao, Haoran Ji, Xiaomin Zhu, Ji Wang, Wenhua Xiao, Jianhong Wu. ACObased solution for computation offloading in mobile cloud computing. Big Data & Information Analytics, 2016, 1 (1) : 113. doi: 10.3934/bdia.2016.1.1 
[7] 
Jinsong Xu. Reversible hidden data access algorithm in cloud computing environment. Discrete & Continuous Dynamical Systems  S, 2019, 12 (4&5) : 12191232. doi: 10.3934/dcdss.2019084 
[8] 
Serap Ergün, Bariş Bülent Kırlar, Sırma Zeynep Alparslan Gök, GerhardWilhelm Weber. An application of crypto cloud computing in social networks by cooperative game theory. Journal of Industrial & Management Optimization, 2020, 16 (4) : 19271941. doi: 10.3934/jimo.2019036 
[9] 
Shunfu Jin, Haixing Wu, Wuyi Yue, Yutaka Takahashi. Performance evaluation and Nash equilibrium of a cloud architecture with a sleeping mechanism and an enrollment service. Journal of Industrial & Management Optimization, 2019 doi: 10.3934/jimo.2019060 
[10] 
Hongtruong Pham, Xiwen Lu. The inverse parallel machine scheduling problem with minimum total completion time. Journal of Industrial & Management Optimization, 2014, 10 (2) : 613620. doi: 10.3934/jimo.2014.10.613 
[11] 
MinFan He, LiNing Xing, Wen Li, Shang Xiang, Xu Tan. Double layer programming model to the scheduling of remote sensing data processing tasks. Discrete & Continuous Dynamical Systems  S, 2019, 12 (4&5) : 15151526. doi: 10.3934/dcdss.2019104 
[12] 
Min Zhang, Gang Li. Multiobjective optimization algorithm based on improved particle swarm in cloud computing environment. Discrete & Continuous Dynamical Systems  S, 2019, 12 (4&5) : 14131426. doi: 10.3934/dcdss.2019097 
[13] 
SzeBi Hsu, Christopher A. Klausmeier, ChiuJu Lin. Analysis of a model of two parallel food chains. Discrete & Continuous Dynamical Systems  B, 2009, 12 (2) : 337359. doi: 10.3934/dcdsb.2009.12.337 
[14] 
Arminda MorenoDíaz, Gabriel de Blasio, MorenoDíaz Jr.. Distributed, layered and reliable computing nets to represent neuronal receptive fields. Mathematical Biosciences & Engineering, 2014, 11 (2) : 343361. doi: 10.3934/mbe.2014.11.343 
[15] 
Yining Cao, Chuck Jia, Roger Temam, Joseph Tribbia. Mathematical analysis of a cloud resolving model including the ice microphysics. Discrete & Continuous Dynamical Systems  A, 2020 doi: 10.3934/dcds.2020219 
[16] 
Louis Caccetta, Syarifah Z. Nordin. Mixed integer programming model for scheduling in unrelated parallel processor system with priority consideration. Numerical Algebra, Control & Optimization, 2014, 4 (2) : 115132. doi: 10.3934/naco.2014.4.115 
[17] 
Chengxin Luo. Single machine batch scheduling problem to minimize makespan with controllable setup and jobs processing times. Numerical Algebra, Control & Optimization, 2015, 5 (1) : 7177. doi: 10.3934/naco.2015.5.71 
[18] 
Le Thi Hoai An, Tran Duc Quynh, Kondo Hloindo Adjallah. A difference of convex functions algorithm for optimal scheduling and realtime assignment of preventive maintenance jobs on parallel processors. Journal of Industrial & Management Optimization, 2014, 10 (1) : 243258. doi: 10.3934/jimo.2014.10.243 
[19] 
Chuanli Zhao, Yunqiang Yin, T. C. E. Cheng, ChinChia Wu. Singlemachine scheduling and due date assignment with rejection and positiondependent processing times. Journal of Industrial & Management Optimization, 2014, 10 (3) : 691700. doi: 10.3934/jimo.2014.10.691 
[20] 
Ran Ma, Jiping Tao. An improved 2.11competitive algorithm for online scheduling on parallel machines to minimize total weighted completion time. Journal of Industrial & Management Optimization, 2018, 14 (2) : 497510. doi: 10.3934/jimo.2017057 
2019 Impact Factor: 1.366
Tools
Metrics
Other articles
by authors
[Back to Top]