• Previous Article
    Compensation plan, pricing and production decisions with inventory-dependent salvage value, and asymmetric risk-averse sales agent
  • JIMO Home
  • This Issue
  • Next Article
    Optimal liability ratio and dividend payment strategies under catastrophic risk
October 2018, 14(4): 1423-1442. doi: 10.3934/jimo.2018014

Performance optimization of parallel-distributed processing with checkpointing for cloud environment

1. 

Graduate School of Informatics, Kyoto University, Yoshida-Hommachi, Sakyo-ku, Kyoto 606-8501, Japan

2. 

Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan

Received  December 2016 Revised  August 2017 Published  January 2018

In cloud computing, the most successful application framework is parallel-distributed processing, in which an enormous task is split into a number of subtasks and those are processed independently on a cluster of machines referred to as workers. Due to its huge system scale, worker failures occur frequently in cloud environment and failed subtasks cause a large processing delay of the task. One of schemes to alleviate the impact of failures is checkpointing method, with which the progress of a subtask is recorded as checkpoint and the failed subtask is resumed by other worker from the latest checkpoint. This method can reduce the processing delay of the task. However, frequent checkpointing is system overhead and hence the checkpoint interval must be set properly. In this paper, we consider the optimal number of checkpoints which minimizes the task-processing time. We construct a stochastic model of parallel-distributed processing with checkpointing and approximately derive explicit expressions for the mean task-processing time and the optimal number of checkpoints. Numerical experiments reveal that the proposed approximations are sufficiently accurate on typical environment of cloud computing. Furthermore, the derived optimal number of checkpoints outperforms the result of previous study for minimizing the task-processing time on parallel-distributed processing.

Citation: Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance optimization of parallel-distributed processing with checkpointing for cloud environment. Journal of Industrial & Management Optimization, 2018, 14 (4) : 1423-1442. doi: 10.3934/jimo.2018014
References:
[1]

L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006.

[2]

T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185.

[3]

C. L. P. Chen and C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314-347. doi: 10.1016/j.ins.2014.01.015.

[4]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010).

[5]

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303-312. doi: 10.1016/j.future.2004.11.016.

[6]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107-113. doi: 10.1145/1327452.1327492.

[7]

J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009).

[8]

J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 72-77. doi: 10.1145/1629175.1629198.

[9]

S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217.

[10]

L. FialhoD. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322-332. doi: 10.1109/ICDCS.2011.12.

[11]

B. JavadiD. KondoA. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 1208-1223. doi: 10.1016/j.jpdc.2013.04.002.

[12]

H. JinY. ChenH. Zhu and X.-H. Sun, Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525-534. doi: 10.1109/ICPP.2010.80.

[13]

A. MartinT. KnauthS. CreutzD. BeckerS. WeigertC. Fetzer and A. Brito, Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689-699. doi: 10.1109/ICDCS.2011.29.

[14]

P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145.

[15]

M. TaifiJ. Y. Shi and A. Khreishah, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109-120. doi: 10.1007/978-3-642-24669-2_11.

[16]

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530-531. doi: 10.1145/361147.361115.

[17]

M. ZahariaT. DasH. LiT. HunterS. Shenker and I. Stoica, Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423-438. doi: 10.1145/2517349.2522737.

show all references

References:
[1]

L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006.

[2]

T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185.

[3]

C. L. P. Chen and C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314-347. doi: 10.1016/j.ins.2014.01.015.

[4]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010).

[5]

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303-312. doi: 10.1016/j.future.2004.11.016.

[6]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107-113. doi: 10.1145/1327452.1327492.

[7]

J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009).

[8]

J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 72-77. doi: 10.1145/1629175.1629198.

[9]

S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217.

[10]

L. FialhoD. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322-332. doi: 10.1109/ICDCS.2011.12.

[11]

B. JavadiD. KondoA. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 1208-1223. doi: 10.1016/j.jpdc.2013.04.002.

[12]

H. JinY. ChenH. Zhu and X.-H. Sun, Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525-534. doi: 10.1109/ICPP.2010.80.

[13]

A. MartinT. KnauthS. CreutzD. BeckerS. WeigertC. Fetzer and A. Brito, Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689-699. doi: 10.1109/ICDCS.2011.29.

[14]

P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145.

[15]

M. TaifiJ. Y. Shi and A. Khreishah, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109-120. doi: 10.1007/978-3-642-24669-2_11.

[16]

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530-531. doi: 10.1145/361147.361115.

[17]

M. ZahariaT. DasH. LiT. HunterS. Shenker and I. Stoica, Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423-438. doi: 10.1145/2517349.2522737.

Figure 1.  Processing of a subtask with checkpointing method
Figure 2.  Mean task-processing time with respect to the number of checkpoints for various $M$ ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 3.  Mean task-processing time with respect to the number of checkpoints for various $b$ ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 4.  Mean task-processing time with respect to the number of checkpoints for various $c$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 5.  Mean task-processing time with respect to the number of checkpoints for various $f$ ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 6.  Mean task-processing time with respect to the number of checkpoints for various $r$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $c = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 7.  Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Figure 8.  Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Figure 9.  Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Figure 10.  Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Figure 11.  Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison between the results of previous and proposal analyses and simulation
Figure 12.  Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 13.  Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 14.  Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 15.  Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 16.  Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 17.  Mean task-processing time with respect to small $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $f = 1$ to $7$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Table 1.  Parameter set.
ParameterDescriptionValue
$M$Number of workers$10$ to $1,000$
$b$Subtask-processing time$6$ to $120$ [hour]
$c$Time to make a checkpoint$30$ to $3,000$ [sec]
$f$Mean time between worker failures$7$ to $180$ [day]
$r$Time to resume a failed subtask$30$ to $3,000$ [sec]
$K$Number of checkpoints$0$ to $30$
ParameterDescriptionValue
$M$Number of workers$10$ to $1,000$
$b$Subtask-processing time$6$ to $120$ [hour]
$c$Time to make a checkpoint$30$ to $3,000$ [sec]
$f$Mean time between worker failures$7$ to $180$ [day]
$r$Time to resume a failed subtask$30$ to $3,000$ [sec]
$K$Number of checkpoints$0$ to $30$
[1]

Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance analysis of large-scale parallel-distributed processing with backup tasks for cloud computing. Journal of Industrial & Management Optimization, 2014, 10 (1) : 113-129. doi: 10.3934/jimo.2014.10.113

[2]

Kyosuke Hashimoto, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance analysis of backup-task scheduling with deadline time in cloud computing. Journal of Industrial & Management Optimization, 2015, 11 (3) : 867-886. doi: 10.3934/jimo.2015.11.867

[3]

Bin Zheng, Min Fan, Mengqi Liu, Shang-Chia Liu, Yunqiang Yin. Parallel-machine scheduling with potential disruption and positional-dependent processing times. Journal of Industrial & Management Optimization, 2017, 13 (2) : 697-711. doi: 10.3934/jimo.2016041

[4]

Jiping Tao, Zhijun Chao, Yugeng Xi. A semi-online algorithm and its competitive analysis for a single machine scheduling problem with bounded processing times. Journal of Industrial & Management Optimization, 2010, 6 (2) : 269-282. doi: 10.3934/jimo.2010.6.269

[5]

Weidong Bao, Haoran Ji, Xiaomin Zhu, Ji Wang, Wenhua Xiao, Jianhong Wu. ACO-based solution for computation offloading in mobile cloud computing. Big Data & Information Analytics, 2016, 1 (1) : 1-13. doi: 10.3934/bdia.2016.1.1

[6]

Leiyang Wang, Zhaohui Liu. Heuristics for parallel machine scheduling with batch delivery consideration. Journal of Industrial & Management Optimization, 2014, 10 (1) : 259-273. doi: 10.3934/jimo.2014.10.259

[7]

Hongtruong Pham, Xiwen Lu. The inverse parallel machine scheduling problem with minimum total completion time. Journal of Industrial & Management Optimization, 2014, 10 (2) : 613-620. doi: 10.3934/jimo.2014.10.613

[8]

Arminda Moreno-Díaz, Gabriel de Blasio, Moreno-Díaz Jr.. Distributed, layered and reliable computing nets to represent neuronal receptive fields. Mathematical Biosciences & Engineering, 2014, 11 (2) : 343-361. doi: 10.3934/mbe.2014.11.343

[9]

Sze-Bi Hsu, Christopher A. Klausmeier, Chiu-Ju Lin. Analysis of a model of two parallel food chains. Discrete & Continuous Dynamical Systems - B, 2009, 12 (2) : 337-359. doi: 10.3934/dcdsb.2009.12.337

[10]

Louis Caccetta, Syarifah Z. Nordin. Mixed integer programming model for scheduling in unrelated parallel processor system with priority consideration. Numerical Algebra, Control & Optimization, 2014, 4 (2) : 115-132. doi: 10.3934/naco.2014.4.115

[11]

Chengxin Luo. Single machine batch scheduling problem to minimize makespan with controllable setup and jobs processing times. Numerical Algebra, Control & Optimization, 2015, 5 (1) : 71-77. doi: 10.3934/naco.2015.5.71

[12]

Le Thi Hoai An, Tran Duc Quynh, Kondo Hloindo Adjallah. A difference of convex functions algorithm for optimal scheduling and real-time assignment of preventive maintenance jobs on parallel processors. Journal of Industrial & Management Optimization, 2014, 10 (1) : 243-258. doi: 10.3934/jimo.2014.10.243

[13]

Chuanli Zhao, Yunqiang Yin, T. C. E. Cheng, Chin-Chia Wu. Single-machine scheduling and due date assignment with rejection and position-dependent processing times. Journal of Industrial & Management Optimization, 2014, 10 (3) : 691-700. doi: 10.3934/jimo.2014.10.691

[14]

Ran Ma, Jiping Tao. An improved 2.11-competitive algorithm for online scheduling on parallel machines to minimize total weighted completion time. Journal of Industrial & Management Optimization, 2018, 14 (2) : 497-510. doi: 10.3934/jimo.2017057

[15]

Xianyu Yu, Dar-Li Yang, Dequn Zhou, Peng Zhou. Multi-machine scheduling with interval constrained position-dependent processing times. Journal of Industrial & Management Optimization, 2018, 14 (2) : 803-815. doi: 10.3934/jimo.2017076

[16]

Ji-Bo Wang, Mengqi Liu, Na Yin, Ping Ji. Scheduling jobs with controllable processing time, truncated job-dependent learning and deterioration effects. Journal of Industrial & Management Optimization, 2017, 13 (2) : 1025-1039. doi: 10.3934/jimo.2016060

[17]

Zhichao Geng, Jinjiang Yuan. Scheduling family jobs on an unbounded parallel-batch machine to minimize makespan and maximum flow time. Journal of Industrial & Management Optimization, 2018, 14 (4) : 1479-1500. doi: 10.3934/jimo.2018017

[18]

Xingong Zhang. Single machine and flowshop scheduling problems with sum-of-processing time based learning phenomenon. Journal of Industrial & Management Optimization, 2017, 13 (5) : 1-14. doi: 10.3934/jimo.2018148

[19]

Yazheng Dang, Fanwen Meng, Jie Sun. Convergence analysis of a parallel projection algorithm for solving convex feasibility problems. Numerical Algebra, Control & Optimization, 2016, 6 (4) : 505-519. doi: 10.3934/naco.2016023

[20]

Andrey Yu. Verisokin, Darya V. Verveyko, Eugene B. Postnikov, Anastasia I. Lavrova. Wavelet analysis of phase clusters in a distributed biochemical system. Conference Publications, 2011, 2011 (Special) : 1404-1412. doi: 10.3934/proc.2011.2011.1404

2017 Impact Factor: 0.994

Metrics

  • PDF downloads (60)
  • HTML views (551)
  • Cited by (0)

[Back to Top]