Article Contents
Article Contents

# Performance optimization of parallel-distributed processing with checkpointing for cloud environment

Mathematics Subject Classification: Primary: 68M20, 90B22; Secondary: 60K30.

 Citation:

• Figure 1.  Processing of a subtask with checkpointing method

Figure 2.  Mean task-processing time with respect to the number of checkpoints for various $M$ ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation

Figure 3.  Mean task-processing time with respect to the number of checkpoints for various $b$ ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation

Figure 4.  Mean task-processing time with respect to the number of checkpoints for various $c$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation

Figure 5.  Mean task-processing time with respect to the number of checkpoints for various $f$ ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of analysis and simulation

Figure 6.  Mean task-processing time with respect to the number of checkpoints for various $r$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $c = 300$ [sec]): Comparison between the results of analysis and simulation

Figure 7.  Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation

Figure 8.  Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation

Figure 9.  Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation

Figure 10.  Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation

Figure 11.  Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison between the results of previous and proposal analyses and simulation

Figure 12.  Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

Figure 13.  Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

Figure 14.  Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

Figure 15.  Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

Figure 16.  Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison among three distributions for the time intervals between consecutive worker failures

Figure 17.  Mean task-processing time with respect to small $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $f = 1$ to $7$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

Table 1.  Parameter set.

 Parameter Description Value $M$ Number of workers $10$ to $1,000$ $b$ Subtask-processing time $6$ to $120$ [hour] $c$ Time to make a checkpoint $30$ to $3,000$ [sec] $f$ Mean time between worker failures $7$ to $180$ [day] $r$ Time to resume a failed subtask $30$ to $3,000$ [sec] $K$ Number of checkpoints $0$ to $30$
•  L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006. T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185. C. L. P. Chen  and  C.-Y. Zhang , Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014) , 314-347.  doi: 10.1016/j.ins.2014.01.015. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010). J. T. Daly , A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006) , 303-312.  doi: 10.1016/j.future.2004.11.016. J. Dean  and  S. Ghemawat , MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008) , 107-113.  doi: 10.1145/1327452.1327492. J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009). J. Dean  and  S. Ghemawat , MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010) , 72-77.  doi: 10.1145/1629175.1629198. S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217. L. Fialho , D. Rexachs  and  E. Luque , What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011) , 322-332.  doi: 10.1109/ICDCS.2011.12. B. Javadi , D. Kondo , A. Iosup  and  D. Epema , The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013) , 1208-1223.  doi: 10.1016/j.jpdc.2013.04.002. H. Jin , Y. Chen , H. Zhu  and  X.-H. Sun , Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010) , 525-534.  doi: 10.1109/ICPP.2010.80. A. Martin , T. Knauth , S. Creutz , D. Becker , S. Weigert , C. Fetzer  and  A. Brito , Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011) , 689-699.  doi: 10.1109/ICDCS.2011.29. P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145. M. Taifi , J. Y. Shi  and  A. Khreishah , SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011) , 109-120.  doi: 10.1007/978-3-642-24669-2_11. J. W. Young , A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974) , 530-531.  doi: 10.1145/361147.361115. M. Zaharia , T. Das , H. Li , T. Hunter , S. Shenker  and  I. Stoica , Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013) , 423-438.  doi: 10.1145/2517349.2522737.

Figures(17)

Tables(1)