PERFORMANCE OPTIMIZATION OF PARALLEL-DISTRIBUTED PROCESSING WITH CHECKPOINTING FOR CLOUD ENVIRONMENT

. In cloud computing, the most successful application framework is parallel-distributed processing, in which an enormous task is split into a num- ber of subtasks and those are processed independently on a cluster of machines referred to as workers. Due to its huge system scale, worker failures occur fre- quently in cloud environment and failed subtasks cause a large processing delay of the task. One of schemes to alleviate the impact of failures is checkpoint- ing method, with which the progress of a subtask is recorded as checkpoint and the failed subtask is resumed by other worker from the latest checkpoint. This method can reduce the processing delay of the task. However, frequent checkpointing is system overhead and hence the checkpoint interval must be set properly. In this paper, we consider the optimal number of checkpoints which minimizes the task-processing time. We construct a stochastic model of parallel-distributed processing with checkpointing and approximately derive explicit expressions for the mean task-processing time and the optimal number of checkpoints. Numerical experiments reveal that the proposed approximations are suﬃciently accurate on typical environment of cloud computing. Fur- thermore, the derived optimal number of checkpoints outperforms the result of previous study for minimizing the task-processing time on parallel-distributed processing.


(Communicated by Wuyi Yue)
Abstract. In cloud computing, the most successful application framework is parallel-distributed processing, in which an enormous task is split into a number of subtasks and those are processed independently on a cluster of machines referred to as workers. Due to its huge system scale, worker failures occur frequently in cloud environment and failed subtasks cause a large processing delay of the task. One of schemes to alleviate the impact of failures is checkpointing method, with which the progress of a subtask is recorded as checkpoint and the failed subtask is resumed by other worker from the latest checkpoint. This method can reduce the processing delay of the task. However, frequent checkpointing is system overhead and hence the checkpoint interval must be set properly. In this paper, we consider the optimal number of checkpoints which minimizes the task-processing time. We construct a stochastic model of parallel-distributed processing with checkpointing and approximately derive explicit expressions for the mean task-processing time and the optimal number of checkpoints. Numerical experiments reveal that the proposed approximations are sufficiently accurate on typical environment of cloud computing. Furthermore, the derived optimal number of checkpoints outperforms the result of previous study for minimizing the task-processing time on parallel-distributed processing. pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. These beneficial characteristics enable us to use enormous computing resources conveniently with a low usage fee.
The most successful application framework in cloud computing is a paralleldistributed processing scheme, such as MapReduce [6]. This scheme handles an enormous amount of data with a huge number of computing resources in paralleldistributed processing fashion, and is used for a lot of applications such as data mining, machine learning and social network analysis [3]. In the following, we refer to this processing mechanism as large-scale parallel-distributed processing.
In large-scale parallel-distributed processing, an enormous task is split into a number of subtasks and those are processed independently in parallel on a cluster of machines referred to as workers. The task completes when all the subtasks have finished, and thus subtasks which are involved in worker failures cause a large processing delay of the task (the issue of stragglers) [6]. Worker failures occur frequently in large-scale parallel-distributed processing because of its huge system scale [8]. For example, it is reported in [7] that there are thousands of hard drive failures for a new cluster in the first year. In addition, the use of commodity machines for reducing hardware cost decreases the mean time between consecutive failures in data centers [1,7]. Therefore, dealing with worker failures is an important issue. In the following, we refer to the time to complete a task (resp. subtask) as the task-processing (resp. subtask-processing) time.
Checkpointing [13] is one of the well-known solutions for alleviating the impact of worker failures. In this method, the progress of the processing (checkpoint) is periodically saved during a period of the processing of a subtask. When worker failure occurs, the subtask processed by the failed worker is resumed by other worker from the latest checkpoint. Checkpointing prevents the failed subtask executing from the beginning of the processing and reduces the increase in the subtask-processing time. However, excessive creation of checkpoints increases system overhead, whereas a long checkpoint interval wastes computation resources when worker failure occurs. Therefore, deriving the optimal checkpoint interval is a crucial subject.
In this paper, we evaluate the effect of checkpointing method on the taskprocessing time and consider the optimal number of checkpoints which achieves the shortest task-processing time. We construct a stochastic model for large-scale parallel-distributed processing with checkpointing, in which a task accepted by the service facility is split into subtasks of equal size, and the processing of the task ends when all of the subtasks are completed. Note that the assumption of equally sized subtasks becomes reasonable when a huge amount of input data is split into data pieces of approximately equal size [6]. Moreover, each subtask's checkpoints are made periodically during its processing, and a subtask is resumed by other worker from the latest checkpoint when the dedicated worker fails. We assume that the time intervals between consecutive failures of each worker follow an exponential distribution. This distribution is one of the most commonly used distributions due to its tractability and there are various discussions on its validity [2,9,11]. We discuss the validity of this assumption in numerical examples.
For this system, we approximately derive explicit expressions for the mean taskprocessing time and the optimal number of checkpoints. In numerical examples, we investigate the accuracy of the derived approximations in comparison with the results of Monte Carlo simulation. Moreover, we show the usefulness of the derived optimal number of checkpoints by comparing with the result of previous study. Finally, we validate through Monte Carlo simulation experiments the assumption of an exponential distribution for the time intervals between consecutive worker failures.
The rest of this paper is organized as follows. Section 2 reviews previous studies on checkpointing. In Section 3, we illustrate the analytical model for large-scale parallel-distributed processing with checkpointing. For this model, we derive the mean task-processing time and the optimal number of checkpoints in Section 4. In Section 5, we show numerical examples of the derived measures. Finally, Section 6 concludes the paper.
2. Related work. Large-scale parallel-distributed processing is required to minimize the performance degradation due to worker failures. In [4,13,17], various implementations of large-scale parallel-distributed processing are proposed, and they apply checkpointing method to reduce the processing delay of a task caused by worker failures. However, none of these studies fully discuss the optimal checkpoint interval.
From the viewpoint of mathematical modeling, Young [16] proposes the model for single processing with checkpointing, and derives the optimal checkpoint interval with a first order approximation. His result is applied to the determination of checkpoint interval not only for single processing but also for parallel-distributed processing [2,15]. Daly [5] extends Young's result to a higher order approximation, and shows that the optimal checkpoint interval is not affected by the time required to resume a failed task. The results of Young and Daly are simple and insightful. However, it becomes difficult to apply their results to parallel-distributed processing as the system scale grows because their models do not take into account the issue of stragglers.
Fialho et al. [10] and Jin et al. [12] focus on checkpointing method for one of the parallel-distributed processing schemes, Message Passing Interface (MPI), and derive the optimal checkpoint interval. However, in MPI, failed subtask restarts or stops the processing of other subtasks because subtasks are dependent each other by data exchange. The results of the above two studies are assumed this dependency. On the other hand, in MapReduce, data exchange between subtasks is not required during Map and Reduce steps, and failed subtask does not bother the processing of other subtasks, but becomes a straggler. To the best of our knowledge, the effect of checkpointing method for parallel-distributed processing such as MapReduce has not been fully studied yet. In this paper, therefore, we focus on parallel-distributed processing in which subtasks are processed independently, and derive the optimal number of checkpoints. The derived optimal number of checkpoints is as simple as Young's formula [16], while our formula takes into account the issue of stragglers. 3. Analytical model. In this section, we describe the analytical model consisting of two parts: large-scale parallel-distributed processing and the processing of a subtask with checkpointing.
3.1. Model descriptions of large-scale parallel-distributed processing. We describe a stochastic model for large-scale parallel-distributed processing. When a task is accepted by the server, the task is divided into M (M = 1, 2, . . . ) subtasks, and each subtask is assigned to a dedicated worker. Let S i (i = 1, 2, . . . , M ) denote the subtask-processing time of the i-th subtask. The subtask-processing times {S i ; i = 1, 2, . . . , M } are independent and identically distributed (i.i.d.) random variables which follow a common distribution function F , i.e., We refer to the distribution F as the subtask-processing time distribution.
We now define T as the task-processing time, which is equal to the maximum of M subtask-processing times, i.e., We also define G and g (1) as the distribution function and mean value of the taskprocessing time T . It then follows that In what follows, we refer to g (1) as the mean task-processing time.

3.2.
Model descriptions of the processing of a subtask with checkpointing. We introduce some symbols and assumptions to describe the processing of a subtask with checkpointing (see Fig. 1). That is, the subtask-processing time S i (i = 1, 2, . . . , M ) follows these assumptions. (a) We define b (b > 0) as the subtask-processing time without checkpointing and worker failures, i.e., the actual processing time of a subtask. We also define c (c > 0) as the time required to make one checkpoint. Moreover, let K (K = 0, 1, . . . ) denote the number of checkpoints being made during a period of the processing of a subtask. (b) The processing work of a subtask is split into K +1 segments of equal size, and a checkpoint is created at the end of each segment except the last one. Let σ k (k = 1, 2, . . . , K + 1) denote the k-th time interval including the processing time of the k-th segment and the time of creating a checkpoint (if any). We then have We refer to σ k as the k-th segment-processing time. Let τ denote the subtaskprocessing time with K checkpoints and no worker failures. We then have (c) The time intervals between consecutive failures for each worker are i.i.d. with an exponential distribution having mean f (f > 0). When worker failure occurs during a segment-processing time, the subtask involved in failure is resumed by other worker from the checkpoint recorded just before the segmentprocessing time. We here define r (r ≥ 0) as the time required to resume the failed subtask. (d) The magnitudes of the parameters b, c, K, f and r satisfy which is a typical relation for parallel-distributed processing with checkpointing on large-scale computing resources [2,7].
Exponentially distributed with mean f We note that from these assumptions, the subtask-processing time is given as the sum of b, Kc, the total computation time lost by worker failures and the total time required to resume the failed subtask.

Analysis.
In this section, we first consider the worker failure probability and the processing delay (defined in Subsections 4.1 and 4.2, respectively). We then derive the subtask-processing time distribution F and the mean task-processing time g (1) . Finally, we propose the optimal number of checkpoints which minimizes g (1) . In the analysis, in order to express the optimal number of checkpoints as a simple formula, we apply some approximations.

4.1.
Worker failure probability. For simplicity, we refer to an arbitrarily chosen subtask as a tagged subtask. Let p denote the probability that the tagged subtask is involved in worker failures during its processing, i.e., at least one worker failure happens in the processing of the tagged subtask, which is referred to the worker failure probability. We assume that the processing of the tagged subtask begins at time 0. We then define T fail , . . . , are i.i.d. with an exponential distribution having mean f . We also define N fail as the number of worker failures happening to the tagged subtask during its processing. From these definitions, we have where the last equality is due to (3). Note here that (4) implies b + Kc f and thus (b + Kc)/f 1. Therefore, using the linear approximation, the worker failure probability p in (5) can be estimated in the following way.
Next, we estimate the probability P(N fail ≥ 2). where Recall here (see assumption (c)) that T (1) fail and T fail are i.i.d. with an exponential distribution having mean f , and the event {N fail ≥ 2} is equivalent to where we use (3) in the last but one equality. Substituting (5) and (9) into (7) yields In addition, proceeding as in the derivation of (6), we can readily obtain Applying (6) and (11) to (10) leads to and thus In the rest of this section, we assume that the probability P(N fail ≥ 2) is negligible.

4.2.
Processing delay. We consider the processing delay of the tagged subtask caused by worker failure, under the assumption that N fail = 1. The processing delay is defined as the sum of the time required to resume a failed subtask and the computation time lost by worker failure, i.e., the time between completion of creating the latest checkpoint and worker failure. Given that τ k−1 ≤ T (1) fail < τ k (k = 1, 2, . . . , K + 1) and N fail = 1, the processing delay is equal to r + T (1) fail − τ k−1 . It then follows from assumption (c) that where the last equality is due to τ k = τ k−1 +σ (see (8)). Similarly, for r ≤ t < r +σ , Note here that (2) and (4) imply c b/(K + 1) and σ = b/(K + 1) + c f , and using the linear approximation, we have Applying (15) to (13) and (14), we have, for k = 1, 2, . . . , K + 1 and r ≤ t < r + σ, 4.3. Mean task-processing time. We first consider the subtask-processing time distribution F . Let S denote the subtask-processing time of the tagged subtask. From (12), we have Note here that Note also that S = τ + r + T Applying (18) and (19) to (17) yields Next, we derive an approximation of the mean task-processing time g (1) . Substituting (20) into (1) and using the binomial theorem, we have Note here that on the right side of (21), the second and third terms mean the expected values of the time required to resume a failed subtask and the computation time lost by worker failure, respectively. We now define R(z) as It then follows from (21) that Substituting the definitions (2), (3) and (5) of σ, τ and p, we have The accuracy of the approximation (22) is numerically investigated in Subsection 5.1.

Optimal number of checkpoints.
In this subsection, we show a simple and explicit formula for the optimal number of checkpoints which is expected to minimize the mean task-processing time g (1) . To this end, we simplify the approximate expression (22) of g (1) . From (4), we have (b + Kc)/f 1 and Kc b, which lead to Applying the above approximation to (22), we have It is easy to see that (K) takes its minimum value for K such that We now define K * prop as the number such that (d/dK) g (1) (K)| K=K * prop = 0, i.e., the solution to (23). We then have We propose K * prop as the optimal number of checkpoints. Since we have made several approximations in obtaining K * prop , the proposed number K * prop is not, in general, equal to the exact optimal number of checkpoints, denoted by K * , which minimizes g (1) . Nevertheless, under those approximations, we can expect K * prop to be a good approximation to the exact optimal number K * , i.e., In Subsection 5.2, we investigate the accuracy of the approximate formula (24) through numerical experiments, where K * prop is rounded to the nearest integer because the number of checkpoints must be integer.
Finally, we note that the approximation formula (24) is independent of r. It is reported by Daly [5] that for single processing, the optimal number of checkpoints is not affected by the time required to resume a failed subtask. Therefore, the formula (24) implies that this insensitivity against the time required to resume a failed subtask is also applicable for parallel-distributed processing. Moreover, we also note that if M = 1 then where K * young is the optimal number of checkpoints according to Young [16]. Therefore, our result can be considered an generalization of Young's result. First, we verify the proposed approximations of the mean task-processing time and the optimal number of checkpoints in comparison with the results of Monte Carlo simulation. In addition, we show the usefulness of the derived optimal number of checkpoints by comparing with the result of previous study. Finally, we validate through Monte Carlo simulation experiments the assumption of an exponential distribution for the time intervals between consecutive worker failures. Table 1 shows the parameter values used in the numerical experiments. We set these parameters according to the literature [2,7] in order to reflect values which are used in real systems.

5.1.
Approximation accuracy of the mean task-processing time. In this subsection, we investigate the approximation accuracy of the mean task-processing time. We apply some approximations to derive the mean task-processing time in Section 4. To verify these approximations, we compare the analytical result (22) with the result of Monte Carlo simulation. We calculate the 95% confidence interval of the mean task-processing time in simulation experiments. Figures 2 to 6 show the mean task-processing time with respect to the number of checkpoints for various system parameters. In Fig. 2 (resp. Figs. 3, 4 5 and 6), the parameter M (resp. b, c, f and r) is set to different three values. In these figures, the analytical result gives a good approximation of the mean task-processing time for most parameter values. On the other hand, we observe the gaps between the result of analysis and that of simulation experiment for M = 1, 000 in Fig. 2, b = 120 [hour] in Fig. 3 and f = 7 [day] in Fig. 5. These gaps are caused by the ignorance of twice or more worker failures happening to each subtask on analysis in Subsection 4.1. However, for such parameter values, qualitative trend against the number of checkpoints is described well by the approximation.
Moreover, in Fig. 6, the mean task-processing time takes the minimum value at K = 13 for each r. This trend means that the optimal number of checkpoints is not affected by the time required to resume a failed subtask, and agrees with the observation of the approximation formula (24).
These results imply that the derived approximation (22) is sufficiently accurate to find the number of checkpoints which minimizes the mean task-processing time when the system parameters do not significantly deviate from the approximation that twice or more worker failures on each subtask can be ignored.

5.2.
Usefulness of the proposed approximation for the optimal number of checkpoints. In this subsection, we discuss the usefulness of the proposed approximation for the optimal number of checkpoints. To this end, we compare the mean task-processing times calculated by three different ways: the approximation   (25) based on Young [16], the proposed approximation (24) and Monte Carlo simulation. In simulation experiments, the mean task-processing time is calculated with the 95% confidence interval for each number of checkpoints, and the minimum one is chosen as the result of simulation. Note here that the approximation derived by   Young [16] is extended to a higher order one by Daly [5]. However, it is reported in [5] that there is no significant difference between the results of Young and Daly when b f holds. Moreover, as mentioned in Section 2, Young's approximation is still applied to the determination of the number of checkpoints for parallel-distributed processing [2,15]. Therefore, we choose the result of Young as a counterpart. Figures 7 to 11 illustrate the mean task-processing times calculated by three different ways. The horizontal axis represents the parameter M (resp. b, c, f and r) in Fig. 7 (resp. Figs. 8, 9, 10 and 11). Figures 7 to 11 indicate that the result of the proposed approximation agrees well with that of simulation, and outperforms the result of Young. This is because Young's approximation does not take into account the issue of stragglers. We can confirm in Figs. 7 and 9 that this disadvantage appears remarkably for large M and c. These results suggest that the proposed approximation (24) for the optimal number of checkpoints is significantly useful to minimize the mean task-processing time.

5.3.
Assumption validation of the analytical model. In this subsection, we validate the assumption of the proposed analytical model. We assume in our analytical model that the time intervals between consecutive worker failures follow an exponential distribution to simplify the analysis. However, there are various discussions on its validity, and some studies point out that an exponential distribution is poorly fitted to the distribution for the time intervals between consecutive worker failures [2,9,11]. Therefore, we calculate the mean task-processing time through Monte Carlo simulation when the time intervals between consecutive worker failures follow an exponential distribution, gamma distribution with shape parameter 0.35 or Weibull distribution with shape parameter 0.48. We choose the latter two distributions according to the measurement of real systems [11]. The mean taskprocessing time is calculated with the 95% confidence interval for each number of checkpoints, and the minimum one is chosen as the result for each distribution.   in Fig. 12 (resp. Figs. 13, 14, 15 and 16). In these figures, there is no significant difference among the mean task-processing time for three distributions, and we can  consider that the assumption that the time intervals between consecutive worker failures follow an exponential distribution is reasonable.  To consider this insensitivity against the shape of the distribution, we illustrate the mean task-processing time with respect to small f in the cases of three distributions in Fig. 17. In this figure, the difference between the result of the exponential  mean time between worker failures is sufficiently larger than the subtask processing time (i.e. b f ). These results indicate that the proposed analytical model can evaluate the mean task-processing time for parallel-distributed processing with checkpointing method even when the time intervals between consecutive worker failures follow a more realistic distribution. 6. Conclusion. In this paper, we evaluated the effect of checkpointing method on the task-processing time and considered the optimal number of checkpoints which achieves the shortest task-processing time. We constructed an analytical model for large-scale parallel-distributed processing with checkpointing, and approximately derived explicit expressions for the mean task-processing time and the optimal number of checkpoints. In numerical examples, we confirmed the accuracy of the derived approximations in comparison with the results of Monte Carlo simulation. Moreover, we showed the usefulness of the derived optimal number of checkpoints by comparing with the result of previous study. Finally, we validated through Monte Carlo simulation experiments the assumption of the analytical model that the time intervals between consecutive worker failures follow an exponential distribution. We can claim that the proposed approximations are sufficiently accurate even when the time intervals between consecutive worker failures follow a more realistic distribution, such as a gamma or Weibull distribution. Furthermore, the derived optimal number of checkpoints is as simple as Young's formula [16], while our formula outperforms Young's one and is useful for minimizing the task-processing time on parallel-distributed processing.