Network-Based Analysis of a Small Ebola Outbreak

We present a method for estimating epidemic parameters in network-based stochastic epidemic models when the total number of infections is assumed to be small. We illustrate the method by reanalyzing the data from the 2014 Democratic Republic of the Congo (DRC) Ebola outbreak described in Maganga et al. (2014).

1. Introduction. The best known models for the spread of infectious disease in human populations are based on the classical SIR model of Kermack and McKendrick [19]. The same system of ordinary differential equations (ODEs) may be derived as the large population limit of a density-dependent Markov jump process using the methods of Kurtz [20]. This stochastic formulation brings a number of mathematically attractive properties such as explicit likelihood formulas and ease of simulation. Nevertheless, a drawback of the Kermack and McKendrick-type models is that they can be unrealistic in describing the interactions of infectives and susceptibles as they are based on assumptions of homogeneous mixing [16].
In recent years there has been considerable interest in developing alternatives to the classical SIR, for instance, via network-based epidemic models, as reviewed by Pellis et al. [27] and House and Keeling [13]. Pair approximation models have Figure 1. The empirical secondary case distribution in the DRC outbreak dataset (neglecting the index case), as given by Maganga et al. [22].
infection, for example due to population behavioral changes or health interventions (e.g. quarantine).
The development of our methods is motivated by the devastating 2013 − 2015 outbreak of Ebola virus in West Africa. While the West Africa outbreak received considerable attention in fall 2014 due to the dramatic rise in the number of cases, there was an independent Ebola outbreak in the Democratic Republic of the Congo (DRC) occurring at the same time. This much smaller outbreak began July 26, 2014 inÉquateur Province and lasted until November 2014. The index case was a woman living in Inkanamongo village that presumably became infected by consuming bushmeat of an infected animal. Several healthcare workers involved in a postmortem cesarean section on the women subsequently became ill and generated further chains of transmission [22].
2. DRC dataset. There were a total of 69 confirmed infections in the 2014 DRC outbreak. The time series of cumulative case counts was reported by Maganga et al. [22]. However, the analysis method presented here does not require temporal data and instead utilizes the distribution of secondary cases, also given in [22]. The index case is believed to have caused 21 secondary cases, presumably due to her funeral acting as a super-spreading event [22]. Hence, we assume this data point to be an outlier and exclude it from our final analysis, as Maganga et al. also did in their estimation of R 0 . The secondary case distribution for other named contacts is shown in Figure 1. One patient caused three subsequent cases, two patients caused two additional cases, 30 caused one additional case and 11 patients caused zero additional cases. These may be viewed as observations from the post-index offspring distribution of the branching process approximation, i.e. the number of infections caused by an individual who is himself not the index case.
Although, as already mentioned, the temporal data measurements are not explicitly required for parameter estimation in our approach, the available temporal information can be directly incorporated into the likelihood function, or used to inform the parameterization of the prior distributions, which is what was done for the current DRC dataset. Details are given below. 3. Epidemic model. Consider a graph G = V, E where the vertex set, V , corresponds to individuals in a population of size n and the edge set, E, corresponds to potentially infectious contacts occurring between such individuals. The graph structure is given as a realization of a configuration model random graph with a prescribed degree distribution, D, where the probability that a node has degree k is denoted by q k ≡ P (D = k). Each node i ∈ V is assigned d i half-edges that are drawn at random from D. Then, the pool of half-edges is paired off uniformly to form the final network. This link formation model does not exclude the possibility of self-loops or multiple edges but this has a negligible effect in the large graph limit (see, e.g., discussion in [14]).
The general disease framework adopted is the standard compartmental SIR model where, for every time t > 0, each node i ∈ V is classified as susceptible (S), infectious (I), or recovered (R) from the infection. Given some number of initially infectious nodes, a node i becomes infective via transmission along an edge from one of his infectious neighbors. In the Markovian case, we assume that i remains infectious for an exponentially distributed amount of time with rate parameter γ, which we refer to as the recovery rate. More generally, a non-exponential (e.g., gamma) recovery rate leads to the so-called semi-Markov SIR model [15]. While infectious, the infective i attempts to transmit the infection to all of his susceptible neighbors according to an exponential distribution with rate β. In the case that the realization of the infection "clock" for a particular neighbor "rings" before the recovery "clock", then this neighbor becomes infected. The epidemic ends when there are no more infectives.
In addition to these standard SIR dynamics, we allow the network structure to change due to infection status. Specifically, we assume that infectious individuals drop each of their contacts according to an exponential distribution with rate δ. The dropped contacts, which could account for behavioral changes due to disease such as isolation or decreased mobility, cannot be reformed.

Statistical inference.
4.1. Index case offspring distribution. For an index case i with degree d i , at most d i secondary infections can be produced. Conditional on the recovery time of that index case, t i , the probability that infection has passed to any particular neighbor is given by where the first term in the product on the right-hand side is the probability that an edge transmitted infection prior to being dropped, while the second term represents the probability that either an infection or a drop occurred before recovery. Therefore, the total number of secondary infections caused by node i, which we denote ANALYSIS OF A SMALL EBOLA OUTBREAK 5 by X i , conditional on the time of recovery t i , is given by If the recovery time is not known but assumed to follow a distribution f (t i ) ≡ P (T i = t i ), then we may analogously define π di (x i ), the conditional probability of x i offspring given degree d i . The law of total probability implies We will assume identical recovery distributions for all individuals and, thus, the offspring distribution will be the same for all index cases. The final form for the offspring distribution of an index case may be found by supposing that the degree, d i , of the index case is unknown. Let q di ≡ P (D i = d i ) denote the probability that the index case has degree d i . Therefore, the law of total probability gives That is, we sum over all possible degrees that could yield at least x i secondary infections and weight them according to the degree distribution.

4.2.
Post-index case offspring distribution. We now consider the offspring distribution for a post-index case. By definition, such an individual acquired infection from another individual in the network and, thus, has at least one neighbor. Therefore, some adjustments are needed to account for the fact that post-index cases have a degree distribution which differs from D. Let q k denote the probability that a given neighbor in the CM network has degree k. Then it is known [26] that Since at least one of the neighbors of a post-index case has already been infected, he may pass the infection to at most k − 1 of his neighbors. Let X i denote the offspring distribution for a post-index infection i. Similarly to Eq. (2) we derive For a fixed set of parameters the basic reproductive number, R 0 , can be calculated as E(X v ), i.e. the average number of secondary infections caused by a post-index case. That is, R 0 is given by 4.3. Example. To illustrate, we assume that the degree distribution is Poisson with mean parameter λ and the recovery distribution is exponential with rate parameter γ. This implies f (t i ) = γe −γti and Therefore, by Eq. (2), the offspring distribution for an index case is given by and, by Eq. (3), the post-index case offspring distribution is given by This expression has no simple analytical form but it is not hard to approximate the integral numerically since it can be written as an expectation against the recovery distribution. Therefore, a simple Monte Carlo sample from the desired recovery distribution allows for efficient computation of this term.
Using Eq. (4), we can calculate the basic reproductive number in this Markovian case. For an arbitrary degree distribution, R 0 is given by which only differs in the inclusion of δ from the corresponding formula on a static CM graph [24,14]. Here µ = E(D) < ∞ by assumption. The summation in Eq. (5) represents the expected excess degree, i.e. the degree of a node which is necessarily a neighbor of a node, not counting the known edge. In the particular case here of a Poisson degree distribution, R 0 is found to be (cf., e.g., [3] chapter 6) Let Θ = (β, γ, δ, λ) denote the vector of parameters where γ and λ represent the parameters of the recovery time and degree distributions, respectively. The offspring distributions given in Eqs. (2) and (3) allow for explicit formulation of the likelihood for Θ which is given by With the specification of the likelihood, maximum likelihood estimators (MLEs) for the rate parameters can be found by numerical optimization. Given the MLÊ θ = (β,γ,δ,λ), the corresponding estimator for the basic reproductive number can be calculated by application of the continuous mapping theorem to the expression for R 0 given in Eq. (4). For example, under the assumptions of our example in Section 4.3, the estimator following from Eq. (6) would bê Denote the vector of current parameters as Θ = (β, γ, δ, λ). If we denote the parameter prior distribution φ(Θ) then the likelihood function above may be also used to compute the Metropolis-Hastings acceptance probability in the Markov chain Monte Carlo (MCMC) sampler. Let x denote the vector of data counts and τ be the transition kernel. The MCMC algorithm for obtaining the posterior distribution of Θ is then as follows 1. Initiate Θ curr = Θ 0 . 2. Obtain proposal Θ prop from τ (Θ|Θ cur ). 3. Accept (or not) this proposal with Metropolis-Hastings probability given by 4. Return to 2. 5. Repeat until convergence.
In this way, after sampler convergence, we obtain an approximate sample from the posterior distribution of Θ and may subsequently compute its approximate 1−α credibility region, given the observed data x. In particular, the credibility interval of the parameter R 0 given by Eq. (4) may be determined.

4.5.
Generalizations. Note that the likelihood formula (7) is valid in a non-Markovian setting such as an arbitrary recovery time distribution and could further be extended to the scenario where the transmission rate varies with time since infection. Note that in the latter case the formula for R 0 would differ from Eq. (4) due to a form for p ti that differs from Eq. (1) but would remain calculable as E(X v ).
If additional data were available, such as the recovery time or number of contacts of each individual, it could be explicitly incorporated into the likelihood function (7) through the joint distribution of recovery times and degrees.

5.
Analysis of the DRC dataset. To illustrate our method we perform the Bayesian posterior estimation of the model parameters for the 2014 Ebola outbreak in the DRC based on the data described in Section 2. The specific model considered is as given in Section 3, where transmission occurs according to an exponential distribution with rate parameter β, and the degree distribution is Poisson with parameter λ. Recovery time is assumed to follow a gamma distribution Γ(α, β). We note that, while incubation periods for Ebola range from two to 21 days [9], our method does not require consideration of latent exposure since it does not depend on infection timing.
We perform estimation via the MCMC scheme given in Section 4.4. Prior distributions were set to be minimally informative Gaussian distributions and hyperparameters were selected based on previous estimates [22,30]. The prior distribution for λ was taken to be N (16,14) and for δ was taken to be N (.01, .05). To improve mixing of the MCMC scheme, β was estimated on log-scale under an assumed N (log(.02), 4) prior distribution. Lastly, the gamma distribution for the infectious period was chosen to have prior mode (i.e., (α − 1)/β) distributed as N (11, 6) and prior standard deviation ( √ α/β) as N (6, 4). A Gaussian transition kernel τ was used. Central 95% credibility intervals were calculated for the parameters of interest as well as for R 0 . Results are summarized in Figure 2 including histograms of posterior samples.
The 95% credibility interval for the basic reproductive number R 0 is found to be (.589, 1.15) with the R 0 posterior mean of .842. The latter value numerically agrees closely with the moment-based estimate given in Maganga et al. [22] and is also consistent with the relatively small total number of confirmed infections observed during the epidemic. We further note that our interval estimate compares favorably to the 95% R 0 confidence interval of (−.38, 2.06) reported by Maganga et al. [22] indicating that for the DRC data the fully parametric model produces a more precise (shorter) interval. The infection rate is found to have a posterior mean of .0069 and a corresponding 95% credibility interval of (.00232, .0167). The contact drop rate is found to have a posterior mean of .0573 and corresponding credible interval of (.00340, .128). These quantities were not estimated directly by the authors in [22] so comparison here is not possible. The estimated mean for the infectious period is found to be 16.18 days with corresponding credibility interval of (6.65 days, 27.9 days). Maganga et al. did not explicitly estimate infectious period but instead gave an estimate for time from symptom onset to death with mean 11.3 days [22]. Based on a historic 1995 outbreak of Ebola in the DRC, Legrand et al. estimated the time from onset to end of infectiousness for survivors to be 10 days and time from onset to death to be 9.6 days [21]. While our estimate of infectious period is somewhat longer than these, we note that our interpretation of the quantity is the total length of time that an individual could cause infection. This could include several days after death in which the body is being prepared for and undergoing funeral rituals. Lastly, the mean number of contacts is estimated to be 16.2 with corresponding credibility interval of (8.59, 22.8), which is again in close pointwise agreement with the estimate inferred from the number of contacts traced by Maganga et al. [22]. Overall, the point estimate for R 0 and infection and recovery rates are seen to be consistent with a small outbreak behavior observed in the DRC dataset and to agree well with the numerical values reported earlier. However, based on the same data, our parametric model is also seen to yield more precise interval estimates. The final outbreak size of simulated branching processes from the posterior parameter distribution is used as a model diagnostic for fit to the empirical data. Conditioning on the number of infections caused by the index case as the number of independent branches, the posterior parameter samples and corresponding post-index case offspring distributions were used to simulate branching process realizations. The final outbreak sizes were calculated and the distribution is presented in Figure 3. Reasonable agreement is observed between this distribution and the empirical outbreak size of 69 cases [22]. 6. Discussion. We presented here a Bayesian parameter estimation method for a class of stochastic epidemic models on configuration model graphs. The method is based on applying the branching process approximation, and is applicable when the number of infections is small in relation to the size of the population under study. In particular, this includes the case when the total outbreak size is small or when we are at the onset of a large outbreak. The method are flexible, for example allowing for arbitrary degree and recovery distributions, and in principle requires only a knowledge of the distribution of secondary cases, although additional data can be incorporated into the inference procedure, as the likelihood function under branching approximation remains straightforward to evaluate under a wide range of data collection schemes.
We illustrated our approach with the analysis of data from the 2014 DRC Ebola outbreak which was originally described and analyzed in Maganga et al. [22]. Our method, under only weakly informative prior distributions, is seen to produce a considerably tighter credibility interval for R 0 than the moment-based confidence estimate reported in [22]. This demonstrates the utility of the branching process approximation for small epidemics in obtaining more precise estimates of R 0 , which is essential in assessing potential risk of a large outbreak and in determining the level of control efforts (e.g. vaccination or quarantine) needed to mitigate an outbreak. The final size comparison indicates that the observed data is within the range of model predictions, although the direct comparison of observed and model predicted offspring distributions indicates some disagreement in the observed frequency of zeros and ones. The small sample size prohibits definitive conclusions but this may indicate the need to incorporate more complex network dynamics (e.g. distinguishing between multiple types of infectious contacts).
As the statistical estimation methods appear essential to inform public health interventions, we hope that our work here will help in establishing a broader inference framework for epidemic parameters, based on the type of data usually collected in the course of an outbreak. The Bayesian approach is particularly attractive in this context, as it naturally incorporates any prior or historical information. However, the current approach only addresses the inference problem at the epidemic onset and, in particular, is not appropriate when the number of infected individuals comprises a significant portion of the population. We plan to address estimation for such large outbreaks, possibly also incorporating more complex network dynamics, in our future work.