Bayesian Inference for Latent Chain Graphs

In this article we consider Bayesian inference for partially observed Andersson-Madigan-Perlman (AMP) Gaussian chain graph (CG) models. Such models are of particular interest in applications such as biological networks and financial time series. The model itself features a variety of constraints which make both prior modeling and computational inference challenging. We develop a framework for the aforementioned challenges, using a sequential Monte Carlo (SMC) method for statistical inference. Our approach is illustrated on both simulated data as well as real case studies from university graduation rates and a pharmacokinetics study.


Introduction
Two common approaches to probabilistically describe conditional dependence structures are based on undirected graphs (Markov networks) and directed acyclic graphs (DAG) (Bayesian networks), e.g. [25]. The vertices of the graph represent variables, while the presence (absence) of an edge between two vertices indicates possible dependence (independence) between the two corresponding variables. Applications of Markov networks include biological networks, e.g., to study the dependence structure among genes from expression data [9,11] and financial time series for forecasting and predictive portfolio analysis [8,33]; while Bayesian networks occur in expert system research for providing rapid absorption and propagation of evidence [18,24], path analysis for computing implied correlations [34], and psychometrics for causal pathways [6,16].
Chain graphs [17,19,35] provide an elegant unifying point of view on Markov and Bayesian networks. These models allow for edges that are both directed and undirected and do not contain any semi-directed cycles. Although chain graphs were first introduced in the late eighties, most research has focused on Bayesian networks and Gaussian Graphical Models. Recently they are receiving more attention as they have proved to be very useful in applications due to their ability to represent symmetric and non-symmetric relationships between the random variables of interest. However, there exist in the literature several different ways of interpreting chain graphs and what conditional independencies they encode, giving rise to different so-called chain graph interpretations. Whilst a directed edge works as in DAGs when it comes to representing independence, the undirected edge can be understood in different ways, giving rise to the different interpretations. This implies that chain graphs can represent every independence model achievable by any DAG, whereas the opposite does not hold. The most common interpretations are the Lauritzen-Wermuth-Frydenberg (LWF) interpretation, the Andersson-Madigan-Perlman interpretation and the multivariate regression (MVR) interpretation. Each interpretation has its own way of determining conditional independences in a CG and each interpretation subsumes another in terms of representable independence models; see [30]. Moreover, [20] discuss causal interpretation of chain graphs.
This work focuses on chain graphs of the AMP type [2,22] and develops a statistical framework to infer the chain graph structure from data measured on a set of variables of interest with the goal of understanding their association pattern. Many inferential procedures have concentrated on the estimation of the parameters characterizing the graph, given the chain graph topology. For example, [13] consider likelihood-based parameter inference for chain graph models that are directly observed. Nevertheless, in the machine learning literature, methods based on optimization have been proposed to perform inference on the graph structure itself in the context of chain graphs, e.g. [23,26]. We consider the case where not only are the parameters unknown, but the chain graph itself is unobserved. The main objective is to perform Bayesian inference in such a context.
There are several contributions that propose Bayesian methods in related contexts. In [28], the author proposes a method to perform full Bayesian inference for acyclic directed mixed graphs using Markov chain Monte Carlo (MCMC) methods. There are several similarities between the model there (based upon structural equation models (SEM) e.g . [7]) and the one developed here. However, the main difference is given by the structure of the latent graph and the associated constraints in our case. These latter constraints are more stringent and lead to more complexity in the computational strategy. In [32] a similar model to the one presented in [28] is developed, which uses a spike-and-slab prior to induce sparsity in the graph. The computational scheme is expected to outperform that in [28] for a large number of nodes. The prior structure in [32] could in principle be extended to accommodate chain graph structures at the cost of introducing constraints that could make computations infeasible.
The contributions of this article are as follows. Based upon the likelihood method in [13] we develop a new Bayesian model for latent AMP chain graphs. This model can also incorporate covariate information, if available. We introduce a sequential Monte Carlo (SMC) method as in [10] that improves upon MCMC-based methods in this context. Our approach is applied to real case studies from university graduation rates and a pharmacokinetics study. We find the performance of our algorithm to be stable and robust.
This article is structured as follows. In Section 2 we describe our model and prior specifications. In Section 2.3 we introduce the SMC algorithm. In Section 3 we present a simulation study and two real-life applications. In the appendix we detail further elements of our algorithm in Section 2.3.

Likelihood
Let Y = (Y 1 , . . . , Y p ) ∈ R p be a random vector whose elements correspond to p ∈ N nodes of a graph. We assume we have m ∈ N observations on Y : y 1:m , y i ∈ R p .
A graph G = (V, E) is described by a set of nodes V and edges E, with variables Y 1 , . . . , Y p placed at the nodes. The edges define the global conditional independence structure of the distribution. An AMP chain graph is a graph whose every edge is directed or undirected such that it does not contain any semi-directed cycles, that is, it contains no path from a node v to itself with at least one directed edge such that all directed edges have the same orientation. Each graph can be identified by an adjacency matrix. Let A be a p × p matrix with entries (a ij ) 1≤i,j≤p where a ij ∈ {0, . . . , r} (let r = 3, which is a notation used later on) with a ii = 0, for i = j 0 no edge between i and j 1 undirected edge between i and j 2 directed edge from i to j 3 directed edge from j to i and, given the upper-triangular part of A, for j < i Given p, a labelling of the nodes and the adjacency matrix A define a graph G(A). Let C denote the set of possible chain graphs for a set of p vertices. Given A, let Ω be a positive definite p × p (real-valued) matrix, such that if a ij = 1 then ω ij = 0. In addition, given A, for an arbitrary real-valued p × p matrix B, then, if a ij = 2 b ij = 0. where I is the p × p identity matrix. To inherit the AMP chain graph property, Σ(B, Ω) should be a p × p positive definite matrix. The absence of semi-directed cycles implies that the vertex set of a chain graph can be partitioned into so-called chain components such that edges within a chain component are undirected whereas the edges between two chain components are directed and point in the same direction. More precisely, the vertex set of a chain graph G(A) can be partitioned into subsets T (G(A)) such that all edges within each subset τ are un-directed and edges between two different subsets τ = τ are directed. In the following, we assume that the partition τ ∈ T (G(A)) is maximal, that is, any two vertices in a subset τ are connected by an un-directed path. Recall that the parents of a set of nodes X of G is the set pa(X ) = {V j s.t. V j → V i ∈ G, V j ∈ X & V i ∈ X }. Let pa(τ ) be parents of τ , B τ = (B ij ) i∈τ,j∈pa(τ ) and Ω τ = (ω ij ) i,j∈τ sub-matrices of B and Ω, respectively. For a single observation y ∈ R p , y τ | y pa(τ ) , B, Ω ∼ N |τ | (B τ y pa(τ ) , Ω −1 τ ), where, for d ∈ N, N d (µ, Σ) denotes the d−dimensional Gaussian distribution with mean µ and covariance matrix Σ. Then the joint density of y ∈ R p given B, Ω, A is The likelihood is then given by that is, the observations are i.i.d. given B, Ω, A. We remark that this structure is similar to the structural equation model proposed by [29], where also a Gaussian distribution is assumed for the variables corresponding to the nodes of the graph. Moreover, these above assumptions on the precision matrix ensure that the corresponding graph satisfies the AMP Markox property as discussed in [2].

Prior Distribution on the Chain Graph Space
We specify the prior on the set of possible chain graphs, by specifying a prior on the elements of the corresponding adjacency matrix. For 1 ≤ i < j ≤ p, let P(a ij = l | π ij ) = π ij (l), l = 0, . . . , r where π ij = (π ij (0), . . . , π ij (r)). We assume each vector π ij to follow a Dirichlet distribution with parameter α = (α 0 , . . . , α r ) and the π ij to be conditionally independent given α. Note that, if available, covariate information could be incorporated to model π ij , as in [31] for example. Marginally, we have The prior for a graph G(A) then becomes: Note that the prior does not integrate to 1, but has a finite mass.
We assume that given G(A), Ω has a G−Wishart prior distribution ( [21]) of parameters δ, D, and given A the non-zero entries of B are i.i.d. univariate Gaussian random variables with mean ξ and variance κ. Let P be the cone of p × p real-valued positive definite matrices. Then the joint prior for Ω and B is given by: p(Ω, B|A) = I P (Σ(B, Ω))p(Ω|A)p(B|A) which does not integrate to 1.

Posterior Inference
The posterior distribution π of all the unknown of interest is given by: The structure of the model bears some resemblance to the models considered in [28,32]. Besides the graph structure being different (namely, we do not allow bi-directional edges), we also impose the AMP constraint, mathematically represented by the term I P (Σ(B, Ω)) in p(Ω, B|A), which leads to a much more complex structure of the graph space than the aforementioned references. As a result, posterior exploration of graph space is more challenging and computationally demanding and requires carefully devised algorithms, in a sense more sophisticated, than the ones considered in [28,32].
Our approach is to design an adaptive SMC sampler as in [15] (see [10] for the original algorithm and [5] for convergence results). SMC based algorithms offer the advantage of being easily parallelisable, often reducing computation times over serial methods in some scenarios. The strategy involves simulating N samples (particles) to approximate the sequence of densities The motivation for this algorithm is well-documented (see the aforementioned references for details), and SMC algorithms have been successfully employed in may contexts to sample from high dimensional posterior distributions.
In the implementation of the algorithm, we require a Markov kernel (e.g., a MCMC kernel) K t , t ∈ {1, 2, . . . , T } that admits π t as an invariant distribution; this step is detailed in the appendix. The algorithm is summarised in Algorithm 1. For notational convenience, we set u t = (B t , Ω t , (a t,ij ) i<j ). In the initialization stage, one simulates exactly from π 0 using rejection sampling. The sequence of {φ t } is set as proposed in [37], and the choice of parameters for the MCMC kernel follows [15].

Simulation and Real Data Study
In the following numerical experiments, we set δ = 3 and D = I p for the G-Wishart prior, and ξ = 0, κ = 1 for the distribution of the non-zero element of B. The number of particles for the SMC is N = 500. The MCMC steps are Metropolis-Hastings kernels.

Simulated Example
In the simulated example, we assume that the p random variables corresponding to the nodes of the graph are independent, i.e., a ij = 0 for all 1 ≤ i, j ≤ p. This is a benchmark example to evaluate the ability of our algorithm to recover the structure as this is a very simple graph structure. We consider p = 10 vertices and m = 100 observations. To generate the data, we set the precision matrix Ω equal to identity matrix I p , and the entries of matrix B are set to be 0. For the Dirichlet prior, we consider α = (3, 1, 1, 1). This prior implies a higher probability of no connection and assumes the same prior probability for any type of edge between two nodes. The reason we prefer this choice of hyper-parameters to α = (1, 1, 1, 1) (which corresponds to a uniform prior) is due to the computational time required by the initialization step to generate the chain graph. The simulation results are summarized in Figure 1. The effective sample size (ESS) drops very fast as the algorithm begins and goes into a steady lower state after several resampling procedures. The acceptance rate is acceptable as it does not fall below a very small value. Using the weighted samples {W T } N n=1 obtained at the last step of SMC, we estimate the posterior probability P(a ij = 0 | y 1:m , α), 1 ≤ i < j ≤ p by and summarize the results in Table 1. From the table, we can see that all the estimated posterior probabilities are greater than 0.7, and most of them are above 0.9. This suggests that our algorithm is able to recover the structure of the graph (independence) used to generate the data.

University Graduation Rates
We investigate the performance of our algorithm when the latent graph has a more complex structure. We consider data first presented in [14] that stem from a study for college ranking carried out in 1993. After initial analysis, Druzdzel and Glymour focus on p = 8 variables: spend average spending per student, strat student-teacher ratio, salar faculty salary, rejr rejection rate, pacc percentage of admitted students who accept university's offer, tstsc average test scores of incoming students, top10 class standing of incoming freshmen, and apgra average percentage of graduation.
Based on m = 159 universities, the correlation matrix of these eight variables is estimated in [14]. Conditional on this correlation matrix, [13] obtain a chain graph through the SIN model selection with significance level 0.15. To get a chain graph with different AMP and LWF Markov properties, [13] further deleted the undirected edge between top10 and rejr and the undirected edge between salar and top10, and introduced an undirected edge between top10 and rejr. The resulting graph is shown in Figure 2, and the corresponding adjacency matrix is shown in Table 2. This graph has three chain components τ 1 = {spend, strat, salar}, τ 2 = {top10, tstsc, rejr, pacc} and τ 3 = {apgra}.
In the original article [14] provide some insight about the causal relationship between some of the variables. The average spending per student (spend), student-teacher ratio (strat) and faculty salary (salar) are determined based on budget considerations and are not influenced by any of the five remaining variables. Rejection rate (rejr) and percentage of students who accept the university's offer from among those who are offered admission (pacc) precede the average test scores (tstsc) and class standing (top10) of incoming freshmen. The latter two are measures taken over matriculating students. Finally, graduation rate (apgra) does not influence any of the other variables.   Figure 2.
To test the performance of the proposed Bayesian chain graph model, we build an empirical baseline graph through the following procedure. Starting with a graph with only one node denoted by G, each time we add a new node to the existing graph. For every node in the subgraph not containing the new node, we consider the candidate graphs that are generated by adding one of four types of correlation (no edge, undirected edge and directed edges in either direction) between the new node and the node in the subgraph. We choose the graph with smallest BIC value obtained by fitting a structural equation model (SEM). SEM is a multivariate statistical analysis technique that is commonly used to analyse structural relationships. In general, the formulation of the SEM is given by the basic equation v = Av + u, where v and u are vectors of random variables. The parameter matrix A contains regression coefficients and the matrix P = E(uu ) gives covariances among the elements of u. The chain graph model described in this work can be written as Y = BY + U , where U is a multivariate Gaussian random vector with covariance matrix Ω −1 . This is consistent with the structural equation model formulation. Figure 3 shows the derived graph and Table 3 shows the corresponding adjacency matrix.   Figure 3.
We compare posterior inference from our model with the chain graphs in Figure 2 and Figure 3. For the SMC sampler, we set the number of samples N = 5000. For the Dirichlet prior, we first consider a prior based on the analysis of [13], i.e., choosing α = (0.39, 0.25, 0.36, 0.05) by matching the probabilities of each type of edges in Figure 2. More precisely, the number of no-edges, un-directed edges and directed edges from i to j are 11, 7 and 10 according to the graphs in Figure 2. So the corresponding proportions of these three types of edge are the first three components of α. The last component of α is chosen to be 0.05 to ensure the occurrence of directed edge from j to i and the probability of this type of edge is not large compared to the other three types. We also perform posterior inference with α = (1, 1, 1, 1), which is a uniform prior, and with α = (1, 3, 3, 3), which favours more connections. The remaining parameters are specified as in the previous section.
Based on the weighted samples {W   1, 1, 1). This is not surprising since the prior with α = (0.39, 0.25, 0.36, 0.05) is very informative and forces the posterior distribution of the edges to be similar to Figure 2. The estimated chain graph obtained setting the hyper-parameters equal to α = (1, 3, 3, 3) favours more connections compared with other two priors. If we compare our results with the conclusions in [14], the estimated chain graph obtained under the informative prior indeed shows that spend, strat and salar are parents of other variables, and apgra is a child of the others. However, it does not show that rejr and pacc precede tstsc and top10 (the graph just shows the opposite relation). Similarly, the graph obtained under the prior with α = (1, 3, 3, 3) shows that tstsc is a parent of salar, strat and spend, which is a contrast to the available prior information. This can be improved by modifying the ordering of nodes and choosing a small value of α 3 .
Finally, we compare our results with those obtained using SEM. We use package sem in R to fit a SEM to these data. Note that when fitting the SEM model the graph topology is fixed.    (1, 1, 1, 1); (c) posterior estimated chain graph using a Dirichlet prior with α = (1, 3, 3, 3).
with the graph selected by the SIN model selection procedure and the empirical graph, which suggests that our algorithm can take advantage of available prior information and perform better.

Tenofovir study
In this section, we illustrate our Bayesian model for AMP chain graphs on a real data application from the RMP-02/MTN-006 study ( [3]). Tenofovir (TFV) is a medication used to treat chronic hepatitis B and to prevent and treat HIV. TFV 1% gel demonstrated 39% protective efficacy in women using the gel within 12 hours before and after sexual activity in the Centre for the AIDs Programme of Research in South Africa 004 study [1]. Daily dosing of tenofovir disoproxil fumarate (TDF)/emtricitabine provides 62 to 73% protection against HIV transmission in serodiscordant men and women enrolled in the Partners PrEP study [4]. The RMP-02 study was designed to evaluate the systemic safety and biologic effects of oral TFV combined with a gel formulation of the drug for application rectally and vaginally. The study enrolled 18 patients, all of whom received a single oral dose of TFV, and randomized each patient to receive either the gel formulation or a placebo several weeks later. Details about the phase 1 study are given in [3]. [27] present analyses of the ancillary studies into the treatment's biologic effects.
The biologic effects we examine here concern the pharmacokinetics (PK) of TFV and its active metabolite tenofovir diphosphate (TFVdp), as well as the pharmacodynamics of the drug. The PK studies evaluate subjects' TFV and TFVdp concentrations in multiple physiologic compartments (i.e., tissues and cells) across multiple time points during the study. Table 5 lists the compartments.  [27] demonstrate that tissue HIV infectibility (cumulative p24) is correlated with in vivo concentrations of both TFV and TFVdp. Statistically significant, non-linear dose-response relationships with reduced tissue infectibility are found for one TFV compartment and four TFVdp compartments; the dose-response relationships are highly significant for TFVdp in whole rectal tissue, CD4 + MMC , CD4 − MMC and Total MMC compartments. Furthermore, [36] conduct a comprehensive pharmacokinetic study of rectally administered TFV gel that describes the distribution of TFV and TFVdp into various tissue compartments relevant to HIV infection. They argue that TFV rectal fluid concentrations may be reasonable bio-indicators of plasma and rectal tissue concentrations, making it easier to estimate adherence and TFV concentrations in the target tissue. Therefore, the correlations between the TFV and TFVdp in compartments can be helpful in studying HIV suppression procedure and providing a measure of drug efficacy, enabling more advanced population pharmacokinetic modelling methods. In the following analysis, we investigate the correlation structure of p = 7 concentration levels in Table 5 collected at visit 12. From clinical knowledge, we would expect the following associations: • TFV plasma is associated with TFV tissue (blood levels and tissue levels), • TFV tissue is associated with TFV rectal (rectal tissue and rectal fluid), • TFV tissue is associated with Total MMC (rectal tissue and mononuclear cells in rectal tissue), • Total MMC is associated with CD4 + MMC and CD4 − MMC (total and constituents). We normalise the observations of each variable to have zero mean and standard deviation of one. The number of observation is m = 11, and the number of particles is set to N = 5000. We fit the model using as hyper-parameters in the Dirichlet prior both α = (1, 1, 1, 1), which corresponds to the uniform prior, and α = (1, 3, 3, 3), which favours the presence of connections. This latter prior choice is more suitable for small sample sizes. The remaining parameters are set as in the previous section.
Based on weighted samples {W T | y 1:n , α , and is shown in Figure 5 (for both prior settings). Obviously, the chain graph obtained under the prior with α = (1, 3, 3, 3) has more connections. Moreover, this graph also shows that the TFV tissue is related to TFV plasma and TFV rectal , the TFV tissue is associated with Total MMC through TFV plasma , and CD4 + MMC causes Total MMC and CD4 − MMC . This is consistent with the clinical knowledge mentioned before. However, some of these associations are missing in the chain graph obtained under the uniform prior, e.g. the edge between TFV plasma and TFV tissue (they are independent given Total MMC and CD4 + MMC under the AMP chain graph property). We compare our results to those obtained by fitting a SEM model using the sem function in the R package sem. We summarise the results in Table 7. The AIC and BIC values of the model corresponding to the chain graph obtained under the uniform prior are missing, which may due to the singular Hessian matrix when estimating the covariance matrix of the parameter. From the table, it is evident that the p-values of some edges are large. For example, the edge CD4 + MMC → TFV tissue in the chain graph obtained under uniform prior has a p-value 0.704. These results suggest that these relationships are not significant in a SEM model.
When inadequate fit of a structural equation model is observed, model modification is often conducted followed by retesting of the modified model. The most popular statistic is the modification index, which is a chi-square score test statistic with degree of freedom one. The modification index provides an estimated value in which the model's chi-square test statistic would decrease if the corresponding parameter is added to the model and respecified as a free parameter. We perform model modification using the modIndices function in sem package, which calculates modification indices and estimates parameter changes for the fixed and constrained parameters in a structural equation model. Table 5 shows the five largest modification indices for both the A matrix and P matrix for the model corresponding to the chain graph obtained under the Dirichlet prior with α = (1, 3, 3, 3). The modification indices suggest that a better fit to the data would be achieved by adding association between TFV rectal and TFV plasma to the model. The small sample size makes it obviously challenging to estimate associations. Furthermore, the level in rectal tissue may depend on whether or not the patient received the TFV gel or the placebo, which may make the correlation of TFV tissue with other variables unclear.

Conclusions
In this article we propose a novel Bayesian model for latent AMP chain graphs, for which observations are available only on the nodes of the graph. Posterior inference is performed through a specially devised SMC algorithm. We investigate the ability of the model to recover a range of structures, also when prior knowledge is available. The performance of the SMC sampler is stable and consistent in our numerical study. However, the sampler is not suitable when the number of nodes p is large, as the initializing step is difficult. Moreover, the proposed algorithm does not scale well with respect to p. The computational cost of computing the probabilities in the adjacency matrix is O(p 2 ), and the computation of the normalizing constant in the G-Wishart distribution is quite expensive when p is large (approximately O(p 3 )).
Several extensions of this work are possible. First, the algorithm can be extended to large p by choosing a more efficient initial proposal q and by actually exploiting parallel computing techniques. Second, the model can be extended to accommodate multiple groups of observations, allowing borrowing information across groups or time periods. Third, we could avoid sampling Ω and B by using a Laplace approximation when calculating the posterior probability.