A Bayesian nonparametric test for conditional independence

This article introduces a Bayesian nonparametric method for quantifying the relative evidence in a dataset in favour of the dependence or independence of two variables conditional on a third. The approach uses Polya tree priors on spaces of conditional probability densities, accounting for uncertainty in the form of the underlying distributions in a nonparametric way. The Bayesian perspective provides an inherently symmetric probability measure of conditional dependence or independence, a feature particularly advantageous in causal discovery and not employed by any previous procedure of this type.


INTRODUCTION
The random variables X and Y are conditionally independent given Z (written X ⊥ ⊥ Y | Z) if and only if the following relation holds between their conditional densities, for all possible realised values z of Z: A common problem is that of assessing whether or not this relation is true for a given triple of variables. Typically, the setting is that the three densities in (1)-and the marginal density p Z (z)-are all unknown a priori, but we have a finite set of data W : assumed to be drawn from the joint measure p XY Z induced by (X, Y, Z). Notably, this type of analysis is a key component in most common approaches to causal discovery (Pearl, 2009).
Testing for conditional independence with finite data is, however, known to be a hard problem in general. This is particularly true if the unknown densities are assumed continuous and modelled nonparametrically, which is often desirable if nothing is known to the contrary. Bergsma (2004) and Shah and Peters (2018) give results which show that in such a setting, a test for conditional independence with desirable statistical properties cannot in general be constructed.
Nonetheless, many tests exist and are commonly used in practice, despite their various theoretical deficiencies. A classic approach is to form a test statistic from the partial correlation coefficient (Fisher, 1924). This vanishes if X ⊥ ⊥ Y | Z, but only under the strong assumptions that all variables are Gaussian and all dependences linear. Limited extensions to non-Gaussian variables are explored in Harris and Drton (2013) and Ramsey (2014) and to nonlinear dependences in Hoyer et al. (2009) and Peters et al. (2014).
Other approaches include combining a series of unconditional independence tests on the response variables (X, Y ) conditional on multiple individual values z of Z (Huang, 2010), and tests based on measures of statistical distance between estimates of the conditional densities p X|Z and p X|Y Z , which are zero if and only if X ⊥ ⊥ Y | Z White, 2007, 2008). Finally, a range of kernel-based methods Fukumizu et al. (2008); Zhang, K. et al. (2012); Doran et al. (2014); Zhang, Q. et al. (2017); Strobl et al. (2018) also exists, which aim to deal with high-dimensional or sparse problems more effectively than simpler approaches.
All of the methods described in the previous paragraph are frequentist by construction, in that they derive a test statistic and construct a hypothesis test based on either a known null distribution, or by using some other strategy such as a permutation test. Peters et al. (2017, §7.2.1) point out one possible problem with this, namely that "all causal discovery methods that are based on conditional independence tests draw conclusions both from dependences and independences", reminding us that classical hypothesis tests are inherently asymmetric.
Bayesian hypothesis testing circumvents this issue. In the unconditional case, Filippi and Holmes (2017)  In this work, we extend their approach to the setting of conditional independence. In contrast to classical hypothesis testing, this method has the potential to positively evidence both X ⊥ ⊥ Y | Z and X ⊥ ⊥ Y | Z.

Bayesian nonparametric hypothesis testing
Recall that we have a dataset W : . We wish to compare two competing hypotheses H 0 and H 1 , with H 0 the hypothesis of conditional independence and H 1 the contrary. Our aim is to output posterior probabilities of the form p(H 0 |W ) and p(H 1 |W ).
To do so, we use the Bayes Factor (Kass and Raftery, 1995), defined as the ratio of the marginal likelihoods of the two models.
Assuming the prior probabilities of the two hypotheses are equal (ie. p(H 0 ) = p(H 1 ) = 0.5), we can use this to derive the posterior probability of H 1 as The ratio of marginal likelihoods on the right-hand side of (2) can now be expanded as follows.
In this expression, the second equality arises from the assumption that the marginal distribution of Z is the same in both models-reasonable since the two hypotheses are only differentiated by the relationship between X and Y -and the third equality is simply an application of the definition of conditional independence given by (1). We henceforth suppress the explicit marking of the models H 0 and H 1 since it is now clear which of the three terms belongs to which model. The next step is to specify probability models for the three unknown conditional densities on the right-hand side of (4). Continuing to take a Bayesian approach but wanting to make as few assumptions as possible, we model each of them nonparametrically. What this means is that we work directly in a suitably-chosen infinite-dimensional space M (·|·) of conditional density functions, rather than in some finite-dimensional parameter space corresponding to a far smaller set of possible densities. This allows for a much more general model for the data-generating mechanism that we are interested in, which will ultimately reduce the bias in our conclusions. (Ghosal and van der Vaart, 2017).
Consider first the two-dimensional conditional density p XY |Z (corresponding to hypothesis H 1 ). The Bayesian nonparametric approach entails placing a functional prior π on M (·|·) -individual elements of which we call q(·|·)-incorporating the data W through a likelihood function L, then marginalising over M (·|·) such that the conditional marginal likelihood is given by The same procedure is applied to the one-dimensional conditional densities p X|Z and p Y |Z with π, q, M (·|·) and L replaced by their one-dimensional analogues.
Since we wish to assume that the random variables are all continuous, we employ the Pólya tree family of priors. These can be designed to ensure that samples from them are absolutely continuous with probability one and, furthermore, have the advantage that the marginal likelihood in (5) is tractable, in contrast to other nonparametric models of continuous random variables such as the Dirichlet Process Mixture (Escobar and West, 1995).
The specific model we use is a modified version of the conditional Optional Pólya tree (cond-OPT) of Ma (2017), also incorporating ideas from the finite Pólya tree of Lavine (1994) and the multi-dimensional Pólya tree of Paddock (1999). We review these models in the coming sections. The first constructions we explore are designed for modelling random unconditional density functions q(·); later we will see how these can be modified and extended to model random conditional density functions q(·|·).
The classical Pólya tree (PT) essentially defines a random probability measure over a one-dimensional domain Ω ⊆ R. Without loss of generality we take Ω = [0, 1]. The most familiar construction proceeds by recursive binary partitioning of Ω and at each step the assigning of probability mass to the two child From each set C * , a particle of probability mass passes to the left with (random) probability θ * 0 and to the right with probability θ * 1 = 1 − θ * 0 , with all θ being Betadistributed as described in the main text.
sets of a set C ⊆ Ω by means of independent Betadistributed random branching variables θ. This results in a tree structure, shown in Figure 1. Constructed this way, it is helpful to think of the PT as a random histogram on Ω or, for parameter choices which result in continuous distributions almost surely, a random density function. A particle of probability mass can be thought of as cascading down the tree, with the direction it takes at each binary split determined by the random parameters θ.
Concretely, let q denote a probability density on Ω and π(q) a measure over M (·) . 1 Consider a two-bin histogram partitioning Ω = [0, 1] at its halfway point. With C 0 := [0, 0.5) and C 1 := [0.5, 1], define the random branching probability θ 0 ≡ q(C 0 ) ∼ Be(α 1 , α 1 ) for some α 1 > 0. It follows that θ 1 ≡ q(C 1 ) = 1 − θ 0 . Note that in general, the two parameters of this Beta distribution need not be the same, though this symmetrising simplification is common and we adopt it. Indeed, we take the parameters constant within each level of the tree; the subscript on α j denotes the level.
Continue in this fashion, with C 0 = C 00 ∪ C 01 and 1 We abuse notation slightly in this section by writing q both for the probability measure and for its density, the existence of which is always assumed. θ 00 ≡ q(C 00 |C 0 ) ∼ Be(α 2 , α 2 ), θ 000 ≡ q(C 000 |C 00 ) ∼ Be(α 3 , α 3 ) and so on recursively, with each independent Beta random variable θ * determining the probability that the particle enters the set C * at the next level of the tree. Writing ε i for a (single) element of the set {0, 1}, ε m ≡ ε 1 ε 2 . . . ε m for a length-m word from the set {0, 1} m , and ε m 0 and ε m 1 for the appending of respectively a single 0 or 1 onto the end of ε m , the measure of a set C ε1ε2...εm can then be written as Now taking the infinite limit of tree depth m and writing ε * for an element of the set of all possible {0, 1}words of any finite length, it can easily be shown that the set of finite unions of intervals of the form C ε * generates the Borel σ-algebra on Ω. With q constructed in this fashion, the measure π(q) is a Pólya tree. Now substituting Bernoulli mass functions into the infinite-depth version of (6), writing {θ} for the collection of all θ, and denoting the set of all C ε * arising in the recursive partitioning procedure above by Π Ω , the density q(x) of a point x ∈ Ω is Lavine (1994) shows that, under certain conditions on the parameters α, the Pólya tree assigns positive probability to the Kullback-Leibler neighbourhood of any element of the space M (·) , and furthermore that these elements are absolutely continuous with respect to Lebesgue measure. Specifically, the choice α j = cj 2 , with c > 0 and j the level of the set in question within the tree, satisfies this condition and ensures that samples from the PT are almost surely continuous. Throughout this paper we employ this choice with c = 1. 2

Posterior & predictive distributions
Pólya trees benefit from the conjugacy of the Binomial and Beta distributions, allowing a simple expression to be derived for the posterior measure over M (·) after data X 1:N ≡ {X 1 , . . . , X N } have been observed. This conjugacy not only means that the posterior is itself a 2 Berger and Guglielmi (2001) write that c "is very difficult to specify" in nonparametric problems, and it is clear that its value affects inference significantly. c = 1 is a common (though ultimately arbitrary) default choice, sup- Pólya tree, but also that the branching variables θ can easily be marginalised, giving the marginal likelihood (8) In this expression, n ε * is a counting function whose value is the number of data X 1:N in the set C ε * ; ie.
, and {α} is the collection of all parameters α. The predictive distribution of a new point x can also be given in closed form as In this expression, n j (x) counts the number of data amongst the X 1:N which are in the same level-j set as (This notation is borrowed from Hanson (2006).) Finally, we can use a product of terms of the form (9) to rewrite (8) in a form which is easier to work with in practice.
Due to the way in which the product over i is made up of expressions of the form (9), n j (X i ) here counts the number of data in the set {X 1 , . . . , X i−1 } which are the same level-j set as X i .

Multivariate & Truncated Pólya trees
We now consider various generalisations and modifications of the classical Pólya Tree (PT), eventually describing the specific model we use in our method. The PT introduced in Section 2 is defined on a one-dimensional domain, but a multi-dimensional extension-in which sets C ⊆ Ω d are binary-divided in each of d dimensions simultaneously at each stage-is considered in Paddock (1999). In this construction, the children of C are assigned probability mass by means of Dirichlet-distributed random variables θ supported on the 2 d -dimensional simplex, generalising the Beta-distributed θ of the classical PT. 3 In this setting, we have (θ ε j−1 1 , . . . , θ ε j−1 2 d ) ∼ Dir(α j , . . . , α j ) and note that, by definition, Equations (8) and (9) can be similarly modified to the multi-dimensional setting.
A critical point is that in the classical PT model, exact calculation of the quantities (7)-(9) theoretically requires infinite computation, since the tree is of unlimited depth. It is therefore common in practice to truncate the calculation at a finite tree depth J. These truncated (also called 'finite' or 'partially-specified') Pólya trees (TPT) are discussed by Lavine (1994) and Mauldin et al. (1992), who explore the relationship of the various derived quantities to the equivalent ones arising under the full PT. While full Kullback-Leibler support over M (·) is no longer guaranteed, bounds on the pointwise error of the posterior measure (Lavine, 1994) and L 1 error of the predictive density (Hanson, 2006) are available. Hanson and Johnson (2002) formalise the definition of the TPT by specifying a base measure µ that the 'leaf' sets at the bottom level J are taken to follow. If this base measure is uniform then the TPT outputs piecewise-constant measures (ie. random histograms). Hanson (2006) explores the case where µ is a d-dimensional Gaussian measure, whereas for our setup (in which the state space is the bounded set Ω d = [0, 1] d ) it makes sense to take µ uniform. The density function for the multivariate TPT is given by (12) The first fraction in this equation is the normalised base density of the point x within its level-J set. Assuming henceforth that µ is indeed uniform, the predictive distribution corresponding to (9) for the multivariate TPT is given by This leads to the following expression for the multivariate TPT marginal likelihood, analogous to (10): As in (10), n j (X i ) counts the number of data in the set {X 1 , . . . , X i−1 } that are the same level-j set as X i .

Optional Pólya Tree
A different approach to modifying the PT to ensure that computation time is almost surely finite is proposed by Wong and Ma (2010). In this paradigm, the partitioning of Ω is augmented at each step by the drawing of independent Bernoulli-distributed stopping variables S. For a set C * arising in the partitioning of Ω, if the corresponding S * = 1 then C * is divided no further and a uniform distribution is placed on it. If S * = 0 then a binary split takes place as usual. As long as the Bernoulli parameter controlling the probability p(S * ) = 1 is uniformly greater than 0, it is easy to see that this partitioning procedure will result in all of Ω (but for a set of measure zero) being stopped in finite time with probability one.
Given certain technical conditions, a full-support result akin to that for the classical PT is also available in this case. However, also as before, in practice it is common to truncate the OPT at a finite tree depth J. Wong and Ma (2010) also introduce the concept of randomised splitting rules for multi-dimensional problems, generalising the construction of Paddock et al. (2003). We do not employ this feature in our model, though its incorporation would not be difficult.

Conditional Optional Pólya Tree
The optional stopping principle of the OPT model is further leveraged in Ma (2013) and Ma (2017), in which multi-scale mixtures of OPTs are used as models for conditional probability distributions; in Ma (2017) this is called the conditional Optional Pólya Tree (cond-OPT).
The basic idea is to construct a random conditional density q(x|z) by partitioning the predictor space Ω Z using the optional-stopping algorithm described in Section 2.3, then for each set A arising from this procedure to construct an independent (unconditional) OPT random density on the response space Ω X but using only those data X i whose corresponding Z i value lies in A. Finally, the multiple independent OPT models are combined in a weighted sum, giving a random conditional density q(x|z).
The measure constructed by this algorithm is shown in Ma (2017) to have full (total variation) support on the space M (·|·) of conditional density functions supported on Ω X × Ω Z , and as such is a direct generalisation of the unconditional Pólya Tree family of models so far discussed, immediately inheriting many of their strengths. This approach to modelling random conditional density functions forms a central part of our work.

BAYESIAN CONDITIONAL INDEPENDENCE TEST
Having briefly surveyed the Pólya Tree family of Bayesian nonparametric priors and alluded to the way in which they can be extended to nonparametric models of conditional distributions, we now describe our proposed method with much greater precision.
Recall that we are seeking to compare the hypotheses Call the support of X, Y and Z respectively Ω X , Ω Y and Ω Z and assume that Ω := Ω X × Ω Y × Ω Z is a compact subset of R 3 . Without loss of generality, we take Ω = [0, 1] 3 . Transformations which allow data of any (even unbounded) range to be constrained to this domain are easy to find.
We will define three nonparametric priors, one for each of the conditional densities p X|Z (x|z), p Y |Z (y|z) and p XY |Z (x, y|z) appearing in (4) then, incorporating the data W and marginalising the randomness in the posterior, derive the terms we require to form the Bayes Factor. We use p X|Z as our running example.
The first step is to partition the predictor space Ω Z using the optional-stopping binary recursive partitioning procedure described in Section 2.3. This produces a random partition of Ω Z , an intrinsic feature of this scheme. This additional randomness must itself be marginalised in order to calculate the conditional marginal likelihood p X|Z (W ). Following Ma (2017), this is done in practice by constructing a non-random binary partition Π Z and performing a recursive calculation on the resulting tree.
The partition Π Z is formed by the recursive binary splitting of Ω Z , with no further splitting for sets containing either exactly 0 or 1 datapoint Z i , or those at the maximum level J Z , regardless of the number of points in the set. The collection of lowest-level sets remaining at the end of this process are called leaf sets.
The partition having been made, the expression for the conditional marginal likelihood p X|Z (W ) is given by Ma (2017) as Φ X (Ω Z ), where Φ X is a set-valued function defined recursively on subsets A ∈ Π Z by This expression is explained fully in the coming paragraphs. Φ 0 X (A) is the marginal likelihood of the (unconditional) nonparametric model over Ω X calculated using only those data X i such that the corresponding Z i are in the set A. We call the data subset defined in this way W A and denote its cardinality by N A .
For these 'local' models over Ω X , we use truncated Pólya trees (TPT) rather than the optional Pólya trees (OPT) used in Ma (2017). Done this way, recursive calculation is not required-we simply use the formula (14), where the Φ 0 X (A) in the notation of this section is equivalent to p X ({X i : Z i ∈ A}|{α}, Π Ω X ) there. Neither are we required to set any 'local' mixing parameters analogous to the ρ in (15).
In addition, we find in tests that this approach is more robust, particularly when the number of data is small, as it is for most sets A near the bottom of Π Z . The full-support result given by Ma (2017) for the original cond-OPT model is easily modified to the case where the local models are PT instead of OPT.
Returning to (15), the local TPT marginal likelihoods Φ 0 X (A) must be calculated separately for all sets A ∈ Π Z . This is done by constructing partitions Π X,A of Ω X which are calculated using only the data in W A , then applying the formula (14). Note that Π X,A ⊆ Π X (≡ Π X,Ω Z ) in general, and in practice this is achieved most efficiently by constructing the most extensive tree Π X once, then pruning it to find the Π X,A for each A.
All the Φ 0 X (A) having been derived, Φ X (A) can then be calculated recursively, starting from the leaf sets of Π Z and working towards the root Ω Z , with the product in (15) being taken over the children A k of A. ρ ∈ (0, 1) is a mixing parameter which we take to be 0.5; this value is used throughout Ma (2017). This (modified) cond-OPT conditional marginal likelihood Φ X (Ω Z ) is thus seen to be formed of a multi-scale additive mixture of TPT marginal likelihoods.
The equivalent calculation is undertaken to find p Y |Z (W ) ≡ Φ Y (Ω Z ) and-now using the multivariate version of the TPT-p XY |Z (W ) ≡ Φ XY (Ω Z ).
The Bayes Factor (4) is then given by The complete procedure described in this section is summarised in the boxed pseudocode to the right.

Synthetic data
We run some proof-of-concept experiments of our method, using the synthetic datasets constructed based on the formulae in the first row of Figure 2. The measures from which the data are sampled are designed in such a way that every combination of unconditional independence/dependence and conditional independence/dependence is represented, as shown in the second row. In each case, (X, Y, Z) are by construction supported on [0, 1] 3 . Example 3-dimensional scatter plots are given for each model in the third row.
We highlight specifically model 4, which has X ⊥ ⊥ Y and X ⊥ ⊥ Y | Z, though for 90% of the data X ⊥ ⊥ Y | Z.
Noting definition (1) ("for all z"), we would like a partial conditional dependence of this type to be detected, even if it derives from only a small subset of the data.

Bayesian nonparametric test to assess
We vary the number of points N between 1 and 10 5 , and for each of several values of N in this range we run 100 repetitions of our procedure using data generated by different random seeds. The maximum tree depths J Z (in the predictor space) and J X , J Y and J XY (in the response space) are all set at log 2 (N ) , following the rule of thumb in Hanson and Johnson (2002).
We plot the range of test outputs p(H 1 |W ) in the fourth row of Table 2, with the blue line representing the median, and the dark-and light-blue shaded regions representing the (25,75)-percentile range and the (5,95)-percentile range respectively.
In the low-data limit, the test output p(H 1 |W ) → 0.5 as expected, while for values of N of 10 4 and greater the test consistently returns a probability very close to 0 or 1, correctly determining in each case the hypothesis that reflects the ground truth.
In the approximate range N = 10 2 to 10 4 , a relatively large uncertainty is present in the output. In the case of the two examples for which X ⊥ ⊥ Y | Z (models 3 and 4), there is a noticeable tendency in this range to falsely favour H 0 , before p(H 1 |W ) → 1 correctly as N > 10 4 . This is a manifestation of the natural Occam Factor present in the test, favouring the simpler model H 0 where insufficient data exists to conclusively support H 1 -see for example MacKay (2003, §28). The same phenomenon manifested in the unconditional independence testing procedure of Filippi and Holmes (2017) upon which this work builds.

Real data
We now apply our method to a representative real dataset to further illustrate its potential. We consider the California Cooperative Oceanic Fisheries Investigations (CalCOFI) Bottle data, a collection of hydrographic readings from maritime stations off the Cali-fornian coast collected over a period of 70 years. These data (available at calcofi.org) contain numerous examples of variables with highly non-linear or even nonfunctional dependence relations.
The complete dataset consists of N = 864, 863 observations of 74 variables, but with a high incidence of missing data and numerous strong 'trivial' linear correlations. We first remove all variables for which there is at least one other variable with which it has no common data. We then calculate pairwise correlation between the remaining variables, and retain only one representative from groups with pair correlations all greater than 0.99. This leaves six variables, T degC (Temperature), Salnty (Salinity), STheta (Potential density), Oxy mol.Kg (Oxygen in micromoles per kg), R DYNHT (Dynamic height) and R PRES (Pressure). We apply our test to all 60 possible assignments of these six variables to X, Y and Z. Though the number of observations remaining even after pre-processing is not significantly lower than N for any triple, we subsample sets of much smaller cardinality to demonstrate the ability of our test to correctly identify conditional dependence relations in limited data settings. This also serves to effectively eliminate correlation between observations-which would otherwise be strong in this type of time series data-avoiding violation of the i.i.d. data assumption. For the purposes of exposition, we focus on the case where Z is the variable T degC. For each of the 10 possible pairs (X, Y ) made up of the remaining five variables, we subsample N = 1000 observations. Figure 4 gives the graph corresponding to the pairwise conditional dependences found among these five variables conditional on T degC, where an edge indicates that p(H 1 |W ) > 0.99. We repeated this experiment five times with different random seeds, and recovered the same graph each time.

Salnty
In Figure 3 we plot the three 2-dimensional marginal scatter plots of an N = 1000 subsample drawn from a representative triple of variables Salnty, Oxy mol and T degC.

CONCLUSIONS & DISCUSSION
In this article we have defined and demonstrated a new Bayesian nonparametric approach to testing whether or not two random variables are independent conditional on a third. We have done so in a manner that minimises the assumptions required on the unknown joint distribution of (X, Y, Z), by modelling various of its conditional distributions using Pólya trees.
We believe this approach has the potential to be developed and extended in numerous ways, and we hope it will in this way increasingly find application in practical analyses. An extension to the multi-dimensional setting, particularly for the conditioning variable Z, would be of real use and is the subject of current work.
The Bayesian approach our procedure takes provides a framework in which both the hypotheses of conditional independence and conditional dependence can be positively evidenced from given data, unlike the inherently asymmetric hypothesis tests of classical statistics. This is of great importance for causal discovery. A relaxation of the prior assumptions can also straightforwardly allow the inclusion of substantive prior information on the plausibility of an association, something particularly useful when screening large biological datasets. Lastly, the ability to detect dependences of a highly nonlinear or even non-functional nature allows for much greater confidence in the robustness of any inference procedure in which this type of test is embedded.