Flexible online multivariate regression with variational Bayes and the matrix-variate Dirichlet process

Flexible regression methods where interest centres on the way that the whole distribution of a response vector changes with covariates are very useful in some applications. A recently developed technique in this regard uses the matrix-variate Dirichlet process as a prior for a mixing distribution on a coefficient in a multivariate linear regression model. The method is attractive, particularly in the multivariate setting, for the convenient way that it allows for borrowing strength across different component regressions and for its computational simplicity and tractability. The purpose of the present article is to develop fast online variational Bayes approaches to fitting this model and to investigate how they perform compared to MCMC and batch variational methods in a number of scenarios.


Introduction
Flexible modelling of multivariate conditional densities is a fundamental problem in statistics, particularly in regression applications in which there is interest in the ways that the whole distribution of a response vector depends on covariates. In a recent paper Zhang et al. (2010) developed a flexible multivariate regression method using a Dirichlet process prior for a mixing distribution on the coefficient in a multivariate linear model, where the Dirichlet process base prior is a matrix-variate normal distribution. The approach is attractive for its flexibility, the easy way it allows borrowing of strength between regressions for different response variables through the matrix-variate normal base prior, and the computational simplicity and convenience that comes from basing the method on the ordinary Dirichlet process. They refer to the Dirichlet process prior with matrixvariate normal base measure as the matrix-variate Dirichlet process (hereafter MDP), and further applications beyond the multivariate linear regression setup were considered in . The contribution of the present work is to consider fast online approaches to fitting the model of Zhang et al. (2010) using variational Bayes methods, suitable for application in the context of large datasets. We also consider a novel approach to improving the predictive performance of the online algorithm which gives performance comparable in many cases to a batch variational Bayes or MCMC approach.
In Bayesian nonparametrics, the development of suitable prior distributions for regression problems of the kind we consider here, involves the development of dependent prior distributions for sets of distributions indexed by the covariates. A recent survey on the extensive literature on this topic is given by Foti and Williamson (2015). A key early paper is by MacEachern (2000), who introduced the framework of the dependent Dirichlet process and which inspired many later developments. Some of the existing approaches in the literature include starting from the stick breaking representation of a random measure and allowing distribution atoms or weights to be covariate dependent (De Iorio et al., 2004;Gelfand et al., 2005;Griffin and Steel, 2006;Dunson and Park, 2008); consideration of covariate dependent generalizations of the Chinese restaurant process or Pólya urn prediction rule (Blei and Frazier, 2011;Caron et al., 2007); as well as methods that build on normalized completely random measures (Kingman, 1967;Lijoi and Prünster, 2010) and which use their relationship with Poisson processes to introduce covariate dependence in various ways (Rao and Teh, 2009;Chen et al., 2013;Lijoi et al., 2014). The above list of references is by no means exhaustive. For the special case of grouped data, the hierarchical Dirichlet process (Teh et al., 2006) has also been an extremely important development.
As mentioned, in the present work we consider the model of Zhang et al. (2010) which is attractive in the case of multivariate response for the convenient mechanism it represents for borrowing strength across regressions for different components through the matrix-variate normal base prior. Our objective is to develop fast online variational Bayes methods which allow the model of Zhang et al. (2010) to be applied with large datasets. The approach adopted builds on the VSUGS algorithm of  for Dirichlet process mixture models, which is a variational extension of the SUGS algorithm of Wang and Dunson (2011). Lin (2013) independently developed a similar algorithm to that of . The development of fast variational methods for complex Bayesian nonparametric models has been a very active area of recent research, with an important early paper being Blei and Jordan (2006) where a batch variational algorithm for fitting Dirichlet process mixture models was developed. In the online setting, some recent contributions include  and Bryant and Sudderth (2012) who consider online algorithms for the hierarchical Dirichlet process, and various methods inspired by the stochastic variational inference framework of Hoffman et al. (2013) (for example, Wang and Blei (2012)). Kabisa et al. (2016) consider a fast online approach to fitting high-dimensional correlated data with a model incorporating some Bayesian nonparametric components; their method is a variational Bayes algorithm which is similar in approach to methods originally developed by Sato (2001). Luts et al. (2014) consider online approaches to fitting semiparametric regression models in the variational Bayes framework.
The next section describes the matrix-variate Dirichlet process mixture model that is considered throughout the rest of the article. In Section 3, a batch variational algorithm for the model is derived and then Section 4 discusses the VSUGS online algorithm which is able to work efficiently for very large datasets. Section 5 discusses predictive inference and our novel regression adjustment approach. Section 6 considers an application to weak informative prior selection, Section 7 considers predictive performance of the methods in some benchmark data sets and Section 8 concludes.

Matrix-variate Dirichlet process mixture model
We consider the matrix-variate Dirichlet process mixture model of Zhang et al. (2010). Specifically, let y i , i = 1, . . . , n denote a collection of observed m-dimensional response vectors and x i , i = 1, . . . , n denote corresponding p-dimensional vectors of covariates. A common flexible way to model the mean in a multivariate regression for the responses involves using some basis expansion where, denoting the jth element of y i by y ij , where E r (x), r = 1, . . . , N are basis functions and β j = (β 0j , . . . , β N j ) T are coefficients, j = 1, . . . , m. In motivating their approach Zhang et al. (2010) discuss such a basis expansion, and consider setting N = n and E r (x) = K(x, x r ) where K(·, ·) is a kernel function so that the number of basis terms equals the number of observations. Here we will be concerned with an online implementation of their approach where n is not known beforehand, so we will make a fixed choice of both N and the basis functions E r (x), r = 1, . . . , N . We give more details about this later. Write β = [β 1 , . . . , β m ] for the (N + 1) × m matrix of regression coefficients and E i = (1, E 1 (x i ), . . . , E N (x i )) T . Then if we assume i.i.d errors in the regression (1) we can write where the i are the errors having mean 0 and covariance matrix τ Σ say where τ > 0 is a scale parameter. The reason for parametrizing the covariance matrix in this way will become clear later when conjugate prior specifications are considered. Flexible multivariate regression approaches using basis expansions of this type have been considered by many authors. The innovation of Zhang et al. (2010) is to consider a model in which the coefficient β varies randomly between observations. The distribution of this coefficient is estimated from the data, and is given a Dirichlet process prior with a matrix-variate normal distribution as the base measure. That is, the Dirichlet process with matrix-variate normal base measure is used as a prior on the mixing distribution for the coefficient. The clustering property of the Dirichlet process ensures that many observations will share the same coefficient matrix and there is borrowing of strength both between observations and responses in estimating the regression. Precisely, the model is where DP (α, M ) denotes the Dirichlet process with precision parameter α and base measure M . The base measure M in the model is chosen to be a matrix-variate normal distribution N N +1,m (0, Ω ⊗ Σ). An s × t random matrix Z has a matrix-variate normal distribution N s,t (C, V ⊗ W ), where C is an s × t matrix and V and W are s × s and t × t covariance matrices respectively, if its density takes the form In our model following Zhang et al. (2010) it will be assumed that Ω is diagonal, Ω = diag(ω 1 , . . . , ω N +1 ) where ω i ∼ IG(a i , b i ) with a i and b i known. Also, Σ is inverse-Wishart with degrees of freedom ν and scale matrix S. τ is given an inverse gamma prior IG(a τ , b τ ) with a τ and b τ known. The Dirichlet process puts all its mass on a countable collection of points so we can rewrite the model in the following way. Let . We let δ i be an integer valued variable with δ i = j ifβ i = β j . Write δ 1:i = (δ 1 , . . . , δ i ) T . Using the Pólya urn representation for the Dirichlet process we can rewrite the model in the form where p(δ 1:n ) = p(δ 1: , the priors on Ω and Σ are the same as before and the conditional densities p(δ i |δ 1:i−1 , α) are defined by (using similar notation to Zhang et al. (2010)) j is the number of δ k , k < i equal to j and n i is the number of distinct β k appearing up to time i − 1. For the purpose of developing our fast online variational approximation algorithm we will use a truncated Dirichlet process mixture model. In this model the sequence where T is the truncation point and This is the model we discuss in what follows.

Variational inference
Consider a Bayesian model with parameter ξ, prior p(ξ) and likelihood p(y|ξ). Variational Bayes computational methods (Waterhouse et al., 1996;Jordan et al., 1999;Attias, 2000;Ormerod and Wand, 2010) attempt to approximate the posterior density p(ξ|y) by a more tractable and manageable variational density q(ξ), belonging to a convenient family. The choice of q(ξ) within the approximating family is usually made by minimizing the KL divergence between p(ξ|y) and q(ξ). It can be shown that where p(y) = p(ξ)p(y|ξ) dξ. The first and second terms on the RHS of (5) are the variational lower bound L (so-called because it forms a lower bound on log p(y)) and the KL divergence between q(ξ) and q(ξ|y), respectively. From (5), it is clear that minimizing the KL divergence is equivalent to maximizing L. For further background see the references above. Now, suppose that ξ can be partitioned into J subvectors, ξ 1 , .., ξ J . In variational Bayes, an approximating family for the posterior is considered where q(ξ) is assumed to factorize as J j=1 q(ξ j ). For each of the factors q(ξ j ), the lower bound is maximized with the other factors held fixed by choosing q(ξ j ) aŝ where E −ξ j denotes an expectation with respect to i =j q(ξ i ). Expression (6) is the basis of a blockwise gradient descent algorithm for maximizing L where an initial choice is made for the factors and then each factor is updated in turn with the others fixed at current values until convergence. One useful application of the variational approach is to approximate the posterior distribution of parameters in Bayesian nonparametric models. It is well known that there is usually no direct way to compute the posterior distribution in these models and that MCMC sampling methods for such models can be difficult and computationally expensive. These considerations motivated Blei and Jordan (2006) to consider a mean-field variational inference algorithm for Dirichlet process mixture models. Their approach can be implemented for the model of Section 2, since the approach of Zhang et al. (2010) is based on an ordinary Dirichlet process mixture model, and we do implement such an approach later in our examples. Since this is a straightforward application of the algorithm of Blei and Jordan (2006) we do not give further details of their method here. However, we develop an alternative batch variational Bayes algorithm which is also described in the next section. The algorithm of Blei and Jordan (2006) is based on the stick breaking representation of the Dirichlet process; our alternative batch variational Bayes algorithm (like the later sequential algorithm of Section 4) is based on the Pólya urn representation with the unknown mixing distribution integrated out. Although the alternative batch algorithm involves some further approximations, the purpose of developing this method is that it gives a batch algorithm similar to our later online approach, and provides another reference for comparison for the performance of the online algorithm where how much performance is lost through the sequential updating mechanism can be better understood. Also, many of the updating steps in the online algorithm are simple modifications of the corresponding steps for the batch algorithm.
q(δ i = j) will be denoted by q ij . In this subsection we give the mean field updates for all factors except for q(δ 1:n ), which is considered in the next subsection. Technical details of the derivations are found in Appendix A.
For β, we recognize the form of q(β, Σ) as being q(β, whereβ j,i is the ith row ofβ j and ω ij is the ith diagonal element of V −1 j .

Batch mean field update for local parameters
We now factorize q(δ 1:n ) as n i=1 q(δ i ) and consider approximate mean field updates for q(δ i ), i = 1, . . . , n. Using (6), for each δ i , we get where δ =i denotes δ 1:n with δ i omitted. Making the approximation with the case where i < T being handled by using the same expression but conditioning on δ i ≤ i. This approximation is obtained by reordering so that the ith observation is last, taking an expectation in (4) and then restoring the constraint associated with the original ordering by conditioning on δ i ≤ i if i ≤ T . Note that because we order atoms according to their order of occurrence it must be the case that δ i ≤ i for i ≤ T . To get an expression for our approximate mean field update it remains to evaluate E q {log(y i |θ j )} which is

VSUGS for matrix-variate Dirichlet process mixture model
The VSUGS algorithm, proposed by , is an online learning procedure for fast fitting of Dirichlet process mixture models. It uses the variational approximation framework to improve the SUGS algorithm (Wang and Dunson, 2011). The VSUGS algorithm is especially useful for large datasets as computing the full variational batch update or using MCMC might be computationally infeasible. The framework of the VSUGS procedure is as follows. Following , we consider an approximation to the posterior p(δ 1:i−1 , θ 1:T |y i:i−1 ) of the form The algorithm starts atq 1 (δ 1 = 1) = 1,q(θ 1 ) = p(θ 1 |y 1 , δ 1 = 1). Then, at time i, we useq i−1 (θ) andq i−1 (δ 1:i−1 ) as a prior for processing the data point y i . Then for a certain fixed choice ofq i (δ i ) the mean field update for θ reduces to the following approximation of p(θ|y):q For the assignment variables δ i , we follow  and choosê for j ∈ {1, ..., min(i, T )} where T is the pre-specified truncation point for the number of mixture components and One property of the VSUGS procedure is that (9) splits the likelihood contribution from the ith observation among the mixture components. This deviates from the original SUGS algorithm (Wang and Dunson, 2011) which uses a "hard" allocation to mixture components. In the case of conjugate priors, the VSUGS algorithm retains the computational advantages of the original SUGS algorithm. See  for further details.

Sequential update of variational parameters for τ, Σ and Ω
In the batch update of the global parameters the expectations of τ −1 , Σ −1 and Ω −1 with respect to q(τ ), q(Σ) and q(Ω) respectively are required. For an online algorithm like VSUGS, these expectations change when a new data point enters. In order to use (8), it is required to replace the expectation of τ −1 , Σ −1 and Ω −1 with E q i (τ −1 ), E q i (Σ −1 ) and E q i (Ω −1 ) respectively, where E q i represents the variational expectation at time i. Following the derivation of the batch updates, our corresponding online learning update for the variational parameters is As the second term on the RHS of (8) does not include any terms for ω 1 , ..., ω N +1 , there is no online learning required for these parameters. At every step of the online VSUGS algorithm, we continue using the batch update for q i (ω 1:N +1 ). For the full algorithm, we refer to Algorithm 1.

Sequential VSUGS type update for the δ i
Suppose we assimilate observations sequentially and at step i − 1 we have a variational posterior distribution of the form Using the VSUGS approximation, we takê where The integral in (12) can be evaluated as (see Appendix B) j . This last integral does not seem to be easily computable analytically. It is an expectation with respect to q i−1 (τ ), and if this distribution is concentrated around the mean it is reasonable to make the approximation f (τ )q i−1 (τ )dτ = f (E q,i−1 (τ )) for functions f (τ ) and where we have written E q,i−1 (τ ) = τ q i−1 (τ )dτ . Using this approximation here we get that the integral is approximately 2π τ −1 is the expectation of τ −1 with respect to q i−1 (τ ).

Algorithm 1 : VSUGS algorithm for Matrix DPMM
andβ j,k is the kth row ofβ j . end for end for 5 Posterior predictive inference Suppose we are given a new input vector x 0 and wish to predict the response vector y 0 .
Assuming n > T , we replace p(δ 0 = j|y 1:n ) with r n+1 j as defined in (12). Also, replacing p(τ, β j , Σ|y 1:n ) with the corresponding variational posterior q n (τ, β j , Σ), the predictive density becomes The integral in (15) evaluates to a multivariate t-distribution. So an approximate posterior predictive density is obtained as a mixture of multivariate t-densities. For more details, including the parameters of the multivariate-t mixture components, see Appendix C.
5.1 Regression-type adjustment for improving predictive inference One advantage of using the matrix-variate Dirichlet process approach to flexible regression is that avoiding covariate dependence in the mixing weights greatly simplifies computation, something that we have exploited here for implementing an online algorithm. However, this does place a greater burden on the mean functions in the regression mixture components to model the response distribution in a flexible way. Here we consider a method for improving predictive performance of the fitted model, borrowing an idea from the literature on regression adjustment methods for approximate Bayesian computation (Beaumont et al., 2002;Blum, 2010;Blum and François, 2010;Blum and Tran, 2010). The idea below is given in equation (4.1) of Blum and Tran (2010).
Suppose we wish to consider prediction of a new response y * = (y * 1 , . . . , y * m ) T to be observed with corresponding covariate x * = (x * 1 , . . . , x * p ) T . Write N k (x * ) for the k nearest neighbours of x * among the observed covariates {x 1 , . . . , x n }. We write the corresponding values of (x, y) as (x i 1 , y i 1 ), . . . , (x i k , y i k ) so that i 1 , . . . , i k denote the indices of the covariates in N k (x * ). In our fitted regression model, writeF j (.|x) for the marginal distribution function of the jth component of the response in the fitted model at x, andF −1 j (·|x) for its inverse where it is assumed this exists. If the fitted model is correct, . . , k, then marginally y a 1j , . . . , y a kj is a sample fromF j (·|x * ) (if the regression model is correct).
The sample y a r , r = 1, . . . , k can be used to do approximate predictive inference. The advantage of this method compared to usingF j (·|x * ) directly is that by using, in effect, quantile residuals locally around x * to define the particles y a r we are able to adjust for any local misfit of the regression model. This can result in improved predictive inference. Note that by transforming the particles y ir component-wise we are not guaranteed to preserve the correct multivariate dependence structure in the fitted model at x * , but if the copula of the fitted distribution changes only slowly with x over the neighbourhood used the effects of this approximation are minor.

Application to weak informative prior selection
As an application of our proposed methodology, we consider flexible approximation of prior predictive densities as a function of a prior hyperparameter value based on data simulated under a model, when these prior predictive densities are not analytically tractable. Approximating such predictive densities is useful for prior choice. In the application considered here we make use of the way that the MDP mixture model is able to approximate the whole response distribution flexibly. In the next section we will look more closely at the quality of point predictions of the online algorithm compared to those obtained by batch VB and MCMC approaches. Consider a statistical model p(y|ξ) for data y with parameter ξ. Suppose we have a class of priors p(ξ|λ) where λ is a hyperparameter value to be chosen. We also suppose that there is a value λ 0 for λ that has already been chosen tentatively as representing our best current prior knowledge of ξ. For the purpose of sensitivity analysis, we may wish to define a prior that is less informative than p(ξ|λ 0 ), and this might be particularly useful in the case where the information brought by the prior and likelihood seem to be contradictory. Evans and Jang (2011) considered defining the amount of information in a prior p(ξ|λ) relative to p(ξ|λ 0 ) through the idea of prior-data conflict. The notion of weakly informative priors formalized in Evans and Jang (2011) was inspired by previous work of Gelman (2006).
Since the idea of Evans and Jang (2011) is built on the idea of checking for prior-data conflict, this needs to be understood first. Prior-data conflict occurs where the prior puts all its mass out in the tails of the likelihood. A way of testing for prior-data conflict which modifies a suggestion of Box (1980) will be considered here, following Evans and Moshonov (2006). Their idea is that a minimal sufficient statistic value S determines the likelihood, so we can check if the observed likelihood is in conflict with the prior by seeing whether the observed value of the sufficient statistic S obs say lies out in the tails of its prior predictive distribution. A p-value for checking for conflict with the prior p(ξ|λ) can be computed as where S ∼ p(S|λ) and p(S|λ) = p(S|ξ)p(ξ|λ)dξ is the prior predictive distribution of S. If a non-trivial sufficient statistic does not exist it may be reasonable to choose an asymptotically sufficient statistic such as the maximum likelihood estimator or some approximation to it. Note that p(S obs , λ) is calculating the probability that a random draw from p(S|λ) has lower density than the value of S obs and it is small if S obs lies out in the tails of p(S|λ). The above prior-data conflict check can be modified in various ways -for more details see Evans and Moshonov (2006). To use this notion of prior-data conflict checking to define how informative the prior p(ξ|λ) is relative to p(ξ|λ 0 ) Evans and Jang (2011) consider S generated randomly under p(S|λ 0 ) and ask whether for data generated in such a way does doing the analysis under p(ξ|λ) rather than p(ξ|λ 0 ) result in a reduction of the frequency of prior-data conflicts. The occurrence of a conflict is defined by choice of a certain cutoff for a conflict p-value such as (17). It is possible to consider various modifications of the basic idea considering uniformity of reduction of levels of conflict over different p-value cutoffs, see Evans and Jang (2011) for more details.
Following the ideas of Evans and Moshonov (2006) and Jang (2011), Nott et al. (2015) propose modifying a regression adjustment approach used in the approximate Bayesian computation (ABC) literature to approximate prior predictive distributions p(S|λ) for many different λ in a computationally thrifty way when S may be expensive to compute. In particular, they consider the method of Blum and François (2010), which modifies a suggestion of Beaumont et al. (2002), to generate approximate samples from prior predictive densities p(S|λ) and then use these samples for the required computations. This approach is much more computationally efficient than generating a large number of replications of S at each value of λ independently for every value λ of interest on a grid, say. The method starts by generating values λ i , i = 1, ..., n from a pseudo-prior p(λ). Then values (ξ i , S i ), i = 1, .., n are generated for(ξ, S) from p(ξ|λ)p(S|ξ) where S is a minimal sufficient statistic or some asymptotically sufficient statistic. Nott et al. (2015) modify the ABC with regression adjustment method in Blum and François (2010) by reversing the usual role of the parameters and the summary statistics where these methods are used in the ABC context. They fit a regression model with where the i are i.i.d errors with zero mean and variance one and µ(λ) and σ(λ) are flexible mean and standard deviation functions. Blum and François (2010) parametrize µ(·) and σ(·) using neural networks. After fitting the model to the data to obtain estimatesμ(λ) andσ(λ), a sample of p(S|λ) can be obtained approximately by considering the fitted mean for the regression model plus the empirical residuals. The empirical residual for the ith point isˆ i =σ(λ i ) −1 (S i −μ(λ i )), and using such empirical residuals together with the fitted model at λ gives as an approximate sample from p(S|λ) if the regression model is correct. Based on the approximate sample S a i (λ), they use a kernel estimate to approximate p(S|λ). Let this kernel estimate bep(S|λ). Next, suppose that S 0 j , j = 1, . . . , n are draws from p(S|λ 0 ). Then a particle approximation to the distribution of p(S, λ) for S ∼ p(S|λ 0 ) is given by the valuesP (S 0 1 , λ), ...,P (S 0 n , λ) wherê The distribution of the p-value can be used to determine whether p(ξ|λ) is weakly informative relative to p(ξ|λ 0 ) or not. We propose using our approach to assess weak informativity of alternative priors compared to a base prior, similar to the above. However, instead of using the ABC with regression adjustment (18), we propose fitting a matrix-variate Dirichlet process mixture model with S as response and λ as predictors. In applying the MDP prior approach we also employ the regression adjustment method of Section 5.1 to obtain approximate samples from p(S|λ) at any desired value of λ. Kernel estimates of p(S|λ) are then constructed as for the approach of Nott et al. (2015) and the procedure above followed for approximating the distribution of conflict p-values for S generated from p(S|λ 0 ). As observed in Nott et al. (2015) high accuracy is not needed in the regression calculations; the regression calculations are simply a screening computation, and once a candidate value of λ is chosen for a weakly informative prior then for the single finally chosen value we can generate a large sample from the prior predictive distribution and see whether our approximate calculations were good enough. qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q (b) Figure 1: (a) Estimated degree of weak informativity at γ = 0.05 for the bioassay example. (b) The plot points in grey and black represent estimated distribution of the conflict p-value using VSUGS-adjusted approach and by direct simulation from the prior predictive distribution. Both estimation uses the alternative prior σ 0 = σ 1 = 4.
We consider a bioassay example from Racine et al. (1986) which is also analysed in Gelman et al. (2008), Evans and Jang (2011) and Nott et al. (2015). In this dataset, four groups of five animals were exposed to different level of doses (x i ) and the number of death (y i ) were recorded. Following Nott et al. (2015) and Evans and Jang (2011), we consider a logistic regression setup. It is assumed that the covariate has been transformed to log scale and centred and scaled as in Gelman et al. (2008). The model is y i ∼ Bin(5, p i ) where logit(p i ) = c 0 + c 1 x i . We assume that the priors for c 0 and c 1 are independent and follow Gaussian distributions with zero mean and variances σ 2 0 and σ 2 1 respectively. For our base prior, we consider σ 0 = 10 and σ 1 = 2.5.
Evans and Jang (2011) consider the exact sufficient statistics (y 1 , y 2 , y 3 , y 4 ) for analysis. Nott et al. (2015) consider using the posterior mode (ĉ 0 ,ĉ 1 ) for a prior with σ 0 = σ 1 = 10 as an approximation to the MLE but which unlike the MLE will exist even in degenerate cases. They consider the MLE for the dimension reduction that it brings and as a generic choice applicable in situations where a non-trivial minimal sufficient statistic doesn't exist. For the statistic S used to define the conflict check in the definition of weak informativity, they use a transformation of (ĉ 0 ,ĉ 1 ) to the fitted probabilitiesp 2 andp 3 at x 2 and x 3 respectively. The reasons for this are discussed further in Nott et al. (2015). That is, our approximate sufficient statistic is (p 2 ,p 3 ) wherê p i = 1/(1 + exp(−ĉ 0 +ĉ 1 x i )) for i = 2, 3. Note that, because of the discreteness of the data, strictly the distribution of this statistic is also discrete but continuity may be used as a reasonable approximation when the number of different possible values is large and we do this here. Note also that in the kernel density estimation we ignore any boundary effects due to the bounded support of the statistics.
To use our methodology to investigate weak informativity with respect to the base prior in this example we proceed as follows. First, we generate 400,000 values of (σ 0 , σ 1 ) from a pseudo prior which is uniform distribution on [0.1, 10] × [0.1, 20]. We label these values as σ (i) = (σ We use these probabilities to generate (y 3 ) T and K(σ (i) , σ (j) ) and K(σ (i) , σ (j) ) = exp{−||σ (i) − σ (j) ||/2κ 2 }, where ||.|| is the Euclidean norm and κ 2 is the mean of the euclidean distance among 5000 random samples drawn from σ (1) , ..., σ (400,000) . Zhang et al. (2010) propose a similar choice of the kernel hyperparameter κ 2 and it is verified to be effective in their experimental analysis. We fit the matrix-variate Dirchlet process mixture model withp (i) as our response vector and the basis functions (1, K(σ (i) , σ (1) ), .., K(σ (i) , σ (N ) )) as our covariates. We set α = 100 and T = 4. For our prior, we set a τ = 5, b τ = 0.5, a i = 5, b i = 0.5, S = I 2 + 1 2 1 2 1 T 2 , ν = 3 and M 1 , ..., M T , C 1 , .., C T as zero matrices. We first run Algorithm 1 to fit the model. For each λ σ on a 100 × 100 regular grid on [0.1, 10]×[0.1, 20], we take 1000 nearest neighbours from the set {σ (i) } i=1,...,400,000 using the knnsearch function in matlab. Then, we use these 1000 nearest neighbour to estimate the corresponding (p 2 ,p 3 ) T using the regression adjusted approach proposed in Section 5.1. Note that if we were to generate 1000 samples directly for the prior predictive for each of our 10, 000 grid points directly this would increase the number of prior predictive simulations and the computational effort by an order of magnitude. The collection of λ σ covers the support of the hyperprior and each λ σ is considered as an alternative prior for comparison with the base prior. We follow the suggestion from Evans and Jang (2011) to measure the degree of weak informativity of an alternative prior. That is, we let p γ be the γ% quantile of the conflict p-value distribution for the base prior. Under the alternative prior, let q γ be the probability of a conflict p-value which is less than or equal to p γ . The degree of weak informativity ζ γ is defined as Figure 1(a) plots the degree of weak informativity ζ γ for all λ σ when γ = 0.05. We observe that the plot is very similar to Figure 2 in Nott et al. (2015). From the plot, it seems that σ 0 = σ 1 = 4 is a suitable choice for a weakly informative prior. This conclusion agrees with the parameter choice of Nott et al. (2015). As mentioned earlier, the approximate regression calculations are simply screening calculations where high accuracy is not needed since the quality of the final answer can be checked. Figure 1(b) shows a comparison of the estimated distribution of the conflict p-values based on regression (grey) compared to one based on direct simulation from the prior predictive at the finally chosen λ (black). In the lower tail, which is what matters for declaring the existence of any conflict and defining weak informativity, the two distributions agree very well, and the approximate regression calculations have successfully allowed us to identify a suitable weakly informative prior. Note that standard procedures such as Gibbs sampler proposed in Zhang et al. (2010) are not suitable for use in this application with the matrix-variate DP prior model as we need to generate a large number of data points from the hyperprior to obtain good estimates of the prior predictive distributions p(S|λ) and so a method is needed that is able to handle large datasets.

Empirical comparisons of predictive performance
In this section, we focus on the predictive performance of the various approaches when fitting a model. We consider the Gibbs sampler (Zhang et al., 2010) and three different versions of the variational procedure. The first variational approach, which we call VB (Stick breaking), is the method of Blei and Jordan (2006). The second variational method, which we call VB (Pólya urn) is the batch variational method discussed in Section 3. In the third variational approach, we consider the VSUGS approach discussed in Section 4. In our experiment, the predictors are standardized to have zero mean and unit variance with respect to the training set. We fixed the number of iterations for the Gibbs sampler to be 25000, of which the first 15000 will be discarded as burn in. For the remaining 10000 iterations of the Gibbs sampler, we retain every 10th realization of the parameters. We also fixed the number of iterations for the variational approach using batch updates to 100. For the online variational approach, we first initialize the variational parameters of the assignment variables on a relatively small number of data points using the batch update and then run Algorithm 1.
We measure the performance of the various approaches by considering their root mean square error (RMSE) and mean absolute percentage error (MAPE). Letỹ 1 , ..,ỹ m be our target response values andŷ 1 , ..,ŷ m be their respective fitted value. The error indicators are defined as Similar performance measures are used in Zhang et al. (2010). In our analysis, we compute the RMSE and MAPE both in-sample (for the training set) and out-of-sample (for the test set). Although we are mostly interested in out-ofsample predictive performance, looking at in-sample measures of fit can also be useful here where we are comparing several computational approximations for the same posterior; measures of in-sample fit can be revealing about differences in the quality of posterior approximation even if out-of-sample predictive performance is similar for the different methods. For the RMSE and MAPE of in-sample predictions, they are constructed as follows. For the Gibbs sampler, at each retained MCMC realization r, r = 1, . . . , 1000, we have a cluster allocation for each y 1 , ..., y n . Suppose let's say that for a particular data point y i , the allocation at the r realization is the jth component. Then our corresponding regression coefficient estimate would then be β * i,r = β T j,r , where β 1,r , .., β T,r are the rth MCMC coefficient matrix realizations. Then, our in-sample fitted value for y i is estimated as For all variational procedures, for each y i , we have the corresponding variational posterior probability of the assignments as well as the posterior mean. We use the weighted sum of these coefficient matrices according to the posterior probability to compute an in-sample fitted value.
Out of sample fitted values are obtained from the posterior predictive distributions of the respective procedures. For the Gibbs sampler, we will use the predictive distribution from equation (7) in Zhang et al. (2010). For the VSUGS procedures, we use the posterior predictive mean to fit each y i . Details of the posterior predictive mean can be found in Section 5. For prediction accuracy of the test set, we also consider the adjusted VSUGS approach, which is to use the VSUGS with the regression-type adjustment in Section 5.1.

Energy Data
In this example, we consider the energy efficiency data created by Tsanas and Xifara (2012). This dataset is available at http://archive.ics.uci.edu/ml/datasets.html. Tsanas and Xifara (2012) studied the effect of eight input variables (relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, glazing area distribution) on two output variables, namely heating load (HL) and cooling load (CL), of residential buildings. The dataset contains 768 instances. We randomly select 100 data points as the test set and use the remaining 668 data points as the training set.
The settings for fitting our model are as follows. We set the number of basis function, N , at 200 and use the same kernel discussed in Section 6.1. Our choice of α is 3, which is set by rounding off the average of the 10000 iterations of α retained from the Gibbs sampler. For comparison of performance, we run both the Gibbs sampler and the variational procedures with a fixed α = 3. For the hyperparameters of the priors, we set a τ = 5, b τ = 0.5, a i = 20, b i = 0.5, S = I 2 + 1 2 1 2 1 T 2 , ν = 3 and M 1 , ..., M T , C 1 , .., C T as zero matrices. We also set the maximum number of possible components as T = 10. For the variational approach with batch update, we initialize each assignment variables randomly to one of the T components and set its variational probability to one. For the matrix VSUGS approach, we initialize the variational parameters for the first 200 assignment variables using the VB(Pólya urn).  Measures of in-sample fit and computation times are presented in Table 2. The two batch VB methods have similar in-sample fits, but the Gibbs sampling and VSUGS approaches produce quite different results. Table 3 considers out-of-sample predictive accuracy. Table 2 shows that all three variational approaches without the regression-type adjustment perform similarly. The RMSE and MAPE of the Gibbs sampler is higher than the rest. Table 2 also shows that using regression-type adjustment significantly improves prediction accuracy with respect to both RMSE and MAPE.

Robot Arm Data
In this subsection, we analyse the performance of our proposed algorithm on the robot arm data. This dataset, available from www.gaussianprocess.org/gpml/data, relates to an inverse dynamics problem for a seven degrees-of-freedom SARCOS anthropomorphic robot arm. The dataset has 21 covariates and 7 responses and has training and test sets of sizes 44448 and 4449 respectively. The 21 covariates consist of 7 joint positions, 7 joint velocities and 7 joint accelerations and the 7 responses consist of 7 joint torques.
We follow the same procedure used in the energy dataset for our settings. The value of α is set to 12. For the hyperparameters of the prior, we set a τ = 5, b τ = 0.5, a i = 20, b i = 0.5, S = I 7 + 1 7 1 7 1 T 7 , ν = 8 and M 1 , ..., M T , C 1 , .., C T as zero matrices. We also set the maximum number of possible components as T = 10. For the matrix VSUGS approach, we initialize the variational parameters of the first 500 data points using VB (Pólya urn).  The performance of the in-sample fits is presented in Table 3, for fitting to a subsample of size 2000. Corresponding out-of-sample fits for test set subsample of size 500 are shown in Table 4. We also initialize the variational parameters of the first 2000 data points using VB (Pólya urn) and then run Algorithm 1 on the full training dataset of size 44448. As the dataset is large, we do not use the Gibbs sampler or the variational procedures using batch updates for the full dataset. The results of out-of-sample predictive performance are presented in Table 5. It is clear that there is a significant prediction accuracy improvement in terms of RMSE and MAPE when using the adjusted VSUGS.   All the algorithms are run on a Mac 3.2Ghz i5 Quad core processor with code written in matlab. For the energy efficiency dataset, as reflected in Table 1, the computation time for Gibbs sampler, VB (Stick breaking), VB (Pólya urn) and matrix VSUGS require approximately 141, 18, 18 and 3 minutes respectively. For the robot arm dataset, Table 6 shows that the Gibbs sampler, VB (Stick breaking), VB (Pólya urn) and matrix VSUGS require 2422, 368, 365 and 93 minutes respectively. The amount of time required for matrix VSUGS to run the full dataset is 584 minutes, which is significantly shorter than the amount of time it takes for the Gibbs sampler to run on a much smaller dataset. In fact, two third of the computation time is spent on initializing the first 2000 data points.

Discussion
In this article, we study variational computational methods for fitting the matrix-variate Dirichlet process mixture model of Zhang et al. (2010), extending the VSUGS approach by  for Dirichlet process mixtures of normal densities. The method we develop is computationally efficient and especially useful as an alternative to MCMC for analysis of medium to large datasets. In order to increase prediction accuracy, we also propose a regression-type adjustment for improving predictive inference. The adjustment approach is shown to be useful in several real applications.

Appendix A -Variational Batch update
In the derivation of the update for each block γ say we will write simply E q (·) for the expectation with respect to the current variational posterior distribution q with γ integrated out and will not denote the dependence on the block explicitly in the notation. The meaning will be clear from the context. We consider the mean field update for (β, Σ) first. We have Apart from constant terms not depending on β, Σ we have where (again apart from terms not depending on β, Σ) This gives (again up to an additive constant) log q(β, Σ) = − (n + N + 2)m + ν + 1 2 log |Σ| − 1 2 tr(SΣ −1 ) To simplify this, writê This means that up to an additive constant Then we recognize the form of q(β, Σ) as being q(β, Σ) = q(Σ)q(β|Σ) where q(Σ) is inverse Wishart, . Next, let's consider the mean field update for q(τ ). We have Apart from additive constants Next, To evaluate the inner conditional expectation, we use the following Lemma (Guptar and Nagar, p. 60).
Using Lemma 1 and some simple algebra we obtain So apart from additive constants log q(τ ) = −(a τ + nm 2 + 1) log τ Hence we recognize that q(τ ) is inverse gamma, Next we consider the variational update for ω. We have that apart from additive constants where β j,i and M j,i denote the ith rows of β j and M j respectively. Writing and noting that β j,i |Σ ∼ N (β j,i , ω ij Σ) whereβ j,i is the ith row ofβ j and ω ij is the ith diagonal element of V −1 j we have So apart from additive constants Appendix B -Sequential update for δ i To help evaluate the integral we use the following lemma.
Lemma 2. (a) Let Z, V , W and C be matrices with Z s × t, V > 0 s × s, W > 0 t × t and C s × t. Then Then Proof. Part a) of the lemma follows easily from the fact that the integrand is an unnormalized matrix normal distribution. Using the fact that the integral of the corresponding normalized density is one and rearranging gives the result. Part b) follows in a similar way by noting that the integrand is an unnormalized inverse Wishart density.
Using Lemma 2 to help evaluate the integral we get j . Next, use part (a) of Lemma 2 and the explicit form for q i−1 (Σ) to get Using part (b) of Lemma 2, this is equal to

Appendix C -Posterior predictive distribution
We now evaluate the integral in (15). Following the same steps to get (14), we have p(y 0 |y 1:n , τ, Σ, β j , δ 0 = j)q n (τ )q n (β j )q n (Σ)dβ j dτ dΣ = 2π . We now show that after simplification of the term inside the determinant we obtain the multivariate t-distribution. We require here the matrix determinant lemma, which states the following.
Lemma 3. For any invertible matrix A and vector u and v we have |A + uv T | = (1 + v T A −1 u)|A|.
This gives us (where we suppress dependence on j in the notation). Using Lemma 3, we have As the term inside the absolute value is a quadratic function in y 0 , the predictive distribution is a multivariate t-distribution. Letting A = µ (n) j Λ −1 E 0 (again suppressing dependence on j in the notation), we have Thus, the jth component in the expression for the posterior predictive density (15) is a multivariate t-density with location − 1 2 A −1 B and variance 1 ν (n) −m−1 A(1 − 1 4 B T A −1 B) −1 −1 . Hence (15) is approximately a mixture of multivariate t-densities.
Suppressing the expectations with respect to q i , evaluating the terms involving τ and ω j gives us