AIMS: AVERAGE INFORMATION MATRIX SPLITTING

. For linear mixed models with co-variance matrices which are not linearly dependent on variance component parameters, we prove that the av- erage of the observed information and the Fisher information can be split into two parts. The essential part enjoys a simple and computational friendly for- mula, while the other part which involves a lot of computations is a random zero matrix and thus is negligible.


1.
Introduction. Many statistical methods require an estimation of unknown (co-)variance parameter(s). The estimation is usually obtained by maximizing a loglikelihood function. In principle, one requires the observed information matrix -the negative Hessian matrix of the log-likelihood-to obtain a maximum likelihood estimator according to the Newton-Raphson method [11]. The expected value of the observed information matrix is usually referred to as the Fisher information matrix or simply the information matrix. It keeps the essential information about unknown parameters and enjoys a simper formula. Therefore it is widely used in many applications [3]. The resulting algorithms is called the Fisher-scoring algorithm which is widely used in informetrics [13] and now is standard procedure in computational statistics [7, p.30].
The Fisher scoring algorithm is a success in simplifying the approximation of the Hessian matrix of the log-likelihood. Still, evaluating elements of the Fisher information matrix remains as one of bottlenecks in a log-likelihood maximization procedure, which prohibits the use of Fisher scoring algorithm for large data sets. In particular, the high-throughput technologies in biological science and engineering mean that the size of data sets and the corresponding statistical models have suddenly increased by several orders of magnitude. Further simplification of computational procedure is quite needed for applications of large scale statistical models, such as genome-wide association studies, which involves many thousands parameters to be estimated [22].
The aim of this short note is to provide a concise mathematical result: the average of the observed information and the Fisher information can be split into two parts; the essential part enjoys a simple formula and is easy to compute, while the other part which involves a lot of computations is a random zero matrix and thus is negligible. Such a spitting and approximation provides significant reduction in computations. What we should mention is that the average information idea has been proposed in [12] for (co)variance matrices which linearly depend on variance parameters. It results in an efficient breeding algorithm in [6], and followed by [14]. However, previous results assume that the variance-variance matrix should be linearly dependent on the underlying variance parameters. Here we prove that similar results still be obtained even the variance-variance matrix is not linearly dependent on the underlying variance parameters.

2.
Preliminary. Consider the following widely-used linear mixed models [1,2,5,28,21] y where y ∈ R n×1 is the observation, τ ∈ R p×1 is the vector of fixed effects, X ∈ R n×p is the design matrix which corresponds to the fixed effects, u ∈ R b×1 is the vector of unobserved random effects, Z ∈ R n×b is the design matrix which corresponds to the random effects. ∈ R n×1 is the vector of residual errors. The random effects, u, and the residual errors, , follow multivariate normal distributions such that E(u) = 0, where G ∈ R b×b , R ∈ R n×n . Typically G and R are parameterized matrices with unknown parameters to be estimated. Precisely, suppose that G = G(γ), R = R(φ), and denote κ = (γ, φ), θ = (σ 2 , κ). Estimating our main concern, the variance parameters θ, requires a conceptually simple nonlinear iterative procedure: one has to maximize a residual log-likelihood function of the form [18, p.252] where H = R(φ) + ZG(γ)Z T , ν = rank(X) and Here we suppose X is full rank, say, ν = p. The first derivatives of R is referred to as the scores for the variance parameters θ := (σ 2 , κ) T [18, p.252]: where κ = (γ T , φ T ) T . We shall denote Algorithm 1 Newton-Raphson method to solve S(θ) = 0.
1: Give an initial guess of θ 0 2: for k = 0, 1, 2, · · · until convergence do 3: The negative Hessian of the log-likelihood function is referred to as the observed information matrix. We will denote the matrix as I O .
Given an initial guess of the variance parameter θ 0 , a standard approach to maximize R or find the root of the score equation S(θ) = 0 is the Newton-Raphson method (Algorithm 1), which requires elements of the observed information matrix.
In particular, involves two computationally intensive trace terms, which prohibits the practical use of the (exact) Newton-Raphson method for large data sets.
In practice, the Fisher information matrix, I = E(I O ), is preferred. The elements of the Fisher information matrix have simper forms than these of the observed information matrix for example The corresponding algorithm is referred to as the Fisher scoring algorithm [13]. Still the element I(κ i , κ j ) of the Fisher information matrix involves the computationally expensive trace terms. When the variance-variance matrix H is linearly dependent on the variance parameter, sayḦ ij = 0, researchers noticed that the average information matrix enjoys a simpler and more computational friendly formula [6] [12].
For more general covariance matrices whenḦ ij = 0, we still have the following nice property by the classical matrix splitting [20, p.94]. The average information matrix can be split into two parts as follows [25]: Such a splitting enjoys a nice property which is presented as our main result.
3. Main result. Such a splitting aims to remove computationally expensive and negligible terms so that a Newton-like method is applicable for large data which involves thousands of fixed and random effects. It keeps the essential information in the observed information matrix. In this sense, I A is a good approximation which is data-dependent (on y) to the data-independent Fisher information. An Quasi-Newton iterative procedure is obtained by replacing I O with I A in Algorithm 1.
Proof of the main result.
Proof. The first two terms can be verified by direct computation. Since H is a positive definite matrix, there exists H 1/2 such that tr(P H) = tr(H 1/2 P H 1/2 ) = tr(I −X(X TX ) −1X ) = n − rank(X) = n − ν.
whereX = H −1/2 X. The 4th item follows because P E(yy T ) = P (var(y) + Xτ (Xτ ) T ) = σ 2 P H + P Xτ (Xτ ) T = σ 2 P H. Lemma 3.3. Let H be a parametric matrix of κ, and X be an constant matrix, then the partial derivative of the projection matrix Proof. See Lemma B.3 [8]. Using the derivatives of the inverse of a matrix

3.2.
Formulae of the observed information matrix.
Lemma 3.4. The element of the observed information matrix for the residual loglikelihood (3) is given by Proof. See Result 4 in [8]. The result in (12) is standard according to the definition. The result in (13) follows from the result in Lemma 3.3 if one uses the score in (4). The first term in (15) follows because The second term in (15) follows because of using the result in Lemma 3.3, we have Further note thatḢ i ,Ḣ j and P are symmetric. The second term in (15) follows because of y T PḢ i PḢ j P y = y T PḢ j PḢ i P y.

3.3.
Formulae of the Fisher information matrix. The Fisher information matrix, I, is the expected value of the observed information matrix, I = E(I O ). The Fisher information matrix enjoys a simpler formula than the observed information matrix and provides the essential information provided by the data, and thus it is a natural approximation to the negative Jacobian matrix.
Lemma 3.5. The elements of the Fisher information matrix for the residual loglikelihood function in (3) are given by Proof. The formulas can be found in [18]. Here we supply an alternative proof. First note that P X = 0, and according to Lemma 3.2 Then E(y T P y) = E(tr(P yy T )) = tr(P E(yy T )) = σ 2 tr(P H) = (n − ν)σ 2 . Therefore Second, we notice that P HP = P . Applying the procedure in (21), we have E(y T PḢ i P y) = tr(PḢ i P E(yy T )) = σ 2 tr(PḢ i P H) E(y T PḢ i PḢ j P y) = σ 2 tr(PḢ i PḢ j P H) E(y T PḦ ij P y) = σ 2 tr(PḦ ij P H) = σ 2 tr(PḦ ij ).
Using the Fishing information matrix as an approximation to the negative Jacobian results in the widely-used Fisher-scoring algorithm [13].

Proof of the main result.
Proof. Let then we have Apply the result in (21), we have Apply the result in (23), we have E(I A (σ 2 , κ i )) = tr(PḢ i ) 2σ 2 and E(I Z (σ 2 , κ i )) = 0.
4. Discussion. The average information splitting is one of the key techniques to reduce computation in the maximum likelihood methods [22], other techniques like sparse inversion (see the state-of-art of the sparse inversion algorithm [27]) should also be implemented to evaluate the score of the log-likelihood. More details can be found in the review report [26]. Since Fisher information matrix is preferred not only in finding the variance of an estimator and in Bayesian inference [15], but also in analyzing the asymptotic behavior of maximum likelihood estimates [16,23,24]. Besides the traditional application fields like genetical theory of natural selection and breeding [4], many other fields including theoretical physics and information geometry also use the Fisher information matrix theory [9][10][17] [19]. Therefore the average information matrix splitting techniques also provides promise in these directions.