DOI:https://doi.org/10.35566/jbds/v1n2/p2
A Note on Wishart and Inverse Wishart Priors for Covariance Matrix
University of Notre DameKeywords: Wishart distribution ● inverse Wishart distribution ● prior distribution ● covariance matrix
In Bayesian analysis, an inverse Wishart (IW) distribution is often used as a prior for the variance-covariance parameter matrix (e.g., Barnard, McCulloch, & Meng, 2000; Gelman et al., 2014; Leonard, Hsu, et al., 1992). The IW prior is very popular because it is conjugate to normal data. For best illustration, consider a multivariate normal (MN) variable. Let X=(X1,X2,…,Xp) denote a vector of p variables X|Σ∼MN(0,Σ) with the mean vector μ=0 and the variance-covariance matrix Σ. The density function is p(x|Σ)=(2π)−p/2|Σ|−1/2exp(−12xTΣ−1x). Given a sample D=(x1,…,xn) with n being the sample size, the likelihood function for Σ is
where S=∑nixixTi/n is the biased sample covariance matrix (the sample is centered at 0). Note that this is also the maximum likelihood estimate of Σ. To get the posterior distribution of Σ for Bayesian inference, one needs to specify a prior distribution p(Σ) for it. With the prior, the posterior distribution can be obtained through the Bayes’ Theorem: p(Σ|D)=p(D|Σ)p(Σ)p(D).
1 The Inverse Wishart Prior
The most commonly used prior for Σ is probably the inverse Wishart conjugate prior. The density function of an inverse Wishart distribution IW(V,m) with the scale matrix V and the degrees of freedom m for a p×p variance-covariance matrix Σ is p(Σ)=|V|m/2|Σ|−(m+p+1)/2exp[−tr(VΣ−1)/2]2mp/2Γ(m/2). The inverse Wishart distribution is a multivariate generalization of the inverse Gamma distribution. The mean of it is
and the variance of each element of Σ=(σij) is Var(σij)=(m−p+1)v2ij+(m−p−1)viivjj(m−p)(m−p−1)2(m−p−3). Especially,
With an inverse Wishart prior IW(V0,m0) based on known V0 and m0, the posterior distribution of Σ is
From it, we can get the posterior distribution for Σ, also an inverse Wishart distribution:
with the updated scale matrix and degrees of freedom.
1.1 Information in an inverse Wishart prior
The posterior mean of Σ is
Therefore, the posterior mean is a weighted average of the sample covariance matrix S and the prior mean V0/(m0−p−1). When the sample size n→∞, the posterior mean approaches the sample mean given fixed m0 and p.
The information in a prior can be connected to data. For example, if we specify the prior IW(V0,m0) as V0=n0S and m0=n0, then the informative in the prior is equivalent to n0 participants in the sample. Note that if we set V0=(m0−p−1)S, then E(Σ|D)=S, meaning the posterior mean is the same as the sample covariance matrix.
2 Precision Matrix and the Wishart Prior
In practice, the BUGS program is probably the most widely used software for Bayesian analysis (e.g., Lunn, Jackson, Best, Thomas, & Spiegelhalter, 2012; Ntzoufras, 2009). BUGS uses the precision matrix, defined as the inverse of the covariance matrix, to specify the multivariate normal distribution. Let P=Σ−1, then the normal density function can be written as
p(x|P)=(2π)−p/2|P|1/2exp(−12xTPx). The use of the precision matrix has the computational advantage by avoiding the inverse of matrix in the density calculation in certain situations.
For the precision matrix P, a Wishart prior W(U0,w0) with the scale matrix U0 and degrees of freedom w0 is used (e.g., Lunn et al., 2012). The density function of the prior is p(P)=|P|(w0−p−1)/2exp[−tr(U−10P)/2]2w0p/2Γ(w0/2)|U0|w0/2. Given the sample D=(x1,…,xn), the posterior distribution of P is
Therefore, the posterior is also a Wishart distribution W(U1,w1) with U1=(nS+U−10)−1 and w1=n+w0. The posterior mean of P is E(P|D)=w1U1=(n+w0)(nS+U−10)−1. Based on the relationship between Wishart and inverse Wishart distributions (Mardia, Bibby, & Kent, 1982),
The posterior mean of Σ is
Comparing the posterior distributions in Equation (3) and (5), giving an inverse Wishart distribution IW(V0,m0) prior to the covariance matrix Σ is the same as giving a Wishart distribution W(V−10,m0) prior to the precision matrix P=Σ−1. However, note that [E(P|D)]−1=nS+U−10n+w0≠E(Σ|D)=nS+U−10n+w0−p−1. Therefore, one cannot simply invert the posterior mean of the precision matrix to get the posterior mean of the covariance matrix.
3 Numerical Examples
For illustration, we look at a concrete experiment. Suppose we have a sample of size n=100 with the sample covariance matrix (p=2) S=(52210). The aim is to estimate Σ through Bayesian method. We now consider the use of different priors and evaluate their influence. Given the connection between the Wishart and inverse Wishart distributions, we focus our discussion on the specification of an inverse Wishart prior for the covariance matrix Σ .
3.1 Priors based on an identity scale matrix
For an inverse Wishart prior IW(V0,m0), we need to specify its scale matrix and degrees of freedom. In practice, an identity matrix has been frequently used as the scale matrix. Therefore, we first set V0=I and vary the degrees of freedom by letting m0=2,5,10,50,100. Note that when m0=2, the prior is not a proper distribution but the posterior is still a proper distribution. The mean and variance of the posterior distribution are given in Table 1. First, when m0=2 or 5, the posterior means are close to the sample covariance matrix. With the increase of m0, the posterior means become smaller and the posterior variances also become smaller. This can be easily explained by Equation (4) – the posterior mean is a weighted average between the sample mean and the prior mean. Take the element Σ11 as an example. From the data, S11=5. The mean of the inverse Wishart prior is V0,11/(m0−3)=1/(m0−3). When m0=5, the prior mean is 0.5 and when m0=100, the prior mean is about 0.01. Furthermore, when m0=5, the weight for the prior mean is about 0.05 but when m0=100, the weight increases to about 0.5. Therefore, with the increase of m0, the posterior mean is pulled towards the prior mean since the prior mean has a greater weight.

In the above specification, since V0≡I, the prior mean also changes along the change of m0. In practice, e.g., in sensitivity analysis, it can be helpful to fix the prior mean. To achieve this, one can set V0=(m0−p−1)I. Therefore, when m0=5, the scale matrix will be 2I, and when m0=100, the scale matrix will be m0=97I. With such specification, the prior mean is always I.
3.2 Priors with the scale matrix formed from data
Another way to specify the prior is to construct the scale matrix for the inverse Wishart distribution based on the sample data. Intuitively, we can set V0=S and change m0. From the top of Table 2, with the increase of m0, the posterior mean deviates from the sample covariance matrix. This is again because that the prior mean becomes smaller with the increase of m0 since the prior mean is equal to S/m0. To maintain the same prior mean while changing the information in the prior, we set V0=(m0−p−1)S. With such specification, the prior mean is always S and the posterior mean is also S as we can see from the bottom part of Table 2. With the increase of the degrees of freedom, more information is supplied through the prior and we can observe the decrease in the posterior variance.

3.3 Other types of specifications
We now consider several other types of specifications of the scale matrix to illustrate the influence of the prior. In all the the specifications, we maintain the same prior mean by setting the prior in the form of IW[(m0−p−1)V0,m0]. The priors considered and the associated posterior mean and variance are summarized in Table 3.
For prior P1, it assumes that Σ11 is 10 times of Σ22, which is not consistent with the sample data. As expected, the posterior mean is pulled towards prior mean with the increase of m0. Notably, the variance of Σ11 does not monotonously decrease with the increase of m0 as one might incorrectly assume that the use of prior information will lead to more precise results. This is because the variance of the inverse Wishart distribution is related to its mean as shown in Equation (2), and the prior is not consistent with data.
For Priors P2, P3, P4, and the one at the bottom of Figure 2, the scale matrices have the same diagonal values and different off-diagonal values. Note that changing the values on the off-diagonals influences neither the posterior means nor variances on the diagonals, which can also be seen in Equations (1) and (2). As expected, changing the off-diagonal values influences both the posterior means and variances. However, the posterior variances are relatively stable.
3.4 Using priors for a precision matrix P
The influence of the priors on the precision matrix is the same as for the covariance matrix because of the connection of Wishart and inverse Wishart distribution – if Σ∼IW(V0,m0), P=Σ−1∼W(V−10,m0). If the prior IW(I,m0) is specified for the covariance matrix, it is equivalent to use W(I,m0) for the precision matrix. As discussed earlier, to maintain the same prior mean, we can use IW[(m0−p−1)I,m0] for Σ. In this case, the prior for the precision matrix should be W[I/(m0−p−1),m0]. Similarly, if we specify a prior for Σ based on the data using IW[(m0−p−1)S,m0], then the prior for the precision matrix would be W[S−1/(m0−p−1),m0].
4 Discussion
Although not without issues, Wishart and inverse Wishart distributions are still commonly used prior distributions for Bayesian analysis involving a covariance matrix (Alvarez, Niemi, & Simpson, 2014; Liu, Zhang, & Grimm, 2016). As we have shown, the use of the inverse Wishart prior has the advantage of conjugate, which simplifies the posterior distribution. By using an inverse Wishart prior, the posterior distribution is also an inverse Wishart distribution given normally distributed data. The posterior mean can be conveniently expressed as a weighted average of the prior mean and the sample covariance matrix. The influence of the prior can also be clearly quantified.

When reliable information is available, an informative inverse Wishart prior can be constructed. For example, previous estimates on the covariance matrix could be available. In this situation, such covariance matrix estimates can be used to construct the scale matrix. If the variance estimates of the covariance matrix is also available, one can determine the degrees of freedom for the inverse Wishart prior based on the variance expression in Equation (2), which can be done using the R package discussed in the Appendix. The degrees of freedom based on each individual element may vary. The overall degrees of freedom for the inverse Wishart distribution can be determined based on the practical research question.
When no reliable information is available, an identity matrix has often been suggested to use as the scale matrix for the inverse Wishart distribution for the covariance matrix and Wishart distribution for the precision matrix (e.g., Congdon, 2014). But as one can see from the numerical example, how much information such a prior has is related to the covariance matrix. We believe a better way to specify an uninformative prior is to determine the scale matrix based on the sample covariance matrix. Therefore, we recommend the prior IW[(m0−p−1)S,m0]. As for the precision matrix, one can use W[S−1/(m0−p−1),m0].
Appendix
The R package wishartprior is developed and made available on GitHub to help understand the Wishart and inverse Wishart priors. The URL to the package is https://github.com/johnnyzhz/wishartprior. The package can be used to generate random numbers from an inverse Wishart distribution. It can calculate the mean and variance of Wishart and inverse Wishart distributions. Using the package, one can investigate the influence of priors.
References
Alvarez, I., Niemi, J., & Simpson, M. (2014). Bayesian inference for a covariance matrix. In Anual conference on applied statistics in agriculture (pp. 71–82). Retrieved from arXiv:1408.4050
Barnard, J., McCulloch, R., & Meng, X.-L. (2000). Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica, 10, 1281–1311.
Congdon, P. (2014). Applied bayesian modeling (2nd ed.). John Wiley & Sons.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (2nd ed.). CRC press.
Leonard, T., Hsu, J. S., et al. (1992). Bayesian inference for a covariance matrix. The Annals of Statistics, 20(4), 1669–1696. doi: https://doi.org/10.1214/aos/1176348885
Liu, H., Zhang, Z., & Grimm, K. J. (2016). Comparison of inverse wishart and separation-strategy priors for bayesian estimation of covariance parameter matrix in growth curve analysis. Structural Equation Modeling: A Multidisciplinary Journal, 23(3), 354–367. doi: https://doi.org/10.1080/10705511.2015.1057285
Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. (2012). The bugs book: A practical introduction to bayesian analysis. CRC Press.
Mardia, K., Bibby, J., & Kent, J. (1982). Multivariate analysis. Academic Press.
Ntzoufras, I. (2009). Bayesian modeling using WinBUGS. John Wiley & Sons.
Open Access Statement. Journal of Behavioral Data Science, ISSN 2575-8306, online ISSN 2574-1284. © International Society for Data Science and Analytics. All content of the Journal of Behavioral Data Science is freely available to download, save, reproduce, and transmit for noncommercial, scholarly, and educational purposes. Reproduction and transmission of journal content for the above purposes should credit the author and original source. Use, reproduction, or distribution of journal content for commercial purposes requires additional permissions from the International Society for Data Science and Analytics; please contact contact@isdsa.org. DOI: 10.35566/jbds