# Seemingly Unrelated Regression (SUR/SURE)¶

Seemingly unrelated regression is a system regression estimator which jointly estimates multiple models. This allows joint hypothesis testing of parameters across models since the parameter covariance is robust to correlation of residuals across models. It can also lead to more precise parameter estimates if some residuals are conditionally homoskedastic and regressors differ across equations.

## Basic Notation¶

Assume $$K$$ series are observed for $$N$$ time periods. Denote the stacked observations of series $$i$$ as $$Y_{i}$$ and its corresponding regressor matrix $$X_{i}$$. The complete model spanning all series can be specified as

\begin{split}\begin{aligned} Y & =X\beta+\epsilon\\ \left[\begin{array}{c} Y_{1}\\ Y_{2}\\ \vdots\\ Y_{K} \end{array}\right] & =\left[\begin{array}{cccc} X_{1} & 0 & 0 & 0\\ 0 & X_{2} & 0 & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & X_{K} \end{array}\right]\left[\begin{array}{c} \beta_{1}\\ \beta_{2}\\ \vdots\\ \beta_{K} \end{array}\right]+\left[\begin{array}{c} \epsilon_{1}\\ \epsilon_{2}\\ \vdots\\ \epsilon_{K} \end{array}\right].\end{aligned}\end{split}

The OLS estimator of $$\beta$$ is then

$\hat{\beta}=\left(X^{\prime}X\right)^{-1}X^{\prime}Y.$

A GLS estimator can be similarly defined as

$\hat{\beta}_{GLS}=\left(X^{\prime}\Omega^{-1}X\right)^{-1}X^{\prime}\Omega^{-1}Y$

where $$\Omega=\Sigma\otimes I_{N}$$ is the joint covariance of the residuals. In practice $$\Sigma$$ is not known as so a feasible GLS (FGLS) is implemented in two steps. The first uses OLS to estimate $$\hat{\epsilon}$$ and then estimates the residual covariance as

$\hat{\Sigma}=N^{-1}\left[\begin{array}{cccc} \hat{\epsilon}_{1} & \hat{\epsilon}_{2} & \ldots & \hat{\epsilon}_{N}\end{array}\right]^{\prime}\left[\begin{array}{cccc} \hat{\epsilon}_{1} & \hat{\epsilon}_{2} & \ldots & \hat{\epsilon}_{N}\end{array}\right].$

The feasible GLS estimator is then

$\hat{\beta}_{FGLS}=\left(X^{\prime}\hat{\Omega}^{-1}X\right)^{-1}X^{\prime}\hat{\Omega}^{-1}Y$

where $$\hat{\Omega}=\hat{\Sigma}\otimes I_{N}$$.

# Covariance Estimation¶

Homoskedastic Data

When residuals are assumed to be homoskedastic, the covariance can be consistently estimated by

$\left(X^{\prime}\Delta^{-1}X\right)^{-1}\left(X^{\prime}\Delta^{-1}\hat{\Omega}\Delta^{-1}X\right)\left(X^{\prime}\Delta^{-1}X\right)^{-1}$

where $$\Delta$$ is the weighting matrix used in the parameter estimator. For example, in OLS $$\Delta=I_{NK}$$ while in FGLS$$\Delta=\hat{\Omega}$$. The estimator supports using FGLS with an assumption that $$\hat{\Sigma}$$ is diagonal or using a user-specified value of $$\Sigma$$. When the FGLS estimator is used, this simplifies to

$\left(X^{\prime}\Delta^{-1}X\right)^{-1}$

Heteroskedastic Data

When residuals are heteroskedastic, the covariance can be consistently estimated by

$\left(X^{\prime}\Delta^{-1}X\right)^{-1}\hat{S}\left(X^{\prime}\Delta^{-1}X\right)^{-1}$

where $$\hat{S}$$ is a covariance estimator that is clustered by time. Define the matrix of scores as

$\begin{split}\hat{g}=\left[\begin{array}{c} \hat{g}_{1}\\ \hat{g}_{2}\\ \vdots\\ \hat{g}_{N} \end{array}\right]=\Delta^{-\frac{1}{2}}X\odot\left(\hat{\epsilon}\iota_{M}^{\prime}\right)\end{split}$

where $$\iota_{M}$$ is a vector of ones with $$M$$ elements where $$M$$ is the number of columns in $$X$$ and $$\odot$$ is element by element multiplication. The clustered-by-time score covariance estimator is then

$\hat{S}=N^{-1}\sum_{i=1}^{N}\Psi_{i}^{\prime}\Psi_{i}$

where

$\Psi_{i}=\sum_{j=1}^{K}\hat{g}_{ij}$

is the sum of the scores across models $$\left(j\right)$$ that have the same observation index $$i$$. This estimator allows arbitrary correlation of residuals with the same observation index.

Debiased Estimators

When the debiased flag is set, a small sample adjustment if applied so that element $$ij$$ of $$\hat{\Sigma}$$ is scaled by

$\frac{T}{\sqrt{\left(T-P_{i}\right)\left(T-P_{j}\right)}}.$

## Other Statistics¶

Goodness of fit

The reported $$R^{2}$$ is always for the data, or weighted data is weights are used, for either OLS or GLS. The means that the reported $$R^{2}$$ for the GLS estimator may be negative

F statistics

When the debiased covariance estimator is used (small sample adjustment) the reported $$F$$ statistics use $$K\left(T-\bar{P}\right)$$ where $$\bar{P}=K^{-1}\sum_{i=1}^{K}P_{i}$$ where $$P_{i}$$ is the number of variables including the constant in model $$i$$. When models include restrictions it may be the case that the covariance is singular. When this occurs, the $$F$$ statistic cannot be calculated.

# Memory efficient calculations¶

The parameter estimates in the SUR model can are

$\left(X^{\prime}\Omega^{-1}X\right)^{-1}\left(X^{\prime}\Omega^{-1}Y\right)^{-1}$

where

$\begin{split}\Omega^{-1}=\left[\begin{array}{cccc} \sigma^{11}I_{T} & \sigma^{12}I_{T} & \ldots & \sigma^{1k}I_{T}\\ \sigma^{21}I_{T} & \sigma^{22}I_{T} & \dots & \sigma^{2k}I_{T}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma^{k2}I_{T} & \sigma^{k2}I_{T} & \dots & \sigma^{kk}I_{T} \end{array}\right]\end{split}$

where $$\sigma^{ij}=\left[\Sigma^{-1}\right]_{ij}$$. Since $$X$$ is block diagonal,

$\begin{split}X^{\prime}\Omega^{-1}=\left[\begin{array}{cccc} \sigma^{11}X_{1}^{\prime} & \sigma^{12}X_{1}^{\prime} & \ldots & \sigma^{1k}X_{1}^{\prime}\\ \sigma^{21}X_{2}^{\prime} & \sigma^{22}X_{2}^{\prime} & \dots & \sigma^{2k}X_{2}^{\prime}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma^{k2}X_{k}^{\prime} & \sigma^{k2}X_{k}^{\prime} & \dots & \sigma^{kk}X_{k}^{\prime} \end{array}\right]\end{split}$

and

$\begin{split}X^{\prime}\Omega^{-1}X=\left[\begin{array}{cccc} \sigma^{11}X_{1}^{\prime}X_{1} & \sigma^{12}X_{1}^{\prime}X_{2} & \ldots & \sigma^{1k}X_{1}^{\prime}X_{k}\\ \sigma^{21}X_{2}^{\prime}X_{1} & \sigma^{22}X_{2}^{\prime}X_{2} & \dots & \sigma^{2k}X_{2}^{\prime}X_{k}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma^{k2}X_{k}^{\prime}X_{1} & \sigma^{k2}X_{k}^{\prime}X_{2} & \dots & \sigma^{kk}X_{k}^{\prime}X_{k} \end{array}\right]\end{split}$

Similarly

$\begin{split}X^{\prime}\Omega^{-1}Y=\left[\begin{array}{c} \sigma^{11}X_{1}^{\prime}Y_{1}+\sigma^{12}X_{1}^{\prime}Y_{2}+\ldots\sigma^{1k}X_{1}^{\prime}Y_{k}\\ \sigma^{21}X_{2}^{\prime}Y_{1}+\sigma^{22}X_{2}^{\prime}Y_{2}+\dots+\sigma^{2k}X_{2}^{\prime}Y_{k}\\ \vdots\\ \sigma^{k2}X_{k}^{\prime}Y_{1}+\sigma^{k2}X_{k}^{\prime}Y_{2}+\dots+\sigma^{kk}X_{k}^{\prime}Y_{k} \end{array}\right]\end{split}$

This suggests that constructing the high dimensional $$\Omega^{-1}$$ can be avoided and many needless multiplications can be avoided by directly computing these two components and the solution can be found using solve.

# Three Stage Least Squares (3SLS)¶

Three-stage least squares extends SUR to the case of endogenous variables. It is the system extension of two-stage least squares (2SLS). Like SUR, 3SLS jointly estimates multiple models which allows joint hypothesis testing of parameters. It can also lead to more precise parameter estimates if some residuals are conditionally homoskedastic and regressors differ across equations.

## Basic Notation¶

Assume $$K$$ series are observed for $$N$$ time periods. Denote the stacked observations of series $$i$$ as $$Y_{i}$$ and its corresponding regressor matrix $$X_{i}$$. The complete model spanning all series can be specified as

\begin{split}\begin{aligned} Y & =X\beta+\epsilon\\ \left[\begin{array}{c} Y_{1}\\ Y_{2}\\ \vdots\\ Y_{K} \end{array}\right] & =\left[\begin{array}{cccc} X_{1} & 0 & 0 & 0\\ 0 & X_{2} & 0 & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & X_{K} \end{array}\right]\left[\begin{array}{c} \beta_{1}\\ \beta_{2}\\ \vdots\\ \beta_{K} \end{array}\right]+\left[\begin{array}{c} \epsilon_{1}\\ \epsilon_{2}\\ \vdots\\ \epsilon_{K} \end{array}\right].\end{aligned}\end{split}

where $$X_{i}$$ contains both exogenous and endogenous variables. For each equation denote the available instruments as $$Z_{i}$$. The instrument include both exogenous regressors and excluded instrumental variables. Define the fitted values of the exogenous variables as

$\hat{X}_{i}=Z_{i}\left(Z_{i}^{\prime}Z_{i}\right)^{-1}Z_{i}^{\prime}X_{i}.$

The IV estimator of $$\beta$$ is then

$\hat{\beta}_{IV}=\left(\hat{X}^{\prime}\hat{X}\right)^{-1}\hat{X}^{\prime}Y$

where

$\begin{split}\hat{X}=\left[\begin{array}{cccc} \hat{X}_{1} & 0 & 0 & 0\\ 0 & \hat{X}_{2} & 0 & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & \hat{X}_{N} \end{array}\right].\end{split}$

A GLS estimator can be similarly defined as

$\hat{\beta}_{GLS}=\left(\hat{X}^{\prime}\Omega^{-1}\hat{X}\right)^{-1}\hat{X}^{\prime}\Omega^{-1}Y.$

Inference is identical to the SUR replacing $$X$$ with $$\hat{X}$$ where model residuals are estimated using $$X$$ and $$Y$$.

# System Generalized Method of Moments (GMM)¶

The system GMM estimator uses a set of moment conditions of the form

\begin{split}\begin{aligned} E\left[g_{i}\right] & =0,i=1,\ldots,T\\ E\left[\begin{array}{c} g_{1i}\\ g_{2i}\\ \vdots\\ g_{Ki} \end{array}\right] & =0\end{aligned}\end{split}

where $$g_{ji}=z_{ji}^{\prime}\left(y_{ji}-x_{ji}\beta\right)$$ where $$j$$ is the equation index, $$z_{ji}$$ is a set of instruments for equation $$j$$ at time period i and $$x_{ji}$$ and $$y_{ji}$$ are similarly defined. Define

$g_{N}\left(\beta\right)=N^{-1}\sum_{i=1}^{N}g_{i}\left(\beta\right),$

then the parameters are estimated to minimize

$\min_{\beta}g_{N}\left(\beta\right)^{\prime}W^{-1}g_{N}\left(\beta\right).$

The solution to this minimization problem is

$\hat{\beta}=\left(X^{\prime}ZW^{-1}Z^{\prime}X\right)^{-1}\left(X^{\prime}ZW^{-1}Z^{\prime}Y\right).$

$$W$$ is the moment weighting matrix and for efficiency should be set to the covariance of the moment conditions. In practice a 2-step estimator is required since estimators of $$W$$ depend on $$\beta$$. In the first step, $$W=N^{-1}Z^{\prime}Z$$ is used to construct an initial estimate of $$\hat{\beta}$$. This would be efficient if residuals were homoskedastic and uncorrelated. This choice will produce estimates that are identical to system 2SLS. Using the initial parameter estimated, the covariance of the moment conditions is computed used one of two forms using the estimated model residuals $$\hat{\epsilon}$$.

## Homoskedastic Weighting¶

When residuals are assumed to be conditionally homoskedastic, then

$\hat{W}=N^{-1}Z^{\prime}\left(\hat{\Sigma}\otimes I_{N}\right)Z$

where $$\hat{\Sigma}$$is the $$K$$ by $$K$$ covariance of the model residuals. This specification assumes that residuals are not correlated across observation index (only across models)

## Heteroskedastic Weighting¶

When residuals are heteroskedastic, the weighting matrix is estimated from

$\hat{W}=N^{-1}\sum\hat{g}_{i}\hat{g}_{i}^{\prime}$

which allows cross-sectional dependence but no dependence across observation index.

The efficient estimates are computed using the same expression only with $$\hat{W}$$,

$\hat{\beta}=\left(X^{\prime}Z\hat{W}^{-1}Z^{\prime}X\right)^{-1}\left(X^{\prime}Z\hat{W}^{-1}Z^{\prime}Y\right).$

Both weighting estimators can be centered which will enforce an exact mean 0 to the moment conditions. When models are over-identified, this can improve accuracy of 2-step estimators.

## Parameter Covariance¶

Covariance estimators are available for either homoskedastic or heteroskedastic data. The only difference is in how the covariance of the moments is computed. In either case, the parameter covariance can be estimated by

$\widehat{Var\left(\hat{\beta}\right)}=N^{-1}\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)^{-1}\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\hat{\Omega}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)^{-1}.$

The covariance of the scores $$\hat{\Omega}$$ estimated using either the homoskedastic weighting formula or the heteroskedastic weighting formula immediately above. This allow the choice of weighting matrix to be set separately from the score covariance estimator, although in most cases these should be the same, and so the covariance of the estimated parameters will simplify to

$\widehat{Var\left(\hat{\beta}\right)}=N^{-1}\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)^{-1}.$

# Testing Covariance and Correlations¶

Two tests are available to test whether the residual covariance is diagonal. These are useful diagnostics when considering GLS estimation. If the tests reject the null, then the data suggest that GLS estimation should improve efficiency as long as the regressors are not all common. If the null is not rejected, then the covariance is not statistically different from a diagonal covariance and there are unlikely to be gains to using GLS. The Breusch-Pagan test directly examines the correlations of the residuals, and is defined as

$N\left(\sum_{i=1}^{K}\sum_{j=i+1}^{K}\hat{\rho}\right)\sim\chi_{K\left(K-1\right)/2}^{2}.$

The likelihood ratio is defined as the difference between the log determinants of a diagonal covariance matrix and the full unrestricted covariance matrix,

$N\left(\sum_{i=1}^{K}\ln\hat{\sigma}_{i}^{2}-\ln\left|\hat{\Sigma}\right|\right)=N\left(\sum_{i=1}^{K}\ln\left|\hat{\Sigma}\odot I_{K}\right|-\ln\left|\hat{\Sigma}\right|\right)\sim\chi_{K\left(K-1\right)/2}^{2}.$

The asymptotic distribution of the likelihood ratio test requires homoskedasticity.

# System Measures of Fit ($$R^{2}$$)¶

Most measures of fit for systems of equations assume that all equations contains a constant (or equivalent). Caution is needed when interpreting if equations exclude constant terms.

## Overall $$R^{2}$$¶

The overall $$R^{2}$$ is defined as

$R^{2}=1-\frac{\sum_{i=1}^{K}SSR_{i}}{\sum_{i=1}^{K}TSS_{i}}$

where $$TSS_{i}$$ is centered if equation $$i$$ contains a constant and uncentered if it does not. When all equations contain constants, it is identical to Judge’s measure.

## McElroy¶

McElroy’s (1977) measure is defined as

$R^{2}=1-\frac{\epsilon^{\prime}\Omega^{-1}\epsilon}{Y^{\prime}\left(\Sigma^{-1}\otimes\left(I_{N}-\frac{\iota\iota^{\prime}}{N}\right)\right)Y}$

where $$\iota$$ is a $$N$$ by 1 vector of 1s. This is implemented as

$R^{2}=1-\frac{\sum_{i=1}^{N}\sum_{j=1}^{K}\hat{\xi}_{ij}^{2}}{\sum_{i=1}^{N}\sum_{j=1}^{K}\hat{\eta}_{ij}^{2}}$

where

\begin{split}\begin{aligned} \hat{\xi} & =\hat{E}\hat{\Sigma}^{-\frac{1}{2}}\\ \hat{E} & =\left[\begin{array}{cccc} \hat{\epsilon}_{1} & \hat{\epsilon}_{2} & \ldots & \hat{\epsilon}_{N}\end{array}\right]\end{aligned}\end{split}

and

\begin{split}\begin{aligned} \hat{\eta} & =\tilde{Y}\hat{\Sigma}^{-\frac{1}{2}}\\ \tilde{Y} & =\left[\begin{array}{cccc} Y_{1}-\hat{\mu}_{1} & Y_{2}-\hat{\mu}_{2} & \ldots & Y_{N}-\hat{\mu}_{N}\end{array}\right].\end{aligned}\end{split}

where the vector of mean parameters is estimated by fitting a SURE to the data (using user specified weights, if provided) where $$X_{i}=\iota$$ contains only a constant. Greene provides an alternative formulation of this measure as

$R^{2}=1-\frac{K}{\mathrm{tr}\left(\hat{\Sigma}^{-1}\hat{\Psi}\right)}$

where $$\hat{\Psi}=N^{-1}\tilde{Y}^{\prime}\tilde{Y}$$ is the covariance of the demeaned data.

## Berndt¶

Berndt’s measure is defined as

$R^{2}=1-\frac{\left|\hat{\Sigma}\right|}{\left|\hat{\Psi}\right|}.$

## Judge¶

Judge’s measure is the naive OLS $$R^{2}$$ for the system,

$R^{2}=1-\frac{\sum_{i=1}^{N}\sum_{j=1}^{K}\hat{E}_{ij}^{2}}{\sum_{i=1}^{N}\sum_{j=1}^{K}\tilde{Y}_{ij}^{2}}.$

## Dhrymes¶

Dhrymes’ measure of fit is a weighted average of the $$R^{2}$$ of each equation,

$R^{2}=\sum_{i=1}^{K}R_{i}^{2}\frac{\hat{\Psi}_{ii}}{\mathrm{tr}\left(\hat{\Psi}\right)}$

where $$R_{i}^{2}$$ is the coefficient of determination from equation $$i$$.

Sources used in writing the code include [Greene], [Stata] and [Henningsen].