Formulas and Mathematical Detail


Interest is in recovering the parameter vector from the model

\[\begin{aligned} y_{i} & =\beta^{\prime}x_{i}+\epsilon_{i}\end{aligned}\]

where \(\beta\) is \(k\) by 1, \(x_{i}\) is a \(k\) by 1 vector of observable variables and \(\epsilon_{i}\) is a scalar error. \(x_{i}\) can be separated in two types of variables. The \(k_{1}\) by 1 set of variables \(x_{1i}\) are exogenous regressors in the sense that \(E\left[x_{1i}\epsilon_{i}\right]=0\). The \(k_{2}\) by 1 variables \(x_{2i}\) are endogenous. A set of \(p\) instruments is available that satisfy the requirements for validity where \(p\geq k_{2}\). The extended model can be written as

\[\begin{split}\begin{aligned} y_{i} & =\underset{\textrm{exogenous}}{\underbrace{\beta_{1}^{\prime}x_{1i}}}+\underset{\textrm{endogenous}}{\underbrace{\beta_{2}^{\prime}x_{2i}}}+\epsilon_{i}\\ x_{2i} & =\underset{\textrm{exogenous}}{\underbrace{\gamma_{1}^{\prime}z_{1i}}}+\underset{\textrm{instruments}}{\underbrace{\gamma_{2}^{\prime}z_{2i}}}+u_{i}\end{aligned}\end{split}\]

The model can be expressed compactly

\[\begin{split}\begin{aligned} Y & =X_{1}\beta_{1}+X_{2}\beta_{1}+\epsilon=X\beta+\epsilon\\ X_{2} & =Z_{1}\gamma_{1}+Z_{2}\gamma_{2}+u=Z\gamma+u\end{aligned}\end{split}\]

The vector of instruments \(z_{i}\) is \(p\) by 1. There are \(n\) observations for all variables. \(k_{c}=1\) if the model contains a constant (either explicit or implicit, i.e., including all categories of a dummy variable). The constant, if included, is in \(x_{1i}\). \(X\) is the \(n\) by \(k\) matrix if regressors where row \(i\) of \(X\) is \(x_{i}^{\prime}\). \(X\) can be partitioned into \(\left[X_{1}\;X_{2}\right]\). \(Z\) is the \(n\) by \(p\) matrix of instruments. The vector \(y\) is \(n\) by 1. Projection matrices for \(X\) is defined \(P_{X}=X\left(X^{\prime}X\right)^{-1}X^{\prime}\). The projection matrix for \(Z\) is similarly defined only using \(Z\). The annihilator matrix for \(X\) is \(M_{X}=I-P_{X}\).

Parameter Estimation

Two-stage Least Squares (2SLS)

The 2SLS estimator is

\[\begin{aligned} \hat{\beta}_{2SLS} & =\left(X^{\prime}P_{Z}X\right)^{-1}\left(X^{\prime}P_{Z}y\right)\end{aligned}\]

Limited Information Maximum Likelihood and k-class Estimators

The LIML or other k-class estimator is

\[\begin{aligned} \hat{\beta}_{\kappa} & =\left(X^{\prime}\left(I-\kappa M_{Z}\right)X\right)^{-1}\left(X^{\prime}\left(I-\kappa M_{Z}\right)y\right)\end{aligned}\]

where \(\kappa\) is the parameter of the class. When \(\kappa=1\) the 2SLS estimator is recovered. When \(\kappa=0\), the OLS estimator is recovered. The LIML estimator is recovered when \(\kappa\) set to


where \(W=\left[y\:X_{2}\right]\) and \(\mathrm{eig}\) returns the eigenvalues.

Generalized Method of Moments (GMM)

The GMM estimator is defined as

\[\begin{aligned} \hat{\beta}_{GMM} & =\left(X^{\prime}ZWZ^{\prime}X\right)^{-1}\left(X^{\prime}ZWZ^{\prime}y\right)\end{aligned}\]

where \(W\) is a positive definite weighting matrix.

Continuously Updating Generalized Method of Moments (GMM-CUE)

The continuously updating GMM estimator solves the minimization problem


where \(\bar{g}\left(\beta\right)=n^{-1}\sum_{i=1}^{n}z_{i}\hat{\epsilon}_{i}\) and \(\hat{\epsilon}_{i}=y_{i}-x_{i}\beta\). Unlike standard GMM, the weight matrix, \(W\) directly depends on the model parameters \(\beta\) and so it is not possible to use a closed form estimator.

Basic Statistics

Let \(\hat{\epsilon}=y-X\hat{\beta}\). The residual sum of squares (RSS) is \(\hat{\epsilon}^{\prime}\hat{\epsilon}\), the model sum of squares (MSS) is \(\hat{\beta}^{\prime}X^{\prime}X\hat{\beta}\) and the total sum of squares (TSS) is \(\left(y-k_{c}\bar{y}\right)^{\prime}\left(y-k_{c}\bar{y}\right)^{\prime}\)where \(\bar{y}\) is the sample average of \(y\). The model \(R^{2}\)is defined

\[\begin{aligned} R^{2} & =1-\frac{\hat{\epsilon}^{\prime}\hat{\epsilon}}{\left(y-k_{c}\bar{y}\right)^{\prime}\left(y-k_{c}\bar{y}\right)^{\prime}}=1-\frac{RSS}{TSS}\end{aligned}\]

and the adjusted \(R^{2}\) is defined

\[\begin{aligned} \bar{R}^{2} & =1-\left(1-R^{2}\right)\frac{N-k_{c}}{N-k}.\end{aligned}\]

The residual variance is \(s^{2}=n^{-1}\hat{\epsilon}^{\prime}\hat{\epsilon}\) unless the debiased flag is used, in which case a small sample adjusted version is estimated \(s^{2}=\left(n-k\right)^{-1}\hat{\epsilon}^{\prime}\hat{\epsilon}\). The model degree of freedom is \(k\) and the residual degree of freedom is \(n-k\).

The model F-statistic is defined

\[\begin{aligned} F & =\hat{\beta}_{-}^{\prime}\hat{V}_{-}^{-1}\dot{\hat{\beta}_{-}}\end{aligned}\]

where the notation \(\hat{\beta}_{-}\) indicates that the constant is excluded from \(\hat{\beta}\) and \(\hat{V}_{-}\) indicates that the row and column corresponding to a constant are excluded. 1 The test statistic is distributed as \(\chi_{k-k_{c}}^{2}.\) If the debiased flag is set, then the test statistic \(F\) is transformed as \(F/\left(k-k_{c}\right)\) and a \(F_{k-k_{c},n-k}\) distribution is used.

Parameter Covariance Estimation

Two-stage LS, LIML and k-class estimators

Four covariance estimators are available. The first is the standard homoskedastic covariance, defined as

\[\begin{aligned} \hat{\Sigma}=n^{-1}s^{2}\left(\frac{X^{\prime}\left(I-\kappa M_{z}\right)X}{n}\right)^{-1} & =n^{-1}s^{2}\hat{A}.\end{aligned}\]

Note that this estimator can be expressed as

\[\begin{aligned} \hat{\Sigma}=n^{-1}\hat{A}^{-1}\left\{ s^{2}\hat{A}\right\} \hat{A}^{-1} & =n^{-1}\hat{A}^{-1}\hat{B}\hat{A}^{-1}.\end{aligned}\]

All estimators take this form and only differ in how the asymptotic covariance of the scores, \(B\), is estimated. For the homoskedastic covariance estimator, \(\hat{B}=s^{2}\hat{A}.\) The score covariance in the heteroskedasticity robust covariance estimator is

\[\begin{aligned} \hat{B} & =n^{-1}\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}\hat{x}_{i}\hat{x}_{i}^{\prime}=n^{-1}\sum_{i=1}^{n}\hat{\xi}_{i}\hat{\xi}_{i}^{\prime}.\end{aligned}\]

where \(\hat{x_{i}}\) is row \(i\) of \(\hat{X}=P_{Z}X\) and \(\hat{\xi}_{i}=\hat{\epsilon}_{i}\hat{x}_{i}\).

The kernel covariance estimator is robust to both heteroskedasticity and autocorrelation and is defined as

\[\begin{split}\begin{aligned} \hat{B} & =\hat{\Gamma}_{0}+\sum_{i=1}^{n-1}K\left(\frac{i}{h}\right)\left(\hat{\Gamma}_{i}+\hat{\Gamma}_{i}^{\prime}\right)\\ \hat{\Gamma_{j}} & =n^{-1}\sum_{i=j+1}^{n}\hat{\xi}_{i-j}\hat{\xi}_{i}^{\prime}\end{aligned}\end{split}\]

where \(K\left(\frac{i}{h}\right)\) is a kernel weighting function where \(h\) is the kernel bandwidth.

The one-way clustered covariance estimator is defined as

\[\begin{aligned} n^{-1}\sum_{j=1}^{g}\left(\sum_{i\in\mathcal{G}_{j}}\hat{\xi}_{i}\right)\left(\sum_{i\in\mathcal{G}_{j}}\hat{\xi}_{i}\right)^{\prime}\end{aligned}\]

where \(\sum_{i\in\mathcal{G}_{j}}\hat{\xi}_{i}\) is the sum of the scores for all members in group \(\mathcal{G}_{j}\) and \(g\) is the number of groups.

If the debiased flag is used to perform a small-sample adjustment, all estimators except the clustered covariance are rescaled by \(\left(n-k\right)/n\). The clustered covariance is rescaled by \(\left(\left(n-k\right)\left(n-1\right)/n^{2}\right)\left(\left(g-1\right)/g\right)\). 2

Standard Errors

Standard errors are defined as


where \(e_{j}\) is a vector of 0s except in location \(j\) which is 1.


T-statistics test the null \(H_{0}:\beta_{j}=0\) against a 2-sided alternative and are defined as



P-values are computes using 2-sided tests,


If the covariance estimator was debiased, a Student’s t distribution with \(n-k\) degrees of freedom is used,

\[\begin{aligned} Pr\left(\left|z\right|>Z\right) & =2-2t_{n-k}\left(\left|z\right|\right)\end{aligned}\]

where \(t_{n-k}\left(\cdot\right)\) is the CDF of a Student’s T distribution.

Confidence Intervals

Confidence intervals are constructed as

\[CI_{i,1-\alpha}=\hat{\beta}_{i}\pm q_{\alpha/2}\times\hat{\sigma}_{\beta_{i}}\]

where \(q_{\alpha/2}\) is the \(\alpha/2\) quantile of a standard Normal distribution or a Student’s t. The Student’s t is used when a debiased covariance estimator is used.

Linear Hypothesis Tests

Linear hypothesis tests examine the validity of nulls of the form \(H_{0}:R\beta-r=0\) and are implemented using a Wald test statistic


where \(q\) is the \(rank\left(R\right)\) which is usually the number of rows in \(R\) . If the debiased flag is used, then \(W/q\) is reported and critical and p-values are taken from a \(F_{q,n-k}\) distribution.

GMM Covariance estimators

GMM covariance depends on the weighting matrix used in estimation and the assumed covariance of the scores. In most applications these are the same and so the inefficient form,


will collapse to the simpler form


when \(W=S^{-1}\). When an unadjusted (homoskedastic) covariance is used,


where \(\tilde{s}^{2}=n^{-1}\sum_{i=1}^{n}\left(\epsilon_{i}-\bar{\epsilon}\right)^{2}\) subtracts the mean which may be non-zero if the model is overidentified. Like previous covariance estimators, if the debiased flag is used, \(n^{-1}\) is replaced by \(\left(n-k\right)^{-1}\). When a robust (heteroskedastic) covariance is used, the estimator of \(S\) is modified to


If the debiased flag is used, \(n^{-1}\) is replaced by \(\left(n-k\right)^{-1}\).

Kernel covariance estimators of \(S\) take the form

\[\begin{split}\begin{aligned} \hat{S} & =\hat{\Gamma}_{0}+\sum_{i=1}^{n-1}k\left(i/h\right)\left(\hat{\Gamma}_{i}+\hat{\Gamma}_{i}^{\prime}\right)\\ \hat{\Gamma_{j}} & =n^{-1}\sum_{i=j+1}^{n}\hat{\epsilon}_{i-j}\hat{\epsilon}_{i}z_{i-j}^{\prime}z_{i}\end{aligned}\end{split}\]

and \(k\left(\cdot\right)\) is a kernel weighting function with bandwidth \(h\). If the debiased flag is used, \(n^{-1}\) is replaced by \(\left(n-k\right)^{-1}\).

The one-way clustered covariance estimator is defined as


where \(\sum_{i\in\mathcal{G}_{j}}\hat{\epsilon}_{i}z_{i}\) is the sum of the moment conditional for all members in group \(\mathcal{G}_{j}\) and \(g\) is the number of groups. If the debiased flag is used, the \(n^{-1}\) term is replaced by


GMM Weight Estimators

The GMM optimal weight estimators are identical to the the estimators of \(S\) with two notable exceptions. First, they are never debiased and so always use \(n^{-1}\). Second, if the center flag is true, the demeaned moment conditions defined as \(\tilde{g}_{i}=z_{i}\hat{\epsilon}_{i}-\overline{z\epsilon}\) are used in-place of \(g_{i}\) in the robust, kernel and clustered estimators. The unadjusted estimator is always centered, and so this option has no effect.


2SLS and LIML Post-estimation Results


Sargan’s test of over-identifying restrictions examines whether the variance of the IV residuals is similar to that of the OLS residuals. The test statistic is computed


where \(\hat{\epsilon}\) are the IV residuals and \(M_{Z}\) is the annihilator matrix using all exogenous variables.\(\nu\) is the number of overidentifying restrictions, which is the number of instruments minus the number of endogenous variables, \(p-k_{2}\).


Basmann’s test is a small-sample corrected version of Sargan’s test of over-identifying restrictions. It has the same distribution. Let \(s\) be Sargan’s test statistic, then Basmann’s test statistic is


Wooldridge score

Wooldridge’s score test of exogeneity examines the magnitude of the correlation between the OLS residuals and a an appropriately constructed set of residuals of the instruments. Define \(e=M_{X}Y\) and \(\nu=M_{X}M_{Z}X_{2}\). Then the test statistic is computed from the regression


as \(nR^{2}\sim\chi_{k_{2}}^{2}\).

Wooldridge regression

Wooldridge’s regression test of exogeneity is closely related to the score test and is generally more powerful. It also inherits robustness to heteroskedasticity and/or autocorrelation the comes from the choice of covariance estimator used in the model. Define \(R=M_{Z}X_{2}\). The test is implemented in a regression of




where \(\hat{\Sigma}_{\gamma}\) is the block of the covariance matrix corresponding to the \(\gamma\) parameters. \(\hat{\Sigma}\) is estimated using the same covariance estimator as the model fit.

Wooldridge’s Test of Overidentifying restrictions

Wooldridge’s test is a score test examining whether the component of the instrument that is uncorrelated with both the included exogenous and the fitted exogenous is uncorrelated with the IV residuals. Define \(\tilde{Z}=M_{\left[X_{1}\:\hat{X}_{2}\right]}Z_{2,1:q}\) where \(\hat{X_{2}}\) are the fitted values from the first stage regression of the endogenous on all exogenous variables and \(Z_{2,1:q}\) contains any \(q\) columns of \(Z_{2}\), \(q=p-k_{2}\) . The test statistic is computed using a regression of 1s on the test functions \(\hat{\epsilon}_{i}\tilde{z}_{i,j}\) for \(j=1,\ldots,q\) which should have expected value 0.


The test statistic is \(nR^{2}\sim\chi_{q}^{2}\) from the regression.


The Andersen-Rubin test of overidentification examines the magnitude of the LIML \(\hat{\kappa}\)which should be close to unity when the model is not overidentified.


where \(q=p-k_{2}\).

Basman’s F

Basmann’s F test of overidentification also examines the magnitude of the LIML \(\hat{\kappa}\). The test statistic is

\[\hat{\kappa}(n-p)/q\sim F_{q,n-n_{instr}}\]

where \(q=p-k_{2}\).

Durbin and Wu-Haussman

Durbin’s and the Wu-Hausman tests of exogeneity test of exogeneity is depends on the variance of the residuals when come endogenous variables are treated as exogenous against the case where they are treated as endogenous. Durbin’s test statistic is

\[\begin{split}\begin{aligned} \delta= & \hat{\epsilon}'_{e}P_{[z,w]}\hat{\epsilon}_{e}-\hat{\epsilon}'_{c}P_{z}\hat{\epsilon}_{c}\\ D= & \delta/(\hat{\epsilon}'_{e}\hat{\epsilon}_{e})/n\sim\chi_{q}^{2}\end{aligned}\end{split}\]

and the Wu-Hausman test statistic is

\[\begin{aligned} WH= & \frac{\delta/q}{(\hat{\epsilon}'_{e}\hat{\epsilon}_{e}-\delta)/v}\sim F_{q,\nu}\end{aligned}\]

where \(\hat{\epsilon}_{e}\) treats the selected set of endogenous variables as exogenous (efficient estimate) and \(\hat{\epsilon}_{c}\) is a consistent estimator if these variables are endogenous.\(P_{\left[Z\,W\right]}\) is the projection matrix containing all exogenous variables including the instrument as well as the variables being tested for endogeneity \(\left(W\right)\).\(q\) is the number of variables being tested for exogeneity and \(\nu=n-k1-k2-q\).

GMM Post-estimation Results


The J-statistic tests whether the model is over-identified, and is defined


where \(\bar{g}=n^{-1}\sum\hat{\epsilon}_{i}z_{i}\) and \(W\) is a consistent estimator of the variance of \(\sqrt{n}\bar{g}\). The degree of freedom is \(q=p-k_{2}\).


The C-statistic tests exogeneity by treating a the set of endogenous variables as exogenous. In practice this means that are included in the GMM moment conditions, and so a likelihood-ratio-like test statistic can be computed as


where \(J_{e}\) is the J-statistic treating the tested variables as exogenous and \(J_{c}\) leaves then as endogenous. The optimal weighting matrix is computed from the expanded model (efficient) and used to estimate parameters in both models. This ensures that the test statistic is positive.

First-stage Estimation Analysis

Partial R2 and Partial F-statistic

The \(R^{2}\) is reported after orthogonalizing the instruments from included exogenous variables, so that the model estimated is


where \(\tilde{Z}_{2}=M_{X_{1}}\tilde{Z}\). The partial \(F\)-statistic is the F-statistic from this regression. It is implemented as a standard \(F\)-statistic when the data is assumed to be homoskedastic with an \(F_{p_{2},n-p_{2}}\) distribution. In all other cases, a quadratic form is used with an asymptotic \(\chi_{p_{2}}^{2}\) distribution testing \(H_{0}:\gamma=0\).

Shea’s R2

Shea’s \(R^{2}\) is defined as the ratio of the parameter variances under OLS and 2SLS estimation standardized by the unexplained variance under each,


If the estimator under 2SLS was as good as under OLS, both ratios would be 1 and Shea’s \(R^{2}=1\). On the other hand, the worse the \(IV\) fit in terms of either \(R^{2}\) or the parameter variances, the lower the value reported by Shea’s \(R^{2}\).

Kernel Weights and Bandwidth Selection

Kernel weights

In all formulas, \(m\) is the kernel bandwidth parameter.

  • Bartlett

  • Parzen

    \[\begin{split}\begin{aligned} z_{i} & =\frac{i}{m+1},\,i<m\\ w_{i} & =1-6z_{i}^{2}+6z_{i}^{3},z\leq0.5\\ w_{i} & =2(1-z_{i})^{3},z>0.5\end{aligned}\end{split}\]
  • Quadratic-Spectral

    \[\begin{split}\begin{aligned} z_{i} & =\frac{6\pi i}{5m}\\ w_{0} & =1\\ w_{i} & =3(\sin(z_{i})/z_{i}-\cos(z_{i}))/z_{i}^{2},\:i\geq1\end{aligned}\end{split}\]

Optimal BW selection


Constant Detection

Whether a model contains a constant or equivalent is tested using three tests. These are executed in order and so once a constant is detected, the others are not executed. The simplest method to ensure that a constant is correctly detected is to include a columns of 1s.

  1. A column with only 1.0s

  2. A column with a maximum minus minimum equal to 0 and that is not all 0s.

  3. Whether the rank of \(X\) is the same as \(\left[1_{N}\:X\right]\). If these are the same, then the model contains a constant equivalent (e.g., dummies for all categories).


If the model contains an implicit constant, e.g., all categories of a dummy, one of the categories is excluded when computing the test statistic. The choice of category to drop has no effect and is equivalent to reparameterizing the model with a constant and excluding one category of dummy.


This somewhat non-obvious choice is driven by Stata compatibility.


Sources used in writing the code include [Baltagi], [BaumEtAl] and [CamEtAl], [CamTri05], [CamTri09], [Greene], [NewWes94], [Stata], [Wool10] and [Wool12].