# Formulas and Mathematical Detail¶

## Notation¶

Interest is in recovering the parameter vector from the model

where \(\beta\) is \(k\) by 1, \(x_{i}\) is a \(k\) by 1 vector of observable variables and \(\epsilon_{i}\) is a scalar error. \(x_{i}\) can be separated in two types of variables. The \(k_{1}\) by 1 set of variables \(x_{1i}\) are exogenous regressors in the sense that \(E\left[x_{1i}\epsilon_{i}\right]=0\). The \(k_{2}\) by 1 variables \(x_{2i}\) are endogenous. A set of \(p\) instruments is available that satisfy the requirements for validity where \(p\geq k_{2}\). The extended model can be written as

The model can be expressed compactly

The vector of instruments \(z_{i}\) is \(p\) by 1. There are \(n\) observations for all variables. \(k_{c}=1\) if the model contains a constant (either explicit or implicit, i.e., including all categories of a dummy variable). The constant, if included, is in \(x_{1i}\). \(X\) is the \(n\) by \(k\) matrix if regressors where row \(i\) of \(X\) is \(x_{i}^{\prime}\). \(X\) can be partitioned into \(\left[X_{1}\;X_{2}\right]\). \(Z\) is the \(n\) by \(p\) matrix of instruments. The vector \(y\) is \(n\) by 1. Projection matrices for \(X\) is defined \(P_{X}=X\left(X^{\prime}X\right)^{-1}X^{\prime}\). The projection matrix for \(Z\) is similarly defined only using \(Z\). The annihilator matrix for \(X\) is \(M_{X}=I-P_{X}\).

## Parameter Estimation¶

### Two-stage Least Squares (2SLS)¶

The 2SLS estimator is

### Limited Information Maximum Likelihood and k-class Estimators¶

The LIML or other k-class estimator is

where \(\kappa\) is the parameter of the class. When \(\kappa=1\) the 2SLS estimator is recovered. When \(\kappa=0\), the OLS estimator is recovered. The LIML estimator is recovered when \(\kappa\) set to

where \(W=\left[y\:X_{2}\right]\) and \(\mathrm{eig}\) returns the eigenvalues.

### Generalized Method of Moments (GMM)¶

The GMM estimator is defined as

where \(W\) is a positive definite weighting matrix.

### Continuously Updating Generalized Method of Moments (GMM-CUE)¶

The continuously updating GMM estimator solves the minimization problem

where \(\bar{g}\left(\beta\right)=n^{-1}\sum_{i=1}^{n}z_{i}\hat{\epsilon}_{i}\) and \(\hat{\epsilon}_{i}=y_{i}-x_{i}\beta\). Unlike standard GMM, the weight matrix, \(W\) directly depends on the model parameters \(\beta\) and so it is not possible to use a closed form estimator.

## Basic Statistics¶

Let \(\hat{\epsilon}=y-X\hat{\beta}\). The residual sum of squares (RSS) is \(\hat{\epsilon}^{\prime}\hat{\epsilon}\), the model sum of squares (MSS) is \(\hat{\beta}^{\prime}X^{\prime}X\hat{\beta}\) and the total sum of squares (TSS) is \(\left(y-k_{c}\bar{y}\right)^{\prime}\left(y-k_{c}\bar{y}\right)^{\prime}\)where \(\bar{y}\) is the sample average of \(y\). The model \(R^{2}\)is defined

and the adjusted \(R^{2}\) is defined

The residual variance is \(s^{2}=n^{-1}\hat{\epsilon}^{\prime}\hat{\epsilon}\) unless the debiased flag is used, in which case a small sample adjusted version is estimated \(s^{2}=\left(n-k\right)^{-1}\hat{\epsilon}^{\prime}\hat{\epsilon}\). The model degree of freedom is \(k\) and the residual degree of freedom is \(n-k\).

The model F-statistic is defined

where the notation \(\hat{\beta}_{-}\) indicates that the constant is excluded from \(\hat{\beta}\) and \(\hat{V}_{-}\) indicates that the row and column corresponding to a constant are excluded. 1 The test statistic is distributed as \(\chi_{k-k_{c}}^{2}.\) If the debiased flag is set, then the test statistic \(F\) is transformed as \(F/\left(k-k_{c}\right)\) and a \(F_{k-k_{c},n-k}\) distribution is used.

## Parameter Covariance Estimation¶

### Two-stage LS, LIML and k-class estimators¶

Four covariance estimators are available. The first is the standard homoskedastic covariance, defined as

Note that this estimator can be expressed as

All estimators take this form and only differ in how the asymptotic covariance of the scores, \(B\), is estimated. For the homoskedastic covariance estimator, \(\hat{B}=s^{2}\hat{A}.\) The score covariance in the heteroskedasticity robust covariance estimator is

where \(\hat{x_{i}}\) is row \(i\) of \(\hat{X}=P_{Z}X\) and \(\hat{\xi}_{i}=\hat{\epsilon}_{i}\hat{x}_{i}\).

The kernel covariance estimator is robust to both heteroskedasticity and autocorrelation and is defined as

where \(K\left(\frac{i}{h}\right)\) is a kernel weighting function where \(h\) is the kernel bandwidth.

The one-way clustered covariance estimator is defined as

where \(\sum_{i\in\mathcal{G}_{j}}\hat{\xi}_{i}\) is the sum of the scores for all members in group \(\mathcal{G}_{j}\) and \(g\) is the number of groups.

If the debiased flag is used to perform a small-sample adjustment, all estimators except the clustered covariance are rescaled by \(\left(n-k\right)/n\). The clustered covariance is rescaled by \(\left(\left(n-k\right)\left(n-1\right)/n^{2}\right)\left(\left(g-1\right)/g\right)\). 2

### Standard Errors¶

Standard errors are defined as

where \(e_{j}\) is a vector of 0s except in location \(j\) which is 1.

### T-statistics¶

T-statistics test the null \(H_{0}:\beta_{j}=0\) against a 2-sided alternative and are defined as

### P-values¶

P-values are computes using 2-sided tests,

If the covariance estimator was debiased, a Student’s t distribution with \(n-k\) degrees of freedom is used,

where \(t_{n-k}\left(\cdot\right)\) is the CDF of a Student’s T distribution.

### Confidence Intervals¶

Confidence intervals are constructed as

where \(q_{\alpha/2}\) is the \(\alpha/2\) quantile of a standard Normal distribution or a Student’s t. The Student’s t is used when a debiased covariance estimator is used.

### Linear Hypothesis Tests¶

Linear hypothesis tests examine the validity of nulls of the form \(H_{0}:R\beta-r=0\) and are implemented using a Wald test statistic

where \(q\) is the \(rank\left(R\right)\) which is usually the number of rows in \(R\) . If the debiased flag is used, then \(W/q\) is reported and critical and p-values are taken from a \(F_{q,n-k}\) distribution.

### GMM Covariance estimators¶

GMM covariance depends on the weighting matrix used in estimation and the assumed covariance of the scores. In most applications these are the same and so the inefficient form,

will collapse to the simpler form

when \(W=S^{-1}\). When an unadjusted (homoskedastic) covariance is used,

where \(\tilde{s}^{2}=n^{-1}\sum_{i=1}^{n}\left(\epsilon_{i}-\bar{\epsilon}\right)^{2}\) subtracts the mean which may be non-zero if the model is overidentified. Like previous covariance estimators, if the debiased flag is used, \(n^{-1}\) is replaced by \(\left(n-k\right)^{-1}\). When a robust (heteroskedastic) covariance is used, the estimator of \(S\) is modified to

If the debiased flag is used, \(n^{-1}\) is replaced by \(\left(n-k\right)^{-1}\).

Kernel covariance estimators of \(S\) take the form

and \(k\left(\cdot\right)\) is a kernel weighting function with bandwidth \(h\). If the debiased flag is used, \(n^{-1}\) is replaced by \(\left(n-k\right)^{-1}\).

The one-way clustered covariance estimator is defined as

where \(\sum_{i\in\mathcal{G}_{j}}\hat{\epsilon}_{i}z_{i}\) is the sum of the moment conditional for all members in group \(\mathcal{G}_{j}\) and \(g\) is the number of groups. If the debiased flag is used, the \(n^{-1}\) term is replaced by

### GMM Weight Estimators¶

The GMM optimal weight estimators are identical to the the estimators of \(S\) with two notable exceptions. First, they are never debiased and so always use \(n^{-1}\). Second, if the center flag is true, the demeaned moment conditions defined as \(\tilde{g}_{i}=z_{i}\hat{\epsilon}_{i}-\overline{z\epsilon}\) are used in-place of \(g_{i}\) in the robust, kernel and clustered estimators. The unadjusted estimator is always centered, and so this option has no effect.

## Post-estimation¶

### 2SLS and LIML Post-estimation Results¶

**Sargan**

Sargan’s test of over-identifying restrictions examines whether the variance of the IV residuals is similar to that of the OLS residuals. The test statistic is computed

where \(\hat{\epsilon}\) are the IV residuals and \(M_{Z}\) is the annihilator matrix using all exogenous variables.\(\nu\) is the number of overidentifying restrictions, which is the number of instruments minus the number of endogenous variables, \(p-k_{2}\).

**Basmann**

Basmann’s test is a small-sample corrected version of Sargan’s test of over-identifying restrictions. It has the same distribution. Let \(s\) be Sargan’s test statistic, then Basmann’s test statistic is

**Wooldridge score**

Wooldridge’s score test of exogeneity examines the magnitude of the correlation between the OLS residuals and a an appropriately constructed set of residuals of the instruments. Define \(e=M_{X}Y\) and \(\nu=M_{X}M_{Z}X_{2}\). Then the test statistic is computed from the regression

as \(nR^{2}\sim\chi_{k_{2}}^{2}\).

**Wooldridge regression**

Wooldridge’s regression test of exogeneity is closely related to the score test and is generally more powerful. It also inherits robustness to heteroskedasticity and/or autocorrelation the comes from the choice of covariance estimator used in the model. Define \(R=M_{Z}X_{2}\). The test is implemented in a regression of

as

where \(\hat{\Sigma}_{\gamma}\) is the block of the covariance matrix corresponding to the \(\gamma\) parameters. \(\hat{\Sigma}\) is estimated using the same covariance estimator as the model fit.

**Wooldridge’s Test of Overidentifying restrictions**

Wooldridge’s test is a score test examining whether the component of the instrument that is uncorrelated with both the included exogenous and the fitted exogenous is uncorrelated with the IV residuals. Define \(\tilde{Z}=M_{\left[X_{1}\:\hat{X}_{2}\right]}Z_{2,1:q}\) where \(\hat{X_{2}}\) are the fitted values from the first stage regression of the endogenous on all exogenous variables and \(Z_{2,1:q}\) contains any \(q\) columns of \(Z_{2}\), \(q=p-k_{2}\) . The test statistic is computed using a regression of 1s on the test functions \(\hat{\epsilon}_{i}\tilde{z}_{i,j}\) for \(j=1,\ldots,q\) which should have expected value 0.

The test statistic is \(nR^{2}\sim\chi_{q}^{2}\) from the regression.

**Anderson-Rubin**

The Andersen-Rubin test of overidentification examines the magnitude of the LIML \(\hat{\kappa}\)which should be close to unity when the model is not overidentified.

where \(q=p-k_{2}\).

**Basman’s F**

Basmann’s F test of overidentification also examines the magnitude of the LIML \(\hat{\kappa}\). The test statistic is

where \(q=p-k_{2}\).

**Durbin and Wu-Haussman**

Durbin’s and the Wu-Hausman tests of exogeneity test of exogeneity is depends on the variance of the residuals when come endogenous variables are treated as exogenous against the case where they are treated as endogenous. Durbin’s test statistic is

and the Wu-Hausman test statistic is

where \(\hat{\epsilon}_{e}\) treats the selected set of endogenous variables as exogenous (efficient estimate) and \(\hat{\epsilon}_{c}\) is a consistent estimator if these variables are endogenous.\(P_{\left[Z\,W\right]}\) is the projection matrix containing all exogenous variables including the instrument as well as the variables being tested for endogeneity \(\left(W\right)\).\(q\) is the number of variables being tested for exogeneity and \(\nu=n-k1-k2-q\).

### GMM Post-estimation Results¶

**J-stat**

The J-statistic tests whether the model is over-identified, and is defined

where \(\bar{g}=n^{-1}\sum\hat{\epsilon}_{i}z_{i}\) and \(W\) is a consistent estimator of the variance of \(\sqrt{n}\bar{g}\). The degree of freedom is \(q=p-k_{2}\).

**C-stat**

The C-statistic tests exogeneity by treating a the set of endogenous variables as exogenous. In practice this means that are included in the GMM moment conditions, and so a likelihood-ratio-like test statistic can be computed as

where \(J_{e}\) is the J-statistic treating the tested variables as exogenous and \(J_{c}\) leaves then as endogenous. The optimal weighting matrix is computed from the expanded model (efficient) and used to estimate parameters in both models. This ensures that the test statistic is positive.

## First-stage Estimation Analysis¶

**Partial R2 and Partial F-statistic**

The \(R^{2}\) is reported after orthogonalizing the instruments from included exogenous variables, so that the model estimated is

where \(\tilde{Z}_{2}=M_{X_{1}}\tilde{Z}\). The partial \(F\)-statistic is the F-statistic from this regression. It is implemented as a standard \(F\)-statistic when the data is assumed to be homoskedastic with an \(F_{p_{2},n-p_{2}}\) distribution. In all other cases, a quadratic form is used with an asymptotic \(\chi_{p_{2}}^{2}\) distribution testing \(H_{0}:\gamma=0\).

**Shea’s R2**

Shea’s \(R^{2}\) is defined as the ratio of the parameter variances under OLS and 2SLS estimation standardized by the unexplained variance under each,

If the estimator under 2SLS was as good as under OLS, both ratios would be 1 and Shea’s \(R^{2}=1\). On the other hand, the worse the \(IV\) fit in terms of either \(R^{2}\) or the parameter variances, the lower the value reported by Shea’s \(R^{2}\).

## Kernel Weights and Bandwidth Selection¶

**Kernel weights**

In all formulas, \(m\) is the kernel bandwidth parameter.

Bartlett

\[w_{i}=1-\frac{i}{m+1},\,i<m\]Parzen

\[\begin{split}\begin{aligned} z_{i} & =\frac{i}{m+1},\,i<m\\ w_{i} & =1-6z_{i}^{2}+6z_{i}^{3},z\leq0.5\\ w_{i} & =2(1-z_{i})^{3},z>0.5\end{aligned}\end{split}\]Quadratic-Spectral

\[\begin{split}\begin{aligned} z_{i} & =\frac{6\pi i}{5m}\\ w_{0} & =1\\ w_{i} & =3(\sin(z_{i})/z_{i}-\cos(z_{i}))/z_{i}^{2},\:i\geq1\end{aligned}\end{split}\]

**Optimal BW selection**

TODO

## Constant Detection¶

Whether a model contains a constant or equivalent is tested using three tests. These are executed in order and so once a constant is detected, the others are not executed. The simplest method to ensure that a constant is correctly detected is to include a columns of 1s.

A column with only 1.0s

A column with a maximum minus minimum equal to 0 and that is not all 0s.

Whether the rank of \(X\) is the same as \(\left[1_{N}\:X\right]\). If these are the same, then the model contains a constant equivalent (e.g., dummies for all categories).

- 1
If the model contains an implicit constant, e.g., all categories of a dummy, one of the categories is excluded when computing the test statistic. The choice of category to drop has no effect and is equivalent to reparameterizing the model with a constant and excluding one category of dummy.

- 2
This somewhat non-obvious choice is driven by Stata compatibility.

References

Sources used in writing the code include [Baltagi], [BaumEtAl] and [CamEtAl], [CamTri05], [CamTri09], [Greene], [NewWes94], [Stata], [Wool10] and [Wool12].