Formulas and Mathematical Detail¶
Notation¶
Interest is in recovering the parameter vector from the model
where \(\beta\) is \(k\) by 1, \(x_{i}\) is a \(k\) by 1 vector of observable variables and \(\epsilon_{i}\) is a scalar error. \(x_{i}\) can be separated in two types of variables. The \(k_{1}\) by 1 set of variables \(x_{1i}\) are exogenous regressors in the sense that \(E\left[x_{1i}\epsilon_{i}\right]=0\). The \(k_{2}\) by 1 variables \(x_{2i}\) are endogenous. A set of \(p\) instruments is available that satisfy the requirements for validity where \(p\geq k_{2}\). The extended model can be written as
The model can be expressed compactly
The vector of instruments \(z_{i}\) is \(p\) by 1. There are \(n\) observations for all variables. \(k_{c}=1\) if the model contains a constant (either explicit or implicit, i.e., including all categories of a dummy variable). The constant, if included, is in \(x_{1i}\). \(X\) is the \(n\) by \(k\) matrix of regressors where row \(i\) of \(X\) is \(x_{i}^{\prime}\). \(X\) can be partitioned into \(\left[X_{1}\;X_{2}\right]\). \(Z\) is the \(n\) by \(p\) matrix of instruments. The vector \(y\) is \(n\) by 1. Projection matrices for \(X\) is defined \(P_{X}=X\left(X^{\prime}X\right)^{-1}X^{\prime}\). The projection matrix for \(Z\) is similarly defined only using \(Z\). The annihilator matrix for \(X\) is \(M_{X}=I-P_{X}\).
Parameter Estimation¶
Two-stage Least Squares (2SLS)¶
The 2SLS estimator is
Limited Information Maximum Likelihood and k-class Estimators¶
The LIML or other k-class estimator is
where \(\kappa\) is the parameter of the class. When \(\kappa=1\) the 2SLS estimator is recovered. When \(\kappa=0\), the OLS estimator is recovered. The LIML estimator is recovered when \(\kappa\) set to
where \(W=\left[y\:X_{2}\right]\) and \(\mathrm{eig}\) returns the eigenvalues.
Generalized Method of Moments (GMM)¶
The GMM estimator is defined as
where \(W\) is a positive definite weighting matrix.
Continuously Updating Generalized Method of Moments (GMM-CUE)¶
The continuously updating GMM estimator solves the minimization problem
where \(\bar{g}\left(\beta\right)=n^{-1}\sum_{i=1}^{n}z_{i}\hat{\epsilon}_{i}\) and \(\hat{\epsilon}_{i}=y_{i}-x_{i}\beta\). Unlike standard GMM, the weight matrix, \(W\) directly depends on the model parameters \(\beta\) and so it is not possible to use a closed form estimator.
Basic Statistics¶
Let \(\hat{\epsilon}=y-X\hat{\beta}\). The residual sum of squares (RSS) is \(\hat{\epsilon}^{\prime}\hat{\epsilon}\), the model sum of squares (MSS) is \(\hat{\beta}^{\prime}X^{\prime}X\hat{\beta}\) and the total sum of squares (TSS) is \(\left(y-k_{c}\bar{y}\right)^{\prime}\left(y-k_{c}\bar{y}\right)^{\prime}\)where \(\bar{y}\) is the sample average of \(y\). The model \(R^{2}\)is defined
and the adjusted \(R^{2}\) is defined
The residual variance is \(s^{2}=n^{-1}\hat{\epsilon}^{\prime}\hat{\epsilon}\) unless the debiased flag is used, in which case a small sample adjusted version is estimated \(s^{2}=\left(n-k\right)^{-1}\hat{\epsilon}^{\prime}\hat{\epsilon}\). The model degree of freedom is \(k\) and the residual degree of freedom is \(n-k\).
The model F-statistic is defined
where the notation \(\hat{\beta}_{-}\) indicates that the constant is excluded from \(\hat{\beta}\) and \(\hat{V}_{-}\) indicates that the row and column corresponding to a constant are excluded. [1] The test statistic is distributed as \(\chi_{k-k_{c}}^{2}.\) If the debiased flag is set, then the test statistic \(F\) is transformed as \(F/\left(k-k_{c}\right)\) and a \(F_{k-k_{c},n-k}\) distribution is used.
Parameter Covariance Estimation¶
Two-stage LS, LIML and k-class estimators¶
Four covariance estimators are available. The first is the standard homoskedastic covariance, defined as
Note that this estimator can be expressed as
All estimators take this form and only differ in how the asymptotic covariance of the scores, \(B\), is estimated. For the homoskedastic covariance estimator, \(\hat{B}=s^{2}\hat{A}.\) The score covariance in the heteroskedasticity robust covariance estimator is
where \(\hat{x_{i}}\) is row \(i\) of \(\hat{X}=P_{Z}X\) and \(\hat{\xi}_{i}=\hat{\epsilon}_{i}\hat{x}_{i}\).
The kernel covariance estimator is robust to both heteroskedasticity and autocorrelation and is defined as
where \(K\left(\frac{i}{h}\right)\) is a kernel weighting function where \(h\) is the kernel bandwidth.
The one-way clustered covariance estimator is defined as
where \(\sum_{i\in\mathcal{G}_{j}}\hat{\xi}_{i}\) is the sum of the scores for all members in group \(\mathcal{G}_{j}\) and \(g\) is the number of groups.
If the debiased flag is used to perform a small-sample adjustment, all estimators except the clustered covariance are rescaled by \(\left(n-k\right)/n\). The clustered covariance is rescaled by \(\left(\left(n-k\right)\left(n-1\right)/n^{2}\right)\left(\left(g-1\right)/g\right)\). [2]
Standard Errors¶
Standard errors are defined as
where \(e_{j}\) is a vector of 0s except in location \(j\) which is 1.
T-statistics¶
T-statistics test the null \(H_{0}:\beta_{j}=0\) against a 2-sided alternative and are defined as
P-values¶
P-values are computes using 2-sided tests,
If the covariance estimator was debiased, a Student’s t distribution with \(n-k\) degrees of freedom is used,
where \(t_{n-k}\left(\cdot\right)\) is the CDF of a Student’s T distribution.
Confidence Intervals¶
Confidence intervals are constructed as
where \(q_{\alpha/2}\) is the \(\alpha/2\) quantile of a standard Normal distribution or a Student’s t. The Student’s t is used when a debiased covariance estimator is used.
Linear Hypothesis Tests¶
Linear hypothesis tests examine the validity of nulls of the form \(H_{0}:R\beta-r=0\) and are implemented using a Wald test statistic
where \(q\) is the \(rank\left(R\right)\) which is usually the number of rows in \(R\) . If the debiased flag is used, then \(W/q\) is reported and critical and p-values are taken from a \(F_{q,n-k}\) distribution.
GMM Covariance estimators¶
GMM covariance depends on the weighting matrix used in estimation and the assumed covariance of the scores. In most applications these are the same and so the inefficient form,
will collapse to the simpler form
when \(W=S^{-1}\). When an unadjusted (homoskedastic) covariance is used,
where \(\tilde{s}^{2}=n^{-1}\sum_{i=1}^{n}\left(\epsilon_{i}-\bar{\epsilon}\right)^{2}\) subtracts the mean which may be non-zero if the model is overidentified. Like previous covariance estimators, if the debiased flag is used, \(n^{-1}\) is replaced by \(\left(n-k\right)^{-1}\). When a robust (heteroskedastic) covariance is used, the estimator of \(S\) is modified to
If the debiased flag is used, \(n^{-1}\) is replaced by \(\left(n-k\right)^{-1}\).
Kernel covariance estimators of \(S\) take the form
and \(k\left(\cdot\right)\) is a kernel weighting function with bandwidth \(h\). If the debiased flag is used, \(n^{-1}\) is replaced by \(\left(n-k\right)^{-1}\).
The one-way clustered covariance estimator is defined as
where \(\sum_{i\in\mathcal{G}_{j}}\hat{\epsilon}_{i}z_{i}\) is the sum of the moment conditional for all members in group \(\mathcal{G}_{j}\) and \(g\) is the number of groups. If the debiased flag is used, the \(n^{-1}\) term is replaced by
GMM Weight Estimators¶
The GMM optimal weight estimators are identical to the the estimators of \(S\) with two notable exceptions. First, they are never debiased and so always use \(n^{-1}\). Second, if the center flag is true, the demeaned moment conditions defined as \(\tilde{g}_{i}=z_{i}\hat{\epsilon}_{i}-\overline{z\epsilon}\) are used in-place of \(g_{i}\) in the robust, kernel and clustered estimators. The unadjusted estimator is always centered, and so this option has no effect.
Post-estimation¶
2SLS and LIML Post-estimation Results¶
Sargan
Sargan’s test of over-identifying restrictions examines whether the variance of the IV residuals is similar to that of the OLS residuals. The test statistic is computed
where \(\hat{\epsilon}\) are the IV residuals and \(M_{Z}\) is the annihilator matrix using all exogenous variables.\(\nu\) is the number of overidentifying restrictions, which is the number of instruments minus the number of endogenous variables, \(p-k_{2}\).
Basmann
Basmann’s test is a small-sample corrected version of Sargan’s test of over-identifying restrictions. It has the same distribution. Let \(s\) be Sargan’s test statistic, then Basmann’s test statistic is
Wooldridge score
Wooldridge’s score test of exogeneity examines the magnitude of the correlation between the OLS residuals and a an appropriately constructed set of residuals of the instruments. Define \(e=M_{X}Y\) and \(\nu=M_{X}M_{Z}X_{2}\). Then the test statistic is computed from the regression
as \(nR^{2}\sim\chi_{k_{2}}^{2}\).
Wooldridge regression
Wooldridge’s regression test of exogeneity is closely related to the score test and is generally more powerful. It also inherits robustness to heteroskedasticity and/or autocorrelation the comes from the choice of covariance estimator used in the model. Define \(R=M_{Z}X_{2}\). The test is implemented in a regression of
as
where \(\hat{\Sigma}_{\gamma}\) is the block of the covariance matrix corresponding to the \(\gamma\) parameters. \(\hat{\Sigma}\) is estimated using the same covariance estimator as the model fit.
Wooldridge’s Test of Overidentifying restrictions
Wooldridge’s test is a score test examining whether the component of the instrument that is uncorrelated with both the included exogenous and the fitted exogenous is uncorrelated with the IV residuals. Define \(\tilde{Z}=M_{\left[X_{1}\:\hat{X}_{2}\right]}Z_{2,1:q}\) where \(\hat{X_{2}}\) are the fitted values from the first stage regression of the endogenous on all exogenous variables and \(Z_{2,1:q}\) contains any \(q\) columns of \(Z_{2}\), \(q=p-k_{2}\) . The test statistic is computed using a regression of 1s on the test functions \(\hat{\epsilon}_{i}\tilde{z}_{i,j}\) for \(j=1,\ldots,q\) which should have expected value 0.
The test statistic is \(nR^{2}\sim\chi_{q}^{2}\) from the regression.
Anderson-Rubin
The Andersen-Rubin test of overidentification examines the magnitude of the LIML \(\hat{\kappa}\)which should be close to unity when the model is not overidentified.
where \(q=p-k_{2}\).
Basman’s F
Basmann’s F test of overidentification also examines the magnitude of the LIML \(\hat{\kappa}\). The test statistic is
where \(q=p-k_{2}\).
Durbin and Wu-Haussman
Durbin’s and the Wu-Hausman tests of exogeneity test of exogeneity is depends on the variance of the residuals when come endogenous variables are treated as exogenous against the case where they are treated as endogenous. Durbin’s test statistic is
and the Wu-Hausman test statistic is
where \(\hat{\epsilon}_{e}\) treats the selected set of endogenous variables as exogenous (efficient estimate) and \(\hat{\epsilon}_{c}\) is a consistent estimator if these variables are endogenous.\(P_{\left[Z\,W\right]}\) is the projection matrix containing all exogenous variables including the instrument as well as the variables being tested for endogeneity \(\left(W\right)\).\(q\) is the number of variables being tested for exogeneity and \(\nu=n-k1-k2-q\).
GMM Post-estimation Results¶
J-stat
The J-statistic tests whether the model is over-identified, and is defined
where \(\bar{g}=n^{-1}\sum\hat{\epsilon}_{i}z_{i}\) and \(W\) is a consistent estimator of the variance of \(\sqrt{n}\bar{g}\). The degree of freedom is \(q=p-k_{2}\).
C-stat
The C-statistic tests exogeneity by treating a the set of endogenous variables as exogenous. In practice this means that are included in the GMM moment conditions, and so a likelihood-ratio-like test statistic can be computed as
where \(J_{e}\) is the J-statistic treating the tested variables as exogenous and \(J_{c}\) leaves then as endogenous. The optimal weighting matrix is computed from the expanded model (efficient) and used to estimate parameters in both models. This ensures that the test statistic is positive.
First-stage Estimation Analysis¶
Partial R2 and Partial F-statistic
The \(R^{2}\) is reported after orthogonalizing the instruments from included exogenous variables, so that the model estimated is
where \(\tilde{Z}_{2}=M_{X_{1}}\tilde{Z}\). The partial \(F\)-statistic is the F-statistic from this regression. It is implemented as a standard \(F\)-statistic when the data is assumed to be homoskedastic with an \(F_{p_{2},n-p_{2}}\) distribution. In all other cases, a quadratic form is used with an asymptotic \(\chi_{p_{2}}^{2}\) distribution testing \(H_{0}:\gamma=0\).
Shea’s R2
Shea’s \(R^{2}\) is defined as the ratio of the parameter variances under OLS and 2SLS estimation standardized by the unexplained variance under each,
If the estimator under 2SLS was as good as under OLS, both ratios would be 1 and Shea’s \(R^{2}=1\). On the other hand, the worse the \(IV\) fit in terms of either \(R^{2}\) or the parameter variances, the lower the value reported by Shea’s \(R^{2}\).
Kernel Weights and Bandwidth Selection¶
Kernel weights
In all formulas, \(m\) is the kernel bandwidth parameter.
Bartlett
\[w_{i}=1-\frac{i}{m+1},\,i<m\]Parzen
\[\begin{split}\begin{aligned} z_{i} & =\frac{i}{m+1},\,i<m\\ w_{i} & =1-6z_{i}^{2}+6z_{i}^{3},z\leq0.5\\ w_{i} & =2(1-z_{i})^{3},z>0.5\end{aligned}\end{split}\]Quadratic-Spectral
\[\begin{split}\begin{aligned} z_{i} & =\frac{6\pi i}{5m}\\ w_{0} & =1\\ w_{i} & =3(\sin(z_{i})/z_{i}-\cos(z_{i}))/z_{i}^{2},\:i\geq1\end{aligned}\end{split}\]
Optimal BW selection
TODO
Constant Detection¶
Whether a model contains a constant or equivalent is tested using three tests. These are executed in order and so once a constant is detected, the others are not executed. The simplest method to ensure that a constant is correctly detected is to include a columns of 1s.
A column with only 1.0s
A column with a maximum minus minimum equal to 0 and that is not all 0s.
Whether the rank of \(X\) is the same as \(\left[1_{N}\:X\right]\). If these are the same, then the model contains a constant equivalent (e.g., dummies for all categories).
References
Sources used in writing the code include [Baltagi], [BaumEtAl] and [CamEtAl], [CamTri05], [CamTri09], [Greene], [NewWes94], [Stata], [Wool10] and [Wool12].