Further Examples¶
Linear Instrumental-Variables Regression¶
These examples follow those in Chapter 6 of Microeconometrics Using Stata by Cameron & Trivedi.
The first step is to import the main estimator for linear IV models:
IV2SLS
- standard two-stage least squaresIVLIML
- Limited information maximum likelihood and k-class estimatorsIVGMM
- Generalized method of moment estimationIVGMMCUE
- Generalized method of moment estimation using continuously updating
[1]:
from linearmodels import IV2SLS, IVGMM, IVGMMCUE, IVLIML
Importing data¶
The data uses comes from the Medical Expenditure Panel Survey (MEPS) and includes data on out-of-pocket drug expenditure (in logs), individual characteristics, whether an individual was insured through an employer or union (a likely endogenous variable), and some candidate instruments including the percentage of income from Social Security Income, the size of the individual”s firm and whether the firm has multiple locations.
[2]:
from linearmodels.datasets import meps
data = meps.load()
data = data.dropna()
print(meps.DESCR)
age Age
age2 Age-squared
black Black
blhisp Black or Hispanic
drugexp Presc-drugs expense
educyr Years of education
fair Fair health
female Female
firmsz Firm size
fph fair or poor health
good Good health
hi_empunion Insured thro emp/union
hisp Hiapanic
income Income
ldrugexp log(drugexp)
linc log(income)
lowincome Low income
marry Married
midincome Middle income
msa Metropolitan stat area
multlc Multiple locations
poor Poor health
poverty Poor
priolist Priority list cond
private Private insurance
ssiratio SSI/Income ratio
totchr Total chronic cond
vegood V-good health
vgh vg or good health
Next the data – dependent, endogenous and controls – are summarized. The controls are grouped into a list to simplify model building.
[3]:
controls = ["totchr", "female", "age", "linc", "blhisp"]
print(data[["ldrugexp", "hi_empunion"] + controls].describe(percentiles=[]))
ldrugexp hi_empunion totchr female age \
count 10089.000000 10089.000000 10089.000000 10089.000000 10089.000000
mean 6.481361 0.382198 1.860938 0.577064 75.051740
std 1.362052 0.485949 1.292858 0.494050 6.682109
min 0.000000 0.000000 0.000000 0.000000 65.000000
50% 6.678342 0.000000 2.000000 1.000000 74.000000
max 10.180172 1.000000 9.000000 1.000000 91.000000
linc blhisp
count 10089.000000 10089.000000
mean 2.743275 0.163544
std 0.913143 0.369880
min -6.907755 0.000000
50% 2.743160 0.000000
max 5.744476 1.000000
It is also worth examining the instruments.
[4]:
instruments = ["ssiratio", "lowincome", "multlc", "firmsz"]
print(data[instruments].describe(percentiles=[]))
ssiratio lowincome multlc firmsz
count 10089.000000 10089.000000 10089.000000 10089.000000
mean 0.536544 0.187432 0.062048 0.140529
std 0.367818 0.390277 0.241254 2.170389
min 0.000000 0.000000 0.000000 0.000000
50% 0.504522 0.000000 0.000000 0.000000
max 9.250620 1.000000 1.000000 50.000000
And finally the simple correlation between the endogenous variable and the instruments. Instruments must be correlated to be relevant (but also must be exogenous, which can”t be examined using simple correlation). The correlation of firmsz
is especially low, which might lead to the weak instruments problem if used exclusively.
[5]:
data[["hi_empunion"] + instruments].corr()
[5]:
hi_empunion | ssiratio | lowincome | multlc | firmsz | |
---|---|---|---|---|---|
hi_empunion | 1.000000 | -0.212431 | -0.116419 | 0.119849 | 0.037352 |
ssiratio | -0.212431 | 1.000000 | 0.253946 | -0.190433 | -0.044578 |
lowincome | -0.116419 | 0.253946 | 1.000000 | -0.062465 | -0.008232 |
multlc | 0.119849 | -0.190433 | -0.062465 | 1.000000 | 0.187275 |
firmsz | 0.037352 | -0.044578 | -0.008232 | 0.187275 | 1.000000 |
add_constant
from statsmodels
is used to simplify the process of adding a constant column to the data.
[6]:
from statsmodels.api import OLS, add_constant
data["const"] = 1
controls = ["const"] + controls
2SLS as OLS¶
Before examining the IV estimators, it is worth noting that 2SLS nests the OLS estimator, so that a call to IV2SLS
using None
for the endogenous and instruments will produce OLS estimates of parameters.
The OLS estimates indicate that insurance through an employer or union leads to an increase in out-of-pocket drug expenditure.
[7]:
ivolsmod = IV2SLS(data.ldrugexp, data[["hi_empunion"] + controls], None, None)
res_ols = ivolsmod.fit()
print(res_ols)
OLS Estimation Summary
==============================================================================
Dep. Variable: ldrugexp R-squared: 0.1770
Estimator: OLS Adj. R-squared: 0.1765
No. Observations: 10089 F-statistic: 2262.6
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:01 Distribution: chi2(6)
Cov. Estimator: robust
Parameter Estimates
===============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------
hi_empunion 0.0739 0.0260 2.8441 0.0045 0.0230 0.1248
const 5.8611 0.1570 37.320 0.0000 5.5533 6.1689
totchr 0.4404 0.0094 47.049 0.0000 0.4220 0.4587
female 0.0578 0.0254 2.2797 0.0226 0.0081 0.1075
age -0.0035 0.0019 -1.8228 0.0683 -0.0073 0.0003
linc 0.0105 0.0137 0.7646 0.4445 -0.0164 0.0373
blhisp -0.1513 0.0341 -4.4353 0.0000 -0.2182 -0.0844
===============================================================================
Just identified 2SLS¶
The just identified two-stage LS estimator uses as many instruments as endogenous variables. In this example there is one of each, using the SSI ratio as the instrument. The with the instrument, the effect of insurance through employer or union has a strong negative effect on drug expenditure.
[8]:
ivmod = IV2SLS(data.ldrugexp, data[controls], data.hi_empunion, data.ssiratio)
res_2sls = ivmod.fit()
print(res_2sls.summary)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: ldrugexp R-squared: 0.0640
Estimator: IV-2SLS Adj. R-squared: 0.0634
No. Observations: 10089 F-statistic: 2000.9
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:01 Distribution: chi2(6)
Cov. Estimator: robust
Parameter Estimates
===============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------
const 6.7872 0.2688 25.246 0.0000 6.2602 7.3141
totchr 0.4503 0.0102 44.157 0.0000 0.4303 0.4703
female -0.0204 0.0326 -0.6257 0.5315 -0.0843 0.0435
age -0.0132 0.0030 -4.4092 0.0000 -0.0191 -0.0073
linc 0.0870 0.0226 3.8436 0.0001 0.0426 0.1314
blhisp -0.2174 0.0395 -5.5052 0.0000 -0.2948 -0.1400
hi_empunion -0.8976 0.2211 -4.0592 0.0000 -1.3310 -0.4642
===============================================================================
Endogenous: hi_empunion
Instruments: ssiratio
Robust Covariance (Heteroskedastic)
Debiased: False
Multiple Instruments¶
Using multiple instruments only requires expanding the data array in the instruments input.
[9]:
ivmod = IV2SLS(
data.ldrugexp, data[controls], data.hi_empunion, data[["ssiratio", "multlc"]]
)
res_2sls_robust = ivmod.fit()
Alternative covariance estimators¶
All estimator allow for three types of parameter covariance estimator:
"unadjusted"
is the classic homoskedastic estimator"robust"
is robust to heteroskedasticity"clustered"
allows one- or two-way clustering to account for additional sources of dependence between the model scores"kernel"
produces a heteroskedasticity-autocorrelation robust covariance estimator
The default is "robust"
.
These are all passed using the keyword input cov_type
. Using clustered requires also passing the clustering variable(s).
[10]:
ivmod = IV2SLS(
data.ldrugexp, data[controls], data.hi_empunion, data[["ssiratio", "multlc"]]
)
res_2sls_std = ivmod.fit(cov_type="unadjusted")
GMM Estimation¶
GMM estimation can be more efficient than 2SLS when there are more than one instrument. By default, 2-step efficient GMM is used (assuming the weighting matrix is correctly specified). It is possible to iterate until convergence using the optional keyword input iter_limit
, which is naturally 2 by default. Generally, GMM-CUE would be preferred to using multiple iterations of standard GMM.
The default weighting matrix is robust to heteroskedasticity (but not clustering).
[11]:
ivmod = IVGMM(
data.ldrugexp, data[controls], data.hi_empunion, data[["ssiratio", "multlc"]]
)
res_gmm = ivmod.fit()
Changing the weighting matrix structure in GMM estimation¶
The weighting matrix in the GMM objective function can be altered when creating the model. This example uses clustered weight by age. The covariance estimator should usually match the weighting matrix, and so clustering is also used here.
[12]:
ivmod = IVGMM(
data.ldrugexp,
data[controls],
data.hi_empunion,
data[["ssiratio", "multlc"]],
weight_type="clustered",
clusters=data.age,
)
res_gmm_clustered = ivmod.fit(cov_type="clustered", clusters=data.age)
Continuously updating GMM¶
The continuously updating GMM estimator simultaneously optimizes the moment conditions and the weighting matrix. It can be more efficient (in the second order sense) than standard 2-step GMM, although it can also be fragile. Here the optional input display
is used to produce the output of the non-linear optimizer used to estimate the parameters.
[13]:
ivmod = IVGMMCUE(
data.ldrugexp, data[controls], data.hi_empunion, data[["ssiratio", "multlc"]]
)
res_gmm_cue = ivmod.fit(cov_type="robust", display=True)
Current function value: 1.045365
Iterations: 10
Function evaluations: 420
Gradient evaluations: 51
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/scipy/optimize/_minimize.py:726: OptimizeWarning: Desired error not necessarily achieved due to precision loss.
res = _minimize_bfgs(fun, x0, args, jac, callback, **options)
Comparing results¶
The function compare
can be used to compare the results of multiple models, possibly with different variables, estimators and/or instruments. Usually a dictionary or OrderedDict
is used to hold results since the keys are used as model names. The advantage of an OrderedDict
is that it will preserve the order of the models in the presentation.
With the expectation of the OLS estimate, the parameter estimates are fairly consistent. Standard errors vary slightly although the conclusions reached are not sensitive to the choice of covariance estimator either. T-stats are reported in parentheses.
[14]:
from collections import OrderedDict
from linearmodels.iv.results import compare
res = OrderedDict()
res["OLS"] = res_ols
res["2SLS"] = res_2sls
res["2SLS-Homo"] = res_2sls_std
res["2SLS-Hetero"] = res_2sls_robust
res["GMM"] = res_gmm
res["GMM Cluster(Age)"] = res_gmm_clustered
res["GMM-CUE"] = res_gmm_cue
print(compare(res))
Model Comparison
==========================================================================================================================
OLS 2SLS 2SLS-Homo 2SLS-Hetero GMM GMM Cluster(Age) GMM-CUE
--------------------------------------------------------------------------------------------------------------------------
Dep. Variable ldrugexp ldrugexp ldrugexp ldrugexp ldrugexp ldrugexp ldrugexp
Estimator OLS IV-2SLS IV-2SLS IV-2SLS IV-GMM IV-GMM IV-GMM
No. Observations 10089 10089 10089 10089 10089 10089 10089
Cov. Est. robust robust unadjusted robust robust clustered robust
R-squared 0.1770 0.0640 0.0414 0.0414 0.0406 0.0292 0.0388
Adj. R-squared 0.1765 0.0634 0.0409 0.0409 0.0400 0.0286 0.0382
F-statistic 2262.6 2000.9 1882.3 1955.4 1952.6 1700.8 1949.2
P-value (F-stat) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
================== =========== =========== ============ =========== =========== =========== ===========
hi_empunion 0.0739 -0.8976 -0.9899 -0.9899 -0.9933 -1.0359 -1.0002
(2.8441) (-4.0592) (-5.1501) (-4.8386) (-4.8530) (-5.0683) (-4.8810)
const 5.8611 6.7872 6.8752 6.8752 6.8778 6.7277 6.8847
(37.320) (25.246) (28.030) (26.660) (26.658) (13.299) (26.657)
totchr 0.4404 0.4503 0.4512 0.4512 0.4510 0.4482 0.4510
(47.049) (44.157) (42.942) (43.769) (43.738) (33.833) (43.701)
female 0.0578 -0.0204 -0.0278 -0.0278 -0.0282 -0.0245 -0.0288
(2.2797) (-0.6257) (-0.8933) (-0.8653) (-0.8752) (-0.8398) (-0.8928)
age -0.0035 -0.0132 -0.0141 -0.0141 -0.0142 -0.0119 -0.0142
(-1.8228) (-4.4092) (-5.0834) (-4.8753) (-4.8773) (-1.8928) (-4.8969)
linc 0.0105 0.0870 0.0943 0.0943 0.0945 0.0957 0.0950
(0.7646) (3.8436) (4.4400) (4.3079) (4.3142) (6.4934) (4.3333)
blhisp -0.1513 -0.2174 -0.2237 -0.2237 -0.2231 -0.2091 -0.2236
(-4.4353) (-5.5052) (-5.7805) (-5.6514) (-5.6344) (-4.1662) (-5.6421)
==================== ============= ============= ============== ============= ============= ============= =============
Instruments ssiratio ssiratio ssiratio ssiratio ssiratio ssiratio
multlc multlc multlc multlc multlc
--------------------------------------------------------------------------------------------------------------------------
T-stats reported in parentheses
Testing endogeneity¶
The Durbin test is a classic of endogeneity which compares OLS estimates with 2SLS and exploits the fact that OLS estimates will be relatively efficient. Durbin”s test is not robust to heteroskedasticity.
[15]:
res_2sls.durbin()
[15]:
Durbin test of exogeneity
H0: All endogenous variables are exogenous
Statistic: 25.2819
P-value: 0.0000
Distributed: chi2(1)
WaldTestStatistic, id: 0x7f82f271e4a0
The Wu-Hausman test is a variant of the Durbin test that uses a slightly different form.
[16]:
res_2sls.wu_hausman()
[16]:
Wu-Hausman test of exogeneity
H0: All endogenous variables are exogenous
Statistic: 25.3253
P-value: 0.0000
Distributed: F(1,10081)
WaldTestStatistic, id: 0x7f82f62850f0
The test statistic can be directly replicated using the squared t-stat in a 2-stage approach where the first stage regresses the endogenous variable on the controls and instrument and the second stage regresses the dependent variable on the controls, the endogenous regressor and the residuals. If the regressor was in fact exogenous, the residuals should not be correlated with the dependent variable.
[17]:
import pandas as pd
step1 = IV2SLS(data.hi_empunion, data[["ssiratio"] + controls], None, None).fit()
resids = step1.resids
exog = pd.concat([data[["hi_empunion"] + controls], resids], axis=1)
step2 = IV2SLS(data.ldrugexp, exog, None, None).fit(cov_type="unadjusted")
print(step2.tstats.residual**2)
25.34541049972333
Wooldridge”s regression-based test of exogeneity is robust to heteroskedasticity since it inherits the covariance estimator from the model. Here there is little difference.
[18]:
res_2sls.wooldridge_regression
[18]:
Wooldridge's regression test of exogeneity
H0: Endogenous variables are exogenous
Statistic: 26.4542
P-value: 0.0000
Distributed: chi2(1)
WaldTestStatistic, id: 0x7f82f1f3a650
Wooldridge”s score test is an alternative to the regression test, although it usually has slightly less power since it is an LM rather than a Wald type test.
[19]:
res_2sls.wooldridge_score
[19]:
Wooldridge's score test of exogeneity
H0: Endogenous variables are exogenous
Statistic: 24.9350
P-value: 0.0000
Distributed: chi2(1)
WaldTestStatistic, id: 0x7f82f1f3a770
Exogeneity Testing¶
When there is more than one instrument (the model is overidentified), the J test can be used in GMM models to test whether the model is overidentified – in other words, whether the instruments are actually exogenous (assuming they are relevant). In the case with 2 instruments there is no evidence that against the null.
[20]:
res_gmm.j_stat
[20]:
H0: Expected moment conditions are equal to 0
Statistic: 1.0475
P-value: 0.3061
Distributed: chi2(1)
WaldTestStatistic, id: 0x7f82f244bdc0
When all instruments are included the story changes, and some of the additional instrument (lowincome
or firmsz
) appear to be endogenous.
[21]:
ivmod = IVGMM(data.ldrugexp, data[controls], data.hi_empunion, data[instruments])
res_gmm_all = ivmod.fit()
res_gmm_all.j_stat
[21]:
H0: Expected moment conditions are equal to 0
Statistic: 11.5903
P-value: 0.0089
Distributed: chi2(3)
WaldTestStatistic, id: 0x7f82f1f39c30
Single Instrument Regressions¶
It can be useful to run the just identified regressions to see how the IV estimate varies by instrument. The OLS model is included for comparison. The coefficient when using lowincome
is very similar to the OLS as is the \(R^2\) which indicates this variable may be endogenous. The coefficient using firmsz
is also very different, but this is probably due to the low correlation between firmsz
and the endogenous regressor so that this is a weak instrument.
[22]:
od = OrderedDict()
for col in instruments:
od[col] = IV2SLS(data.ldrugexp, data[controls], data.hi_empunion, data[col]).fit(
cov_type="robust"
)
od["OLS"] = res_ols
print(compare(od))
Model Comparison
==========================================================================================
ssiratio lowincome multlc firmsz OLS
------------------------------------------------------------------------------------------
Dep. Variable ldrugexp ldrugexp ldrugexp ldrugexp ldrugexp
Estimator IV-2SLS IV-2SLS IV-2SLS IV-2SLS OLS
No. Observations 10089 10089 10089 10089 10089
Cov. Est. robust robust robust robust robust
R-squared 0.0640 0.1768 -0.0644 -0.9053 0.1770
Adj. R-squared 0.0634 0.1763 -0.0651 -0.9064 0.1765
F-statistic 2000.9 2250.6 1734.1 950.64 2262.6
P-value (F-stat) 0.0000 0.0000 0.0000 0.0000 0.0000
================== =========== =========== =========== =========== ===========
const 6.7872 5.8201 7.2145 8.7267 5.8611
(25.246) (15.267) (16.326) (6.4196) (37.320)
totchr 0.4503 0.4399 0.4548 0.4710 0.4404
(44.157) (43.834) (39.193) (23.163) (47.049)
female -0.0204 0.0613 -0.0565 -0.1842 0.0578
(-0.6257) (1.6066) (-1.2596) (-1.5312) (2.2797)
age -0.0132 -0.0031 -0.0177 -0.0335 -0.0035
(-4.4092) (-0.7521) (-3.7086) (-2.3433) (-1.8228)
linc 0.0870 0.0071 0.1223 0.2473 0.0105
(3.8436) (0.2280) (3.2965) (2.1876) (0.7646)
blhisp -0.2174 -0.1484 -0.2479 -0.3559 -0.1513
(-5.5052) (-3.5639) (-5.0727) (-3.2425) (-4.4353)
hi_empunion -0.8976 0.1170 -1.3459 -2.9323 0.0739
(-4.0592) (0.3254) (-3.1760) (-2.0908) (2.8441)
==================== ============= ============= ============= ============= =============
Instruments ssiratio lowincome multlc firmsz
------------------------------------------------------------------------------------------
T-stats reported in parentheses
First Stage Diagnostics¶
First stage diagnostics are available to assess whether the instruments appear to be credible for the endogenous regressor. The Partial F-statistic is the F-statistic for all instruments once controls have been partialed out. In the case of a single instrument, it is just the squared t-stat.
[23]:
print(res_2sls.first_stage)
First Stage Estimation Results
======================================
hi_empunion
--------------------------------------
R-squared 0.0761
Partial R-squared 0.0179
Shea's R-squared 0.0179
Partial F-statistic 65.806
P-value (Partial F-stat) 4.441e-16
Partial F-stat Distn chi2(1)
========================== ===========
const 1.0290
(17.705)
totchr 0.0128
(3.4896)
female -0.0734
(-7.6226)
age -0.0086
(-12.184)
linc 0.0484
(7.3266)
blhisp -0.0627
(-5.1084)
ssiratio -0.1916
(-8.1121)
--------------------------------------
T-stats reported in parentheses
T-stats use same covariance type as original model
The F-statistic actually has a \(chi^2\) distribution since it is just a Wald test that all of the coefficients are 0. This breaks the “rule-of-thumb” but it can be applied by dividing the F-stat by the number of instruments.
[24]:
ivmod = IV2SLS(data.ldrugexp, data[controls], data.hi_empunion, data[instruments])
res_2sls_all = ivmod.fit()
print(res_2sls_all.first_stage)
First Stage Estimation Results
======================================
hi_empunion
--------------------------------------
R-squared 0.0821
Partial R-squared 0.0243
Shea's R-squared 0.0243
Partial F-statistic 179.47
P-value (Partial F-stat) 0.0000
Partial F-stat Distn chi2(4)
========================== ===========
const 0.9899
(16.959)
totchr 0.0133
(3.6494)
female -0.0724
(-7.5497)
age -0.0080
(-11.206)
linc 0.0410
(6.3552)
blhisp -0.0676
(-5.5369)
ssiratio -0.1690
(-7.3289)
lowincome -0.0637
(-5.1947)
multlc 0.1151
(5.4799)
firmsz 0.0037
(1.9286)
--------------------------------------
T-stats reported in parentheses
T-stats use same covariance type as original model
LIML¶
The LIML estimator and related k-class estimators can be used through IVLIML
. LIML can have better finite sample properties if the model is not strongly identified. By default the \(\kappa\) parameter is estimated. In this dataset it is very close to 1 and to the results for LIML are similar to 2SLS (they would be exact if \(\kappa=1\)).
[25]:
ivmod = IVLIML(
data.ldrugexp, data[controls], data.hi_empunion, data[["ssiratio", "multlc"]]
)
res_liml = ivmod.fit(cov_type="robust")
print(compare({"2SLS": res_2sls_robust, "LIML": res_liml, "GMM": res_gmm}))
Model Comparison
==============================================================
2SLS LIML GMM
--------------------------------------------------------------
Dep. Variable ldrugexp ldrugexp ldrugexp
Estimator IV-2SLS IV-LIML IV-GMM
No. Observations 10089 10089 10089
Cov. Est. robust robust robust
R-squared 0.0414 0.0400 0.0406
Adj. R-squared 0.0409 0.0394 0.0400
F-statistic 1955.4 1952.3 1952.6
P-value (F-stat) 0.0000 0.0000 0.0000
================== =========== =========== ===========
const 6.8752 6.8807 6.8778
(26.660) (26.577) (26.658)
totchr 0.4512 0.4513 0.4510
(43.769) (43.730) (43.738)
female -0.0278 -0.0283 -0.0282
(-0.8653) (-0.8776) (-0.8752)
age -0.0141 -0.0142 -0.0142
(-4.8753) (-4.8781) (-4.8773)
linc 0.0943 0.0947 0.0945
(4.3079) (4.3114) (4.3142)
blhisp -0.2237 -0.2241 -0.2231
(-5.6514) (-5.6531) (-5.6344)
hi_empunion -0.9899 -0.9957 -0.9933
(-4.8386) (-4.8361) (-4.8530)
==================== ============= ============= =============
Instruments ssiratio ssiratio ssiratio
multlc multlc multlc
--------------------------------------------------------------
T-stats reported in parentheses
The estimated value of \(\kappa\).
[26]:
print(res_liml.kappa)
1.0001153166806434
IV2SLS as OLS¶
As one final check, the “OLS” version of IV2SLS
is compared to statsmodels
OLS
command. The parameters are identical.
[27]:
import pandas as pd
ivolsmod = IV2SLS(data.ldrugexp, data[["hi_empunion"] + controls], None, None)
res_ivols = ivolsmod.fit()
sm_ols = res_ols.params
sm_ols.name = "sm"
print(pd.concat([res_ivols.params, sm_ols], axis=1))
parameter sm
hi_empunion 0.073879 0.073879
const 5.861131 5.861131
totchr 0.440381 0.440381
female 0.057806 0.057806
age -0.003529 -0.003529
linc 0.010482 0.010482
blhisp -0.151307 -0.151307