Basic Examples¶
These examples are based on Chapter 15 of Introduction to Econometrics by Jeffrey Wooldridge and demonstrate the basic use of the IV estimators (primarily IV2SLS – the two-stage least squares estimator)
Wages of Married Women¶
This first example examines the effect of education on the wages of women. Education is a classic endogenous variable since it has signaling value beyond the actual direct effect (among other reasons).
This first block imports the data and uses the DESCR
attribute to describe the dataset. add_constant
from statsmodels is used to add a variable named const
to the DataFrame. Observations with missing values are dropped.
[1]:
import numpy as np
from linearmodels.datasets import mroz
from statsmodels.api import add_constant
print(mroz.DESCR)
data = mroz.load()
data = data.dropna()
data = add_constant(data, has_constant="add")
T.A. Mroz (1987), "The Sensitivity of an Empirical Model of Married Women's
Hours of Work to Economic and Statistical Assumptions," Econometrica 55,
765-799.
nlf 1 if in labor force, 1975
hours hours worked, 1975
kidslt6 # kids < 6 years
kidsge6 # kids 6-18
age woman's age in yrs
educ years of schooling
wage estimated wage from earns., hours
repwage reported wage at interview in 1976
hushrs hours worked by husband, 1975
husage husband's age
huseduc husband's years of schooling
huswage husband's hourly wage, 1975
faminc family income, 1975
mtr fed. marginal tax rate facing woman
motheduc mother's years of schooling
fatheduc father's years of schooling
unem unem. rate in county of resid.
city =1 if live in SMSA
exper actual labor mkt exper
nwifeinc (faminc - wage*hours)/1000
lwage log(wage)
expersq exper^2
Since OLS is a special case of 2SLS, IV2SLS can be used to estimate a model using OLS by setting endog
and instruments
variables to None
.
This first regression uses OLS to estimate the effect of education on the log of wage. It indicates that 1 year of education increases wage by 10%.
[2]:
from linearmodels.iv import IV2SLS
res_ols = IV2SLS(np.log(data.wage), data[["const", "educ"]], None, None).fit(
cov_type="unadjusted"
)
print(res_ols)
OLS Estimation Summary
==============================================================================
Dep. Variable: wage R-squared: 0.1179
Estimator: OLS Adj. R-squared: 0.1158
No. Observations: 428 F-statistic: 57.196
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:06 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const -0.1852 0.1848 -1.0022 0.3163 -0.5474 0.1770
educ 0.1086 0.0144 7.5628 0.0000 0.0805 0.1368
==============================================================================
Two-stage least squares is estimated “as-if” two regressions are run. Here the first stage regression where OLS is used to fit the value on the instrument (in this case the education of the subject”s father). The fitted value is saved for use later.
This first stage regression indicates that there is a strong relationship and the first stage easily passes the rule of thumb test where the F-statistic is at least 10.
[3]:
res_first = IV2SLS(data.educ, data[["const", "fatheduc"]], None, None).fit(
cov_type="unadjusted"
)
print(res_first)
data["educ_hat"] = data.educ - res_first.resids
OLS Estimation Summary
==============================================================================
Dep. Variable: educ R-squared: 0.1726
Estimator: OLS Adj. R-squared: 0.1706
No. Observations: 428 F-statistic: 89.258
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:06 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 10.237 0.2753 37.186 0.0000 9.6975 10.777
fatheduc 0.2694 0.0285 9.4476 0.0000 0.2135 0.3253
==============================================================================
The second stage uses the instrument to fit the model. This model indicates a much lower effect of education.
[4]:
res_second = IV2SLS(np.log(data.wage), data[["const"]], data.educ, data.fatheduc).fit(
cov_type="unadjusted"
)
print(res_second)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: wage R-squared: 0.0934
Estimator: IV-2SLS Adj. R-squared: 0.0913
No. Observations: 428 F-statistic: 2.8487
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0914
Time: 17:55:06 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 0.4411 0.4451 0.9911 0.3216 -0.4312 1.3134
educ 0.0592 0.0351 1.6878 0.0914 -0.0095 0.1279
==============================================================================
Endogenous: educ
Instruments: fatheduc
Unadjusted Covariance (Homoskedastic)
Debiased: False
The fitted value can be used with OLS to estimate the same parameters. Note that the other values reported such as t-statistics or \(R^2\) are not correct.
The compare
function is used to compare the different models. The 2SLS coefficient on education and the direct coefficient on educ_hat are identical.
[5]:
from linearmodels.iv import compare
res_direct = IV2SLS(np.log(data.wage), data[["const", "educ_hat"]], None, None).fit(
cov_type="unadjusted"
)
print(compare({"OLS": res_ols, "2SLS": res_second, "Direct": res_direct}))
Model Comparison
=================================================================
OLS 2SLS Direct
-----------------------------------------------------------------
Dep. Variable wage wage wage
Estimator OLS IV-2SLS OLS
No. Observations 428 428 428
Cov. Est. unadjusted unadjusted unadjusted
R-squared 0.1179 0.0934 0.0060
Adj. R-squared 0.1158 0.0913 0.0037
F-statistic 57.196 2.8487 2.5982
P-value (F-stat) 3.941e-14 0.0914 0.1070
================== ============ ============ ============
const -0.1852 0.4411 0.4411
(-1.0022) (0.9911) (0.9465)
educ 0.1086 0.0592
(7.5628) (1.6878)
educ_hat 0.0592
(1.6119)
==================== ============== ============== ==============
Instruments fatheduc
-----------------------------------------------------------------
T-stats reported in parentheses
Wages of Men¶
This next example examines the returns to education for men and uses the number of siblings as an instrument.
[6]:
from linearmodels.datasets import wage
men = wage.load()
print(wage.DESCR)
men = men[["educ", "wage", "sibs", "exper"]]
men = add_constant(men)
men = men.dropna()
M. Blackburn and D. Neumark (1992), "Unobserved Ability, Efficiency Wages, and
Interindustry Wage Differentials," Quarterly Journal of Economics 107, 1421-1436.
wage monthly earnings
hours average weekly hours
IQ IQ score
KWW knowledge of world work score
educ years of education
exper years of work experience
tenure years with current employer
age age in years
married =1 if married
black =1 if black
south =1 if live in south
urban =1 if live in SMSA
sibs number of siblings
brthord birth order
meduc mother's education
feduc father's education
lwage natural log of wage
OLS estimates indicate a small effect of education for men in this sample.
[7]:
res_ols = IV2SLS(np.log(men.wage), men[["const", "educ"]], None, None).fit(
cov_type="unadjusted"
)
print(res_ols)
OLS Estimation Summary
==============================================================================
Dep. Variable: wage R-squared: 0.0973
Estimator: OLS Adj. R-squared: 0.0964
No. Observations: 934 F-statistic: 100.71
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:06 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 5.9733 0.0814 73.419 0.0000 5.8139 6.1328
educ 0.0598 0.0060 10.035 0.0000 0.0481 0.0715
==============================================================================
The first stage regressed the endogenous variable on the instrument. There is a strong, negative relationship here.
[8]:
res_first = IV2SLS(men.educ, men[["const", "sibs"]], None, None).fit(
cov_type="unadjusted"
)
print(res_first.summary)
OLS Estimation Summary
==============================================================================
Dep. Variable: educ R-squared: 0.0576
Estimator: OLS Adj. R-squared: 0.0566
No. Observations: 934 F-statistic: 57.107
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 14.143 0.1131 125.02 0.0000 13.921 14.365
sibs -0.2287 0.0303 -7.5569 0.0000 -0.2880 -0.1694
==============================================================================
The second stage indicates a much strong relationship than the OLS estimate.
[9]:
res = IV2SLS(np.log(men.wage), men.const, men.educ, men.sibs).fit(cov_type="unadjusted")
print(res.summary)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: wage R-squared: -0.0090
Estimator: IV-2SLS Adj. R-squared: -0.0101
No. Observations: 934 F-statistic: 21.715
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 5.1310 0.3539 14.497 0.0000 4.4373 5.8248
educ 0.1224 0.0263 4.6599 0.0000 0.0709 0.1738
==============================================================================
Endogenous: educ
Instruments: sibs
Unadjusted Covariance (Homoskedastic)
Debiased: False
Weighted IV¶
All IV estimator support weighted which extends the concept of WLS in an OLS model to IV estimation. The weights are applied to the dependent variables, the matrix of regressors (endogenous and exogenous) and the matrix of instrument (exogenous and instruments) which allows for GLS-type estimation. In particular, if the variance of model residuals was \(\sigma^2_i\), then setting \(w_i = 1 / \sigma^2_i\) would produce GLS estimates.
This example shows how a feasible GLS-type of estimator could be used.
[10]:
res2 = res.resids**2
fgls_mod = IV2SLS(np.log(res2), men[["const", "sibs", "exper"]], None, None)
fgls_res = fgls_mod.fit()
sigma2_hat = np.exp(np.log(res2) - fgls_res.resids)
print(fgls_res)
OLS Estimation Summary
==============================================================================
Dep. Variable: residual R-squared: 0.0045
Estimator: OLS Adj. R-squared: 0.0024
No. Observations: 934 F-statistic: 3.9903
Date: Fri, Jul 19 2024 P-value (F-stat) 0.1360
Time: 17:55:07 Distribution: chi2(2)
Cov. Estimator: robust
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const -2.9315 0.2224 -13.179 0.0000 -3.3675 -2.4956
sibs -0.0651 0.0335 -1.9452 0.0518 -0.1307 0.0005
exper 0.0084 0.0158 0.5311 0.5953 -0.0226 0.0394
==============================================================================
The variance of the squared residuals is not particularly well explained by the data, and so the GLS-type estimates and the usual IV estimates don”t differ by much.
[11]:
res_gls = IV2SLS(
np.log(men.wage), men.const, men.educ, men.sibs, weights=1 / sigma2_hat
).fit(cov_type="unadjusted")
print(res_gls)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: wage R-squared: 0.0090
Estimator: IV-2SLS Adj. R-squared: 0.0080
No. Observations: 934 F-statistic: 23.520
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 5.2031 0.3231 16.105 0.0000 4.5699 5.8363
educ 0.1166 0.0241 4.8497 0.0000 0.0695 0.1638
==============================================================================
Endogenous: educ
Instruments: sibs
Unadjusted Covariance (Homoskedastic)
Debiased: False
Smoking and birth weight¶
This example examines the effect of smoking on the birth weight of babies.
[12]:
from linearmodels.datasets import birthweight
data = birthweight.load()
print(birthweight.DESCR)
data = add_constant(data)
J. Mullahy (1997), "Instrumental-Variable Estimation of Count Data Models:
Applications to Models of Cigarette Smoking Behavior," Review of Economics
and Statistics 79, 596-593.
faminc 1988 family income, $1000s
cigtax cig. tax in home state, 1988
cigprice cig. price in home state, 1988
bwght birth weight, ounces
fatheduc father's yrs of educ
motheduc mother's yrs of educ
parity birth order of child
male =1 if male child
white =1 if white
cigs cigs smked per day while preg
lbwght log of bwght
bwghtlbs birth weight, pounds
packs packs smked per day while preg
faminc log(faminc)
The first stage regresses the number of packs smoked on the cigarette price. There is a very weak relationship – so weak that this is an example of a _weak instrument.
[13]:
res = IV2SLS(data.packs, data[["const", "cigprice"]], None, None).fit(
cov_type="unadjusted"
)
print(res)
OLS Estimation Summary
==============================================================================
Dep. Variable: packs R-squared: 0.0015
Estimator: OLS Adj. R-squared: 0.0008
No. Observations: 1388 F-statistic: 2.0731
Date: Fri, Jul 19 2024 P-value (F-stat) 0.1499
Time: 17:55:07 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const -2.9746 3.2361 -0.9192 0.3580 -9.3172 3.3681
cigprice 0.0356 0.0247 1.4398 0.1499 -0.0129 0.0840
==============================================================================
Despite the weak relationship between the price and the number of pack smoked, the 2SLS can be estimated, although substantial caution is warranted to interpret the results. Note the very negative \(R^2\) (-150%).
[14]:
res = IV2SLS(np.log(data.bwght), data.const, data.packs, data.cigprice).fit(
cov_type="unadjusted"
)
print(res)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: bwght R-squared: -1.5224
Estimator: IV-2SLS Adj. R-squared: -1.5242
No. Observations: 1388 F-statistic: 1.1356
Date: Fri, Jul 19 2024 P-value (F-stat) 0.2866
Time: 17:55:07 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 4.7203 0.0381 123.80 0.0000 4.6456 4.7951
packs 0.0238 0.0223 1.0656 0.2866 -0.0199 0.0675
==============================================================================
Endogenous: packs
Instruments: cigprice
Unadjusted Covariance (Homoskedastic)
Debiased: False
Proximity and education¶
This next example uses a well-known dataset that uses proximity to a 4 year college as an instrument for education.
[15]:
from linearmodels.datasets import card
data = card.load()
print(card.DESCR)
data = add_constant(data)
D. Card (1995), "Using Geographic Variation in College Proximity to Estimate
the Return to Schooling," in Aspects of Labour Market Behavior: Essays in
Honour of John Vanderkamp. Ed. L.N. Christophides, E.K. Grant, and R.
Swidinsky, 201-222. Toronto: University of Toronto Press.
id person identifier
nearc2 =1 if near 2 yr college, 1966
nearc4 =1 if near 4 yr college, 1966
educ years of schooling, 1976
age in years
fatheduc father's schooling
motheduc mother's schooling
weight NLS sampling weight, 1976
momdad14 =1 if live with mom, dad at 14
sinmom14 =1 if with single mom at 14
step14 =1 if with step parent at 14
reg661 =1 for region 1, 1966
reg662 =1 for region 2, 1966
reg663 =1 for region 3, 1966
reg664 =1 for region 4, 1966
reg665 =1 for region 5, 1966
reg666 =1 for region 6, 1966
reg667 =1 for region 7, 1966
reg668 =1 for region 8, 1966
reg669 =1 for region 9, 1966
south66 =1 if in south in 1966
black =1 if black
smsa =1 in in SMSA, 1976
south =1 if in south, 1976
smsa66 =1 if in SMSA, 1966
wage hourly wage in cents, 1976
enroll =1 if enrolled in school, 1976
KWW knowledge world of work score
IQ IQ score
married =1 if married, 1976
libcrd14 =1 if lib. card in home at 14
exper age - educ - 6
lwage log(wage)
xpersq exper**2
Lists are used to hold the groups of variable in this large model and missing values are dropped.
[16]:
dep = ["wage"]
endog = ["educ"]
exog = [
"const",
"exper",
"expersq",
"black",
"smsa",
"south",
"smsa66",
"reg662",
"reg663",
"reg664",
"reg665",
"reg666",
"reg667",
"reg668",
"reg669",
]
instr = ["nearc4"]
data = data[dep + exog + endog + instr].dropna()
The first stage estimate shows a very large F-statistic. Note that when there are many exogenous variables the results cannot be directly interpreted. It is better to use the first_stage
information from a 2SLS estimate.
[17]:
res = IV2SLS(data.educ, data[instr + exog], None, None).fit()
print(res)
OLS Estimation Summary
==============================================================================
Dep. Variable: educ R-squared: 0.4771
Estimator: OLS Adj. R-squared: 0.4745
No. Observations: 3010 F-statistic: 3693.4
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(15)
Cov. Estimator: robust
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
nearc4 0.3199 0.0848 3.7702 0.0002 0.1536 0.4862
const 16.638 0.2148 77.456 0.0000 16.217 17.059
exper -0.4125 0.0320 -12.896 0.0000 -0.4752 -0.3498
expersq 0.0009 0.0017 0.5100 0.6100 -0.0025 0.0042
black -0.9355 0.0923 -10.138 0.0000 -1.1164 -0.7547
smsa 0.4022 0.1109 3.6255 0.0003 0.1848 0.6196
south -0.0516 0.1416 -0.3645 0.7155 -0.3291 0.2259
smsa66 0.0255 0.1103 0.2309 0.8174 -0.1908 0.2417
reg662 -0.0786 0.1854 -0.4242 0.6714 -0.4420 0.2847
reg663 -0.0279 0.1789 -0.1562 0.8759 -0.3785 0.3226
reg664 0.1172 0.2070 0.5660 0.5714 -0.2886 0.5230
reg665 -0.2726 0.2237 -1.2186 0.2230 -0.7111 0.1659
reg666 -0.3028 0.2361 -1.2826 0.1996 -0.7656 0.1599
reg667 -0.2168 0.2389 -0.9077 0.3640 -0.6850 0.2513
reg668 0.5239 0.2562 2.0449 0.0409 0.0218 1.0260
reg669 0.2103 0.1988 1.0575 0.2903 -0.1794 0.6000
==============================================================================
The OLS estimate indicates an increase of 7% for a year of education.
[18]:
res = IV2SLS(np.log(data.wage), data[exog + endog], None, None).fit()
print(res)
OLS Estimation Summary
==============================================================================
Dep. Variable: wage R-squared: 0.2998
Estimator: OLS Adj. R-squared: 0.2963
No. Observations: 3010 F-statistic: 1377.0
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(15)
Cov. Estimator: robust
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 4.6208 0.0740 62.417 0.0000 4.4757 4.7659
exper 0.0848 0.0067 12.592 0.0000 0.0716 0.0980
expersq -0.0023 0.0003 -7.1798 0.0000 -0.0029 -0.0017
black -0.1990 0.0181 -10.985 0.0000 -0.2345 -0.1635
smsa 0.1364 0.0192 7.1159 0.0000 0.0988 0.1739
south -0.1480 0.0280 -5.2917 0.0000 -0.2028 -0.0932
smsa66 0.0262 0.0185 1.4153 0.1570 -0.0101 0.0626
reg662 0.0964 0.0350 2.7531 0.0059 0.0278 0.1650
reg663 0.1445 0.0337 4.2850 0.0000 0.0784 0.2107
reg664 0.0551 0.0411 1.3402 0.1802 -0.0255 0.1356
reg665 0.1280 0.0428 2.9912 0.0028 0.0441 0.2119
reg666 0.1405 0.0450 3.1223 0.0018 0.0523 0.2287
reg667 0.1180 0.0455 2.5934 0.0095 0.0288 0.2071
reg668 -0.0564 0.0505 -1.1183 0.2634 -0.1553 0.0425
reg669 0.1186 0.0387 3.0658 0.0022 0.0428 0.1944
educ 0.0747 0.0036 20.540 0.0000 0.0676 0.0818
==============================================================================
The 2SLS estimate of the effect is nearly double. However, there is some reason to be concerned about the strength of the instrument despite the F-statistic in the first stage regression.
[19]:
res_2sls = IV2SLS(np.log(data.wage), data[exog], data[endog], data[instr]).fit()
print(res_2sls)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: wage R-squared: 0.2382
Estimator: IV-2SLS Adj. R-squared: 0.2343
No. Observations: 3010 F-statistic: 840.83
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(15)
Cov. Estimator: robust
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 3.6662 0.9085 4.0352 0.0001 1.8855 5.4468
exper 0.1083 0.0233 4.6376 0.0000 0.0625 0.1540
expersq -0.0023 0.0003 -6.7128 0.0000 -0.0030 -0.0017
black -0.1468 0.0524 -2.8031 0.0051 -0.2494 -0.0441
smsa 0.1118 0.0311 3.5995 0.0003 0.0509 0.1727
south -0.1447 0.0291 -4.9775 0.0000 -0.2016 -0.0877
smsa66 0.0185 0.0205 0.9035 0.3663 -0.0217 0.0587
reg662 0.1008 0.0365 2.7644 0.0057 0.0293 0.1722
reg663 0.1483 0.0355 4.1760 0.0000 0.0787 0.2178
reg664 0.0499 0.0435 1.1471 0.2514 -0.0354 0.1352
reg665 0.1463 0.0491 2.9794 0.0029 0.0500 0.2425
reg666 0.1629 0.0516 3.1553 0.0016 0.0617 0.2641
reg667 0.1346 0.0504 2.6689 0.0076 0.0357 0.2334
reg668 -0.0831 0.0571 -1.4552 0.1456 -0.1950 0.0288
reg669 0.1078 0.0410 2.6317 0.0085 0.0275 0.1881
educ 0.1315 0.0540 2.4353 0.0149 0.0257 0.2373
==============================================================================
Endogenous: educ
Instruments: nearc4
Robust Covariance (Heteroskedastic)
Debiased: False
The property first_stage
can be used to show a large set of first stage diagnostics. These results show a much lower partial \(R^2\) that has measures the unique effect of the instrument on the endogenous controlling for the exogenous regressors. This is much smaller than the naive first stage \(R^2\) of 47%. The partial F-statistic is also much smaller, although it technically over the rule-of-thumb of 10 for a single instrument.
The instrument is a dummy variable and being close to a 4 year college is only worth 0.3 years of education on average.
[20]:
print(res_2sls.first_stage)
First Stage Estimation Results
======================================
educ
--------------------------------------
R-squared 0.4771
Partial R-squared 0.0044
Shea's R-squared 0.0044
Partial F-statistic 14.214
P-value (Partial F-stat) 0.0002
Partial F-stat Distn chi2(1)
========================== ===========
const 16.638
(77.456)
exper -0.4125
(-12.896)
expersq 0.0009
(0.5100)
black -0.9355
(-10.138)
smsa 0.4022
(3.6255)
south -0.0516
(-0.3645)
smsa66 0.0255
(0.2309)
reg662 -0.0786
(-0.4242)
reg663 -0.0279
(-0.1562)
reg664 0.1172
(0.5660)
reg665 -0.2726
(-1.2186)
reg666 -0.3028
(-1.2826)
reg667 -0.2168
(-0.9077)
reg668 0.5239
(2.0449)
reg669 0.2103
(1.0575)
nearc4 0.3199
(3.7702)
--------------------------------------
T-stats reported in parentheses
T-stats use same covariance type as original model
Formula interface¶
This model was large and so it might be simpler to use a formula. While formulas are discussed in detail in another notebook, they use the formula language of formulaic with an augmentation to specify the endogenous and instrumental variables. The generic form is
dependent ~ exog + [endog ~ instr]
where each block can contain multiple variables.
Here the model is compared to the direct parameterization using DataFrames by differencing the \(R^2\).
[21]:
import numpy as np
formula = (
"np.log(wage) ~ 1 + exper + expersq + black + smsa + south + smsa66 + reg662 + reg663 + reg664 + "
"reg665 + reg666 + reg667 + reg668 + reg669 + [educ ~ nearc4]"
)
mod = IV2SLS.from_formula(formula, data)
res_formula = mod.fit(cov_type="unadjusted")
print(res_formula.rsquared - res_2sls.rsquared)
0.0
Categorical Variables¶
pandas categorical
s are automatically treated as factors and expanded to dummies. The first is always dropped. This next block constructs a categorical from the region dummies and then uses it instead of the individual dummies. The model is identical.
[22]:
data["reg"] = "661" # The default region, which was omitted
for i in range(2, 10):
region = "reg66" + str(i)
data.loc[data[region] == 1, "reg"] = region[3:]
data["reg"] = data["reg"].astype("category")
data.describe()
res_cat = IV2SLS(
np.log(data.wage),
data[["const", "exper", "expersq", "black", "smsa", "south", "smsa66", "reg"]],
data.educ,
data.nearc4,
).fit()
print(res_cat)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: wage R-squared: 0.2382
Estimator: IV-2SLS Adj. R-squared: 0.2343
No. Observations: 3010 F-statistic: 840.83
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(15)
Cov. Estimator: robust
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 3.6662 0.9085 4.0352 0.0001 1.8855 5.4468
exper 0.1083 0.0233 4.6376 0.0000 0.0625 0.1540
expersq -0.0023 0.0003 -6.7128 0.0000 -0.0030 -0.0017
black -0.1468 0.0524 -2.8031 0.0051 -0.2494 -0.0441
smsa 0.1118 0.0311 3.5995 0.0003 0.0509 0.1727
south -0.1447 0.0291 -4.9775 0.0000 -0.2016 -0.0877
smsa66 0.0185 0.0205 0.9035 0.3663 -0.0217 0.0587
reg.662 0.1008 0.0365 2.7644 0.0057 0.0293 0.1722
reg.663 0.1483 0.0355 4.1760 0.0000 0.0787 0.2178
reg.664 0.0499 0.0435 1.1471 0.2514 -0.0354 0.1352
reg.665 0.1463 0.0491 2.9794 0.0029 0.0500 0.2425
reg.666 0.1629 0.0516 3.1553 0.0016 0.0617 0.2641
reg.667 0.1346 0.0504 2.6689 0.0076 0.0357 0.2334
reg.668 -0.0831 0.0571 -1.4552 0.1456 -0.1950 0.0288
reg.669 0.1078 0.0410 2.6317 0.0085 0.0275 0.1881
educ 0.1315 0.0540 2.4353 0.0149 0.0257 0.2373
==============================================================================
Endogenous: educ
Instruments: nearc4
Robust Covariance (Heteroskedastic)
Debiased: False
Post-estimation diagnostics¶
Common post-estimation diagnostics are to test the assumption of endogeneity and to examine if instruments are valid (when there are more instruments than endogenous variables).
[23]:
data = mroz.load()
data = data.dropna()
data = add_constant(data, has_constant="add")
data["lnwage"] = np.log(data.wage)
dep = "lnwage"
exog = ["const", "exper", "expersq"]
endog = ["educ"]
instr = ["fatheduc", "motheduc"]
The first step is to fit the model using 2SLS.
[24]:
res = IV2SLS(data[dep], data[exog], data[endog], data[instr]).fit(cov_type="unadjusted")
print(res)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: lnwage R-squared: 0.1357
Estimator: IV-2SLS Adj. R-squared: 0.1296
No. Observations: 428 F-statistic: 24.653
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(3)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 0.0481 0.3985 0.1207 0.9039 -0.7329 0.8291
exper 0.0442 0.0134 3.3038 0.0010 0.0180 0.0704
expersq -0.0009 0.0004 -2.2485 0.0245 -0.0017 -0.0001
educ 0.0614 0.0313 1.9622 0.0497 7.043e-05 0.1227
==============================================================================
Endogenous: educ
Instruments: fatheduc, motheduc
Unadjusted Covariance (Homoskedastic)
Debiased: False
Wooldridge”s regression test of exogeneity uses regression residual where the endogenous variables are regressed on the exogenous and the instrument to test for endogeneity. IF the endogenous variable is actually exogenous these residuals should not be correlated with the variable of interest.
[25]:
res.wooldridge_regression
[25]:
Wooldridge's regression test of exogeneity
H0: Endogenous variables are exogenous
Statistic: 2.8256
P-value: 0.0928
Distributed: chi2(1)
WaldTestStatistic, id: 0x7ff918d63cd0
This test can be easily implemented using two regression. The first one constructs the residuals and the second re-fits the model using 2SLS but including the residuals.
Note that the p-value of the t-state on residuals
is the same as the P-value of the previous test – this is not an accident.
[26]:
v = IV2SLS(data[endog], data[exog + instr], None, None).fit().resids
import pandas as pd
res_direct = IV2SLS(
data[dep], pd.concat([v, data[exog]], axis=1), data[endog], data[instr]
).fit(cov_type="unadjusted")
print(res_direct)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: lnwage R-squared: 0.1624
Estimator: IV-2SLS Adj. R-squared: 0.1544
No. Observations: 428 F-statistic: 82.954
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(4)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
residual 0.0582 0.0346 1.6810 0.0928 -0.0097 0.1260
const 0.0481 0.3923 0.1226 0.9024 -0.7207 0.8169
exper 0.0442 0.0132 3.3559 0.0008 0.0184 0.0700
expersq -0.0009 0.0004 -2.2840 0.0224 -0.0017 -0.0001
educ 0.0614 0.0308 1.9932 0.0462 0.0010 0.1218
==============================================================================
Endogenous: educ
Instruments: fatheduc, motheduc
Unadjusted Covariance (Homoskedastic)
Debiased: False
Since this regression has two instrument it is possible to test for overidentification. Wooldridge”s overidentification test uses a regression to test whether the 2SLS residuals are uncorrelated with the instruments, which should be the case if the model is correct and the instruments are not needed in the original model.
[27]:
res.wooldridge_overid
[27]:
Wooldridge's score test of overidentification
H0: Model is not overidentified.
Statistic: 0.4435
P-value: 0.5055
Distributed: chi2(1)
WaldTestStatistic, id: 0x7ff918d63e20
A naive version of this test can be directly implemented. This direct implementation is different from the formal test but would be consistent if the model was overidentified.
[28]:
u = res.resids
res = IV2SLS(u, data[["exper", "expersq"] + instr], None, None).fit()
The test is \(n\times R^2\), and has the same \(\chi^2_1\) distribution. The test statistic is slightly smaller but the conclusions are the same.
[29]:
res.nobs * res.rsquared
[29]:
0.3719811106150277
Husband”s education can be used as an additional instrument, and its validity tested.
[30]:
instr = ["fatheduc", "motheduc", "huseduc"]
res = IV2SLS(data[dep], data[exog], data[endog], data[instr]).fit(cov_type="unadjusted")
Testing overidentification does not indicate any difference from the previous result.
[31]:
res.wooldridge_overid
[31]:
Wooldridge's score test of overidentification
H0: Model is not overidentified.
Statistic: 1.0421
P-value: 0.5939
Distributed: chi2(2)
WaldTestStatistic, id: 0x7ff918d61840
Directly testing using two regression would reach the same conclusion.
[32]:
u = res.resids
res = IV2SLS(u, data[["exper", "expersq"] + instr], None, None).fit()
res.nobs * res.rsquared
[32]:
1.1126588681969989
Panel IV¶
Instrumental variable regression can also be used with panel data. This example makes use of first differences to eliminate a year-specific effect and then uses
[33]:
from linearmodels.datasets import jobtraining
data = jobtraining.load()
print(jobtraining.DESCR)
data.head()
data = data.where(data.year.isin((1987, 1988)))
data = data.dropna(how="all", axis=0).sort_values(["fcode", "year"])
print(data.describe())
data = data.set_index("fcode")
data = data[["year", "hrsemp", "grant", "scrap", "lscrap"]]
H. Holzer, R. Block, M. Cheatham, and J. Knott (1993), "Are Training Subsidies
Effective? The Michigan Experience," Industrial and Labor Relations Review 46,
625-636.
year 1987, 1988, or 1989
fcode firm code number
employ # employees at plant
sales annual sales, $
avgsal average employee salary
scrap scrap rate (per 100 items)
rework rework rate (per 100 items)
tothrs total hours training
union =1 if unionized
grant =1 if received grant
d89 =1 if year = 1989
d88 =1 if year = 1988
totrain total employees trained
hrsemp tothrs/totrain
lscrap log(scrap)
lemploy log(employ)
lsales log(sales)
lrework log(rework)
lhrsemp log(1 + hrsemp)
lscrap_1 lagged lscrap; missing 1987
grant_1 lagged grant; assumed 0 in 1987
clscrap lscrap - lscrap_1; year > 1987
cgrant grant - grant_1
clemploy lemploy - lemploy[t-1]
clsales lavgsal - lavgsal[t-1]
lavgsal log(avgsal)
clavgsal lavgsal - lavgsal[t-1]
cgrant_1 cgrant[t-1]
chrsemp hrsemp - hrsemp[t-1]
lhrsemp lhrsemp - lhrsemp[t-1]
year fcode employ sales avgsal \
count 314.000000 314.000000 290.000000 2.450000e+02 267.000000
mean 1987.500000 415708.885350 56.903448 5.679815e+06 18326.284644
std 0.500798 4025.063452 70.843101 6.977975e+06 6507.584305
min 1987.000000 410032.000000 4.000000 1.100000e+05 4237.000000
25% 1987.000000 410604.000000 15.000000 1.527000e+06 13664.000000
50% 1987.500000 418084.000000 27.500000 2.900000e+06 17160.000000
75% 1988.000000 419309.000000 70.000000 7.000000e+06 22000.000000
max 1988.000000 419486.000000 500.000000 4.700000e+07 40563.000000
scrap rework tothrs union grant ... \
count 108.000000 82.000000 276.000000 314.000000 314.000000 ...
mean 4.199722 3.883537 25.648551 0.197452 0.114650 ...
std 6.188094 6.145294 45.694834 0.398712 0.319107 ...
min 0.010000 0.000000 0.000000 0.000000 0.000000 ...
25% 0.767500 0.332500 0.000000 0.000000 0.000000 ...
50% 1.555000 1.500000 8.000000 0.000000 0.000000 ...
75% 5.000000 4.267500 39.250000 0.000000 0.000000 ...
max 30.000000 40.000000 290.000000 1.000000 1.000000 ...
grant_1 clscrap cgrant clemploy clsales lavgsal \
count 314.0 54.000000 314.000000 144.000000 119.000000 267.000000
mean 0.0 -0.168993 0.114650 0.096876 0.128995 9.754753
std 0.0 0.589774 0.319107 0.251180 0.489422 0.353818
min 0.0 -2.502170 0.000000 -0.411099 -1.982874 8.351611
25% 0.0 -0.350983 0.000000 -0.031946 0.023286 9.522509
50% 0.0 -0.115118 0.000000 0.064538 0.116219 9.750337
75% 0.0 0.092537 0.000000 0.182321 0.215391 9.998797
max 0.0 2.397895 1.000000 1.673976 2.896701 10.610610
clavgsal cgrant_1 chrsemp clhrsemp
count 131.000000 157.0 125.000000 125.000000
mean 0.058601 0.0 7.423050 0.642580
std 0.074255 0.0 19.301290 1.235229
min -0.080411 0.0 -47.753490 -2.338574
25% 0.020135 0.0 0.000000 0.000000
50% 0.057159 0.0 0.272727 0.039375
75% 0.086255 0.0 10.000000 1.063067
max 0.568912 0.0 107.027800 4.394449
[8 rows x 30 columns]
[34]:
deltas = data.loc[data.year == 1988] - data.loc[data.year == 1987]
deltas = add_constant(deltas, has_constant="add")
deltas = deltas.dropna()
print(deltas.describe())
const year hrsemp grant scrap lscrap
count 45.0 45.0 45.000000 45.000000 45.000000 45.000000
mean 1.0 1.0 10.812321 0.377778 -0.817556 -0.185697
std 0.0 0.0 20.523825 0.490310 2.496392 0.626858
min 1.0 1.0 -19.850180 0.000000 -10.000000 -2.502169
25% 1.0 1.0 0.000000 0.000000 -1.000000 -0.355820
50% 1.0 1.0 1.846154 0.000000 -0.110000 -0.167054
75% 1.0 1.0 15.333330 1.000000 0.090000 0.054067
max 1.0 1.0 80.000000 1.000000 5.000000 2.397895
The first stage indicates a relatively strong relationship between grant and the number of hours employed. Note that grant is a dummy and so the coefficient is just the difference in means.
[35]:
mod = IV2SLS(deltas.hrsemp, deltas[["const", "grant"]], None, None)
print(mod.fit(cov_type="unadjusted"))
OLS Estimation Summary
==============================================================================
Dep. Variable: hrsemp R-squared: 0.3408
Estimator: OLS Adj. R-squared: 0.3255
No. Observations: 45 F-statistic: 23.266
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000
Time: 17:55:07 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 1.5806 3.1139 0.5076 0.6117 -4.5225 7.6837
grant 24.437 5.0662 4.8235 0.0000 14.507 34.367
==============================================================================
Here a formula is used to specify the model since it is cleaner. Note that the []
contains the endogenous variables and instruments.
[36]:
mod = IV2SLS.from_formula("lscrap ~ 1 + [hrsemp ~ grant]", deltas)
res_iv = mod.fit(cov_type="unadjusted")
print(res_iv)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: lscrap R-squared: 0.0159
Estimator: IV-2SLS Adj. R-squared: -0.0070
No. Observations: 45 F-statistic: 3.3464
Date: Fri, Jul 19 2024 P-value (F-stat) 0.0674
Time: 17:55:07 Distribution: chi2(1)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept -0.0327 0.1241 -0.2632 0.7924 -0.2759 0.2106
hrsemp -0.0142 0.0077 -1.8293 0.0674 -0.0293 0.0010
==============================================================================
Endogenous: hrsemp
Instruments: grant
Unadjusted Covariance (Homoskedastic)
Debiased: False
The 2SLS estimate is nearly twice as large as the OLS estimate and is slightly more significant.
[37]:
res_ols = IV2SLS.from_formula("lscrap ~ 1 + hrsemp", deltas).fit(cov_type="unadjusted")
print(compare({"Panel OLS": res_ols, "Panel IV": res_iv}))
Model Comparison
==================================================
Panel OLS Panel IV
--------------------------------------------------
Dep. Variable lscrap lscrap
Estimator OLS IV-2SLS
No. Observations 45 45
Cov. Est. unadjusted unadjusted
R-squared 0.0619 0.0159
Adj. R-squared 0.0401 -0.0070
F-statistic 2.9707 3.3464
P-value (F-stat) 0.0848 0.0674
================== ============ ============
Intercept -0.1035 -0.0327
(-1.0208) (-0.2632)
hrsemp -0.0076 -0.0142
(-1.7236) (-1.8293)
==================== ============== ==============
Instruments grant
--------------------------------------------------
T-stats reported in parentheses
[38]:
mod = IV2SLS.from_formula("lscrap ~ 1 + [hrsemp ~ grant]", deltas)
res_iv = mod.fit(cov_type="unadjusted")
n = deltas.shape[0]
pred_exog = pd.DataFrame(np.ones((n, 1)), index=deltas.index)
res_iv.predict(exog=pred_exog, endog=deltas[["hrsemp"]])
[38]:
predictions | |
---|---|
fcode | |
410523.0 | -0.016018 |
410563.0 | -0.032668 |
410565.0 | -0.032668 |
410566.0 | 0.037845 |
410567.0 | -0.032668 |
410577.0 | -0.040453 |
410593.0 | -0.032668 |
410596.0 | -0.060975 |
410606.0 | -0.010335 |
410626.0 | -0.011102 |
418011.0 | -0.154386 |
418021.0 | -0.119765 |
418035.0 | -0.293087 |
418045.0 | -0.089281 |
418054.0 | -0.058797 |
418065.0 | -0.202584 |
418076.0 | -0.227847 |
418091.0 | -0.193340 |
418097.0 | -0.067874 |
418107.0 | -1.065849 |
418118.0 | -0.342269 |
418125.0 | -0.339713 |
418140.0 | -1.164922 |
418163.0 | -0.598795 |
418168.0 | -0.032668 |
418177.0 | -0.400651 |
418237.0 | -0.117587 |
419198.0 | -0.998414 |
419201.0 | -0.152554 |
419242.0 | -0.469395 |
419268.0 | -0.032668 |
419272.0 | 0.084601 |
419289.0 | -0.032668 |
419297.0 | 0.248274 |
419307.0 | -0.036528 |
419339.0 | -0.032668 |
419343.0 | -0.042576 |
419357.0 | -0.386498 |
419378.0 | -0.434747 |
419381.0 | 0.027988 |
419388.0 | -0.049180 |
419409.0 | -0.249684 |
419459.0 | -0.033183 |
419482.0 | -0.032668 |
419483.0 | -0.032668 |