Basic Examples

These examples are based on Chapter 15 of Introduction to Econometrics by Jeffrey Wooldridge and demonstrate the basic use of the IV estimators (primarily IV2SLS – the two-stage least squares estimator)

Wages of Married Women

This first example examines the effect of education on the wages of women. Education is a classic endogenous variable since it has signaling value beyond the actual direct effect (among other reasons).

This first block imports the data and uses the DESCR attribute to describe the dataset. add_constant from statsmodels is used to add a variable named const to the DataFrame. Observations with missing values are dropped.

[1]:
import numpy as np
from statsmodels.api import add_constant
from linearmodels.datasets import mroz
print(mroz.DESCR)
data = mroz.load()
data = data.dropna()
data = add_constant(data, has_constant='add')

T.A. Mroz (1987), "The Sensitivity of an Empirical Model of Married Women's
Hours of Work to Economic and Statistical Assumptions," Econometrica 55,
765-799.

nlf        1 if in labor force, 1975
hours      hours worked, 1975
kidslt6    # kids < 6 years
kidsge6    # kids 6-18
age        woman's age in yrs
educ       years of schooling
wage       estimated wage from earns., hours
repwage    reported wage at interview in 1976
hushrs     hours worked by husband, 1975
husage     husband's age
huseduc    husband's years of schooling
huswage    husband's hourly wage, 1975
faminc     family income, 1975
mtr        fed. marginal tax rate facing woman
motheduc   mother's years of schooling
fatheduc   father's years of schooling
unem       unem. rate in county of resid.
city       =1 if live in SMSA
exper      actual labor mkt exper
nwifeinc   (faminc - wage*hours)/1000
lwage      log(wage)
expersq    exper^2

Since OLS is a special case of 2SLS, IV2SLS can be used to estimate a model using OLS by setting endog and instruments variables to None.

This first regression uses OLS to estimate the effect of education on the log of wage. It indicates that 1 year of education increases wage by 10%.

[2]:
from linearmodels.iv import IV2SLS
res_ols = IV2SLS(np.log(data.wage), data[['const','educ']], None, None).fit(cov_type='unadjusted')
print(res_ols)
                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.1179
Estimator:                        OLS   Adj. R-squared:                 0.1158
No. Observations:                 428   F-statistic:                    57.196
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:28   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const         -0.1852     0.1848    -1.0022     0.3163     -0.5474      0.1770
educ           0.1086     0.0144     7.5628     0.0000      0.0805      0.1368
==============================================================================

Two-stage least squares is estimated “as-if” two regressions are run. Here the first stage regression where OLS is used to fit the value on the instrument (in this case the education of the subject’s father). The fitted value is saved for use later.

This first stage regression indicates that there is a strong relationship and the first stage easily passes the rule of thumb test where the F-statistic is at least 10.

[3]:
res_first = IV2SLS(data.educ, data[['const','fatheduc']], None, None).fit(cov_type='unadjusted')
print(res_first)
data['educ_hat'] = data.educ - res_first.resids
                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   educ   R-squared:                      0.1726
Estimator:                        OLS   Adj. R-squared:                 0.1706
No. Observations:                 428   F-statistic:                    89.258
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:28   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          10.237     0.2753     37.186     0.0000      9.6975      10.777
fatheduc       0.2694     0.0285     9.4476     0.0000      0.2135      0.3253
==============================================================================

The second stage uses the instrument to fit the model. This model indicates a much lower effect of education.

[4]:
res_second = IV2SLS(np.log(data.wage), data[['const']], data.educ, data.fatheduc).fit(cov_type='unadjusted')
print(res_second)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.0934
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0913
No. Observations:                 428   F-statistic:                    2.8487
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0914
Time:                        10:34:28   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.4411     0.4451     0.9911     0.3216     -0.4312      1.3134
educ           0.0592     0.0351     1.6878     0.0914     -0.0095      0.1279
==============================================================================

Endogenous: educ
Instruments: fatheduc
Unadjusted Covariance (Homoskedastic)
Debiased: False

The fitted value can be used with OLS to estimate the same parameters. Note that the other values reported such as t-statistics or \(R^2\) are not correct.

The compare function is used to compare the different models. The 2SLS coefficient on education and the direct coefficient on educ_hat are identical.

[5]:
from linearmodels.iv import compare
res_direct = IV2SLS(np.log(data.wage), data[['const','educ_hat']], None, None).fit(cov_type='unadjusted')
print(compare({'OLS':res_ols, '2SLS':res_second, 'Direct': res_direct}))
                         Model Comparison
=================================================================
                                OLS           2SLS         Direct
-----------------------------------------------------------------
Dep. Variable                  wage           wage           wage
Estimator                       OLS        IV-2SLS            OLS
No. Observations                428            428            428
Cov. Est.                unadjusted     unadjusted     unadjusted
R-squared                    0.1179         0.0934         0.0060
Adj. R-squared               0.1158         0.0913         0.0037
F-statistic                  57.196         2.8487         2.5982
P-value (F-stat)          3.941e-14         0.0914         0.1070
==================     ============   ============   ============
const                       -0.1852         0.4411         0.4411
                          (-1.0022)       (0.9911)       (0.9465)
educ                         0.1086         0.0592
                           (7.5628)       (1.6878)
educ_hat                                                   0.0592
                                                         (1.6119)
==================== ============== ============== ==============
Instruments                               fatheduc
-----------------------------------------------------------------

T-stats reported in parentheses

Wages of Men

This next example examines the returns to education for men and uses the number of siblings as an instrument.

[6]:
from linearmodels.datasets import wage
men = wage.load()
print(wage.DESCR)
men = men[['educ','wage','sibs','exper']]
men = add_constant(men)
men = men.dropna()

M. Blackburn and D. Neumark (1992), "Unobserved Ability, Efficiency Wages, and
Interindustry Wage Differentials," Quarterly Journal of Economics 107, 1421-1436.

wage                     monthly earnings
hours                    average weekly hours
IQ                       IQ score
KWW                      knowledge of world work score
educ                     years of education
exper                    years of work experience
tenure                   years with current employer
age                      age in years
married                  =1 if married
black                    =1 if black
south                    =1 if live in south
urban                    =1 if live in SMSA
sibs                     number of siblings
brthord                  birth order
meduc                    mother's education
feduc                    father's education
lwage                    natural log of wage

OLS estimates indicate a small effect of education for men in this sample.

[7]:
res_ols = IV2SLS(np.log(men.wage), men[['const', 'educ']], None, None).fit(cov_type='unadjusted')
print(res_ols)
                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.0973
Estimator:                        OLS   Adj. R-squared:                 0.0964
No. Observations:                 934   F-statistic:                    100.71
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:28   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          5.9733     0.0814     73.419     0.0000      5.8139      6.1328
educ           0.0598     0.0060     10.035     0.0000      0.0481      0.0715
==============================================================================

The first stage regressed the endogenous variable on the instrument. There is a strong, negative relationship here.

[8]:
res_first = IV2SLS(men.educ, men[['const', 'sibs']], None, None).fit(cov_type='unadjusted')
print(res_first.summary)
                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   educ   R-squared:                      0.0576
Estimator:                        OLS   Adj. R-squared:                 0.0566
No. Observations:                 934   F-statistic:                    57.107
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:28   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          14.143     0.1131     125.02     0.0000      13.921      14.365
sibs          -0.2287     0.0303    -7.5569     0.0000     -0.2880     -0.1694
==============================================================================

The second stage indicates a much strong relationship than the OLS estimate.

[9]:
res = IV2SLS(np.log(men.wage), men.const, men.educ, men.sibs).fit(cov_type='unadjusted')
print(res.summary)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                     -0.0090
Estimator:                    IV-2SLS   Adj. R-squared:                -0.0101
No. Observations:                 934   F-statistic:                    21.715
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:28   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          5.1310     0.3539     14.497     0.0000      4.4373      5.8248
educ           0.1224     0.0263     4.6599     0.0000      0.0709      0.1738
==============================================================================

Endogenous: educ
Instruments: sibs
Unadjusted Covariance (Homoskedastic)
Debiased: False

Weighted IV

All IV estimator support weighted which extends the concept of WLS in an OLS model to IV estimation. The weights are applied to the dependent variables, the matrix of regressors (endogenous and exogenous) and the matrix of instrument (exogenous and instruments) which allows for GLS-type estimation. In particular, if the variance of model residuals was \(\sigma^2_i\), then setting \(w_i = 1 / \sigma^2_i\) would produce GLS estimates.

This example shows how a feasible GLS-type of estimator could be used.

[10]:
res2 = res.resids ** 2
fgls_mod = IV2SLS(np.log(res2), men[['const','sibs','exper']], None, None)
fgls_res = fgls_mod.fit()
sigma2_hat = np.exp(np.log(res2) - fgls_res.resids)
print(fgls_res)
                            OLS Estimation Summary
==============================================================================
Dep. Variable:               residual   R-squared:                      0.0045
Estimator:                        OLS   Adj. R-squared:                 0.0024
No. Observations:                 934   F-statistic:                    3.9903
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.1360
Time:                        10:34:28   Distribution:                  chi2(2)
Cov. Estimator:                robust

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const         -2.9315     0.2224    -13.179     0.0000     -3.3675     -2.4956
sibs          -0.0651     0.0335    -1.9452     0.0518     -0.1307      0.0005
exper          0.0084     0.0158     0.5311     0.5953     -0.0226      0.0394
==============================================================================

The variance of the squared residuals is not particularly well explained by the data, and so the GLS-type estimates and the usual IV estimates don’t differ by much.

[11]:
res_gls = IV2SLS(np.log(men.wage), men.const, men.educ, men.sibs, weights=1 / sigma2_hat).fit(cov_type='unadjusted')
print(res_gls)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.0090
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0080
No. Observations:                 934   F-statistic:                    23.520
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:28   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          5.2031     0.3231     16.105     0.0000      4.5699      5.8363
educ           0.1166     0.0241     4.8497     0.0000      0.0695      0.1638
==============================================================================

Endogenous: educ
Instruments: sibs
Unadjusted Covariance (Homoskedastic)
Debiased: False

Smoking and birth weight

This example examines the effect of smoking on the birth weight of babies.

[12]:
from linearmodels.datasets import birthweight
data = birthweight.load()
print(birthweight.DESCR)
data = add_constant(data)

J. Mullahy (1997), "Instrumental-Variable Estimation of Count Data Models:
Applications to Models of Cigarette Smoking Behavior," Review of Economics
and Statistics 79, 596-593.

faminc                   1988 family income, $1000s
cigtax                   cig. tax in home state, 1988
cigprice                 cig. price in home state, 1988
bwght                    birth weight, ounces
fatheduc                 father's yrs of educ
motheduc                 mother's yrs of educ
parity                   birth order of child
male                     =1 if male child
white                    =1 if white
cigs                     cigs smked per day while preg
lbwght                   log of bwght
bwghtlbs                 birth weight, pounds
packs                    packs smked per day while preg
faminc                   log(faminc)

The first stage regresses the number of packs smoked on the cigarette price. There is a very weak relationship – so weak that this is an example of a _weak instrument.

[13]:
res = IV2SLS(data.packs, data[['const','cigprice']], None, None).fit(cov_type='unadjusted')
print(res)
                            OLS Estimation Summary
==============================================================================
Dep. Variable:                  packs   R-squared:                      0.0015
Estimator:                        OLS   Adj. R-squared:                 0.0008
No. Observations:                1388   F-statistic:                    2.0731
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.1499
Time:                        10:34:28   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const         -2.9746     3.2361    -0.9192     0.3580     -9.3172      3.3681
cigprice       0.0356     0.0247     1.4398     0.1499     -0.0129      0.0840
==============================================================================

Despite the weak relationship between the price and the number of pack smoked, the 2SLS can be estimated, although substantial caution is warranted to interpret the results. Note the very negative \(R^2\) (-150%).

[14]:
res = IV2SLS(np.log(data.bwght), data.const, data.packs, data.cigprice).fit(cov_type='unadjusted')
print(res)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                  bwght   R-squared:                     -1.5224
Estimator:                    IV-2SLS   Adj. R-squared:                -1.5242
No. Observations:                1388   F-statistic:                    1.1356
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.2866
Time:                        10:34:28   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          4.7203     0.0381     123.80     0.0000      4.6456      4.7951
packs          0.0238     0.0223     1.0656     0.2866     -0.0199      0.0675
==============================================================================

Endogenous: packs
Instruments: cigprice
Unadjusted Covariance (Homoskedastic)
Debiased: False

Proximity and education

This next example uses a well-known dataset that uses proximity to a 4 year college as an instrument for education.

[15]:
from linearmodels.datasets import card
data = card.load()
print(card.DESCR)
data = add_constant(data)

D. Card (1995), "Using Geographic Variation in College Proximity to Estimate
the Return to Schooling," in Aspects of Labour Market Behavior:  Essays in
Honour of John Vanderkamp.  Ed. L.N. Christophides, E.K. Grant, and R.
Swidinsky, 201-222.  Toronto: University of Toronto Press.

id                       person identifier
nearc2                   =1 if near 2 yr college, 1966
nearc4                   =1 if near 4 yr college, 1966
educ                     years of schooling, 1976
age                      in years
fatheduc                 father's schooling
motheduc                 mother's schooling
weight                   NLS sampling weight, 1976
momdad14                 =1 if live with mom, dad at 14
sinmom14                 =1 if with single mom at 14
step14                   =1 if with step parent at 14
reg661                   =1 for region 1, 1966
reg662                   =1 for region 2, 1966
reg663                   =1 for region 3, 1966
reg664                   =1 for region 4, 1966
reg665                   =1 for region 5, 1966
reg666                   =1 for region 6, 1966
reg667                   =1 for region 7, 1966
reg668                   =1 for region 8, 1966
reg669                   =1 for region 9, 1966
south66                  =1 if in south in 1966
black                    =1 if black
smsa                     =1 in in SMSA, 1976
south                    =1 if in south, 1976
smsa66                   =1 if in SMSA, 1966
wage                     hourly wage in cents, 1976
enroll                   =1 if enrolled in school, 1976
KWW                      knowledge world of work score
IQ                       IQ score
married                  =1 if married, 1976
libcrd14                 =1 if lib. card in home at 14
exper                    age - educ - 6
lwage                    log(wage)
xpersq                   exper**2

Lists are used to hold the groups of variable in this large model and missing values are dropped.

[16]:
dep = ['wage']
endog = ['educ']
exog = ['const','exper','expersq','black','smsa','south','smsa66','reg662',
       'reg663','reg664','reg665','reg666','reg667','reg668','reg669']
instr = ['nearc4']
data = data[dep +exog+endog+ instr].dropna()

The first stage estimate shows a very large F-statistic. Note that when there are many exogenous variables the results cannot be directly interpreted. It is better to use the first_stage information from a 2SLS estimate.

[17]:
res = IV2SLS(data.educ, data[instr+exog], None, None).fit()
print(res)
                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   educ   R-squared:                      0.4771
Estimator:                        OLS   Adj. R-squared:                 0.4745
No. Observations:                3010   F-statistic:                    3693.4
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:28   Distribution:                 chi2(15)
Cov. Estimator:                robust

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
nearc4         0.3199     0.0848     3.7702     0.0002      0.1536      0.4862
const          16.638     0.2148     77.456     0.0000      16.217      17.059
exper         -0.4125     0.0320    -12.896     0.0000     -0.4752     -0.3498
expersq        0.0009     0.0017     0.5100     0.6100     -0.0025      0.0042
black         -0.9355     0.0923    -10.138     0.0000     -1.1164     -0.7547
smsa           0.4022     0.1109     3.6255     0.0003      0.1848      0.6196
south         -0.0516     0.1416    -0.3645     0.7155     -0.3291      0.2259
smsa66         0.0255     0.1103     0.2309     0.8174     -0.1908      0.2417
reg662        -0.0786     0.1854    -0.4242     0.6714     -0.4420      0.2847
reg663        -0.0279     0.1789    -0.1562     0.8759     -0.3785      0.3226
reg664         0.1172     0.2070     0.5660     0.5714     -0.2886      0.5230
reg665        -0.2726     0.2237    -1.2186     0.2230     -0.7111      0.1659
reg666        -0.3028     0.2361    -1.2826     0.1996     -0.7656      0.1599
reg667        -0.2168     0.2389    -0.9077     0.3640     -0.6850      0.2513
reg668         0.5239     0.2562     2.0449     0.0409      0.0218      1.0260
reg669         0.2103     0.1988     1.0575     0.2903     -0.1794      0.6000
==============================================================================

The OLS estimate indicates an increase of 7% for a year of education.

[18]:
res = IV2SLS(np.log(data.wage), data[exog+endog],None,None).fit()
print(res)
                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.2998
Estimator:                        OLS   Adj. R-squared:                 0.2963
No. Observations:                3010   F-statistic:                    1377.0
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:28   Distribution:                 chi2(15)
Cov. Estimator:                robust

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          4.6208     0.0740     62.417     0.0000      4.4757      4.7659
exper          0.0848     0.0067     12.592     0.0000      0.0716      0.0980
expersq       -0.0023     0.0003    -7.1798     0.0000     -0.0029     -0.0017
black         -0.1990     0.0181    -10.985     0.0000     -0.2345     -0.1635
smsa           0.1364     0.0192     7.1159     0.0000      0.0988      0.1739
south         -0.1480     0.0280    -5.2917     0.0000     -0.2028     -0.0932
smsa66         0.0262     0.0185     1.4153     0.1570     -0.0101      0.0626
reg662         0.0964     0.0350     2.7531     0.0059      0.0278      0.1650
reg663         0.1445     0.0337     4.2850     0.0000      0.0784      0.2107
reg664         0.0551     0.0411     1.3402     0.1802     -0.0255      0.1356
reg665         0.1280     0.0428     2.9912     0.0028      0.0441      0.2119
reg666         0.1405     0.0450     3.1223     0.0018      0.0523      0.2287
reg667         0.1180     0.0455     2.5934     0.0095      0.0288      0.2071
reg668        -0.0564     0.0505    -1.1183     0.2634     -0.1553      0.0425
reg669         0.1186     0.0387     3.0658     0.0022      0.0428      0.1944
educ           0.0747     0.0036     20.540     0.0000      0.0676      0.0818
==============================================================================

The 2SLS estimate of the effect is nearly double. However, there is some reason to be concerned about the strength of the instrument despite the F-statistic in the first stage regression.

[19]:
res_2sls = IV2SLS(np.log(data.wage), data[exog], data[endog], data[instr]).fit()
print(res_2sls)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.2382
Estimator:                    IV-2SLS   Adj. R-squared:                 0.2343
No. Observations:                3010   F-statistic:                    840.83
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:28   Distribution:                 chi2(15)
Cov. Estimator:                robust

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          3.6662     0.9085     4.0352     0.0001      1.8855      5.4468
exper          0.1083     0.0233     4.6376     0.0000      0.0625      0.1540
expersq       -0.0023     0.0003    -6.7128     0.0000     -0.0030     -0.0017
black         -0.1468     0.0524    -2.8031     0.0051     -0.2494     -0.0441
smsa           0.1118     0.0311     3.5995     0.0003      0.0509      0.1727
south         -0.1447     0.0291    -4.9775     0.0000     -0.2016     -0.0877
smsa66         0.0185     0.0205     0.9035     0.3663     -0.0217      0.0587
reg662         0.1008     0.0365     2.7644     0.0057      0.0293      0.1722
reg663         0.1483     0.0355     4.1760     0.0000      0.0787      0.2178
reg664         0.0499     0.0435     1.1471     0.2514     -0.0354      0.1352
reg665         0.1463     0.0491     2.9794     0.0029      0.0500      0.2425
reg666         0.1629     0.0516     3.1553     0.0016      0.0617      0.2641
reg667         0.1346     0.0504     2.6689     0.0076      0.0357      0.2334
reg668        -0.0831     0.0571    -1.4552     0.1456     -0.1950      0.0288
reg669         0.1078     0.0410     2.6317     0.0085      0.0275      0.1881
educ           0.1315     0.0540     2.4353     0.0149      0.0257      0.2373
==============================================================================

Endogenous: educ
Instruments: nearc4
Robust Covariance (Heteroskedastic)
Debiased: False

The property first_stage can be used to show a large set of first stage diagnostics. These results show a much lower partial \(R^2\) that has measures the unique effect of the instrument on the endogenous controlling for the exogenous regressors. This is much smaller than the naive first stage \(R^2\) of 47%. The partial F-statistic is also much smaller, although it technically over the rule-of-thumb of 10 for a single instrument.

The instrument is a dummy variable and being close to a 4 year college is only worth 0.3 years of education on average.

[20]:
print(res_2sls.first_stage)
    First Stage Estimation Results
======================================
                                  educ
--------------------------------------
R-squared                       0.4771
Partial R-squared               0.0044
Shea's R-squared                0.0044
Partial F-statistic             14.214
P-value (Partial F-stat)        0.0002
Partial F-stat Distn           chi2(1)
========================== ===========
const                           16.638
                              (77.456)
exper                          -0.4125
                             (-12.896)
expersq                         0.0009
                              (0.5100)
black                          -0.9355
                             (-10.138)
smsa                            0.4022
                              (3.6255)
south                          -0.0516
                             (-0.3645)
smsa66                          0.0255
                              (0.2309)
reg662                         -0.0786
                             (-0.4242)
reg663                         -0.0279
                             (-0.1562)
reg664                          0.1172
                              (0.5660)
reg665                         -0.2726
                             (-1.2186)
reg666                         -0.3028
                             (-1.2826)
reg667                         -0.2168
                             (-0.9077)
reg668                          0.5239
                              (2.0449)
reg669                          0.2103
                              (1.0575)
nearc4                          0.3199
                              (3.7702)
--------------------------------------

T-stats reported in parentheses
T-stats use same covariance type as original model

Formula interface

This model was large and so it might be simpler to use a formula. While formulas are discussed in detail in another notebook, they use the formula language of patsy with an augmentation to specify the endogenous and instrumental variables. The generic form is

dependent ~ exog + [endog ~ instr]

where each block can contain multiple variables.

Here the model is compared to the direct parameterization using DataFrames by differencing the \(R^2\).

[21]:
import numpy as np
formula = ('np.log(wage) ~ 1 + exper + expersq + black + smsa + south + smsa66 + reg662 + reg663 + reg664 + '
           'reg665 + reg666 + reg667 + reg668 + reg669 + [educ ~ nearc4]')
mod = IV2SLS.from_formula(formula, data)
res_formula = mod.fit(cov_type='unadjusted')
print(res_formula.rsquared - res_2sls.rsquared)
0.0

Categorical Variables

pandas categoricals are automatically treated as factors and expanded to dummies. The first is always dropped. This next block constructs a categorical from the region dummies and then uses it instead of the individual dummies. The model is identical.

[22]:
data['reg'] = '661'  # The default region, which was omitted
for i in range(2, 10):
    region = 'reg66' + str(i)
    data.loc[data[region] == 1, 'reg'] = region[3:]
data['reg'] = data['reg'].astype('category')
data.describe()
res_cat = IV2SLS(np.log(data.wage), data[['const','exper','expersq','black','smsa','south','smsa66','reg']], data.educ, data.nearc4).fit()
print(res_cat)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.2382
Estimator:                    IV-2SLS   Adj. R-squared:                 0.2343
No. Observations:                3010   F-statistic:                    840.83
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:29   Distribution:                 chi2(15)
Cov. Estimator:                robust

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          3.6662     0.9085     4.0352     0.0001      1.8855      5.4468
exper          0.1083     0.0233     4.6376     0.0000      0.0625      0.1540
expersq       -0.0023     0.0003    -6.7128     0.0000     -0.0030     -0.0017
black         -0.1468     0.0524    -2.8031     0.0051     -0.2494     -0.0441
smsa           0.1118     0.0311     3.5995     0.0003      0.0509      0.1727
south         -0.1447     0.0291    -4.9775     0.0000     -0.2016     -0.0877
smsa66         0.0185     0.0205     0.9035     0.3663     -0.0217      0.0587
reg.662        0.1008     0.0365     2.7644     0.0057      0.0293      0.1722
reg.663        0.1483     0.0355     4.1760     0.0000      0.0787      0.2178
reg.664        0.0499     0.0435     1.1471     0.2514     -0.0354      0.1352
reg.665        0.1463     0.0491     2.9794     0.0029      0.0500      0.2425
reg.666        0.1629     0.0516     3.1553     0.0016      0.0617      0.2641
reg.667        0.1346     0.0504     2.6689     0.0076      0.0357      0.2334
reg.668       -0.0831     0.0571    -1.4552     0.1456     -0.1950      0.0288
reg.669        0.1078     0.0410     2.6317     0.0085      0.0275      0.1881
educ           0.1315     0.0540     2.4353     0.0149      0.0257      0.2373
==============================================================================

Endogenous: educ
Instruments: nearc4
Robust Covariance (Heteroskedastic)
Debiased: False

Post-estimation diagnostics

Common post-estimation diagnostics are to test the assumption of endogeneity and to examine if instruments are valid (when there are more instruments than endogenous variables).

[23]:
data = mroz.load()
data = data.dropna()
data = add_constant(data, has_constant='add')
data['lnwage'] = np.log(data.wage)
dep = 'lnwage'
exog = ['const','exper','expersq']
endog = ['educ']
instr = ['fatheduc','motheduc']

The first step is to fit the model using 2SLS.

[24]:
res = IV2SLS(data[dep],data[exog],data[endog],data[instr]).fit(cov_type='unadjusted')
print(res)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                 lnwage   R-squared:                      0.1357
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1296
No. Observations:                 428   F-statistic:                    24.653
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:29   Distribution:                  chi2(3)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.0481     0.3985     0.1207     0.9039     -0.7329      0.8291
exper          0.0442     0.0134     3.3038     0.0010      0.0180      0.0704
expersq       -0.0009     0.0004    -2.2485     0.0245     -0.0017     -0.0001
educ           0.0614     0.0313     1.9622     0.0497   7.043e-05      0.1227
==============================================================================

Endogenous: educ
Instruments: fatheduc, motheduc
Unadjusted Covariance (Homoskedastic)
Debiased: False

Wooldridge’s regression test of exogeneity uses regression residual where the endogenous variables are regressed on the exogenous and the instrument to test for endogeneity. IF the endogenous variable is actually exogenous these residuals should not be correlated with the variable of interest.

[25]:
res.wooldridge_regression
[25]:
Wooldridge's regression test of exogeneity
H0: Endogenous variables are exogenous
Statistic: 2.8256
P-value: 0.0928
Distributed: chi2(1)
WaldTestStatistic, id: 0x7f241dbdc110

This test can be easily implemented using two regression. The first one constructs the residuals and the second re-fits the model using 2SLS but including the residuals.

Note that the p-value of the t-state on residuals is the same as the P-value of the previous test – this is not an accident.

[26]:
v = IV2SLS(data[endog],data[exog+instr],None,None).fit().resids
import pandas as pd
res_direct = IV2SLS(data[dep],pd.concat([v, data[exog]], 1),data[endog],data[instr]).fit(cov_type='unadjusted')
print(res_direct)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                 lnwage   R-squared:                      0.1624
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1544
No. Observations:                 428   F-statistic:                    82.954
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:29   Distribution:                  chi2(4)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
residual       0.0582     0.0346     1.6810     0.0928     -0.0097      0.1260
const          0.0481     0.3923     0.1226     0.9024     -0.7207      0.8169
exper          0.0442     0.0132     3.3559     0.0008      0.0184      0.0700
expersq       -0.0009     0.0004    -2.2840     0.0224     -0.0017     -0.0001
educ           0.0614     0.0308     1.9932     0.0462      0.0010      0.1218
==============================================================================

Endogenous: educ
Instruments: fatheduc, motheduc
Unadjusted Covariance (Homoskedastic)
Debiased: False

Since this regression has two instrument it is possible to test for overidentification. Wooldridge’s overidentification test uses a regression to test whether the 2SLS residuals are uncorrelated with the instruments, which should be the case if the model is correct and the instruments are not needed in the original model.

[27]:
res.wooldridge_overid
[27]:
Wooldridge's score test of overidentification
H0: Model is not overidentified.
Statistic: 0.4435
P-value: 0.5055
Distributed: chi2(1)
WaldTestStatistic, id: 0x7f241db56b50

A naive version of this test can be directly implemented. This direct implementation is different from the formal test but would be consistent if the model was overidentified.

[28]:
u = res.resids
res = IV2SLS(u, data[['exper','expersq'] + instr], None, None).fit()

The test is \(n\times R^2\), and has the same \(\chi^2_1\) distribution. The test statistic is slightly smaller but the conclusions are the same.

[29]:
res.nobs * res.rsquared
[29]:
0.3719811106150277

Husband’s education can be used as an additional instrument, and its validity tested.

[30]:
instr = ['fatheduc','motheduc','huseduc']
res = IV2SLS(data[dep], data[exog], data[endog], data[instr]).fit(cov_type='unadjusted')

Testing overidentification does not indicate any difference from the previous result.

[31]:
res.wooldridge_overid
[31]:
Wooldridge's score test of overidentification
H0: Model is not overidentified.
Statistic: 1.0421
P-value: 0.5939
Distributed: chi2(2)
WaldTestStatistic, id: 0x7f241dd02c90

Directly testing using two regression would reach the same conclusion.

[32]:
u = res.resids
res = IV2SLS(u, data[['exper', 'expersq'] + instr], None, None).fit()
res.nobs * res.rsquared
[32]:
1.112658868197094

Panel IV

Instrumental variable regression can also be used with panel data. This example makes use of first differences to eliminate a year-specific effect and then uses

[33]:
from linearmodels.datasets import jobtraining
data = jobtraining.load()
print(jobtraining.DESCR)
data.head()
data = data.where(data.year.isin((1987,1988)))
data = data.dropna(how='all',axis=0).sort_values(['fcode','year'])
print(data.describe())
data = data.set_index('fcode')
data = data[['year','hrsemp','grant','scrap','lscrap']]

H. Holzer, R. Block, M. Cheatham, and J. Knott (1993), "Are Training Subsidies
Effective? The Michigan Experience," Industrial and Labor Relations Review 46,
625-636.

year                     1987, 1988, or 1989
fcode                    firm code number
employ                   # employees at plant
sales                    annual sales, $
avgsal                   average employee salary
scrap                    scrap rate (per 100 items)
rework                   rework rate (per 100 items)
tothrs                   total hours training
union                    =1 if unionized
grant                    =1 if received grant
d89                      =1 if year = 1989
d88                      =1 if year = 1988
totrain                  total employees trained
hrsemp                   tothrs/totrain
lscrap                   log(scrap)
lemploy                  log(employ)
lsales                   log(sales)
lrework                  log(rework)
lhrsemp                  log(1 + hrsemp)
lscrap_1                 lagged lscrap; missing 1987
grant_1                  lagged grant; assumed 0 in 1987
clscrap                  lscrap - lscrap_1; year > 1987
cgrant                   grant - grant_1
clemploy                 lemploy - lemploy[t-1]
clsales                  lavgsal - lavgsal[t-1]
lavgsal                  log(avgsal)
clavgsal                 lavgsal - lavgsal[t-1]
cgrant_1                 cgrant[t-1]
chrsemp                  hrsemp - hrsemp[t-1]
lhrsemp                 lhrsemp - lhrsemp[t-1]

              year          fcode      employ         sales        avgsal  \
count   314.000000     314.000000  290.000000  2.450000e+02    267.000000
mean   1987.500000  415708.885350   56.903448  5.679815e+06  18326.284644
std       0.500798    4025.063452   70.843101  6.977975e+06   6507.584305
min    1987.000000  410032.000000    4.000000  1.100000e+05   4237.000000
25%    1987.000000  410604.000000   15.000000  1.527000e+06  13664.000000
50%    1987.500000  418084.000000   27.500000  2.900000e+06  17160.000000
75%    1988.000000  419309.000000   70.000000  7.000000e+06  22000.000000
max    1988.000000  419486.000000  500.000000  4.700000e+07  40563.000000

            scrap     rework      tothrs       union       grant  ...  \
count  108.000000  82.000000  276.000000  314.000000  314.000000  ...
mean     4.199722   3.883537   25.648551    0.197452    0.114650  ...
std      6.188094   6.145294   45.694834    0.398712    0.319107  ...
min      0.010000   0.000000    0.000000    0.000000    0.000000  ...
25%      0.767500   0.332500    0.000000    0.000000    0.000000  ...
50%      1.555000   1.500000    8.000000    0.000000    0.000000  ...
75%      5.000000   4.267500   39.250000    0.000000    0.000000  ...
max     30.000000  40.000000  290.000000    1.000000    1.000000  ...

       grant_1    clscrap      cgrant    clemploy     clsales     lavgsal  \
count    314.0  54.000000  314.000000  144.000000  119.000000  267.000000
mean       0.0  -0.168993    0.114650    0.096876    0.128995    9.754753
std        0.0   0.589774    0.319107    0.251180    0.489422    0.353818
min        0.0  -2.502170    0.000000   -0.411099   -1.982874    8.351611
25%        0.0  -0.350983    0.000000   -0.031946    0.023286    9.522509
50%        0.0  -0.115118    0.000000    0.064538    0.116220    9.750337
75%        0.0   0.092537    0.000000    0.182321    0.215391    9.998797
max        0.0   2.397895    1.000000    1.673976    2.896701   10.610610

         clavgsal  cgrant_1     chrsemp    clhrsemp
count  131.000000     157.0  125.000000  125.000000
mean     0.058601       0.0    7.423050    0.642580
std      0.074255       0.0   19.301290    1.235229
min     -0.080411       0.0  -47.753490   -2.338574
25%      0.020135       0.0    0.000000    0.000000
50%      0.057159       0.0    0.272727    0.039375
75%      0.086255       0.0   10.000000    1.063067
max      0.568912       0.0  107.027800    4.394449

[8 rows x 30 columns]
[34]:
deltas = data.loc[data.year == 1988] - data.loc[data.year == 1987]
deltas = add_constant(deltas, has_constant='add')
deltas = deltas.dropna()
print(deltas.describe())
       const  year     hrsemp      grant      scrap     lscrap
count   45.0  45.0  45.000000  45.000000  45.000000  45.000000
mean     1.0   1.0  10.812321   0.377778  -0.817556  -0.185697
std      0.0   0.0  20.523825   0.490310   2.496392   0.626858
min      1.0   1.0 -19.850180   0.000000 -10.000000  -2.502169
25%      1.0   1.0   0.000000   0.000000  -1.000000  -0.355820
50%      1.0   1.0   1.846154   0.000000  -0.110000  -0.167054
75%      1.0   1.0  15.333330   1.000000   0.090000   0.054067
max      1.0   1.0  80.000000   1.000000   5.000000   2.397895

The first stage indicates a relatively strong relationship between grant and the number of hours employed. Note that grant is a dummy and so the coefficient is just the difference in means.

[35]:
mod = IV2SLS(deltas.hrsemp, deltas[['const','grant']], None, None)
print(mod.fit(cov_type='unadjusted'))
                            OLS Estimation Summary
==============================================================================
Dep. Variable:                 hrsemp   R-squared:                      0.3408
Estimator:                        OLS   Adj. R-squared:                 0.3255
No. Observations:                  45   F-statistic:                    23.266
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0000
Time:                        10:34:29   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          1.5806     3.1139     0.5076     0.6117     -4.5225      7.6837
grant          24.437     5.0662     4.8235     0.0000      14.507      34.367
==============================================================================

Here a formula is used to specify the model since it is cleaner. Note that the [] contains the endogenous variables and instruments.

[36]:
mod = IV2SLS.from_formula('lscrap ~ 1 + [hrsemp ~ grant]', deltas)
res_iv = mod.fit(cov_type='unadjusted')
print(res_iv)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                 lscrap   R-squared:                      0.0159
Estimator:                    IV-2SLS   Adj. R-squared:                -0.0070
No. Observations:                  45   F-statistic:                    3.3464
Date:                Tue, Feb 04 2020   P-value (F-stat)                0.0674
Time:                        10:34:29   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept     -0.0327     0.1241    -0.2632     0.7924     -0.2759      0.2106
hrsemp        -0.0142     0.0077    -1.8293     0.0674     -0.0293      0.0010
==============================================================================

Endogenous: hrsemp
Instruments: grant
Unadjusted Covariance (Homoskedastic)
Debiased: False

The 2SLS estimate is nearly twice as large as the OLS estimate and is slightly more significant.

[37]:
res_ols = IV2SLS.from_formula('lscrap ~ 1 + hrsemp', deltas).fit(cov_type='unadjusted')
print(compare({'Panel OLS':res_ols,'Panel IV': res_iv}))
                 Model Comparison
==================================================
                          Panel OLS       Panel IV
--------------------------------------------------
Dep. Variable                lscrap         lscrap
Estimator                       OLS        IV-2SLS
No. Observations                 45             45
Cov. Est.                unadjusted     unadjusted
R-squared                    0.0619         0.0159
Adj. R-squared               0.0401        -0.0070
F-statistic                  2.9707         3.3464
P-value (F-stat)             0.0848         0.0674
==================     ============   ============
Intercept                   -0.1035        -0.0327
                          (-1.0208)      (-0.2632)
hrsemp                      -0.0076        -0.0142
                          (-1.7236)      (-1.8293)
==================== ============== ==============
Instruments                                  grant
--------------------------------------------------

T-stats reported in parentheses