# Basic Examples¶

These examples are based on Chapter 15 of Introduction to Econometrics by Jeffrey Wooldridge and demonstrate the basic use of the IV estimators (primarily IV2SLS – the two-stage least squares estimator)

## Wages of Married Women¶

This first example examines the effect of education on the wages of women. Education is a classic endogenous variable since it has signaling value beyond the actual direct effect (among other reasons).

This first block imports the data and uses the DESCR attribute to describe the dataset. add_constant from statsmodels is used to add a variable named const to the DataFrame. Observations with missing values are dropped.

[1]:

import numpy as np
from linearmodels.datasets import mroz

print(mroz.DESCR)
data = data.dropna()


T.A. Mroz (1987), "The Sensitivity of an Empirical Model of Married Women's
Hours of Work to Economic and Statistical Assumptions," Econometrica 55,
765-799.

nlf        1 if in labor force, 1975
hours      hours worked, 1975
kidslt6    # kids < 6 years
kidsge6    # kids 6-18
age        woman's age in yrs
educ       years of schooling
wage       estimated wage from earns., hours
repwage    reported wage at interview in 1976
hushrs     hours worked by husband, 1975
husage     husband's age
huseduc    husband's years of schooling
huswage    husband's hourly wage, 1975
faminc     family income, 1975
mtr        fed. marginal tax rate facing woman
motheduc   mother's years of schooling
fatheduc   father's years of schooling
unem       unem. rate in county of resid.
city       =1 if live in SMSA
exper      actual labor mkt exper
nwifeinc   (faminc - wage*hours)/1000
lwage      log(wage)
expersq    exper^2



Since OLS is a special case of 2SLS, IV2SLS can be used to estimate a model using OLS by setting endog and instruments variables to None.

This first regression uses OLS to estimate the effect of education on the log of wage. It indicates that 1 year of education increases wage by 10%.

[2]:

from linearmodels.iv import IV2SLS

res_ols = IV2SLS(np.log(data.wage), data[["const", "educ"]], None, None).fit(
)
print(res_ols)

                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.1179
No. Observations:                 428   F-statistic:                    57.196
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.0000
Time:                        17:55:06   Distribution:                  chi2(1)

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const         -0.1852     0.1848    -1.0022     0.3163     -0.5474      0.1770
educ           0.1086     0.0144     7.5628     0.0000      0.0805      0.1368
==============================================================================


Two-stage least squares is estimated “as-if” two regressions are run. Here the first stage regression where OLS is used to fit the value on the instrument (in this case the education of the subject”s father). The fitted value is saved for use later.

This first stage regression indicates that there is a strong relationship and the first stage easily passes the rule of thumb test where the F-statistic is at least 10.

[3]:

res_first = IV2SLS(data.educ, data[["const", "fatheduc"]], None, None).fit(
)
print(res_first)
data["educ_hat"] = data.educ - res_first.resids

                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   educ   R-squared:                      0.1726
No. Observations:                 428   F-statistic:                    89.258
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.0000
Time:                        17:55:06   Distribution:                  chi2(1)

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          10.237     0.2753     37.186     0.0000      9.6975      10.777
fatheduc       0.2694     0.0285     9.4476     0.0000      0.2135      0.3253
==============================================================================


The second stage uses the instrument to fit the model. This model indicates a much lower effect of education.

[4]:

res_second = IV2SLS(np.log(data.wage), data[["const"]], data.educ, data.fatheduc).fit(
)
print(res_second)

                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.0934
No. Observations:                 428   F-statistic:                    2.8487
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.0914
Time:                        17:55:06   Distribution:                  chi2(1)

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.4411     0.4451     0.9911     0.3216     -0.4312      1.3134
educ           0.0592     0.0351     1.6878     0.0914     -0.0095      0.1279
==============================================================================

Endogenous: educ
Instruments: fatheduc
Debiased: False


The fitted value can be used with OLS to estimate the same parameters. Note that the other values reported such as t-statistics or $$R^2$$ are not correct.

The compare function is used to compare the different models. The 2SLS coefficient on education and the direct coefficient on educ_hat are identical.

[5]:

from linearmodels.iv import compare

res_direct = IV2SLS(np.log(data.wage), data[["const", "educ_hat"]], None, None).fit(
)
print(compare({"OLS": res_ols, "2SLS": res_second, "Direct": res_direct}))

                         Model Comparison
=================================================================
OLS           2SLS         Direct
-----------------------------------------------------------------
Dep. Variable                  wage           wage           wage
Estimator                       OLS        IV-2SLS            OLS
No. Observations                428            428            428
R-squared                    0.1179         0.0934         0.0060
F-statistic                  57.196         2.8487         2.5982
P-value (F-stat)          3.941e-14         0.0914         0.1070
==================     ============   ============   ============
const                       -0.1852         0.4411         0.4411
(-1.0022)       (0.9911)       (0.9465)
educ                         0.1086         0.0592
(7.5628)       (1.6878)
educ_hat                                                   0.0592
(1.6119)
==================== ============== ============== ==============
Instruments                               fatheduc
-----------------------------------------------------------------

T-stats reported in parentheses


## Wages of Men¶

This next example examines the returns to education for men and uses the number of siblings as an instrument.

[6]:

from linearmodels.datasets import wage

print(wage.DESCR)
men = men[["educ", "wage", "sibs", "exper"]]
men = men.dropna()


M. Blackburn and D. Neumark (1992), "Unobserved Ability, Efficiency Wages, and
Interindustry Wage Differentials," Quarterly Journal of Economics 107, 1421-1436.

wage                     monthly earnings
hours                    average weekly hours
IQ                       IQ score
KWW                      knowledge of world work score
educ                     years of education
exper                    years of work experience
tenure                   years with current employer
age                      age in years
married                  =1 if married
black                    =1 if black
south                    =1 if live in south
urban                    =1 if live in SMSA
sibs                     number of siblings
brthord                  birth order
meduc                    mother's education
feduc                    father's education
lwage                    natural log of wage



OLS estimates indicate a small effect of education for men in this sample.

[7]:

res_ols = IV2SLS(np.log(men.wage), men[["const", "educ"]], None, None).fit(
)
print(res_ols)

                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.0973
No. Observations:                 934   F-statistic:                    100.71
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.0000
Time:                        17:55:06   Distribution:                  chi2(1)

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          5.9733     0.0814     73.419     0.0000      5.8139      6.1328
educ           0.0598     0.0060     10.035     0.0000      0.0481      0.0715
==============================================================================


The first stage regressed the endogenous variable on the instrument. There is a strong, negative relationship here.

[8]:

res_first = IV2SLS(men.educ, men[["const", "sibs"]], None, None).fit(
)
print(res_first.summary)

                            OLS Estimation Summary
==============================================================================
Dep. Variable:                   educ   R-squared:                      0.0576
No. Observations:                 934   F-statistic:                    57.107
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.0000
Time:                        17:55:07   Distribution:                  chi2(1)

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          14.143     0.1131     125.02     0.0000      13.921      14.365
sibs          -0.2287     0.0303    -7.5569     0.0000     -0.2880     -0.1694
==============================================================================


The second stage indicates a much strong relationship than the OLS estimate.

[9]:

res = IV2SLS(np.log(men.wage), men.const, men.educ, men.sibs).fit(cov_type="unadjusted")
print(res.summary)

                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                     -0.0090
No. Observations:                 934   F-statistic:                    21.715
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.0000
Time:                        17:55:07   Distribution:                  chi2(1)

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          5.1310     0.3539     14.497     0.0000      4.4373      5.8248
educ           0.1224     0.0263     4.6599     0.0000      0.0709      0.1738
==============================================================================

Endogenous: educ
Instruments: sibs
Debiased: False


## Weighted IV¶

All IV estimator support weighted which extends the concept of WLS in an OLS model to IV estimation. The weights are applied to the dependent variables, the matrix of regressors (endogenous and exogenous) and the matrix of instrument (exogenous and instruments) which allows for GLS-type estimation. In particular, if the variance of model residuals was $$\sigma^2_i$$, then setting $$w_i = 1 / \sigma^2_i$$ would produce GLS estimates.

This example shows how a feasible GLS-type of estimator could be used.

[10]:

res2 = res.resids**2
fgls_mod = IV2SLS(np.log(res2), men[["const", "sibs", "exper"]], None, None)
fgls_res = fgls_mod.fit()
sigma2_hat = np.exp(np.log(res2) - fgls_res.resids)
print(fgls_res)

                            OLS Estimation Summary
==============================================================================
Dep. Variable:               residual   R-squared:                      0.0045
No. Observations:                 934   F-statistic:                    3.9903
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.1360
Time:                        17:55:07   Distribution:                  chi2(2)
Cov. Estimator:                robust

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const         -2.9315     0.2224    -13.179     0.0000     -3.3675     -2.4956
sibs          -0.0651     0.0335    -1.9452     0.0518     -0.1307      0.0005
exper          0.0084     0.0158     0.5311     0.5953     -0.0226      0.0394
==============================================================================


The variance of the squared residuals is not particularly well explained by the data, and so the GLS-type estimates and the usual IV estimates don”t differ by much.

[11]:

res_gls = IV2SLS(
np.log(men.wage), men.const, men.educ, men.sibs, weights=1 / sigma2_hat
print(res_gls)

                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                   wage   R-squared:                      0.0090
No. Observations:                 934   F-statistic:                    23.520
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.0000
Time:                        17:55:07   Distribution:                  chi2(1)

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          5.2031     0.3231     16.105     0.0000      4.5699      5.8363
educ           0.1166     0.0241     4.8497     0.0000      0.0695      0.1638
==============================================================================

Endogenous: educ
Instruments: sibs
Debiased: False


## Smoking and birth weight¶

This example examines the effect of smoking on the birth weight of babies.

[12]:

from linearmodels.datasets import birthweight

print(birthweight.DESCR)


J. Mullahy (1997), "Instrumental-Variable Estimation of Count Data Models:
Applications to Models of Cigarette Smoking Behavior," Review of Economics
and Statistics 79, 596-593.

faminc                   1988 family income, $1000s cigtax cig. tax in home state, 1988 cigprice cig. price in home state, 1988 bwght birth weight, ounces fatheduc father's yrs of educ motheduc mother's yrs of educ parity birth order of child male =1 if male child white =1 if white cigs cigs smked per day while preg lbwght log of bwght bwghtlbs birth weight, pounds packs packs smked per day while preg faminc log(faminc)  The first stage regresses the number of packs smoked on the cigarette price. There is a very weak relationship – so weak that this is an example of a _weak instrument. [13]:  res = IV2SLS(data.packs, data[["const", "cigprice"]], None, None).fit( cov_type="unadjusted" ) print(res)   OLS Estimation Summary ============================================================================== Dep. Variable: packs R-squared: 0.0015 Estimator: OLS Adj. R-squared: 0.0008 No. Observations: 1388 F-statistic: 2.0731 Date: Fri, Jul 19 2024 P-value (F-stat) 0.1499 Time: 17:55:07 Distribution: chi2(1) Cov. Estimator: unadjusted Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ const -2.9746 3.2361 -0.9192 0.3580 -9.3172 3.3681 cigprice 0.0356 0.0247 1.4398 0.1499 -0.0129 0.0840 ==============================================================================  Despite the weak relationship between the price and the number of pack smoked, the 2SLS can be estimated, although substantial caution is warranted to interpret the results. Note the very negative $$R^2$$ (-150%). [14]:  res = IV2SLS(np.log(data.bwght), data.const, data.packs, data.cigprice).fit( cov_type="unadjusted" ) print(res)   IV-2SLS Estimation Summary ============================================================================== Dep. Variable: bwght R-squared: -1.5224 Estimator: IV-2SLS Adj. R-squared: -1.5242 No. Observations: 1388 F-statistic: 1.1356 Date: Fri, Jul 19 2024 P-value (F-stat) 0.2866 Time: 17:55:07 Distribution: chi2(1) Cov. Estimator: unadjusted Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ const 4.7203 0.0381 123.80 0.0000 4.6456 4.7951 packs 0.0238 0.0223 1.0656 0.2866 -0.0199 0.0675 ============================================================================== Endogenous: packs Instruments: cigprice Unadjusted Covariance (Homoskedastic) Debiased: False  ## Proximity and education¶ This next example uses a well-known dataset that uses proximity to a 4 year college as an instrument for education. [15]:  from linearmodels.datasets import card data = card.load() print(card.DESCR) data = add_constant(data)   D. Card (1995), "Using Geographic Variation in College Proximity to Estimate the Return to Schooling," in Aspects of Labour Market Behavior: Essays in Honour of John Vanderkamp. Ed. L.N. Christophides, E.K. Grant, and R. Swidinsky, 201-222. Toronto: University of Toronto Press. id person identifier nearc2 =1 if near 2 yr college, 1966 nearc4 =1 if near 4 yr college, 1966 educ years of schooling, 1976 age in years fatheduc father's schooling motheduc mother's schooling weight NLS sampling weight, 1976 momdad14 =1 if live with mom, dad at 14 sinmom14 =1 if with single mom at 14 step14 =1 if with step parent at 14 reg661 =1 for region 1, 1966 reg662 =1 for region 2, 1966 reg663 =1 for region 3, 1966 reg664 =1 for region 4, 1966 reg665 =1 for region 5, 1966 reg666 =1 for region 6, 1966 reg667 =1 for region 7, 1966 reg668 =1 for region 8, 1966 reg669 =1 for region 9, 1966 south66 =1 if in south in 1966 black =1 if black smsa =1 in in SMSA, 1976 south =1 if in south, 1976 smsa66 =1 if in SMSA, 1966 wage hourly wage in cents, 1976 enroll =1 if enrolled in school, 1976 KWW knowledge world of work score IQ IQ score married =1 if married, 1976 libcrd14 =1 if lib. card in home at 14 exper age - educ - 6 lwage log(wage) xpersq exper**2  Lists are used to hold the groups of variable in this large model and missing values are dropped. [16]:  dep = ["wage"] endog = ["educ"] exog = [ "const", "exper", "expersq", "black", "smsa", "south", "smsa66", "reg662", "reg663", "reg664", "reg665", "reg666", "reg667", "reg668", "reg669", ] instr = ["nearc4"] data = data[dep + exog + endog + instr].dropna()  The first stage estimate shows a very large F-statistic. Note that when there are many exogenous variables the results cannot be directly interpreted. It is better to use the first_stage information from a 2SLS estimate. [17]:  res = IV2SLS(data.educ, data[instr + exog], None, None).fit() print(res)   OLS Estimation Summary ============================================================================== Dep. Variable: educ R-squared: 0.4771 Estimator: OLS Adj. R-squared: 0.4745 No. Observations: 3010 F-statistic: 3693.4 Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000 Time: 17:55:07 Distribution: chi2(15) Cov. Estimator: robust Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ nearc4 0.3199 0.0848 3.7702 0.0002 0.1536 0.4862 const 16.638 0.2148 77.456 0.0000 16.217 17.059 exper -0.4125 0.0320 -12.896 0.0000 -0.4752 -0.3498 expersq 0.0009 0.0017 0.5100 0.6100 -0.0025 0.0042 black -0.9355 0.0923 -10.138 0.0000 -1.1164 -0.7547 smsa 0.4022 0.1109 3.6255 0.0003 0.1848 0.6196 south -0.0516 0.1416 -0.3645 0.7155 -0.3291 0.2259 smsa66 0.0255 0.1103 0.2309 0.8174 -0.1908 0.2417 reg662 -0.0786 0.1854 -0.4242 0.6714 -0.4420 0.2847 reg663 -0.0279 0.1789 -0.1562 0.8759 -0.3785 0.3226 reg664 0.1172 0.2070 0.5660 0.5714 -0.2886 0.5230 reg665 -0.2726 0.2237 -1.2186 0.2230 -0.7111 0.1659 reg666 -0.3028 0.2361 -1.2826 0.1996 -0.7656 0.1599 reg667 -0.2168 0.2389 -0.9077 0.3640 -0.6850 0.2513 reg668 0.5239 0.2562 2.0449 0.0409 0.0218 1.0260 reg669 0.2103 0.1988 1.0575 0.2903 -0.1794 0.6000 ==============================================================================  The OLS estimate indicates an increase of 7% for a year of education. [18]:  res = IV2SLS(np.log(data.wage), data[exog + endog], None, None).fit() print(res)   OLS Estimation Summary ============================================================================== Dep. Variable: wage R-squared: 0.2998 Estimator: OLS Adj. R-squared: 0.2963 No. Observations: 3010 F-statistic: 1377.0 Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000 Time: 17:55:07 Distribution: chi2(15) Cov. Estimator: robust Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ const 4.6208 0.0740 62.417 0.0000 4.4757 4.7659 exper 0.0848 0.0067 12.592 0.0000 0.0716 0.0980 expersq -0.0023 0.0003 -7.1798 0.0000 -0.0029 -0.0017 black -0.1990 0.0181 -10.985 0.0000 -0.2345 -0.1635 smsa 0.1364 0.0192 7.1159 0.0000 0.0988 0.1739 south -0.1480 0.0280 -5.2917 0.0000 -0.2028 -0.0932 smsa66 0.0262 0.0185 1.4153 0.1570 -0.0101 0.0626 reg662 0.0964 0.0350 2.7531 0.0059 0.0278 0.1650 reg663 0.1445 0.0337 4.2850 0.0000 0.0784 0.2107 reg664 0.0551 0.0411 1.3402 0.1802 -0.0255 0.1356 reg665 0.1280 0.0428 2.9912 0.0028 0.0441 0.2119 reg666 0.1405 0.0450 3.1223 0.0018 0.0523 0.2287 reg667 0.1180 0.0455 2.5934 0.0095 0.0288 0.2071 reg668 -0.0564 0.0505 -1.1183 0.2634 -0.1553 0.0425 reg669 0.1186 0.0387 3.0658 0.0022 0.0428 0.1944 educ 0.0747 0.0036 20.540 0.0000 0.0676 0.0818 ==============================================================================  The 2SLS estimate of the effect is nearly double. However, there is some reason to be concerned about the strength of the instrument despite the F-statistic in the first stage regression. [19]:  res_2sls = IV2SLS(np.log(data.wage), data[exog], data[endog], data[instr]).fit() print(res_2sls)   IV-2SLS Estimation Summary ============================================================================== Dep. Variable: wage R-squared: 0.2382 Estimator: IV-2SLS Adj. R-squared: 0.2343 No. Observations: 3010 F-statistic: 840.83 Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000 Time: 17:55:07 Distribution: chi2(15) Cov. Estimator: robust Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ const 3.6662 0.9085 4.0352 0.0001 1.8855 5.4468 exper 0.1083 0.0233 4.6376 0.0000 0.0625 0.1540 expersq -0.0023 0.0003 -6.7128 0.0000 -0.0030 -0.0017 black -0.1468 0.0524 -2.8031 0.0051 -0.2494 -0.0441 smsa 0.1118 0.0311 3.5995 0.0003 0.0509 0.1727 south -0.1447 0.0291 -4.9775 0.0000 -0.2016 -0.0877 smsa66 0.0185 0.0205 0.9035 0.3663 -0.0217 0.0587 reg662 0.1008 0.0365 2.7644 0.0057 0.0293 0.1722 reg663 0.1483 0.0355 4.1760 0.0000 0.0787 0.2178 reg664 0.0499 0.0435 1.1471 0.2514 -0.0354 0.1352 reg665 0.1463 0.0491 2.9794 0.0029 0.0500 0.2425 reg666 0.1629 0.0516 3.1553 0.0016 0.0617 0.2641 reg667 0.1346 0.0504 2.6689 0.0076 0.0357 0.2334 reg668 -0.0831 0.0571 -1.4552 0.1456 -0.1950 0.0288 reg669 0.1078 0.0410 2.6317 0.0085 0.0275 0.1881 educ 0.1315 0.0540 2.4353 0.0149 0.0257 0.2373 ============================================================================== Endogenous: educ Instruments: nearc4 Robust Covariance (Heteroskedastic) Debiased: False  The property first_stage can be used to show a large set of first stage diagnostics. These results show a much lower partial $$R^2$$ that has measures the unique effect of the instrument on the endogenous controlling for the exogenous regressors. This is much smaller than the naive first stage $$R^2$$ of 47%. The partial F-statistic is also much smaller, although it technically over the rule-of-thumb of 10 for a single instrument. The instrument is a dummy variable and being close to a 4 year college is only worth 0.3 years of education on average. [20]:  print(res_2sls.first_stage)   First Stage Estimation Results ====================================== educ -------------------------------------- R-squared 0.4771 Partial R-squared 0.0044 Shea's R-squared 0.0044 Partial F-statistic 14.214 P-value (Partial F-stat) 0.0002 Partial F-stat Distn chi2(1) ========================== =========== const 16.638 (77.456) exper -0.4125 (-12.896) expersq 0.0009 (0.5100) black -0.9355 (-10.138) smsa 0.4022 (3.6255) south -0.0516 (-0.3645) smsa66 0.0255 (0.2309) reg662 -0.0786 (-0.4242) reg663 -0.0279 (-0.1562) reg664 0.1172 (0.5660) reg665 -0.2726 (-1.2186) reg666 -0.3028 (-1.2826) reg667 -0.2168 (-0.9077) reg668 0.5239 (2.0449) reg669 0.2103 (1.0575) nearc4 0.3199 (3.7702) -------------------------------------- T-stats reported in parentheses T-stats use same covariance type as original model  ## Formula interface¶ This model was large and so it might be simpler to use a formula. While formulas are discussed in detail in another notebook, they use the formula language of formulaic with an augmentation to specify the endogenous and instrumental variables. The generic form is dependent ~ exog + [endog ~ instr]  where each block can contain multiple variables. Here the model is compared to the direct parameterization using DataFrames by differencing the $$R^2$$. [21]:  import numpy as np formula = ( "np.log(wage) ~ 1 + exper + expersq + black + smsa + south + smsa66 + reg662 + reg663 + reg664 + " "reg665 + reg666 + reg667 + reg668 + reg669 + [educ ~ nearc4]" ) mod = IV2SLS.from_formula(formula, data) res_formula = mod.fit(cov_type="unadjusted") print(res_formula.rsquared - res_2sls.rsquared)  0.0  ## Categorical Variables¶ pandas categoricals are automatically treated as factors and expanded to dummies. The first is always dropped. This next block constructs a categorical from the region dummies and then uses it instead of the individual dummies. The model is identical. [22]:  data["reg"] = "661" # The default region, which was omitted for i in range(2, 10): region = "reg66" + str(i) data.loc[data[region] == 1, "reg"] = region[3:] data["reg"] = data["reg"].astype("category") data.describe() res_cat = IV2SLS( np.log(data.wage), data[["const", "exper", "expersq", "black", "smsa", "south", "smsa66", "reg"]], data.educ, data.nearc4, ).fit() print(res_cat)   IV-2SLS Estimation Summary ============================================================================== Dep. Variable: wage R-squared: 0.2382 Estimator: IV-2SLS Adj. R-squared: 0.2343 No. Observations: 3010 F-statistic: 840.83 Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000 Time: 17:55:07 Distribution: chi2(15) Cov. Estimator: robust Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ const 3.6662 0.9085 4.0352 0.0001 1.8855 5.4468 exper 0.1083 0.0233 4.6376 0.0000 0.0625 0.1540 expersq -0.0023 0.0003 -6.7128 0.0000 -0.0030 -0.0017 black -0.1468 0.0524 -2.8031 0.0051 -0.2494 -0.0441 smsa 0.1118 0.0311 3.5995 0.0003 0.0509 0.1727 south -0.1447 0.0291 -4.9775 0.0000 -0.2016 -0.0877 smsa66 0.0185 0.0205 0.9035 0.3663 -0.0217 0.0587 reg.662 0.1008 0.0365 2.7644 0.0057 0.0293 0.1722 reg.663 0.1483 0.0355 4.1760 0.0000 0.0787 0.2178 reg.664 0.0499 0.0435 1.1471 0.2514 -0.0354 0.1352 reg.665 0.1463 0.0491 2.9794 0.0029 0.0500 0.2425 reg.666 0.1629 0.0516 3.1553 0.0016 0.0617 0.2641 reg.667 0.1346 0.0504 2.6689 0.0076 0.0357 0.2334 reg.668 -0.0831 0.0571 -1.4552 0.1456 -0.1950 0.0288 reg.669 0.1078 0.0410 2.6317 0.0085 0.0275 0.1881 educ 0.1315 0.0540 2.4353 0.0149 0.0257 0.2373 ============================================================================== Endogenous: educ Instruments: nearc4 Robust Covariance (Heteroskedastic) Debiased: False  ## Post-estimation diagnostics¶ Common post-estimation diagnostics are to test the assumption of endogeneity and to examine if instruments are valid (when there are more instruments than endogenous variables). [23]:  data = mroz.load() data = data.dropna() data = add_constant(data, has_constant="add") data["lnwage"] = np.log(data.wage) dep = "lnwage" exog = ["const", "exper", "expersq"] endog = ["educ"] instr = ["fatheduc", "motheduc"]  The first step is to fit the model using 2SLS. [24]:  res = IV2SLS(data[dep], data[exog], data[endog], data[instr]).fit(cov_type="unadjusted") print(res)   IV-2SLS Estimation Summary ============================================================================== Dep. Variable: lnwage R-squared: 0.1357 Estimator: IV-2SLS Adj. R-squared: 0.1296 No. Observations: 428 F-statistic: 24.653 Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000 Time: 17:55:07 Distribution: chi2(3) Cov. Estimator: unadjusted Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ const 0.0481 0.3985 0.1207 0.9039 -0.7329 0.8291 exper 0.0442 0.0134 3.3038 0.0010 0.0180 0.0704 expersq -0.0009 0.0004 -2.2485 0.0245 -0.0017 -0.0001 educ 0.0614 0.0313 1.9622 0.0497 7.043e-05 0.1227 ============================================================================== Endogenous: educ Instruments: fatheduc, motheduc Unadjusted Covariance (Homoskedastic) Debiased: False  Wooldridge”s regression test of exogeneity uses regression residual where the endogenous variables are regressed on the exogenous and the instrument to test for endogeneity. IF the endogenous variable is actually exogenous these residuals should not be correlated with the variable of interest. [25]:  res.wooldridge_regression  [25]:  Wooldridge's regression test of exogeneity H0: Endogenous variables are exogenous Statistic: 2.8256 P-value: 0.0928 Distributed: chi2(1) WaldTestStatistic, id: 0x7ff918d63cd0  This test can be easily implemented using two regression. The first one constructs the residuals and the second re-fits the model using 2SLS but including the residuals. Note that the p-value of the t-state on residuals is the same as the P-value of the previous test – this is not an accident. [26]:  v = IV2SLS(data[endog], data[exog + instr], None, None).fit().resids import pandas as pd res_direct = IV2SLS( data[dep], pd.concat([v, data[exog]], axis=1), data[endog], data[instr] ).fit(cov_type="unadjusted") print(res_direct)   IV-2SLS Estimation Summary ============================================================================== Dep. Variable: lnwage R-squared: 0.1624 Estimator: IV-2SLS Adj. R-squared: 0.1544 No. Observations: 428 F-statistic: 82.954 Date: Fri, Jul 19 2024 P-value (F-stat) 0.0000 Time: 17:55:07 Distribution: chi2(4) Cov. Estimator: unadjusted Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ residual 0.0582 0.0346 1.6810 0.0928 -0.0097 0.1260 const 0.0481 0.3923 0.1226 0.9024 -0.7207 0.8169 exper 0.0442 0.0132 3.3559 0.0008 0.0184 0.0700 expersq -0.0009 0.0004 -2.2840 0.0224 -0.0017 -0.0001 educ 0.0614 0.0308 1.9932 0.0462 0.0010 0.1218 ============================================================================== Endogenous: educ Instruments: fatheduc, motheduc Unadjusted Covariance (Homoskedastic) Debiased: False  Since this regression has two instrument it is possible to test for overidentification. Wooldridge”s overidentification test uses a regression to test whether the 2SLS residuals are uncorrelated with the instruments, which should be the case if the model is correct and the instruments are not needed in the original model. [27]:  res.wooldridge_overid  [27]:  Wooldridge's score test of overidentification H0: Model is not overidentified. Statistic: 0.4435 P-value: 0.5055 Distributed: chi2(1) WaldTestStatistic, id: 0x7ff918d63e20  A naive version of this test can be directly implemented. This direct implementation is different from the formal test but would be consistent if the model was overidentified. [28]:  u = res.resids res = IV2SLS(u, data[["exper", "expersq"] + instr], None, None).fit()  The test is $$n\times R^2$$, and has the same $$\chi^2_1$$ distribution. The test statistic is slightly smaller but the conclusions are the same. [29]:  res.nobs * res.rsquared  [29]:  0.3719811106150277  Husband”s education can be used as an additional instrument, and its validity tested. [30]:  instr = ["fatheduc", "motheduc", "huseduc"] res = IV2SLS(data[dep], data[exog], data[endog], data[instr]).fit(cov_type="unadjusted")  Testing overidentification does not indicate any difference from the previous result. [31]:  res.wooldridge_overid  [31]:  Wooldridge's score test of overidentification H0: Model is not overidentified. Statistic: 1.0421 P-value: 0.5939 Distributed: chi2(2) WaldTestStatistic, id: 0x7ff918d61840  Directly testing using two regression would reach the same conclusion. [32]:  u = res.resids res = IV2SLS(u, data[["exper", "expersq"] + instr], None, None).fit() res.nobs * res.rsquared  [32]:  1.1126588681969989  ## Panel IV¶ Instrumental variable regression can also be used with panel data. This example makes use of first differences to eliminate a year-specific effect and then uses [33]:  from linearmodels.datasets import jobtraining data = jobtraining.load() print(jobtraining.DESCR) data.head() data = data.where(data.year.isin((1987, 1988))) data = data.dropna(how="all", axis=0).sort_values(["fcode", "year"]) print(data.describe()) data = data.set_index("fcode") data = data[["year", "hrsemp", "grant", "scrap", "lscrap"]]   H. Holzer, R. Block, M. Cheatham, and J. Knott (1993), "Are Training Subsidies Effective? The Michigan Experience," Industrial and Labor Relations Review 46, 625-636. year 1987, 1988, or 1989 fcode firm code number employ # employees at plant sales annual sales,$
avgsal                   average employee salary
scrap                    scrap rate (per 100 items)
rework                   rework rate (per 100 items)
tothrs                   total hours training
union                    =1 if unionized
d89                      =1 if year = 1989
d88                      =1 if year = 1988
totrain                  total employees trained
hrsemp                   tothrs/totrain
lscrap                   log(scrap)
lemploy                  log(employ)
lsales                   log(sales)
lrework                  log(rework)
lhrsemp                  log(1 + hrsemp)
lscrap_1                 lagged lscrap; missing 1987
grant_1                  lagged grant; assumed 0 in 1987
clscrap                  lscrap - lscrap_1; year > 1987
cgrant                   grant - grant_1
clemploy                 lemploy - lemploy[t-1]
clsales                  lavgsal - lavgsal[t-1]
lavgsal                  log(avgsal)
clavgsal                 lavgsal - lavgsal[t-1]
cgrant_1                 cgrant[t-1]
chrsemp                  hrsemp - hrsemp[t-1]
lhrsemp                 lhrsemp - lhrsemp[t-1]

year          fcode      employ         sales        avgsal  \
count   314.000000     314.000000  290.000000  2.450000e+02    267.000000
mean   1987.500000  415708.885350   56.903448  5.679815e+06  18326.284644
std       0.500798    4025.063452   70.843101  6.977975e+06   6507.584305
min    1987.000000  410032.000000    4.000000  1.100000e+05   4237.000000
25%    1987.000000  410604.000000   15.000000  1.527000e+06  13664.000000
50%    1987.500000  418084.000000   27.500000  2.900000e+06  17160.000000
75%    1988.000000  419309.000000   70.000000  7.000000e+06  22000.000000
max    1988.000000  419486.000000  500.000000  4.700000e+07  40563.000000

scrap     rework      tothrs       union       grant  ...  \
count  108.000000  82.000000  276.000000  314.000000  314.000000  ...
mean     4.199722   3.883537   25.648551    0.197452    0.114650  ...
std      6.188094   6.145294   45.694834    0.398712    0.319107  ...
min      0.010000   0.000000    0.000000    0.000000    0.000000  ...
25%      0.767500   0.332500    0.000000    0.000000    0.000000  ...
50%      1.555000   1.500000    8.000000    0.000000    0.000000  ...
75%      5.000000   4.267500   39.250000    0.000000    0.000000  ...
max     30.000000  40.000000  290.000000    1.000000    1.000000  ...

grant_1    clscrap      cgrant    clemploy     clsales     lavgsal  \
count    314.0  54.000000  314.000000  144.000000  119.000000  267.000000
mean       0.0  -0.168993    0.114650    0.096876    0.128995    9.754753
std        0.0   0.589774    0.319107    0.251180    0.489422    0.353818
min        0.0  -2.502170    0.000000   -0.411099   -1.982874    8.351611
25%        0.0  -0.350983    0.000000   -0.031946    0.023286    9.522509
50%        0.0  -0.115118    0.000000    0.064538    0.116219    9.750337
75%        0.0   0.092537    0.000000    0.182321    0.215391    9.998797
max        0.0   2.397895    1.000000    1.673976    2.896701   10.610610

clavgsal  cgrant_1     chrsemp    clhrsemp
count  131.000000     157.0  125.000000  125.000000
mean     0.058601       0.0    7.423050    0.642580
std      0.074255       0.0   19.301290    1.235229
min     -0.080411       0.0  -47.753490   -2.338574
25%      0.020135       0.0    0.000000    0.000000
50%      0.057159       0.0    0.272727    0.039375
75%      0.086255       0.0   10.000000    1.063067
max      0.568912       0.0  107.027800    4.394449

[8 rows x 30 columns]

[34]:

deltas = data.loc[data.year == 1988] - data.loc[data.year == 1987]
deltas = deltas.dropna()
print(deltas.describe())

       const  year     hrsemp      grant      scrap     lscrap
count   45.0  45.0  45.000000  45.000000  45.000000  45.000000
mean     1.0   1.0  10.812321   0.377778  -0.817556  -0.185697
std      0.0   0.0  20.523825   0.490310   2.496392   0.626858
min      1.0   1.0 -19.850180   0.000000 -10.000000  -2.502169
25%      1.0   1.0   0.000000   0.000000  -1.000000  -0.355820
50%      1.0   1.0   1.846154   0.000000  -0.110000  -0.167054
75%      1.0   1.0  15.333330   1.000000   0.090000   0.054067
max      1.0   1.0  80.000000   1.000000   5.000000   2.397895


The first stage indicates a relatively strong relationship between grant and the number of hours employed. Note that grant is a dummy and so the coefficient is just the difference in means.

[35]:

mod = IV2SLS(deltas.hrsemp, deltas[["const", "grant"]], None, None)

                            OLS Estimation Summary
==============================================================================
Dep. Variable:                 hrsemp   R-squared:                      0.3408
No. Observations:                  45   F-statistic:                    23.266
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.0000
Time:                        17:55:07   Distribution:                  chi2(1)

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          1.5806     3.1139     0.5076     0.6117     -4.5225      7.6837
grant          24.437     5.0662     4.8235     0.0000      14.507      34.367
==============================================================================


Here a formula is used to specify the model since it is cleaner. Note that the [] contains the endogenous variables and instruments.

[36]:

mod = IV2SLS.from_formula("lscrap ~ 1 + [hrsemp ~ grant]", deltas)
print(res_iv)

                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                 lscrap   R-squared:                      0.0159
No. Observations:                  45   F-statistic:                    3.3464
Date:                Fri, Jul 19 2024   P-value (F-stat)                0.0674
Time:                        17:55:07   Distribution:                  chi2(1)

Parameter Estimates
==============================================================================
Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept     -0.0327     0.1241    -0.2632     0.7924     -0.2759      0.2106
hrsemp        -0.0142     0.0077    -1.8293     0.0674     -0.0293      0.0010
==============================================================================

Endogenous: hrsemp
Instruments: grant
Debiased: False


The 2SLS estimate is nearly twice as large as the OLS estimate and is slightly more significant.

[37]:

res_ols = IV2SLS.from_formula("lscrap ~ 1 + hrsemp", deltas).fit(cov_type="unadjusted")
print(compare({"Panel OLS": res_ols, "Panel IV": res_iv}))

                 Model Comparison
==================================================
Panel OLS       Panel IV
--------------------------------------------------
Dep. Variable                lscrap         lscrap
Estimator                       OLS        IV-2SLS
No. Observations                 45             45
R-squared                    0.0619         0.0159
F-statistic                  2.9707         3.3464
P-value (F-stat)             0.0848         0.0674
==================     ============   ============
Intercept                   -0.1035        -0.0327
(-1.0208)      (-0.2632)
hrsemp                      -0.0076        -0.0142
(-1.7236)      (-1.8293)
==================== ============== ==============
Instruments                                  grant
--------------------------------------------------

T-stats reported in parentheses

[38]:

mod = IV2SLS.from_formula("lscrap ~ 1 + [hrsemp ~ grant]", deltas)
n = deltas.shape[0]
pred_exog = pd.DataFrame(np.ones((n, 1)), index=deltas.index)
res_iv.predict(exog=pred_exog, endog=deltas[["hrsemp"]])

[38]:

predictions
fcode
410523.0 -0.016018
410563.0 -0.032668
410565.0 -0.032668
410566.0 0.037845
410567.0 -0.032668
410577.0 -0.040453
410593.0 -0.032668
410596.0 -0.060975
410606.0 -0.010335
410626.0 -0.011102
418011.0 -0.154386
418021.0 -0.119765
418035.0 -0.293087
418045.0 -0.089281
418054.0 -0.058797
418065.0 -0.202584
418076.0 -0.227847
418091.0 -0.193340
418097.0 -0.067874
418107.0 -1.065849
418118.0 -0.342269
418125.0 -0.339713
418140.0 -1.164922
418163.0 -0.598795
418168.0 -0.032668
418177.0 -0.400651
418237.0 -0.117587
419198.0 -0.998414
419201.0 -0.152554
419242.0 -0.469395
419268.0 -0.032668
419272.0 0.084601
419289.0 -0.032668
419297.0 0.248274
419307.0 -0.036528
419339.0 -0.032668
419343.0 -0.042576
419357.0 -0.386498
419378.0 -0.434747
419381.0 0.027988
419388.0 -0.049180
419409.0 -0.249684
419459.0 -0.033183
419482.0 -0.032668
419483.0 -0.032668