Using formulas to specify models¶
Basic Usage¶
Formulas provide an alternative method to specify a model. The formulas used here utilize formulaic (documentation) are similar to those in statsmodels, although they use an enhanced syntax to allow identification of endogenous regressors. The basis formula syntax for a single variable regression would be
y ~ 1 + x
where the 1
indicates that a constant should be included and x
is the regressor. In the context of an instrumental variables model, it is necessary to mark variables as endogenous and to provide a list of instruments that are included only in the model for the endogenous variables. In a basic single regressor model, this would be specified using []
to surround an inner model.
y ~ 1 + [x ~ z]
In this expression, x
is now marked as endogenous and z
is an instrument. Any exogenous variable will automatically be used when instrumenting x
so there is no need to repeat these here (in this example, the “first stage” would include a constant and z).
Multiple Endogenous Variables¶
Multiple endogenous variables are specified in a similar manner. The basic concept is that any model can be expressed as
dep ~ exog + [ endog ~ instruments]
and it must be the case that
dep ~ exog + endog
and
dep ~ exog + instruments
are valid formulaic formulas. This means that multiple endogenous regressors or instruments should be joined with +
, but that the first endogenous or first instrument should not have a leading +
. A simple example with 2 endogenous variables and 3 instruments would be
y ~ 1 + x1 + x2 + x3 + [ x4 + x5 ~ z1 + z2 + z3]
In this example, the “submodels” y ~ 1 + x1 + x2 +x3 + x4 + x5
and y ~ 1 + x1 + x2 + x3 + z1 + z2 +z3
are both valid formulaic expressions.
Standard formulaic¶
Aside from this change, the standard rules of formulaic apply, and so it is possible to use mathematical expression or other formulaic-specific features. See the formulaic quickstart for some examples of what is possible.
MEPS data¶
This example shows the use of formulas to estimate both IV and OLS models using the medical expenditure panel survey. The model measures the effect of various characteristics on the log of drug expenditure and instruments the variable that measures where a subject was insured through a union with their social security to income ratio.
This first block imports the data and numpy.
[1]:
from linearmodels.datasets import meps
from linearmodels.iv import IV2SLS
data = meps.load()
data = data.dropna()
print(meps.DESCR)
age Age
age2 Age-squared
black Black
blhisp Black or Hispanic
drugexp Presc-drugs expense
educyr Years of education
fair Fair health
female Female
firmsz Firm size
fph fair or poor health
good Good health
hi_empunion Insured thro emp/union
hisp Hiapanic
income Income
ldrugexp log(drugexp)
linc log(income)
lowincome Low income
marry Married
midincome Middle income
msa Metropolitan stat area
multlc Multiple locations
poor Poor health
poverty Poor
priolist Priority list cond
private Private insurance
ssiratio SSI/Income ratio
totchr Total chronic cond
vegood V-good health
vgh vg or good health
Estimating a model with a formula¶
This model uses a formula which is input using the from_formula
interface. Unlike direct initialization, this interface takes the formula and a DataFrame containing the data necessary to evaluate the formula.
[2]:
formula = (
"ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ ssiratio]"
)
mod = IV2SLS.from_formula(formula, data)
[3]:
iv_res = mod.fit(cov_type="robust")
print(iv_res)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: ldrugexp R-squared: 0.0640
Estimator: IV-2SLS Adj. R-squared: 0.0634
No. Observations: 10089 F-statistic: 2000.9
Date: Sun, Oct 20 2024 P-value (F-stat) 0.0000
Time: 22:07:58 Distribution: chi2(6)
Cov. Estimator: robust
Parameter Estimates
===============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------
Intercept 6.7872 0.2688 25.246 0.0000 6.2602 7.3141
totchr 0.4503 0.0102 44.157 0.0000 0.4303 0.4703
female -0.0204 0.0326 -0.6257 0.5315 -0.0843 0.0435
age -0.0132 0.0030 -4.4092 0.0000 -0.0191 -0.0073
linc 0.0870 0.0226 3.8436 0.0001 0.0426 0.1314
blhisp -0.2174 0.0395 -5.5052 0.0000 -0.2948 -0.1400
hi_empunion -0.8976 0.2211 -4.0592 0.0000 -1.3310 -0.4642
===============================================================================
Endogenous: hi_empunion
Instruments: ssiratio
Robust Covariance (Heteroskedastic)
Debiased: False
Mathematical expression in formulas¶
Standard formulaic syntax, such as using mathematical expressions, can be readily used.
[4]:
formula = (
"np.log(drugexp) ~ 1 + totchr + age + linc + blhisp + [hi_empunion ~ ssiratio]"
)
mod = IV2SLS.from_formula(formula, data)
iv_res2 = mod.fit(cov_type="robust")
OLS¶
Omitting the block that marks a variable as endogenous will produce OLS – just like using None
for both endog
and instruments
.
[5]:
formula = "ldrugexp ~ 1 + totchr + female + age + linc + blhisp + hi_empunion"
ols = IV2SLS.from_formula(formula, data)
ols_res = ols.fit(cov_type="robust")
print(ols_res)
OLS Estimation Summary
==============================================================================
Dep. Variable: ldrugexp R-squared: 0.1770
Estimator: OLS Adj. R-squared: 0.1765
No. Observations: 10089 F-statistic: 2262.6
Date: Sun, Oct 20 2024 P-value (F-stat) 0.0000
Time: 22:07:58 Distribution: chi2(6)
Cov. Estimator: robust
Parameter Estimates
===============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------
Intercept 5.8611 0.1570 37.320 0.0000 5.5533 6.1689
totchr 0.4404 0.0094 47.049 0.0000 0.4220 0.4587
female 0.0578 0.0254 2.2797 0.0226 0.0081 0.1075
age -0.0035 0.0019 -1.8228 0.0683 -0.0073 0.0003
linc 0.0105 0.0137 0.7646 0.4445 -0.0164 0.0373
blhisp -0.1513 0.0341 -4.4353 0.0000 -0.2182 -0.0844
hi_empunion 0.0739 0.0260 2.8441 0.0045 0.0230 0.1248
===============================================================================
Comparing results¶
The function compare
can be used to compare the result of multiple models. Here dropping female
from the IV regression improves the \(R^2\).
[6]:
from linearmodels.iv import compare
print(compare({"IV": iv_res, "OLS": ols_res, "IV-formula": iv_res2}))
Model Comparison
====================================================================
IV OLS IV-formula
--------------------------------------------------------------------
Dep. Variable ldrugexp ldrugexp np.log(drugexp)
Estimator IV-2SLS OLS IV-2SLS
No. Observations 10089 10089 10089
Cov. Est. robust robust robust
R-squared 0.0640 0.1770 0.0659
Adj. R-squared 0.0634 0.1765 0.0654
F-statistic 2000.9 2262.6 2004.3
P-value (F-stat) 0.0000 0.0000 0.0000
================== =========== =========== =================
Intercept 6.7872 5.8611 6.7709
(25.246) (37.320) (26.353)
totchr 0.4503 0.4404 0.4498
(44.157) (47.049) (44.453)
female -0.0204 0.0578
(-0.6257) (2.2797)
age -0.0132 -0.0035 -0.0132
(-4.4092) (-1.8228) (-4.4237)
linc 0.0870 0.0105 0.0873
(3.8436) (0.7646) (3.8348)
blhisp -0.2174 -0.1513 -0.2168
(-5.5052) (-4.4353) (-5.5252)
hi_empunion -0.8976 0.0739 -0.8892
(-4.0592) (2.8441) (-4.1653)
==================== ============= ============= ===================
Instruments ssiratio ssiratio
--------------------------------------------------------------------
T-stats reported in parentheses