Using formulas to specify models

Basic Usage

Formulas provide an alternative method to specify a model. The formulas used here utilize formulaic (documentation) are similar to those in statsmodels, although they use an enhanced syntax to allow identification of endogenous regressors. The basis formula syntax for a single variable regression would be

y ~ 1 + x

where the 1 indicates that a constant should be included and x is the regressor. In the context of an instrumental variables model, it is necessary to mark variables as endogenous and to provide a list of instruments that are included only in the model for the endogenous variables. In a basic single regressor model, this would be specified using [] to surround an inner model.

y ~ 1 + [x ~ z]

In this expression, x is now marked as endogenous and z is an instrument. Any exogenous variable will automatically be used when instrumenting x so there is no need to repeat these here (in this example, the “first stage” would include a constant and z).

Multiple Endogenous Variables

Multiple endogenous variables are specified in a similar manner. The basic concept is that any model can be expressed as

dep ~ exog + [ endog ~ instruments]

and it must be the case that

dep ~ exog + endog

and

dep ~ exog + instruments

are valid formulaic formulas. This means that multiple endogenous regressors or instruments should be joined with +, but that the first endogenous or first instrument should not have a leading +. A simple example with 2 endogenous variables and 3 instruments would be

y ~ 1 + x1 + x2 + x3  + [ x4 + x5 ~ z1 + z2 + z3]

In this example, the “submodels” y ~ 1 + x1 + x2 +x3 + x4 + x5 and y ~ 1 + x1 + x2 + x3 + z1 + z2 +z3 are both valid formulaic expressions.

Standard formulaic

Aside from this change, the standard rules of formulaic apply, and so it is possible to use mathematical expression or other formulaic-specific features. See the formulaic quickstart for some examples of what is possible.

MEPS data

This example shows the use of formulas to estimate both IV and OLS models using the medical expenditure panel survey. The model measures the effect of various characteristics on the log of drug expenditure and instruments the variable that measures where a subject was insured through a union with their social security to income ratio.

This first block imports the data and numpy.

[1]:
from linearmodels.datasets import meps
from linearmodels.iv import IV2SLS

data = meps.load()
data = data.dropna()
print(meps.DESCR)

age               Age
age2              Age-squared
black             Black
blhisp            Black or Hispanic
drugexp           Presc-drugs expense
educyr            Years of education
fair              Fair health
female            Female
firmsz            Firm size
fph               fair or poor health
good              Good health
hi_empunion       Insured thro emp/union
hisp              Hiapanic
income            Income
ldrugexp          log(drugexp)
linc              log(income)
lowincome         Low income
marry             Married
midincome         Middle income
msa               Metropolitan stat area
multlc            Multiple locations
poor              Poor health
poverty           Poor
priolist          Priority list cond
private           Private insurance
ssiratio          SSI/Income ratio
totchr            Total chronic cond
vegood            V-good health
vgh               vg or good health

Estimating a model with a formula

This model uses a formula which is input using the from_formula interface. Unlike direct initialization, this interface takes the formula and a DataFrame containing the data necessary to evaluate the formula.

[2]:
formula = (
    "ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ ssiratio]"
)
mod = IV2SLS.from_formula(formula, data)
[3]:
iv_res = mod.fit(cov_type="robust")
print(iv_res)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:               ldrugexp   R-squared:                      0.0640
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0634
No. Observations:               10089   F-statistic:                    2000.9
Date:                Sun, Oct 20 2024   P-value (F-stat)                0.0000
Time:                        22:07:58   Distribution:                  chi2(6)
Cov. Estimator:                robust

                              Parameter Estimates
===============================================================================
             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------
Intercept       6.7872     0.2688     25.246     0.0000      6.2602      7.3141
totchr          0.4503     0.0102     44.157     0.0000      0.4303      0.4703
female         -0.0204     0.0326    -0.6257     0.5315     -0.0843      0.0435
age            -0.0132     0.0030    -4.4092     0.0000     -0.0191     -0.0073
linc            0.0870     0.0226     3.8436     0.0001      0.0426      0.1314
blhisp         -0.2174     0.0395    -5.5052     0.0000     -0.2948     -0.1400
hi_empunion    -0.8976     0.2211    -4.0592     0.0000     -1.3310     -0.4642
===============================================================================

Endogenous: hi_empunion
Instruments: ssiratio
Robust Covariance (Heteroskedastic)
Debiased: False

Mathematical expression in formulas

Standard formulaic syntax, such as using mathematical expressions, can be readily used.

[4]:
formula = (
    "np.log(drugexp) ~ 1 + totchr + age + linc + blhisp + [hi_empunion ~ ssiratio]"
)
mod = IV2SLS.from_formula(formula, data)
iv_res2 = mod.fit(cov_type="robust")

OLS

Omitting the block that marks a variable as endogenous will produce OLS – just like using None for both endog and instruments.

[5]:
formula = "ldrugexp ~ 1 + totchr + female + age + linc + blhisp + hi_empunion"
ols = IV2SLS.from_formula(formula, data)
ols_res = ols.fit(cov_type="robust")
print(ols_res)
                            OLS Estimation Summary
==============================================================================
Dep. Variable:               ldrugexp   R-squared:                      0.1770
Estimator:                        OLS   Adj. R-squared:                 0.1765
No. Observations:               10089   F-statistic:                    2262.6
Date:                Sun, Oct 20 2024   P-value (F-stat)                0.0000
Time:                        22:07:58   Distribution:                  chi2(6)
Cov. Estimator:                robust

                              Parameter Estimates
===============================================================================
             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------
Intercept       5.8611     0.1570     37.320     0.0000      5.5533      6.1689
totchr          0.4404     0.0094     47.049     0.0000      0.4220      0.4587
female          0.0578     0.0254     2.2797     0.0226      0.0081      0.1075
age            -0.0035     0.0019    -1.8228     0.0683     -0.0073      0.0003
linc            0.0105     0.0137     0.7646     0.4445     -0.0164      0.0373
blhisp         -0.1513     0.0341    -4.4353     0.0000     -0.2182     -0.0844
hi_empunion     0.0739     0.0260     2.8441     0.0045      0.0230      0.1248
===============================================================================

Comparing results

The function compare can be used to compare the result of multiple models. Here dropping female from the IV regression improves the \(R^2\).

[6]:
from linearmodels.iv import compare

print(compare({"IV": iv_res, "OLS": ols_res, "IV-formula": iv_res2}))
                          Model Comparison
====================================================================
                                IV           OLS          IV-formula
--------------------------------------------------------------------
Dep. Variable             ldrugexp      ldrugexp     np.log(drugexp)
Estimator                  IV-2SLS           OLS             IV-2SLS
No. Observations             10089         10089               10089
Cov. Est.                   robust        robust              robust
R-squared                   0.0640        0.1770              0.0659
Adj. R-squared              0.0634        0.1765              0.0654
F-statistic                 2000.9        2262.6              2004.3
P-value (F-stat)            0.0000        0.0000              0.0000
==================     ===========   ===========   =================
Intercept                   6.7872        5.8611              6.7709
                          (25.246)      (37.320)            (26.353)
totchr                      0.4503        0.4404              0.4498
                          (44.157)      (47.049)            (44.453)
female                     -0.0204        0.0578
                         (-0.6257)      (2.2797)
age                        -0.0132       -0.0035             -0.0132
                         (-4.4092)     (-1.8228)           (-4.4237)
linc                        0.0870        0.0105              0.0873
                          (3.8436)      (0.7646)            (3.8348)
blhisp                     -0.2174       -0.1513             -0.2168
                         (-5.5052)     (-4.4353)           (-5.5252)
hi_empunion                -0.8976        0.0739             -0.8892
                         (-4.0592)      (2.8441)           (-4.1653)
==================== ============= ============= ===================
Instruments               ssiratio                          ssiratio
--------------------------------------------------------------------

T-stats reported in parentheses