Using formulas to specify models

Formulas can be used to specify models using mostly standard formulaic syntax. Since system estimation is more complicated than the specification of a single model, there are two methods available to specify a system:

  • Dictionary of formulas

  • Single formula separated using {}

These examples use data on fringe benefits from F. Vella (1993), “A Simple Estimator for Simultaneous Models with Censored Endogenous Regressors” which appears in Wooldridge (2002). The model consists of two equations, one for hourly wage and the other for hourly benefits. The initial model uses the same regressors in both equations.

[1]:
import numpy as np
import pandas as pd
from linearmodels.datasets import fringe

data = fringe.load()

Dictionary

The dictionary syntax is virtually identical to standard formulaic syntax where each equation is specified in a key-value pair where the key is the equation label and the value is the formula. It is recommended to use an OrderedDict which will preserve equation order in results. Keys must be strings.

[2]:
from collections import OrderedDict

formula = OrderedDict()
formula["benefits"] = (
    "hrbens ~ educ + exper + expersq + union + south + nrtheast + nrthcen + male"
)
formula["earnings"] = "hrearn ~ educ + exper + expersq + nrtheast + married + male"
[3]:
from linearmodels.system import SUR

mod = SUR.from_formula(formula, data)
print(mod.fit(cov_type="unadjusted"))
                           System GLS Estimation Summary
===================================================================================
Estimator:                        GLS   Overall R-squared:                   0.6951
No. Equations.:                     2   McElroy's R-squared:                 0.2197
No. Observations:                 616   Judge's (OLS) R-squared:             0.1873
Date:                Fri, Jul 19 2024   Berndt's R-squared:                  0.3775
Time:                        17:54:35   Dhrymes's R-squared:                 0.6950
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                Equation: benefits, Dependent Variable: hrbens
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0346     0.0050     6.8605     0.0000      0.0247      0.0445
exper          0.0350     0.0063     5.5974     0.0000      0.0228      0.0473
expersq       -0.0006     0.0001    -4.4326     0.0000     -0.0009     -0.0003
union          0.3682     0.0474     7.7659     0.0000      0.2753      0.4612
south         -0.1775     0.0586    -3.0289     0.0025     -0.2923     -0.0626
nrtheast      -0.1224     0.0714    -1.7132     0.0867     -0.2624      0.0176
nrthcen       -0.1433     0.0621    -2.3074     0.0210     -0.2650     -0.0216
male           0.2518     0.0490     5.1435     0.0000      0.1559      0.3478
                Equation: earnings, Dependent Variable: hrearn
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.3547     0.0357     9.9280     0.0000      0.2847      0.4247
exper         -0.0750     0.0492    -1.5260     0.1270     -0.1714      0.0213
expersq        0.0038     0.0011     3.4600     0.0005      0.0016      0.0059
nrtheast      -0.7127     0.4473    -1.5934     0.1111     -1.5894      0.1640
married        0.4472     0.3920     1.1410     0.2539     -0.3210      1.2155
male           1.8899     0.3904     4.8408     0.0000      1.1247      2.6551
==============================================================================

Covariance Estimator:
Homoskedastic (Unadjusted) Covariance (Debiased: False, GLS: True)

Curly Braces

The same formula can be expressed in a single string by surrounding each equation with braces {}.

[4]:
braces_formula = """
{hrbens ~ educ + exper + expersq + union + south + nrtheast + nrthcen + male}
{hrearn ~ educ + exper + expersq + nrtheast + married + male}
"""
braces_mod = SUR.from_formula(braces_formula, data)
braces_res = braces_mod.fit(cov_type="unadjusted")
print(braces_res)
                           System GLS Estimation Summary
===================================================================================
Estimator:                        GLS   Overall R-squared:                   0.6951
No. Equations.:                     2   McElroy's R-squared:                 0.2197
No. Observations:                 616   Judge's (OLS) R-squared:             0.1873
Date:                Fri, Jul 19 2024   Berndt's R-squared:                  0.3775
Time:                        17:54:35   Dhrymes's R-squared:                 0.6950
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                 Equation: hrbens, Dependent Variable: hrbens
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0346     0.0050     6.8605     0.0000      0.0247      0.0445
exper          0.0350     0.0063     5.5974     0.0000      0.0228      0.0473
expersq       -0.0006     0.0001    -4.4326     0.0000     -0.0009     -0.0003
union          0.3682     0.0474     7.7659     0.0000      0.2753      0.4612
south         -0.1775     0.0586    -3.0289     0.0025     -0.2923     -0.0626
nrtheast      -0.1224     0.0714    -1.7132     0.0867     -0.2624      0.0176
nrthcen       -0.1433     0.0621    -2.3074     0.0210     -0.2650     -0.0216
male           0.2518     0.0490     5.1435     0.0000      0.1559      0.3478
                 Equation: hrearn, Dependent Variable: hrearn
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.3547     0.0357     9.9280     0.0000      0.2847      0.4247
exper         -0.0750     0.0492    -1.5260     0.1270     -0.1714      0.0213
expersq        0.0038     0.0011     3.4600     0.0005      0.0016      0.0059
nrtheast      -0.7127     0.4473    -1.5934     0.1111     -1.5894      0.1640
married        0.4472     0.3920     1.1410     0.2539     -0.3210      1.2155
male           1.8899     0.3904     4.8408     0.0000      1.1247      2.6551
==============================================================================

Covariance Estimator:
Homoskedastic (Unadjusted) Covariance (Debiased: False, GLS: True)

Labeled Formulas

When using the curly brace formula specification, the equation names are determined by the dependent variable names. When names are repeated as is the case in some datasets (e.g. a SUR on GDP of multiple countries) then the equation labels will be modified until they are unique. This can produce meaningless equation labels, and so it is possible to pass an equation label using the syntax

{label : dep ~ exog}
[5]:
labeled_formula = """
{benefits: hrbens ~ educ + exper + expersq + union + south + nrtheast + nrthcen + male}
{earnings: hrearn ~ educ + exper + expersq + nrtheast + married + male}
"""
labels_mod = SUR.from_formula(labeled_formula, data)
labeled_res = labels_mod.fit(cov_type="unadjusted")

print("Unlabeled")
print(braces_res.equation_labels)
print("Labeled")
print(labeled_res.equation_labels)
Unlabeled
['hrbens', 'hrearn']
Labeled
['benefits', 'earnings']

Other Options

Estimation Weights

SUR supports weights which are assumed to be proportional to the inverse variance of the data so that

\[V(y_i \times w_i) = \sigma^2 \,\,\forall i.\]

Weights can be passed using a DataFrame where each column.

Here the results are printed to ensure that the estimates are different from those in the standard GLS model.

[6]:
random_weights = np.random.chisquare(5, size=(616, 2))
random_weights = pd.DataFrame(random_weights, columns=["benefits", "earnings"])
weighted_mod = SUR.from_formula(formula, data, weights=random_weights)
print(weighted_mod.fit())
                           System GLS Estimation Summary
===================================================================================
Estimator:                        GLS   Overall R-squared:                   0.7049
No. Equations.:                     2   McElroy's R-squared:                 0.1779
No. Observations:                 616   Judge's (OLS) R-squared:             0.0647
Date:                Fri, Jul 19 2024   Berndt's R-squared:                  0.1331
Time:                        17:54:35   Dhrymes's R-squared:                 0.7049
                                        Cov. Estimator:                      robust
                                        Num. Constraints:                      None
                Equation: benefits, Dependent Variable: hrbens
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0411     0.0049     8.4678     0.0000      0.0316      0.0507
exper          0.0334     0.0055     6.0259     0.0000      0.0226      0.0443
expersq       -0.0006     0.0001    -4.4290     0.0000     -0.0008     -0.0003
union          0.3810     0.0524     7.2738     0.0000      0.2783      0.4837
south         -0.2692     0.0560    -4.8101     0.0000     -0.3789     -0.1595
nrtheast      -0.2606     0.0655    -3.9765     0.0001     -0.3891     -0.1322
nrthcen       -0.2197     0.0617    -3.5637     0.0004     -0.3406     -0.0989
male           0.2704     0.0467     5.7934     0.0000      0.1789      0.3619
                Equation: earnings, Dependent Variable: hrearn
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.4996     0.1173     4.2613     0.0000      0.2698      0.7295
exper         -0.4112     0.2574    -1.5975     0.1101     -0.9157      0.0933
expersq        0.0129     0.0069     1.8642     0.0623     -0.0007      0.0265
nrtheast      -1.5760     0.6630    -2.3771     0.0175     -2.8755     -0.2766
married        0.6833     0.4094     1.6691     0.0951     -0.1191      1.4857
male           2.1146     0.2482     8.5198     0.0000      1.6281      2.6010
==============================================================================

Covariance Estimator:
Heteroskedastic (Robust) Covariance (Debiased: False, GLS: True)

Prespecified Residual Covariance

Like a standard SUR, it is possible to pass a prespecified residual covariance for use in the GLS step. This is done using the keyword argument sigma in the from_formula method, and is otherwise identical to passing one to the standard SUR.