Three-stage Least Squares (3SLS)

This example demonstrates how a system of simultaneous equations can be jointly estimated using three-stage least squares (3SLS). The simultaneous equations model the wage and number of hours worked. The two equations are

\[\begin{split}\begin{eqnarray} hours & = & \beta_0 + \beta_1 \ln(wage) + \beta_2 educ + \beta_3 age + \beta_4 kidslt6 + \beta_5 nwifeinc + \epsilon^h_i \\ \ln(wage) & = & \gamma_0 + \gamma_1 hours + \gamma_2 educ + \gamma_3 educ^2 + \gamma_4 exper + \epsilon^w_i \end{eqnarray}\end{split}\]

Each equation has a single exogenous variables. The instruments for the endogenous variables are the regressors that appear in one equation but not the other.

Data

The data set is the MORZ data set from Wooldridge (2002).

[1]:
from linearmodels.datasets import mroz

data = mroz.load()

Here the relevant variables are selected and missing observations are dropped to avoid warnings.

[2]:
data = data[
    ["hours", "educ", "age", "kidslt6", "nwifeinc", "lwage", "exper", "expersq"]
]
data = data.dropna()

The main models are imported:

  • IV2SLS - single equation 2-stage least squares

  • IV3SLS - system estimation using instrumental variables

  • SUR - system estimation without endogenous variables

[3]:
from linearmodels import IV2SLS, IV3SLS, SUR, IVSystemGMM

Formulas

These examples use the formula interface. This is usually simpler when models have exogenous regressors, endogenous regressors and instruments. The syntax is the same as in the 2SLS models.

[4]:
hours = "hours ~ educ + age + kidslt6 + nwifeinc + [lwage ~ exper + expersq]"

hours_mod = IV2SLS.from_formula(hours, data)
hours_res = hours_mod.fit(cov_type="unadjusted")
print(hours_res)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                  hours   R-squared:                      0.1903
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1807
No. Observations:                 428   F-statistic:                    399.30
Date:                Tue, Apr 16 2024   P-value (F-stat)                0.0000
Time:                        17:36:09   Distribution:                  chi2(5)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -99.299     48.997    -2.0266     0.0427     -195.33     -3.2666
age            19.429     6.2770     3.0952     0.0020      7.1261      31.732
kidslt6       -51.616     183.96    -0.2806     0.7790     -412.18      308.95
nwifeinc      -11.445     6.6787    -1.7137     0.0866     -24.535      1.6449
lwage          1626.3     472.19     3.4442     0.0006      700.85      2551.8
==============================================================================

Endogenous: lwage
Instruments: exper, expersq
Unadjusted Covariance (Homoskedastic)
Debiased: False

The \(\ln\) wage model can be similarly specified and estimated

[5]:
lwage = "lwage ~ educ + exper + expersq + [hours ~ age + kidslt6 + nwifeinc]"

lwage_mod = IV2SLS.from_formula(lwage, data)
lwage_res = lwage_mod.fit(cov_type="unadjusted")
print(lwage_res)
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                  lwage   R-squared:                      0.7582
Estimator:                    IV-2SLS   Adj. R-squared:                 0.7559
No. Observations:                 428   F-statistic:                    1362.4
Date:                Tue, Apr 16 2024   P-value (F-stat)                0.0000
Time:                        17:36:09   Distribution:                  chi2(4)
Cov. Estimator:            unadjusted

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0875     0.0162     5.3892     0.0000      0.0557      0.1193
exper          0.0524     0.0299     1.7501     0.0801     -0.0063      0.1110
expersq       -0.0009     0.0006    -1.4898     0.1363     -0.0021      0.0003
hours         -0.0003     0.0003    -0.8666     0.3862     -0.0009      0.0004
==============================================================================

Endogenous: hours
Instruments: age, kidslt6, nwifeinc
Unadjusted Covariance (Homoskedastic)
Debiased: False

A system can be specified using a dictionary for formulas. The dictionary keys are used as equation labels. Aside from this simple change, the syntax is identical.

Here the model is estimated using method="ols" which will just simultaneously estimate the two equations but will produce estimates that are identical to separate equations.

[6]:
equations = dict(hours=hours, lwage=lwage)
system_2sls = IV3SLS.from_formula(equations, data)
system_2sls_res = system_2sls.fit(method="ols", cov_type="unadjusted")
print(system_2sls_res)
                           System OLS Estimation Summary
===================================================================================
Estimator:                        OLS   Overall R-squared:                   0.1903
No. Equations.:                     2   McElroy's R-squared:                 0.1276
No. Observations:                 428   Judge's (OLS) R-squared:            -2.0961
Date:                Tue, Apr 16 2024   Berndt's R-squared:                 -0.7279
Time:                        17:36:09   Dhrymes's R-squared:                 0.1903
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -99.299     48.997    -2.0266     0.0427     -195.33     -3.2666
age            19.429     6.2770     3.0952     0.0020      7.1261      31.732
kidslt6       -51.616     183.96    -0.2806     0.7790     -412.18      308.95
nwifeinc      -11.445     6.6787    -1.7137     0.0866     -24.535      1.6449
lwage          1626.3     472.19     3.4442     0.0006      700.85      2551.8

==============
 Instruments
--------------
exper, expersq

                  Equation: lwage, Dependent Variable: lwage
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0875     0.0162     5.3892     0.0000      0.0557      0.1193
exper          0.0524     0.0299     1.7501     0.0801     -0.0063      0.1110
expersq       -0.0009     0.0006    -1.4898     0.1363     -0.0021      0.0003
hours         -0.0003     0.0003    -0.8666     0.3862     -0.0009      0.0004
======================
     Instruments
----------------------
age, kidslt6, nwifeinc
----------------------

Covariance Estimator:
Homoskedastic (Unadjusted) Covariance (Debiased: False, GLS: False)

Using method="gls" will use GLS estimates which can be more efficient than the usual estimates. Here only the first equation changes. This is due to the structure of the problem.

[7]:
equations = dict(hours=hours, lwage=lwage)
system_3sls = IV3SLS.from_formula(equations, data)
system_3sls_res = system_3sls.fit(method="gls", cov_type="unadjusted")
print(system_3sls_res)
                           System GLS Estimation Summary
===================================================================================
Estimator:                        GLS   Overall R-squared:                   0.0120
No. Equations.:                     2   McElroy's R-squared:                 0.0873
No. Observations:                 428   Judge's (OLS) R-squared:            -2.7778
Date:                Tue, Apr 16 2024   Berndt's R-squared:                 -0.7279
Time:                        17:36:10   Dhrymes's R-squared:                 0.0120
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -109.90     48.052    -2.2870     0.0222     -204.08     -15.716
age            13.651     5.5456     2.4617     0.0138      2.7822      24.521
kidslt6       -196.61     170.04    -1.1563     0.2476     -529.88      136.66
nwifeinc      -6.4136     5.4646    -1.1736     0.2405     -17.124      4.2969
lwage          1872.7     461.13     4.0611     0.0000      968.88      2776.5

==============
 Instruments
--------------
exper, expersq

                  Equation: lwage, Dependent Variable: lwage
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0859     0.0159     5.3864     0.0000      0.0546      0.1171
exper          0.0550     0.0295     1.8622     0.0626     -0.0029      0.1128
expersq       -0.0010     0.0006    -1.7539     0.0794     -0.0022      0.0001
hours         -0.0003     0.0003    -0.8398     0.4010     -0.0009      0.0004
======================
     Instruments
----------------------
age, kidslt6, nwifeinc
----------------------

Covariance Estimator:
Homoskedastic (Unadjusted) Covariance (Debiased: False, GLS: True)

Direct Model Specification

The model can be directly specified using a dictionary of dictionaries where the inner dictionaries contain the 4 components of the model:

  • dependent - The dependent variable

  • exog - Exogenous regressors

  • endog - Endogenous regressors

  • instruments - Instrumental variables

The estimates are the same. This interface is more useful for programmatically generating and estimating models.

[8]:
hours = {
    "dependent": data[["hours"]],
    "exog": data[["educ", "age", "kidslt6", "nwifeinc"]],
    "endog": data[["lwage"]],
    "instruments": data[["exper", "expersq"]],
}

lwage = {
    "dependent": data[["lwage"]],
    "exog": data[["educ", "exper", "expersq"]],
    "endog": data[["hours"]],
    "instruments": data[["age", "kidslt6", "nwifeinc"]],
}

equations = dict(hours=hours, lwage=lwage)
system_3sls = IV3SLS(equations)
system_3sls_res = system_3sls.fit(cov_type="unadjusted")
print(system_3sls_res)
                           System GLS Estimation Summary
===================================================================================
Estimator:                        GLS   Overall R-squared:                   0.0120
No. Equations.:                     2   McElroy's R-squared:                 0.0873
No. Observations:                 428   Judge's (OLS) R-squared:            -2.7778
Date:                Tue, Apr 16 2024   Berndt's R-squared:                 -0.7279
Time:                        17:36:10   Dhrymes's R-squared:                 0.0120
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -109.90     48.052    -2.2870     0.0222     -204.08     -15.716
age            13.651     5.5456     2.4617     0.0138      2.7822      24.521
kidslt6       -196.61     170.04    -1.1563     0.2476     -529.88      136.66
nwifeinc      -6.4136     5.4646    -1.1736     0.2405     -17.124      4.2969
lwage          1872.7     461.13     4.0611     0.0000      968.88      2776.5

==============
 Instruments
--------------
exper, expersq

                  Equation: lwage, Dependent Variable: lwage
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0859     0.0159     5.3864     0.0000      0.0546      0.1171
exper          0.0550     0.0295     1.8622     0.0626     -0.0029      0.1128
expersq       -0.0010     0.0006    -1.7539     0.0794     -0.0022      0.0001
hours         -0.0003     0.0003    -0.8398     0.4010     -0.0009      0.0004
======================
     Instruments
----------------------
age, kidslt6, nwifeinc
----------------------

Covariance Estimator:
Homoskedastic (Unadjusted) Covariance (Debiased: False, GLS: True)

System GMM Estimation

System GMM is an alternative to 3SLS estimation. It is the natural extension to GMM estimation of IV models. It makes weaker assumptions about instruments than 3SLS does. In particular, instruments are assumed exogenous on an equation-by-equation basis rather than the 3SLS assumption that all instruments are exogenous in all equations.

The system GMM estimator is similar to the 3SLS estimator except that it requires making a choice about the moment weighting estimator. Valid options for the weighting estimator are "unadjusted" or "homoskedastic" which assumes that residuals are conditionally homoskedastic or "robust" or "heteroskedastic" which allows for conditional heteroskedasticity.

The System GMM estimator also supports iterative application where it is possible to iterate until convergence.

Here the examples make use of the same data as in the 3SLS example and only use the formula interface. The default uses 2-step (efficient) GMM.

[9]:
equations = dict(
    hours="hours ~ educ + age + kidslt6 + nwifeinc + [lwage ~ exper + expersq]",
    lwage="lwage ~ educ + exper + expersq + [hours ~ age + kidslt6 + nwifeinc]",
)
system_gmm = IVSystemGMM.from_formula(equations, data, weight_type="unadjusted")
system_gmm_res = system_gmm.fit(cov_type="unadjusted")
print(system_gmm_res)
                    System 2-Step System GMM Estimation Summary
===================================================================================
Estimator:          2-Step System GMM   Overall R-squared:                   0.0121
No. Equations.:                     2   McElroy's R-squared:                 0.0871
No. Observations:                 428   Judge's (OLS) R-squared:            -2.7776
Date:                Tue, Apr 16 2024   Berndt's R-squared:                 -0.7268
Time:                        17:36:10   Dhrymes's R-squared:                 0.0121
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -109.89     48.038    -2.2876     0.0222     -204.05     -15.741
age            13.653     5.5440     2.4626     0.0138      2.7868      24.519
kidslt6       -196.57     169.99    -1.1564     0.2475     -529.74      136.60
nwifeinc      -6.4149     5.4631    -1.1742     0.2403     -17.122      4.2925
lwage          1872.6     461.00     4.0621     0.0000      969.08      2776.2

==============
 Instruments
--------------
exper, expersq

                  Equation: lwage, Dependent Variable: lwage
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0859     0.0159     5.3866     0.0000      0.0546      0.1171
exper          0.0550     0.0295     1.8623     0.0626     -0.0029      0.1128
expersq       -0.0010     0.0006    -1.7540     0.0794     -0.0022      0.0001
hours         -0.0003     0.0003    -0.8398     0.4010     -0.0009      0.0004
======================
     Instruments
----------------------
age, kidslt6, nwifeinc
----------------------

Covariance Estimator:
GMM Homoskedastic (Unadjusted) Covariance
Weight Estimator:
Homoskedastic (Unadjusted) Weighting (Debiased: False, Center: False)

Robust weighting can be used by setting the weight_type. The number of iterations can be set using iter_limit. Overall the parameters do not meaningfully change.

[10]:
system_gmm = IVSystemGMM.from_formula(equations, data, weight_type="robust")
system_gmm_res = system_gmm.fit(cov_type="robust", iter_limit=100)
print("Number of iterations: " + str(system_gmm_res.iterations))
print(system_gmm_res)
Number of iterations: 20
                    System Iterative System GMM Estimation Summary
=====================================================================================
Estimator:         Iterative System GMM   Overall R-squared:                  -0.0345
No. Equations.:                       2   McElroy's R-squared:                -0.2256
No. Observations:                   428   Judge's (OLS) R-squared:            -2.9557
Date:                  Tue, Apr 16 2024   Berndt's R-squared:                 -2.0361
Time:                          17:36:10   Dhrymes's R-squared:                -0.0345
                                          Cov. Estimator:                      robust
                                          Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -118.31     57.508    -2.0572     0.0397     -231.02     -5.5947
age            15.027     5.9903     2.5086     0.0121      3.2864      26.768
kidslt6       -227.22     216.54    -1.0493     0.2940     -651.64      197.20
nwifeinc      -8.2578     5.3146    -1.5538     0.1202     -18.674      2.1587
lwage          1933.5     592.77     3.2618     0.0011      771.69      3095.3

==============
 Instruments
--------------
exper, expersq

                  Equation: lwage, Dependent Variable: lwage
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0967     0.0190     5.0999     0.0000      0.0595      0.1338
exper          0.0720     0.0363     1.9840     0.0473      0.0009      0.1430
expersq       -0.0013     0.0007    -1.8475     0.0647     -0.0028   8.133e-05
hours         -0.0005     0.0004    -1.2298     0.2188     -0.0013      0.0003
======================
     Instruments
----------------------
age, kidslt6, nwifeinc
----------------------

Covariance Estimator:
GMM Heteroskedastic (Robust) Covariance
Weight Estimator:
Heteroskedastic (Robust) Weighting (Debiased: False, Center: False)

Simultaneous Equations

Consider the simple simultaneous equation model

\[\begin{split}\begin{align} y_1 = y_2 + x_1 + \epsilon_1 \\ y_2 = -y_1 + x_2 + \epsilon_1 \\ \end{align}\end{split}\]

Substituting each equation in to the other, the reduced form of this model is

\[\begin{split}\begin{align} y_1 = \frac{1}{2}\left(x_1 + x_2 + \epsilon_1 + \epsilon_2\right) \\ y_1 = \frac{1}{2}\left(x_2 - x_1 + \epsilon_2 - \epsilon_1\right) \\ \end{align}\end{split}\]

Data from this model is simple to simulate.

[11]:
import numpy as np
import pandas as pd

rs = np.random.default_rng(20220224)
e = rs.standard_normal((50000, 2))
x = rs.standard_normal((50000, 2))
y2 = x[:, 1] / 2 - x[:, 0] / 2 + e[:, 1] / 2 - e[:, 0] / 2
y1 = x[:, 1] / 2 + x[:, 0] / 2 + e[:, 1] / 2 + e[:, 0] / 2
df = pd.DataFrame(np.column_stack([y1, y2, x]), columns=["y1", "y2", "x1", "x2"])
in_sample = df.iloc[:-10000]
oos = df.iloc[-10000:]
mod = IV3SLS.from_formula(
    dict(y1="y1 ~  x1 + [y2 ~ x2]", y2="y2 ~ x2 + [y1 ~ x1]"), data=df
)
res = mod.fit()
print(res)
                           System GLS Estimation Summary
===================================================================================
Estimator:                        GLS   Overall R-squared:                   0.0013
No. Equations.:                     2   McElroy's R-squared:                 0.0014
No. Observations:               50000   Judge's (OLS) R-squared:             0.0013
Date:                Tue, Apr 16 2024   Berndt's R-squared:                  0.0027
Time:                        17:36:10   Dhrymes's R-squared:                 0.0013
                                        Cov. Estimator:                      robust
                                        Num. Constraints:                      None
                     Equation: y1, Dependent Variable: y1
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
x1             0.9992     0.0063     157.74     0.0000      0.9868      1.0116
y2             1.0053     0.0090     111.60     0.0000      0.9877      1.0230

===========
Instruments
-----------
         x2

                     Equation: y2, Dependent Variable: y2
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
x2             0.9970     0.0063     158.73     0.0000      0.9847      1.0093
y1            -0.9809     0.0088    -111.59     0.0000     -0.9982     -0.9637
===========
Instruments
-----------
         x1
-----------

Covariance Estimator:
Heteroskedastic (Robust) Covariance (Debiased: False, GLS: True)

Out-of-sample predictions can be made using the predict method along with a DataFrame containing the values to use for the predictions. The predictions make use of the out-of-sample values stored above.

[12]:
res.predict(data=oos, dataframe=True)
[12]:
y1 y2
40000 -0.125377 0.024812
40001 0.489893 -0.707565
40002 -0.189191 -0.727702
40003 -2.081779 -2.756181
40004 1.628805 0.227368
... ... ...
49995 -0.293820 -0.140007
49996 -0.462017 1.848988
49997 0.698136 1.510755
49998 0.862183 0.732983
49999 1.118024 1.147715

10000 rows × 2 columns

These predictions can be compared to making the predicted values using the estimated parameters.

[13]:
y1_pred = res.params.y1_x1 * oos.x1 + res.params.y1_y2 * oos.y2
y2_pred = res.params.y2_x2 * oos.x2 + res.params.y2_y1 * oos.y1
pred_df = pd.DataFrame({"y1": y1_pred, "y2": y2_pred})
pred_df
[13]:
y1 y2
40000 -0.125377 0.024812
40001 0.489893 -0.707565
40002 -0.189191 -0.727702
40003 -2.081779 -2.756181
40004 1.628805 0.227368
... ... ...
49995 -0.293820 -0.140007
49996 -0.462017 1.848988
49997 0.698136 1.510755
49998 0.862183 0.732983
49999 1.118024 1.147715

10000 rows × 2 columns