Three-stage Least Squares (3SLS)¶
This example demonstrates how a system of simultaneous equations can be jointly estimated using three-stage least squares (3SLS). The simultaneous equations model the wage and number of hours worked. The two equations are
Each equation has a single exogenous variables. The instruments for the endogenous variables are the regressors that appear in one equation but not the other.
Data¶
The data set is the MORZ data set from Wooldridge (2002).
[1]:
from linearmodels.datasets import mroz
data = mroz.load()
Here the relevant variables are selected and missing observations are dropped to avoid warnings.
[2]:
data = data[
["hours", "educ", "age", "kidslt6", "nwifeinc", "lwage", "exper", "expersq"]
]
data = data.dropna()
The main models are imported:
IV2SLS
- single equation 2-stage least squaresIV3SLS
- system estimation using instrumental variablesSUR
- system estimation without endogenous variables
[3]:
from linearmodels import IV2SLS, IV3SLS, IVSystemGMM
Formulas¶
These examples use the formula interface. This is usually simpler when models have exogenous regressors, endogenous regressors and instruments. The syntax is the same as in the 2SLS models.
[4]:
hours = "hours ~ educ + age + kidslt6 + nwifeinc + [lwage ~ exper + expersq]"
hours_mod = IV2SLS.from_formula(hours, data)
hours_res = hours_mod.fit(cov_type="unadjusted")
print(hours_res)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: hours R-squared: 0.1903
Estimator: IV-2SLS Adj. R-squared: 0.1807
No. Observations: 428 F-statistic: 399.30
Date: Wed, Nov 06 2024 P-value (F-stat) 0.0000
Time: 15:12:59 Distribution: chi2(5)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ -99.299 48.997 -2.0266 0.0427 -195.33 -3.2666
age 19.429 6.2770 3.0952 0.0020 7.1261 31.732
kidslt6 -51.616 183.96 -0.2806 0.7790 -412.18 308.95
nwifeinc -11.445 6.6787 -1.7137 0.0866 -24.535 1.6449
lwage 1626.3 472.19 3.4442 0.0006 700.85 2551.8
==============================================================================
Endogenous: lwage
Instruments: exper, expersq
Unadjusted Covariance (Homoskedastic)
Debiased: False
The \(\ln\) wage model can be similarly specified and estimated
[5]:
lwage = "lwage ~ educ + exper + expersq + [hours ~ age + kidslt6 + nwifeinc]"
lwage_mod = IV2SLS.from_formula(lwage, data)
lwage_res = lwage_mod.fit(cov_type="unadjusted")
print(lwage_res)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: lwage R-squared: 0.7582
Estimator: IV-2SLS Adj. R-squared: 0.7559
No. Observations: 428 F-statistic: 1362.4
Date: Wed, Nov 06 2024 P-value (F-stat) 0.0000
Time: 15:12:59 Distribution: chi2(4)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ 0.0875 0.0162 5.3892 0.0000 0.0557 0.1193
exper 0.0524 0.0299 1.7501 0.0801 -0.0063 0.1110
expersq -0.0009 0.0006 -1.4898 0.1363 -0.0021 0.0003
hours -0.0003 0.0003 -0.8666 0.3862 -0.0009 0.0004
==============================================================================
Endogenous: hours
Instruments: age, kidslt6, nwifeinc
Unadjusted Covariance (Homoskedastic)
Debiased: False
A system can be specified using a dictionary for formulas. The dictionary keys are used as equation labels. Aside from this simple change, the syntax is identical.
Here the model is estimated using method="ols"
which will just simultaneously estimate the two equations but will produce estimates that are identical to separate equations.
[6]:
equations = dict(hours=hours, lwage=lwage)
system_2sls = IV3SLS.from_formula(equations, data)
system_2sls_res = system_2sls.fit(method="ols", cov_type="unadjusted")
print(system_2sls_res)
System OLS Estimation Summary
===================================================================================
Estimator: OLS Overall R-squared: 0.1903
No. Equations.: 2 McElroy's R-squared: 0.1276
No. Observations: 428 Judge's (OLS) R-squared: -2.0961
Date: Wed, Nov 06 2024 Berndt's R-squared: -0.7279
Time: 15:12:59 Dhrymes's R-squared: 0.1903
Cov. Estimator: unadjusted
Num. Constraints: None
Equation: hours, Dependent Variable: hours
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ -99.299 48.997 -2.0266 0.0427 -195.33 -3.2666
age 19.429 6.2770 3.0952 0.0020 7.1261 31.732
kidslt6 -51.616 183.96 -0.2806 0.7790 -412.18 308.95
nwifeinc -11.445 6.6787 -1.7137 0.0866 -24.535 1.6449
lwage 1626.3 472.19 3.4442 0.0006 700.85 2551.8
==============
Instruments
--------------
exper, expersq
Equation: lwage, Dependent Variable: lwage
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ 0.0875 0.0162 5.3892 0.0000 0.0557 0.1193
exper 0.0524 0.0299 1.7501 0.0801 -0.0063 0.1110
expersq -0.0009 0.0006 -1.4898 0.1363 -0.0021 0.0003
hours -0.0003 0.0003 -0.8666 0.3862 -0.0009 0.0004
======================
Instruments
----------------------
age, kidslt6, nwifeinc
----------------------
Covariance Estimator:
Homoskedastic (Unadjusted) Covariance (Debiased: False, GLS: False)
Using method="gls"
will use GLS estimates which can be more efficient than the usual estimates. Here only the first equation changes. This is due to the structure of the problem.
[7]:
equations = dict(hours=hours, lwage=lwage)
system_3sls = IV3SLS.from_formula(equations, data)
system_3sls_res = system_3sls.fit(method="gls", cov_type="unadjusted")
print(system_3sls_res)
System GLS Estimation Summary
===================================================================================
Estimator: GLS Overall R-squared: 0.0120
No. Equations.: 2 McElroy's R-squared: 0.0873
No. Observations: 428 Judge's (OLS) R-squared: -2.7778
Date: Wed, Nov 06 2024 Berndt's R-squared: -0.7279
Time: 15:13:00 Dhrymes's R-squared: 0.0120
Cov. Estimator: unadjusted
Num. Constraints: None
Equation: hours, Dependent Variable: hours
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ -109.90 48.052 -2.2870 0.0222 -204.08 -15.716
age 13.651 5.5456 2.4617 0.0138 2.7822 24.521
kidslt6 -196.61 170.04 -1.1563 0.2476 -529.88 136.66
nwifeinc -6.4136 5.4646 -1.1736 0.2405 -17.124 4.2969
lwage 1872.7 461.13 4.0611 0.0000 968.88 2776.5
==============
Instruments
--------------
exper, expersq
Equation: lwage, Dependent Variable: lwage
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ 0.0859 0.0159 5.3864 0.0000 0.0546 0.1171
exper 0.0550 0.0295 1.8622 0.0626 -0.0029 0.1128
expersq -0.0010 0.0006 -1.7539 0.0794 -0.0022 0.0001
hours -0.0003 0.0003 -0.8398 0.4010 -0.0009 0.0004
======================
Instruments
----------------------
age, kidslt6, nwifeinc
----------------------
Covariance Estimator:
Homoskedastic (Unadjusted) Covariance (Debiased: False, GLS: True)
Direct Model Specification¶
The model can be directly specified using a dictionary of dictionaries where the inner dictionaries contain the 4 components of the model:
dependent - The dependent variable
exog - Exogenous regressors
endog - Endogenous regressors
instruments - Instrumental variables
The estimates are the same. This interface is more useful for programmatically generating and estimating models.
[8]:
hours = {
"dependent": data[["hours"]],
"exog": data[["educ", "age", "kidslt6", "nwifeinc"]],
"endog": data[["lwage"]],
"instruments": data[["exper", "expersq"]],
}
lwage = {
"dependent": data[["lwage"]],
"exog": data[["educ", "exper", "expersq"]],
"endog": data[["hours"]],
"instruments": data[["age", "kidslt6", "nwifeinc"]],
}
equations = dict(hours=hours, lwage=lwage)
system_3sls = IV3SLS(equations)
system_3sls_res = system_3sls.fit(cov_type="unadjusted")
print(system_3sls_res)
System GLS Estimation Summary
===================================================================================
Estimator: GLS Overall R-squared: 0.0120
No. Equations.: 2 McElroy's R-squared: 0.0873
No. Observations: 428 Judge's (OLS) R-squared: -2.7778
Date: Wed, Nov 06 2024 Berndt's R-squared: -0.7279
Time: 15:13:00 Dhrymes's R-squared: 0.0120
Cov. Estimator: unadjusted
Num. Constraints: None
Equation: hours, Dependent Variable: hours
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ -109.90 48.052 -2.2870 0.0222 -204.08 -15.716
age 13.651 5.5456 2.4617 0.0138 2.7822 24.521
kidslt6 -196.61 170.04 -1.1563 0.2476 -529.88 136.66
nwifeinc -6.4136 5.4646 -1.1736 0.2405 -17.124 4.2969
lwage 1872.7 461.13 4.0611 0.0000 968.88 2776.5
==============
Instruments
--------------
exper, expersq
Equation: lwage, Dependent Variable: lwage
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ 0.0859 0.0159 5.3864 0.0000 0.0546 0.1171
exper 0.0550 0.0295 1.8622 0.0626 -0.0029 0.1128
expersq -0.0010 0.0006 -1.7539 0.0794 -0.0022 0.0001
hours -0.0003 0.0003 -0.8398 0.4010 -0.0009 0.0004
======================
Instruments
----------------------
age, kidslt6, nwifeinc
----------------------
Covariance Estimator:
Homoskedastic (Unadjusted) Covariance (Debiased: False, GLS: True)
System GMM Estimation¶
System GMM is an alternative to 3SLS estimation. It is the natural extension to GMM estimation of IV models. It makes weaker assumptions about instruments than 3SLS does. In particular, instruments are assumed exogenous on an equation-by-equation basis rather than the 3SLS assumption that all instruments are exogenous in all equations.
The system GMM estimator is similar to the 3SLS estimator except that it requires making a choice about the moment weighting estimator. Valid options for the weighting estimator are "unadjusted"
or "homoskedastic"
which assumes that residuals are conditionally homoskedastic or "robust"
or "heteroskedastic"
which allows for conditional heteroskedasticity.
The System GMM estimator also supports iterative application where it is possible to iterate until convergence.
Here the examples make use of the same data as in the 3SLS example and only use the formula interface. The default uses 2-step (efficient) GMM.
[9]:
equations = dict(
hours="hours ~ educ + age + kidslt6 + nwifeinc + [lwage ~ exper + expersq]",
lwage="lwage ~ educ + exper + expersq + [hours ~ age + kidslt6 + nwifeinc]",
)
system_gmm = IVSystemGMM.from_formula(equations, data, weight_type="unadjusted")
system_gmm_res = system_gmm.fit(cov_type="unadjusted")
print(system_gmm_res)
System 2-Step System GMM Estimation Summary
===================================================================================
Estimator: 2-Step System GMM Overall R-squared: 0.0121
No. Equations.: 2 McElroy's R-squared: 0.0871
No. Observations: 428 Judge's (OLS) R-squared: -2.7776
Date: Wed, Nov 06 2024 Berndt's R-squared: -0.7268
Time: 15:13:00 Dhrymes's R-squared: 0.0121
Cov. Estimator: unadjusted
Num. Constraints: None
Equation: hours, Dependent Variable: hours
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ -109.89 48.038 -2.2876 0.0222 -204.05 -15.741
age 13.653 5.5440 2.4626 0.0138 2.7868 24.519
kidslt6 -196.57 169.99 -1.1564 0.2475 -529.74 136.60
nwifeinc -6.4149 5.4631 -1.1742 0.2403 -17.122 4.2925
lwage 1872.6 461.00 4.0621 0.0000 969.08 2776.2
==============
Instruments
--------------
exper, expersq
Equation: lwage, Dependent Variable: lwage
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ 0.0859 0.0159 5.3866 0.0000 0.0546 0.1171
exper 0.0550 0.0295 1.8623 0.0626 -0.0029 0.1128
expersq -0.0010 0.0006 -1.7540 0.0794 -0.0022 0.0001
hours -0.0003 0.0003 -0.8398 0.4010 -0.0009 0.0004
======================
Instruments
----------------------
age, kidslt6, nwifeinc
----------------------
Covariance Estimator:
GMM Homoskedastic (Unadjusted) Covariance
Weight Estimator:
Homoskedastic (Unadjusted) Weighting (Debiased: False, Center: False)
Robust weighting can be used by setting the weight_type
. The number of iterations can be set using iter_limit
. Overall the parameters do not meaningfully change.
[10]:
system_gmm = IVSystemGMM.from_formula(equations, data, weight_type="robust")
system_gmm_res = system_gmm.fit(cov_type="robust", iter_limit=100)
print("Number of iterations: " + str(system_gmm_res.iterations))
print(system_gmm_res)
Number of iterations: 20
System Iterative System GMM Estimation Summary
=====================================================================================
Estimator: Iterative System GMM Overall R-squared: -0.0345
No. Equations.: 2 McElroy's R-squared: -0.2256
No. Observations: 428 Judge's (OLS) R-squared: -2.9557
Date: Wed, Nov 06 2024 Berndt's R-squared: -2.0361
Time: 15:13:00 Dhrymes's R-squared: -0.0345
Cov. Estimator: robust
Num. Constraints: None
Equation: hours, Dependent Variable: hours
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ -118.31 57.508 -2.0572 0.0397 -231.02 -5.5947
age 15.027 5.9903 2.5086 0.0121 3.2864 26.768
kidslt6 -227.22 216.54 -1.0493 0.2940 -651.64 197.20
nwifeinc -8.2578 5.3146 -1.5538 0.1202 -18.674 2.1587
lwage 1933.5 592.77 3.2618 0.0011 771.69 3095.3
==============
Instruments
--------------
exper, expersq
Equation: lwage, Dependent Variable: lwage
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
educ 0.0967 0.0190 5.0999 0.0000 0.0595 0.1338
exper 0.0720 0.0363 1.9840 0.0473 0.0009 0.1430
expersq -0.0013 0.0007 -1.8475 0.0647 -0.0028 8.133e-05
hours -0.0005 0.0004 -1.2298 0.2188 -0.0013 0.0003
======================
Instruments
----------------------
age, kidslt6, nwifeinc
----------------------
Covariance Estimator:
GMM Heteroskedastic (Robust) Covariance
Weight Estimator:
Heteroskedastic (Robust) Weighting (Debiased: False, Center: False)
Simultaneous Equations¶
Consider the simple simultaneous equation model
Substituting each equation in to the other, the reduced form of this model is
Data from this model is simple to simulate.
[11]:
import numpy as np
import pandas as pd
rs = np.random.default_rng(20220224)
e = rs.standard_normal((50000, 2))
x = rs.standard_normal((50000, 2))
y2 = x[:, 1] / 2 - x[:, 0] / 2 + e[:, 1] / 2 - e[:, 0] / 2
y1 = x[:, 1] / 2 + x[:, 0] / 2 + e[:, 1] / 2 + e[:, 0] / 2
df = pd.DataFrame(np.column_stack([y1, y2, x]), columns=["y1", "y2", "x1", "x2"])
in_sample = df.iloc[:-10000]
oos = df.iloc[-10000:]
mod = IV3SLS.from_formula(
dict(y1="y1 ~ x1 + [y2 ~ x2]", y2="y2 ~ x2 + [y1 ~ x1]"), data=df
)
res = mod.fit()
print(res)
System GLS Estimation Summary
===================================================================================
Estimator: GLS Overall R-squared: 0.0013
No. Equations.: 2 McElroy's R-squared: 0.0014
No. Observations: 50000 Judge's (OLS) R-squared: 0.0013
Date: Wed, Nov 06 2024 Berndt's R-squared: 0.0027
Time: 15:13:00 Dhrymes's R-squared: 0.0013
Cov. Estimator: robust
Num. Constraints: None
Equation: y1, Dependent Variable: y1
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
x1 0.9992 0.0063 157.74 0.0000 0.9868 1.0116
y2 1.0053 0.0090 111.60 0.0000 0.9877 1.0230
===========
Instruments
-----------
x2
Equation: y2, Dependent Variable: y2
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
x2 0.9970 0.0063 158.73 0.0000 0.9847 1.0093
y1 -0.9809 0.0088 -111.59 0.0000 -0.9982 -0.9637
===========
Instruments
-----------
x1
-----------
Covariance Estimator:
Heteroskedastic (Robust) Covariance (Debiased: False, GLS: True)
Out-of-sample predictions can be made using the predict
method along with a DataFrame
containing the values to use for the predictions. The predictions make use of the out-of-sample values stored above.
[12]:
res.predict(data=oos, dataframe=True)
[12]:
y1 | y2 | |
---|---|---|
40000 | -0.125377 | 0.024812 |
40001 | 0.489893 | -0.707565 |
40002 | -0.189191 | -0.727702 |
40003 | -2.081779 | -2.756181 |
40004 | 1.628805 | 0.227368 |
... | ... | ... |
49995 | -0.293820 | -0.140007 |
49996 | -0.462017 | 1.848988 |
49997 | 0.698136 | 1.510755 |
49998 | 0.862183 | 0.732983 |
49999 | 1.118024 | 1.147715 |
10000 rows × 2 columns
These predictions can be compared to making the predicted values using the estimated parameters.
[13]:
y1_pred = res.params.y1_x1 * oos.x1 + res.params.y1_y2 * oos.y2
y2_pred = res.params.y2_x2 * oos.x2 + res.params.y2_y1 * oos.y1
pred_df = pd.DataFrame({"y1": y1_pred, "y2": y2_pred})
pred_df
[13]:
y1 | y2 | |
---|---|---|
40000 | -0.125377 | 0.024812 |
40001 | 0.489893 | -0.707565 |
40002 | -0.189191 | -0.727702 |
40003 | -2.081779 | -2.756181 |
40004 | 1.628805 | 0.227368 |
... | ... | ... |
49995 | -0.293820 | -0.140007 |
49996 | -0.462017 | 1.848988 |
49997 | 0.698136 | 1.510755 |
49998 | 0.862183 | 0.732983 |
49999 | 1.118024 | 1.147715 |
10000 rows × 2 columns