Using formulas to specify models¶
All of the models can be specified using formulas. The formulas used here utilize formulaic are similar to those in statsmodels. The basis formula syntax for a single variable regression would be
y ~ 1 + x
The formulas used with BetweenOLS
, PooledOLS
and RandomEffects
are completely standard and are identical to statsmodels. FirstDifferenceOLS
is nearly identical with the caveat that the model cannot include an intercept.
PanelOLS
, which implements effects (entity, time or other) has a small extension to the formula which allows entity effects or time effects (or both) to be specified as part of the formula. While it is not possible to specify other effects using the formula interface, these can be included as an optional parameter when using a formula.
Loading and preparing data¶
When using formulas, a MultiIndex pandas dataframe where the index is entity-time is required. Here the Grunfeld data, from “The Determinants of Corporate Investment”, provided by statsmodels, is used to illustrate the use of formulas. This dataset contains data on firm investment, market value and the stock of plant capital.
set_index
is used to set the index using variables from the dataset.
[1]:
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data = data.set_index(["firm", "year"])
print(data.head())
invest value capital
firm year
General Motors 1935.0 317.6 3078.5 2.8
1936.0 391.8 4661.7 52.6
1937.0 410.6 5387.1 156.9
1938.0 257.7 2792.2 209.2
1939.0 330.8 4313.2 203.4
PanelOLS with Entity Effects¶
Entity effects are specified using the special command EntityEffects
. By default a constant is not included, and so if a constant is desired, 1+
should be included in the formula. When including effects, the model and fit are identical whether a constant is included or not.
PanelOLS with Entity Effects and a constant¶
The constant can be explicitly included using the 1 +
notation. When a constant is included in the model, and additional constraint is imposed that the number of the effects is 0. This allows the constant to be identified using the grand mean of the dependent and the regressors.
[2]:
from linearmodels import PanelOLS
mod = PanelOLS.from_formula("invest ~ value + capital + EntityEffects", data=data)
print(mod.fit())
PanelOLS Estimation Summary
================================================================================
Dep. Variable: invest R-squared: 0.7667
Estimator: PanelOLS R-squared (Between): 0.8223
No. Observations: 220 R-squared (Within): 0.7667
Date: Fri, Jul 19 2024 R-squared (Overall): 0.8132
Time: 17:54:59 Log-likelihood -1167.4
Cov. Estimator: Unadjusted
F-statistic: 340.08
Entities: 11 P-value 0.0000
Avg Obs: 20.000 Distribution: F(2,207)
Min Obs: 20.000
Max Obs: 20.000 F-statistic (robust): 340.08
P-value 0.0000
Time periods: 20 Distribution: F(2,207)
Avg Obs: 11.000
Min Obs: 11.000
Max Obs: 11.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
value 0.1101 0.0113 9.7461 0.0000 0.0879 0.1324
capital 0.3100 0.0165 18.744 0.0000 0.2774 0.3426
==============================================================================
F-test for Poolability: 49.207
P-value: 0.0000
Distribution: F(10,207)
Included effects: Entity
[3]:
mod = PanelOLS.from_formula("invest ~ 1 + value + capital + EntityEffects", data=data)
print(mod.fit())
PanelOLS Estimation Summary
================================================================================
Dep. Variable: invest R-squared: 0.7667
Estimator: PanelOLS R-squared (Between): 0.8193
No. Observations: 220 R-squared (Within): 0.7667
Date: Fri, Jul 19 2024 R-squared (Overall): 0.8071
Time: 17:54:59 Log-likelihood -1167.4
Cov. Estimator: Unadjusted
F-statistic: 340.08
Entities: 11 P-value 0.0000
Avg Obs: 20.000 Distribution: F(2,207)
Min Obs: 20.000
Max Obs: 20.000 F-statistic (robust): 340.08
P-value 0.0000
Time periods: 20 Distribution: F(2,207)
Avg Obs: 11.000
Min Obs: 11.000
Max Obs: 11.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept -55.272 10.891 -5.0750 0.0000 -76.743 -33.800
value 0.1101 0.0113 9.7461 0.0000 0.0879 0.1324
capital 0.3100 0.0165 18.744 0.0000 0.2774 0.3426
==============================================================================
F-test for Poolability: 49.207
P-value: 0.0000
Distribution: F(10,207)
Included effects: Entity
Panel with Entity and Time Effects¶
Time effects can be similarly included using TimeEffects
. In many models, time effects can be consistently estimated and so they could be equivalently included in the set of regressors using a categorical variable.
[4]:
mod = PanelOLS.from_formula(
"invest ~ 1 + value + capital + EntityEffects + TimeEffects", data=data
)
print(mod.fit())
PanelOLS Estimation Summary
================================================================================
Dep. Variable: invest R-squared: 0.7253
Estimator: PanelOLS R-squared (Between): 0.7944
No. Observations: 220 R-squared (Within): 0.7566
Date: Fri, Jul 19 2024 R-squared (Overall): 0.7856
Time: 17:54:59 Log-likelihood -1153.0
Cov. Estimator: Unadjusted
F-statistic: 248.15
Entities: 11 P-value 0.0000
Avg Obs: 20.000 Distribution: F(2,188)
Min Obs: 20.000
Max Obs: 20.000 F-statistic (robust): 248.15
P-value 0.0000
Time periods: 20 Distribution: F(2,188)
Avg Obs: 11.000
Min Obs: 11.000
Max Obs: 11.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept -72.394 12.732 -5.6861 0.0000 -97.509 -47.278
value 0.1167 0.0129 9.0219 0.0000 0.0912 0.1422
capital 0.3514 0.0210 16.696 0.0000 0.3099 0.3930
==============================================================================
F-test for Poolability: 18.476
P-value: 0.0000
Distribution: F(29,188)
Included effects: Entity, Time
Between OLS¶
The other panel models are straight-forward and are included for completeness.
[5]:
from linearmodels import BetweenOLS, FirstDifferenceOLS, PooledOLS
mod = BetweenOLS.from_formula("invest ~ 1 + value + capital", data=data)
print(mod.fit())
BetweenOLS Estimation Summary
================================================================================
Dep. Variable: invest R-squared: 0.8644
Estimator: BetweenOLS R-squared (Between): 0.8644
No. Observations: 11 R-squared (Within): 0.4195
Date: Fri, Jul 19 2024 R-squared (Overall): 0.7616
Time: 17:54:59 Log-likelihood -61.997
Cov. Estimator: Unadjusted
F-statistic: 25.500
Entities: 11 P-value 0.0003
Avg Obs: 20.000 Distribution: F(2,8)
Min Obs: 20.000
Max Obs: 20.000 F-statistic (robust): 25.500
P-value 0.0003
Time periods: 20 Distribution: F(2,8)
Avg Obs: 11.000
Min Obs: 11.000
Max Obs: 11.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept -7.3825 40.444 -0.1825 0.8597 -100.65 85.881
value 0.1346 0.0269 5.0065 0.0010 0.0726 0.1966
capital 0.0297 0.1746 0.1700 0.8692 -0.3730 0.4323
==============================================================================
First Difference OLS¶
The first difference model must never include a constant since this is not identified after differencing.
[6]:
mod = FirstDifferenceOLS.from_formula("invest ~ value + capital", data=data)
print(mod.fit())
FirstDifferenceOLS Estimation Summary
================================================================================
Dep. Variable: invest R-squared: 0.4287
Estimator: FirstDifferenceOLS R-squared (Between): 0.8643
No. Observations: 209 R-squared (Within): 0.7539
Date: Fri, Jul 19 2024 R-squared (Overall): 0.8461
Time: 17:54:59 Log-likelihood -1071.1
Cov. Estimator: Unadjusted
F-statistic: 77.679
Entities: 11 P-value 0.0000
Avg Obs: 20.000 Distribution: F(2,207)
Min Obs: 20.000
Max Obs: 20.000 F-statistic (robust): 77.679
P-value 0.0000
Time periods: 20 Distribution: F(2,207)
Avg Obs: 11.000
Min Obs: 11.000
Max Obs: 11.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
value 0.0891 0.0078 11.348 0.0000 0.0736 0.1045
capital 0.2786 0.0449 6.1990 0.0000 0.1900 0.3673
==============================================================================
Pooled OLS¶
The pooled OLS estimator is a special case of PanelOLS
when there are no effects. It is effectively identical to OLS
in statsmodels
(or WLS
) but is included for completeness.
[7]:
mod = PooledOLS.from_formula("invest ~ 1 + value + capital", data=data)
print(mod.fit())
PooledOLS Estimation Summary
================================================================================
Dep. Variable: invest R-squared: 0.8179
Estimator: PooledOLS R-squared (Between): 0.8426
No. Observations: 220 R-squared (Within): 0.7357
Date: Fri, Jul 19 2024 R-squared (Overall): 0.8179
Time: 17:54:59 Log-likelihood -1301.3
Cov. Estimator: Unadjusted
F-statistic: 487.28
Entities: 11 P-value 0.0000
Avg Obs: 20.000 Distribution: F(2,217)
Min Obs: 20.000
Max Obs: 20.000 F-statistic (robust): 487.28
P-value 0.0000
Time periods: 20 Distribution: F(2,217)
Avg Obs: 11.000
Min Obs: 11.000
Max Obs: 11.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept -38.410 8.4134 -4.5654 0.0000 -54.992 -21.828
value 0.1145 0.0055 20.753 0.0000 0.1037 0.1254
capital 0.2275 0.0242 9.3904 0.0000 0.1798 0.2753
==============================================================================