Examples¶
These examples cover the models available for estimating panel models. The initial examples all ignore covariance options and so use the default classic covariance which is appropriate for homoskedastic data. The alternative covariance options are described at the end of this document.
Loading data¶
These examples all make use of the wage panel from
Vella and M. Verbeek (1998), “Whose Wages Do Unions Raise? A Dynamic Model of Unionism and Wage Rate Determination for Young Men,” Journal of Applied Econometrics 13, 163-183.
The data set consists of wages and characteristics for men during the 1980s. The entity identifier is nr
and the time identified is year
. This data is used extensively in Chapter 14 of Introduction to Econometrics by Jeffrey Wooldridge.
Here a MultiIndex
DataFrame
is used to hold the data in a format that can be understood as a panel. Before setting the index, a year Categorical
is created which facilitated making dummies.
[1]:
import pandas as pd
from linearmodels.datasets import wage_panel
data = wage_panel.load()
year = pd.Categorical(data.year)
data = data.set_index(["nr", "year"])
data["year"] = year
print(wage_panel.DESCR)
print(data.head())
F. Vella and M. Verbeek (1998), "Whose Wages Do Unions Raise? A Dynamic Model
of Unionism and Wage Rate Determination for Young Men," Journal of Applied
Econometrics 13, 163-183.
nr person identifier
year 1980 to 1987
black =1 if black
exper labor market experience
hisp =1 if Hispanic
hours annual hours worked
married =1 if married
educ years of schooling
union =1 if in union
lwage log(wage)
expersq exper^2
occupation Occupation code
black exper hisp hours married educ union lwage expersq \
nr year
13 1980 0 1 0 2672 0 14 0 1.197540 1
1981 0 2 0 2320 0 14 1 1.853060 4
1982 0 3 0 2940 0 14 0 1.344462 9
1983 0 4 0 2960 0 14 0 1.433213 16
1984 0 5 0 3071 0 14 0 1.568125 25
occupation year
nr year
13 1980 9 1980
1981 9 1981
1982 9 1982
1983 9 1983
1984 5 1984
Basic regression on panel data¶
PooledOLS
is just plain OLS that understands that various panel data structures. It is useful as a base model. Here the log wage is modeled using all of the variables and time dummies.
[2]:
import statsmodels.api as sm
from linearmodels.panel import PooledOLS
exog_vars = ["black", "hisp", "exper", "expersq", "married", "educ", "union", "year"]
exog = sm.add_constant(data[exog_vars])
mod = PooledOLS(data.lwage, exog)
pooled_res = mod.fit()
print(pooled_res)
PooledOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.1893
Estimator: PooledOLS R-squared (Between): 0.2066
No. Observations: 4360 R-squared (Within): 0.1692
Date: Tue, Sep 24 2024 R-squared (Overall): 0.1893
Time: 09:29:52 Log-likelihood -2982.0
Cov. Estimator: Unadjusted
F-statistic: 72.459
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(14,4345)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 72.459
P-value 0.0000
Time periods: 8 Distribution: F(14,4345)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 0.0921 0.0783 1.1761 0.2396 -0.0614 0.2455
black -0.1392 0.0236 -5.9049 0.0000 -0.1855 -0.0930
hisp 0.0160 0.0208 0.7703 0.4412 -0.0248 0.0568
exper 0.0672 0.0137 4.9095 0.0000 0.0404 0.0941
expersq -0.0024 0.0008 -2.9413 0.0033 -0.0040 -0.0008
married 0.1083 0.0157 6.8997 0.0000 0.0775 0.1390
educ 0.0913 0.0052 17.442 0.0000 0.0811 0.1016
union 0.1825 0.0172 10.635 0.0000 0.1488 0.2161
year.1981 0.0583 0.0304 1.9214 0.0548 -0.0012 0.1178
year.1982 0.0628 0.0332 1.8900 0.0588 -0.0023 0.1279
year.1983 0.0620 0.0367 1.6915 0.0908 -0.0099 0.1339
year.1984 0.0905 0.0401 2.2566 0.0241 0.0119 0.1691
year.1985 0.1092 0.0434 2.5200 0.0118 0.0243 0.1942
year.1986 0.1420 0.0464 3.0580 0.0022 0.0509 0.2330
year.1987 0.1738 0.0494 3.5165 0.0004 0.0769 0.2707
==============================================================================
Estimating parameters with uncorrelated effects¶
When modeling panel data it is common to consider models beyond what OLS will efficiently estimate. The most common are error component models which add an additional term to the standard OLS model,
where \(\alpha_i\) affects all values of entity i. When the \(\alpha_i\) are uncorrelated with the regressors in \(x_{it}\), a random effects model can be used to efficiently estimate parameters of this model.
Random effects¶
The random effects model is virtually identical to the pooled OLS model except that is accounts for the structure of the model and so is more efficient. Random effects uses a quasi-demeaning strategy which subtracts the time average of the within entity values to account for the common shock.
[3]:
from linearmodels.panel import RandomEffects
mod = RandomEffects(data.lwage, exog)
re_res = mod.fit()
print(re_res)
RandomEffects Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.1806
Estimator: RandomEffects R-squared (Between): 0.1853
No. Observations: 4360 R-squared (Within): 0.1799
Date: Tue, Sep 24 2024 R-squared (Overall): 0.1828
Time: 09:29:52 Log-likelihood -1622.5
Cov. Estimator: Unadjusted
F-statistic: 68.409
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(14,4345)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 68.409
P-value 0.0000
Time periods: 8 Distribution: F(14,4345)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 0.0234 0.1514 0.1546 0.8771 -0.2735 0.3203
black -0.1394 0.0480 -2.9054 0.0037 -0.2334 -0.0453
hisp 0.0217 0.0428 0.5078 0.6116 -0.0622 0.1057
exper 0.1058 0.0154 6.8706 0.0000 0.0756 0.1361
expersq -0.0047 0.0007 -6.8623 0.0000 -0.0061 -0.0034
married 0.0638 0.0168 3.8035 0.0001 0.0309 0.0967
educ 0.0919 0.0107 8.5744 0.0000 0.0709 0.1129
union 0.1059 0.0179 5.9289 0.0000 0.0709 0.1409
year.1981 0.0404 0.0247 1.6362 0.1019 -0.0080 0.0889
year.1982 0.0309 0.0324 0.9519 0.3412 -0.0327 0.0944
year.1983 0.0202 0.0417 0.4840 0.6284 -0.0616 0.1020
year.1984 0.0430 0.0515 0.8350 0.4037 -0.0580 0.1440
year.1985 0.0577 0.0615 0.9383 0.3482 -0.0629 0.1782
year.1986 0.0918 0.0716 1.2834 0.1994 -0.0485 0.2321
year.1987 0.1348 0.0817 1.6504 0.0989 -0.0253 0.2950
==============================================================================
The model fit is fairly similar, although the return to experience has changed substantially, as has its significance. This is partially explainable by the inclusion of the year dummies which will fit the trend in experience and so only the cross-sectional differences matter. The quasi-differencing in the random effects estimator depends on a quantity that depends on the relative variance of the idiosyncratic shock and the common shock. This can be accessed using variance_decomposition
.
[4]:
re_res.variance_decomposition
[4]:
Effects 0.106946
Residual 0.123324
Percent due to Effects 0.464438
Name: Variance Decomposition, dtype: float64
The coefficient \(\theta_i\) determines how much demeaning takes place. When this value is 1, the RE model reduces to the pooled model since this occurs when there is no variance in the effects. When panels are unbalanced it will vary across entities, but in this balanced panel all values are the same.
[5]:
re_res.theta.head()
[5]:
theta | |
---|---|
nr | |
13 | 0.645059 |
17 | 0.645059 |
18 | 0.645059 |
45 | 0.645059 |
110 | 0.645059 |
The between estimator¶
The between estimator is an alternative, usually less efficient estimator, can can be used to estimate model parameters. It is particular simple since it first computes the time averages of \(y\) and \(x\) and then runs a simple regression using these averages.
The year dummies are dropped since the averaging removes differences due to the year. expersq
was also dropped since it is fairly co-linear with exper
. These results are broadly similar to the previous models.
[6]:
from linearmodels.panel import BetweenOLS
exog_vars = ["black", "hisp", "exper", "married", "educ", "union"]
exog = sm.add_constant(data[exog_vars])
mod = BetweenOLS(data.lwage, exog)
be_res = mod.fit()
print(be_res)
BetweenOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.2155
Estimator: BetweenOLS R-squared (Between): 0.2155
No. Observations: 545 R-squared (Within): 0.1141
Date: Tue, Sep 24 2024 R-squared (Overall): 0.1686
Time: 09:29:52 Log-likelihood -194.54
Cov. Estimator: Unadjusted
F-statistic: 24.633
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(6,538)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 24.633
P-value 0.0000
Time periods: 8 Distribution: F(6,538)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 0.2836 0.1784 1.5897 0.1125 -0.0668 0.6340
black -0.1414 0.0489 -2.8915 0.0040 -0.2375 -0.0453
hisp 0.0100 0.0426 0.2355 0.8139 -0.0737 0.0938
exper 0.0278 0.0113 2.4538 0.0144 0.0055 0.0501
married 0.1416 0.0412 3.4346 0.0006 0.0606 0.2226
educ 0.0913 0.0107 8.5159 0.0000 0.0702 0.1123
union 0.2587 0.0460 5.6214 0.0000 0.1683 0.3491
==============================================================================
Other options¶
The fit
method of BetweenOLS
takes an optional input reweight
which uses GLS-like weighting when panels are unbalanced. In this dataset the panel is balanced and so using this option would have no effect. By default this is False
and so the averages are directly used in OLS even when there are difference numbers of observations across entities.
Handling correlated effects¶
When effects are correlated with the regressors the RE and BE estimators are not consistent. The usual solution is to use Fixed Effects which are available in PanelOLS
. Fixed effects are called entity_effects
when applied to entities and time_effects
when applied to the time dimension:
Including fixed effects¶
Entity effects are included by setting entity_effects=True
. This is equivalent to including dummies for each entity. In this panel, this would add 545 dummy variables and estimation of the model would be considerably slower. PanelOLS
does not actually use dummy variables and instead uses group-wise demeaning to achieve the same effect.
Time-invariant Variables¶
Time-invariant variables cannot be included when using entity effects since, once demeaned, these will all be 0. Here exper
is also excluded since once entity effects and time dummies are incorporated, exper
is perfectly co-linear.
[7]:
from linearmodels.panel import PanelOLS
exog_vars = ["expersq", "union", "married", "year"]
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True)
fe_res = mod.fit()
print(fe_res)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.1806
Estimator: PanelOLS R-squared (Between): -0.0052
No. Observations: 4360 R-squared (Within): 0.1806
Date: Tue, Sep 24 2024 R-squared (Overall): 0.0807
Time: 09:29:52 Log-likelihood -1324.8
Cov. Estimator: Unadjusted
F-statistic: 83.851
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(10,3805)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 83.851
P-value 0.0000
Time periods: 8 Distribution: F(10,3805)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 1.4260 0.0183 77.748 0.0000 1.3901 1.4620
expersq -0.0052 0.0007 -7.3612 0.0000 -0.0066 -0.0038
union 0.0800 0.0193 4.1430 0.0000 0.0421 0.1179
married 0.0467 0.0183 2.5494 0.0108 0.0108 0.0826
year.1981 0.1512 0.0219 6.8883 0.0000 0.1082 0.1942
year.1982 0.2530 0.0244 10.360 0.0000 0.2051 0.3008
year.1983 0.3544 0.0292 12.121 0.0000 0.2971 0.4118
year.1984 0.4901 0.0362 13.529 0.0000 0.4191 0.5611
year.1985 0.6175 0.0452 13.648 0.0000 0.5288 0.7062
year.1986 0.7655 0.0561 13.638 0.0000 0.6555 0.8755
year.1987 0.9250 0.0688 13.450 0.0000 0.7902 1.0599
==============================================================================
F-test for Poolability: 9.1568
P-value: 0.0000
Distribution: F(544,3805)
Included effects: Entity
Time Effects¶
Time effect can be added using time_effects=True
. Here the time dummies are removed. Note that the core coefficients are identical. The only change is in the test statistic for poolability since not the “effects” include both entity and time, whereas before only entity were included.
Effects vs Dummies¶
For variable which can be consistently estimated, such as time effects in the usual large N, fixed T panel, it is not necessary to include these as effects. They can instead be implemented as dummy variables.
[8]:
exog_vars = ["expersq", "union", "married"]
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True, time_effects=True)
fe_te_res = mod.fit()
print(fe_te_res)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.0216
Estimator: PanelOLS R-squared (Between): -0.0052
No. Observations: 4360 R-squared (Within): -0.4809
Date: Tue, Sep 24 2024 R-squared (Overall): -0.2253
Time: 09:29:53 Log-likelihood -1324.8
Cov. Estimator: Unadjusted
F-statistic: 27.959
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(3,3805)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 27.959
P-value 0.0000
Time periods: 8 Distribution: F(3,3805)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 1.8706 0.0378 49.430 0.0000 1.7964 1.9448
expersq -0.0052 0.0007 -7.3612 0.0000 -0.0066 -0.0038
union 0.0800 0.0193 4.1430 0.0000 0.0421 0.1179
married 0.0467 0.0183 2.5494 0.0108 0.0108 0.0826
==============================================================================
F-test for Poolability: 10.067
P-value: 0.0000
Distribution: F(551,3805)
Included effects: Entity, Time
Other Options¶
Some additional options are available when using PanelOLS
and effects. The use_lsdv
can be used to force the model to be estimated using dummy variables. This is not necessary and will be slower in most circumstances. This options is primarily included for testing. auto_df
instructs PanelOLS
to determine the degree of freedom adjustment to make. Unlike most other estimators, the treatment of effects depends on the covariance estimator. For example, when using a classic covariance
estimator, effects count as lost degrees of freedom. When using entity effects and clustering by entity, they do not. Setting auto_df=False
will allow the entities to be counted or not, depending on the value of count_effects
.
Other Effects¶
PanelOLS
can be used to estimated models with up to 2-effects fixed effects. These can include any combination of
Entity effects
Time effects
Other effects
Other effects are specified using the other_effects
input of PanelOLS
. The input should have the same number of observations as the engog
and can have 1 or two columns. Below we reproduce the model fitted with time effects only using other_effects
to set the effect.
[9]:
from linearmodels.panel.data import PanelData
exog_vars = ["expersq", "union", "married"]
exog = sm.add_constant(data[exog_vars])
# Use the `PanelData` structure to simplify constructing the time IDs
time_ids = pd.DataFrame(
PanelData(data.lwage).time_ids, index=data.index, columns=["Other Effect"]
)
mod = PanelOLS(data.lwage, exog, entity_effects=True, other_effects=time_ids)
fe_oe_res = mod.fit()
print(fe_oe_res)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.0216
Estimator: PanelOLS R-squared (Between): -0.0052
No. Observations: 4360 R-squared (Within): -0.4809
Date: Tue, Sep 24 2024 R-squared (Overall): -0.2253
Time: 09:29:53 Log-likelihood -1324.8
Cov. Estimator: Unadjusted
F-statistic: 27.959
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(3,3805)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 27.959
P-value 0.0000
Time periods: 8 Distribution: F(3,3805)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 1.8706 0.0378 49.430 0.0000 1.7964 1.9448
expersq -0.0052 0.0007 -7.3612 0.0000 -0.0066 -0.0038
union 0.0800 0.0193 4.1430 0.0000 0.0421 0.1179
married 0.0467 0.0183 2.5494 0.0108 0.0108 0.0826
==============================================================================
F-test for Poolability: 10.067
P-value: 0.0000
Distribution: F(551,3805)
Included effects: Entity, Other Effect (Other Effect)
Model includes 1 other effect
Other Effect Observations per group (Other Effect):
Avg Obs: 545.00, Min Obs: 545.00, Max Obs: 545.00, Groups: 8
Using first differences¶
First differencing is an alternative to using fixed effects when there might be correlation. When using first differences, time-invariant variables must be excluded. Additionally, only one linear time-trending variable can be included since this will look like a constant. This variable will soak up all time-trends in the data, and so interpretations of these variable can be challenging.
[10]:
from linearmodels.panel import FirstDifferenceOLS
exog_vars = ["exper", "expersq", "union", "married"]
exog = data[exog_vars]
mod = FirstDifferenceOLS(data.lwage, exog)
fd_res = mod.fit()
print(fd_res)
FirstDifferenceOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.0268
Estimator: FirstDifferenceOLS R-squared (Between): 0.5491
No. Observations: 3815 R-squared (Within): 0.1763
Date: Tue, Sep 24 2024 R-squared (Overall): 0.5328
Time: 09:29:53 Log-likelihood -2305.5
Cov. Estimator: Unadjusted
F-statistic: 26.208
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(4,3811)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 26.208
P-value 0.0000
Time periods: 8 Distribution: F(4,3811)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
exper 0.1158 0.0196 5.9096 0.0000 0.0773 0.1542
expersq -0.0039 0.0014 -2.8005 0.0051 -0.0066 -0.0012
union 0.0428 0.0197 2.1767 0.0296 0.0042 0.0813
married 0.0381 0.0229 1.6633 0.0963 -0.0068 0.0831
==============================================================================
Comparing models¶
Model results can be compared using compare
. compare
accepts lists of results, a dictionary of results where the key is interpreted as the model name.
[11]:
from linearmodels.panel import compare
print(compare({"BE": be_res, "RE": re_res, "Pooled": pooled_res}))
Model Comparison
===============================================================
BE RE Pooled
---------------------------------------------------------------
Dep. Variable lwage lwage lwage
Estimator BetweenOLS RandomEffects PooledOLS
No. Observations 545 4360 4360
Cov. Est. Unadjusted Unadjusted Unadjusted
R-squared 0.2155 0.1806 0.1893
R-Squared (Within) 0.1141 0.1799 0.1692
R-Squared (Between) 0.2155 0.1853 0.2066
R-Squared (Overall) 0.1686 0.1828 0.1893
F-statistic 24.633 68.409 72.459
P-value (F-stat) 0.0000 0.0000 0.0000
===================== ============ =============== ============
const 0.2836 0.0234 0.0921
(1.5897) (0.1546) (1.1761)
black -0.1414 -0.1394 -0.1392
(-2.8915) (-2.9054) (-5.9049)
hisp 0.0100 0.0217 0.0160
(0.2355) (0.5078) (0.7703)
exper 0.0278 0.1058 0.0672
(2.4538) (6.8706) (4.9095)
married 0.1416 0.0638 0.1083
(3.4346) (3.8035) (6.8997)
educ 0.0913 0.0919 0.0913
(8.5159) (8.5744) (17.442)
union 0.2587 0.1059 0.1825
(5.6214) (5.9289) (10.635)
expersq -0.0047 -0.0024
(-6.8623) (-2.9413)
year.1981 0.0404 0.0583
(1.6362) (1.9214)
year.1982 0.0309 0.0628
(0.9519) (1.8900)
year.1983 0.0202 0.0620
(0.4840) (1.6915)
year.1984 0.0430 0.0905
(0.8350) (2.2566)
year.1985 0.0577 0.1092
(0.9383) (2.5200)
year.1986 0.0918 0.1420
(1.2834) (3.0580)
year.1987 0.1348 0.1738
(1.6504) (3.5165)
---------------------------------------------------------------
T-stats reported in parentheses
Covariance options¶
Heteroskedasticity Robust Covariance¶
White”s robust covariance can be used by setting cov_type="robust
. This estimator adds some robustness against certain types of specification issues but should not be used when using fixed effects (entity effects) since it is no longer robust. Instead a clustered covariance is required.
[12]:
exog_vars = ["black", "hisp", "exper", "expersq", "married", "educ", "union"]
exog = sm.add_constant(data[exog_vars])
mod = PooledOLS(data.lwage, exog)
robust = mod.fit(cov_type="robust")
Clustered by Entity¶
The usual variable to cluster are are entity or entity and time. The can be implemented using cov_type="clustered"
and the additional keyword arguments cluster_entity=True
and/or cluster_time=True
.
[13]:
clust_entity = mod.fit(cov_type="clustered", cluster_entity=True)
This next example clusters by both.
[14]:
clust_entity_time = mod.fit(
cov_type="clustered", cluster_entity=True, cluster_time=True
)
An OrderedDict
is used to hold the results for comparing models. This allows the models to be named as well as for the order of the models to be specified. A standard dict
will produce effectively random order.
Clustering on entity reduced the t-stats across the board. This suggests there is important correlation in the residuals per entity. Clustering by both also decreases the t-stats which suggests that there is cross-sectional dependence in the data. Note: clustering by entity addresses correlation across time and clustering by time controls for correlation between entities in a time period.
[15]:
from collections import OrderedDict
res = OrderedDict()
res["Robust"] = robust
res["Entity"] = clust_entity
res["Entity-Time"] = clust_entity_time
print(compare(res))
Model Comparison
=========================================================
Robust Entity Entity-Time
---------------------------------------------------------
Dep. Variable lwage lwage lwage
Estimator PooledOLS PooledOLS PooledOLS
No. Observations 4360 4360 4360
Cov. Est. Robust Clustered Clustered
R-squared 0.1866 0.1866 0.1866
R-Squared (Within) 0.1679 0.1679 0.1679
R-Squared (Between) 0.2027 0.2027 0.2027
R-Squared (Overall) 0.1866 0.1866 0.1866
F-statistic 142.61 142.61 142.61
P-value (F-stat) 0.0000 0.0000 0.0000
===================== =========== =========== ===========
const -0.0347 -0.0347 -0.0347
(-0.5360) (-0.2892) (-0.3145)
black -0.1438 -0.1438 -0.1438
(-5.9045) (-2.8727) (-3.0067)
hisp 0.0157 0.0157 0.0157
(0.7952) (0.4008) (0.4428)
exper 0.0892 0.0892 0.0892
(8.7881) (7.1728) (6.3223)
expersq -0.0028 -0.0028 -0.0028
(-4.1934) (-3.2747) (-3.1571)
married 0.1077 0.1077 0.1077
(7.0525) (4.1314) (4.8989)
educ 0.0994 0.0994 0.0994
(21.626) (10.802) (12.296)
union 0.1801 0.1801 0.1801
(11.087) (6.5343) (6.6732)
---------------------------------------------------------
T-stats reported in parentheses
Other clusters¶
Other clusters can be used by directly passing integer arrays (1 or 2 columns, or a 1-d array) using the input clusters
. This example clustered by occupation, which is probably not a reliable variable to cluster on since there are only 9 groups and the usual theory for clustered standard errors requires that the number of clusters is large.
[16]:
clust_entity = mod.fit(cov_type="clustered", clusters=data.occupation)
print(data.occupation.value_counts())
print(clust_entity)
occupation
5 934
6 881
9 509
4 486
1 453
7 401
2 399
3 233
8 64
Name: count, dtype: int64
PooledOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.1866
Estimator: PooledOLS R-squared (Between): 0.2027
No. Observations: 4360 R-squared (Within): 0.1679
Date: Tue, Sep 24 2024 R-squared (Overall): 0.1866
Time: 09:29:53 Log-likelihood -2989.2
Cov. Estimator: Clustered
F-statistic: 142.61
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(7,4352)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 116.58
P-value 0.0000
Time periods: 8 Distribution: F(7,4352)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const -0.0347 0.1479 -0.2346 0.8145 -0.3247 0.2553
black -0.1438 0.0297 -4.8469 0.0000 -0.2020 -0.0857
hisp 0.0157 0.0266 0.5892 0.5557 -0.0365 0.0679
exper 0.0892 0.0134 6.6513 0.0000 0.0629 0.1155
expersq -0.0028 0.0009 -3.2442 0.0012 -0.0046 -0.0011
married 0.1077 0.0139 7.7322 0.0000 0.0804 0.1350
educ 0.0994 0.0112 8.8846 0.0000 0.0775 0.1213
union 0.1801 0.0320 5.6323 0.0000 0.1174 0.2428
==============================================================================