Examples

These examples cover the models available for estimating panel models. The initial examples all ignore covariance options and so use the default classic covariance which is appropriate for homoskedastic data. The alternative covariance options are described at the end of this document.

Loading data

These examples all make use of the wage panel from

  1. Vella and M. Verbeek (1998), “Whose Wages Do Unions Raise? A Dynamic Model of Unionism and Wage Rate Determination for Young Men,” Journal of Applied Econometrics 13, 163-183.

The data set consists of wages and characteristics for men during the 1980s. The entity identifier is nr and the time identified is year. This data is used extensively in Chapter 14 of Introduction to Econometrics by Jeffrey Wooldridge.

Here a MultiIndex DataFrame is used to hold the data in a format that can be understood as a panel. Before setting the index, a year Categorical is created which facilitated making dummies.

[1]:
import pandas as pd

from linearmodels.datasets import wage_panel

data = wage_panel.load()
year = pd.Categorical(data.year)
data = data.set_index(["nr", "year"])
data["year"] = year
print(wage_panel.DESCR)
print(data.head())

F. Vella and M. Verbeek (1998), "Whose Wages Do Unions Raise? A Dynamic Model
of Unionism and Wage Rate Determination for Young Men," Journal of Applied
Econometrics 13, 163-183.

nr                       person identifier
year                     1980 to 1987
black                    =1 if black
exper                    labor market experience
hisp                     =1 if Hispanic
hours                    annual hours worked
married                  =1 if married
educ                     years of schooling
union                    =1 if in union
lwage                    log(wage)
expersq                  exper^2
occupation               Occupation code

         black  exper  hisp  hours  married  educ  union     lwage  expersq  \
nr year
13 1980      0      1     0   2672        0    14      0  1.197540        1
   1981      0      2     0   2320        0    14      1  1.853060        4
   1982      0      3     0   2940        0    14      0  1.344462        9
   1983      0      4     0   2960        0    14      0  1.433213       16
   1984      0      5     0   3071        0    14      0  1.568125       25

         occupation  year
nr year
13 1980           9  1980
   1981           9  1981
   1982           9  1982
   1983           9  1983
   1984           5  1984

Basic regression on panel data

PooledOLS is just plain OLS that understands that various panel data structures. It is useful as a base model. Here the log wage is modeled using all of the variables and time dummies.

[2]:
import statsmodels.api as sm

from linearmodels.panel import PooledOLS

exog_vars = ["black", "hisp", "exper", "expersq", "married", "educ", "union", "year"]
exog = sm.add_constant(data[exog_vars])
mod = PooledOLS(data.lwage, exog)
pooled_res = mod.fit()
print(pooled_res)
                          PooledOLS Estimation Summary
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.1893
Estimator:                  PooledOLS   R-squared (Between):              0.2066
No. Observations:                4360   R-squared (Within):               0.1692
Date:                Sun, Jan 12 2025   R-squared (Overall):              0.1893
Time:                        13:43:36   Log-likelihood                   -2982.0
Cov. Estimator:            Unadjusted
                                        F-statistic:                      72.459
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                 F(14,4345)
Min Obs:                       8.0000
Max Obs:                       8.0000   F-statistic (robust):             72.459
                                        P-value                           0.0000
Time periods:                       8   Distribution:                 F(14,4345)
Avg Obs:                       545.00
Min Obs:                       545.00
Max Obs:                       545.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.0921     0.0783     1.1761     0.2396     -0.0614      0.2455
black         -0.1392     0.0236    -5.9049     0.0000     -0.1855     -0.0930
hisp           0.0160     0.0208     0.7703     0.4412     -0.0248      0.0568
exper          0.0672     0.0137     4.9095     0.0000      0.0404      0.0941
expersq       -0.0024     0.0008    -2.9413     0.0033     -0.0040     -0.0008
married        0.1083     0.0157     6.8997     0.0000      0.0775      0.1390
educ           0.0913     0.0052     17.442     0.0000      0.0811      0.1016
union          0.1825     0.0172     10.635     0.0000      0.1488      0.2161
year.1981      0.0583     0.0304     1.9214     0.0548     -0.0012      0.1178
year.1982      0.0628     0.0332     1.8900     0.0588     -0.0023      0.1279
year.1983      0.0620     0.0367     1.6915     0.0908     -0.0099      0.1339
year.1984      0.0905     0.0401     2.2566     0.0241      0.0119      0.1691
year.1985      0.1092     0.0434     2.5200     0.0118      0.0243      0.1942
year.1986      0.1420     0.0464     3.0580     0.0022      0.0509      0.2330
year.1987      0.1738     0.0494     3.5165     0.0004      0.0769      0.2707
==============================================================================

Estimating parameters with uncorrelated effects

When modeling panel data it is common to consider models beyond what OLS will efficiently estimate. The most common are error component models which add an additional term to the standard OLS model,

\[y_{it} = x_{it}\beta + \alpha_{i} + \epsilon_{it}\]

where \(\alpha_i\) affects all values of entity i. When the \(\alpha_i\) are uncorrelated with the regressors in \(x_{it}\), a random effects model can be used to efficiently estimate parameters of this model.

Random effects

The random effects model is virtually identical to the pooled OLS model except that is accounts for the structure of the model and so is more efficient. Random effects uses a quasi-demeaning strategy which subtracts the time average of the within entity values to account for the common shock.

[3]:
from linearmodels.panel import RandomEffects

mod = RandomEffects(data.lwage, exog)
re_res = mod.fit()
print(re_res)
                        RandomEffects Estimation Summary
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.1806
Estimator:              RandomEffects   R-squared (Between):              0.1853
No. Observations:                4360   R-squared (Within):               0.1799
Date:                Sun, Jan 12 2025   R-squared (Overall):              0.1828
Time:                        13:43:37   Log-likelihood                   -1622.5
Cov. Estimator:            Unadjusted
                                        F-statistic:                      68.409
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                 F(14,4345)
Min Obs:                       8.0000
Max Obs:                       8.0000   F-statistic (robust):             68.409
                                        P-value                           0.0000
Time periods:                       8   Distribution:                 F(14,4345)
Avg Obs:                       545.00
Min Obs:                       545.00
Max Obs:                       545.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.0234     0.1514     0.1546     0.8771     -0.2735      0.3203
black         -0.1394     0.0480    -2.9054     0.0037     -0.2334     -0.0453
hisp           0.0217     0.0428     0.5078     0.6116     -0.0622      0.1057
exper          0.1058     0.0154     6.8706     0.0000      0.0756      0.1361
expersq       -0.0047     0.0007    -6.8623     0.0000     -0.0061     -0.0034
married        0.0638     0.0168     3.8035     0.0001      0.0309      0.0967
educ           0.0919     0.0107     8.5744     0.0000      0.0709      0.1129
union          0.1059     0.0179     5.9289     0.0000      0.0709      0.1409
year.1981      0.0404     0.0247     1.6362     0.1019     -0.0080      0.0889
year.1982      0.0309     0.0324     0.9519     0.3412     -0.0327      0.0944
year.1983      0.0202     0.0417     0.4840     0.6284     -0.0616      0.1020
year.1984      0.0430     0.0515     0.8350     0.4037     -0.0580      0.1440
year.1985      0.0577     0.0615     0.9383     0.3482     -0.0629      0.1782
year.1986      0.0918     0.0716     1.2834     0.1994     -0.0485      0.2321
year.1987      0.1348     0.0817     1.6504     0.0989     -0.0253      0.2950
==============================================================================

The model fit is fairly similar, although the return to experience has changed substantially, as has its significance. This is partially explainable by the inclusion of the year dummies which will fit the trend in experience and so only the cross-sectional differences matter. The quasi-differencing in the random effects estimator depends on a quantity that depends on the relative variance of the idiosyncratic shock and the common shock. This can be accessed using variance_decomposition.

[4]:
re_res.variance_decomposition
[4]:
Effects                   0.106946
Residual                  0.123324
Percent due to Effects    0.464438
Name: Variance Decomposition, dtype: float64

The coefficient \(\theta_i\) determines how much demeaning takes place. When this value is 1, the RE model reduces to the pooled model since this occurs when there is no variance in the effects. When panels are unbalanced it will vary across entities, but in this balanced panel all values are the same.

[5]:
re_res.theta.head()
[5]:
theta
nr
13 0.645059
17 0.645059
18 0.645059
45 0.645059
110 0.645059

The between estimator

The between estimator is an alternative, usually less efficient estimator, can can be used to estimate model parameters. It is particular simple since it first computes the time averages of \(y\) and \(x\) and then runs a simple regression using these averages.

The year dummies are dropped since the averaging removes differences due to the year. expersq was also dropped since it is fairly co-linear with exper. These results are broadly similar to the previous models.

[6]:
from linearmodels.panel import BetweenOLS

exog_vars = ["black", "hisp", "exper", "married", "educ", "union"]
exog = sm.add_constant(data[exog_vars])
mod = BetweenOLS(data.lwage, exog)
be_res = mod.fit()
print(be_res)
                         BetweenOLS Estimation Summary
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.2155
Estimator:                 BetweenOLS   R-squared (Between):              0.2155
No. Observations:                 545   R-squared (Within):               0.1141
Date:                Sun, Jan 12 2025   R-squared (Overall):              0.1686
Time:                        13:43:37   Log-likelihood                   -194.54
Cov. Estimator:            Unadjusted
                                        F-statistic:                      24.633
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                   F(6,538)
Min Obs:                       8.0000
Max Obs:                       8.0000   F-statistic (robust):             24.633
                                        P-value                           0.0000
Time periods:                       8   Distribution:                   F(6,538)
Avg Obs:                       545.00
Min Obs:                       545.00
Max Obs:                       545.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.2836     0.1784     1.5897     0.1125     -0.0668      0.6340
black         -0.1414     0.0489    -2.8915     0.0040     -0.2375     -0.0453
hisp           0.0100     0.0426     0.2355     0.8139     -0.0737      0.0938
exper          0.0278     0.0113     2.4538     0.0144      0.0055      0.0501
married        0.1416     0.0412     3.4346     0.0006      0.0606      0.2226
educ           0.0913     0.0107     8.5159     0.0000      0.0702      0.1123
union          0.2587     0.0460     5.6214     0.0000      0.1683      0.3491
==============================================================================

Other options

The fit method of BetweenOLS takes an optional input reweight which uses GLS-like weighting when panels are unbalanced. In this dataset the panel is balanced and so using this option would have no effect. By default this is False and so the averages are directly used in OLS even when there are difference numbers of observations across entities.

Handling correlated effects

When effects are correlated with the regressors the RE and BE estimators are not consistent. The usual solution is to use Fixed Effects which are available in PanelOLS. Fixed effects are called entity_effects when applied to entities and time_effects when applied to the time dimension:

Including fixed effects

Entity effects are included by setting entity_effects=True. This is equivalent to including dummies for each entity. In this panel, this would add 545 dummy variables and estimation of the model would be considerably slower. PanelOLS does not actually use dummy variables and instead uses group-wise demeaning to achieve the same effect.

Time-invariant Variables

Time-invariant variables cannot be included when using entity effects since, once demeaned, these will all be 0. Here exper is also excluded since once entity effects and time dummies are incorporated, exper is perfectly co-linear.

[7]:
from linearmodels.panel import PanelOLS

exog_vars = ["expersq", "union", "married", "year"]
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True)
fe_res = mod.fit()
print(fe_res)
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.1806
Estimator:                   PanelOLS   R-squared (Between):             -0.0052
No. Observations:                4360   R-squared (Within):               0.1806
Date:                Sun, Jan 12 2025   R-squared (Overall):              0.0807
Time:                        13:43:37   Log-likelihood                   -1324.8
Cov. Estimator:            Unadjusted
                                        F-statistic:                      83.851
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                 F(10,3805)
Min Obs:                       8.0000
Max Obs:                       8.0000   F-statistic (robust):             83.851
                                        P-value                           0.0000
Time periods:                       8   Distribution:                 F(10,3805)
Avg Obs:                       545.00
Min Obs:                       545.00
Max Obs:                       545.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          1.4260     0.0183     77.748     0.0000      1.3901      1.4620
expersq       -0.0052     0.0007    -7.3612     0.0000     -0.0066     -0.0038
union          0.0800     0.0193     4.1430     0.0000      0.0421      0.1179
married        0.0467     0.0183     2.5494     0.0108      0.0108      0.0826
year.1981      0.1512     0.0219     6.8883     0.0000      0.1082      0.1942
year.1982      0.2530     0.0244     10.360     0.0000      0.2051      0.3008
year.1983      0.3544     0.0292     12.121     0.0000      0.2971      0.4118
year.1984      0.4901     0.0362     13.529     0.0000      0.4191      0.5611
year.1985      0.6175     0.0452     13.648     0.0000      0.5288      0.7062
year.1986      0.7655     0.0561     13.638     0.0000      0.6555      0.8755
year.1987      0.9250     0.0688     13.450     0.0000      0.7902      1.0599
==============================================================================

F-test for Poolability: 9.1568
P-value: 0.0000
Distribution: F(544,3805)

Included effects: Entity

Time Effects

Time effect can be added using time_effects=True. Here the time dummies are removed. Note that the core coefficients are identical. The only change is in the test statistic for poolability since not the “effects” include both entity and time, whereas before only entity were included.

Effects vs Dummies

For variable which can be consistently estimated, such as time effects in the usual large N, fixed T panel, it is not necessary to include these as effects. They can instead be implemented as dummy variables.

[8]:
exog_vars = ["expersq", "union", "married"]
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True, time_effects=True)
fe_te_res = mod.fit()
print(fe_te_res)
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.0216
Estimator:                   PanelOLS   R-squared (Between):             -0.0052
No. Observations:                4360   R-squared (Within):              -0.4809
Date:                Sun, Jan 12 2025   R-squared (Overall):             -0.2253
Time:                        13:43:37   Log-likelihood                   -1324.8
Cov. Estimator:            Unadjusted
                                        F-statistic:                      27.959
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                  F(3,3805)
Min Obs:                       8.0000
Max Obs:                       8.0000   F-statistic (robust):             27.959
                                        P-value                           0.0000
Time periods:                       8   Distribution:                  F(3,3805)
Avg Obs:                       545.00
Min Obs:                       545.00
Max Obs:                       545.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          1.8706     0.0378     49.430     0.0000      1.7964      1.9448
expersq       -0.0052     0.0007    -7.3612     0.0000     -0.0066     -0.0038
union          0.0800     0.0193     4.1430     0.0000      0.0421      0.1179
married        0.0467     0.0183     2.5494     0.0108      0.0108      0.0826
==============================================================================

F-test for Poolability: 10.067
P-value: 0.0000
Distribution: F(551,3805)

Included effects: Entity, Time

Other Options

Some additional options are available when using PanelOLS and effects. The use_lsdv can be used to force the model to be estimated using dummy variables. This is not necessary and will be slower in most circumstances. This options is primarily included for testing. auto_df instructs PanelOLS to determine the degree of freedom adjustment to make. Unlike most other estimators, the treatment of effects depends on the covariance estimator. For example, when using a classic covariance estimator, effects count as lost degrees of freedom. When using entity effects and clustering by entity, they do not. Setting auto_df=False will allow the entities to be counted or not, depending on the value of count_effects.

Other Effects

PanelOLS can be used to estimated models with up to 2-effects fixed effects. These can include any combination of

  • Entity effects

  • Time effects

  • Other effects

Other effects are specified using the other_effects input of PanelOLS. The input should have the same number of observations as the engog and can have 1 or two columns. Below we reproduce the model fitted with time effects only using other_effects to set the effect.

[9]:
from linearmodels.panel.data import PanelData

exog_vars = ["expersq", "union", "married"]
exog = sm.add_constant(data[exog_vars])
# Use the `PanelData` structure to simplify constructing the time IDs
time_ids = pd.DataFrame(
    PanelData(data.lwage).time_ids, index=data.index, columns=["Other Effect"]
)

mod = PanelOLS(data.lwage, exog, entity_effects=True, other_effects=time_ids)
fe_oe_res = mod.fit()
print(fe_oe_res)
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.0216
Estimator:                   PanelOLS   R-squared (Between):             -0.0052
No. Observations:                4360   R-squared (Within):              -0.4809
Date:                Sun, Jan 12 2025   R-squared (Overall):             -0.2253
Time:                        13:43:37   Log-likelihood                   -1324.8
Cov. Estimator:            Unadjusted
                                        F-statistic:                      27.959
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                  F(3,3805)
Min Obs:                       8.0000
Max Obs:                       8.0000   F-statistic (robust):             27.959
                                        P-value                           0.0000
Time periods:                       8   Distribution:                  F(3,3805)
Avg Obs:                       545.00
Min Obs:                       545.00
Max Obs:                       545.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          1.8706     0.0378     49.430     0.0000      1.7964      1.9448
expersq       -0.0052     0.0007    -7.3612     0.0000     -0.0066     -0.0038
union          0.0800     0.0193     4.1430     0.0000      0.0421      0.1179
married        0.0467     0.0183     2.5494     0.0108      0.0108      0.0826
==============================================================================

F-test for Poolability: 10.067
P-value: 0.0000
Distribution: F(551,3805)

Included effects: Entity, Other Effect (Other Effect)
Model includes 1 other effect
Other Effect Observations per group (Other Effect):
Avg Obs: 545.00, Min Obs: 545.00, Max Obs: 545.00, Groups: 8

Using first differences

First differencing is an alternative to using fixed effects when there might be correlation. When using first differences, time-invariant variables must be excluded. Additionally, only one linear time-trending variable can be included since this will look like a constant. This variable will soak up all time-trends in the data, and so interpretations of these variable can be challenging.

[10]:
from linearmodels.panel import FirstDifferenceOLS

exog_vars = ["exper", "expersq", "union", "married"]
exog = data[exog_vars]
mod = FirstDifferenceOLS(data.lwage, exog)
fd_res = mod.fit()
print(fd_res)
                     FirstDifferenceOLS Estimation Summary
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.0268
Estimator:         FirstDifferenceOLS   R-squared (Between):              0.5491
No. Observations:                3815   R-squared (Within):               0.1763
Date:                Sun, Jan 12 2025   R-squared (Overall):              0.5328
Time:                        13:43:37   Log-likelihood                   -2305.5
Cov. Estimator:            Unadjusted
                                        F-statistic:                      26.208
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                  F(4,3811)
Min Obs:                       8.0000
Max Obs:                       8.0000   F-statistic (robust):             26.208
                                        P-value                           0.0000
Time periods:                       8   Distribution:                  F(4,3811)
Avg Obs:                       545.00
Min Obs:                       545.00
Max Obs:                       545.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exper          0.1158     0.0196     5.9096     0.0000      0.0773      0.1542
expersq       -0.0039     0.0014    -2.8005     0.0051     -0.0066     -0.0012
union          0.0428     0.0197     2.1767     0.0296      0.0042      0.0813
married        0.0381     0.0229     1.6633     0.0963     -0.0068      0.0831
==============================================================================

Comparing models

Model results can be compared using compare. compare accepts lists of results, a dictionary of results where the key is interpreted as the model name.

[11]:
from linearmodels.panel import compare

print(compare({"BE": be_res, "RE": re_res, "Pooled": pooled_res}))
                        Model Comparison
===============================================================
                                BE              RE       Pooled
---------------------------------------------------------------
Dep. Variable                lwage           lwage        lwage
Estimator               BetweenOLS   RandomEffects    PooledOLS
No. Observations               545            4360         4360
Cov. Est.               Unadjusted      Unadjusted   Unadjusted
R-squared                   0.2155          0.1806       0.1893
R-Squared (Within)          0.1141          0.1799       0.1692
R-Squared (Between)         0.2155          0.1853       0.2066
R-Squared (Overall)         0.1686          0.1828       0.1893
F-statistic                 24.633          68.409       72.459
P-value (F-stat)            0.0000          0.0000       0.0000
===================== ============ =============== ============
const                       0.2836          0.0234       0.0921
                          (1.5897)        (0.1546)     (1.1761)
black                      -0.1414         -0.1394      -0.1392
                         (-2.8915)       (-2.9054)    (-5.9049)
hisp                        0.0100          0.0217       0.0160
                          (0.2355)        (0.5078)     (0.7703)
exper                       0.0278          0.1058       0.0672
                          (2.4538)        (6.8706)     (4.9095)
married                     0.1416          0.0638       0.1083
                          (3.4346)        (3.8035)     (6.8997)
educ                        0.0913          0.0919       0.0913
                          (8.5159)        (8.5744)     (17.442)
union                       0.2587          0.1059       0.1825
                          (5.6214)        (5.9289)     (10.635)
expersq                                    -0.0047      -0.0024
                                         (-6.8623)    (-2.9413)
year.1981                                   0.0404       0.0583
                                          (1.6362)     (1.9214)
year.1982                                   0.0309       0.0628
                                          (0.9519)     (1.8900)
year.1983                                   0.0202       0.0620
                                          (0.4840)     (1.6915)
year.1984                                   0.0430       0.0905
                                          (0.8350)     (2.2566)
year.1985                                   0.0577       0.1092
                                          (0.9383)     (2.5200)
year.1986                                   0.0918       0.1420
                                          (1.2834)     (3.0580)
year.1987                                   0.1348       0.1738
                                          (1.6504)     (3.5165)
---------------------------------------------------------------

T-stats reported in parentheses

Covariance options

Heteroskedasticity Robust Covariance

White”s robust covariance can be used by setting cov_type="robust. This estimator adds some robustness against certain types of specification issues but should not be used when using fixed effects (entity effects) since it is no longer robust. Instead a clustered covariance is required.

[12]:
exog_vars = ["black", "hisp", "exper", "expersq", "married", "educ", "union"]
exog = sm.add_constant(data[exog_vars])
mod = PooledOLS(data.lwage, exog)
robust = mod.fit(cov_type="robust")

Clustered by Entity

The usual variable to cluster are are entity or entity and time. The can be implemented using cov_type="clustered" and the additional keyword arguments cluster_entity=True and/or cluster_time=True.

[13]:
clust_entity = mod.fit(cov_type="clustered", cluster_entity=True)

This next example clusters by both.

[14]:
clust_entity_time = mod.fit(
    cov_type="clustered", cluster_entity=True, cluster_time=True
)

An OrderedDict is used to hold the results for comparing models. This allows the models to be named as well as for the order of the models to be specified. A standard dict will produce effectively random order.

Clustering on entity reduced the t-stats across the board. This suggests there is important correlation in the residuals per entity. Clustering by both also decreases the t-stats which suggests that there is cross-sectional dependence in the data. Note: clustering by entity addresses correlation across time and clustering by time controls for correlation between entities in a time period.

[15]:
from collections import OrderedDict

res = OrderedDict()
res["Robust"] = robust
res["Entity"] = clust_entity
res["Entity-Time"] = clust_entity_time
print(compare(res))
                     Model Comparison
=========================================================
                           Robust      Entity Entity-Time
---------------------------------------------------------
Dep. Variable               lwage       lwage       lwage
Estimator               PooledOLS   PooledOLS   PooledOLS
No. Observations             4360        4360        4360
Cov. Est.                  Robust   Clustered   Clustered
R-squared                  0.1866      0.1866      0.1866
R-Squared (Within)         0.1679      0.1679      0.1679
R-Squared (Between)        0.2027      0.2027      0.2027
R-Squared (Overall)        0.1866      0.1866      0.1866
F-statistic                142.61      142.61      142.61
P-value (F-stat)           0.0000      0.0000      0.0000
===================== =========== =========== ===========
const                     -0.0347     -0.0347     -0.0347
                        (-0.5360)   (-0.2892)   (-0.3145)
black                     -0.1438     -0.1438     -0.1438
                        (-5.9045)   (-2.8727)   (-3.0067)
hisp                       0.0157      0.0157      0.0157
                         (0.7952)    (0.4008)    (0.4428)
exper                      0.0892      0.0892      0.0892
                         (8.7881)    (7.1728)    (6.3223)
expersq                   -0.0028     -0.0028     -0.0028
                        (-4.1934)   (-3.2747)   (-3.1571)
married                    0.1077      0.1077      0.1077
                         (7.0525)    (4.1314)    (4.8989)
educ                       0.0994      0.0994      0.0994
                         (21.626)    (10.802)    (12.296)
union                      0.1801      0.1801      0.1801
                         (11.087)    (6.5343)    (6.6732)
---------------------------------------------------------

T-stats reported in parentheses

Other clusters

Other clusters can be used by directly passing integer arrays (1 or 2 columns, or a 1-d array) using the input clusters. This example clustered by occupation, which is probably not a reliable variable to cluster on since there are only 9 groups and the usual theory for clustered standard errors requires that the number of clusters is large.

[16]:
clust_entity = mod.fit(cov_type="clustered", clusters=data.occupation)
print(data.occupation.value_counts())
print(clust_entity)
occupation
5    934
6    881
9    509
4    486
1    453
7    401
2    399
3    233
8     64
Name: count, dtype: int64
                          PooledOLS Estimation Summary
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.1866
Estimator:                  PooledOLS   R-squared (Between):              0.2027
No. Observations:                4360   R-squared (Within):               0.1679
Date:                Sun, Jan 12 2025   R-squared (Overall):              0.1866
Time:                        13:43:38   Log-likelihood                   -2989.2
Cov. Estimator:             Clustered
                                        F-statistic:                      142.61
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                  F(7,4352)
Min Obs:                       8.0000
Max Obs:                       8.0000   F-statistic (robust):             116.58
                                        P-value                           0.0000
Time periods:                       8   Distribution:                  F(7,4352)
Avg Obs:                       545.00
Min Obs:                       545.00
Max Obs:                       545.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const         -0.0347     0.1479    -0.2346     0.8145     -0.3247      0.2553
black         -0.1438     0.0297    -4.8469     0.0000     -0.2020     -0.0857
hisp           0.0157     0.0266     0.5892     0.5557     -0.0365      0.0679
exper          0.0892     0.0134     6.6513     0.0000      0.0629      0.1155
expersq       -0.0028     0.0009    -3.2442     0.0012     -0.0046     -0.0011
married        0.1077     0.0139     7.7322     0.0000      0.0804      0.1350
educ           0.0994     0.0112     8.8846     0.0000      0.0775      0.1213
union          0.1801     0.0320     5.6323     0.0000      0.1174      0.2428
==============================================================================