Data Formats for Panel Data Analysis

There are two primary methods to express data:

  • MultiIndex DataFrames where the outer index is the entity and the inner is the time index. This requires using pandas.

  • 3D structures were dimension 0 (outer) is variable, dimension 1 is time index and dimension 2 is the entity index. It is also possible to use a 2D data structure with dimensions (t, n) which is treated as a 3D data structure having dimensions (1, t, n). These 3D data structures can be pandas, NumPy or xarray.

MultiIndex DataFrames

The most precise data format to use is a MultiIndex DataFrame. This is the most precise since only single columns can preserve all types within a panel. For example, it is not possible to span a single Categorical variable across multiple columns when using a pandas Panel.

This example uses the job training data to construct a MultiIndex DataFrame using the set_index command. The entity index is fcode and the time index is year.

[1]:
from linearmodels.datasets import jobtraining

data = jobtraining.load()
print(data.head())
   year   fcode  employ       sales   avgsal  scrap  rework  tothrs  union  \
0  1987  410032   100.0  47000000.0  35000.0    NaN     NaN    12.0      0
1  1988  410032   131.0  43000000.0  37000.0    NaN     NaN     8.0      0
2  1989  410032   123.0  49000000.0  39000.0    NaN     NaN     8.0      0
3  1987  410440    12.0   1560000.0  10500.0    NaN     NaN    12.0      0
4  1988  410440    13.0   1970000.0  11000.0    NaN     NaN    12.0      0

   grant  ...  grant_1  clscrap  cgrant  clemploy   clsales    lavgsal  \
0      0  ...        0      NaN       0       NaN       NaN  10.463100
1      0  ...        0      NaN       0  0.270027 -0.088949  10.518670
2      0  ...        0      NaN       0 -0.063013  0.130621  10.571320
3      0  ...        0      NaN       0       NaN       NaN   9.259130
4      0  ...        0      NaN       0  0.080043  0.233347   9.305651

   clavgsal  cgrant_1   chrsemp  clhrsemp
0       NaN       NaN       NaN       NaN
1  0.055570       0.0 -8.946565 -1.165385
2  0.052644       0.0  0.198597  0.047832
3       NaN       NaN       NaN       NaN
4  0.046520       0.0  0.000000  0.000000

[5 rows x 30 columns]

Here set_index is used to set the MultiIndex using the firm code (entity) and year (time).

[2]:
orig_mi_data = data.set_index(["fcode", "year"])
# Subset to the relevant columns and drop missing to avoid warnings
mi_data = orig_mi_data[["lscrap", "hrsemp"]]
mi_data = mi_data.dropna(axis=0, how="any")

print(mi_data.head())
               lscrap    hrsemp
fcode  year
410523 1987 -2.813411  20.00000
       1988 -2.995732  18.82353
       1989 -2.995732  24.48980
410538 1989  0.932164  25.00000
410563 1987  1.791759   0.00000

The MultiIndex DataFrame can be used to initialized the model. When only referencing a single series, the MultiIndex Series representation can be used.

[3]:
from linearmodels import PanelOLS

mod = PanelOLS(mi_data.lscrap, mi_data.hrsemp, entity_effects=True)
print(mod.fit())
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                 lscrap   R-squared:                        0.0528
Estimator:                   PanelOLS   R-squared (Between):             -0.0379
No. Observations:                 140   R-squared (Within):               0.0528
Date:                Fri, Jul 19 2024   R-squared (Overall):             -0.0288
Time:                        17:54:46   Log-likelihood                   -90.459
Cov. Estimator:            Unadjusted
                                        F-statistic:                      5.0751
Entities:                          48   P-value                           0.0267
Avg Obs:                       2.9167   Distribution:                    F(1,91)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             5.0751
                                        P-value                           0.0267
Time periods:                       3   Distribution:                    F(1,91)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
hrsemp        -0.0054     0.0024    -2.2528     0.0267     -0.0102     -0.0006
==============================================================================

F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)

Included effects: Entity

NumPy arrays

3D NumPy arrays can be used to hand panel data where the three axes are 0, items, 1, time and 2, entity. NumPy arrays are not usually the best format for data since the results all use generic variable names.

Pandas dropped support for their Panel in 0.25.

[4]:
import numpy as np

np_data = np.asarray(orig_mi_data)
np_lscrap = np_data[:, orig_mi_data.columns.get_loc("lscrap")]
np_hrsemp = np_data[:, orig_mi_data.columns.get_loc("hrsemp")]
nentity = mi_data.index.levels[0].shape[0]
ntime = mi_data.index.levels[1].shape[0]
np_lscrap = np_lscrap.reshape((nentity, ntime)).T
np_hrsemp = np_hrsemp.reshape((nentity, ntime)).T
np_hrsemp.shape = (1, ntime, nentity)
[5]:
# Warnings are inevitable when using NumPy with missing data
# since the arrays must be rectangular, and not ragged
res = PanelOLS(np_lscrap, np_hrsemp, entity_effects=True).fit()
print(res)
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                    Dep   R-squared:                        0.0528
Estimator:                   PanelOLS   R-squared (Between):             -0.0379
No. Observations:                 140   R-squared (Within):               0.0528
Date:                Fri, Jul 19 2024   R-squared (Overall):             -0.0288
Time:                        17:54:46   Log-likelihood                   -90.459
Cov. Estimator:            Unadjusted
                                        F-statistic:                      5.0751
Entities:                          48   P-value                           0.0267
Avg Obs:                       2.9167   Distribution:                    F(1,91)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             5.0751
                                        P-value                           0.0267
Time periods:                       3   Distribution:                    F(1,91)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Exog          -0.0054     0.0024    -2.2528     0.0267     -0.0102     -0.0006
==============================================================================

F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)

Included effects: Entity
/home/runner/work/linearmodels/linearmodels/linearmodels/panel/model.py:1260: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)

xarray DataArrays

xarray is a relatively new entrant into the set of packages used for data structures. The data structures provided by xarray are relevant in the context of panel models since pandas Panel is scheduled for removal in the futures, and so the only 3d data format that will remain viable is an xarray DataArray. DataArrays are similar to pandas Panel although DataArrays use some difference notation. In principle it is possible to express the same information in a DataArray as one can in a Panel

[6]:
da = mi_data.to_xarray()
da.keys()
[6]:
KeysView(<xarray.Dataset> Size: 3kB
Dimensions:  (fcode: 48, year: 3)
Coordinates:
  * fcode    (fcode) int64 384B 410523 410538 410563 ... 419459 419482 419483
  * year     (year) int64 24B 1987 1988 1989
Data variables:
    lscrap   (fcode, year) float64 1kB -2.813 -2.996 -2.996 ... 3.219 3.401
    hrsemp   (fcode, year) float64 1kB 20.0 18.82 24.49 nan ... 0.0 0.0 3.101)
[7]:
res = PanelOLS(da["lscrap"].T, da["hrsemp"].T, entity_effects=True).fit()
print(res)
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                    Dep   R-squared:                        0.0528
Estimator:                   PanelOLS   R-squared (Between):             -0.0379
No. Observations:                 140   R-squared (Within):               0.0528
Date:                Fri, Jul 19 2024   R-squared (Overall):             -0.0288
Time:                        17:54:46   Log-likelihood                   -90.459
Cov. Estimator:            Unadjusted
                                        F-statistic:                      5.0751
Entities:                          48   P-value                           0.0267
Avg Obs:                       2.9167   Distribution:                    F(1,91)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             5.0751
                                        P-value                           0.0267
Time periods:                       3   Distribution:                    F(1,91)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Exog          -0.0054     0.0024    -2.2528     0.0267     -0.0102     -0.0006
==============================================================================

F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)

Included effects: Entity

Conversion of Categorical and Strings to Dummies

Categorical or string variables are treated as factors and so are converted to dummies. The first category is always dropped. If this is not desirable, you should manually convert the data to dummies before estimating a model.

[8]:
import pandas as pd

year_str = mi_data.reset_index()[["year"]].astype("str")
year_cat = pd.Categorical(year_str.iloc[:, 0])
year_str.index = mi_data.index
year_cat.index = mi_data.index
mi_data["year_str"] = year_str
mi_data["year_cat"] = year_cat

Here year has been converted to a string which is then used in the model to produce year dummies.

[9]:
print("Exogenous variables")
print(mi_data[["hrsemp", "year_str"]].head())
print(mi_data[["hrsemp", "year_str"]].dtypes)

res = PanelOLS(
    mi_data[["lscrap"]], mi_data[["hrsemp", "year_str"]], entity_effects=True
).fit()
print(res)
Exogenous variables
               hrsemp year_str
fcode  year
410523 1987  20.00000     1987
       1988  18.82353     1988
       1989  24.48980     1989
410538 1989  25.00000     1989
410563 1987   0.00000     1987
hrsemp      float64
year_str     object
dtype: object
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                 lscrap   R-squared:                        0.1985
Estimator:                   PanelOLS   R-squared (Between):             -0.1240
No. Observations:                 140   R-squared (Within):               0.1985
Date:                Fri, Jul 19 2024   R-squared (Overall):             -0.0934
Time:                        17:54:46   Log-likelihood                   -78.765
Cov. Estimator:            Unadjusted
                                        F-statistic:                      7.3496
Entities:                          48   P-value                           0.0002
Avg Obs:                       2.9167   Distribution:                    F(3,89)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             7.3496
                                        P-value                           0.0002
Time periods:                       3   Distribution:                    F(3,89)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                               Parameter Estimates
=================================================================================
               Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
---------------------------------------------------------------------------------
hrsemp           -0.0024     0.0024    -1.0058     0.3172     -0.0071      0.0023
year_str.1988    -0.1591     0.1146    -1.3888     0.1684     -0.3868      0.0685
year_str.1989    -0.4620     0.1176    -3.9297     0.0002     -0.6957     -0.2284
=================================================================================

F-test for Poolability: 19.649
P-value: 0.0000
Distribution: F(47,89)

Included effects: Entity

Using categoricals has the same effect.

[10]:
print("Exogenous variables")
print(mi_data[["hrsemp", "year_cat"]].head())
print(mi_data[["hrsemp", "year_cat"]].dtypes)

res = PanelOLS(
    mi_data[["lscrap"]], mi_data[["hrsemp", "year_cat"]], entity_effects=True
).fit()
print(res)
Exogenous variables
               hrsemp year_cat
fcode  year
410523 1987  20.00000     1987
       1988  18.82353     1988
       1989  24.48980     1989
410538 1989  25.00000     1989
410563 1987   0.00000     1987
hrsemp       float64
year_cat    category
dtype: object
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                 lscrap   R-squared:                        0.1985
Estimator:                   PanelOLS   R-squared (Between):             -0.1240
No. Observations:                 140   R-squared (Within):               0.1985
Date:                Fri, Jul 19 2024   R-squared (Overall):             -0.0934
Time:                        17:54:47   Log-likelihood                   -78.765
Cov. Estimator:            Unadjusted
                                        F-statistic:                      7.3496
Entities:                          48   P-value                           0.0002
Avg Obs:                       2.9167   Distribution:                    F(3,89)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             7.3496
                                        P-value                           0.0002
Time periods:                       3   Distribution:                    F(3,89)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                               Parameter Estimates
=================================================================================
               Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
---------------------------------------------------------------------------------
hrsemp           -0.0024     0.0024    -1.0058     0.3172     -0.0071      0.0023
year_cat.1988    -0.1591     0.1146    -1.3888     0.1684     -0.3868      0.0685
year_cat.1989    -0.4620     0.1176    -3.9297     0.0002     -0.6957     -0.2284
=================================================================================

F-test for Poolability: 19.649
P-value: 0.0000
Distribution: F(47,89)

Included effects: Entity