Data Formats for Panel Data Analysis

There are two primary methods to express data:

  • MultiIndex DataFrames where the outer index is the entity and the inner is the time index. This requires using pandas.

  • 3D structures were dimension 0 (outer) is variable, dimension 1 is time index and dimension 2 is the entity index. It is also possible to use a 2D data structure with dimensions (t, n) which is treated as a 3D data structure having dimensions (1, t, n). These 3D data structures can be pandas, NumPy or xarray.

MultiIndex DataFrames

The most precise data format to use is a MultiIndex DataFrame. This is the most precise since only single columns can preserve all types within a panel. For example, it is not possible to span a single Categorical variable across multiple columns when using a pandas Panel.

This example uses the job training data to construct a MultiIndex DataFrame using the set_index command. The entity index is fcode and the time index is year.

[1]:
from linearmodels.datasets import jobtraining
data = jobtraining.load()
print(data.head())
   year   fcode  employ       sales   avgsal  scrap  rework  tothrs  union  \
0  1987  410032   100.0  47000000.0  35000.0    NaN     NaN    12.0      0
1  1988  410032   131.0  43000000.0  37000.0    NaN     NaN     8.0      0
2  1989  410032   123.0  49000000.0  39000.0    NaN     NaN     8.0      0
3  1987  410440    12.0   1560000.0  10500.0    NaN     NaN    12.0      0
4  1988  410440    13.0   1970000.0  11000.0    NaN     NaN    12.0      0

   grant  ...  grant_1  clscrap  cgrant  clemploy   clsales    lavgsal  \
0      0  ...        0      NaN       0       NaN       NaN  10.463100
1      0  ...        0      NaN       0  0.270027 -0.088949  10.518670
2      0  ...        0      NaN       0 -0.063013  0.130621  10.571320
3      0  ...        0      NaN       0       NaN       NaN   9.259130
4      0  ...        0      NaN       0  0.080043  0.233347   9.305651

   clavgsal  cgrant_1   chrsemp  clhrsemp
0       NaN       NaN       NaN       NaN
1  0.055570       0.0 -8.946565 -1.165385
2  0.052644       0.0  0.198597  0.047832
3       NaN       NaN       NaN       NaN
4  0.046520       0.0  0.000000  0.000000

[5 rows x 30 columns]

Here set_index is used to set the MultiIndex using the firm code (entity) and year (time).

[2]:
mi_data = data.set_index(['fcode', 'year'])
print(mi_data.head())
             employ       sales   avgsal  scrap  rework  tothrs  union  grant  \
fcode  year
410032 1987   100.0  47000000.0  35000.0    NaN     NaN    12.0      0      0
       1988   131.0  43000000.0  37000.0    NaN     NaN     8.0      0      0
       1989   123.0  49000000.0  39000.0    NaN     NaN     8.0      0      0
410440 1987    12.0   1560000.0  10500.0    NaN     NaN    12.0      0      0
       1988    13.0   1970000.0  11000.0    NaN     NaN    12.0      0      0

             d89  d88  ...  grant_1  clscrap  cgrant  clemploy   clsales  \
fcode  year            ...
410032 1987    0    0  ...        0      NaN       0       NaN       NaN
       1988    0    1  ...        0      NaN       0  0.270027 -0.088949
       1989    1    0  ...        0      NaN       0 -0.063013  0.130621
410440 1987    0    0  ...        0      NaN       0       NaN       NaN
       1988    0    1  ...        0      NaN       0  0.080043  0.233347

               lavgsal  clavgsal  cgrant_1   chrsemp  clhrsemp
fcode  year
410032 1987  10.463100       NaN       NaN       NaN       NaN
       1988  10.518670  0.055570       0.0 -8.946565 -1.165385
       1989  10.571320  0.052644       0.0  0.198597  0.047832
410440 1987   9.259130       NaN       NaN       NaN       NaN
       1988   9.305651  0.046520       0.0  0.000000  0.000000

[5 rows x 28 columns]

The MultiIndex DataFrame can be used to initialized the model. When only referencing a single series, the MultiIndex Series representation can be used.

[3]:
from linearmodels import PanelOLS
mod = PanelOLS(mi_data.lscrap, mi_data.hrsemp, entity_effects=True)
print(mod.fit())
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                 lscrap   R-squared:                        0.0528
Estimator:                   PanelOLS   R-squared (Between):             -0.0379
No. Observations:                 140   R-squared (Within):               0.0528
Date:                Tue, Feb 04 2020   R-squared (Overall):             -0.0288
Time:                        10:34:42   Log-likelihood                   -90.459
Cov. Estimator:            Unadjusted
                                        F-statistic:                      5.0751
Entities:                          48   P-value                           0.0267
Avg Obs:                       2.9167   Distribution:                    F(1,91)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             5.0751
                                        P-value                           0.0267
Time periods:                       3   Distribution:                    F(1,91)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
hrsemp        -0.0054     0.0024    -2.2528     0.0267     -0.0102     -0.0006
==============================================================================

F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)

Included effects: Entity
/home/travis/build/bashtage/linearmodels/linearmodels/utility.py:549: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
  warnings.warn(missing_value_warning_msg, MissingValueWarning)

NumPy arrays

3D NumPy arrays can be used to hand panel data where the three axes are 0, items, 1, time and 2, entity. NumPy arrays are not usually the best format for data since the results all use generic variable names.

Pandas dropped support for their Panel in 0.25.

[4]:
import numpy as np

np_data = np.asarray(mi_data)
np_lscrap = np_data[:, mi_data.columns.get_loc('lscrap')]
np_hrsemp = np_data[:, mi_data.columns.get_loc('hrsemp')]
nentity = mi_data.index.levels[0].shape[0]
ntime = mi_data.index.levels[1].shape[0]
np_lscrap = np_lscrap.reshape((nentity, ntime)).T
np_hrsemp = np_hrsemp.reshape((nentity, ntime)).T
np_hrsemp.shape = (1,ntime,nentity)
[5]:
res = PanelOLS(np_lscrap, np_hrsemp, entity_effects=True).fit()
print(res)
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                    Dep   R-squared:                        0.0528
Estimator:                   PanelOLS   R-squared (Between):             -0.0379
No. Observations:                 140   R-squared (Within):               0.0528
Date:                Tue, Feb 04 2020   R-squared (Overall):             -0.0288
Time:                        10:34:42   Log-likelihood                   -90.459
Cov. Estimator:            Unadjusted
                                        F-statistic:                      5.0751
Entities:                          48   P-value                           0.0267
Avg Obs:                       2.9167   Distribution:                    F(1,91)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             5.0751
                                        P-value                           0.0267
Time periods:                       3   Distribution:                    F(1,91)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Exog          -0.0054     0.0024    -2.2528     0.0267     -0.0102     -0.0006
==============================================================================

F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)

Included effects: Entity

xarray DataArrays

xarray is a relatively new entrant into the set of packages used for data structures. The data structures provided by xarray are relevant in the context of panel models since pandas Panel is scheduled for removal in the futures, and so the only 3d data format that will remain viable is an xarray DataArray. DataArrays are similar to pandas Panel although DataArrays use some difference notation. In principle it is possible to express the same information in a DataArray as one can in a Panel

[6]:
da = mi_data.to_xarray()
da.keys()
[6]:
KeysView(<xarray.Dataset>
Dimensions:   (fcode: 157, year: 3)
Coordinates:
  * fcode     (fcode) int64 410032 410440 410495 410500 ... 419482 419483 419486
  * year      (year) int64 1987 1988 1989
Data variables:
    employ    (fcode, year) float64 100.0 131.0 123.0 12.0 ... 80.0 90.0 100.0
    sales     (fcode, year) float64 4.7e+07 4.3e+07 4.9e+07 ... 8.5e+06 9.9e+06
    avgsal    (fcode, year) float64 3.5e+04 3.7e+04 3.9e+04 ... 1.7e+04 1.8e+04
    scrap     (fcode, year) float64 nan nan nan nan nan ... 30.0 nan nan nan
    rework    (fcode, year) float64 nan nan nan nan nan ... nan nan nan nan nan
    tothrs    (fcode, year) float64 12.0 8.0 8.0 12.0 12.0 ... 20.0 0.0 0.0 40.0
    union     (fcode, year) int64 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 1 1 0 0 0
    grant     (fcode, year) int64 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
    d89       (fcode, year) int64 0 0 1 0 0 1 0 0 1 0 0 ... 1 0 0 1 0 0 1 0 0 1
    d88       (fcode, year) int64 0 1 0 0 1 0 0 1 0 0 1 ... 0 0 1 0 0 1 0 0 1 0
    totrain   (fcode, year) float64 100.0 50.0 50.0 12.0 ... 20.0 0.0 0.0 90.0
    hrsemp    (fcode, year) float64 12.0 3.053 3.252 12.0 ... 3.101 0.0 0.0 36.0
    lscrap    (fcode, year) float64 nan nan nan nan nan ... 3.401 nan nan nan
    lemploy   (fcode, year) float64 4.605 4.875 4.812 2.485 ... 4.382 4.5 4.605
    lsales    (fcode, year) float64 17.67 17.58 17.71 ... 15.76 15.96 16.11
    lrework   (fcode, year) float64 nan nan nan nan nan ... nan nan nan nan nan
    lhrsemp   (fcode, year) float64 2.565 1.4 1.447 2.565 ... 0.0 0.0 3.611
    lscrap_1  (fcode, year) float64 nan nan nan nan nan ... 3.219 nan nan nan
    grant_1   (fcode, year) int64 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
    clscrap   (fcode, year) float64 nan nan nan nan nan ... 0.1823 nan nan nan
    cgrant    (fcode, year) int64 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
    clemploy  (fcode, year) float64 nan 0.27 -0.06301 nan ... nan 0.1178 0.1054
    clsales   (fcode, year) float64 nan -0.08895 0.1306 ... nan 0.1942 0.1525
    lavgsal   (fcode, year) float64 10.46 10.52 10.57 9.259 ... 9.68 9.741 9.798
    clavgsal  (fcode, year) float64 nan 0.05557 0.05264 ... nan 0.06063 0.05716
    cgrant_1  (fcode, year) float64 nan 0.0 0.0 nan 0.0 ... 0.0 0.0 nan 0.0 0.0
    chrsemp   (fcode, year) float64 nan -8.947 0.1986 nan ... 3.101 nan 0.0 36.0
    clhrsemp  (fcode, year) float64 nan -1.165 0.04783 nan ... nan 0.0 3.611)
[7]:
res = PanelOLS(da['lscrap'].T, da['hrsemp'].T, entity_effects=True).fit()
print(res)
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                    Dep   R-squared:                        0.0528
Estimator:                   PanelOLS   R-squared (Between):             -0.0379
No. Observations:                 140   R-squared (Within):               0.0528
Date:                Tue, Feb 04 2020   R-squared (Overall):             -0.0288
Time:                        10:34:42   Log-likelihood                   -90.459
Cov. Estimator:            Unadjusted
                                        F-statistic:                      5.0751
Entities:                          48   P-value                           0.0267
Avg Obs:                       2.9167   Distribution:                    F(1,91)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             5.0751
                                        P-value                           0.0267
Time periods:                       3   Distribution:                    F(1,91)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Exog          -0.0054     0.0024    -2.2528     0.0267     -0.0102     -0.0006
==============================================================================

F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)

Included effects: Entity

Conversion of Categorical and Strings to Dummies

Categorical or string variables are treated as factors and so are converted to dummies. The first category is always dropped. If this is not desirable, you should manually convert the data to dummies before estimating a model.

[8]:
import pandas as pd
year_str = mi_data.reset_index()[['year']].astype('str')
year_cat = pd.Categorical(year_str.iloc[:,0])
year_str.index = mi_data.index
year_cat.index = mi_data.index
mi_data['year_str'] = year_str
mi_data['year_cat'] = year_cat

Here year has been converted to a string which is then used in the model to produce year dummies.

[9]:
print('Exogenous variables')
print(mi_data[['hrsemp','year_str']].head())
print(mi_data[['hrsemp','year_str']].dtypes)

res = PanelOLS(mi_data[['lscrap']], mi_data[['hrsemp','year_str']], entity_effects=True).fit()
print(res)
Exogenous variables
                hrsemp year_str
fcode  year
410032 1987  12.000000     1987
       1988   3.053435     1988
       1989   3.252033     1989
410440 1987  12.000000     1987
       1988  12.000000     1988
hrsemp      float64
year_str     object
dtype: object
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                 lscrap   R-squared:                        0.1985
Estimator:                   PanelOLS   R-squared (Between):             -0.1240
No. Observations:                 140   R-squared (Within):               0.1985
Date:                Tue, Feb 04 2020   R-squared (Overall):             -0.0934
Time:                        10:34:42   Log-likelihood                   -78.765
Cov. Estimator:            Unadjusted
                                        F-statistic:                      7.3496
Entities:                          48   P-value                           0.0002
Avg Obs:                       2.9167   Distribution:                    F(3,89)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             7.3496
                                        P-value                           0.0002
Time periods:                       3   Distribution:                    F(3,89)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                               Parameter Estimates
=================================================================================
               Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
---------------------------------------------------------------------------------
hrsemp           -0.0024     0.0024    -1.0058     0.3172     -0.0071      0.0023
year_str.1988    -0.1591     0.1146    -1.3888     0.1684     -0.3868      0.0685
year_str.1989    -0.4620     0.1176    -3.9297     0.0002     -0.6957     -0.2284
=================================================================================

F-test for Poolability: 19.649
P-value: 0.0000
Distribution: F(47,89)

Included effects: Entity

Using categoricals has the same effect.

[10]:
print('Exogenous variables')
print(mi_data[['hrsemp','year_cat']].head())
print(mi_data[['hrsemp','year_cat']].dtypes)

res = PanelOLS(mi_data[['lscrap']], mi_data[['hrsemp','year_cat']], entity_effects=True).fit()
print(res)
Exogenous variables
                hrsemp year_cat
fcode  year
410032 1987  12.000000     1987
       1988   3.053435     1988
       1989   3.252033     1989
410440 1987  12.000000     1987
       1988  12.000000     1988
hrsemp       float64
year_cat    category
dtype: object
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                 lscrap   R-squared:                        0.1985
Estimator:                   PanelOLS   R-squared (Between):             -0.1240
No. Observations:                 140   R-squared (Within):               0.1985
Date:                Tue, Feb 04 2020   R-squared (Overall):             -0.0934
Time:                        10:34:43   Log-likelihood                   -78.765
Cov. Estimator:            Unadjusted
                                        F-statistic:                      7.3496
Entities:                          48   P-value                           0.0002
Avg Obs:                       2.9167   Distribution:                    F(3,89)
Min Obs:                       1.0000
Max Obs:                       3.0000   F-statistic (robust):             7.3496
                                        P-value                           0.0002
Time periods:                       3   Distribution:                    F(3,89)
Avg Obs:                       46.667
Min Obs:                       46.000
Max Obs:                       48.000

                               Parameter Estimates
=================================================================================
               Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
---------------------------------------------------------------------------------
hrsemp           -0.0024     0.0024    -1.0058     0.3172     -0.0071      0.0023
year_cat.1988    -0.1591     0.1146    -1.3888     0.1684     -0.3868      0.0685
year_cat.1989    -0.4620     0.1176    -3.9297     0.0002     -0.6957     -0.2284
=================================================================================

F-test for Poolability: 19.649
P-value: 0.0000
Distribution: F(47,89)

Included effects: Entity