Data Formats for Panel Data Analysis¶
There are two primary methods to express data:
MultiIndex DataFrames where the outer index is the entity and the inner is the time index. This requires using pandas.
3D structures were dimension 0 (outer) is variable, dimension 1 is time index and dimension 2 is the entity index. It is also possible to use a 2D data structure with dimensions (t, n) which is treated as a 3D data structure having dimensions (1, t, n). These 3D data structures can be pandas, NumPy or xarray.
MultiIndex DataFrames¶
The most precise data format to use is a MultiIndex DataFrame
. This is the most precise since only single columns can preserve all types within a panel. For example, it is not possible to span a single Categorical variable across multiple columns when using a pandas Panel
.
This example uses the job training data to construct a MultiIndex DataFrame
using the set_index
command. The entity index is fcode
and the time index is year
.
[1]:
from linearmodels.datasets import jobtraining
data = jobtraining.load()
print(data.head())
year fcode employ sales avgsal scrap rework tothrs union \
0 1987 410032 100.0 47000000.0 35000.0 NaN NaN 12.0 0
1 1988 410032 131.0 43000000.0 37000.0 NaN NaN 8.0 0
2 1989 410032 123.0 49000000.0 39000.0 NaN NaN 8.0 0
3 1987 410440 12.0 1560000.0 10500.0 NaN NaN 12.0 0
4 1988 410440 13.0 1970000.0 11000.0 NaN NaN 12.0 0
grant ... grant_1 clscrap cgrant clemploy clsales lavgsal \
0 0 ... 0 NaN 0 NaN NaN 10.463100
1 0 ... 0 NaN 0 0.270027 -0.088949 10.518670
2 0 ... 0 NaN 0 -0.063013 0.130621 10.571320
3 0 ... 0 NaN 0 NaN NaN 9.259130
4 0 ... 0 NaN 0 0.080043 0.233347 9.305651
clavgsal cgrant_1 chrsemp clhrsemp
0 NaN NaN NaN NaN
1 0.055570 0.0 -8.946565 -1.165385
2 0.052644 0.0 0.198597 0.047832
3 NaN NaN NaN NaN
4 0.046520 0.0 0.000000 0.000000
[5 rows x 30 columns]
Here set_index
is used to set the MultiIndex using the firm code (entity) and year (time).
[2]:
mi_data = data.set_index(["fcode", "year"])
print(mi_data.head())
employ sales avgsal scrap rework tothrs union grant \
fcode year
410032 1987 100.0 47000000.0 35000.0 NaN NaN 12.0 0 0
1988 131.0 43000000.0 37000.0 NaN NaN 8.0 0 0
1989 123.0 49000000.0 39000.0 NaN NaN 8.0 0 0
410440 1987 12.0 1560000.0 10500.0 NaN NaN 12.0 0 0
1988 13.0 1970000.0 11000.0 NaN NaN 12.0 0 0
d89 d88 ... grant_1 clscrap cgrant clemploy clsales \
fcode year ...
410032 1987 0 0 ... 0 NaN 0 NaN NaN
1988 0 1 ... 0 NaN 0 0.270027 -0.088949
1989 1 0 ... 0 NaN 0 -0.063013 0.130621
410440 1987 0 0 ... 0 NaN 0 NaN NaN
1988 0 1 ... 0 NaN 0 0.080043 0.233347
lavgsal clavgsal cgrant_1 chrsemp clhrsemp
fcode year
410032 1987 10.463100 NaN NaN NaN NaN
1988 10.518670 0.055570 0.0 -8.946565 -1.165385
1989 10.571320 0.052644 0.0 0.198597 0.047832
410440 1987 9.259130 NaN NaN NaN NaN
1988 9.305651 0.046520 0.0 0.000000 0.000000
[5 rows x 28 columns]
The MultiIndex
DataFrame
can be used to initialized the model. When only referencing a single series, the MultiIndex
Series
representation can be used.
[3]:
from linearmodels import PanelOLS
mod = PanelOLS(mi_data.lscrap, mi_data.hrsemp, entity_effects=True)
print(mod.fit())
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lscrap R-squared: 0.0528
Estimator: PanelOLS R-squared (Between): -0.0379
No. Observations: 140 R-squared (Within): 0.0528
Date: Tue, Sep 26 2023 R-squared (Overall): -0.0288
Time: 08:55:34 Log-likelihood -90.459
Cov. Estimator: Unadjusted
F-statistic: 5.0751
Entities: 48 P-value 0.0267
Avg Obs: 2.9167 Distribution: F(1,91)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 5.0751
P-value 0.0267
Time periods: 3 Distribution: F(1,91)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
hrsemp -0.0054 0.0024 -2.2528 0.0267 -0.0102 -0.0006
==============================================================================
F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)
Included effects: Entity
/home/runner/work/linearmodels/linearmodels/linearmodels/panel/model.py:1214: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
NumPy arrays¶
3D NumPy arrays can be used to hand panel data where the three axes are 0
, items, 1
, time and 2
, entity. NumPy arrays are not usually the best format for data since the results all use generic variable names.
Pandas dropped support for their Panel in 0.25.
[4]:
import numpy as np
np_data = np.asarray(mi_data)
np_lscrap = np_data[:, mi_data.columns.get_loc("lscrap")]
np_hrsemp = np_data[:, mi_data.columns.get_loc("hrsemp")]
nentity = mi_data.index.levels[0].shape[0]
ntime = mi_data.index.levels[1].shape[0]
np_lscrap = np_lscrap.reshape((nentity, ntime)).T
np_hrsemp = np_hrsemp.reshape((nentity, ntime)).T
np_hrsemp.shape = (1, ntime, nentity)
[5]:
res = PanelOLS(np_lscrap, np_hrsemp, entity_effects=True).fit()
print(res)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: Dep R-squared: 0.0528
Estimator: PanelOLS R-squared (Between): -0.0379
No. Observations: 140 R-squared (Within): 0.0528
Date: Tue, Sep 26 2023 R-squared (Overall): -0.0288
Time: 08:55:34 Log-likelihood -90.459
Cov. Estimator: Unadjusted
F-statistic: 5.0751
Entities: 48 P-value 0.0267
Avg Obs: 2.9167 Distribution: F(1,91)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 5.0751
P-value 0.0267
Time periods: 3 Distribution: F(1,91)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Exog -0.0054 0.0024 -2.2528 0.0267 -0.0102 -0.0006
==============================================================================
F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)
Included effects: Entity
/home/runner/work/linearmodels/linearmodels/linearmodels/panel/model.py:1214: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
xarray DataArrays¶
xarray is a relatively new entrant into the set of packages used for data structures. The data structures provided by xarray
are relevant in the context of panel models since pandas Panel
is scheduled for removal in the futures, and so the only 3d data format that will remain viable is an xarray
DataArray
. DataArray
s are similar to pandas Panel
although DataArrays
use some difference notation. In principle it is possible to express the same information in a
DataArray
as one can in a Panel
[6]:
da = mi_data.to_xarray()
da.keys()
[6]:
KeysView(<xarray.Dataset>
Dimensions: (fcode: 157, year: 3)
Coordinates:
* fcode (fcode) int64 410032 410440 410495 410500 ... 419482 419483 419486
* year (year) int64 1987 1988 1989
Data variables: (12/28)
employ (fcode, year) float64 100.0 131.0 123.0 12.0 ... 80.0 90.0 100.0
sales (fcode, year) float64 4.7e+07 4.3e+07 4.9e+07 ... 8.5e+06 9.9e+06
avgsal (fcode, year) float64 3.5e+04 3.7e+04 3.9e+04 ... 1.7e+04 1.8e+04
scrap (fcode, year) float64 nan nan nan nan nan ... 30.0 nan nan nan
rework (fcode, year) float64 nan nan nan nan nan ... nan nan nan nan nan
tothrs (fcode, year) float64 12.0 8.0 8.0 12.0 12.0 ... 20.0 0.0 0.0 40.0
... ...
clsales (fcode, year) float64 nan -0.08895 0.1306 ... nan 0.1942 0.1525
lavgsal (fcode, year) float64 10.46 10.52 10.57 9.259 ... 9.68 9.741 9.798
clavgsal (fcode, year) float64 nan 0.05557 0.05264 ... nan 0.06063 0.05716
cgrant_1 (fcode, year) float64 nan 0.0 0.0 nan 0.0 ... 0.0 0.0 nan 0.0 0.0
chrsemp (fcode, year) float64 nan -8.947 0.1986 nan ... 3.101 nan 0.0 36.0
clhrsemp (fcode, year) float64 nan -1.165 0.04783 nan ... nan 0.0 3.611)
[7]:
res = PanelOLS(da["lscrap"].T, da["hrsemp"].T, entity_effects=True).fit()
print(res)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: Dep R-squared: 0.0528
Estimator: PanelOLS R-squared (Between): -0.0379
No. Observations: 140 R-squared (Within): 0.0528
Date: Tue, Sep 26 2023 R-squared (Overall): -0.0288
Time: 08:55:34 Log-likelihood -90.459
Cov. Estimator: Unadjusted
F-statistic: 5.0751
Entities: 48 P-value 0.0267
Avg Obs: 2.9167 Distribution: F(1,91)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 5.0751
P-value 0.0267
Time periods: 3 Distribution: F(1,91)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Exog -0.0054 0.0024 -2.2528 0.0267 -0.0102 -0.0006
==============================================================================
F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)
Included effects: Entity
/home/runner/work/linearmodels/linearmodels/linearmodels/panel/model.py:1214: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Conversion of Categorical and Strings to Dummies¶
Categorical or string variables are treated as factors and so are converted to dummies. The first category is always dropped. If this is not desirable, you should manually convert the data to dummies before estimating a model.
[8]:
import pandas as pd
year_str = mi_data.reset_index()[["year"]].astype("str")
year_cat = pd.Categorical(year_str.iloc[:, 0])
year_str.index = mi_data.index
year_cat.index = mi_data.index
mi_data["year_str"] = year_str
mi_data["year_cat"] = year_cat
Here year has been converted to a string which is then used in the model to produce year dummies.
[9]:
print("Exogenous variables")
print(mi_data[["hrsemp", "year_str"]].head())
print(mi_data[["hrsemp", "year_str"]].dtypes)
res = PanelOLS(
mi_data[["lscrap"]], mi_data[["hrsemp", "year_str"]], entity_effects=True
).fit()
print(res)
Exogenous variables
hrsemp year_str
fcode year
410032 1987 12.000000 1987
1988 3.053435 1988
1989 3.252033 1989
410440 1987 12.000000 1987
1988 12.000000 1988
hrsemp float64
year_str object
dtype: object
/home/runner/work/linearmodels/linearmodels/linearmodels/panel/model.py:1214: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lscrap R-squared: 0.1985
Estimator: PanelOLS R-squared (Between): -0.1240
No. Observations: 140 R-squared (Within): 0.1985
Date: Tue, Sep 26 2023 R-squared (Overall): -0.0934
Time: 08:55:34 Log-likelihood -78.765
Cov. Estimator: Unadjusted
F-statistic: 7.3496
Entities: 48 P-value 0.0002
Avg Obs: 2.9167 Distribution: F(3,89)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 7.3496
P-value 0.0002
Time periods: 3 Distribution: F(3,89)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
=================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
---------------------------------------------------------------------------------
hrsemp -0.0024 0.0024 -1.0058 0.3172 -0.0071 0.0023
year_str.1988 -0.1591 0.1146 -1.3888 0.1684 -0.3868 0.0685
year_str.1989 -0.4620 0.1176 -3.9297 0.0002 -0.6957 -0.2284
=================================================================================
F-test for Poolability: 19.649
P-value: 0.0000
Distribution: F(47,89)
Included effects: Entity
Using categorical
s has the same effect.
[10]:
print("Exogenous variables")
print(mi_data[["hrsemp", "year_cat"]].head())
print(mi_data[["hrsemp", "year_cat"]].dtypes)
res = PanelOLS(
mi_data[["lscrap"]], mi_data[["hrsemp", "year_cat"]], entity_effects=True
).fit()
print(res)
Exogenous variables
hrsemp year_cat
fcode year
410032 1987 12.000000 1987
1988 3.053435 1988
1989 3.252033 1989
410440 1987 12.000000 1987
1988 12.000000 1988
hrsemp float64
year_cat category
dtype: object
/home/runner/work/linearmodels/linearmodels/linearmodels/panel/model.py:1214: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lscrap R-squared: 0.1985
Estimator: PanelOLS R-squared (Between): -0.1240
No. Observations: 140 R-squared (Within): 0.1985
Date: Tue, Sep 26 2023 R-squared (Overall): -0.0934
Time: 08:55:34 Log-likelihood -78.765
Cov. Estimator: Unadjusted
F-statistic: 7.3496
Entities: 48 P-value 0.0002
Avg Obs: 2.9167 Distribution: F(3,89)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 7.3496
P-value 0.0002
Time periods: 3 Distribution: F(3,89)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
=================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
---------------------------------------------------------------------------------
hrsemp -0.0024 0.0024 -1.0058 0.3172 -0.0071 0.0023
year_cat.1988 -0.1591 0.1146 -1.3888 0.1684 -0.3868 0.0685
year_cat.1989 -0.4620 0.1176 -3.9297 0.0002 -0.6957 -0.2284
=================================================================================
F-test for Poolability: 19.649
P-value: 0.0000
Distribution: F(47,89)
Included effects: Entity