Data Formats for Panel Data Analysis¶
There are two primary methods to express data:
MultiIndex DataFrames where the outer index is the entity and the inner is the time index. This requires using pandas.
3D structures were dimension 0 (outer) is variable, dimension 1 is time index and dimension 2 is the entity index. It is also possible to use a 2D data structure with dimensions (t, n) which is treated as a 3D data structure having dimensions (1, t, n). These 3D data structures can be pandas, NumPy or xarray.
MultiIndex DataFrames¶
The most precise data format to use is a MultiIndex DataFrame
. This is the most precise since only single columns can preserve all types within a panel. For example, it is not possible to span a single Categorical variable across multiple columns when using a pandas Panel
.
This example uses the job training data to construct a MultiIndex DataFrame
using the set_index
command. The entity index is fcode
and the time index is year
.
[1]:
from linearmodels.datasets import jobtraining
data = jobtraining.load()
print(data.head())
year fcode employ sales avgsal scrap rework tothrs union \
0 1987 410032 100.0 47000000.0 35000.0 NaN NaN 12.0 0
1 1988 410032 131.0 43000000.0 37000.0 NaN NaN 8.0 0
2 1989 410032 123.0 49000000.0 39000.0 NaN NaN 8.0 0
3 1987 410440 12.0 1560000.0 10500.0 NaN NaN 12.0 0
4 1988 410440 13.0 1970000.0 11000.0 NaN NaN 12.0 0
grant ... grant_1 clscrap cgrant clemploy clsales lavgsal \
0 0 ... 0 NaN 0 NaN NaN 10.463100
1 0 ... 0 NaN 0 0.270027 -0.088949 10.518670
2 0 ... 0 NaN 0 -0.063013 0.130621 10.571320
3 0 ... 0 NaN 0 NaN NaN 9.259130
4 0 ... 0 NaN 0 0.080043 0.233347 9.305651
clavgsal cgrant_1 chrsemp clhrsemp
0 NaN NaN NaN NaN
1 0.055570 0.0 -8.946565 -1.165385
2 0.052644 0.0 0.198597 0.047832
3 NaN NaN NaN NaN
4 0.046520 0.0 0.000000 0.000000
[5 rows x 30 columns]
Here set_index
is used to set the MultiIndex using the firm code (entity) and year (time).
[2]:
orig_mi_data = data.set_index(["fcode", "year"])
# Subset to the relevant columns and drop missing to avoid warnings
mi_data = orig_mi_data[["lscrap", "hrsemp"]]
mi_data = mi_data.dropna(axis=0, how="any")
print(mi_data.head())
lscrap hrsemp
fcode year
410523 1987 -2.813411 20.00000
1988 -2.995732 18.82353
1989 -2.995732 24.48980
410538 1989 0.932164 25.00000
410563 1987 1.791759 0.00000
The MultiIndex
DataFrame
can be used to initialized the model. When only referencing a single series, the MultiIndex
Series
representation can be used.
[3]:
from linearmodels import PanelOLS
mod = PanelOLS(mi_data.lscrap, mi_data.hrsemp, entity_effects=True)
print(mod.fit())
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lscrap R-squared: 0.0528
Estimator: PanelOLS R-squared (Between): -0.0379
No. Observations: 140 R-squared (Within): 0.0528
Date: Wed, Nov 06 2024 R-squared (Overall): -0.0288
Time: 15:13:07 Log-likelihood -90.459
Cov. Estimator: Unadjusted
F-statistic: 5.0751
Entities: 48 P-value 0.0267
Avg Obs: 2.9167 Distribution: F(1,91)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 5.0751
P-value 0.0267
Time periods: 3 Distribution: F(1,91)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
hrsemp -0.0054 0.0024 -2.2528 0.0267 -0.0102 -0.0006
==============================================================================
F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)
Included effects: Entity
NumPy arrays¶
3D NumPy arrays can be used to hand panel data where the three axes are 0
, items, 1
, time and 2
, entity. NumPy arrays are not usually the best format for data since the results all use generic variable names.
Pandas dropped support for their Panel in 0.25.
[4]:
import numpy as np
np_data = np.asarray(orig_mi_data)
np_lscrap = np_data[:, orig_mi_data.columns.get_loc("lscrap")]
np_hrsemp = np_data[:, orig_mi_data.columns.get_loc("hrsemp")]
nentity = mi_data.index.levels[0].shape[0]
ntime = mi_data.index.levels[1].shape[0]
np_lscrap = np_lscrap.reshape((nentity, ntime)).T
np_hrsemp = np_hrsemp.reshape((nentity, ntime)).T
np_hrsemp.shape = (1, ntime, nentity)
[5]:
# Warnings are inevitable when using NumPy with missing data
# since the arrays must be rectangular, and not ragged
res = PanelOLS(np_lscrap, np_hrsemp, entity_effects=True).fit()
print(res)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: Dep R-squared: 0.0528
Estimator: PanelOLS R-squared (Between): -0.0379
No. Observations: 140 R-squared (Within): 0.0528
Date: Wed, Nov 06 2024 R-squared (Overall): -0.0288
Time: 15:13:07 Log-likelihood -90.459
Cov. Estimator: Unadjusted
F-statistic: 5.0751
Entities: 48 P-value 0.0267
Avg Obs: 2.9167 Distribution: F(1,91)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 5.0751
P-value 0.0267
Time periods: 3 Distribution: F(1,91)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Exog -0.0054 0.0024 -2.2528 0.0267 -0.0102 -0.0006
==============================================================================
F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)
Included effects: Entity
/home/runner/work/linearmodels/linearmodels/linearmodels/panel/model.py:1260: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
xarray DataArrays¶
xarray is a relatively new entrant into the set of packages used for data structures. The data structures provided by xarray
are relevant in the context of panel models since pandas Panel
is scheduled for removal in the futures, and so the only 3d data format that will remain viable is an xarray
DataArray
. DataArray
s are similar to pandas Panel
although DataArrays
use some difference notation. In principle it is possible to express the same information in a
DataArray
as one can in a Panel
[6]:
da = mi_data.to_xarray()
da.keys()
[6]:
KeysView(<xarray.Dataset> Size: 3kB
Dimensions: (fcode: 48, year: 3)
Coordinates:
* fcode (fcode) int64 384B 410523 410538 410563 ... 419459 419482 419483
* year (year) int64 24B 1987 1988 1989
Data variables:
lscrap (fcode, year) float64 1kB -2.813 -2.996 -2.996 ... 3.219 3.401
hrsemp (fcode, year) float64 1kB 20.0 18.82 24.49 nan ... 0.0 0.0 3.101)
[7]:
res = PanelOLS(da["lscrap"].T, da["hrsemp"].T, entity_effects=True).fit()
print(res)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: Dep R-squared: 0.0528
Estimator: PanelOLS R-squared (Between): -0.0379
No. Observations: 140 R-squared (Within): 0.0528
Date: Wed, Nov 06 2024 R-squared (Overall): -0.0288
Time: 15:13:07 Log-likelihood -90.459
Cov. Estimator: Unadjusted
F-statistic: 5.0751
Entities: 48 P-value 0.0267
Avg Obs: 2.9167 Distribution: F(1,91)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 5.0751
P-value 0.0267
Time periods: 3 Distribution: F(1,91)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Exog -0.0054 0.0024 -2.2528 0.0267 -0.0102 -0.0006
==============================================================================
F-test for Poolability: 17.094
P-value: 0.0000
Distribution: F(47,91)
Included effects: Entity
Conversion of Categorical and Strings to Dummies¶
Categorical or string variables are treated as factors and so are converted to dummies. The first category is always dropped. If this is not desirable, you should manually convert the data to dummies before estimating a model.
[8]:
import pandas as pd
year_str = mi_data.reset_index()[["year"]].astype("str")
year_cat = pd.Categorical(year_str.iloc[:, 0])
year_str.index = mi_data.index
year_cat.index = mi_data.index
mi_data["year_str"] = year_str
mi_data["year_cat"] = year_cat
Here year has been converted to a string which is then used in the model to produce year dummies.
[9]:
print("Exogenous variables")
print(mi_data[["hrsemp", "year_str"]].head())
print(mi_data[["hrsemp", "year_str"]].dtypes)
res = PanelOLS(
mi_data[["lscrap"]], mi_data[["hrsemp", "year_str"]], entity_effects=True
).fit()
print(res)
Exogenous variables
hrsemp year_str
fcode year
410523 1987 20.00000 1987
1988 18.82353 1988
1989 24.48980 1989
410538 1989 25.00000 1989
410563 1987 0.00000 1987
hrsemp float64
year_str object
dtype: object
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lscrap R-squared: 0.1985
Estimator: PanelOLS R-squared (Between): -0.1240
No. Observations: 140 R-squared (Within): 0.1985
Date: Wed, Nov 06 2024 R-squared (Overall): -0.0934
Time: 15:13:07 Log-likelihood -78.765
Cov. Estimator: Unadjusted
F-statistic: 7.3496
Entities: 48 P-value 0.0002
Avg Obs: 2.9167 Distribution: F(3,89)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 7.3496
P-value 0.0002
Time periods: 3 Distribution: F(3,89)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
=================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
---------------------------------------------------------------------------------
hrsemp -0.0024 0.0024 -1.0058 0.3172 -0.0071 0.0023
year_str.1988 -0.1591 0.1146 -1.3888 0.1684 -0.3868 0.0685
year_str.1989 -0.4620 0.1176 -3.9297 0.0002 -0.6957 -0.2284
=================================================================================
F-test for Poolability: 19.649
P-value: 0.0000
Distribution: F(47,89)
Included effects: Entity
Using categorical
s has the same effect.
[10]:
print("Exogenous variables")
print(mi_data[["hrsemp", "year_cat"]].head())
print(mi_data[["hrsemp", "year_cat"]].dtypes)
res = PanelOLS(
mi_data[["lscrap"]], mi_data[["hrsemp", "year_cat"]], entity_effects=True
).fit()
print(res)
Exogenous variables
hrsemp year_cat
fcode year
410523 1987 20.00000 1987
1988 18.82353 1988
1989 24.48980 1989
410538 1989 25.00000 1989
410563 1987 0.00000 1987
hrsemp float64
year_cat category
dtype: object
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lscrap R-squared: 0.1985
Estimator: PanelOLS R-squared (Between): -0.1240
No. Observations: 140 R-squared (Within): 0.1985
Date: Wed, Nov 06 2024 R-squared (Overall): -0.0934
Time: 15:13:07 Log-likelihood -78.765
Cov. Estimator: Unadjusted
F-statistic: 7.3496
Entities: 48 P-value 0.0002
Avg Obs: 2.9167 Distribution: F(3,89)
Min Obs: 1.0000
Max Obs: 3.0000 F-statistic (robust): 7.3496
P-value 0.0002
Time periods: 3 Distribution: F(3,89)
Avg Obs: 46.667
Min Obs: 46.000
Max Obs: 48.000
Parameter Estimates
=================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
---------------------------------------------------------------------------------
hrsemp -0.0024 0.0024 -1.0058 0.3172 -0.0071 0.0023
year_cat.1988 -0.1591 0.1146 -1.3888 0.1684 -0.3868 0.0685
year_cat.1989 -0.4620 0.1176 -3.9297 0.0002 -0.6957 -0.2284
=================================================================================
F-test for Poolability: 19.649
P-value: 0.0000
Distribution: F(47,89)
Included effects: Entity