Absorbing Regression¶
An absorbing regression is a model of the form
where interest is on \(\beta\) and not \(\gamma\). \(z_i\) may be high-dimensional, and may grow with the sample size (i.e., a matrix of fixed effects).
This notebook shows how this type of model can be fit in a simulate data set that mirrors some used in practice. There are three effects, one for the state of the worker (small), one one for the workers firm (large)
[1]:
import numpy as np
import pandas as pd
rs = np.random.RandomState(0)
nobs = 1_000_000
state_id = rs.randint(50, size=nobs)
state_effects = rs.standard_normal(state_id.max() + 1)
state_effects = state_effects[state_id]
# 5 workers/firm, on average
firm_id = rs.randint(nobs // 5, size=nobs)
firm_effects = rs.standard_normal(firm_id.max() + 1)
firm_effects = firm_effects[firm_id]
cats = pd.DataFrame(
{"state": pd.Categorical(state_id), "firm": pd.Categorical(firm_id)}
)
eps = rs.standard_normal(nobs)
x = rs.standard_normal((nobs, 2))
x = np.column_stack([np.ones(nobs), x])
y = x.sum(1) + firm_effects + state_effects + eps
Including a constant¶
The estimator can estimate an intercept even when all dummies are included. This is does using a mathematical trick and the intercept is not usually meaningful. This is done as-if the the dummies are orthogonalized to a constant.
[2]:
from linearmodels.iv.absorbing import AbsorbingLS
mod = AbsorbingLS(y, x, absorb=cats)
print(mod.fit())
Absorbing LS Estimation Summary
==================================================================================
Dep. Variable: dependent R-squared: 0.8377
Estimator: Absorbing LS Adj. R-squared: 0.7975
No. Observations: 1000000 F-statistic: 1.962e+06
Date: Tue, Sep 24 2024 P-value (F-stat): 0.0000
Time: 09:29:31 Distribution: chi2(2)
Cov. Estimator: robust R-squared (No Effects): 0.6664
Variables Absorbed: 1.987e+05
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
exog.0 0.9477 0.0009 1057.8 0.0000 0.9460 0.9495
exog.1 0.9994 0.0010 990.89 0.0000 0.9974 1.0014
exog.2 1.0008 0.0010 989.09 0.0000 0.9989 1.0028
==============================================================================
Excluding the constant¶
If the constant is dropped the other coefficient are identical since the dummies span the constant.
[3]:
from linearmodels.iv.absorbing import AbsorbingLS
mod = AbsorbingLS(y, x[:, 1:], absorb=cats)
print(mod.fit())
Absorbing LS Estimation Summary
==================================================================================
Dep. Variable: dependent R-squared: 0.8377
Estimator: Absorbing LS Adj. R-squared: 0.7975
No. Observations: 1000000 F-statistic: 1.962e+06
Date: Tue, Sep 24 2024 P-value (F-stat): 0.0000
Time: 09:29:39 Distribution: chi2(2)
Cov. Estimator: robust R-squared (No Effects): 0.6664
Variables Absorbed: 1.987e+05
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
exog.0 0.9994 0.0010 990.89 0.0000 0.9974 1.0014
exog.1 1.0008 0.0010 989.09 0.0000 0.9989 1.0028
==============================================================================
Optimization Options¶
The residuals from the absorbed variables are either estimated using HDFE or LSMR< depending on the variables included in the regression. HDFE is used when:
the model is unweighted; and
the absorbed regressors are all categorical (i.e., fixed effects).
If these conditions are not satisfied, then LSMR is used. LSMR can be used by setting method="lsmr"
even when the conditions for HDFE are satisfied.
[4]:
import datetime as dt
from linearmodels.iv.absorbing import AbsorbingLS
mod = AbsorbingLS(y, x[:, 1:], absorb=cats)
start = dt.datetime.now()
res = mod.fit(use_cache=False, method="lsmr")
print(f"LSMR Second: {(dt.datetime.now() - start).total_seconds()}")
start = dt.datetime.now()
res = mod.fit()
print(f"HDFE Second: {(dt.datetime.now() - start).total_seconds()}")
LSMR Second: 4.799927
HDFE Second: 1.532557
LSMR is iterative and does not have a closed form. The tolerance can be set using absorb_options
which is a dictionary. See scipy.sparse.linalg.lsmr for details on the options.
[5]:
mod = AbsorbingLS(y, x[:, 1:], absorb=cats)
res = mod.fit(method="lsmr", absorb_options={"show": True})
LSMR Least-squares solution of Ax = b
The matrix A has 1000000 rows and 198676 columns
damp = 0.00000000000000e+00
atol = 1.00e-08 conlim = 1.00e+08
btol = 1.00e-08 maxiter = 198676
itn x(1) norm r norm Ar compatible LS norm A cond A
0 0.00000e+00 2.417e+03 2.126e+03 1.0e+00 3.6e-04
1 1.83446e+02 1.677e+03 6.377e+02 6.9e-01 3.1e-01 1.2e+00 1.0e+00
2 2.09819e+02 1.563e+03 1.668e+02 6.5e-01 6.2e-02 1.7e+00 1.2e+00
3 2.17247e+02 1.553e+03 6.446e+01 6.4e-01 2.1e-02 2.0e+00 1.3e+00
4 2.30067e+02 1.551e+03 6.539e-01 6.4e-01 1.9e-04 2.2e+00 1.5e+00
5 2.30071e+02 1.551e+03 2.662e-01 6.4e-01 7.0e-05 2.4e+00 1.5e+00
6 2.29984e+02 1.551e+03 4.542e-03 6.4e-01 1.1e-06 2.6e+00 1.6e+00
7 2.29984e+02 1.551e+03 3.999e-03 6.4e-01 9.6e-07 2.7e+00 2.5e+00
8 2.29985e+02 1.551e+03 3.882e-03 6.4e-01 8.7e-07 2.9e+00 6.5e+00
9 2.29985e+02 1.551e+03 3.882e-03 6.4e-01 8.3e-07 3.0e+00 5.1e+02
10 2.29990e+02 1.551e+03 3.882e-03 6.4e-01 7.8e-07 3.2e+00 5.5e+02
13 3.86056e+02 1.551e+03 4.003e-05 6.4e-01 7.2e-09 3.6e+00 8.7e+01
LSMR finished
The least-squares solution is good enough, given atol
istop = 2 normr = 1.6e+03
normA = 3.6e+00 normAr = 4.0e-05
itn = 13 condA = 8.7e+01
normx = 2.3e+03
13 3.86056e+02 1.551e+03 4.003e-05
6.4e-01 7.2e-09 3.6e+00 8.7e+01
LSMR Least-squares solution of Ax = b
The matrix A has 1000000 rows and 198676 columns
damp = 0.00000000000000e+00
atol = 1.00e-08 conlim = 1.00e+08
btol = 1.00e-08 maxiter = 198676
itn x(1) norm r norm Ar compatible LS norm A cond A
0 0.00000e+00 1.001e+03 4.470e+02 1.0e+00 4.5e-04
1 6.08847e-02 8.953e+02 4.167e+00 8.9e-01 4.7e-03 1.0e+00 1.0e+00
2 2.15243e-01 8.953e+02 1.484e+00 8.9e-01 1.1e-03 1.5e+00 1.1e+00
3 2.41447e-01 8.953e+02 6.412e-01 8.9e-01 4.0e-04 1.8e+00 1.4e+00
4 2.88004e-01 8.953e+02 6.266e-03 8.9e-01 3.1e-06 2.2e+00 1.3e+00
5 2.87319e-01 8.953e+02 2.921e-03 8.9e-01 1.4e-06 2.4e+00 1.4e+00
6 2.87000e-01 8.953e+02 9.868e-04 8.9e-01 4.2e-07 2.6e+00 1.4e+00
7 2.87108e-01 8.953e+02 9.866e-04 8.9e-01 4.2e-07 2.6e+00 6.2e+01
8 2.87255e-01 8.953e+02 9.866e-04 8.9e-01 3.9e-07 2.9e+00 1.3e+02
9 3.62098e-01 8.953e+02 9.857e-04 8.9e-01 3.7e-07 3.0e+00 7.2e+02
10 9.04265e-01 8.953e+02 9.789e-04 8.9e-01 3.4e-07 3.2e+00 3.1e+01
12 3.99381e+01 8.953e+02 2.309e-05 8.9e-01 7.4e-09 3.5e+00 3.1e+01
LSMR finished
The least-squares solution is good enough, given atol
istop = 2 normr = 9.0e+02
normA = 3.5e+00 normAr = 2.3e-05
itn = 12 condA = 3.1e+01
normx = 6.0e+02
12 3.99381e+01 8.953e+02 2.309e-05
8.9e-01 7.4e-09 3.5e+00 3.1e+01
LSMR Least-squares solution of Ax = b
The matrix A has 1000000 rows and 198676 columns
damp = 0.00000000000000e+00
atol = 1.00e-08 conlim = 1.00e+08
btol = 1.00e-08 maxiter = 198676
itn x(1) norm r norm Ar compatible LS norm A cond A
0 0.00000e+00 1.000e+03 4.472e+02 1.0e+00 4.5e-04
1 -9.29233e-01 8.946e+02 4.623e+00 8.9e-01 5.2e-03 1.0e+00 1.0e+00
2 -7.45416e-01 8.946e+02 1.328e+00 8.9e-01 9.9e-04 1.5e+00 1.1e+00
3 -8.34706e-01 8.946e+02 1.107e-01 8.9e-01 7.1e-05 1.7e+00 1.5e+00
4 -8.41880e-01 8.946e+02 7.140e-03 8.9e-01 3.6e-06 2.2e+00 1.8e+00
5 -8.42126e-01 8.946e+02 4.176e-03 8.9e-01 2.0e-06 2.4e+00 1.8e+00
6 -8.41939e-01 8.946e+02 3.077e-03 8.9e-01 1.3e-06 2.6e+00 2.4e+00
7 -8.41643e-01 8.946e+02 3.077e-03 8.9e-01 1.3e-06 2.6e+00 1.9e+02
8 -8.40968e-01 8.946e+02 3.077e-03 8.9e-01 1.2e-06 2.9e+00 4.2e+02
9 5.02689e-01 8.946e+02 3.060e-03 8.9e-01 1.1e-06 3.0e+00 7.6e+02
10 1.66462e+01 8.946e+02 2.851e-03 8.9e-01 9.9e-07 3.2e+00 7.9e+01
12 1.22889e+02 8.946e+02 2.454e-05 8.9e-01 7.8e-09 3.5e+00 7.9e+01
LSMR finished
The least-squares solution is good enough, given atol
istop = 2 normr = 8.9e+02
normA = 3.5e+00 normAr = 2.5e-05
itn = 12 condA = 7.9e+01
normx = 1.3e+03
12 1.22889e+02 8.946e+02 2.454e-05
8.9e-01 7.8e-09 3.5e+00 7.9e+01