Absorbing Regression

An absorbing regression is a model of the form

\[y_i = x_i \beta + z_i \gamma +\epsilon_i\]

where interest is on \(\beta\) and not \(\gamma\). \(z_i\) may be high-dimensional, and may grow with the sample size (i.e., a matrix of fixed effects).

This notebook shows how this type of model can be fit in a simulate data set that mirrors some used in practice. There are three effects, one for the state of the worker (small), one one for the workers firm (large)

[1]:
import numpy as np
import pandas as pd

rs = np.random.RandomState(0)
nobs = 1_000_000
state_id = rs.randint(50, size=nobs)
state_effects = rs.standard_normal(state_id.max() + 1)
state_effects = state_effects[state_id]
# 5 workers/firm, on average
firm_id = rs.randint(nobs // 5, size=nobs)
firm_effects = rs.standard_normal(firm_id.max() + 1)
firm_effects = firm_effects[firm_id]
cats = pd.DataFrame(
    {"state": pd.Categorical(state_id), "firm": pd.Categorical(firm_id)}
)
eps = rs.standard_normal(nobs)
x = rs.standard_normal((nobs, 2))
x = np.column_stack([np.ones(nobs), x])
y = x.sum(1) + firm_effects + state_effects + eps

Including a constant

The estimator can estimate an intercept even when all dummies are included. This is does using a mathematical trick and the intercept is not usually meaningful. This is done as-if the the dummies are orthogonalized to a constant.

[2]:
from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(y, x, absorb=cats)
print(mod.fit())
                         Absorbing LS Estimation Summary
==================================================================================
Dep. Variable:              dependent   R-squared:                          0.8377
Estimator:               Absorbing LS   Adj. R-squared:                     0.7975
No. Observations:             1000000   F-statistic:                     1.962e+06
Date:                Tue, Sep 24 2024   P-value (F-stat):                   0.0000
Time:                        09:29:31   Distribution:                      chi2(2)
Cov. Estimator:                robust   R-squared (No Effects):             0.6664
                                        Variables Absorbed:              1.987e+05
                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exog.0         0.9477     0.0009     1057.8     0.0000      0.9460      0.9495
exog.1         0.9994     0.0010     990.89     0.0000      0.9974      1.0014
exog.2         1.0008     0.0010     989.09     0.0000      0.9989      1.0028
==============================================================================

Excluding the constant

If the constant is dropped the other coefficient are identical since the dummies span the constant.

[3]:
from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(y, x[:, 1:], absorb=cats)
print(mod.fit())
                         Absorbing LS Estimation Summary
==================================================================================
Dep. Variable:              dependent   R-squared:                          0.8377
Estimator:               Absorbing LS   Adj. R-squared:                     0.7975
No. Observations:             1000000   F-statistic:                     1.962e+06
Date:                Tue, Sep 24 2024   P-value (F-stat):                   0.0000
Time:                        09:29:39   Distribution:                      chi2(2)
Cov. Estimator:                robust   R-squared (No Effects):             0.6664
                                        Variables Absorbed:              1.987e+05
                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exog.0         0.9994     0.0010     990.89     0.0000      0.9974      1.0014
exog.1         1.0008     0.0010     989.09     0.0000      0.9989      1.0028
==============================================================================

Optimization Options

The residuals from the absorbed variables are either estimated using HDFE or LSMR< depending on the variables included in the regression. HDFE is used when:

  • the model is unweighted; and

  • the absorbed regressors are all categorical (i.e., fixed effects).

If these conditions are not satisfied, then LSMR is used. LSMR can be used by setting method="lsmr" even when the conditions for HDFE are satisfied.

[4]:
import datetime as dt

from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(y, x[:, 1:], absorb=cats)

start = dt.datetime.now()
res = mod.fit(use_cache=False, method="lsmr")
print(f"LSMR Second: {(dt.datetime.now() - start).total_seconds()}")

start = dt.datetime.now()
res = mod.fit()
print(f"HDFE Second: {(dt.datetime.now() - start).total_seconds()}")
LSMR Second: 4.799927
HDFE Second: 1.532557

LSMR is iterative and does not have a closed form. The tolerance can be set using absorb_options which is a dictionary. See scipy.sparse.linalg.lsmr for details on the options.

[5]:
mod = AbsorbingLS(y, x[:, 1:], absorb=cats)
res = mod.fit(method="lsmr", absorb_options={"show": True})

LSMR            Least-squares solution of  Ax = b

The matrix A has 1000000 rows and 198676 columns
damp = 0.00000000000000e+00

atol = 1.00e-08                 conlim = 1.00e+08

btol = 1.00e-08             maxiter =   198676


   itn      x(1)       norm r    norm Ar  compatible   LS      norm A   cond A
     0  0.00000e+00  2.417e+03  2.126e+03   1.0e+00  3.6e-04
     1  1.83446e+02  1.677e+03  6.377e+02   6.9e-01  3.1e-01  1.2e+00  1.0e+00
     2  2.09819e+02  1.563e+03  1.668e+02   6.5e-01  6.2e-02  1.7e+00  1.2e+00
     3  2.17247e+02  1.553e+03  6.446e+01   6.4e-01  2.1e-02  2.0e+00  1.3e+00
     4  2.30067e+02  1.551e+03  6.539e-01   6.4e-01  1.9e-04  2.2e+00  1.5e+00
     5  2.30071e+02  1.551e+03  2.662e-01   6.4e-01  7.0e-05  2.4e+00  1.5e+00
     6  2.29984e+02  1.551e+03  4.542e-03   6.4e-01  1.1e-06  2.6e+00  1.6e+00
     7  2.29984e+02  1.551e+03  3.999e-03   6.4e-01  9.6e-07  2.7e+00  2.5e+00
     8  2.29985e+02  1.551e+03  3.882e-03   6.4e-01  8.7e-07  2.9e+00  6.5e+00
     9  2.29985e+02  1.551e+03  3.882e-03   6.4e-01  8.3e-07  3.0e+00  5.1e+02
    10  2.29990e+02  1.551e+03  3.882e-03   6.4e-01  7.8e-07  3.2e+00  5.5e+02
    13  3.86056e+02  1.551e+03  4.003e-05   6.4e-01  7.2e-09  3.6e+00  8.7e+01

LSMR finished
The least-squares solution is good enough, given atol
istop =       2    normr = 1.6e+03
    normA = 3.6e+00    normAr = 4.0e-05
itn   =      13    condA = 8.7e+01
    normx = 2.3e+03
    13  3.86056e+02   1.551e+03  4.003e-05
   6.4e-01  7.2e-09   3.6e+00  8.7e+01

LSMR            Least-squares solution of  Ax = b

The matrix A has 1000000 rows and 198676 columns
damp = 0.00000000000000e+00

atol = 1.00e-08                 conlim = 1.00e+08

btol = 1.00e-08             maxiter =   198676


   itn      x(1)       norm r    norm Ar  compatible   LS      norm A   cond A
     0  0.00000e+00  1.001e+03  4.470e+02   1.0e+00  4.5e-04
     1  6.08847e-02  8.953e+02  4.167e+00   8.9e-01  4.7e-03  1.0e+00  1.0e+00
     2  2.15243e-01  8.953e+02  1.484e+00   8.9e-01  1.1e-03  1.5e+00  1.1e+00
     3  2.41447e-01  8.953e+02  6.412e-01   8.9e-01  4.0e-04  1.8e+00  1.4e+00
     4  2.88004e-01  8.953e+02  6.266e-03   8.9e-01  3.1e-06  2.2e+00  1.3e+00
     5  2.87319e-01  8.953e+02  2.921e-03   8.9e-01  1.4e-06  2.4e+00  1.4e+00
     6  2.87000e-01  8.953e+02  9.868e-04   8.9e-01  4.2e-07  2.6e+00  1.4e+00
     7  2.87108e-01  8.953e+02  9.866e-04   8.9e-01  4.2e-07  2.6e+00  6.2e+01
     8  2.87255e-01  8.953e+02  9.866e-04   8.9e-01  3.9e-07  2.9e+00  1.3e+02
     9  3.62098e-01  8.953e+02  9.857e-04   8.9e-01  3.7e-07  3.0e+00  7.2e+02
    10  9.04265e-01  8.953e+02  9.789e-04   8.9e-01  3.4e-07  3.2e+00  3.1e+01
    12  3.99381e+01  8.953e+02  2.309e-05   8.9e-01  7.4e-09  3.5e+00  3.1e+01

LSMR finished
The least-squares solution is good enough, given atol
istop =       2    normr = 9.0e+02
    normA = 3.5e+00    normAr = 2.3e-05
itn   =      12    condA = 3.1e+01
    normx = 6.0e+02
    12  3.99381e+01   8.953e+02  2.309e-05
   8.9e-01  7.4e-09   3.5e+00  3.1e+01

LSMR            Least-squares solution of  Ax = b

The matrix A has 1000000 rows and 198676 columns
damp = 0.00000000000000e+00

atol = 1.00e-08                 conlim = 1.00e+08

btol = 1.00e-08             maxiter =   198676


   itn      x(1)       norm r    norm Ar  compatible   LS      norm A   cond A
     0  0.00000e+00  1.000e+03  4.472e+02   1.0e+00  4.5e-04
     1 -9.29233e-01  8.946e+02  4.623e+00   8.9e-01  5.2e-03  1.0e+00  1.0e+00
     2 -7.45416e-01  8.946e+02  1.328e+00   8.9e-01  9.9e-04  1.5e+00  1.1e+00
     3 -8.34706e-01  8.946e+02  1.107e-01   8.9e-01  7.1e-05  1.7e+00  1.5e+00
     4 -8.41880e-01  8.946e+02  7.140e-03   8.9e-01  3.6e-06  2.2e+00  1.8e+00
     5 -8.42126e-01  8.946e+02  4.176e-03   8.9e-01  2.0e-06  2.4e+00  1.8e+00
     6 -8.41939e-01  8.946e+02  3.077e-03   8.9e-01  1.3e-06  2.6e+00  2.4e+00
     7 -8.41643e-01  8.946e+02  3.077e-03   8.9e-01  1.3e-06  2.6e+00  1.9e+02
     8 -8.40968e-01  8.946e+02  3.077e-03   8.9e-01  1.2e-06  2.9e+00  4.2e+02
     9  5.02689e-01  8.946e+02  3.060e-03   8.9e-01  1.1e-06  3.0e+00  7.6e+02
    10  1.66462e+01  8.946e+02  2.851e-03   8.9e-01  9.9e-07  3.2e+00  7.9e+01
    12  1.22889e+02  8.946e+02  2.454e-05   8.9e-01  7.8e-09  3.5e+00  7.9e+01

LSMR finished
The least-squares solution is good enough, given atol
istop =       2    normr = 8.9e+02
    normA = 3.5e+00    normAr = 2.5e-05
itn   =      12    condA = 7.9e+01
    normx = 1.3e+03
    12  1.22889e+02   8.946e+02  2.454e-05
   8.9e-01  7.8e-09   3.5e+00  7.9e+01