Absorbing Regression

An absorbing regression is a model of the form

\[y_i = x_i \beta + z_i \gamma +\epsilon_i\]

where interest is on \(\beta\) and not \(\gamma\). \(z_i\) may be high-dimensional, and may grow with the sample size (i.e., a matrix of fixed effects).

This notebook shows how this type of model can be fit in a simulate data set that mirrors some used in practice. There are three effects, one for the state of the worker (small), one one for the workers firm (large)

[1]:
import numpy as np
import pandas as pd
rs = np.random.RandomState(0)
nobs = 250000
state_id = rs.randint(50, size=nobs)
state_effects = rs.standard_normal(state_id.max()+1)
state_effects = state_effects[state_id]
# 5 workers/firm, on average
firm_id = rs.randint(nobs//5, size=nobs)
firm_effects = rs.standard_normal(firm_id.max()+1)
firm_effects = firm_effects[firm_id]
cats = pd.DataFrame({'state': pd.Categorical(state_id), 'firm': pd.Categorical(firm_id)})
eps = rs.standard_normal(nobs)
x = rs.standard_normal((nobs,2))
x = np.column_stack([np.ones(nobs), x])
y = x.sum(1) + firm_effects + state_effects + eps

Including a constant

The estimator can estimate an intercept even when all dummies are included. This is does using a mathematical trick and the intercept is not usually meaningful. This is done as-if the the dummies are orthogonalized to a constant.

[2]:
from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(y, x, absorb=cats)
print(mod.fit())
                         Absorbing LS Estimation Summary
==================================================================================
Dep. Variable:              dependent   R-squared:                          0.8462
Estimator:               Absorbing LS   Adj. R-squared:                     0.8080
No. Observations:              250000   F-statistic:                     4.944e+05
Date:                Fri, Jan 10 2020   P-value (F-stat):                   0.0000
Time:                        15:02:17   Distribution:                      chi2(2)
Cov. Estimator:                robust   R-squared (No Effects):             0.6665
                                        Varaibles Absorbed:               4.97e+04
                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exog.0         0.7737     0.0018     432.47     0.0000      0.7702      0.7772
exog.1         1.0012     0.0020     498.01     0.0000      0.9973      1.0051
exog.2         0.9985     0.0020     494.86     0.0000      0.9946      1.0025
==============================================================================

Excluding the constant

If the constant is dropped the other coefficient are identical since the dummies span the constant.

[3]:
from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(y, x[:,1:], absorb=cats)
print(mod.fit())
                         Absorbing LS Estimation Summary
==================================================================================
Dep. Variable:              dependent   R-squared:                          0.8462
Estimator:               Absorbing LS   Adj. R-squared:                     0.8080
No. Observations:              250000   F-statistic:                     4.944e+05
Date:                Fri, Jan 10 2020   P-value (F-stat):                   0.0000
Time:                        15:02:17   Distribution:                      chi2(2)
Cov. Estimator:                robust   R-squared (No Effects):             0.6665
                                        Varaibles Absorbed:               4.97e+04
                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exog.0         1.0012     0.0020     498.01     0.0000      0.9973      1.0051
exog.1         0.9985     0.0020     494.86     0.0000      0.9946      1.0025
==============================================================================

Optimization Options

LSMR is iterative and does not have a closed form. The tolerance can be set using lsmr_options which is a dictionary. See scipy.sparse.linalg.lsmr for details on the options.

Below use_cache is set to ensure that LSMR is run. By default, the exogenous variables with the effects purged are cached. LSMR is run once for the dependent and for each column in exog.

[4]:
from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(y, x[:,1:], absorb=cats)
res = mod.fit(use_cache=False, lsmr_options={'show': True})

LSMR            Least-squares solution of  Ax = b

The matrix A has   250000 rows  and    49702 cols
damp = 0.00000000000000e+00

atol = 1.00e-08                 conlim = 1.00e+08

btol = 1.00e-08             maxiter =    49702


   itn      x(1)       norm r    norm Ar  compatible   LS      norm A   cond A
     0  0.00000e+00  1.205e+03  1.030e+03   1.0e+00  7.1e-04
     1  1.49975e+01  8.358e+02  3.105e+02   6.9e-01  3.1e-01  1.2e+00  1.0e+00
     2 -1.34026e+00  7.825e+02  9.467e+01   6.5e-01  7.2e-02  1.7e+00  1.1e+00
     3 -1.27956e+00  7.758e+02  3.632e+01   6.4e-01  2.4e-02  2.0e+00  1.3e+00
     4 -3.42453e+00  7.745e+02  7.430e-01   6.4e-01  4.3e-04  2.2e+00  1.4e+00
     5 -3.42346e+00  7.745e+02  3.038e-01   6.4e-01  1.6e-04  2.4e+00  1.4e+00
     6 -3.34778e+00  7.745e+02  1.196e-02   6.4e-01  5.8e-06  2.6e+00  1.5e+00
     7 -3.34682e+00  7.745e+02  1.145e-02   6.4e-01  5.6e-06  2.7e+00  3.9e+00
     8 -3.34463e+00  7.745e+02  1.135e-02   6.4e-01  5.1e-06  2.9e+00  1.1e+01
     9 -3.34066e+00  7.745e+02  1.135e-02   6.4e-01  4.9e-06  3.0e+00  2.7e+02
    10 -3.27727e+00  7.745e+02  1.134e-02   6.4e-01  4.6e-06  3.2e+00  1.4e+02
    14  6.73835e+01  7.745e+02  1.395e-05   6.4e-01  4.8e-09  3.8e+00  4.0e+01

LSMR finished
The least-squares solution is good enough, given atol
istop =       2    normr = 7.7e+02
    normA = 3.8e+00    normAr = 1.4e-05
itn   =      14    condA = 4.0e+01
    normx = 1.1e+03
    14  6.73835e+01   7.745e+02  1.395e-05
   6.4e-01  4.8e-09   3.8e+00  4.0e+01

LSMR            Least-squares solution of  Ax = b

The matrix A has   250000 rows  and    49702 cols
damp = 0.00000000000000e+00

atol = 1.00e-08                 conlim = 1.00e+08

btol = 1.00e-08             maxiter =    49702


   itn      x(1)       norm r    norm Ar  compatible   LS      norm A   cond A
     0  0.00000e+00  4.998e+02  2.222e+02   1.0e+00  8.9e-04
     1  9.84373e-01  4.477e+02  4.487e+00   9.0e-01  1.0e-02  1.0e+00  1.0e+00
     2  7.02800e-01  4.477e+02  1.358e+00   9.0e-01  2.0e-03  1.5e+00  1.1e+00
     3  7.45878e-01  4.477e+02  3.404e-02   9.0e-01  4.4e-05  1.7e+00  1.5e+00
     4  7.47005e-01  4.477e+02  1.167e-02   9.0e-01  1.3e-05  2.1e+00  1.5e+00
     5  7.46062e-01  4.477e+02  8.204e-03   9.0e-01  8.2e-06  2.2e+00  1.7e+00
     6  7.46033e-01  4.477e+02  6.687e-03   9.0e-01  6.0e-06  2.5e+00  1.9e+00
     7  7.46672e-01  4.477e+02  6.686e-03   9.0e-01  5.6e-06  2.6e+00  9.3e+01
     8  7.48764e-01  4.477e+02  6.686e-03   9.0e-01  5.2e-06  2.9e+00  1.9e+02
     9  2.38297e+00  4.477e+02  6.554e-03   9.0e-01  4.9e-06  3.0e+00  2.1e+02
    10  1.37603e+01  4.477e+02  5.544e-03   9.0e-01  3.9e-06  3.2e+00  4.2e+01
    13  4.24118e+01  4.477e+02  1.348e-06   9.0e-01  8.4e-10  3.6e+00  4.2e+01

LSMR finished
The least-squares solution is good enough, given atol
istop =       2    normr = 4.5e+02
    normA = 3.6e+00    normAr = 1.3e-06
itn   =      13    condA = 4.2e+01
    normx = 4.7e+02
    13  4.24118e+01   4.477e+02  1.348e-06
   9.0e-01  8.4e-10   3.6e+00  4.2e+01

LSMR            Least-squares solution of  Ax = b

The matrix A has   250000 rows  and    49702 cols
damp = 0.00000000000000e+00

atol = 1.00e-08                 conlim = 1.00e+08

btol = 1.00e-08             maxiter =    49702


   itn      x(1)       norm r    norm Ar  compatible   LS      norm A   cond A
     0  0.00000e+00  4.993e+02  2.227e+02   1.0e+00  8.9e-04
     1 -1.20186e+00  4.469e+02  4.551e+00   9.0e-01  1.0e-02  1.0e+00  1.0e+00
     2 -1.16915e+00  4.469e+02  1.577e+00   9.0e-01  2.3e-03  1.5e+00  1.2e+00
     3 -1.19516e+00  4.469e+02  7.706e-01   9.0e-01  9.3e-04  1.9e+00  1.4e+00
     4 -1.33274e+00  4.469e+02  1.378e-02   9.0e-01  1.4e-05  2.2e+00  1.2e+00
     5 -1.33265e+00  4.469e+02  6.946e-03   9.0e-01  6.5e-06  2.4e+00  1.4e+00
     6 -1.33214e+00  4.469e+02  3.654e-03   9.0e-01  3.1e-06  2.6e+00  1.6e+00
     7 -1.33174e+00  4.469e+02  3.653e-03   9.0e-01  3.1e-06  2.6e+00  6.4e+01
     8 -1.33107e+00  4.469e+02  3.653e-03   9.0e-01  2.9e-06  2.9e+00  1.2e+02
     9 -1.04042e+00  4.469e+02  3.629e-03   9.0e-01  2.7e-06  3.0e+00  2.8e+02
    10  1.28105e+00  4.469e+02  3.437e-03   9.0e-01  2.4e-06  3.2e+00  3.1e+01
    13  2.14317e+01  4.469e+02  1.210e-06   9.0e-01  7.5e-10  3.6e+00  3.1e+01

LSMR finished
The least-squares solution is good enough, given atol
istop =       2    normr = 4.5e+02
    normA = 3.6e+00    normAr = 1.2e-06
itn   =      13    condA = 3.1e+01
    normx = 3.2e+02
    13  2.14317e+01   4.469e+02  1.210e-06
   9.0e-01  7.5e-10   3.6e+00  3.1e+01