# Using formulas to specify models

## Basic Usage

Formulas provide an alternative method to specify a model. The formulas used here utilize [formulaic](https://github.com/matthewwardrop/formulaic/) ([documentation](https://matthewwardrop.github.io/formulaic/)) are similar to those in [statsmodels](http://www.statsmodels.org), although they use an enhanced syntax to allow identification of endogenous regressors. The basis formula syntax for a single variable regression would be

```
y ~ 1 + x
```

where the `1` indicates that a constant should be included and `x` is the regressor. In the context of an instrumental variables model, it is necessary to mark variables as endogenous and to provide a list of instruments that are included only in the model for the endogenous variables. In a basic single regressor model, this would be specified using `[]` to surround an inner model.

```
y ~ 1 + [x ~ z]
```

In this expression, `x` is now marked as endogenous and `z` is an instrument. Any exogenous variable will automatically be used when instrumenting `x` so there is no need to repeat these here (in this example, the "first stage" would include a constant and z).

## Multiple Endogenous Variables
Multiple endogenous variables are specified in a similar manner. The basic concept is that any model can be expressed as 
```
dep ~ exog + [ endog ~ instruments]
```

and it must be the case that 

```
dep ~ exog + endog
```
and
```
dep ~ exog + instruments
```

are valid formulaic formulas. This means that multiple endogenous regressors or instruments should be joined with `+`, but that the first endogenous or first instrument should not have a leading `+`. A simple example with 2 endogenous variables and 3 instruments would be

```
y ~ 1 + x1 + x2 + x3 + [ x4 + x5 ~ z1 + z2 + z3]
```

In this example, the "submodels" `y ~ 1 + x1 + x2 +x3 + x4 + x5` and `y ~ 1 + x1 + x2 + x3 + z1 + z2 +z3` are both valid formulaic expressions.

## Standard formulaic
Aside from this change, the standard rules of formulaic apply, and so it is possible to use mathematical expression or other formulaic-specific features. See the [formulaic quickstart](https://matthewwardrop.github.io/formulaic/guides/quickstart/) for some examples of what is possible.

## MEPS data

This example shows the use of formulas to estimate both IV and OLS models using the [medical expenditure panel survey](https://meps.ahrq.gov). The model measures the effect of various characteristics on the log of drug expenditure and instruments the variable that measures where a subject was insured through a union with their social security to income ratio.

This first block imports the data and numpy.

In [None]:
import numpy as np
from linearmodels.datasets import meps
from linearmodels.iv import IV2SLS

data = meps.load()
data = data.dropna()
print(meps.DESCR)

### Estimating a model with a formula

This model uses a formula which is input using the `from_formula` interface. Unlike direct initialization, this interface takes the formula and a DataFrame containing the data necessary to evaluate the formula.

In [None]:
formula = (
 "ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ ssiratio]"
)
mod = IV2SLS.from_formula(formula, data)

In [None]:
iv_res = mod.fit(cov_type="robust")
print(iv_res)

### Mathematical expression in formulas

Standard formulaic syntax, such as using mathematical expressions, can be readily used.

In [None]:
formula = (
 "np.log(drugexp) ~ 1 + totchr + age + linc + blhisp + [hi_empunion ~ ssiratio]"
)
mod = IV2SLS.from_formula(formula, data)
iv_res2 = mod.fit(cov_type="robust")

### OLS

Omitting the block that marks a variable as endogenous will produce OLS -- just like using `None` for both `endog` and `instruments`.

In [None]:
formula = "ldrugexp ~ 1 + totchr + female + age + linc + blhisp + hi_empunion"
ols = IV2SLS.from_formula(formula, data)
ols_res = ols.fit(cov_type="robust")
print(ols_res)

### Comparing results

The function `compare` can be used to compare the result of multiple models. Here dropping `female` from the IV regression improves the $R^2$.

In [None]:
from linearmodels.iv import compare

print(compare({"IV": iv_res, "OLS": ols_res, "IV-formula": iv_res2}))