{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using formulas to specify models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage\n",
    "\n",
    "Formulas provide an alternative method to specify a model.  The formulas used here utilize [formulaic](https://github.com/matthewwardrop/formulaic/) ([documentation](https://matthewwardrop.github.io/formulaic/)) are similar to those in [statsmodels](http://www.statsmodels.org), although they use an enhanced syntax to allow identification of endogenous regressors.  The basis formula syntax for a single variable regression would be\n",
    "\n",
    "```\n",
    "y ~ 1 + x\n",
    "```\n",
    "\n",
    "where the `1` indicates that a constant should be included and `x` is the regressor.  In the context of an instrumental variables model, it is necessary to mark variables as endogenous and to provide a list of instruments that are included only in the model for the endogenous variables.  In a basic single regressor model, this would be specified using `[]` to surround an inner model.\n",
    "\n",
    "```\n",
    "y ~ 1 + [x ~ z]\n",
    "```\n",
    "\n",
    "In this expression, `x` is now marked as endogenous and `z` is an instrument.  Any exogenous variable will automatically be used when instrumenting `x` so there is no need to repeat these here (in this example, the \"first stage\" would include a constant and z).\n",
    "\n",
    "## Multiple Endogenous Variables\n",
    "Multiple endogenous variables are specified in a similar manner.  The basic concept is that any model can be expressed as \n",
    "```\n",
    "dep ~ exog + [ endog ~ instruments]\n",
    "```\n",
    "\n",
    "and it must be the case that \n",
    "\n",
    "```\n",
    "dep ~ exog + endog\n",
    "```\n",
    "and\n",
    "```\n",
    "dep ~ exog + instruments\n",
    "```\n",
    "\n",
    "are valid formulaic formulas. This means that multiple endogenous regressors or instruments should be joined with `+`, but that the first endogenous or first instrument should not have a leading `+`.  A simple example with 2 endogenous variables and 3 instruments would be\n",
    "\n",
    "```\n",
    "y ~ 1 + x1 + x2 + x3  + [ x4 + x5 ~ z1 + z2 + z3]\n",
    "```\n",
    "\n",
    "In this example, the \"submodels\" `y ~ 1 + x1 + x2 +x3 + x4 + x5` and `y ~ 1 + x1 + x2 + x3 + z1 + z2 +z3` are both valid formulaic expressions.\n",
    "\n",
    "## Standard formulaic\n",
    "Aside from this change, the standard rules of formulaic apply, and so it is possible to use mathematical expression or other formulaic-specific features. See the [formulaic quickstart](https://matthewwardrop.github.io/formulaic/guides/quickstart/) for some examples of what is possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MEPS data\n",
    "\n",
    "This example shows the use of formulas to estimate both IV and OLS models using the [medical expenditure panel survey](https://meps.ahrq.gov). The model measures the effect of various characteristics on the log of drug expenditure and instruments the variable that measures where a subject was insured through a union with their social security to income ratio.\n",
    "\n",
    "This first block imports the data and numpy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from linearmodels.datasets import meps\n",
    "from linearmodels.iv import IV2SLS\n",
    "\n",
    "data = meps.load()\n",
    "data = data.dropna()\n",
    "print(meps.DESCR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Estimating a model with a formula\n",
    "\n",
    "This model uses a formula which is input using the `from_formula` interface. Unlike direct initialization, this interface takes the formula and a DataFrame containing the data necessary to evaluate the formula."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "formula = (\n",
    "    \"ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ ssiratio]\"\n",
    ")\n",
    "mod = IV2SLS.from_formula(formula, data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iv_res = mod.fit(cov_type=\"robust\")\n",
    "print(iv_res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Mathematical expression in formulas\n",
    "\n",
    "Standard formulaic syntax, such as using mathematical expressions, can be readily used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "formula = (\n",
    "    \"np.log(drugexp) ~ 1 + totchr + age + linc + blhisp + [hi_empunion ~ ssiratio]\"\n",
    ")\n",
    "mod = IV2SLS.from_formula(formula, data)\n",
    "iv_res2 = mod.fit(cov_type=\"robust\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### OLS\n",
    "\n",
    "Omitting the block that marks a variable as endogenous will produce OLS -- just like using `None` for both `endog` and `instruments`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "formula = \"ldrugexp ~ 1 + totchr + female + age + linc + blhisp + hi_empunion\"\n",
    "ols = IV2SLS.from_formula(formula, data)\n",
    "ols_res = ols.fit(cov_type=\"robust\")\n",
    "print(ols_res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comparing results\n",
    "\n",
    "The function `compare` can be used to compare the result of multiple models.  Here dropping `female` from the IV regression improves the $R^2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from linearmodels.iv import compare\n",
    "\n",
    "print(compare({\"IV\": iv_res, \"OLS\": ols_res, \"IV-formula\": iv_res2}))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}