{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using formulas to specify models\n", "\n", "Formulas can be used to specify models using mostly standard [formulaic](https://matthewwardrop.github.io/formulaic/) syntax. Since system estimation is more complicated than the specification of a single model, there are two methods available to specify a system:\n", "\n", "* Dictionary of formulas\n", "* Single formula separated using {}\n", "\n", "These examples use data on fringe benefits from F. Vella (1993), \"A Simple Estimator for Simultaneous Models with Censored\n", "Endogenous Regressors\" which appears in Wooldridge (2002). The model consists of two equations, one for hourly wage and the other for hourly benefits. The initial model uses the same regressors in both equations. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from linearmodels.datasets import fringe\n", "\n", "data = fringe.load()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dictionary \n", "\n", "The dictionary syntax is virtually identical to standard [formulaic syntax](https://matthewwardrop.github.io/formulaic/) where each equation is specified in a key-value pair where the key is the equation label and the value is the formula. It is recommended to use an OrderedDict which will preserve equation order in results. Keys **must** be strings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from collections import OrderedDict\n", "\n", "formula = OrderedDict()\n", "formula[\n", " \"benefits\"\n", "] = \"hrbens ~ educ + exper + expersq + union + south + nrtheast + nrthcen + male\"\n", "formula[\"earnings\"] = \"hrearn ~ educ + exper + expersq + nrtheast + married + male\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from linearmodels.system import SUR\n", "\n", "mod = SUR.from_formula(formula, data)\n", "print(mod.fit(cov_type=\"unadjusted\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Curly Braces {}\n", "\n", "The same formula can be expressed in a single string by surrounding each equation with braces `{}`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "braces_formula = \"\"\"\n", "{hrbens ~ educ + exper + expersq + union + south + nrtheast + nrthcen + male}\n", "{hrearn ~ educ + exper + expersq + nrtheast + married + male}\n", "\"\"\"\n", "braces_mod = SUR.from_formula(braces_formula, data)\n", "braces_res = braces_mod.fit(cov_type=\"unadjusted\")\n", "print(braces_res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Labeled Formulas\n", "\n", "When using the curly brace formula specification, the equation names are determined by the dependent variable names. When names are repeated as is the case in some datasets (e.g. a SUR on GDP of multiple countries) then the equation labels will be modified until they are unique. This can produce meaningless equation labels, and so it is possible to pass an equation label using the syntax\n", "\n", "```\n", "{label : dep ~ exog}\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labeled_formula = \"\"\"\n", "{benefits: hrbens ~ educ + exper + expersq + union + south + nrtheast + nrthcen + male}\n", "{earnings: hrearn ~ educ + exper + expersq + nrtheast + married + male}\n", "\"\"\"\n", "labels_mod = SUR.from_formula(labeled_formula, data)\n", "labeled_res = labels_mod.fit(cov_type=\"unadjusted\")\n", "\n", "print(\"Unlabeled\")\n", "print(braces_res.equation_labels)\n", "print(\"Labeled\")\n", "print(labeled_res.equation_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other Options" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Estimation Weights\n", "\n", "SUR supports weights which are assumed to be proportional to the inverse variance of the data so that \n", "\n", "$$ V(y_i \\times w_i) = \\sigma^2 \\,\\,\\forall i.$$\n", "\n", "Weights can be passed using a `DataFrame` where each column. \n", "\n", "Here the results are printed to ensure that the estimates are different from those in the standard GLS model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random_weights = np.random.chisquare(5, size=(616, 2))\n", "random_weights = pd.DataFrame(random_weights, columns=[\"benefits\", \"earnings\"])\n", "weighted_mod = SUR.from_formula(formula, data, weights=random_weights)\n", "print(weighted_mod.fit())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prespecified Residual Covariance\n", "Like a standard SUR, it is possible to pass a prespecified residual covariance for use in the GLS step. This is done using the keyword argument `sigma` in the `from_formula` method, and is otherwise identical to passing one to the standard SUR." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }