{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Forecasting with Exogenous Regressors\n", "\n", "This notebook provides examples of the accepted data structures for passing the expected value of exogenous variables when these are included in the mean. For example, consider an AR(1) with 2 exogenous variables. The mean dynamics are\n", "\n", "$$ Y_t = \\phi_0 + \\phi_1 Y_{t-1} + \\beta_0 X_{0,t} + \\beta_1 X_{1,t} + \\epsilon_t. $$\n", "\n", "The $h$-step forecast, $E_{T}[Y_{t+h}]$, depends on the conditional expectation of $X_{0,T+h}$ and $X_{1,T+h}$,\n", "\n", "$$ E_{T}[Y_{T+h}] = \\phi_0 + \\phi_1 E_{T}[Y_{T+h-1}] + \\beta_0 E_{T}[X_{0,T+h}] +\\beta_1 E_{T}[X_{1,T+h}] $$\n", "\n", "where $E_{T}[Y_{T+h-1}]$ has been recursively computed.\n", "\n", "In order to construct forecasts up to some horizon $h$, it is necessary to pass $2\\times h$ values ($h$ for each series). If using the features of `forecast` that allow many forecast to be specified, it necessary to supply $n \\times 2 \\times h$ values.\n", "\n", "There are two general purpose data structures that can be used for any number of exogenous variables and any number steps ahead:\n", "\n", "* `dict` - The values can be pass using a `dict` where the keys are the variable names and the values are 2-dimensional arrays. This is the most natural generalization of a pandas `DataFrame` to 3-dimensions.\n", "* `array` - The vales can alternatively be passed as a 3-d NumPy `array` where dimension 0 tracks the regressor index, dimension 1 is the time period and dimension 2 is the horizon.\n", "\n", "When a model contains a single exogenous regressor it is possible to use a 2-d array or `DataFrame` where dim0 tracks the time period where the forecast is generated and dimension 1 tracks the horizon.\n", "\n", "In the special case where a model contains a single regressor _and_ the horizon is 1, then a 1-d array or pandas Series can be used." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn\n", "\n", "seaborn.set_style(\"darkgrid\")\n", "plt.rc(\"figure\", figsize=(16, 6))\n", "plt.rc(\"savefig\", dpi=90)\n", "plt.rc(\"font\", family=\"sans-serif\")\n", "plt.rc(\"font\", size=14)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Simulating data\n", "\n", "Two $X$ variables are simulated and are assumed to follow independent AR(1) processes. The data is then assumed to follow an ARX(1) with 2 exogenous regressors and GARCH(1,1) errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from arch.univariate import ARX, GARCH, ZeroMean, arch_model\n", "\n", "burn = 250\n", "\n", "x_mod = ARX(None, lags=1)\n", "x0 = x_mod.simulate([1, 0.8, 1], nobs=1000 + burn).data\n", "x1 = x_mod.simulate([2.5, 0.5, 1], nobs=1000 + burn).data\n", "\n", "resid_mod = ZeroMean(volatility=GARCH())\n", "resids = resid_mod.simulate([0.1, 0.1, 0.8], nobs=1000 + burn).data\n", "\n", "phi1 = 0.7\n", "phi0 = 3\n", "y = 10 + resids.copy()\n", "for i in range(1, y.shape[0]):\n", " y[i] = phi0 + phi1 * y[i - 1] + 2 * x0[i] - 2 * x1[i] + resids[i]\n", "\n", "x0 = x0.iloc[-1000:]\n", "x1 = x1.iloc[-1000:]\n", "y = y.iloc[-1000:]\n", "y.index = x0.index = x1.index = np.arange(1000)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Plotting the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "ax = pd.DataFrame({\"ARX\": y}).plot(legend=False)\n", "ax.legend(frameon=False)\n", "_ = ax.set_xlim(0, 999)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Forecasting the X values\n", "\n", "The forecasts of $Y$ depend on forecasts of $X_0$ and $X_1$. Both of these follow simple AR(1), and so we can construct the forecasts for all time horizons. Note that the value in position `[i,j]` is the time-`i` forecast for horizon `j+1`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "x0_oos = np.empty((1000, 10))\n", "x1_oos = np.empty((1000, 10))\n", "for i in range(10):\n", " if i == 0:\n", " last = x0\n", " else:\n", " last = x0_oos[:, i - 1]\n", " x0_oos[:, i] = 1 + 0.8 * last\n", " if i == 0:\n", " last = x1\n", " else:\n", " last = x1_oos[:, i - 1]\n", " x1_oos[:, i] = 2.5 + 0.5 * last\n", "\n", "x1_oos[-1]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Fitting the model\n", "\n", "Next, the model is fit. The parameters are precisely estimated." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "exog = pd.DataFrame({\"x0\": x0, \"x1\": x1})\n", "mod = arch_model(y, x=exog, mean=\"ARX\", lags=1)\n", "res = mod.fit(disp=\"off\")\n", "print(res.summary())" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Using a `dict`\n", "\n", "The first approach uses a dict to pass the two variables. The key consideration here is the the keys of the dictionary must **exactly** match the variable names (`x0` and `x1` here). The dictionary here contains only the final row of the forecast values since `forecast` will only make forecasts beginning from the final in-sample observation by default.\n", "\n", "### Using `DataFrame`\n", "\n", "While these examples make use of NumPy arrays, these can be `DataFrames`. This allows the index to be used to track the forecast origination point, which can be a helpful device." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "exog_fcast = {\"x0\": x0_oos[-1:], \"x1\": x1_oos[-1:]}\n", "forecasts = res.forecast(horizon=10, x=exog_fcast)\n", "forecasts.mean.T.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using an `array`\n", "\n", "An array can alternatively be used. This frees the restriction on matching the variable names although the order must match instead. The forecast values are 2 (variables) by 1 (forecast) by 10 (horizon)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "exog_fcast = np.array([x0_oos[-1:], x1_oos[-1:]])\n", "print(f\"The shape is {exog_fcast.shape}\")\n", "array_forecasts = res.forecast(horizon=10, x=exog_fcast)\n", "print(array_forecasts.mean - forecasts.mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Producing multiple forecasts\n", "\n", "`forecast` can produce multiple forecasts using the same fit model. Here the model is fit to the first 500 observations and then forecasting for the remaining values are produced. It must be the case that the `x` values passed for `forecast` have the same number of rows as the table of forecasts produced.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "res = mod.fit(disp=\"off\", last_obs=500)\n", "exog_fcast = {\"x0\": x0_oos[-500:], \"x1\": x1_oos[-500:]}\n", "multi_forecasts = res.forecast(start=500, horizon=10, x=exog_fcast)\n", "multi_forecasts.mean.tail(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot of the final 5 forecast paths shows the the mean reversion of the process." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "_ = multi_forecasts.mean.tail().T.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previous example made use of dictionaries where each of the values was a 500 (number of forecasts) by 10 (horizon) array. The alternative format can be used where `x` is a 3-d array with shape 2 (variables) by 500 (forecasts) by 10 (horizon)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "exog_fcast = np.array([x0_oos[-500:], x1_oos[-500:]])\n", "print(exog_fcast.shape)\n", "array_multi_forecasts = res.forecast(start=500, horizon=10, x=exog_fcast)\n", "np.max(np.abs(array_multi_forecasts.mean - multi_forecasts.mean))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `x` input array sizes\n", "\n", "While the natural shape of the `x` data is the number of forecasts, it is also possible to pass an `x` that has the same shape as the `y` used to construct the model. The may simplify tracking the origin points of the forecast. Values are are not needed are ignored. In this example, the out-of-sample values are 2 by 1000 (original number of observations) by 10. Only the final 500 are used. \n", "\n", "