Forecast reconciliation

What if our forecasts should follow linear constraints?

Author

Email

Felipe Angelim

felipeangelim@gmail.com

Published

April 26th, 2021

Imagine you work in a retail company and you are responsible for forecasting total sales in the following weeks. The company is spread through the country, and the forecast must be made for every city and state.

However, when the national and regional sales teams meet, your forecasts aren't consistent. "How can we sum state-level sales and find a number larger than the national one?". Which forecasts should they consider to base their decisions on?

Forecast reconciliation methods are simple and effective solutions to that problem. When you have an hierarchy of time-series that respect some linear constraints (such as the sum of state sales must be equal to national sales), you may forecast each level independently and reconciliate them.

Let's suppose our company is based on Europe and has two branches: one in France and another in Italy, and we were asked for a seven-days-ahead estimation.

The intuition

Consider this dataset containing two years of sales of a company:

Figure 1. A synthetic dataset of sales. The company has branches in two countries: France and Italy. Each curve was built using the cumulative sum of gaussian distribution.

The total sales in Europe is the sum of French and Italian sales, and they should follow a linear constraint defined by

y_{europe}(t) = y_{france}(t) + y_{italy}(t)

Note that it can be regarded as a 2d plane in a three-dimensional space: to be considered coherent, our observations and forecasts must lie on that plane.

Figure 2. Visualizing our dataset. The axes are our sales in Italy, France and Europe. Each point represents a daily observation. You can play with the cursor to visualize the plane that surges from the hierarchical structure of our data.

Applying well-known statistical forecasting methods such as ARIMA and ETS won't necessarily generate estimates that adhere to that constraint. The figure below illustrates how forecasts of an ARIMA model can diverge from it.

Figure 3. Visualizing 14-days-ahead forecasts of ARIMA. The blue plane represents the coherent subspace, where observations should lie to respect the linear constraint defined by the problem.

It seems that we can't rely on those methods. Forecasting each level independently is not enough. How can we guarantee that we'll give good quality and coherent forecasts?

Forecasting coherently

The literature provides two main approaches to forecast coherently: the single-level approach and reconciliation methods (Athanasopoulos et al., 2020).

Single-level approaches

Single-level approaches choose one level of hierarchy to produce estimates and then expand the forecasts to other levels.

The bottom-up requires forecasts for the most disaggregated levels of the hierarchy. Let $n$ be the total number of series and $m$ the number of bottom-level series. In our sales example, $n=3$ and $m=2$ . The coherent forecasts are given by:

y_{T+h|T} = Sb_{T+h|T}

where $y_{T+h|T} \in \mathbb{R}^n$ is the h-step-ahead base forecasts for all series, $b_{T+h|T} \in \mathbb{R}^m$ our h-step-ahead forecasts for bottom-level ones. $S$ is the matrix that captures the hierarchy constraints. In the case of our sales forecasting problem, it's equal to:

S = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}

and

y_t = \begin{pmatrix} y^{europe}_t \\ y^{france}_t \\ y^{italy}_t \end{pmatrix}, \quad b_t = \begin{pmatrix} y^{france}_t \\ y^{italy}_t \end{pmatrix},

These levels, however, are usually the most volatile and with lower signal-to-noise ratio. Another approach is the top-down, which forecasts the top-level time-series $\hat{t}_{t}$ and then disaggregate it to build forecasts for lower levels in the hierarchy:

\hat{b}_{T+h|T} = p\hat{t}_t

The proportions $p \in \mathbb{R}^m$ may be estimated from historical observations (Gross & Sohl, 1990). This approach, on the other hand, is not capable of capturing particularities of the other levels in the hierarchy. Single-level approaches have the advantage of being simpler, but they don't take advantage of the information in other levels. Reconciliation methods solve this issue by using correlations of all series to get a coherent estimate.

Forecast reconciliation

Remember how our observations lay on a plane and incoherent forecasts lay outside of it in Figure 2. Let $\mathbf{s} \subset \mathbb{R}^n$ be this coherent subspace, a m-dimensional space for which the linear constraints hold. We can say that our h-step-ahead forecasts $y_{T+h|T}$ will be coherent if $y_{T+h|T} \in \mathbf{s}$ . If we get our base forecasts and project them onto the coherent subspace, we'll get coherent forecasts.

The work of Panagiotelis et al., 2021 provided this geometric view of forecast reconciliation. They reframed the forecast reconciliation problem as "how should our projection matrix be so that our error metric $E$ is reduced?" $.$ If you want to ensure all your forecasts will be better after reconciliation, you may enjoy the properties of orthogonal projection. This procedure, however, considers that all errors in the hierarchy should be treated equally, which is not always the case. The forecast of total sales, for example, might be more important than sales in a specific country. Or maybe France plays a critic role in the business.

In such cases, you may go for oblique projections. They work as if the axes with less relevance were stretched before proceeding with an orthogonal projection. There's also a special kind of oblique projection that minimizes the expected error, using an estimate of the error covariance matrix Wickramasuriya et al., 2019. They called it MinT reconciliation.

Orthogonal projection

The orthogonal projection has an interesting property of being the solution in $\mathbb{s}$ closest to the base forecasts $\hat{y}_t$ . In other words, it solves the optimization problem:

\min \limits_{ \tilde{y}_t \in \mathbb{s}}\vert\vert \hat{y}_t - \tilde{y}_t \vert\vert

where $\tilde{y}_t$ is the reconciled forecast. Hereafter, we'll use tilde to represent forecasts after reconciliation, and a hat to represent our base forecasts. The projection matrix $P_s \in \mathbb{R}^{n \times n}$ can be built directly from the matrix $S$ which captures the hierarchical structure.

P_s = SG = S (S^TS)^{-1}S^T

where $G \in \mathbb{R}^{m \times n }$ maps $\hat{y}_{T+h|T}$ to $\mathbb{R}^m$ , producing new bottom-level forecasts.

We can easily reconcile our forecasts:

\tilde{y}_{T+h|T} = P_s \hat{y}_{T+h|T}

Figure 4. Projecting base forecasts onto the coherent subspace. The yellow dots are the reconciled forecasts, after orthogonal projection of red dots.

If one of the requirements is to change the base forecasts as least as possible, orthogonal projection is the way to go. However, it considers that all errors in the hierarchy are equally important.

Oblique projection

When you want to avoid some errors in the hierarchy more than others, oblique projection is a good choice. It can be seen as an orthogonal projection in a space with another metric $\vert\vert v \vert\vert_W = v^TWv$ , where W is a positive definite matrix. In that case, the projection matrix is:

P_s = S(S^TWS)^{-1}S^TW = SG^{{}_W}

Note that $W = \mathbb{I}$ is equivalent to orthogonal projection in euclidean space. Pythagoras theorem guarantees that $\tilde{y}_{t}$ will always be closer to the true $y_{t}$ than the base forecast $\hat{y}_t$ in terms of $\vert\vert \cdot \vert\vert_{W}$ :

Figure 5. Visualizing Pythagoras theorem and its impacts on orthogonal projection.

If, for example, we apply an oblique projection to our sales forecasts with $W = \begin{bmatrix} 4 & 0 & 0 \\ 0 & 4 & 0 \\ 0 & 0 & 1 \end{bmatrix}$ , we obtain the following result:

Figure 6. Another approach for projecting base forecasts onto the coherent subspace. The green dots are forecasts reconcilied after oblique projection of the red dots, our base forecasts.

MinT reconciliation

While those solutions ensure better estimates, they aren't optimal in terms of expected error. They suppose a distance metric and find the coherent forecasts that change the base forecasts as least as possible. The MinT reconciliation approach, on the other hand, minimizes the expected distance between the reconciled forecasts and the true values, even if that means changing considerably the base forecasts. The optimization problem it tries to solve is

\begin{align*} \min_{\tilde{y}_{T + h | T} \in \mathbb{s}} \mathbb{E}[\,\vert\vert y_{T + h | T} - \tilde{y}_{T + h | T} \vert\vert\,] &= \min \mathbb{E}[\,\vert\vert \tilde{e}_{T + h | T} \vert\vert\,] \\ &= \min_G \mathbb{E}[\,\vert\vert SG\hat{e}_{T + h | T} \vert\vert\,] \\ &= \min_G tr(SG\Sigma_hG^TS^T) \end{align*}

(1)

where

\Sigma_h = E[(\hat{e}_{T+h|T} - E[\hat{e}_{T+h|T}])(\hat{e}_{T+h|T} - E[\hat{e}_{T+h|T}])^T]

where $\Sigma_h$ is the error covariance matrix of h-steps-ahead forecasts. Equation 1 says that, by minimizing the trace of the h-steps-ahead covariance matrix of $\tilde{e}_{T+h|T}$ , we can reduce the expected error after reconciliation. The optimal $\tilde{y}_{T+h|T}$ can be obtained by the oblique projection of $\hat{y}_{T+h|T}$ :

\tilde{y}_{T+h|T} = P_s^{MinT} \hat{y}_{T+h|T}

where

\begin{align*} P^{MinT}_s &= SG^{{}_{MinT}} \\ &= S(S^T\Sigma_h^{-1} S)^{-1}S^T\Sigma_h^{-1} \end{align*}

Still, we have to estimate $\Sigma_h$ , which is unknown. The literature provides two choices Athanasopoulos et al., 2020:

$\Sigma_h = \frac{1}{T}\sum\limits_{t=1}^T \hat{e}_{t+h | t}$ , a naive unconstrained estimator for covariance matrix also known as sample covariance matrix. While this method is simple and works well for small hierarchies, it does not provide reliable results when the number of series grows. The reconciliation method with this estimator is called MinT(sample).
An improved estimator of the covariance matrix proposed by Schäfer & Strimmer, 2005 and Ledoit & Wolf, 2004, which has one of funniest titles I've ever seen. The shrinkage estimator is $\Sigma^{(shrunk)}_t = \lambda \,diag(\Sigma_h) + (1 - \lambda)\Sigma_h$ and the shrinkage intensity parameter $\lambda$ can be computed from the estimated correlation matrix $r_{ij}$ :

\lambda = \frac{\sum\limits_{i\neq j}Var(\hat{r}_{ij})}{\sum\limits_{i\neq j}\hat{r}^2_{ij}}

This estimator safeguards the positive-definite property of the covariance matrix, and works better when then number of observations is not considerably larger than the number of series in the hierarchy. The reconciliation method with this estimator is called MinT(shrink).

Figure 7. Visualizing MinT(sample) reconciliation. The rosé points are the reconcilied forecasts.

Conclusion

This post was meant to be a brief introduction to hierarchical forecasting. Some advances in reconciliation methods have been made in recent years, and we should expect more to come. The implemention of those methods is fairly simple, and the benefits are sure. The fable package in R implements them, but it should be straightforward to code from scratch in other languages. Feel free to contact me if you need some help.

Appendix

Proof of orthogonal projection matrix

The optimization problem is:

\min_{\tilde{y} \in \mathbb{s}} \vert\vert \hat{y} - \tilde{y} \vert\vert = \min_x \vert\vert \hat{y} - Sx \vert\vert

where $\mathbb{s} = span(S)$ .

\vert\vert \hat{y} - Sx \vert\vert = (\hat{y} - Sx)^T(\hat{y} - Sx) = \hat{y}^T\hat{y} - 2\hat{y}^TSx + x^TS^TSx

\begin{align*} \frac{d\vert\vert \hat{y} - Sx \vert\vert}{dx} &= -2S^T\hat{y} - 2 S^TSx = 0 \\ &\Rightarrow S^T\hat{y} = S^TSx \\ &\Rightarrow x = (S^TS)^{-1}S^T\hat{y} \end{align*}

\tilde{y} = Sx = S(S^TS)^{-1}S^T\hat{y}

Proof of oblique projection matrix

The optimization problem is

\min_{\tilde{y} \in \mathbb{s}} \vert\vert \hat{y} - \tilde{y} \vert\vert_W = \min_x \vert\vert \hat{y} - Sx \vert\vert_W

where $\mathbb{s} = span(S)$ .

\vert\vert \hat{y} - Sx \vert\vert_W = (\hat{y} - Sx)^TW(\hat{y} - Sx) = \hat{y}^TW\hat{y} - 2\hat{y}^TWSx + x^TSW^TSx

\begin{align*} \frac{d\vert\vert \hat{y} - Sx \vert\vert_W}{dx} &= -2S^TW\hat{y} - 2 S^TWSx = 0 \\ &\Rightarrow S^TW\hat{y} = S^TWSx \\ &\Rightarrow x = (S^TWS)^{-1}S^TW\hat{y} \end{align*}

\tilde{y} = Sx = S(S^TWS)^{-1}S^TW\hat{y}

Proof of MinT

The optimization problem is

\min_{\tilde{y} \in \mathbb{s}} \mathbb{E}[\,\vert\vert y- \tilde{y}\vert\vert\,] = \min \mathbb{E}[\,\vert\vert \tilde{e} \vert\vert\,] \\

But

\begin{align*} \tilde{e} &= y - \tilde{y} = y - SG\hat{y} \\ &= y - SG(y - \hat{e}) \\ &= (\mathbb{I} - SG)y + SG\hat{e} \\ &= SG\hat{e} \end{align*}

because $y \in N(\mathbb{I} - SG)$

Thus

\begin{align*} \mathbb{E}[\,\vert\vert \tilde{e} \vert\vert^2] &= tr(\mathbb{E}[\;(\tilde{e} - \mathbb{E}[\tilde{e}])(\tilde{e} - \mathbb{E}[\tilde{e}])^T\;]) \\ &= tr(\mathbb{E}[\;(SG\tilde{e} - \mathbb{E}[SG\tilde{e}])(SG\tilde{e} - \mathbb{E}[SG\tilde{e}])^T\;]) \\ &= tr(SG\mathbb{E}[\;(\hat{e} - \mathbb{E}[\hat{e}])(\hat{e} - \mathbb{E}[\hat{e}])^T\;]G^TS^T) \\ &= tr(SG\,\Sigma\, G^TS^T) \end{align*}

The optimization problem can be written in terms of minimization of the trace:

\min_{\tilde{y} \in \mathbb{s}} \mathbb{E}[\,\vert\vert y- \tilde{y}\vert\vert\,] = \min \mathbb{E}[\,\vert\vert \tilde{e} \vert\vert\,] = \min_G tr(SG\,\Sigma\, G^TS^T)

Following the proof in the appendix of Wickramasuriya et al., 2019, the optimal G is:

G = (S^T\Sigma^{-1} S)^{-1}S^T\Sigma^{-1}

References

Athanasopoulos, G., Gamakumara, P., Panagiotelis, A., Hyndman, R. J., & Affan, M. (2020). Hierarchical Forecasting. Advanced Studies in Theoretical and Applied Econometrics, 52(February), 689–719. https://doi.org/10.1007/978-3-030-31150-6_21
Panagiotelis, A., Athanasopoulos, G., Gamakumara, P., & Hyndman, R. J. (2021). Forecast reconciliation: A geometric view with new insights on bias correction. International Journal of Forecasting, 37(1), 343–359. https://doi.org/10.1016/j.ijforecast.2020.06.004
Gross, C. W., & Sohl, J. E. (1990). Disaggregation methods to expedite product line forecasting. Journal of Forecasting, 9(3), 233–254. https://doi.org/10.1002/for.3980090304
Wickramasuriya, S. L., Athanasopoulos, G., & Hyndman, R. J. (2019). Optimal Forecast Reconciliation for Hierarchical and Grouped Time Series Through Trace Minimization. Journal of the American Statistical Association, 114(526), 804–819. https://doi.org/10.1080/01621459.2018.1448825
Banerjee, S., & Roy, A. (2014). Linear algebra and matrix analysis for statistics. Crc Press.
Schäfer, J., & Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4(1).
Ledoit, O., & Wolf, M. (2004). Honey, I shrunk the sample covariance matrix. Journal of Portfolio Management, 30(4), 1–22. https://doi.org/10.3905/jpm.2004.110