Top Links
Journal of Biostatistics and Biometric Applications
ISSN: 2455-765X
Comparison of Statistical Techniques for Forecasting Malaria Cases in Ghana
Copyright: © 2019 Twumasi-Ankrah S. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Related article at Pubmed, Google Scholar
Background and Aim: The purpose of this study was to determine an appropriate statistical technique for forecasting the monthly Malaria cases in Ghana.
Methods and Materials: Monthly data spanning from January 2008 to December 2017 were obtained from the District Health Information Management System (DHIMS) 2, Ghana Health Service. The four competing forecasting techniques that were applied to the Malaria cases data were the Seasonal Autoregressive Integrated Moving Average (SARIMA), Artificial Neural Network (ANN), Exponential smoothing (ETS) and a Combination technique. The four competing forecasting techniques were compared using their respective forecast accuracy measures in order to choose the appropriate technique for forecasting Malaria cases in Ghana.
Results: It was observed that the SARIMA technique was the appropriate statistical technique. The “best” model for forecasting the monthly malaria cases in Ghana was SARIMA (2, 1, 0) (2, 0, 0)12 which passed all the required diagnostic tests.
Conclusion: A two-year monthly forecast from the “best” model revealed that, in 2018, we should expect a decrease in Malaria cases in the last quarter but should expect an increase in Malaria cases during the first half of 2019.
Keywords: Malaria Cases; SARIMA; Artificial Neural Network; Forecast Accuracy; Exponential Smoothing
Malaria is one of the key micro parasitic diseases causing human mortality in the world [1,2]. According to [3], there were an estimated 216 million cases of malaria in 91 countries in 2016, this accounted for about 80% of the global malaria burden; and an increase of 5 million cases over 2015 [3]. About 90% of the world’s 300 - 500 million malaria cases and 1.5 - 2.7 million deaths due to malaria annually are from the sub-Saharan Africa [4,5]. The rate of malaria as indicated by [5] still records 40% of all outpatient attendance, with the most vulnerable people being under 5 children and pregnant women [6,7]. There were an estimated 445,000 deaths from malaria globally in 2016 as compared to 446, 000 estimated deaths in 2015 [3]. The Sub-Saharan Africa, accounted for 91% of all malaria deaths in 2016, followed by the South - East Asia Region (6%) [3].
In 1995, the estimated total cost of malaria to Africa was US$ 1.8 billion and US$ 2 billion in 1997 [8]. In 2016, 74% of an estimated US$ 2.7 billion invested in malaria control and elimination efforts globally was spent on Sub-Sahara African countries [3]. Thus, malaria is not only a public health problem but also a developmental problem.
In Ghana, malaria is the number one cause of morbidity accounting for 40 - 60% of out-patient [6]. According to [18], the burden of malaria in Ghana in 2002 was estimated at US$ 2.63 per capita or US$ 13.51 per households [18]. The 2015–2020 Ghana Malaria Strategic Plan which was recently adopted aims at reducing malaria burden by 75.0% [9,10]. According to [11] there are few readily available mathematical models to support decisions for planning and evaluation of these strategies remains a challenge [11]. Thus, forecast of malaria cases is vital for its control and intervention. Because of its devastating effects, some researchers in Ghana have tried to come up with appropriate models to forecast malaria cases, which will help policy makers. The gaps in some of the research articles on forecasting malaria cases in Ghana are discussed.
Takyi Appiah S, et al., indicated that the ARIMA (2, 1, 1) model was the “best” fitted model for forecasting the number of malaria cases in the Ejisu-Juaben Municipality for a period of two years from 2014 and 2016. The authors employed only the graphical method to examine whether the data was stationary or not, which is not evidence enough [12].
Alhassan EA, et al., identified an ARIMA (1, 0, 1) model for forecasting malaria cases in Kasena Nankana municipality in Ghana. They indicated that malaria cases in the Navrongo municipality were growing at a constant quadratic rate. However, their “best” model did not capture the trend nature of the data, hence assumed a non-stationary data as stationary [13].
Furthermore, Bosson-Amedenu S used the Box-Jenkins (ARIMA) methodology on a monthly malaria incidence data of children below the age of five in Edum Banso. The choice of the order of the differencing was not justified since only the graphical approach was used to check for stationarity [14].
Again, Anokye R, ε., used the quadratic model for forecasting the half year incidence of Malaria whiles ARIMA (1, 1, 2) was used for forecasting monthly malaria incidence for the years 2018 and 2019 in Kumasi Metropolis [15]. A major drawback on their work is that, in the ARIMA modelling, the authors treated their data as a stationary data whereas the data was non-stationary and thus require differencing (that is, their data exhibited a quadratic trend and was also a seasonal data).
From the above literatures, it is obvious that several of these studies only considered the graphical method to decide whether a series was stationary and not conducting a formal test of hypothesis like the unit root test (e.g. the Augmented Dickey Fuller Test). Also, none of these studies considered different forecasting techniques in order to select the most appropriate one; and again, the above studies did not consider forecasting malaria cases for the entire country. Hence this study seeks to develop and validate an appropriate forecasting technique that can be used to forecast malaria transmission in Ghana. This will bring out a solid model by comparing several competing models using different forecasting techniques. This study employs three different forecasting techniques: Autoregressive Integrated Moving Average (ARIMA) model, Artificial Neural Network (ANN) and Exponential smoothing approach (ETS) to model and forecast malaria cases in Ghana.
Data used for the study are monthly malaria cases spanning from January 2008 to December 2017 and were obtained from the Ghana Health Service database called the District Health Information Management System (DHIMS)2. For the purpose of cross validation of forecasting techniques, we used the data period from 2008 to 2015 to fit the model while 2016 and 2017 period was used for out-sample forecast accuracy measure.
The Unit Root Test: The unit root test considered in this study is the Augmented Dickey-Fuller (ADF) test which has the null hypothesis that the time series has a unit root (that is, it is not stationary). Thus, the alternative hypothesis is that the time series does not have a unit root (that is, it is stationary). This test is made up of approximating the regression model below:
Where εt is a pure white noise error term and the ADF test follows an asymptotic distribution.
Autoregressive Integrated Moving Average (ARIMA) Model: A non-stationary ARMA (p + d, q) process is written in a lag operator form as:
where the roots of φ'(L)yt is the lag operator form of autoregressive model, while θ(L)εt is the lag operator form of the moving average. In similar instances, equation (2) can be written as a stationary process ωt so that φ'(L)ωt=θ(L)εt, where ωt=∇dYt.It can be said that yt is an ARIMA (p,d,q) and ωt is an ARMA (p,q). Conditionally, the tail off of the ACF and the PACF implies a mixed ARMA model.
Exponential Smoothing Technique (ETS): The various parts or components of time series include the trend (τ), cycle (c), seasonal (s), and irregular or error (ε) parts. All these could be added together in a diverse number of ways. A purely additive model can be expressed as
Where the three parts are combined to form the series that is seen. The trend component, an amalgamation of a level term (τl) and a growth term (τg) at all times, is the beginning component in exponential smoothing. If the error part is overlooked, there are precisely fifteen exponential smoothing methods as shown in the table below.
To be clear, according to Hyndman RJ, Athanasopoulos G, the simple exponential smoothing method is defined by cell (N, N), Holt’s linear method by cell (A, N), the damped trend method by cell (Ad, N), Holt-Winters’ additive method by cell (A, A), and Holt-Winters’ multiplicative method is given by cell (A, M) [16].
In this study, the Corrected Akaike information criterion (AICc) was used, and it is given by:
where L^(θk) is the likelihood of the fitted model,Kp1=+representing size of model, p = total of parameters, and n is the total number of observation.
When the ‘best’ model is selected, it is important to access the model for likely inconsistencies. Having fitted the model, residuals are found to access the likely inconsistencies, as well as help to modify the selected model. It is necessary that the residuals satisfy the model assumptions. If this is not the case, then the model which has been fitted will not be statistically correct.
Root Mean Square Error (RMSE): RMSE is a measure of spread of the forecast errors about the actual data points. This implies that the RMSE informs how far or near the forecasted values of an estimated model are from the real data points. The formula is given as:
where Y^t is the forecasted values, Yt the actual data points, an N is the sample size.
Mean Absolute Percentage Error (MAPE): MAPE is a measure of the size of error of a forecast in percentage. It is used to measure the accuracy of a forecast using the formula below:
For replication purposes, the principles were followed:
1) The series is plotted to observe the various features
2) Augmented Dickey-Fuller test is used to test for stationarity in the series
3) For each of the forecasting techniques, competing models are fitted to the malaria cases series; and the “best” model is selected by the minimum information criterion.
4) The “best” model from each of the forecasting techniques in step (3) is compared using their predictive performance or lowest accuracy measure to select the “best” method for forecasting malaria cases in Ghana.
5) The “best” forecasting technique in step (4) is used to forecast the malaria cases in Ghana.
Firstly, we employed three different univariate time series techniques in this section to model and forecast malaria incidence in Ghana. The time series techniques are ARIMA, ETS and ANN. We compared these techniques by using their predictive performance or lowest accuracy measure to select the “best” method for forecasting malaria cases in Ghana. The R software, specifically forecast 8.3 package Hyndman R, et al., is used to run the time series models. For each forecasting technique, appropriate competing models are constructed and their AICc’s recorded. The model with the least AICc is chosen as the ‘best’ model for forecasting the malaria time series data [17].
Time Series Plot: Generally, in Figure 1, there is a strong upward trend in malaria cases from 2008 to 2012, whereas from 2013 to 2017, the trend changes to a decreasing one. This is an indication that the malaria cases series is not stationary.
Seasonal Plot: In Figure 2, there is an observed sharp decrease in malaria cases at the start of each year (i.e., January and February) but relatively higher as compared to the cases at the end of the previous year. An upward trend in malaria cases tends to occur from March every year through to the middle (July) of each year where the annual peak mostly occurs. Finally, a downward trend is observed typically in the last quarter of every year as the number of malaria cases at the start of that year reduces by the end of the same quarter. Remarkably, malaria cases tend to be evidently lower during the dry season in Ghana than the wet season.
Test of Stationarity: As observed in Figures 1 and 2, the trend and seasonality imply that our time series data is non-stationary. This is confirmed by the Augmented Dickey-Fuller test in Table 2 which resulted in a p-value (0.3315) greater than 5% significance. Thus, we do not have enough evidence to reject the null hypothesis that the data was non-stationary. Nevertheless, a first difference of the Malaria data made it stationary, as confirmed by the same test.
Model Selection: In model building, the standard practice is to fit more than one model to the dataset in order to choose the “best” model. With the “differencing” information acquired at the test of stationarity in Table 2, a set of competing models and their respective information criterion are given in Table 3. The model with the minimum AICc value among the competing models is chosen as the “best” model, therefore SARIMA (0, 1, 2) (0, 1, 1)12 is the “best” ARIMA model.
The appropriate exponential smoothing technique for the malaria series is the Holt-Winters’ seasonal method. This is mainly because of the observed trend and seasonality in Figure 1 and 2. For the analysis, five competing exponential smoothing models were fitted to the Malaria data, all exhibiting either multiplicative error or additive error. Table 4 summarizes each of the competing models and their respective information criterion.
Conclusively, the ‘best’ model is ETS (M, Ad, M), because it has the least AICc value. This implies that the model that best fits the malaria data using the exponential smoothing technique is one that has a multiplicative error, a damped additive trend, and a multiplicative seasonality.
Several competing artificial neural networks were constructed after setting seed; and NNAR (1, 1, 2) [12] model is considered as the “best”. In Figure 3, the forecast values over a two year period, that is January, 2016, to December, 2017 from NNAR (1, 1, 2) [12] is plotted. It can be observed that the forecast does not bear much resemblance to the original malaria cases series.
To select the appropriate forecasting technique for malaria cases in Ghana, the respective “best” forecasting models were separately used to forecast the next two years. The forecast values were then compared to the actual data (which is the test data), and the accuracy of each of the ‘best’ forecast technique was computed using the Root-Mean-Square Error (RSME) and the Mean Absolute Percentage Error (MAPE). From Table 5, the forecast values of the three original methods (ARIMA, ETS, NNAR) are combined by simple arithmetic mean to form the Combination technique. The RMSE and MAPE of the combined method and the other three methods are noted. It is obvious that the SARIMA model results have the least RMSE and MAPE values among the other three forecasting techniques. Thus, it is concluded that the SARIMA (0, 1, 2) (0, 1, 1)12 model is the ‘best’ model for forecasting the malaria time series data in Ghana.
From Table 6, SARIMA (0, 1, 2) (0, 1, 1)12 was selected as the ‘best’ forecasting technique because it had the least forecast accuracy measure. From the diagnostic checking, the chosen forecast technique, SARIMA (0, 1, 2) (0, 1, 1)12, fitted the malaria cases series well.
In Figure 4, the diagnostic checks on the residuals of the chosen forecasting method [SARIMA (0, 1, 2) (0, 1, 1)12] is presented. This is done to see if it does not violate any of the assumptions. From Figure 4, we observed the following:
a) There is no apparent trend in the plot of the standardized residuals over the years.
b) As well, a plot of the ACF of the residuals confirms that none of their lags are statistically significant implying that the residuals are not correlated.
c) The Normal Q-Q plot shows that most of the errors are normally distributed except for a few at either ends that move away from the normal line. Again, potential cases of outliers are observed.
d) Finally, from the plot of the p-values for the Ljung-Box statistic, it is observed that all the p-values are above the 5% significance level. This indicates that the null hypothesis cannot be rejected, as well; the plotted lags are not significantly different from zero.
In Figure 5, monthly forecast of Malaria cases in Ghana for two years that is 2018 to June 2019 is presented. In 2018, we should expect a decrease in Malaria cases in the last quarter but should expect an increase in Malaria cases during the first half of 2019.
Figure 6 gives the actual data from 2008 to 2017 and forecast values from 2016 to 2017 of malaria cases; it is obvious that the suitable model captures the data well especially for 2016.
The main aim of the study is to identify appropriate statistical technique for forecasting the monthly Malaria cases in Ghana. Thus, strict statistical procedures were followed in order to achieve a suitable model for forecasting. Firstly, the monthly malaria cases data were assessed to know the pattern or characteristics of the data, and this is clearly shown in Figure 1 and 2.
For building suitable statistical models, four competing forecasting techniques were compared in order to choose the appropriate technique. Four competing forecasting techniques (that is SARIMA, ETS, NNAR and a combination of the three) were applied to malaria cases data spanning from January 2008 to December 2017. And for each forecasting technique, the principle of model selection was applied in order to select an appropriate model. That is, suitable model is selected under each forecasting technique and the suitable models from the four techniques are finally compared to choose the overall suitable model (Table 3 and 4). The SARIMA technique is the most suitable statistical model for forecasting malaria incidence in Ghana (Table 5). The suitable model for forecasting (i.e. SARIMA (2, 1, 0) (2, 0, 0)) passed all the needed diagnostic tests [18].
Forecast of malaria cases is vital for its control and intervention. This can diminish the huge impact of morbidity and mortality caused by the malady. National control and aversion methodologies will be greatly upgraded through better capacity to forecast future trends in the disease incidence. This research sought to identify appropriate statistical technique for forecasting the monthly Malaria cases in Ghana. Therefore, four competing forecasting techniques were compared in order to choose the appropriate technique. Four competing forecasting techniques (that is SARIMA, ETS, NNAR and a combination of the three) were applied to malaria cases data spanning from January 2008 to December 2017. Our findings reveal that the SARIMA technique is the appropriate statistical model for forecasting malaria incidence in Ghana. The “best” model for forecasting is SARIMA (2, 1, 0) (2, 0, 0) which passed all the needed diagnostic tests. A two year monthly forecast from the “best” model revealed that, in 2018, we should expect a decrease in Malaria cases in the last quarter but should expect an increase in Malaria cases during the first half of 2019.
Figure 1: Poly Nomenclature |
Figure 2: Triangles Parallelograms and Quandrilaterals |
Figure 3: GADJ |
Figure 4: Polycon Concepts |
Figure 5: Adjacent factors |
Figure 6: Polycon GADJ:linear(j, j+1) |
Trend Component |
Seasonal Component |
||
N (None) |
A (Additive) |
M (Multiplicative) |
|
N (None) |
N, N |
N, A |
N, M |
A (Additive) |
A, N |
A, A |
A, M |
Ad (Additive Damped) |
Ad, N |
Ad, A |
Ad, M |
M (Multiplicative) |
M, N |
M, A |
M, M |
Md(Multiplicative Damped) |
Md, N |
Md, A |
Md, M |
Variable |
Order of Differencing |
P-value |
Conclusion |
Malaria Cases |
I (0) (original data) |
0.3315 |
The data is not stationary |
Differenced Malaria Cases |
I (1) (differed data) |
0.01 |
The data is stationary |
Model |
AICc |
(1, 1, 0) (1, 1, 0) |
-178.28 |
(0, 1, 1) (1, 1, 0) |
-181.52 |
(0, 1, 2) (1, 1, 0) |
-182.12 |
(0, 1, 1) (0, 1, 1) |
-187.88 |
(0, 1, 2) (0, 1, 1)* |
-188.15 |
(1, 1, 1) (1, 1, 0) |
-180.79 |
(1, 1, 1) (0, 1, 1) |
-187.33 |
(1, 1, 1) (0, 1, 0) |
-162.85 |
(1, 1, 2) (0, 1, 1) |
-186.06 |
(1, 1, 2) (0, 1, 1) |
-180.66 |
Table 3: Competing models and their AICc values
Model |
AICc |
BIC |
M, N, M |
88.7279 |
123.7425 |
M, Ad, M * |
-19.0453 |
21.5477 |
M, Md, M |
-18.6612 |
21.9317 |
A, A, N |
43.5431 |
56.3655 |
A, Ad, A |
-18.9906 |
21.6024 |
Table 4: Competing Models with respective AICc and BIC
MODEL |
RMSE |
MAPE |
SARIMA(0, 1, 2) (0, 1, 1)12* |
0.07727 |
0.4569 |
ETS (M, Ad, M) |
0.1231 |
0.8141 |
NNAR (1, 1, 2) |
0.1818 |
1.2421 |
Combination (ARIMA, ETS, NNAR) |
0.1207 |
0.7958 |
Table 5: Forecast Accuracy Measures of Competing Forecast Techniques
Variable |
Estimates |
Standard Error |
P – Value |
|
Malaria Cases |
MA(1) |
-0.2471 |
0.1042 |
0.0198 |
MA(2) |
-0.1575 |
0.0987 |
0.1140 |
|
SMA(1) |
-0.6235 |
0.1212 |
0.0000 |