Autocorrelation and partial autocorrelation functions to ... - IEEE Xplore

8 downloads 0 Views 1MB Size Report
Abstract—This paper proposes the autocorrelation function. (acf) and partial autocorrelation function (pacf) as tools to help and improve the construction of the ...
WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia

IJCNN

Autocorrelation and partial autocorrelation functions to improve neural networks models on univariate time series forecasting Jo˜ao Henrique F. Flores

Paulo Martins Engel

Rafael C. Pinto

Instituto de Inform´atica UFRGS [email protected]

Instituto de Inform´atica UFRGS [email protected]

Instituto de Inform´atica UFRGS [email protected]

Abstract—This paper proposes the autocorrelation function (acf) and partial autocorrelation function (pacf) as tools to help and improve the construction of the input layer for univariate time series artificial neural network (ANN) models, as used in classical time series analysis. Especially reducing the number of input layer neurons, and also helping the user to understand the behaviour of the series. Although the acf and pacf are considered linear functions, this paper shows that they can be used even in non linear time series. The ANNs used in this work are the Incremental Gaussian Mixture Network (IGMN), because it is a deterministic model, and the multilayer perceptron (MLP), the most used ANN model for time series forecasting. Index Terms—Artificial Neural Network, Time Series, Autocorrelation Function, Partial Autocorrelation Function, Forecasting, IGMN, MLP.

I. I NTRODUCTION Artificial neural network (ANN) models have been used in time series forecasting for a long time [1]–[3]. However, in many papers the models are presented in their final form, with the results related to the forecasting errors and other description data, but with no analytical information on how the user achieved that model. The models are often obtained via repetition, searching for better performance, for the best data to the input layer, among other configuration possibilities [4]. In resume, the models are obtained, mostly, by trial-and-error. In other works the models are obtained automatically. The most common automatic methods for neural network modeling are based on error minimization, information criteria as the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion), or both as seen in [5] and [6]. On the other hand, the classical statistical approach to univariate time series forecasting is substantially different. In the classical statistical approach there are tools and tests to help and to confirm, to some extent, the construction of a model [7], [8]. The Box-Jenkins algorithm [9], [10] uses, mainly, the autocorrelation and partial autocorrelation functions to obtain important information to build a model, as seasonal effects and the autocorrelation window, for example. However, the Box-Jenkins algorithm is, yet, a subjective method to obtain a classical statistical model. It is not uncommon that different users obtain different models for the same data [7], [11].

U.S. Government work not protected by U.S. copyright

In this work we chose the incremental gaussian mixture network (IGMN) to propose the use of the autocorrelation function (acf) and partial autocorrelation function (pacf) because this specific neural network model is deterministic in relation to its parameters and, also, to use the same probability distribution (Gaussian) as the classical statistical models [12]. On the other hand, the acf and pacf were also tested with the multilayer perceptron (MLP), a random start model, to extend the results to one of the most common neural network models for time series forecasting. The major problem with the classical statistical approach is the amount of available models for almost every single characteristic on a time series. Some models, as the autoregressive heteroskedastic models (ARCH and GARCH family), have more than one hundred variations [7], [8]. The same does not occur with neural network models [13]. However, other works, as [4], refuse the use of the acf and pacf functions in ANN forecasting models because these functions are linear. Other works use the acf, pacf and even Fourier analysis, but in hybrid models, similarly to the classical statistical approach [14] and [15], for example. But, in the classical statistical approach, the non linearity of a series can be determined by a non constant mean and/or variance. And even then, there are specific models to address these situations as seen in [7] and [11], using the acf and pacf. Therefore, the objective of this work is to verify the capabilities of the acf and pacf to minimize the data in the input layer, using only relevant, or statistically significant lags without increasing errors, although these functions are considered linear. II. I NCREMENTAL G AUSSIAN M IXTURE N ETWORK The Incremental Gaussian Mixture Network (IGMN) [12] is a supervised algorithm that uses as its base an incremental approximation of the EM algorithm [16], the IGMM (Incremental Gaussian Mixture Model) [17]. It creates and continually adjusts a probabilistic model consistent to all sequentially presented data, after each data point presentation, and without the need to store any past data points. Its learning process is aggressive, or ”one-shot”, meaning that only a single scan through the data is necessary to obtain a consistent model.

IGMN adopts a Gaussian mixture model of distribution components (known as a cortical region) that can be expanded to accommodate new information from an input data point, or reduced if spurious components are identified along the learning process. Each data point assimilated by the model contributes to the sequential update of the model parameters based on the maximization of the likelihood of the data. The parameters are updated through the accumulation of relevant information extracted from each data point. Differently from IGMM, however, the IGMN is capable of supervised learning, simply by assigning any of its input vector elements as outputs (any element can be used to predict any other element, like autoassociative neural networks [18]). This architecture is depicted on figure 1.

along a long corridor), which could cause numerical problems as well as disturbing the efficiency of the novelty criterion used for component creation. Supposing that the learning procedure goes on, posterior probabilities are calculated for each component as follows: p(j|x) =

p(x|j)p(j) M X

p(x|q)p(q)

where M is the number of components. Now, parameters of the algorithm must be updated according to the following equations: vj (t) = vj (t − 1) + 1

(4)

spj (t) = spj (t − 1) + p(j|x)

(5)

ej = x − µj

(6)

p(j|x) spj

(7)

∆µj = ωj ej

(8)

µj (t) = µj (t − 1) + ∆µj

(9)

  Cj (t) = Cj (t − 1) − ∆µj ∆µTj + ω eeT − Cj (t − 1) (10)

A. Learning

p(j) =

The algorithm starts with no components, which are created as necessary (see subsection II-B). Given input x (a single instantaneous data point), the IGMN algorithm processing step is as follows. First, the likelihood for each component j is computed:   1 p¯(x|j) = exp − (x − µj )T C−1 (x − µ ) (1) j j 2 p(x|j) =

p¯(x|j) p (2π)D/2 |Cj |

(2)

where D is the input dimensionality, µj the j th component mean, Cj its covariance matrix and p¯(x|j) is the likelihood before normalization (each value of p¯(x|j) will be in the interval [0 ,1], with 1 being a perfect match between x and µj ). If any p¯(x|j) is greater than a threshold τmax (e.g., 0.999), it is assumed that the model already represents well the current data point x, and therefore no learning takes place. Besides speeding up the algorithm in some cases, this condition avoids excessive shrinking of the Gaussian components when very similar data points are seen many times (e.g., robot sonar data

(3)

q=1

ωj =

Fig. 1. An example of IGMN with 3 input nodes and 5 Gaussian components. Any input element can be predicted by using any other element, which means that the input vector can actually be divided into input and output elements.

∀j

spj M X

(11)

spq

q=1

where spj and vj are the accumulator and the age of component j, respectively, and p(j) is its prior probability. B. Creating New Components If instead of having any p¯(x|j) greater than a threshold τmax , we have all p¯(x|j) below some threshold τmin (e.g., 0.001) (or there are no components) and a stability criterion is satisfied, which means having all vj greater than some agemin (e.g., D + 1, where D is the input space dimensionality), then a new component j is created and initialized as follows: µj = x;

spj = 1;

vj = 1;

p(j) =

1 ; M X spi

2 Cj = σini I

i=1

where M already includes the new component and σini can be obtained by: σini = 3δstd(x) (12)

where δ is a manually chosen scaling factor (e.g., 0.01) and std is the standard deviation of the dataset (the constant 3 is chosen to better cover the full range of values in the dataset according to a Gaussian distribution). Note that the IGMN is an online and incremental algorithm and therefore it may be the case that we do not have the entire dataset to extract descriptive statistics. In this case the standard deviation can be just an estimation, without impacting the algorithm. C. Removing Spurious Components A component j is removed whenever vj > vmin and spj < spmin , where vmin and spmin are manually chosen (e.g., 5.0 and 3.0, respectively). In that case, also, p(q) must be adjusted for all q ∈ M , q 6= j, using equation 11. In other words, each component is given some time vmin to show its importance to the model in the form of an accumulation of its posterior probabilities spj . D. Recalling

xˆt =

M X

p(j|xi )(µj,t + Cj,ti C−1 j,i (xi − µj,i ))

(14)

j=1

where Cj,ti is the submatrix of the jth component covariance matrix associating the unknown and known parts of the data, Cj,i is the submatrix corresponding to the known part only and µj,i is the jth’s component mean without the unknown element. The recalling procedure is depicted in figure 2. III. C LASSICAL S TATISTICAL A PPROACH The Box-Jenkins algorithm uses two graphical tools for modeling classical statistical models [7]: the autocorrelation function (acf) and the partial autocorrelation function (pacf). These functions show the intensity of the temporal autocorrelation, each on its own manner. A classical model is considered adjusted when both, acf and pacf, show no significant autocorrelation. The acf is obtained from the linear correlation of each xt value of the series to the others in different lags, as xt−1 , xt−2 and so on. The pacf, however, obtains almost the same result, but removing the interference of the other values. In the acf, for example, the correlation between xt and xt−2 suffers the interference of xt−1 . The pacf removes that interference. Each of these functions not only has its own interpretation, but also a combined interpretation. For example, observe figure 3.

Fig. 2. An example of IGMN with 3 input nodes and 5 Gaussian components. Two of the input elements were selected for estimating the third one. The different color intensities inside each Gaussian component represent their different posterior probabilities after seeing data xi (only the given elements), and are used to weight the contributions of each component to the final result. (a)

In the IGMN, any element can be predicted by any other element, and this is the biggest addition in relation to the unsupervised IGMM algorithm. This is done by reconstructing data from the target elements (xt , a slice of the entire input vector x) by estimating the posterior probabilities using only the given elements (xi , also a slice of the entire input vector x), as follows: p(j|xi ) =

p(xi |j)p(j) M X p(xi |q)p(q)

∀j

(13)

q=1

It is similar to (3), except that it uses a modified input vector xi with the target elements xt removed from calculations. After that, xt can be reconstructed using the following equation:

(b) Fig. 3. Example of a acf(a) and pacf(b) of a second order autoregressive(AR) model

In the graphs of figure 3 one can note two different correlations beyond the statistical significance limits, the dashed lines. By definition, the acf graph shows every lag correlation, even on lag zero, which is not taken into account. In figure 3 it is also possible to observe significant correlations in the pacf graph, even with a second order autoregressive model. This is due to common sample variations and the possible interaction of the autoregressive (AR) term in the moving average (MA) term. Not only, but also because of this effect, it is recommended to use both functions. By definition, autoregressive components can be observed when the pacf is statistically different from zero at the lags m = 1, 2, . . . , p and zero thereafter. In this case an AR(p) is used, according to (15), where φ are the autoregressive parameters, xt is the observation on time t and t is the white noise on time t. xt = φ1 x(t−1) + φ2 x(t−2) + . . . + φp x(t−p) + t

(15)

The seasonal autoregressive component has a pacf statistically zero, except in the lags m = s, 2s, . . . , P s, in which case an ARs (P) model is obtained, according to (16), where s represents that seasonal effect. xt = φs x(t−s) + φ2s x(t−2s) + . . . + φP s x(t−P s) + t (16) The autoregressive behavior refers to the correlation that exists between the current and the past values. However, some precautions are necessary. The limits of acceptable correlation (represented by the dashed line in figure 3) are constructed assuming the data, or residuals, follow a Gaussian probability distribution. If this assumption can not be verified, the amplitude of the limits becomes different and the limits must be calculated by a non-parametric approach. In this paper, the series presented are also used in other works applying the classical statistical approach, and so these verifications are not taken into account. The other model component to be detected is the moving average (MA) component. This component can be identified using, mainly, the acf. When the acf is statistically different from zero on lags m = 1, 2, . . . , q and equal zero thereafter, the model is a MA(q), according (17), where γ are the moving average parameters and t the residuals on time t. xt = γ1 (t−1) + γ2 (t−2) + . . . + γq (t−q) + t

(17)

It is important to note, however, that the t is a random error, but the (t−1) , (t−2) , . . . , (t−q) are not, as they are already observed in previous periods. Also, by definition, x(t−1) is the real observation on time (t − 1) and x ˆ(t−1) is the model’s estimated value on time (t − 1). Therefore, the residual (t−1) is obtained as (t−1) = x(t−1) − x ˆ(t−1) . The same procedure is done to time (t − 2), (t − 3), . . . , (t − q). The seasonal moving average component has a acf equal to zero, except on lags m = s, 2s, . . . , Qs, in which case a MAs (Q) model is suggested, according to equation 18.

xt = γs (t−s) + γ2s (t−2s) + . . . + γQs (t−Qs) + t

(18)

The Seasonal Autoregressive Integrated Moving Average(SARIMA) model combines the effects of an AR(p), an ARs (P), a MA(q), a MAs (Q) component and a differentiation (or integrated) component. This model is presented as a SARIM A(p, i, q)x(P, I, Q)s , where the i component stands for integrated component, I for the seasonal integrated component and s the seasonal time period, as seen before. A SARMA model, according to (19), is similar to a SARIMA model, but without the integrated component. The integrated, or seasonal integrated, components are used in the classical statistical approach when the time series is not stationary, namely a time series with the mean and/or variance non constant. Mainly because of the neural models capabilities, the integrated component is not relevant, and so, not used in this work. xt = φ1 x(t−1) + . . . + φp x(t−p) + φs x(t−s) + . . . + φP s x(t−P s) + γ1 (t−1) + . . . + γq (t−q) + γs (t−s) + . . . + γQs (t−Qs) + t

(19)

The MA component refers to the correlation that exists between the current values and the residuals obtained from previous values. In the neural networks models chosen to verify the use of the acf and pacf in univariate time series, IGMN and MLP, this behavior is not presented explicitly to the model. It is, however, possible that the IGMN and MLP can, implicitly, model this behavior. It is shown in [7], [8] that a higher order AR process can model a low order MA process. However this is also not presented in this work. IV. T IME SERIES DATA To show the capabilities of the acf and pacf two different time series were used: the air passengers data, as seen in [9] and the monthly sunspot data. Both series will be presented next, with their respective acf and pacf graphs. A. The air passengers data The air passengers time series represents a total of 144 observations about the monthly total of international airline passengers between 1949 and 1960, in thousands, see figure 4. This particular series can be modelled in the classical statistical approach with a reasonably simple model, a SARIM A(0, 1, 1)x(0, 1, 1)12 , even being considered a non linear time series. The majority of ANN models applied to this series uses a 12-lag window or 24-lag window. Figure 5 shows the acf and pacf of the series. Observing the pacf graph, one can notice the statistically significant lags that will be used in the IGMN and MLP models. The acf graph shows only a non stationary time series, this is why the integration coefficients in the SARIMA model commonly used. This is also characteristic for a non linear time series.

B. The sunspots data The sunspots time series represents the monthly mean relative sunspot numbers from 1749 to 1983. This series is commonly used to verify the capabilities of different models, as it is considered a non linear time series [19].

Fig. 4.

The air passengers time series data

Fig. 6.

The monthly sunspots numbers time series data

The classical statistical models to this series range from an AR(9) to more complex models. Usually, the ANN models applied to this series have a 12-lag window or even more. Observing the graphs on figure 7 one can see that the acf graph is similar to the air passengers data, but the pacf shows that the statistically significant lags are on the beginning of the series and after a long period. V. E XPERIMENTS AND R ESULTS

(a)

(b) Fig. 5.

The acf(a) and pacf(b) of the air passengers data

As discussed before, in the experiments the ANNs applied to the time series are the Incremental Gaussian Mixture Network (IGMN) and the multilayer perceptron (MLP). The models generated by these ANNs will be separately compared using the same parameters, with exception of the input layer. The acf and pacf functions are used in two different moments. In the beginning of the modeling, to obtain the input representative lag window, or the representative time-delays needed, and after the modeling to analyse the residuals. This is the same method used in the classical statistical modeling. The only remark is that the original air passengers data are used, not the differential, integrated or log data. For the purpose of this work, the time series should be non linear and that is one of the reasons to use the original data, namely to test the capabilities of the acf and pacf function as tools to improve the building of the ANN model. The other major reason is to be able to compare both models. The same is done with the sunspot series. After modeling using the acf and pacf functions, another IGMN and MLP model is generated, as the traditional method to compare with, using the whole lag window, or time-delays. The lag window or time-delay representation used in this work is as follows: a lag window of 1 is the input of xt to forecast xt+1 , a lag window of 2 is the input of xt−1 and xt

(a)

as in (20) and the normalized mean square error with the trivial solution (NMSET) as in (21). These results are obtained using only the forecasts, not the whole series. In both series the last observations will be removed to obtain forecasts and compute the RMSE and NMSET. The forecast corresponds to an one step horizon. In the MLP models, the results will be the median computed over the 30 different models and statistically tested to verify if the difference is significant or not. The test used is the Wilcoxon rank sum test over two samples. The MLP results include also the median absolute deviation (MAD). v u n uX (xt − x ˆ t )2 (20) RM SE = t n t=1 Pn (xt − x ˆt )2 P N M SET = nt=2 2 t=2 (xt−1 − xt )

(21)

where xt is the real observation at time t and x ˆt is the predicted value at time t. The first series to be modeled corresponds to the air passengers data. As shown in section IV, two different IGMN models were obtained: a model with a lag window of 14 observations, i.e., xt , xt−1 , . . . , xt−13 , xt−14 and a model with only the lags xt , xt−12 and xt−14 . The last 30 observations were used to verify the RMSE and NMSET. The same lags are used in the MLP models. The results are shown in table I. TABLE I (b) Fig. 7.

The acf(a) and pacf(b) of the sunspots data

to forecast xt+1 and so on. Using the acf and pacf we apply the term representative lag window because the model does not use the whole lag window, but just some elements of it, the representative lags. A model can be described as using the xt−6 , xt−1 and xt to forecast xt+1 . This is not a lag window of 7, which is described by xt−6 , xt−5 , ..., xt−1 and xt to forecast xt+1 . To assess the contribution of the acf and pacf functions to produce more compact models, two models with distinct input-layers were generated for each kind of ANN (IGMN and MLP). One of them has a compact input-layer corresponding to the representative lag window with the delays determined using the acf and pacf functions. The other one has an inputlayer corresponding to the proper (whole) lag window, as seen above. The IGMN is a deterministic network in relation to the model parameters. It means that the trained model parameters are always the same for the selected network configuration. On the other hand, the MLP (or TDNN) is a random model and it is necessary to run many simulations on each time series to collect appropriate statistics. For the MLP, the trained model parameters (the weights) depend on a random initialization process. The obtained results for the trained models will be presented with the following measures: root mean square error (RMSE)

T HE AIR

PASSENGERS FORECASTS RESULTS

Models IGMN xt , xt−12 and xt−14 IGMN xt , . . . , xt−14 MLP xt , xt−12 and xt−14 MLP xt , . . . , xt−14

Clusters 1 9 -

RMSE 18.2622 47.2595 48.6395 47.3708

NMSET 0.1205 0.8069 1.1323 1.1875

Table I presents a better performance of the IGMN model that uses the information obtained by the acf and pacf graphs. The IGMN model with the significant lags, xt , xt−12 and xt−14 has 1 cluster instead of 9 and managed to obtain a RMSE of 18.3 and a NMSET of 0.12. Figure 8 presents the final acf and pacf, using the residuals of the representative lags model. Observing the acf and pacf in figure 8, one can easily see that not all the lags are non significant, as, for example, the lags from 13 to 15 in the pacf graph. This indicates that the IGMN has difficulty in modeling the MA component. The air passengers data, as described before, is basically a moving average process and it is difficult to model that only implicitly from the data. This could also explain the poor results obtained with the MLP models. Table II presents the results obtained by the MLP models, with the p-value of the Wilcoxon rank sum test. The results in table II show that, in the MLP models, the use of the acf and pacf is not statistically significant. Although the MLP model obtained without the acf and pacf has better results (RMSE of 47.4 instead of 48.6, for example), this difference

TABLE III T HE MONTHLY SUNSPOTS FORECASTS RESULTS Models IGMN xt , xt−3 and xt−6 IGMN xt , . . . , xt−6 MLP xt , xt−3 and xt−6 MLP xt , . . . , xt−6

Clusters 11 12 -

RMSE 20.0275 22.4864 19.6717 19.6503

NMSET 0.9169 1.1558 1.1307 1.1329

data, the MLP models have obtained better results than both IGMN models in the RMSE. Yet the IGMN model with the significant lags is the only model with the NMSET bellow 1. Figure 9 shows the acf and pacf graphs for the IGMN model with the significant lags.

(a)

(b) (a) Fig. 8.

The final acf(a) and pacf(b) of the air passengers data TABLE II

R ESULTS OF THE MLP

Median RMSE MAD RMSE Median NMSET MAD NMSET RMSE Wilcoxon p-value NMSET Wilcoxon p-value

MODELS IN THE AIR PASSENGERS DATA

MLP xt , xt−12 , xt−14 48.6395 13.2125 1.1323 0.5240 0.7731 0.7394

MLP xt , . . . , xt−14 47.3708 9.6766 1.1875 0.5577 0.7731 0.7394

is not significant in relation to any of the used measures, as the p-value of the two Wilcoxon tests indicates. The other series analysed corresponds to the sunspots data. The same steps that were done in the air passengers data are repeated here. The models obtained are an IGMN with a lag window of 6 steps, i.e., xt , xt−1 , . . . , xt−5 , xt−6 , an IGMN model with only the lags xt , xt−3 and xt−6 and the two corresponding MLP models. The last 93 observations were used to verify the RMSE and NMSET. The results are shown in table III. The IGMN model with the significant lags produced smaller NMSET errors than the other three models, but in respect to the RMSE the difference may not be relevant, as one can see in table III. It is important to note that, unlike the air passengers

(b) Fig. 9.

The final acf(a) and pacf(b) of the sunspots data

Both the acf an the pacf graphs in figure 9 show only three significant lags that were not well modeled. This may indicate the same problem as occurred with the air passengers data, the moving average process is not well adjusted implicitly in the models. Table IV presents the results obtained by the MLP models. The results in table IV show lesser variation than the MLP models obtained with the air passengers data. But, as seen on table IV, here also the acf and pacf do not show any improvement over the conventional models. The Wilcoxon

TABLE IV R ESULTS OF THE MLP MODELS IN THE SUNSPOTS DATA

Median RMSE MAD RMSE Median NMSET MAD NMSET RMSE Wilcoxon p-value NMSET Wilcoxon p-value

MLP xt , xt−3 , xt−6 19.6717 0.1960 1.1307 0.0227 0.7958 0.7845

MLP xt , . . . , xt−6 19.6503 0.1740 1.1329 0.0201 0.7958 0.7845

tests also show no statistical difference between the two models in either RMSE and NMSET. VI. D ISCUSSION AND CONCLUSIONS Although being a reasonably simple comparison, and very restricted, the IGMN models obtained using the information from the acf and pacf graphs have, if not better, similar performance with fewer input units in both series. However, it is important to note that the models presented in this work are not the best possible IGMN models for each time series, but are better than the standard model presented here, i.e., the lag window model. It was not the goal of this paper to find the best network configuration for modeling the chosen time series, but to show that acf and pacf functions can be used to reduce the number of inputs of the neural models. This is best shown with the IGMN, whilst maintaining or even improving their performance. Another important conclusion is related to the fact that the moving average component of the process is not being appropriately modeled. Apparently both models are not adjusting the MA component implicitly. Future works will focus on modeling of the MA component. Some restrictions to the conclusions also include the fact that both series can be considered non linear in its original form. The models obtained by the classical statistical approach use some preprocessing of the data to adapt to its assumptions, like stationarity and linear behavior. Although the acf and pacf are considered linear, the results shown in tables I and III indicate that these functions can be used, even with better results, in the case of the IGMN models. All models presented in the experiments were generated without any preprocessing of the data. Because of this, the final acf and pacf graphs are not well adjusted. Also, the MA component may be causing the significant lags in those graphs. The interpretation of the acf and pacf is better when the time series (preprocessed or not) are stationary. However, it also demonstrated the capabilities of the IGMN model, that even using non-preprocessed data could achieve final acf and pacf graphs as seen of figures 8 and 9. In comparison, in the air passengers data, the IGMN obtained better results than the MLP models, but in the sunspots data, these two models are very much similar. This could be due to the fact that the MLP models can not handle the tendency present in the air passengers data. On the other hand, as discussed in [12], the IGMN can learn a growing tendency of the time series. In the

sunspots data, there is no, at least visible, tendency and so, the two models are similar. On the other hand, this could also be due to the MA component. A more specific investigation about the MA component is planned as future work. Also about the acf and pacf graphs, the dashed line, the upper and lower limit of statistical significance, are obtained with the assumption that the residuals follow a Gaussian probability function. If this assumption fails, other limits must be obtained. These two chosen series satisfy this assumption, as shown in other statistical works. But to extend these experiments to other time series, it is important to verify the assumption about the probability distribution of the residuals. R EFERENCES [1] G. P. Zhang and V. L. Berardi, “Time series forecasting with neural network ensembles: an application for exchange rate prediction,” Journal of Operational Research Society, vol. 52, pp. 652–664, 2001. [2] M. Qi and G. P. Zhang, “An investigation of model selection criteria for neural network time series forecasting,” European Journal of Operational Research, vol. 132, pp. 666–680, 2001. [3] P. Auer, H. Burgsteiner, and W. Maass, “A learning rule for very simple universal approximators consisting of a single layer of perceptrons,” Neural Networks, vol. 21, pp. 786–795, 2008. [4] G. Zhang, B. E. Patuwo, and M. Y. Hu, “Forecasting with artificial neural networks: The state of the art,” International Journal of Forecasting, vol. 14, no. 1, p. 29, 1998. [5] S. D. Balkin and J. K. Ord, “Automatic neural network modeling for univariate time series,” International Journal of Forecasting, vol. 16, pp. 509–515, 2000. [6] D.-Y. Chiu and P.-J. Chen, “Dynamically exploring internal mechanism of stock market by fuzzy-based support vector machines with high dimension input space and genetic algorithm,” Expert Systems with Applications, vol. 36, p. 9, 2009. [7] W. Enders, Applied econometric time series, 1st ed. New York: Wiley, 1995. [8] J. D. Hamilton, Time series analysis, 1st ed. Princeton: Princeton University Press, 1994. [9] G. Box and G. M. Jenkins, Time series analysis, 1st ed. San Francisco: Holden-Day, 1976. [10] G. Box, G. M. Jenkins, and G. Reinsel, Time series analysis, 3rd ed. New York: Prentice Hall, 1994. [11] R. H. Shumway and D. S. Stoffer, Time series analysis and its applications, 1st ed. New York: Springer-Verlag, 2000. [12] M. Heinen, “A connectionist approach for incremental function approximation and on-line tasks,” Ph.D. dissertation, Universidade Federal do Rio Grande do Sul - Instituto de Inform´atica, 2011. [13] S. Haykin, Redes neurais, 2nd ed. Porto Alegre: Bookman, 2001. [14] G. P. Zhang, “Time series forecasting using a hybrid arma and neural network model,” Neurocomputing, vol. 50, no. 1, p. 16, 2003. [15] T. Taskaya-Temizel and M. C. Casey, “A comparative study of autoregressive neural networks hybrids,” Neural Networks, vol. 18, no. 1, p. 8, 2005. [16] A. Dempster, N. Laird, D. Rubin et al., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [17] P. Engel and M. Heinen, “Incremental learning of multivariate gaussian mixture models,” Advances in Artificial Intelligence–SBIA 2010, pp. 82– 91, 2011. [18] D. Rumelhart and J. McClelland, Parallel distributed processing. MIT Pr., 1986. [19] S. Makridakis, S. C. Wheelwright, and R. J. Hyndman, Forecasting Methods and Applications, 3rd ed. New York: Wiley, 1998.