Forecasting Container Throughput of Qingdao Port ...

J Syst Sci Complex (2015) 28: 105–121

Forecasting Container Throughput of Qingdao Port with a Hybrid Model HUANG Anqiang · LAI Kinkeung · LI Yinhua · WANG Shouyang

DOI: 10.1007/s11424-014-3188-4 Received: 23 August 2013 / Revised: 15 January 2014 c The Editorial Office of JSSC & Springer-Verlag Berlin Heidelberg 2015 Abstract This paper proposes a hybrid forecasting method to forecast container throughput of Qingdao Port. To eliminate the influence of outliers, local outlier factor (lof) is extended to detect outliers in time series, and then different dummy variables are constructed to capture the effect of outliers based on domain knowledge. Next, a hybrid forecasting model combining projection pursuit regression (PPR) and genetic programming (GP) algorithm is proposed. Finally, the hybrid model is applied to forecasting container throughput of Qingdao Port and the results show that the proposed method significantly outperforms ANN, SARIMA, and PPR models. Keywords Container throughput forecast, genetic programming algorithm, outlier processing, projection pursuit regression.

1

Introduction

As the indispensable base of development planning and decision making by enterprises and government, economic forecasting has absorbed much attention. High quality of data is one of the key factors of excellent forecasting performance, therefore, it is necessary to process outliers before running the forecasting algorithm. An outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism[1] . There are multiple categories of outlier detection approaches, such as Bayesian HUANG Anqiang School of Economy and Management, Beihang University, Beijing 100191, China. Email: [email protected]. LAI Kinkeung Department of Management Sciences, City University of Hong Kong, Kowloon, Hong Kong, China. Email: [email protected]. LI Yinhua Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100080, China. Email: [email protected]. WANG Shouyang Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China. Email: [email protected]. This paper was recommended for publication by Editor TANG Xijin.

106

HUANG ANQIANG, et al.

approaches[2–6] , wavelet based approaches[7–11] , distance based approaches[12–15] and intelligent algorithm based approaches[16–21] . [22–24] investigated early studies on outlier detection, and [25–30] thoroughly reviewed the latest studies on this issue. However, prior studies seldom address how to deal with outliers after they are detected out. It may be an easy choice to directly dump them, but this approach is not suitable for time series, considering that outliers in time series may be the ones of high-valued information, e.g., the structural breakpoints. Therefore, it is necessary to utilize a more reasonable way to process outliers in time series. Another concern with time series forecasting lies in the fact that linear models, e.g., SARIMA and VAR, cannot satisfactorily fit the complex and nonlinear relationship among economic variables in the real world. Project pursuit regression (PPR) developed in [31] is a powerful nonlinear modeling method, as presented in [32–35], and has been in wide use. For examples, [36] applied an extended PPR to exploratory data analysis and illustrated its efficiency through several studies of real and simulated data. [37] argued for the high performance of PPR in prediction of retention times of peptides in RPLC. More recent studies can be found in [38–41]. However, when applying standard PPR, it is difficult to determine the ridge functions, thus the solution of this problem is usually the rule of thumb, which renders the result heavily dependent on expertise. To solve the above-mentioned problems, this paper extends the local outlier factor (lof), frequently used for cross-sectional data, to outlier detection for time series, then constructs different types of dummy variables to capture the effect of outliers based on domain knowledge. Subsequently, this paper takes the lead to propose a hybrid forecasting model (PPR-GP) combining projection pursuit regression and genetic programming algorithm. The remainder of this paper is organized as follows. Section 2 discusses the outlier processing algorithm incorporating domain knowledge. Section 3 interprets details of PPR-GP. Section 4 applies PPR-GP to forecasting container throughput of Qingdao Port and compares the performance of it and other models. Conclusions are presented in Section 5.

2

Outlier Processing Algorithm Incorporating Domain Knowledge

Although statistical models have been playing an indispensable role in rich research areas, it is not the universal prescription, owing to at least two reasons: (a) a statistical model is the abstraction and simplification of relationships of variables in the real world, and thus it cannot reflect all information of them; (b) some special types of information cannot be analyzed by statistical models. To solve the problem, domain knowledge is frequently used. Substantial researches, such as [42, 43], have shown that incorporating domain knowledge significantly improves statistical models’ performance. Therefore, this section proposes an outlier-processing framework, which incorporates domain knowledge into the outlier processing algorithm. 2.1

Outlier Processing Framework

Under the conventional outlier processing framework, outliers are directly dumped, which probably leads to loss of valuable information. Especially for time series, if several latest observations are identified as outliers and removed, the latest information of the development

FORECASTING CONTAINER THROUGHPUT OF QINGDAO PORT

107

trend will be lost. Therefore, the newly proposed framework advocates that different approaches should be applied to dealing with different types of outliers. Figure 1 compares the conventional framework of outlier processing with the newly proposed framework. Under the new framework, the outlier detection algorithm is first applied to the original data. If there are outliers, the type of outliers is identified with the help of domain knowledge, then corresponding methods are employed to process outliers. Specifically, Section 2.3 constructs the corresponding type of dummy variables for each type of outliers. Data

Data

Outlier detection

Outlier detection

Outliers exist?

No

Outliers exist?

No

Yes

Yes

Judge the type of outliers

Drop outliers Handle outliers in suitable ways Further data analysis

Further data analysis

(a) The conventional framework

Figure 1

2.2

(b) The proposed framework

Comparison of the conventional and the proposed outlier processing framework

An Outlier Detection Algorithm for Time Series

lof, proposed by [44], is an effective index for outlier detection on cross-sectional data. In order to apply lof to time series data, this section first transforms time series into cross-sectional data by mapping the time series to a set of three-dimensional vectors, and then detects outliers by lof following the reference [44]. Figure 2 presents three main causes of a time series outlier, i.e., amplitude shift, level shift and direction frequency shift. A real outlier is frequently caused by the mix of them, which implies that an observation would be an outlier when it is significantly different from others in the above three aspects. Therefore, the vector [a, l, f ] is able to describe the characteristics of time series, where a denotes the amplitude, l denotes the level, and f denotes the direction shift frequency. [a, l, f ] can be computed as follows: N 1 (yi − y)2 , (1) a= N i=1 N 1 yi , l= N i=1

1 I(yi − yi−1 ), N − 1 i=2

(2)

N

f=

(3)

where N is the length of the series, yi is the i-th element in the series, I(x) = 1 when x > 0 and I(x) = −1 when x ≤ 0.

108


(a) Amplitude shift

(b) Level shift

(c) Direction frequency shift

Figure 2

Three main causes of time series outliers

The proposed outlier detection algorithm first divides original time series into several nonoverlapping subparts. Next, it maps every subpart to a three-dimensional vector [a, l, f ] and generates a cross-sectional data set by collecting all of the vectors. After that, it computes lof for every vector in the data set and identifies subparts containing outliers by lof. More details are explained as follows: 1) Remove seasonality and the temporal trend from original data. Time series from the real economy usually contain seasonality, where data experience regular or predictable changes that recur every calendar year. It is necessary to remove seasonality from original data before running the outlier detection algorithm, lest observations with seasonality should be confounded with outliers. In order to highlight the impact of outliers, the temporal trend should also be removed. Our investigation resorts to the widely-used X-12 program to obtain the adjusted series S. 2) Divide the series S into K non-overlapping subparts denoted by S = {S1 , S2 , · · · , SK }, the lengths of which are written as [n1 , n2 , · · · , nK ]. Then map Si (i = 1, 2, · · · , K) to vi = [ai , li , fi ]. 3) Compute the k-distance of Si (i = 1, 2, · · · , K). Denote the Euclidean distance from Si to Sj as D(i, j) = (ai − aj )2 + (li − lj )2 + (fi − fj )2 . (4) According to [44], The k-distance of Si , denoted by Dk(i) , is the Euclidean distance D(i, p) satisfying the following constraints: i) The number of subparts Sq ∈ S\Si is not less than k, s.t., D(i, q) ≤ D(i, p). ii) The number of subparts Sq ∈ S\Si is not more than k − 1, s.t., D(i, q) < D(i, p). 4) Compute the lof of Si . Denote the reachability distance[44] of Si from Sj by RD(i, j), and compute it as RD(i, j) = max{D(i, j), Dk (j)}.

(5)

Given the set of M inP ts nearest neighbors of Si , denoted by NMinP ts (Si ), the local reachability


109

density[44] of Si is defined as lrdsi =

|NMinP ts (Si )| , Sj ∈NM inP ts(S ) RD(i, j)

(6)

i

where |NMinP ts (Si )| denotes the number of elements in NMinP ts (Si ) and M inP ts is a parameter defined by the rule of thumb. The lof of Si is then computed as Sj ∈NM inP ts (Si ) lrdSj lof si = . (7) lrdSi × |NMinP ts (Si )| 5) Identify outliers in light of lof. A value of lof Si less than or approximating to 1 indicates Si is an inlier or comparable to its neighbors, thus lof Si ≤ 1 means Si contains no outliers. Conversely, a value significantly larger than 1 indicates outliers. 2.3

Outlier Processing with Domain Knowledge

Traditionally, outliers can be categorized into innovative outliers[45] (IO) and additive outliers[46] (AO). In this section, a new type of outliers called temporary outlier (TO) is proposed, where data transitorily keep deviating from the normal during some period. The three types of outliers are presented in Figure 3.

(a) Innovative outlier

Figure 3

(b) Additive outlier

(c) Temporary outlier

Examples of three types of outliers

Outliers should not be abandoned arbitrarily, lest the results should lose valuable information. Domain knowledge, a situated set of relevant knowledge involved in problem solving[47] , should be employed to help with outlier processing. In this paper, domain knowledge refers to any relevant information about outliers, such as the cause of outliers, the type of outliers and so on. For example, outliers are probably innovative if they are caused by a significant increase of a factory’s outputs benefiting from its technical upgrade and infrastructure improvement. Outliers may be temporary if they are caused by transitory decrease of a factory’s outputs owing to a natural disaster whose negative effects will persist during a certain period. An outlier may be additive if it is caused by a sudden decline in the outputs owing to an accidental machinery breakdown that can be fixed within a very short time. Domain knowledge can be obtained from news on internet, academic papers, research reports, experts judgement, etc. With the above domain knowledge, suitable methods can be selected to process outliers. The outlier processing approach incorporating domain knowledge for time series can be summarized as the following steps:

110


1) Identify subseries containing outliers using the outlier detection algorithm proposed in Section 2.2. 2) Search for domain knowledge of the outliers through multiple channels, such as Internet, books, research reports, experts judgements, and statistics. 3) With the help of domain knowledge, identify the type of outliers. Take the left chart in Figure 3 as an example, the container throughput of Singapore Port descended abruptly in 2009. By analyzing information on Internet, the main reason is the US financial crisis. Referring to some academic papers and research reports of the World Bank, the IMF and the UN Department of Economic and Social Affairs, the impact of the financial crisis is expected to persist in our forecast horizon. Consequently, the outliers are innovative outliers. Other types of outliers can be identified in a similar way. 4) Construct dummy variables for the forecasting model according to the type of outliers. Values of dummy variables are dependent on the type of outliers. For innovative outliers, the front part of the dummy variable are 0 and the remainder are 1; For temporary outliers, the middle part are 1 and the remainder are 0; For additive outliers, the corresponding entry is 1 and the remainder are 0.

3

Hybrid of PPR and GP

After dealing with outliers in time series, suitable nonlinear models have to be employed for a successful forecast. Although PPR has satisfactory performance in fitting nonlinearity, it still suffers from the difficulty of determining suitable ridge functions. This paper uses the hybrid of PPR and GP to solve this problem. 3.1

Projection Pursuit Regression The project pursuit regression (PPR) function, according to [35], can be written in the form

of yt = Xt β +

M

wj φj (Xt ϕj ) + rt ,

(8)

i=1

where ϕj (j = 1, 2, · · · , M ) define the projection vectors, φj (·) are ridge functions linearly combined with weight wj and added to the linear part Xt β to form yt , and rt is a stochastic term. The estimate of parameters in Equation (8) can be obtained by the following steps: 1) Let superscript (v) denote the v-th round and set v = 0. Estimate β (v) using simple ordinary least squares algorithm, compute the residual r(v) , and initialize w(v) and ϕ(v) . Particularly, let T

T

β (v) = (X (v) X (v) )−1 (X (v) Y ), r

(v)

=Y −X

(v) (v)

β

ϕ(v) ∼ U (−1, 1), w

(v)

∼ U (−1, 1),

,

(9) (10) (11) (12)


111

where U (·) is uniform distribution and the superscript T is the transpose operator. In order to explore the nature of the local optima, repeated runs are important and necessary when estimating projection pursuit models. (v+1) (v+1) and weight vectors wj maximizing the goodness2) Find the projection vectors ϕj (v+1)

of-fit R2

(v+1)

. The expression of R2 R

2 (v+1)

is

n =1−

(v) t=1 [rt

−

M

(v+1) (v+1)

wj φj n (v+1) t rt

j=1

(v+1)

(Xt ϕj

)]2

.

(13)

(v+1)

φj are ridge functions, which are difficult to be determined and are usually subjectively determined by experts. (v+1) is big enough, stop the algorithm; otherwise, go to Step 4. 3) If R2 4) Compute the residual in the (v + 1)-th round in the form of r(v+1) = r(v) −

M

(v+1) (v+1) (v+1) φj (Xϕj ).

wj

(14)

i=1

Update v = v + 1 and go back to Step 2. More details about PPR can be found in [35]. 3.2

Genetic Programming Algorithm

Owing to its powerful searching ability of the optimal, the genetic programming (GP) algorithm proposed by [48] has been successfully applied to an increasing number of areas, including optimal control, symbolic regression, programming solving, differential equation solving, game strategy searching, evolution of spontaneous behavior, and so forth, such as [49–54] among others. The GP algorithm, presented in Figure 4, includes four main blocks, i.e., initialization, reproduction, crossover, and mutation. The initial generation consists of a set of randomly generated mathematical expressions and each expression is represented by a binary tree. If none of the mathematical expressions reaches the threshold of goodness-of-fit, the reproduction algorithm will generate a set of individuals by randomly selecting individuals from the present generation to enter the next generation. Then, with a certain probability, the crossover algorithm selects some pairs of individuals from the set and switches some parts of them. Next, the mutation algorithm randomly selects some individuals and replaces part of them with newly generated ones in order to avoid falling into a local optima. Through the above steps, a new generation of mathematical expressions can be obtained. After this, goodness-of-fit of the new generation is computed and whether it reaches the predefined threshold is checked. This cycle will repeat until goodness-of-fit of the latest generation reaches the threshold.

112

HUANG ANQIANG, et al. Initialize population Evaluate functions Yes

Meet the threshold?

End

No Reproduction

Crossover Mutation New generation

Figure 4

The chart of GP algorithm

Figure 5 is a visualized example of a binary tree representing the mathematical expression of yt−3 + yt−2 × yt−3 + yt−2 ÷ yt−8 . Figures 6 and 7 visually describe crossover and mutation processes, respectively. Details of the GP algorithm can be found in [47]. + +

÷

yt-3

×

yt-2

yt-2

yt-3

A binary tree representing the function of yt−3 + yt−2 × yt−3 + yt−2 ÷ yt−8

+

Crossover yt-4

yt-1

÷ yt-4

+ yt-5

yt-3

÷ ×

yt-2

yt-2

yt-8

yt-4

+

yt-1

yt-3

yt-3

Figure 6

×

yt-4 yt-3

An example of crossover + +

Mutation -

-

÷

yt-6

yt-1

yt-4

yt-5

+

÷ yt-4

yt-4

yt-2

yt-1

yt-6

÷

yt-8 yt-2

Original tree

Figure 7

New randomly generated tree

÷

÷

yt-2

+

+

1

+

2

1

+

2

Figure 5

yt-8

New tree after mutation

An example of mutation

yt-8

yt-5

yt-2

yt-8


3.3

113

A Hybrid of PPR and GP

The standard PPR model has the limitation that the selection of ridge functions heavily relies on expertise, therefore this approach tends to result in many kinds of biases and insistence inherent in subjective judgement. To solve the above problem by taking advantage of the powerful searching ability of GP, this section proposes a hybrid of PPR and GP (PPR-GP), interpreted as the follows: 1) Let the superscript (v) denote the v-th round and set v = 0. Compute residual r(v) as described in Step 1 in Section 3.1. 2) Initialize the first generation of binary trees representing the expressions composed of the explanatory variables (Xt ), dummy variables capturing outliers’ effects, elementary functions √ (e.g., sin(x), xm , log(x), ex , x) and mathematical operators (i.e., +,–,×, ÷). (v+1) computed by Equation 13 as the fitness function and run the GP algorithm 3) Define R2 to optimize it. (v+1) reaches the predefined threshold, the algorithm will stop; 4) If the optimum of R2 otherwise update v = v + 1 and go back to Step 3). The advantage of PRR-GP over standard PRR lies in that PRR-GP automatically searches the optimal ridge functions, therefore, it can effectively avoid biases and inconsistence inherent in experts’ subjective judgement.

4

Forecasting Container Throughput of Qingdao Port

PPR-GP is used to forecast container throughput of Qingdao Port. In order to verify its effectiveness, BP-ANN famous as a powerful nonlinear model (e.g., [55]), SARIMA widely used as an effective linear model (e.g., [56]), and standard PPR are taken as benchmarks. 4.1

Data Description

The monthly data of Qingdao Port’s container throughput used in this study are obtained from CEIC macroeconomic data base (http://www.ceicdata.com), covering the period from January 2005 to April 2012 with 88 observations, as illustrated in Figure 8. The data from January 2005 to April 2011 are used as the training data set for training purpose and the remainder from May 2011 to April 2012 are used as the test data set to evaluate the out-ofsample forecasting performance. 4.2

Evaluation Criteria

For comparison purpose, three evaluation criteria are used in this section, i.e., normalized mean squared error (NMSE), mean absolute percentage error (MAPE), and the correct direction forecast rate (CDFR). Given T pairs of the observed value yi and predicted value yi , NMSE is the measure of accuracy of yi simulating yi and is calculated by T NMSE = i=1 T

(yi − yi )2

i=1 (yi

− y)2

,

(15)

114


where y is mean of yi . 1300

Thousand Teu

1200 1100 1000

900 800 700 600 500

2005M01 2005M05 2005M09 2006M01 2006M05 2006M09 2007M01 2007M05 2007M09 2008M01 2008M05 2008M09 2009M01 2009M05 2009M09 2010M01 2010M05 2010M09 2011M01 2011M05 2011M09 2012M01

400

Figure 8

Qingdao Port’s container throughput during Jan. 2005 – Apr. 2012

MAPE is defined as T MAPE =

i=1

|yi − yi | yi

T

.

(16)

However, RMSE and MAPE cannot provide direct suggestions to decision makers. Many decision makers, such as investors, are much more interested in the direction of the change and consequently CDFR is proposed, which is expressed as T CDFi , (17) CDFR = i=2 T where

4.3

⎧ ⎨ CDF = 1, i ⎩ CDFi = 0,

if (yi − yi−1 )( yi − yi−1 ) > 0, if (yi − yi−1 )( yi − yi−1 ) < 0.

Parameter Settings

Before running PPR, BP-ANN, and PPR-GP algorithms, some parameters have to be determined. In standard PPR, three parameters need to be predefined. The parameter ‘nterms’ denotes the number of terms to be included in the final model. The parameter ‘optlevel’, the level of optimization, differs in how thoroughly the models are refitted during this process. At level 1, the projection directions are not refitted, but the ridge functions and the regression coefficients are. At levels 2, all the terms are refitted; At level 3, PPR rebalances the contributions from each regressor at each step, and thus is less possible to converge to a saddle point of the sum of squares criterion. The parameter ‘nspan’ defines the number of the observations in the span of the running line smoother. In this investigation, the parameter vector [nterms, optlevel, nspan] is defined as [6, 3, 3]. In BP-ANN, ‘ninputs’, ‘nhiddens’ and ‘noutputs’ mean the number of input, hidden and output nodes and are set to be 6, 10 and 1, respectively. The type of transfer functions is set to be ‘tansig’. The threshold of goodness-of-fit is defined as 0.9 and the number of maximum iteration 3000. For more robustness, 5-fold cross-validation is employed.


115

In PPR-GP, the parameter ‘npop’ means the number of individuals in the initial generation which is set to be 1000. ‘reprod’, ‘crossover’ and ‘mutation’ mean the probability of reproduction, crossover and mutation. The parameter vector [crossover, mutation] is determined as [0.4, 0.01]. Instead of a constant probability, PPR-GP reproduces individuals with a dynamically-adjusting probability. Let sk denote that the k-th individual have been selected for sk times as the candidate into next generation, r denote the number of individuals that have not been selected. Define Pk as Rk2 2, j=1 popRj

Pk = n

(18)

where Rj2 is the goodness-of-fit of the j-th individual, then the reproduction probability of the k-th individual can be computed by the equation Pkadj =

npop × Pk − sk , r

(19)

where Pkadj is the adjusted reproduction probability of the k-th individual. 4.4

Outlier Processing

Following the outlier processing method described in Section 2, the temporal trend is firstly removed from the original series, then the X-12 algorithm is employed to eliminate the seasonal effects. The newly generated series are divided into 15 subparts, each subpart with 6 observations except the last subpart with 4 observations. we compute lof i (i = 1, 2, · · · , 15) and find lof 9 , lof 10 , and lof 11 are significantly larger than 1, therefore time series in the period from Jan. 2009 to Jun. 2010 contain outliers. We collect news, academic papers and research reports of important institutes like the IMF, the World Bank and the UN Department of Economic and Social Affairs in this period. The materials show that the U.S. subprime crisis is the main driver of the outliers and its effects will taper off in 2010. Therefore, the outliers are judged to be temporary and the corresponding dummy variables are constructed by using the method in Subsection 2.3. 4.5

Comparison of Different Models

We first compare standard models against outlier-processed models of BP-ANN, SARIMA, and PPR to validate the proposed outlier processing method. Subsequently, PRR-GP is compared with BP-ANN, SARIMA, and PPR on the same outlier-processed data to verify the superiority of PPR-GP. After that, to demonstrate the overall advantage of the outlier-processed PPR-GP over standard models, the outlier-processed PPR-GP is compared with standard models of SARIMA, BP-ANN, and PPR. Here, ‘standard models’ mean the models built on the original data, and ‘outlier-processed models’ mean the models built on the outlier-processed data. 1) Outlier-processed models VS. standard models Figure 9 visually describes the outputs of 6 different models, where SARIMA1, BP-ANN1, and PPR1 are outlier-processed models, while SARIMA2, BP-ANN2 and PPR2 are standard models. Table 1 compares their performance in terms of NMSE, MAPE, and CDFR.

116


From Figure 9 and Table 1, we can draw the conclusion that outlier processing indeed enhances forecasting performance of SARIMA, BP-ANN and PPR, which proves the value of the proposed outlier processing algorithm. 1300

Thousand Teu

1250 True value

1200

SARIMA1 1150

SARIMA2 BP-ANN1

1100

BP-ANN2 1050

PPR1

PPR2

1000 2012M04

2012M03

2012M02

2012M01

2011M12

2011M11

2011M10

2011M09

2011M08

2011M07

2011M06

2011M05

Figure 9

A graphical comparison between outlier-processed models and standard models

Table 1 Performance of outlier-processed models and standard models Models

NMSE

MAPE(%)

CDFR(%)

SARIMA1

0.25

1.50

83.3

SARIMA2

0.41

2.29

75.0

BP-ANN1

0.21

1.94

91.7

BP-ANN2

0.30

2.46

83.3

PPR1

0.13

1.80

91.7

PPR2

0.20

1.92

83.3

Notes: model1 and model2 respectively denote the outlier-processed model and the standard model

2) Comparison among different outlier-processed models In this subpart, data are preprocessed by the proposed outlier processing method, based on which, forecasting performances of PPR-GP, SARIMA, BP-ANN and PPR are compared. Figure 10 visually illustrates the projected values generated by the 4 forecasting models and Table 2 compares their forecasting performances. Figure 10 and Table 2 clearly show that PPR-GP performs best in terms of all of the three evaluation criteria.


1300

117

Thousand Teu

1250 1200

True value

1150

SARIMA1 BP-ANN1

1100

PPR1 1050

PPR-GP1

1000 2012M04

2012M03

2012M02

2012M01

2011M12

2011M11

2011M10

2011M09

2011M08

2011M07

2011M06

2011M05

Figure 10 A graphical comparison among different outlierprocessed models Table 2 Performance of different outlier-processed models Models

NMSE

MAPE(%)

CDFR(%)

PPR-GP1

0.10

1.50

91.7

SARIMA1

0.25

1.88

83.3

BP-ANN1

0.21

1.94

91.7

PPR1

0.13

1.80

91.7

Notes: model1 denotes the outlier-processed model

3) Comparison of outlier-processed PPR-GP against three standard models In order to present the superiority of outlier-processed PPR-GP over standard models of BPANN, SARIMA and PPR, their forecasting performances are compared and listed in Table 3. The results show that outlier-processed PPR-GP significantly outperforms others, owing to the contribution of outlier processing and the advantage of standard PPR-GP over other standard models. Table 3 Performance of outlier-processed PPR-GP and three standard models Models

NMSE

MAPE(%)

CDFR(%)

PPR-GP1

0.10

1.50

91.7

SARIMA2

0.41

2.29

75.0

BP-ANN2

0.30

2.46

83.3

PPR2

0.20

1.92

83.3

Notes: model1 and model2 respectively denote the outlier-processed model and the standard model

118

5


Conclusion

Considering the fact that most prior studies on outliers are only focused on outlier detection but seldom pay attention to further processing of them, this paper proposes a new outlier processing algorithm that can effectively detect outliers in time series and deal with different types of outliers with different approaches based on domain knowledge. Subsequently, a hybrid of PPR and GP is proposed, which not only effectively fits nonlinear relationships but also solves the difficulty of determining ridge functions. The outlier-processed PPR-GP model is applied to forecasting container throughput of Qingdao Port, and three other models including SARIMA, BP-ANN, and PPR are employed as benchmarks. In the first step, SARIMA, BP-ANN, and PPR are built on both original data and the outlier-processed data. The comparison results indicate that outlier processing can significantly improve the performance of the models. In the second step, PPR-GP, SARIMA, BP-ANN, and PPR are compared on the same data. The results show that PPR-GP performs best in terms of the same evaluation criteria. In the third step, we compare the outlier-processed PPR-GP model with standard models of SARIMA, BPANN, and PPR, and the results suggest that the former significantly outperforms all of its rivals, owing to the aggregate contribution of the outlier processing algorithm and the PPR-GP model. This method is also used to forecast the global top 20 ports’ container throughput and has obtained satisfactory forecasting performance. Detailed information can be found in book series of outlook of container throughput of global top 20 ports, issued by the Center for forecasting science, Chinese Academy of Sciences. It is notable that the proposed forecasting method can be applied to not only container throughput prediction but also other prediction areas, such as the volume of foreign trade, foreign exchange rates, and so on.

References [1] [2]

[3] [4] [5] [6]

Hawkins D, Identification of Outliers, Chapman and Hall, London, 1980. Pettit L I and Smith A F M, Outliers and Influential Observations in Linear Models, Bayesian Statistics 2 (eds. by Bernado J M, DeGroot M H, Lindley D V, and Smith A F M), North-Holland, Amsterdam, 1985. McCulloch R E and Tsay R S, Bayesian analysis of autoregressive times series via the Gibbs sampler, Journal of Time Series Analysis, 1994, 15(2): 235–250. Chaloner K and Brant P, A Bayesian approach to outlier detection and residual analysis, Biometrika, 1998, 75(4): 651–659. Giuli M E D, Maggi M A, and Tarantola C, Bayesian outlier detection in capital asset pricing model, Statistical Modelling, 2010, 10(4): 379–390. ShotwellM S and Slatey E H, Bayesian outlier detection with Dirichlet process mixtures, Bayesian Analysis, 2011, 6(4): 665–690.

FORECASTING CONTAINER THROUGHPUT OF QINGDAO PORT [7] [8] [9] [10] [11] [12] [13]

[14] [15] [16]

[17] [18]

[19] [20] [21] [22] [23] [24] [25] [26] [27] [28]

119

Sardy S, Tseng P, and Bruce A, Robust wavelet denoising, IEEE Transactions on Signal Processing, 2011, 49(6): 1146–1152. Struzik Z R and Siebes A P J M, Wavelet transform based multifractal formalism in outlier detection and localization for financial time series, Physica A, 2002, 309(3–4): 388–402. Ranta R, Louis-Dorr V, Heinrich C, and Wolf D, Iterative wavelet-based denoising methods and robust outlier detection, IEEE Signal Processing Letters, 2005, 12(8): 557–560. Bilen C and Huzurbazar S, Wavelet-based detection of outliers in time series, Journal of Comutational and Graphical Statistics, 2002, 11(2): 311–327. Grané A and Veiga H, Wavelet-based detection of outliers in financial time series, Computational Statistics and Data Analysis, 2010, 54(11): 2580–2593. Knorr E M, Ng R T, and Tucakov V, Distance-based outliers: Algorithms and applications, International Journal on Very Large Data Bases, 2000, 8(3–4): 237–253. Bay S D and Schwabacher M, Mining distance-based outliers in near linear time with randomization and a simple pruning rule, Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 2003. Angiulli F, Basta S, and Pizzuti C, Distance-based detection and prediction of outliers, IEEE Transaction on Knowledge and Data Engineering, 2006, 18(2): 145–160. Pasha M Z and Umesh N, A comparative study on outlier detection techniques, International Journal of Computer Applications, 2013, 66(24): 23–27. Baragona R, Battagliab F, and Calzinia C, Genetic algorithms for the identification of additive and innovation outliers in time series, Computational Statistics & Data Analysis, 2001, 37(1): 1–12. Tolvi J, Genetic algorithms for outlier detection and variable selection in linear regression models, Soft Computing, 2004, 8(8): 527–533. Ozlem G A, Serdar K, and Aybars U, Genetic algorithms for outlier detection in multiple regression with different information criteria, Journal of Statistical Computation and Simulation, 2011, 81(1): 29–47. Raja P V and Bhaskaran V M, An effective genetic algorithm for outlier detection, International Journal of Computer Applications, 2012, 38(6): 30–33. Markou M and Singh S, Novelty detection: A review-part 1: Statistical approaches, Signal Processing, 2003, 83(12): 2481–2497. Markou M and Singh S, Novelty detection: A review-part 2: Neural network based approaches, Signal Processing, 2003, 83(12): 2499–2521. Beckman R J and Cook R D, Outliers in statistical data, Technometrics, 1983, 25(2): 119–149. Hawkins D M, Bradu D, and Kass G V, Location of several outliers in multiple regression data using elemental sets, Technometrics, 1984, 26(3): 197–208. Barnett V and Lewis T, Outliers in Statistical Data, John Wiley & Sons, Chichester, 1984. Patcha A and Park J M, An overview of outlier detection techniques: Existing solutions and latest technological trends, Computer Networks, 2007, 51(12): 3448–3470. Cousineau D and Chartier S, Outlier detection and treatment: A review, International Jouranl of Phychological Research, 2010, 3(1): 58–67. Hodge V J and Austin J, A survey of outlier detection methodologies, Artificial Intelligence Review, 2004, 22(2): 85–126. Singh K and Upadhyaya S, Outlier detection: Applications and techniques, International Journal

120

[29] [30] [31] [32]

[33] [34] [35] [36] [37]

[38]

[39] [40]

[41]

[42] [43] [44]

[45] [46] [47]

HUANG ANQIANG, et al. of Computer Science Issue, 2012, 9(1): 1694–0814. Zhang J, Advancements of outlier detection: A survey, ICST Transactions on Scalable Information Systems, 2013, 13(1–3): 1–24. Pahuja D and Yadav R, Outlier detection for different applications: Review, International Journal of Engineering Research & Technology, 2013, 2(3): 1–13. Friedman J H and Stuetzle W, Projection pursuit regression, Journal of the American Statistical Association, 1981, 76(376): 817–823. Du H, Wang J, Zhang X, Yao X, and Hu Z, Prediction of retention times of peptides in RPLC by using radial basis function neural networks and projection pursuit regression, Chemometrics and Intelligent Laboratory Systems, 2008, 92(1): 92–99. Diaconis P and Shahshahani M, On nonlinear functions of linear combinations, SIAM Journal on Scientific and Statistical Computing, 1984, 5(1): 175–191. Aldrin M, Moderate projection pursuit regression for multivariate response data, Computational Statistics & Data Analysis, 1996, 21(5): 501–531. Lingjærde O C and LiestØl K, Generalized projection pursuit regression, SIAM Journal on Scientific and Statistical Computing, 1998, 20(3): 844–857. Posse C, Projection pursuit exploratory data analysis, Computational Statistics & Data Analysis, 1998, 20(6): 669–687. Du H Y, Wang J, Zhang X Y, Yao X J, and Hu Z D, Prediction of retention times of peptides in RPLC by using radial basis function neural networks and projection pursuit regression, Chemometrics and Intelligent Laboratory Systems, 2008, 92(1): 92–99. Du H, Wang J, Hu Z, Yao X, and Zhang X, Prediction of fungicidal activities of rice blast disease based on least-squares support vector machines and project pursuit regression, Journal of Agricultrual and Food Chemistry, 2008, 56(22): 10785–10792. Liu P and Long W, Current mathematical methods used in QSAR/QSPR studies, International Journal of Molecular Sciences, 2009, 10(5): 1978–1998. Guo Q J and Yang J G, Application of projection pursuit regression to thermal error modeling of a CNC machine tool, International Journal of Advanced Manufacturing Technology, 2011, 55(5): 623–629. Guo Q J, Yu S S, and He L, Research on tool wear monitoring method based on project pursuit regression for a CNC machine tool, Research Journal of Applied Sciences, Engineering, and Technology, 2014, 7(3): 438–441. Fildes R and Stekler H, The state of macroeconomic forecasting, Journal of Macroeconomics, 2002, 24(4): 435–468. Huang A Q, Xiao J, and Wang S Y, A combined forecast method integrating contextual knowledge, International Journal of Knowledge and Systems Science, 2011, 2(4): 39–53. Breunig M M, Kriegel H P, Ng R T, and Sander J, LOF: Identifying Density-Based Local Outliers, Proceedings of the 29th ACM SIGMOD International Conference on Management Data, Dallas, Texas, USA, 2000. Fox A J, Outliers in time series, Journal of Royal Statistical Society, Series B, 1972, 34(3): 350–363. Denby L and Martin R D, Robust estimation of the first order autoregressive parameter, Journal of the American Statistical Association, 1979, 74(365): 140–146. Brezillion P and Pomerol J, Contextual knowledge sharing and cooperation in intelligent assistant


[48] [49]

[50] [51] [52] [53] [54] [55] [56]

121

systems, Le Travail Humain, 1999, 62(3): 223–246. Koza J R, Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, Cambridge, 1992. Koza J R, Mydlowec W, Lanza G, Yu J, and Keane M A, Reverse engineering and automatic synthesis of metabolic pathways from observed data using genetic programming, Pacific Symposium on Biocomputing, 2001, 6: 434–445. Fonlupt C, Solving the ocean color problem using a genetic programming approach, Applied Soft Computing, 2001, 1(1): 63–72. Sugimoto M, Kikuchi S, and Tomita M, Reverse engineering of biochemical equations from time course data by means of genetic programming, BioSystems, 2005, 80(2): 155–164. Worzel W P, Yu J J, Almal A A, and Chinnaiyan A M, Applications of genetic programming in cancer research, International Journal of Biochemistry & Cell Biology, 2009, 41(2): 405–413. Tsai H C, Using weighted genetic programming to program squat wall strengths and tune associated formulas, Engineering Applications of Artificial Intelligence, 2011, 24(3): 526–533. Forouzanfar M, Doustmohammadi A, Hasanzade S, and Shakouri G H, Transport energy demand forecast using multi-level genetic programming, Applied Energy, 2012, 91(1): 496–503. Kumru M and Kumru P Y, Using artificial neural networks to forecast operation times in metal industry, International Journal of Computer Integrated Manufacturing, 2014, 27(1): 48–59. Mombeni H A, Rezaei S, Nadarajah S, and Emami M, Estimation of water demand in Iran based on SARIMA models, Environmental Modeling & Assessment, 2013, 18(5): 559–565.