Electricity Load Forecasting Based on Autocorrelation ... - CiteSeerX

5 downloads 0 Views 428KB Size Report
Abstract—We present new approaches for 5-minute ahead electricity load forecasting. They were evaluated on data from the Australian electricity market ...
Electricity Load Forecasting Based on Autocorrelation Analysis Rohen Sood, Irena Koprinska, and Vassilios G. Agelidis

Abstract—We present new approaches for 5-minute ahead electricity load forecasting. They were evaluated on data from the Australian electricity market operator for 2006-2008. After examining the load characteristics using autocorrelation analysis with 4-week sliding window, we selected 51 features. Using this feature set with linear regression and support vector regression we achieved an improvement of 7.56% in the Mean Absolute Percentage Error (MAPE) over the industry model which uses backpropagation neural network. We then investigated the application of a number of methods for further feature subset selection. Using a subset of 38 and 14 of these features with the same algorithms we were able to achieve an improvement of 6.53% and 4.81% in MAPE, respectively, over the industry model.

A

I. INTRODUCTION

CCURATE electricity load forecasting is needed for numerous decisions including optimal commitment of generators and setting the minimum reserve. The goal is to ensure reliable supply while keeping the operating cost low. It is estimated that reducing the Mean Absolute Percentage Error (MAPE) with 1% for a 10 GW generator saves US$1.6 million per year [1]. In this paper we consider 5-minute ahead load forecasting. This task is classified as Very Short-Term Load Forecasting (VSTLF), i.e. hours or minutes ahead. VSTLF is very important in deregulated and competitive energy markets, such as the Australian national electricity market. Market participants offer to supply specific amounts of electrical energy at a certain price. All bids are collated by the market operator and shortfalls of supply against expected demand are published and used by the participants for rebids. The allocation of the generators is based on reducing the total costs, i.e. the cheapest generator is allocated first, the second cheapest next and so on until the electricity demand is met. The market operator also determines the minimum generation reserve that should be kept. In addition, accurate VSTLF is also required by electricity retailers for effective load shifting between transmission substations of the retail networks, especially during periods Rohen Sood is with the School of Electrical and Information Engineering, University of Sydney, NSW 2006, Australia (e-mail: [email protected]). Irena Koprinska is with the School of Information Technologies, University of Sydney, NSW 2006, Australia (phone: +61 2 9351 3764, fax: +61 2 9351 3838, e-mail: [email protected]). Vassilios G. Agelidis is with the Centre for Energy Research and Policy Analysis, University of New South Wales, Sydney, NSW 2052, Australia (e-mail: [email protected]).

of abnormal peak [2]. Finally, forecasting errors have significant implications for profits, market shares and shareholder values [3]. This paper is an extension of our previous work [4], where we reported the performance of several prediction algorithms on a slightly different dataset. As input variables we used the load from the day of the prediction and the load on the same day of the previous week. However, the electricity load exhibits complex weekly and daily patterns which have not been fully encompassed by the input variables previously used. In this paper we apply autocorrelation analysis with 4-week-back sliding window to systematically identify the important variables for accurate electricity load forecasting We use data from 20062008 for the state of New South Wales (NSW) in Australia, provided by the Australian electricity market operator [5]. Our contribution can be summarized as follows: 1. We study the load characteristics using autocorrelation analysis and a 4-week-back sliding window. Based on this analysis we select appropriate set of variables (features). 2. We investigate if further feature selection can reduce the number of features and improve accuracy or maintain similar accuracy. A smaller feature set also allows for faster training and utilization of the prediction model, and a better understanding of the underlying process. We study three feature selection methods, the state-of-the-art correlation-based [13] and RReliefF [15] methods, and also an expert selection method. 3. We compare the performance of four prediction algorithms: Back-Propagation Neural Network (BPNN), Support Vector Regression (SVR), Linear Regression (LR) and Least Median Squares Regression (LMS). BPNN is the algorithm predominantly used by the research community and industry forecasters for VSTLF. LR and LMS are standard linear regression algorithms; Hippert et al. [6] reviewed neural networks for load forecasting and noted the need for systematic testing and comparison of BPNN with standard linear regression techniques. SVR was chosen as it is a state-of-the-art prediction algorithm. The paper is organised as follows. Section II reviews previous research on VSTLF. Section III presents the autocorrelation analysis and the selected feature set. Section IV describes the benchmark feature sets we used for comparison and the methods we used for further feature selection. Section V and VI summarize the prediction algorithms and the evaluation procedure, respectively. The results are presented and discussed in section VII. Finally, the last section provides conclusions and future directions.

II. PREVIOUS WORK ON VSTLF There are only a few published papers on VSTLF. In contrast, short-term load forecasting (from a day to several weeks ahead) has been an active area of research and a variety of prediction algorithms (e.g. regression, time series and neural networks) have been applied [7, 8]. Neural networks, in particular BPNNs, are the most popular prediction algorithms for both short-term forecasting and VSTLF. BPNNs are attractive as they can model non-linear input/output relationships; it is believed that the electricity load depends on many variables and that the relationship is non-linear. Another advantage of BPNNs is that can learn this relationship from a sample of input/output training examples. One of the first VSTLF studies was conducted by Liu et al. [9] who applied BPNN, fuzzy logic rules and a simple autoregressive model. The task was to predict the load every minute for 30 points ahead using the previous 30 minute-byminute data. It was found that BPNN and fuzzy rules were much more accurate than the autoregressive model. Charytoniuk and Chen [10] used BPNN to predict the load 10 minutes ahead using the load in the previous 20-90 minutes. Instead of forecasting the actual load, the task was shifted to forecasting load differences. The authors argued that predicting the load difference is more robust and less sensitive to the requirement of having training data that is representative of all possible loads. They experimented with different BPNN architectures and input variables and were able to achieve MAPE of 0.4-1%. The proposed method was implemented in a power utility in the United States and showed good accuracy and reliability. Shamsollahi et al. [11] used a BPNN for 5-minute ahead load prediction of the electricity market in New England, USA. Their approach is simple and very effective. The data was preprocessed by applying a logarithmic difference of the consecutive loads; the BPNN’s architecture consisted of 1 hidden layer with 8 nodes and 1 output node. The stopping criterion for the BPNN training was based on validation set, which is a standard approach used to prevent overtraining. An excellent MAPE of 0.12% was achieved and the approach was integrated in the New England energy system. Chen and York [12] used a complex hierarchical architecture of BPNNs for 15-minute ahead prediction. To predict the load for each day of the week, 5 BPNNs wre used to cover different time intervals of the 24-hour period and their decisions were combined using another BPNN. The reported MAPE results are between 0.28% and 0.87%. Taylor [2] used minute-by-minute British electricity load data to predict the load 10-30 minutes ahead. He studied a number of statistical methods based on Autoregressive Integrated Moving Average (ARIMA) and exponential smoothing. Some of the methods captured both the intraday and intraweek cycle, others ignored both of them or captured only the intraweek cycle. The best forecasting method was an adaptation of the Holt-Winter’s smoothing for double

seasonality, achieving MAPE of about 0.4% for 30-minute ahead prediction. The MAPE results for 5-minute ahead prediction were also reported and they were around 0.25% for the best performing methods (double seasonal HoltWinter, restricted intraday cycle smoothing, ARIMA). In our previous work [4] we investigated SVR for 5minute ahead load forecasting. Similarly to BPNN, SVR can form non-linear mapping and can learn from examples. It is also able to overcome two of the main drawbacks of BPNN: overtraining and the need to tune many parameters. We found that SVR was more accurate than BPNN. However, the simpler LR and LMS produced similar accuracy as SVR and were faster to train. We also studied the performance of four feature sets based on historical load data from the prediction day and the same day of the previous week, and found set FS1 (see section IV) to perform best. However, given the complex weekly and daily pattern of the electricity load, other variables, not limited to the load on the two days listed above, are likely to be useful and improve accuracy. We investigate this question using autocorrelation analysis as described in the next section. III. AUTOCORRELATION ANALYSIS The autocorrelation coefficient measures the correlation of a time series with itself at different time lags, defined as: n

rk =

∑ ( X t − X )( X t −k − X )

t = k +1

n

∑(Xt − X )

,

2

t =1

where X t is the value of the time series at time t and X is the mean value. Thus, r1 shows how the X values 1 lag apart (i.e. the successive values) relate to each other, r2 indicates how the X values 2 lags apart relate to each other and so on. The autocorrelations at different lags together form the autocorrelation function. This function is used to investigate the cyclic nature of the time series. Spikes, i.e. values close to 1 or -1, indicate high positive or negative autocorrelation and values close to 0 indicate lack of autocorrelation. Fig. 1 shows the autocorrelation function graph for August 2006 for NSW. The graph for 2007 is very similar and not shown due to space limitations. As it can be seen, there are weekly and daily patterns in the electricity load. In particular, there are 6 clusters of high spikes; they are listed below in decreasing order of their height starting with the highest: 1) the 1st group occurs at the lags closest to the value to be predicted, i.e. lags 1, 2, 3, 4, etc. corresponding to the loads 5, 10, 15, 20, etc. minutes before, respectively. 2) the 2nd group is around lag 288, i.e. 1 day before; 3) the 3rd group is around lag 576, i.e. 2 days before; 4) the 4th group is around lag 2016, i.e. 1 week before; 5) the 5th group is around lag 864, i.e. 3 days before; 6) the 6th group is around lag 1728, i.e. 6 days before. Based on this analysis, to predict the load Xt+1, we selected the variables listed in Table 1.

Fig. 1. Autocorrelation function of the electricity load for NSW for August 2006 (1 lag = 5 minutes) TABLE 1. SELECTED VARIABLES BASED ON THE AUTOCORRELATION ANALYSIS TO PREDICT THE LOAD Xt+1 Peak Variables Description cluster 1 Xt,...,Xt-8 Load on the forecast day at times t,…,t-8 2 XDt+2,...,XDt-4 Load 1 day before at times t+2,…,t-4 3 XD2t+2,...,XD2t-4 Load 2 days before at times t+2,…,t-4 4 XXt+6,...,XXt-7 Load 7 days before at times t+6,…,t-7 5 XD3t+2,...,XD3t-4 Load 3 days before at times t+2,…,t-4 6 XD6t+2,...,XD6t-4 Load 6 days before at times t+2,…,t-4

IV. FEATURE SELECTION Feature selection is the process of removing irrelevant and redundant features and selecting a small set of informative features that are necessary and sufficient for good prediction. It helps to avoid overfitting and also reduces the data dimensionality which means faster building and utilization of the prediction model. In some cases the accuracy can be improved or at least maintained. A. Feature sets FS-all, FS1 and FS-industry FS-all is based on the autocorrelation analysis from the previous section and includes all 51 variables from Table 1. Its performance will be compared with two other sets: FS1 and FS-industry. The three feature sets, FS-all, FS1 and FSindustry, are described in Table 2. FS1 is the feature set that we found to work best in our previous work [4]. To predict the load at a given time, it uses the loads from the previous 5 lags on that day and also the load at exactly the same time 7 days ago and the previous 5 lags 7 days ago. TABLE 2. FEATURE SETS FS-ALL, FS1 AND FS-INDUSTRY Description Predicting Xt+1 based on FS-all (51 features) Xt,...,Xt-8 Load on the forecast day at times t,…,t-8 (9 features) XDt+2,...,XDt-4 Load 1 day before at times t+2,…,t-4 (7 features)

XD2t+2,...,XD2t-4

Load 2 days before at times t+2,…,t-4 (7 features)

XXt+6,...,XXt-7

Load 7 days before at times t+6,…,t-7 (14 features)

XD3t+2,...,XD3t-4

Load 3 days before at times t+2,…,t-4 (7 features)

XD6t+2,...,XD6t-4

Load 6 days before at times t+2,…,t-4 (7 features)

Example: The load at 10.30 on Monday will be predicted using the load on the same day at times 10:25, 10:20, 10:15, 10:10, 10:05, 10:00, 9:55, 9:50, 9:45, the load 1 day before (Sunday) at times 10:35, 10:30, 10:25, 10:20, 10:15, 10:10, 10:05, the load 2 days before (Saturday) at times 10:35, 10:30, 10:25, 10:20, 10:15, 10:10, 10:05, the load 3 days before (Friday) at times 10:35, 10:30, 10:25, 10:20, 10:15, 10:10, 10:05, the load 6 days before (Tuesday) at times 10:35, 10:30, 10:25, 10:20, 10:15, 10:10 and the load 7 days before (Monday) at times 10:55, 10:50 ,10:45, 10:40, 10:35, 10:30, 10:25, 10:20, 10:15, 10:10, 10:05, 10:00, 9:55, 9:50. FS1 (11 features) Xt,...,Xt-4 Load on the forecast day at times t,…, t-4 (5 features). XXt+1,...,XXt-4

Load on the same weekday the previous week at times t+1,..,t-4 (6 features).

Example: The load at 10:30 on Monday will be predicted using the load on the same day at times 10:25, 10:20, 10:15, 10:10, 10:05 and the load 7 days before (Monday) at times 10:30, 10:25, 10:20, 10:15, 10:10 and 10:05. FS-industry (9 features) Logarithmic load differences from the forecast day ln(Xt/Xt-1),..., (4 features). ln(Xt-3/Xt-4) ln(XXt+1/XXt), ln(XXt/XXt-1),..., ln(XXt-3/XXt-4)

Logarithmic load differences from the same day 1 week before (5 features)

Note: Predicting ln(Xt+1/Xt) which is then transformed back to Xt+1. Example: The load at 10:30 on Monday will be predicted based on the logarithmic difference of successive load values from the same day at times 10:25, 10:20, 10:15, 10:10, 10:05 and also the load values on Monday 1 week before at times 10:30, 10:25, 10:20, 10:15, 10:10 and 10:05.

FS-industry is a typical feature set used by industry. It uses the same variables (11 previous loads) as FS1 but in transformed form. The loads are first pre-processed by applying a natural logarithm conversion; the 9 features are then computed as the differences between the successive loads, i.e. the first feature will be ln(Xt) - ln(Xt-1) = ln(Xt/Xt1), as shown in Table 2. This transformation is the same as in [11]; the rational behind is to improve the data stationarity and highlight the look-ahead characteristic. The variable to be predicted is ln(Xt+1/Xt) and is transformed back to Xt+1. In contrast, all the other feature sets use actual (untransformed) previous loads to directly predict Xt+1. The feature set FS-industry is typically employed in conjunction with BPNN; we will call this combination the “industry model” and use it as a baseline. B. Further Feature Reduction of FS-all Our goal was to investigate if further feature selection applied to FS-all can be beneficial, i.e. if the number of features can be reduced while maintaining similar accuracy or improving the accuracy. We applied three methods for feature selection: correlation-based, using the RReliefF algorithm and manual expert selection. The features were selected using the training data only to ensure realistic evaluation and fair comparison of the feature selection methods. 1) Correlation-Based Feature Selection (CFS) CFS is a simple, fast and efficient feature subset selection method developed by Hall [13]. It searches for the “best” subset of features where “best” is defined by a heuristic which is based on: 1) how good the individual features are at predicting the class and 2) how much they correlate with the other features. Good subsets contain features that are highly correlated with the class and uncorrelated with each other. 2) RReliefF The Relief family of algorithms is used for selecting a subset of features. These algorithms produce a ranking of all features; the top N highly ranked features are then selected using a user-defined threshold t. The base algorithm is Relief [14] and it is used for twoclass classification problems. The key idea is that high quality features should distinguish between instances from different classes and should have similar values for instances from the same class. More specifically, Relief ranks the features based on how well they distinguish between instances that are near to each other. It randomly selects an instance Ri from the data and finds the nearest neighbor H (nearest hit) from the same class and the nearest neighbor M (nearest miss) from the other class. Then it updates the quality score of each feature by comparing the feature values of Ri with H and M. If Ri and H have different values of f, this means that two instances from the same class are separated by f (not desirable), the score of f is decreased. If Ri and M have different values of f, this means that two instances from different classes are separated by f (desirable), the score of f is increased. The process is

repeated for m randomly selected instances. ReliefF is an extension of Relief for more than 2 classes. RReliefF [15] is an extension of ReliefF for regression problems. In regression problems the predicted value is numeric so hits and misses cannot be used. Instead, they are replaced with probability that the predicted values are different based on the relative distance between these values. We used k = 10 neighbors and m = all instances (17,856), i.e. all instances in the training data were sampled (see section VI) which increases the reliability of the feature scores. 3) Manual Expert Feature Selection (MEFS) We also performed a manual feature reduction based on the performance of all feature sets, including the sets selected by CFS and ReliefF, and also using our expert knowledge along with trial and error. V. PREDICTION ALGORITHMS As in our previous work [4], we applied four prediction algorithms: SVR, LR, LMS and BPNN. As already mentioned, BPNN is the most popular algorithm for load forecasting used by the research community and industry forecasters. LR and LMS are standard linear regression algorithms. Hippert et al. [6] reviewed neural networks for load forecasting and noted the need for systematic testing and comparison of BPNN with standard techniques such as LR. SVR was chosen as it is a state-of-the-art algorithm shown to be successful in many applications [16, 17]. Below we briefly summarize the main ideas behind these algorithms and the parameters used. We used their WEKA's implementations [18]. SVR is an extension of the support vector machine algorithm [19] for numeric prediction. SVR produces a decision boundary that can be expressed in terms of a few support vectors and can be used with kernel functions to create complex nonlinear decision boundaries while reducing the computational complexity. Similarly to LR, SVR tries to find a function that best fits the training data. In contrast to LR, it defines a tube around the regression line using a user-specified parameter ε where the errors are ignored and it also tries to maximize the flatness of the line [18, 20]. In addition to ε, the other main parameter is C; it controls the trade off between training error and model complexity. We used polynomial kernel, ε =0.001, C=1. LR is a standard statistical method. It approximates the relationship between the outcome and the independent variables with a straight line. The least squares method is used to find the line which best fits the data. We used the M5's method for variable selection - starting with all variables at each step the weakest variable is removed based on its standardized coefficient until no improvement is observed in the error estimate. LMS is an extension of LR [21] that uses least squared regression functions generated from random subsamples of

VI. EVALUATION PROCEDURE AND PERFORMANCE METRICS To evaluate and compare the performance of the prediction models, we created the models using data from August 2006 and 2007 and tested them on data from August 2008. The training data consisted of 17,856 instances and the test data consisted of 8,928 instances. We used the following accuracy performance metrics: 1) Mean absolute error (MAE) MAE =

1 n ∑ L _ actuali − L _ forecasti n i =1

where L_actuali and L_forecasti are the actual and forecasted electricity loads, respectively, at the 5-minute interval i and n is the total number of predicted loads. 2) Relative absolute error (RAE) RAE =

MAE 100 [%] MAE ZeroR

It measures the MAE of the prediction model used relative to the MAE of a very simple predictor, ZeroR. ZeroR predicts the mean value in the training data and is used as a baseline. 3) Mean absolute percentage error (MAPE), an extension of MAE defined as MAPE =

L _ actuali − L _ forecast i 1 100 [%] ∑ n i =1 L _ actuali n

Previous work mainly reports MAPE, which is also the metric used by industry. However, MAPE is not sufficient when comparing prediction algorithms [6]; we address this by also reporting MAE which is the standard metric used in such cases. VII. RESULTS AND DISCUSSION A. Using FS-all, FS1 and FS-industry Feature Sets Table 3 shows the performance of the four prediction algorithms using feature sets FS-all, FS1 and FS-industry. The number of features is shown in brackets. The most accurate prediction was achieved by LR and SVR with FSall: MAE=26.64 MW and MAPE=0.269%. For comparison, the industry model (BPNN with the industry feature set) achieved MAE=28.88 MW and MAPE=0.291%. This means that using FS-all with LR or SVR, there is a reduction of the error of the industry model with 7.76% in MAE and 7.56% in MAPE.

The small RAE values indicate that all four prediction models improved over the baseline of simply predicting the mean value (ZeroR). The RAE values are between 2.13% and 2.66%, which means that the MAE of the prediction models was only 2.13-2.66% of the MAE of the ZeroR baseline, which is a significant improvement. TABLE 3. PREDICTIVE PERFORMANCE USING FEATURE SETS FS-ALL, FS1 AND FS-INDUSTRY LR LMS BPNN SVR FS-all (51) MAE [%] 26.64 27.42 33.28 26.65 RAE [%] 2.13 2.19 2.66 2.31 MAPE [%] 0.269 0.277 0.332 0.269 FS1 (11) MAE [%] 29.35 29.58 36.21 29.43 RAE [%] 2.34 2.36 2.89 2.35 MAPE [%] 0.296 0.298 0.365 0.297 FS-industry (9) MAE [%] 29.41 29.57 28.88 29.42 RAE [%] 2.35 2.37 2.31 2.35 MAPE [%] 0.297 0.299 0.291 0.297

Fig. 2 presents the MAPE results from Table 3 for visual comparison across the three feature sets, for each prediction algorithm. A comparison across the algorithms shows that on feature sets FS-all and FS1 the most accurate algorithm was LR, closely followed by SVR and LMS, and that BPNN was the least accurate algorithm, significantly behind. On FS-industry, the four algorithms performed similarly, with BPNN even slightly better that the others. 0.4 0.35 0.3 MAPE [%]

the data. The final prediction model is the least squared regression with the lowest median squared error. We used sample size=4 for the size of the sub-samples. BPNN is a classical neural network trained with the backpropagation algorithm. We used a network with 1 hidden layer, sigmoid transfer functions, learning rate=0.3 and momentum=0.2. The number of hidden neurons was set to the average of the number of input and output neurons. The training stopped when the maximum number of 500 epochs was reached. We experimented with other parameters but found these parameters to perform well.

LR

0.25

LMS

0.2

BPNN

0.15

SVR

0.1 0.05 0 FS-all (51)

FS1 (11)

FS-industry (9)

feature set

Fig. 2. A comparison across feature sets FS-all, FS1 and FS-industry – MAPE [%]

For all feature sets LR was the fastest to build the prediction model, followed by LMS, BPNN and SVR. For example, using FS-all, it took 3 seconds to build the LR model, 2 minutes to build the LMS model, 45 minutes to build the BPNN model and 16 hours to build the SVR model. The time to build the model is important if the model needs to be re-trained on-line or very often. The time to utilise the built model (i.e. make predictions on new data) was not an issue; all models were very fast. To conclude, the feature set based on autocorrelation analysis FS-all outperformed the benchmark feature sets. This confirmed our hypothesis that there are useful features other than the loads on the day of the prediction and the loads on the same day of the previous week. LR and SVR with FS-all were found to perform best, achieving 7.56%

improvement in MAPE over the industry model, with LR being considerably faster than SVR. B. Applying Feature Selection to FS-all We investigate if it is possible to select a subset of the 51 features in FS-all and improve accuracy or maintain similar accuracy. We apply three feature selection methods: CFS, RReliefF and MEFS. Table 4 shows the feature subsets they selected. Table 5 shows the accuracy results they achieved. TABLE 4. SELECTED FEATURE SETS BY CFS, RRELIEFF AND MEFS Feature subset Selected features CFS (3) Xt, XDt+1, XXt+1 R7 (7) Xt, Xt-1, Xt-2, Xt-3, Xt-4, Xt-5, Xt-8 R11 (11) R7 + Xt-6, XDt+2, Xt-7, XDt+1 R13 (13) R11 +, XXt+6, XD6t+2 R16 (16) R13 + XXt+5, XD6t+1, XDt R24 (24) R16 + XD2t+2, XXt+4, XD3t+2, XXt+3, XXt+2, XD2t+1, XXt+1, XX3t+1 R30 (30) R24 + XD6t, XXt-7, XDt-1, XD6t-4, XD6t-1, XXt-6 R38 (38) R30 + XXt, XD3t, XD2t, XD6t-3, XD3t-4, XDt-2, XD6t-2, XXt-5 R46 (46) R38 + XDt-4, XDt-3, XD3t-3, XD3t-1, XDt-1, XD3t-2, XXt-4, XD2t-1 MEFS1 (11) Xt, XDt+1, XDt, XD2t, XD3t+1, XD3t, XD6t+1, XD6t, XXt+1, XXt, XDt-2 MEFS2 (14) Xt, Xt-3, Xt-6, XDt+1, XDt, XD2t+1, XD2t, XD3t+1, XD3t, XD6t+1, XD6t, XXt+1, XXt, XXt-2

FS-all-MEFS1 (11) 27.79 27.97 2.22 2.23 0.281 0.283 FS-all-MEFS2 (14) 27.37 28.59 2.19 2.28 0.277 0.289

MAE [%] RAE [%] MAPE [%] MAE [%] RAE [%] MAPE [%]

28.31 2.26 0.286

27.81 2.22 0.282

28.19 2.25 0.284

27.37 2.18 0.277

CFS selected only three features: the load on the day of the prediction 1 lag before (Xt), and the load the previous day (XDt+1) and the previous week (XXt+1) at the time of the prediction. We used Best-first forward search to generate and evaluate the feature subsets as the search space is very big for employing an exhaustive search algorithm. Fig. 3 compares the MAPE performance of the original and reduced feature sets, and also compares them with the benchmarks FS1 and the industry model. The error of the reduced feature set is higher than the error of both benchmark feature sets for all prediction algorithms. Thus, CFS was not successful in selecting a subset of the 51 features in FS-all without decreasing the accuracy. 0.600

0.500

MAPE [%]

0.400 FS-all-CFS (3) 0.300

FS1 (11) FS-all (51)

0.200

0.100

0.000 LR

LMS

BPNN

SVR

industy

prediction algorithm

Fig. 3. A comparison of the MAPE performance of the CFS feature subset with the original set FS-all, set FS1 and the industry model

While CFS produces directly a feature subset, RReliefF ranks all features and requires a user threshold to select the top N features and form the subset. Fig.4 shows RReliefF’s ranking of all 51 features. 0.01400

0.01200

0.01000

0.00800

0.00600

0.00400

0.00200

0.00000 Xt Xt-1 Xt-2 Xt-3 Xt-4 Xt-5 Xt-8 Xt-6 XDt+2 Xt-7 XDt+1 XXt+6 XD6t+2 XXt+5 XD6t+1 XDt XD2t+2 XXt+4 XD3t+2 XXt+3 XXt+2 XD2t+1 XXt+1 XD3t+1 XD6t XXt-7 XDt-1 XD6t-4 XD6t-1 XXt-6 XXt XD3t XD2t XD6t-3 XD3t-4 XDt-2 XD6t-2 XXt-5 XDt-4 XDt-3 XD3t-3 XD3t-1 XXt-1 XD3t-2 XXt-4 XD2t-1 XXt-2 XXt-3 XD2t-4 XD2t-2 XD2t-3

feature weight

TABLE 5. PREDICTIVE PERFORMANCE USING THE FEATURE SETS SELECTED BY CFS, RRELIEFF AND MEFS LR LMS BPNN SVR FS-all-CFS (3) MAE [%] 50.95 50.29 55.24 50.30 RAE [%] 4.07 4.02 4.41 4.02 MAPE [%] 0.516 0.509 0.556 0.509 FS-all-R7 (7) MAE [%] 33.65 33.74 36.79 33.65 RAE [%] 2.69 2.70 2.94 2.69 MAPE [%] 0.340 0.341 0.368 2.69 FS-all-R11 (11) MAE [%] 32.17 33.33 42.63 32.82 RAE [%] 2.62 2.66 3.40 2.62 MAPE [%] 0.331 0.336 0.432 0.331 FS-all-R13 (13) MAE [%] 30.37 30.87 32.48 30.42 RAE [%] 2.43 2.46 2.59 2.43 MAPE [%] 0.307 0.311 0.326 0.308 FS-all-R16 (16) MAE [%] 30.00 30.00 31.47 30.04 RAE [%] 2.40 2.40 2.51 2.40 MAPE [%] 0.303 0.303 0.317 0.304 FS-all-R24 (24) MAE [%] 29.50 30.89 31.98 29.53 RAE [%] 2.36 2.47 2.55 2.36 MAPE [%] 0.298 0.312 0.322 0.298 FS-all-R30 (30) MAE [%] 28.15 28.35 29.81 28.17 RAE [%] 2.25 2.26 2.38 2.25 MAPE [%] 0.284 0.286 0.298 0.285 FS-all-R38 (38) MAE [%] 26.89 27.18 30.08 26.92 RAE [%] 2.15 2.17 2.40 2.15 MAPE [%] 0.272 0.275 0.300 0.272 FS-all-R46 (46) MAE [%] 26.75 26.87 30.66 26.79 RAE [%] 2.14 2.25 2.45 2.14 MAPE [%] 0.270 0.271 0.306 0.271

features

Fig. 4. Ranking of the FS-all features using RReliefF

and some days from the past week, i.e. to utilize the daily and weekly patterns of electricity load. More specifically, the main idea of MEFS2 is to use the load from the lag before the prediction time on the same day, 1, 2, 3, 6 and 7 days ago and the load from the time being predicted from 1, 2, 6 and 7 days ago. MEFS1 has the same central idea but uses a smaller number of features. 0.400 0.350 0.300 MAPE [%]

The most informative feature is Xt, the load on the day of the prediction, 1 lag before, which is consistent with the CFS’s selection. The next seven highly ranked features correspond to the load on the same day at previous lags. The second ranked feature by CFS Xt is only ranked 11th by RReliefF and the third ranked feature by CFS XXt+1 is only ranked 23th by RReliefF. Fig. 4 also shows that there is a sharp drop in the feature score till Xt-8 (the 7th ranked feature) followed by further quick decrease till XD6t+2 (the 13th ranked feature) and a gradual and smaller decrease after that. We selected sets R7, R11, R13, R16, R24, R30, R38 and R46 (consisting of the top 7, 13, 16, 24, 30, 38 and 46 features, respectively). Fig. 5 shows graphically the MAPE results for the RReliefF subsets and compares them with the original feature set FS-all, and the benchmarks FS1 and the industry model. The best subsets selected by RReliefF for each prediction algorithm were R46 for LR, LMS and SVR (MAPE= 0.270%, 0.271% and 0.271%, respectively) and R30 for BPNN (MAPE=0.298%). For all prediction algorithms except BPNN, as the number of selected features increased from 7 to 46, the MAPE improved (with one exception for LMS). LR, LMS and SVR with 30, 38 and 46 features outperformed both the industry model and the benchmark FS1. With 38 and 46 features these algorithms produced similar accuracy results as the original feature set FS-all (51 features). Thus, for LR, LMS and SVR, RReliefF was able to produce smaller feature subsets that outperformed both the industry model and FS1 and achieved similar results as the full dataset FS-all. For BPNN, RReliefF was less successful. The subsets with 13, 16, 20, 24, 30, 38 and 46 features achieved better accuracy results than FS1 and FS-all but neither of them outperformed the industry model. The last feature selection method we used was MEFS, the manual expert selection. We chose 2 feature subsets: MEFS1 and MEFS2, listed in Table 4. The motivation was to select subsets with underlying pattern that is not present in the CFS and RReliefF subsets. For example, the main idea behind the MEFS feature sets was to use a small time window around the prediction time, for the prediction day

FS-all-MEFS1 (11)

0.250

FS-all-MEFS2 (14)

0.200

FS-all (51)

0.150

FS1 (11)

0.100 0.050 0.000 LR

LMS

BPNN

SVR

industy

prediction algorithm

Fig. 6. A comparison of the MAPE performance of the MEFS feature subsets with the original set FS-all, set FS1 and the industry model

Fig. 6 shows graphically the MAPE results for MEFS1 and 2 and compares them with the original set FS-all and the benchmarks FS1 and the industry model. In general, MEFS1 and MEFS2 achieved similar results; the bigger set MEFS2 was slightly better for all algorithms except LMS. Both MEFS1 and MEFS2 produced better accuracy results than the industry model with all algorithms. This means that the industry model can be improved even by keeping the BPNN algorithm but replacing the industry feature set with MEFS2; the MAPE will decrease from 0.291% to 0.284%. This is unlikely to increase the computation; although the industry model uses 9 features they require logarithmic and difference transformations, while MEFS2 uses 14 features that are simply previous load and hence do not require any additional pre-processing.

0.500 FS-all-R7 (7)

0.450

FS-all-R11 (11)

0.400

FS-all-R13 (13)

MAPE [%]

0.350

FS-all-R16 (16)

0.300

FS-all-R20 (20)

0.250

FS-all-R24 (24)

0.200

FS-all-R30 (30)

0.150

FS-all-R38 (38)

0.100

FS-all-R46 (46)

0.050

FS-all (51)

0.000 LR

LMS

BPNN

SVR

industy

FS1 (11)

prediction algorithm Fig. 5. A comparison of the MAPE performance of the RReliefF feature subsets (R7-R46) with the original set FS-all, set FS1 and the industry model

Another important observation is that both MEFS1 and MEFS2 performed very similarly to the original set FS-all (and with BPNN they even outperform it). This is significant as MEFS1 and MEFS2 use only a fraction of the features in FS-all (11 and 14 versus 51). In addition, a comparison between MEFS1 and FS1, both of which contain 11 features, shows that MEFS1 is a better subset than the benchmark FS1. To conclude, MEFS was successful in reducing the number of features in FS-all from 51 to 14 while producing very similar results. Overall, the best three reduced feature subsets were: 1) R46 (46 features) with LR, LMS and SVR, 2) R38 (38 features) with the same prediction algorithms and 3) MEFS2 (14 features) with LR and SVR. All of them achieved similar accuracy to the original and best performing set FSall (51 features) and also outperformed the industry model and the benchmark FS1.

also plan to explore the potential of ensembles of prediction algorithms [22] to improve accuracy. Another avenue for future work is expanding the forecast horizon to more than one point ahead. REFERENCES [1] [2] [3] [4]

[5] [6]

VIII. CONCLUSION We presented new approaches for 5-minute ahead electricity load forecasting based on autocorrelation analysis. Using a 4-week sliding window we selected 51 features, corresponding to previous load at the day of the forecast and also 1, 2, 3, 6 and 7 days before. Using this feature set with LR and SVR we achieved forecasting error MAPE=0.297%, an improvement of 7.56% in comparison to the industry model (based on BPNN), and also outperformed the previous best model which uses feature set FS1 (11 features). We investigated if it is possible to select a subset of the 51 features and improve the accuracy or maintain similar accuracy. We applied three feature selection methods: the state-of-the art CFS and RReliefF and also our own manual expert selection (MEFS). The subset selected by CFS was inadequately small (3 features) and was not competitive in accuracy to the original set, the benchmark set FS1 and the industry model. RReliefF was successful for LR, LMS and SVR; the selected subsets with 30, 38 and 46 features outperformed both the industry model and the benchmark FS1. For example, the selected subset with 38 features when used with LR and SVR achieved MAPE=0.272 (an improvement of 6.53% over the industry model). In addition, the subsets with 38 and 46 features achieved similar results as the original feature set. The third method, MEFS, was also successful. The selected subset of 14 features when used with LR and SVR outperformed the industry model achieving MAPE=0.277 (an improvement of 4.81% in comparison to the industry model), outperformed the benchmark FS1 and was only slightly less accurate than the original feature set. Overall, we found LR and SVR to be the most accurate prediction algorithms and BPNN to be the least accurate. LR was also considerably faster than SVR. Hence, the best performing prediction algorithm in terms of accuracy and time was LR. Future work will include extending our study to all months and evaluating a year round prediction model. We

[7] [8] [9] [10] [11]

[12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

B. F. Hobbs, S. Jitprapaikulsarn, S. Konda et al., “Analysis of the value for unit commitment of improved load forecasts,” IEEE Trans. Power Systems, vol. 14, no. 4, pp. 1342-1348, 1999. J. W. Taylor, “An evaluation of methods for very short-term load forecasting using minute-by-minute British data,” Int. J. Forecasting, vol. 24, pp. 645-658, 2008. D. W. Bunn, “Forecasting loads and prices in competitive power markets,” Proc. of IEEE, vol. 88, no. 2, 2000. A. Setiawan, I. Koprinska, and V. G. Agelidis, “Very short-term electricity load demand forecasting using support vector regression,” in Int. Joint Conference on Neural Networks (IJCNN), Atlanta, USA, 2009, pp. 2888-2894. AEMO, 2009; www.aemo.com.au. H. S. Hippert, C. E. Pedreira, and R. C. Souza, “Neural Networks for short-term load forecasting: a review and evaluation,” IEEE Trans. Power Systems, vol. 16, no. 1, pp. 44-55, 2001. G. Gross, and F. D. Galiana, “Short-term load forecasting,” Proceedings of the IEEE, vol. 75, no. 12, pp. 1558-1573, 1987. E. A. Feinberg, and D. Genethliou, "Load forecasting," Applied Mathematics for Restructured Electric Power Systems: Optimization, Control and Computational Intelligence, pp. 269-285: Springer, 2005. K. Liu, S. Subbarayan, R. R. Shoults et al., “Comparison of very short-term load forecasting techniques,” IEEE Trans. Power Systems, vol. 11, no. 2, pp. 877-882, 1996. W. Charytoniuk, and M.-S. Chen, “Very short-term load forecasting using artificial neural networks,” IEEE Trans. Power Systems, vol. 15, no. 1, pp. 263-268, 2000. P. Shamsollahi, K. W. Cheung, Q. Chen et al., “A neural network based very short term load forecaster for the interim ISO New England electricity market system,” in 22nd IEEE PES Int. Conf. Power Industry Computer Applications (PICA), 2001, pp. 217-222. D. Chen, and M. York, “Neural network based very short term load prediction,” in IEEE/PES Transmission and Distribution Conf. & Expo, 2008, pp. 1-9. M. A. Hall, “Correlation-based feature selection for discrete and numeric class machine learning,” in 17th Int. Conf. on Machine Learning (ICML), 2000, pp. 359-366. K. Kira, and L. Rendell, “A practical approach to feature selection,” in 9th Int. Conf. on Machine Learning (ICML), 1992, pp. 249-256. M. R. Sikonja, and I. Kononenko, “Theoretical and empirical analysis of ReliefF and RReliefF,” Machine Learning, vol. 53, pp. 23-69, 2003. T. Joachims, Learning to Classify Text Using Support Vector Machines - Methods, Theory, and Algorithms: Springer, 2002. A. J. Smola, and B. Scholkopf, "A tutorial on support vector regression," Neuro-COLT2 Technical Report NC2-TR-1998-030, 2003. I. Witten, and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2d ed.: Morgan Kaufmann, 2005. V. Vapnik, Statistical Learning Theory: Wiley, 1998. S. K. Shevade, S. S. Keerthi, C. Bhattacharyya et al., “Improvements to the SMO Algorithm for SVM Regression,” IEEE Trans. Neural Networks, vol. 11, no. 5, 2000. P. J. Rousseeuw, and A. M. Leroy, Robust Regression and Outlier Detection: Wiley, 1987. L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley, 2004.