Robust online algorithm for adaptive linear ... - Wiley Online Library

6 downloads 889 Views 2MB Size Report
Mar 4, 2016 - a review of the online passive aggressive algorithm (OPAA), an adaptive .... online prediction and parameter estimation in the presence of.
Research article Received: 23 September 2015,

Revised: 3 January 2016,

Accepted: 1 February 2016,

Published online in Wiley Online Library: 4 March 2016

(wileyonlinelibrary.com) DOI: 10.1002/cem.2792

Robust online algorithm for adaptive linear regression parameter estimation and prediction Shekhar Sharmaa , Swanand Khareb and Biao Huanga In this article, we focus on adaptive linear regression methods and propose a new technique. The article begins with a review of the online passive aggressive algorithm (OPAA), an adaptive linear regression algorithm from the machine learning literature. We highlight the strengths and weaknesses of OPAA and compare it with other popular adaptive regression techniques such as moving window and recursive least squares, recursive partial least squares, and justin-time or locally weighted regression. Modifications to OPAA are proposed to make it more robust and better suited for industrial soft-sensor applications. The new algorithm is called smoothed passive aggressive algorithm (SPAA), and like OPAA, it follows a cautious parameter update strategy but is more robust. The trade-off between SPAA’s computation complexity and accuracy can be easily controlled by manipulating just two tuning parameters. We also demonstrate that the SPAA framework is quite flexible and a number of variants are easily formulated. Application of SPAA to estimate the time-varying parameters of a numerically simulated autoregressive with exogenous terms (ARX) model and to predict the Reid vapor pressure of the bottoms flow from an industrial column demonstrates its superior performance over OPAA and comparable performance with the other popular algorithms. Copyright © 2016 John Wiley & Sons, Ltd. Keywords: adaptive linear regression; robust estimation; least squares; least absolute deviation; moving window

1. INTRODUCTION

308

Mathematical models are implemented in a variety of industries like the steel, refinery, and pharmaceutical industries for the purpose of process control, product quality estimation, or fault detection. The purpose of these models, also called soft sensors, is to estimate process variables that are either not possible to measure using hardware analyzers or are too expensive to measure [1]. In the presence of sufficient process knowledge, white box models that are based on first principles are used for these mathematical models. However, with the complexity of modern processes, this is rarely the case, and the necessary information required for white box models is generally not available. In such cases, black box modeling is a suitable alternative. Black box models are data-driven models and can be built solely on the available data. Grey-box modeling is a middle path that combines available process know how with the data-driven modeling technique to build useful mathematical models [1]. Among the data-based approaches, global models, also called offline models or batch modeling, based on offline training using historic data, such as artificial neural network, principal component analysis, and other linear/nonlinear models, have been the traditional choice [2]. However, because most industrial systems have time-varying characteristics to some extent, models that are built offline are unable to cope with new operating conditions. Typical causes of such time-varying behavior can be changes in the quality of the feed, catalyst degradation, equipment wear, and tear and natural causes such as changes in

J. Chemometrics 2016; 30: 308–323

weather conditions[2]. These points highlight the need for model maintenance and retraining of model parameters. Hence, equipping soft sensors with some adaptation strategy has become almost a necessity. However, adaptation or retraining of complex nonlinear models is generally not convenient. As an example, fuzzy set, artificial neural networks, or neuro fuzzy networks, which are some of the popular nonlinear modeling approaches, generally have complicated model structures and as a consequence are difficult to adapt [2]. The complexity and high computation load associated with such models also limit their applications. On the other hand, adaptive versions of linear models are simple and relatively easy to implement, thus making them a popular choice for soft-sensor applications. Blockwise linear least squares or moving/sliding window least squares (MWLS) and recursive least squares (RLS) are two such popular adaptive linear regression techniques. For high dimensional and or correlated data, moving

* Correspondence to: Biao Huang, Department of Chemical and Materials Engineering, University of Alberta, Edmonton, Alberta T6G 2G6, Canada. E-mail: [email protected] a S. Sharma, B. Huang Department of Chemical and Materials Engineering, University of Alberta, Edmonton, Alberta T6G 2G6, Canada b S. Khare Department of Mathematics, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India

Copyright © 2016 John Wiley & Sons, Ltd.

Robust adaptive regression window and recursive versions of principal component analysis and partial least squares (PLS), which are essentially linear techniques, have found wide applications. Dayal et al. [3] have proposed a fast and exponentially weighted recursive PLS algorithm that provides improved estimates in many situations. Through case studies, it was shown that because the exponentially weighted recursive PLS algorithm can handle correlated variables and does not suffer from short data windows, it outperforms the recursive and window-based least squares algorithms. Another popular approach to building soft sensors is the just-intime (JIT) [4] modeling approach, also known as locally weighted learning [5,6]. In JIT modeling, a local model is built only when the prediction for a query is received that is then discarded after the prediction is made. The local modeling methods typically used in JIT soft sensors are locally weighted least squares (LWLS) and locally weighted PLS. In this article, we concentrate on adaptive regression techniques for soft-sensor applications. Furthermore, among the various soft-sensor applications of online prediction, process monitoring, fault detection, and isolation, the focus is on online prediction and parameter estimation in the presence of unknown process drifts or parameter changes. Least squares cost function-based regression techniques, though widely used, are not robust. Robustness refers to the insensitivity of the estimates produced by a method to outlying or abnormal data points or model misspecifications [7]. Least squares estimates are optimal and coincide with the maximum likelihood estimate under the assumption of independent and normally distributed error terms [7–9]. However, in practical scenarios, this assumption does not always hold, and just a single outlying observation can distort the least squares estimate [9]. Therefore, robust alternative regression techniques are needed. L1 , or least absolute deviation (LAD), and the online passive aggressive algorithm (OPAA) [10] are two such adaptive linear regression methods, although they are not as widely used. However, as will be shown later in the article, both LAD and OPAA in their current forms have certain disadvantages when it comes to industrial applications. In this article, we propose an improved OPAA algorithm called smoothed passive aggressive algorithm (SPAA) that overcomes some of these shortcomings making it suitable for practical implementation, though at the expense of additional computational cost. SPAA is not only robust but also follows a cautious parameter update strategy and is not influenced by minor disturbances by skipping the parameter update step. Further, it will be shown that the proposed SPAA framework is a more general framework and for specific values of the tuning parameters, it reduces to OPAA or a moving window version of the LAD algorithm. The rest of the article is organized as follows. First, some of the popular adaptive regression techniques are described. Next, a comparative analysis of these techniques is carried out and the shortcomings discussed. An improved algorithm to address the shortcomings highlighted is then proposed. Finally, advantages of the proposed method are demonstrated by application to a numerical simulation and an industrial case study followed by the concluding remarks.

2. ADAPTIVE REGRESSION

J. Chemometrics 2016; 30: 308–323

2.1. Recursive least squares Consider the case where the relation between the output y 2 Rn1 (or difficult to measure variable) and the input X 2 Rnm (or easy to measure variables) can be reasonably approximated by the following linear form: y = Xˇ + 

(1)

where , the residual vector, consists of independent and Gaussian-distributed entries and n and m denote the number of samples and variables, respectively. The regression vector, ˇ, can be estimated recursively as follows [12]: Pn+1 = Pn –

Pn xn+1 T xn+1 Pn 1 + xn+1 Pn xn+1 T

(2)

  ˇO n+1 = ˇO n + Pn+1 xn+1 T yn+1 – xn+1 ˇO n

(3)

 –1 . This is the recursive formulation of the where Pn = Xn T Xn linear least squares regression algorithm called RLS. It is a simple and computationally efficient technique for the cases where the regression vector, ˇ, is a function of time. However, as the value of n becomes larger, the influence of new observations decreases, and the ability to track the changes in ˇ is lost. To mitigate this, a forgetting factor,  is introduced into Eqn (2) as follows: 1 Pn xn+1 T xn+1 Pn Pn+1 = Pn –   + xn+1 Pn xn+1 T

! (4)

where  2 [0, 1]. The previous equation is the RLS algorithm with a forgetting factor and is a widely used adaptive regression technique. The role of  is to discount the past data and emphasize new samples. Hence, it is the bias/variance trade-off parameter, and appropriate selection of its value is critical. 2.2. Recursive partial least squares In cases where the input dimension is large or the variables are correlated, PLS has been used as a popular alternative [2]. The aim of the PLS method is to project the data to latent variables such that the covariance between the input and output is maximized. For the data set, {X 2 Rnm , Y 2 Rnk }, the PLS formulation is as follows [2]: X = TPT + E

(5)

Y = UQT + F

(6)

where T 2 Rnl , U 2 Rnl , P 2 Rml , and Q 2 Rkl are the score and loading matrices, while E and F are the input and output data residuals. l ( m) is the number of latent variables. The vectors, t, u, p, and q, which make up the score and loading matrices, can be calculated sequentially by either the nonlinear iterative partial least squares (NIPALS) algorithm [2] or the kernel algorithm [13].

Copyright © 2016 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

309

Ordinary least squares is one of the most popular regression techniques. However, for parameter estimation in the presence of unknown parameter changes, its adaptive versions, namely, the sliding or moving window, and RLS have been widely used [11].

Recursive PLS and JIT, or locally weighted regression, are also popular techniques for building adaptive soft sensors. Another lesser known, from industrial application point of view, adaptive linear regression technique is OPAA. A brief overview of these algorithms is provided in the following sections.

S. Sharma, S. Khare and B. Huang Similar to the situation described earlier in RLS, when the data are available regularly such as in online applications, it can be beneficial to utilize the newly available data by updating the PLS model recursively. Furthermore, if the system is time varying, recent data need to be assigned higher weight than older data. Dayal and MacGregor [13] achieved this adaptation by updating the covariance matrices used in the kernel algorithm as follows:     XT X =  XT X + xt T xt (7) t





XT Y

t

t–1

  =  XT Y

t–1

+ xt T yt

(8)

where xt and yt are  the new data samples received at time t,  T T X X and X Y represent the updated covariance matrices t

t

at time t, and  2 [0, 1] is the forgetting factor to discount past data. This version of the PLS algorithm is called the recursive exponentially weighted PLS. Readers are referred to Dayal and MacGregor [13] for the MATLAB code.

(1) select samples in the database that are relevant to the query based on some similarity criterion, (2) build a local model on the relevant samples thus selected, and (3) calculate the output based on the local model and query. The local model is typically discarded after the prediction has been made. Many types of models can be used under the JIT framework, but due to their simplicity and low computational load, locally linear models such as LWLS, described in the succeeding texts, or locally weighted partial least squares are preferred. Because of their local model structure, JIT models can handle nonlinear systems well [16]. To adapt to time-varying systems on the other hand, it is necessary to modify the database, and hence, database maintenance is an important element of JIT soft sensors. In the typical approach, all new samples are stored in the database. However, over time, the database size will become too large requiring increased memory space and computation load for similarity calculations. In an alternative approach, the moving window database approach can be utilized wherein the database size is limited to a fixed number, called the window size, w. Any new sample is then stored and the oldest one removed, thus keeping the database size fixed.

2.3. Moving/sliding window least squares The moving window formulation of the ordinary least squares algorithm is another popular adaptive parameter estimation technique. In this formulation, the model parameters are estimated again when a given number, s (step size), of new data samples have been collected. The number of samples used for model training is w, the window size [2]. The moving window formulation of the least squares estimate for the system described by Eqn (1) is written as follows: 2

3

2

3

2

(9)

–1  ˇO = XT WX XT Wy

(11)

Locally weighted regression has, therefore, become a popular tool for building soft sensors because of its ability to model complex nonlinear systems and track abrupt changes using simple locally linear models [17]. 2.6. Online passive aggressive algorithm

Writing Eqn (9) in matrix form (10)

where Xw and yw denote the latest w samples and ˇO w denotes the least squares estimate on this window of w points. The length of the window signifies the size of the database that is used for parameter estimation, and the step size signifies the frequency of the estimation. Together, the parameters, w and s, play a role similar to that of  in RLS, and selecting their values is critical as inappropriate setting can lead to decrease in performance [14]. 2.4. Just-in-time modeling

310

In JIT modeling, the historical data are stored, and a local model is built only when a request for the prediction to a query variable is received. Once a query is received, the key steps involved during prediction in the JIT modeling method can be summarized as follows [15]:

wileyonlinelibrary.com/journal/cem

For the database, query, and the weighting matrix given as follows: X 2 Rnm , y 2 Rn1 , xq 2 R1m , and W = diag (w1 , : : : , wN ), the equation involved in LWLS to calculate the local regression parameters is as follows: [6]:

3

xn+s–w+1 yn+s–w+1 n+s–w+1 6 7 6 7 7 6 6 yn+s–w+2 7 6 xn+s–w+2 7 6 n+s–w+2 7 6 7 6 7 7 6 6 . 7 6 . 7 7 6 . 6 7 6 7 6 7 6 7=6 7ˇ w + 6 7 7 6 . 7 7 6 . 6 . 7 6 7 7 6 6 7 6 . 7 7 6 . 6 . 5 4 5 5 4 4 yn+s xn+s n+1

 –1 ˇO w = Xw T Xw Xw T yw

2.5. Locally weighted least squares

Here, we introduce the OPAA for regression problems [10], an adaptive parameter tracking algorithm from the machinelearning literature. Similar to the scenario for the formulation of the RLS algorithm, suppose we have n observations and an estimate for the regression vector in Eqn (1) as ˇO n . The recursive update to ˇO n is then obtained as follows: 

Calculate the loss ln+1 based on the error of the current prediction from the following loss function [10]: ( l (ˇO n ) =

0 when |yn+1 – xn+1 ˇO n |   O |yn+1 – xn+1 ˇ n | –  otherwise

(12)

where xn+1 ˇO n gives the prediction, yO n+1 , corresponding to the latest predictor variable xn+1 . This loss is 0 when the predicted target deviates from the true target by less than  and otherwise grows linearly with |yn+1 – yO n+1 |. The threshold parameter,  , is a positive real number that controls the sensitivity of the algorithm to inaccuracies in the prediction.

Copyright © 2016 John Wiley & Sons, Ltd.

J. Chemometrics 2016; 30: 308–323

Robust adaptive regression 

Next, find the new updated regression vector, ˇO n+1 , such that the loss for the current term is 0 while minimizing the distance of ˇO n+1 from ˇO n [10]. Hence, the update to ˇO n is made as follows: ˇO n+1 = arg min

ˇˇ ˇˇ2 ˇˇ ˇˇ ˇˇˇ – ˇO n ˇˇ 2

ˇ

s.t.

  l ˇ = 0

(13)

Using Lagrangian optimization, the closed form expression for the updated regression vector is obtained as follows:   ˇO n+1 = ˇO n + sign yn+1 – yO n+1 xn+1 n+1 (14) where n+1 =

  On l ˇ ||xn+1 ||2

.

From Eqn (14), we see that the change in ˇOn is proportional to n+1 . Crammer et al. [10] also introduced two variants of the update strategy called OPAA-I and OPAA-II, respectively. The only difference is in the computation of n+1 , which for the two cases is as follows:  9 8 < l ˇO n = OPAA-I:n+1 = min C, (15) : ||xn+1 ||2 ;

OPAA-II:n+1 =

  l ˇO n 1 ||xn+1 ||2 + 2C

(16)

where C is a positive parameter that controls the aggressiveness of the update to ˇOn . For very large values of C, OPAA-I and OPAAII reduce to the original OPAA algorithm, while smaller C values cause a less aggressive update.

3. MOTIVATION Before proceeding further, the following remarks about the parameter update strategies used in RLS/MWLS and OPAA are in order:

J. Chemometrics 2016; 30: 308–323

From the aforementioned observations, we see that OPAA has certain attractive features as a predictive algorithm. However, in its current form, it is not suited for use in practical soft-sensor design. We propose improvements to OPAA and call it smoothed passive aggressive algorithm (SPAA) for reasons that will be evident later. 3.1. Smoothed passive aggressive algorithm It was previously highlighted that OPAA uses a single observation for its parameter update. It is a generally accepted rule that overfitting occurs when the ratio of model parameters to training data size is very high. Because, in OPAA, the regression parameter has to always fit the latest observation, it is equivalent to using a single sample for model training. It has also been pointed by Kim et al. [16] that overfitting occurs when the bandwidth of the local model used in JIT models is very low, that is, very few points participate in building the local model. For industrial applications therefore, it is generally desirable to have a minimum sample size for model training to avoid overfitting and improve robustness. We deal with this issue by considering a moving window-based loss function. The new formulation based on a window will be reduced to the original formulation if w is set as 1. The new loss term is formulated as follows: 8 ˇ n+1 P ˇˇ ˆ ˇ ˆ ˆ 0 when ˇyi – xi ˇO n ˇ  sadw + e   < i=n–w+2 O le ˇ n = (17) ˇ  n+1  P ˇˇ ˆ ˇ ˆ ˆ otherwise ˇyi – xi ˇO n ˇ – sadw + e : i=n–w+2

Here, sadw is the sum of the absolute deviations given by the LAD ˚ solution for the set  of w points  contained in the window, xn–w+2 , yn–w+2 , : : : , xn+1 , yn+1 , that is, n+1 X

sadw =

| yi – xi ˇ lad |

(18)

i=n–w+2

ˇ lad = arg min ˇ

n+1 X

| yi – xi ˇ lad |

(19)

i=n–w+2

e, the threshold parameter for SPAA, is a percent of the sadw value for the current window. The reader is reminded that the threshold  in OPAA described earlier is not calculated as a percentage, O although both serve a similar purpose. Finally, the update to ˇ, the regression vector, is obtained as follows:

Copyright © 2016 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

311

1. The form of the update to ˇO in both RLS and OPAA is very similar. In both cases, the magnitude of the change to ˇO is proportional to the error in prediction using the O However, one difference is readily previous estimate of ˇ. observed. Whereas in RLS, the update occurs for each and every prediction error, and for every step size in MWLS, in the case of OPAA, the update occurs only when the prediction deviates from the target by more than the threshold, . This can be an advantage in situations where it is undesirable for minor process fluctuations to cause changes in the parameter estimates. 2. In the case of RLS and RPLS, the update term takes into account the history of the input variables in the form of the covariance matrix. In OPAA, the historical data are being taken into account indirectly through the constraint on the change of the updated parameters from the previous ones. 3. The RLS and RPLS algorithms minimize the squared error cost function considering all the samples in the database (time weighted in the case of the exponentially weighted forms), and the MWLS does the same for the samples in the window. The OPAA, on the other hand, minimizes within a

threshold, the absolute value of the prediction error for the latest sample. Thus, the LS algorithms will be sensitive to large errors because of the squared error cost function. The OPAA, due to the absolute deviation loss term and cautious update, will tend to be more robust. However, because it updates ˇO based on the performance of a single sample, it is still vulnerable to arbitrary process fluctuations. All JIT models, with LWLS or locally weighted partial least squares as the local models, will be sensitive not only to outlying values in the output but also to outliers in the inputs because of the Euclidean distance-based similarity calculations. The sensitivity to input outliers, that is, leverage points, however, could be reduced by using the absolute distance for the similarity calculations.

S. Sharma, S. Khare and B. Huang

ˇO n+1 = arg min ˇ

ˇO n+1 = arg min

ˇˇ ˇˇ2 ˇˇ ˇˇ ˇˇˇ – ˇO n ˇˇ ˇˇ ˇˇ2 ˇˇ ˇˇ ˇˇˇ – ˇO n ˇˇ

n+1 X

s.t.

2



le ˇ, Xw , yw = 0

s.t.

2

ˇ



or

|yi – xi ˇ|  sadw + e

i=n–w+2

(20) Before analyzing SPAA, the solution strategy for Eqn (20) is discussed in the succeeding texts. The loss function in SPAA requires the LAD solution for the current window, which can be easily calculated by converting the optimization to a linear program [18] as follows: n+1 X

ˇ lad = arg min ˇ

ˇ lad = arg min ˇ,ui

i=n–w+2 n+1 X

|yi – xi ˇ|

or (21)

ui

s.t.

ui – |yi – xi ˇ| = 0

n+1 X

ˇ lad = arg min 

ˇ,ui



ui

i=n–w+2

ui  y i – x i ˇ ,

(22) 

ui  – yi – xi ˇ



where ui are artificial variables introduced to convert the problem to a linear program. The inequality in Eqn (22) forces each ui to equal |yi – xi ˇ| upon minimization. Once ˇ lad is found, sadw required in evaluating the loss function in SPAA can be obtained. Finally, the overall optimization in Eqn (20) can be reduced to a set of linear inequality constraints with a quadratic objective function. Similar to Eqn (22) previously, we write Eqn (20) as follows: ˇˇ ˇˇ2 ˇˇ ˇˇ ˇˇˇ – ˇO n ˇˇ 2

ˇ



subject to the constraints: (23) n+1 X



  ui  yi – xi ˇ , ui  – yi – xi ˇ ,

ui  sadw + e

i=n–w+2

The variables for this optimization are the ui s and ˇ. Because we n+1 P already know that there exists a ˇ = ˇ lad that leads to |yi – i=n–w+2

xi ˇ lad | = sadw , it is guaranteed that we can always find a ˇ that will satisfy the inequalities in Eqn (23) previously. As e  0, we have the following: sadw 

3.2. Analysis Let us now perform a comparative analysis of the proposed SPAA algorithm with the original form and with the RLS and MWLS algorithms. It is also interesting to note that SPAA with a window size of 1 reduces to OPAA, and with e as 0%, reduces to a moving window LAD algorithm.

i=n–w+2

The aforementioned problem can be written as the following linear program:

ˇO n+1 = arg min

is closest to ˇO n hence giving a unique solution. When e is set to 0, ˇO equals ˇ lad . Depending on whether ˇ lad is unique or not (there are situations where ˇ lad may be non-unique), the objective function then becomes inactive or active. Hence, the formulation in Eqn (20), among other things, always ensures a unique update. We see that, given some initial estimate, ˇO is learnt adaptively from the data. There are two parameters, w and e, the window size, and the threshold, respectively, that are needed for the update equation. These can be learnt offline by minimizing the prediction error on the training data set. The significance of these parameters will be discussed in the following section.

X

|yi – xi ˇ| 

i

X

ui 

sadw + e

i

312

However, unlike Eqn (22), Eqn (23) does not always force each ui to equal |yi – xi ˇ|, but the original requirements as per Eqn (20) are always met.P The inequality constraints ensure that the new total deviation, |yi – xi |, is within the acceptable relaxed limit i   of sadw + e . The use of the threshold parameter, e, causes existence of multiple solutions for ˇO and ui s satisfying the constraints. O which The objective function then causes the selection of that ˇ,

wileyonlinelibrary.com/journal/cem

3.2.1. Smoothed passive aggressive algorithm and online passive aggressive algorithm The OPAA algorithm, as shown in Eqn (13), contains two terms in the update expression for the regression vector, the left hand/objective term, and the right hand/constraint term. The O This space conconstraint term defines a feasible space for ˇ. tains ˇO values, all of which satisfy the required performance in prediction accuracy. In OPAA, this requirement in prediction accuracy is measured by the ability of ˇO to correctly predict the latest response variable, that is, yn+1 . From this space of possible ˇO values, the one closest to its previous value is picked. The sensitivity/threshold parameter, , can be considered as a relaxation to the requirement in prediction accuracy, and its value will depend on the particular application. For example, when an appropriate value of  is selected, noise or minor process fluctuations will not cause any undesirable changes in the model parameters. This happens because ˇO n correctly predicts yn+1 within the desired threshold  and forms a part of the feasible space O The objective is then minimized for the case when ˇO n for ˇ. remains unchanged, that is, ˇO n+1 = ˇO n and no update occurs. Only when large changes in the system occur will the decrease in the accuracy of ˇO n be large enough to cause a change in its value. The optimum value of  can be learnt by minimizing the error on the training data. The key point to note here is that the model update is based only on the most recent sample. This is true in the case of the two variants, OPAA-I and OPAA-II as well. Hence, a single abnormal data sample can have a large impact on the performance. In contrast to OPAA, the prediction performance over the latest w observations is considered for parameter updates in SPAA. Generally, in practical applications, model training is always carried out over a minimum number of samples to avoid issues like overfitting and poor robustness. From Eqn (20), we can see that this performance is measured by the sum of the absolute deviations in predicting the points in a window. We can consider OPAA as a specific case of SPAA with w as 1. The use of a window instead of a single sample to assess the performance of the model makes it more robust to arbitrary fluctuations in the system. It acts in a way similar to the forgetting factor used in RLS, and its selection is a trade-off between prediction bias and prediction variance or model robustness and model adaptability. Models

Copyright © 2016 John Wiley & Sons, Ltd.

J. Chemometrics 2016; 30: 308–323

Robust adaptive regression with a larger window size will be more robust and slower to adapt, whereas those with a smaller window size will be less robust and faster to adapt. Hence, SPAA with an appropriately selected window size performs better than OPAA or either of its variants, OPAA-I or OPAA-II. This will be demonstrated in the applications section later on. The downside of using a window formulation is that the closed form update expression of the original algorithm is lost and computational complexity is increased. Although using a window will lead to robustness, SPAA without the sensitivity parameter, e, will still be subjected to updates caused by noise or minor disturbances. This is evident from Eqn (22), because the sadw for every window will almost always be different from each other. The use of e, as of  in OPAA, causes the loss term to become 0 for minor fluctuations and avoids unnecessary updates. Again, offline optimization on the training data is one way to select w and e values. Expert process knowledge may be used to restrict w within a meaningful range. For values of e larger than the optimal one, the loss term will be 0 more frequently and hence will lead to lesser updates that means lower computation because the optimization in Eqn (20) will be no longer required. This will be reflected in a decreased performance in terms of R and root mean square error of prediction (rmsep) values. However, because of the window formulation, it would still be able to capture a trend change, though with a slight delay and bias depending on e. Hence, the SPAA formulation gives a general framework where the w and e values can be tweaked as per the application at hand. For example, in situations where capturing the overall trend with increased robustness is preferred rather than exactly predicting the reference values (possibly laboratory values in the case of soft-sensor applications), an e value greater than the optimal one could be chosen. This trade-off between computation, robustness, and accuracy can be suited to match the requirements.

3.2.2. SPAA, RLS, MWLS, RPLS, and LWLS

J. Chemometrics 2016; 30: 308–323

Smoothed passive aggressive algorithm can also be seen as a modified implementation of the LAD solution. The LAD estimate is the maximum likelihood estimate when the residuals are assumed to follow a Laplace (or double exponential) distribution [19]. Consider a linear system similar to Eqn (1) where the errors are independent and identically distributed according to the zero mean Laplace distribution: y = Xˇ + ,

  where   L 0, b

The density function of the errors is given as follows:   |i | 1 i = exp – 2b b

(24)

(25)

where b is the scale parameter. Writing the likelihood function, we have the following:       lˇ,b = f y, X|ˇ, b = f y|X, ˇ, b f X|ˇ, b

(26)

Considering that the predictor variables and the regression parameters are independent of each other, the last term in the aforementioned equation is a constant and can be denoted as c.  P    c | yi – xi ˇ| lˇ,b = cf y1 , y2 , ...yn |X, ˇ, b =  n exp – (27) b 2b Finally, the log-likelihood function is given as follows: P | yi – xi ˇ| Lˇ,b = ln lˇ,b = – + ln c – n ln 2b b

(28)

PThus, maximizing the likelihood is equivalent to minimizing |yi – xi ˇ|, which is the LAD solution. Because the Laplace distribution has fatter tails compared with the Gaussian distribution, the LAD solution is more robust to outliers in the output compared with the least squares solution. However, the LAD method also has certain characteristics that make it unsuitable for implementation; the primary of which is the non-uniqueness of the solution and the computational complexity compared with least squares [20]. In the case of the least squares solution, the condition required for uniqueness is for the input matrix to have full column rank. However, there is no particular defined condition that can guarantee a unique LAD solution [21]. Furthermore, the simplex method, which is typically used to solve linear programs, becomes slow for a large number of observations. However, up to a few hundred observations, LAD regression is competitive with LS [22]. The proposed algorithm, SPAA, is able to retain the desirable characteristics of the LAD solution while tackling the aforementioned disadvantages. Although the size of the window, w , will vary from application to application, it will rarely be large enough for computation to become a factor in its implementation. Furthermore, the form of the optimization in Eqn (20) inherently deals with the issue of multiple solutions by selecting the regression vector with the least distance from its previous value. 3.3. Smoothed passive aggressive algorithm variants The SPAA framework is quite flexible, and a number of variations are possible within it. These variations come at almost no increase

Copyright © 2016 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

313

As will be shown in the next section, the LAD estimate is the maximum likelihood estimate when the residuals are assumed to follow a Laplace distribution. All other methods mentioned previously, that is, RLS, MWLS, RPLS, and LWLS, minimize a cost function based on the sum of the squares of the errors. The least squares estimates so obtained are therefore based on the residuals following a Gaussian distribution. The Laplace distribution has a thicker tail than the Gaussian distribution and hence assigns a larger probability to bigger residuals. The LAD estimate, and consequently the SPAA algorithm based on it, is therefore more robust to outlying values in the output than the least squares algorithms. Furthermore, the regression parameter vector is updated for every new sample in the case of RLS and RPLS and for every update step in the case of MWLS. In the case of JIT, typically a new local model is built for every new query. The SPAA on the other hand can be said to follow a smart update because it updates the regression parameters only when the average performance over the latest w points is unsatisfactory. Also, because SPAA uses the absolute deviation loss function and does a cautious update, it is to be expected that the SPAA algorithm will have a less number of predictions with high relative error values compared with the other algorithms. On the computation side however, SPAA has a higher complexity.

3.2.3. Smoothed passive aggressive algorithm and least absolute deviation

S. Sharma, S. Khare and B. Huang in computational complexity. When to use a particular variant depends on the application and availability of expert process knowledge. As mentioned earlier, like OPAA, the SPAA formulation consists of two components, the objective term that is to be minimized and the constraint term. The variations in these two terms are now discussed, respectively. 3.3.1. Objective function variations There are numerous ways to describe closeness in mathematical terms besides the one used for ˇO in OPAA. We discuss some alternatives in the succeeding texts. (a) Sum of squares of fractional change

ˇˇ ˇˇ ˇˇ O ˇˇ2 ˇˇˇ–ˇ n ˇˇ Instead of minimizing in the objective, one  22 m O P ˇ – ˇ could minimize, 12 j O n,j , that is, the sum of squares ˇn,j j

of the fractional changes in the regression parameters. If the update in ˇO is brought about by a disturbance or outlier, the former should perform better, whereas for a system change, the latter should be better. This is because in the case of a disturbance/outlier, one would expect the parameter to return to normal, and the error will be minimized when the change in the regression coefficients (and not the fractional change) due to the disturbance is minimal. On the other hand, a system change implies a new value O In this case, among all ˇO that satisfy the constraint, of ˇ. selecting the one that minimizes the relative change in regression coefficients makes more sense. In applications where regression coefficients indicate a physical relationship between response and explanatory variables, it is more reasonable to accept, say, a 5% change in a coefficient with a large magnitude than accept a 100% change in one with a much smaller magnitude. Selecting among the two depends on the expected noise/disturbance and the trend changes anticipated in the system, and use of expert/prior knowledge, if available, is recommended. (b) L1 norm Here, instead of minimizing the L2 norm in the objective, one could minimize the L1 norm. Hence, the update to ˇO now becomes the following: ˇO n+1 = arg min|ˇ – ˇO n |1

s.t.

ˇ

n+1 X

|yi –xi ˇ|  sadw +e

i=n–w+2

(29) |ˇ – ˇO n |1 is the L1 norm of the vector (ˇ – ˇO n ), and is the m P O With|ˇj – ˇOn,j |, where m is the dimension of ˇ. same as j=1

out going into details, we directly write the form of the final update expression. The reader can verify this using the techniques employed previously. ˇO n+1 = arg min 

ˇ,zj ,ui



m X

zj  ˇj – ˇOn,j , 

zj

j=1

s.t. 

where zi and ui are artificial variables introduced for conversion into a linear program. The consequence of using the L1 instead of the L2 norm is that because of the sparseness property of the L1 norm, the update with changes in fewer of the regression coefficients will be preferred, whereas in the L2 norm, the update will almost always be one where all the coefficients change to some extent. Also, because the O using the L1 norm will prediction is a linear function of ˇ, cause less deviation from the previous ˇO value. Again, this may or may not be desirable depending on whether the update is caused by a disturbance or process change and use of expert judgment will be required to select among the two. 3.3.2. Constraint term variations Lastly, as pointed out by Fisher [23], the use of LAD for regression is very flexible. Because the SPAA algorithm, in its constraint/loss term, uses a slightly modified version of the absolute cost function, it also inherits the same flexibility. A number of constraints can be easily incorporated within the existing optimization problem without any significant increase in computational requirement. For example, if it is required to weigh the data unequally,    it can be performed by simply transforming the data as xi , yi =   wi xi , yi [24]. The weights, wi , could be time weights [25], or they could be a robust measure of the distances in the input space [24]. Also, in the case of industrial applications, use of prior or expert knowledge regarding the systems can be implemented easily. The regression coefficients in the case of real-world systems quantify the physical relationship between the response and explanatory variables. Hence, suppose that based on experience or historical evidence from similar scenarios elsewhere, one wanted to restrict the fractional change in any particular regression coefficient, j, to a value p; the same can be very easily incorporated into the existing framework by adding the additional inequality constraint as follows: ˇj  (1 + p)ˇOn,j &ˇj  (1 – p) ˇOn,j

4. APPLICATIONS In this section, the SPAA algorithm is tested on a numerically simulated data set and an industrial data set, respectively, and its performance is compared with OPAA, RLS, MWLS, RPLS, and JIT (with LWLS as the local model). Because time-varying systems are considered, two database strategies are employed in the JIT model for comparison. In the first case, every new query is stored in the database, and in the second case, a moving window database is maintained. The two JIT models are denoted as JITdb1 and JITdb2 , respectively. The results are consistent with the observations and comments made earlier and clearly bring out the advantages of SPAA. The correlation coefficient, R, and the rmsep defined in Eqn (32) are used to quantify the performance of the various algorithms.



zj  – ˇj – ˇOn,j &



  ui  yi – xi ˇ , ui  – yi – xi ˇ ,

n+1 X

rmsep = ui  sadw + e

314

i=n–w+2

(30)

wileyonlinelibrary.com/journal/cem

(31)

v uP 2 u ns  yO – y u t i=1 i i ns

(32)

where yO i is the prediction corresponding to the true value yi and ns is the number of test samples. For JIT, the Euclidean distance

Copyright © 2016 John Wiley & Sons, Ltd.

J. Chemometrics 2016; 30: 308–323

Robust adaptive regression di and the Gaussian kernel [5] are used to calculate the similarity between the query, xq , database samples, xi , and the sample weights, wi , respectively, as follows: q   T xi – xq xi – xq di = (33) wi = e–di

2

a combination of ramp and step changes as shown in Table I (all ramp changes are linear). The least squares solution for the first 20 points is used as the initial value of the ARX parameters. Next, for the test set, two cases are considered (Table II). (i) Case I: The same equation, Eqn (35), used for the training set is used to generate the test set. (ii) Case II: The sequence in which the three ARX models appear in the training set is changed for the test set generation in this case. The parameter variation for case ll, as shown in Table II, is the same as in Table I but with rows 1 and 2 interchanged.

(34)

where  is the smoothing parameter or the bandwidth parameter. For the case studies, all algorithms are trained on the training data set and the optimum parameters (w, , C, e, , and ) selected by minimizing the rmsep on it. 4.1. Numerical simulation A single input–single output parameter varying ARX model is simulated as follows: yt = a0 + a1 yt–1 + a2 ut + a3 ut–1 + t

(35)

where ut , the input, is a uniformly distributed random number in the interval [–2, +2], a0 , a1 , a2 , and a3 are the ARX model parameters, and t is zero mean white noise. For the training set, a total of 600 samples are generated, and the ARX parameters varied with

In both test cases, outliers are added to the output, y. It is then observed how the different algorithms perform in the presence of extreme values. For the test cases, the results are given based on comparison of the predicted output with the outlier-free output. Figure 1 presents the scatter plot of the outputs with outliers against the clean data. Before the results are presented, we show Figures 2 and 3 illustrating the affects of the parameters e and w. From Figures 2 and 3, one can see that w and e affect parameter estimation, and hence prediction, differently. While both have a smoothing effect

Table I. ARX model parameter variation: training set. Time, t 001 : 200 201 : 400 401 : 600

a0

a1

a2

a3

–1.00 ! 0.00 0.00 0.00 ! +1.00

+0.50 –0.50 +0.75

+1.00 ! +0.25 +0.25 +0.25 ! –0.75

+0.25 +1.00 –0.50

Table II. ARX model parameter variation: test set II. Time, t 001 : 200 201 : 400 401 : 600

a0

a1

a2

a3

0.00 0.00 ! –1.00 –1.00 ! +1.00

–0.50 +0.50 +0.75

+0.25 +0.25 ! +1.00 +1.00 ! –0.75

+1.00 +0.25 –0.50

315

Figure 1. Contaminated versus clean output for test cases I and II.

J. Chemometrics 2016; 30: 308–323

Copyright © 2016 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

S. Sharma, S. Khare and B. Huang

Figure 2. Effect of w on estimating a2 , for fixed e(= 0).

Figure 3. Effect of e on estimating a2 , for fixed w(= 10).

316

on the estimation and prediction, increase in w causes increasing influence of past samples on the current estimation. Thus, a large window leads to a more robust but less adaptive estimation. Once a trend change occurs, a large window size leads to a bias in estimation. However, this bias is removed when the sliding window moves forward and leaves all past samples behind. This role is quite similar to that of  in RLS and RPLS. On the other hand, increase in e results in less parameter updates, but it also causes a delay in detecting the trend change. As soon as this trend change is detected, by activation of the loss function, the estimation of the updated parameters is based only on the current window. The bias is caused by the presence of the threshold because according to the formulation, the updated parameter does not have to fit the LAD solution for the current window exactly. Tables III and IV show the performance of the various algorithms for test cases I and II. As discussed earlier, SPAA with e = 0 is effectively a moving window LAD algorithm. The rmsep values of SPAA (e = 0) for test cases I and II are 0.658 and 0.697, respectively, which are higher than those of SPAA (e = 20) for the corresponding cases. For the variants of OPAA, rmsep values of

wileyonlinelibrary.com/journal/cem

0.912/0.944 for OPAA-I and 1.076/1.166 for OPAA-II are obtained for test cases I and II. Although the OPAA variants show improvement, the performance is still much less accurate than the other algorithms, which is because of the update being based on a single observation as pointed out earlier. Another observation is that more than one combination of C and  can lead to the least rmsep on the training set. In such cases, it is difficult to select the appropriate set of parameters for OPAA-I/OPAA-II. For the JIT models, we observe that the prediction accuracy of JITdb1 is not good and is only slightly better than that of OPAA. JITdb2 on the other hand performs better but gives the same result as MWLS. This is because the system under study is a linear time-varying one. In JITdb1 , because all samples are stored in the database, old samples that are no longer relevant also participate in the local modeling and deteriorate the prediction. JITdb2 offline training results in a  value of 0 indicating the system linearity, and the optimum window of 30 reduces its performance to be the same as the MWLS algorithm. The rmsep values for the parameters in the case of the JIT models are not presented because they are essentially applied for prediction applications and not

Copyright © 2016 John Wiley & Sons, Ltd.

J. Chemometrics 2016; 30: 308–323

Robust adaptive regression

Table III. Results: numerical simulation – case I. //e/w/l/

R

rmsep (y)

Updates

rmsep (a0 /a1 /a2 /a3 )

Recursive least squares

0.94/-/-/-/-/-

0.941

0.695

580

0.38/0.33/0.19/0.30

Recursive Partial Least Squares

0.94/-/-/-/3/-

0.941

0.695

580

0.38/0.33/0.19/0.30

Moving/sliding window least squares

-/-/-/30/-/-

0.935

0.731

580

0.44/0.33/0.22/0.32

JITdb1

-/-/-/-/-/0.2

0.851

1.153



-/-/-/-

JITdb2

-/-/-/30/-/0.0

0.935

0.731



-/-/-/-

Smoothed passive aggressive algorithm

-/-/20/20/-/-

0.947

0.648

125

0.28/0.30/0.16/0.29

Online passive aggressive algorithm

-/0.8/-/-/-/-

0.845

1.223

301

0.51/0.42/0.54/0.50

Model

JIT, just-in-time; rmsep, root mean square error of prediction.

Table IV. Results: numerical simulation – case II. //e/w/l/

R

rmsep (y)

Updates

rmsep (a0 /a1 /a2 /a3 )

Recursive least squares

0.94/-/-/-/-/-

0.903

0.757

580

0.34/0.38/0.28/0.29

RPLS

0.94/-/-/-/3/-

0.903

0.757

580

0.34/0.38/0.28/0.29

Model

Moving/sliding window least squares

-/-/-/30/-/-

0.892

0.802

580

0.51/0.44/0.29/0.35

JITdb1 JITdb2

-/-/-/-/-/0.2 -/-/-/30/-/0.0

0.738 0.892

1.184 0.802

– –

-/-/-/-/-/-/-

Smoothed passive aggressive algorithm

-/-/20/20/-/-

0.926

0.657

126

0.29/0.27/0.21/0.21

Online passive aggressive algorithm

-/0.8/-/-/-/-

0.752

1.375

306

0.95/0.77/0.53/0.54

JIT, just-in-time; rmsep, root mean square error of prediction.

J. Chemometrics 2016; 30: 308–323

The results clearly show that SPAA performs the best among all the algorithms. It is also interesting to note that the number of updates in the SPAA algorithm for both cases is significantly lower than the rest, including OPAA-I and OPAA-II. Figures 4–11 show the tracking performance of RLS, MWLS, SPAA, OPAA, and its variants for the ARX parameters for both test cases. These figures are able to demonstrate more clearly why the SPAA algorithm performs as it does. For clarity, the comparison of SPAA with RLS, MWLS, OPAA, and OPAA-I is shown separately (performance of OPAA-II is quite similar to OPAA-I and is not included in the figures). Based on the figures, it can be seen that the performance of SPAA is much more robust than the other algorithms. However, SPAA also leads to a bias in the parameter estimation. This is possibly the reason why the difference in performance is not more in terms of the R or rmsep values. However, for industrial applications, it is generally not necessary to track the reference (or laboratory) values exactly; merely tracking the trend of the output is more satisfactory. This is because the reference values (such as laboratory data) can have occasional large errors. It is also evident that compared with the other algorithms, the variance of the parameter estimates in SPAA is quite low. In this regard, SPAA’s advantage is brought out more clearly. 4.2. Industrial case study In this section, an industrial data set from an oil sands processing plant located in Alberta, Canada, is used to assess the performance of SPAA. The output is the Reid vapor pressure of the bottoms of a de-propanizer column, which is part of the upgrading

Copyright © 2016 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

317

parameter estimation as there is no single global model that is built or maintained. In the case of RPLS, offline validation results in selection of three latent factors (dimension of the input variable) and a forgetting factor of 0.94, similar to that of RLS. This is because the input variables have very low correlation and consequently, the RPLS reduces to the RLS algorithm and gives a similar performance. Obviously, the JIT and RPLS algorithms are suited for nonlinear static systems and linear time-varying systems with correlated variables, respectively. Because both these algorithms utilize the least squares cost function, they are not robust and therefore are unable to provide improvement over the RLS and MWLS algorithms in the presence of outliers as is clearly indicated by the results presented. The computation load of JIT modeling methods is dependent on the local model complexity and database maintenance strategy employed. However, for the case studies considered here, we point out that the computation load of SPAA is, comparatively, heavier than the other algorithms, including JIT. However, it is, by itself, not a limiting factor as far as application is concerned, and the gain in accuracy and robustness against outliers achieved so far outweighs the minor computational disadvantage. Also, the greater the number/degree of outlying values, the greater will be the difference in the performance between SPAA and the other algorithms. Therefore, for these same numerical simulations, when the degree/number of outlying values added to the output was increased, the difference between the performance of SPAA and the other algorithms also increased. The results for only two cases (cases I and II) were presented here to avoid repetition and convey the core advantage.

S. Sharma, S. Khare and B. Huang

Figure 4. a0 tracking performance, smoothed passive aggressive algorithm (SPAA) versus recursive least squares (RLS) and moving/sliding window least squares (MWLS) for test cases I and II (left to right, respectively). - True, .. RLS, - MWLS, and - - SPAA.

Figure 5. a1 tracking performance, smoothed passive aggressive algorithm (SPAA) versus recursive least squares (RLS) and moving/sliding window least squares (MWLS) for test cases I and II (left to right, respectively). - True, .. RLS, - MWLS, and - - SPAA.

318

Figure 6. a2 tracking performance, smoothed passive aggressive algorithm (SPAA) versus recursive least squares (RLS) and moving/sliding window least squares (MWLS) for test cases l and II (left to right, respectively). - True, .. RLS, - MWLS, and - - SPAA.

wileyonlinelibrary.com/journal/cem

Copyright © 2016 John Wiley & Sons, Ltd.

J. Chemometrics 2016; 30: 308–323

Robust adaptive regression

Figure 7. a3 tracking performance, smoothed passive aggressive algorithm (SPAA) versus recursive least squares (RLS) and moving/sliding window least squares (MWLS) for test cases l and II (left to right, respectively). - True, .. RLS, - MWLS, and - - SPAA.

Figure 8. a0 tracking performance, smoothed passive aggressive algorithm (SPAA) versus online passive aggressive algorithm (OPAA) and OPAA-l for test cases I and II (left to right, respectively). - True, - OPAA, .. OPAA-I, and - - SPAA.

J. Chemometrics 2016; 30: 308–323

Copyright © 2016 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

319

Figure 9. a1 tracking performance, smoothed passive aggressive algorithm (SPAA) versus online passive aggressive algorithm (OPAA) and OPAA-l for test cases I and II (left to right, respectively). - True, - OPAA, .. OPAA-I, and - - SPAA.

S. Sharma, S. Khare and B. Huang

Figure 10. a2 tracking performance, smoothed passive aggressive algorithm (SPAA) versus online passive aggressive algorithm (OPAA) and OPAA-l for test cases and II (left to right, respectively). - True, - OPAA, .. OPAA-I, and - - SPAA.

Figure 11. a3 tracking performance, smoothed passive aggressive algorithm (SPAA) versus online passive aggressive algorithm (OPAA) and OPAA-l for test cases I and II (left to right, respectively). - True, - OPAA, .. OPAA-I, and - - SPAA.

320

Figure 12. Reid vapor pressure (RVP) normalized values, test set.

wileyonlinelibrary.com/journal/cem

Copyright © 2016 John Wiley & Sons, Ltd.

J. Chemometrics 2016; 30: 308–323

Robust adaptive regression unit of an oil sands processing plant. The inputs used are the flowrate, temperature, and pressure associated with the depropanizer column. Due to proprietary reasons, further details are not given regarding the process, and normalized values are used for the inputs and output. The system is approximated by the following linear model: y = Xˇ

(36)

Because the system is considered time varying, random samples are not used to form the training and test set. Therefore, the first 700 samples form the training, and the next 350 samples form the test set, respectively. The least squares solution based on the first 50 points is used as the initial estimate for ˇ. Figure 12 shows the plot of the normalized Reid vapor pressure values for the test set. The test set contains potential outliers based on the three  edit rules [26] because the explanatory variables for the corresponding points in the test set are within the normal operating range. Table V shows the performance of the algorithms on the test set. While calculating the R and rmsep values, the potential outliers were not included in the calculation in order to check the performance with respect to the outlier-free data.

SPAA performs better than RLS, RPLS, MWLS, and JIT in the test case in the presence of potential outliers. Although there is good correlation between the input variables, {X1 ,X2 = –0.40, X1 ,X3 = 0.70, X2 ,X3 = –0.15}, RPLS does not lead to any improvement over RLS and gives similar result because offline training causes selection of the maximum number of latent factors (= 3). Both JITdb1 and JITdb2 are unable to match the prediction accuracy of SPAA and in fact have the lowest performance. A section of the prediction trend is displayed in Figure 13. The JIT model predictions are not included to keep the figure easy to view. It shows that SPAA is able to track the reference trend better in comparison with the other methods. The number of updates to ˇ in SPAA is also lower than the other methods. This is also reflected in a smaller variance in prediction for SPAA as compared with any of RLS, RPLS, MWLS, JITdb1 , or JITdb2 , which is 0.167, 0.173, 0.187, 0.286, and 0.186 in that order. The number of updates in SPAA can be further controlled by appropriately tuning the value of the sensitivity parameter, e. The effect of the sensitivity parameter is illustrated by Figure 14, which displays the parameter estimates for the regression vector of the linearmodel in Eqn (36). Although small at 1.5%, it is still significant because it is a percentage value and the window size at 30 is large.

Table V. Results: industrial case study. Model Recursive least squares RPLS Moving/sliding window least squares JITdb1 JITdb2 Smoothed passive aggressive algorithm



e

w

l



R

Root mean square error of prediction

Updates

0.95 0.95 – – – –

– – – – – 1.5

– – 30 – 75 30

– 3 – – – –

– – – 0.8 0.1 –

0.597 0.597 0.571 0.403 0.446 0.622

0.414 0.414 0.432 0.618 0.494 0.401

348 348 348 – – 207

JIT, just-in-time.

Figure 13. Prediction comparison on a section of test data. SPAA, smoothed passive aggressive algorithm; RLS, recursive least squares; MWLS, moving/sliding window least squares.

321

J. Chemometrics 2016; 30: 308–323

Copyright © 2016 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

S. Sharma, S. Khare and B. Huang

Figure 14. Clockwise from top left, parameter estimates for the constant term, flowrate, temperature, and pressure, respectively. - Smoothed passive aggressive algorithm (SPAA) (e = 0) and - SPAA (e = 1.5).

5. CONCLUSION In this study, a new method called SPAA , which improves upon an existing adaptive linear regression algorithm, OPAA, to make it more robust and suitable for practical applications, has been proposed. Compared with OPAA, RLS, RPLS, MWLS, and JIT or locally weighted modeling, SPAA is more robust and follows a cautious parameter update strategy. Also, the SPAA framework is a more general one, and OPAA and moving window LAD are realized from it at specific values of the tuning parameters. Further, a number of variations are possible in the SPAA framework with little or no additional computational complexity. The advantages of the method are demonstrated by application to an industrial and numerically simulated data set.

Acknowledgements This work was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada and Alberta Innovates Technology Futures (AITF).

REFERENCES

322

1. Kadlec P, Gabrys B, Strandt S. Data-driven soft sensors in the process industry. Computers & Chemical Engineering 2009; 33(4): 795–814. 2. Kadlec P, Grbi´c R, Gabrys B. Review of adaptation mechanisms for data-driven soft sensors. Computers & Chemical Engineering 2011; 35(1): 1–24. 3. Dayal BS, MacGregor JF. Recursive exponentially weighted PLS and its applications to adaptive control and prediction. Journal of Process Control 1997; 7(3): 169–179.

wileyonlinelibrary.com/journal/cem

4. Cybenko G. Just-in-time learning and estimation. Nato ASI Series F Computer and Systems Sciences 1996; 153: 423–434. 5. Atkeson C, Schaal S, Moore A. Locally weighted learning. Artificial Intelligence Review 1997; 11(1-5): 11–73. 6. Schaal S, Atkeson CG, Vijayakumar S. Scalable techniques from nonparametric statistics for real time robot learning. Applied Intelligence 2002; 17(1): 49–60. 7. Pynnönen S, Salmi T. A report on least absolute deviation regression with ordinary linear programming. Finnish Journal of Business Economics 1994; 43(1): 33–49. 8. Narula SC, Wellington JF. The minimum sum of absolute errors regression: a state of the art survey. International Statistical Review/Revue Internationale de Statistique 1982; 50: 317–326. 9. Huber PJ. Robust regression: asymptotics, conjectures and Monte Carlo. The Annals of Statistics 1973; 1: 799–821. 10. Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y. Online passive-aggressive algorithms. The Journal of Machine Learning Research 2006; 7: 551–585. 11. Jiang J, Zhang Y. A revisit to block and recursive least squares for parameter estimation. Computers & Electrical Engineering 2004; 30(5): 403–416. 12. Ogunnaike BA. Random Phenomena: Fundamentals of Probability and Statistics for Engineers. CRC Press: Boca Raton, 2010. 13. Dayal B, MacGregor JF. Improved PLS algorithms. Journal of Chemometrics 1997; 11(1): 73–85. 14. Kadlec P. On Robust and Adaptive Soft Sensors. Bournemouth University, School of Design, Engineering & Computing: Bournemouth, UK, 2009. 15. Cheng C, Chiu MS. A new data-based methodology for nonlinear process modeling. Chemical Engineering Science 2004; 59 (13): 2801–2810. 16. Kim S, Kano M, Hasebe S, Takinami A, Seki T. Long-term industrial applications of inferential control based on just-in-time soft-sensors: economical impact and challenges. Industrial & Engineering Chemistry Research 2013; 52(35): 12346–12356.

Copyright © 2016 John Wiley & Sons, Ltd.

J. Chemometrics 2016; 30: 308–323

Robust adaptive regression 17. Ge Z, Song Z. A comparative study of just-in-time-learning based methods for online soft sensor modeling. Chemometrics and Intelligent Laboratory Systems 2010; 104(2): 306–317. 18. Vanderbei RJ. Linear Programming: Foundations and Extensions. Department of Operations Research and Financial Engineering, Princeton University: Princeton, 2001. 19. Blattberg R, Sargent T. Regression with non-Gaussian stable disturbances: some sampling results. Econometrica 1971; 39(3): 501–510. 20. Dielman TE. Least absolute value regression: recent contributions. Journal of Statistical Computation and Simulation 2005; 75(4): 263–286. 21. Planitz M, Gates J. Strict discrete approximation in the L1 and L 1 norms. Applied Statistics 1991; 40: 113–122.

22. Portnoy S, Koenker R. The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Statistical Science 1997; 12(4): 279–300. 23. Fisher WD. A note on curve fitting with minimum deviations by linear programming. Journal of the American Statistical Association 1961; 56(294): 359–362. 24. Giloni A, Simonoff JS, Sengupta B. Robust weighted LAD regression. Computational Statistics & Data Analysis 2006; 50(11): 3124–3140. 25. Cipra T. Robust exponential smoothing. Journal of Forecasting 1992; 11(1): 57. 26. Fortuna L, Graziani S, Rizzo A, Xibilia MG. Soft Sensors for Monitoring and Control of Industrial Processes. Springer: London, 2007.

323

J. Chemometrics 2016; 30: 308–323

Copyright © 2016 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem