Comparison of Data Mining Algorithms for On-line ...

4 downloads 5002 Views 562KB Size Report
Comparison of Data Mining Algorithms for On-line NIR Models of CCS in Australia. Introduction. On-line NIR analysis systems play an important role inย ...
Comparison of Data Mining Algorithms for On-line NIR Models of CCS in Australia J. Sexton1, Y. Everingham1, D. Donald2 1College

of Science and Engineering, James Cook University 2MicroAg [email protected]

Introduction

Model comparison

On-line NIR analysis systems play an important role in assessing the quality of sugarcane in Australia. In particular, commercial cane sugar (CCS) is directly used to calculate the payment made to growers. Within the Australian sugarcane industry partial least squares regression has been the primary chemometric algorithm for building these NIR models. While interest in machine learning algorithms has grown, there has been little research into their application to cane quality measures. The objective of this research was to compare partial least squares, support vector regression, artificial neural nets and gradient boosted trees for their ability to estimate CCS from NIR data, collected by an on-line cane analysis system.

All four methods performed well with a minimum R2 of 0.85 (GBT; explaining 85% of observed variance). The ANN model had the lowest RMSE and highest R2, followed by SVR (Figure 3). However, there was little difference between PLSR, SVR and ANN performance. This suggests SVR or ANN could replace PLSR without lose of performance but may not provide a strong improvement in performance. The GBT model was able to identify 2250 nm, 2064 nm, 1846 nm, 1500 nm and 1270 nm as the most influential regions (relative influence >1%).

Background Commercial Cane Sugar (CCS) is the primary quality measure used in the Australian sugar industry and is calculated in laboratory analysis as: 3 ร— ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ 95 โˆ’ %๐น๐น๐น๐น๐น๐น๐น๐น๐น๐น โˆ’ ๐ต๐ต๐ต๐ต๐ต๐ต(97 โˆ’ %๐น๐น๐น๐น๐น๐น๐น๐น๐น๐น) ๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ = 200

Where: Pij = Pol in juice: an indication of sucrose measured as a percentage Bij = Brix: a measure of all dissolved sugars measured as a percentage Fibre = Fibre content as a percentage. Many Australian sugar cane mills use NIR analysis of shredded cane (Figure 1) to quickly estimate these cane quality measures. โ€ข Partial Least Squares regression (PLSR) is the main algorithm used and performs well for estimating Bij, Pij and CCS (Staunton et al. 2004) โ€ข Support Vector Regression (SVR) has proven as effective as PLSR for Bij and Pij in Australia (Sexton et al. 2017) and Japan (Tange et al. 2014) โ€ข Artificial Neural Networks (ANN) are less common but have been used to build Vis/SWNIR models that classify brix from sugarcane rind scans in Australia (Nawi et al. 2013) โ€ข Gradient Boosted Trees (GBT) have been Figure 1. Shredded cane in an Australian used in vis-NIR analysis of soils (Araujo et al. sugar cane Mill. 2014) but are not well represented in agricultural studies

Figure 3. Validation RMSE and R2 for each of for each data mining method.

Error comparison Predicted vs observed values of CCS were plotted to investigate errors visually (Figure 4). Spread in the data reflects the overall method performance with lower spread for ANN and the largest spread for GBT predictions. Highlighting the 10 samples with the largest error for each method (red) showed that some samples were always difficult to estimate regardless of the algorithm used. Furthermore the predicted values for these samples were similar for each model.

Methodology

PLSR, SVR, ANN and GBT were used to build regression models of CCS from NIR data collected in an Australian sugar mill (Figure 1). โ€ข Data were collected from a single sugar cane mill online analysis system for the 2006 crush season โ€ข Lab CCS data with paired NIR scans (1100 nm โ€“ 2400 nm) were collected โ€ข Data were partitioned into roughly equal training and validation sets (Table 1) with no samples from the same farm appearing in both sets โ€ข NIR data were pre-processed using a first derivative and standard normal variate correction โ€ข A Mahalanobis distance was used to identify atypical spectra and samples with less than 3 โ€˜cleanโ€™ spectra were removed from analysis. Final spectra were the average of all clean spectra โ€ข Cross-validation was used to tune hyper-parameters for each algorithm PLSR: No. latent variables SVR: cost and gamma (radial basis function, epsilon=0.1) ANN: No. nodes, decay ( single hidden layer) GBT: No. Trees, depth, shrinkage, min no. obs.

โ€ข PLSR, SVR and ANN models performed similarly with ANN having a slightly higher R2 and lower RMSE โ€ข GBT model was able to identify influential regions of the spectra and could be used as a feature selection technique โ€ข Some sample were difficult to estimate regardless of the algorithm used

References

Figure 2. Outline of Methodology. Table 1. CCS reference data statistics.

Training Validation Total

Conclusions

โ€ข Results suggest better understanding what samples the models struggle with may be more important than which algorithm is used

โ€ข Final parameter sets chosen as โ€˜simplestโ€™ model within one standard error of minimum RMSE (CV) โ€ข Final models were built on full training set and applied to validation set โ€ข Model skill was compared based on validation RMSE and R2

N 1899 1894 3793

Figure 4. Predicted VS Observed CCS values for the validation data set. Red numbers identify 10 samples with the largest errors for each model.

CCS Mean (SD) 13.1(1.68) 13.4(1.33) 13.2(1.52)

CCS Range 7.6 โ€“ 16.9 8.5 โ€“ 17.0 7.6 โ€“ 17.0

1. Staunton, S., D. Mackintosh and G. Peatey (2004). The application of network NIR calibration equations at the Maryborough Sugar Factory. Proceedings of the Australian Society of Sugar Cane Technologists 26: 1-14. 2. Sexton, J., Y. Everingham and D. Donald (2017). Detailed trait characterisation is needed for simulation of cultivar responses to water stress. Proceedings of the Australian Society of Sugar Cane Technologists 29: 557-567. 3. Tange, R. I., M. A. Rasmussen, E. Taira and R. Bro (2015). Application of support vector regression for simultaneous modelling of near infrared spectra from multiple process steps. Journal of Near Infrared Spectroscopy 23(2). 4. Araรบjo, S. R., J. Wetterlind, J. A. M. Demattรช and B. Stenberg (2014). Improving the prediction performance of a large tropical vis-NIR spectroscopic soil library from Brazil by clustering into smaller subsets or use of data mining calibration techniques. European Journal of Soil Science 65(5): 718-729. 5. Nawi, N. M., G. Chen, T. Jensen and S. A. Mehdizadeh (2013). Prediction and classification of sugar content of sugarcane based on skin scanning using visible and shortwave near infrared. Biosystems Engineering 115(2): 154-161.

Acknowledgements This research was funded by Sugar Research Australia and James Cook University as part of a PhD project (2014/109). Special thanks are extended to Steven Staunton of SRA for contributions to this research.