Comparison of Data Mining Algorithms for On-line NIR Models of CCS in Australia J. Sexton1, Y. Everingham1, D. Donald2 1College
of Science and Engineering, James Cook University 2MicroAg
[email protected]
Introduction
Model comparison
On-line NIR analysis systems play an important role in assessing the quality of sugarcane in Australia. In particular, commercial cane sugar (CCS) is directly used to calculate the payment made to growers. Within the Australian sugarcane industry partial least squares regression has been the primary chemometric algorithm for building these NIR models. While interest in machine learning algorithms has grown, there has been little research into their application to cane quality measures. The objective of this research was to compare partial least squares, support vector regression, artificial neural nets and gradient boosted trees for their ability to estimate CCS from NIR data, collected by an on-line cane analysis system.
All four methods performed well with a minimum R2 of 0.85 (GBT; explaining 85% of observed variance). The ANN model had the lowest RMSE and highest R2, followed by SVR (Figure 3). However, there was little difference between PLSR, SVR and ANN performance. This suggests SVR or ANN could replace PLSR without lose of performance but may not provide a strong improvement in performance. The GBT model was able to identify 2250 nm, 2064 nm, 1846 nm, 1500 nm and 1270 nm as the most influential regions (relative influence >1%).
Background Commercial Cane Sugar (CCS) is the primary quality measure used in the Australian sugar industry and is calculated in laboratory analysis as: 3 ร ๐๐๐๐๐๐ 95 โ %๐น๐น๐น๐น๐น๐น๐น๐น๐น๐น โ ๐ต๐ต๐ต๐ต๐ต๐ต(97 โ %๐น๐น๐น๐น๐น๐น๐น๐น๐น๐น) ๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ = 200
Where: Pij = Pol in juice: an indication of sucrose measured as a percentage Bij = Brix: a measure of all dissolved sugars measured as a percentage Fibre = Fibre content as a percentage. Many Australian sugar cane mills use NIR analysis of shredded cane (Figure 1) to quickly estimate these cane quality measures. โข Partial Least Squares regression (PLSR) is the main algorithm used and performs well for estimating Bij, Pij and CCS (Staunton et al. 2004) โข Support Vector Regression (SVR) has proven as effective as PLSR for Bij and Pij in Australia (Sexton et al. 2017) and Japan (Tange et al. 2014) โข Artificial Neural Networks (ANN) are less common but have been used to build Vis/SWNIR models that classify brix from sugarcane rind scans in Australia (Nawi et al. 2013) โข Gradient Boosted Trees (GBT) have been Figure 1. Shredded cane in an Australian used in vis-NIR analysis of soils (Araujo et al. sugar cane Mill. 2014) but are not well represented in agricultural studies
Figure 3. Validation RMSE and R2 for each of for each data mining method.
Error comparison Predicted vs observed values of CCS were plotted to investigate errors visually (Figure 4). Spread in the data reflects the overall method performance with lower spread for ANN and the largest spread for GBT predictions. Highlighting the 10 samples with the largest error for each method (red) showed that some samples were always difficult to estimate regardless of the algorithm used. Furthermore the predicted values for these samples were similar for each model.
Methodology
PLSR, SVR, ANN and GBT were used to build regression models of CCS from NIR data collected in an Australian sugar mill (Figure 1). โข Data were collected from a single sugar cane mill online analysis system for the 2006 crush season โข Lab CCS data with paired NIR scans (1100 nm โ 2400 nm) were collected โข Data were partitioned into roughly equal training and validation sets (Table 1) with no samples from the same farm appearing in both sets โข NIR data were pre-processed using a first derivative and standard normal variate correction โข A Mahalanobis distance was used to identify atypical spectra and samples with less than 3 โcleanโ spectra were removed from analysis. Final spectra were the average of all clean spectra โข Cross-validation was used to tune hyper-parameters for each algorithm PLSR: No. latent variables SVR: cost and gamma (radial basis function, epsilon=0.1) ANN: No. nodes, decay ( single hidden layer) GBT: No. Trees, depth, shrinkage, min no. obs.
โข PLSR, SVR and ANN models performed similarly with ANN having a slightly higher R2 and lower RMSE โข GBT model was able to identify influential regions of the spectra and could be used as a feature selection technique โข Some sample were difficult to estimate regardless of the algorithm used
References
Figure 2. Outline of Methodology. Table 1. CCS reference data statistics.
Training Validation Total
Conclusions
โข Results suggest better understanding what samples the models struggle with may be more important than which algorithm is used
โข Final parameter sets chosen as โsimplestโ model within one standard error of minimum RMSE (CV) โข Final models were built on full training set and applied to validation set โข Model skill was compared based on validation RMSE and R2
N 1899 1894 3793
Figure 4. Predicted VS Observed CCS values for the validation data set. Red numbers identify 10 samples with the largest errors for each model.
CCS Mean (SD) 13.1(1.68) 13.4(1.33) 13.2(1.52)
CCS Range 7.6 โ 16.9 8.5 โ 17.0 7.6 โ 17.0
1. Staunton, S., D. Mackintosh and G. Peatey (2004). The application of network NIR calibration equations at the Maryborough Sugar Factory. Proceedings of the Australian Society of Sugar Cane Technologists 26: 1-14. 2. Sexton, J., Y. Everingham and D. Donald (2017). Detailed trait characterisation is needed for simulation of cultivar responses to water stress. Proceedings of the Australian Society of Sugar Cane Technologists 29: 557-567. 3. Tange, R. I., M. A. Rasmussen, E. Taira and R. Bro (2015). Application of support vector regression for simultaneous modelling of near infrared spectra from multiple process steps. Journal of Near Infrared Spectroscopy 23(2). 4. Araรบjo, S. R., J. Wetterlind, J. A. M. Demattรช and B. Stenberg (2014). Improving the prediction performance of a large tropical vis-NIR spectroscopic soil library from Brazil by clustering into smaller subsets or use of data mining calibration techniques. European Journal of Soil Science 65(5): 718-729. 5. Nawi, N. M., G. Chen, T. Jensen and S. A. Mehdizadeh (2013). Prediction and classification of sugar content of sugarcane based on skin scanning using visible and shortwave near infrared. Biosystems Engineering 115(2): 154-161.
Acknowledgements This research was funded by Sugar Research Australia and James Cook University as part of a PhD project (2014/109). Special thanks are extended to Steven Staunton of SRA for contributions to this research.