278711 A Comprehensive Evaluation of Different Variable Selection Methods for Soft Sensor Development
In recent years, rapid developments in technology facilitated the collection of vast amount of data from different industrial processes. The data has been utilized in many different areas, such as data-driven soft sensor development and process monitoring, to control and optimize the process. The performance of these data-driven schemes can be greatly improved by selecting only the vital variables that strongly affect the primary variables, rather than all the available process variables. Consequently, variable selection has been one of the most important practical concerns in data-driven approaches. By identifying the irrelevant and redundant variables, variable selection can improve the prediction performance, reduce the computational load and model complexity, obtain better insight into the nature of the process, and lower the cost of measurements [1-2].
A comprehensive evaluation of different variable selection methods for soft sensor development will be presented in this work. Among all the variable selection methods, seven algorithms are investigated. They are stepwise regression, PLS-BETA, PLS-VIP, UVE-PLS, PLS-SA, PCA-SA and GA as discussed below. Stepwise regression methods are often used for variable selection in linear regression . The procedure is carried out in such a way that individual predictor/secondary variable is sequentially introduced into the model to observe its relation to the primary variables. Partial Least Squares (PLS) regression is a model parameter based algorithm. Both the regression coefficients estimated by PLS (PLS-BETA) and variable importance in projection (PLS-VIP) are discussed . Another model parameter based method, called Uninformative Variable Elimination by PLS (UVE-PLS), is also related to regression coefficients. However, instead of looking at the regression coefficients only, the reliability of the coefficients is explored . Variable selection algorithms based on sensitivity analysis (PLS-SA and PCA-SA) are also studied. In these approaches, the importance of variables is defined by their sensitivity, which is defined as the change in primary variables by varying the secondary variable in its allowable range [6-7]. Furthermore, properties of genetic algorithms (GA), which have been recently proposed for variable selection applications , are also investigated.
The algorithms of these variable selection methods and their characteristics will be presented. In addition, the strength and limitations when applied for soft sensor development are studied. The soft sensor prediction performance of models developed by these variable selection methods are compared using PLS.
A simple simulation case is used to investigate the properties of the selected variable selection methods. The dataset is generated to mimic the typical characteristics of process data, such as the magnitude of correlations between variables and the magnitude of signal to noise ratio . In addition, the algorithms are applied to an industrial soft sensor case study. In both cases, independent test sets are used to provide fair comparison and analysis of different algorithms. The final performances are compared to demonstrate the advantages and disadvantages of the different methods in order to provide useful insights to practitioners in the field.
- Andersen, C. M., & Bro, R. (2010). Variable selection in regression — a tutorial. Journal of Chemometrics, 24(11-12), 728-737.
- Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods. The Journal of Machine Learning Research, 3, 1371-1382. Retrieved from http://dl.acm.org/citation.cfm?id=944978
- Ma, M.-D., Ko, J.-W., Wang, S.-J., Wu, M.-F., Jang, S.-S., Shieh, S.-S., & Wong, D. S.-H. (2009). Development of adaptive soft sensor based on statistical identification of key variables. Control Engineering Practice, 17(9), 1026-1034. Elsevier.
- Chong, I.-G., & Jun, C.-H. (2005). Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems, 78(1-2), 103-112.
- Centner, V., Massart, D. L., de Noord, O. E., de Jong, S., Vandeginste, B. M., & Sterna, C. (1996). Elimination of uninformative variables for multivariate calibration. Analytical chemistry, 68(21), 3851-3858.
- Arciniegas, F. A., Embrechts, M., & Rueda, I. E. A. (2006). Variable Selection with Partial Least Squares Sensitivity Analysis: An Application to Currency Crises’ Real Effects. SSRN eLibrary. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=909508
- Zamprogna, E., Barolo, M., & Seborg, D. E. (2005). Optimal selection of soft sensor inputs for batch distillation columns using principal component analysis. Journal of Process Control, 15(1), 39-52.
- Chiang, L. H., & Pell, R. J. (2004). Genetic algorithms combined with discriminant analysis for key variable identification. Analytical Sciences, 14, 143-155.