469990 An Information Entropy Based Consistency Index for Evaluating the Performance of Variable Selection Methods

Tuesday, November 15, 2016: 10:18 AM
Carmel I (Hotel Nikko San Francisco)
Qinghua He, Department of Chemical Engineering, Tuskegee University, Tuskegee, AL

An information entropy based consistency index for evaluating the performance of variable selection methods

Q. Peter He

Department of Chemical Engineering, Auburn University, Auburn, AL 36849, USA

Data-driven soft sensors have been widely used in both academic research and industrial applications for predicting hard-to-measure variables or replacing physical sensors to reduce cost. It has been shown that the performance of these data-driven soft sensors could be greatly improved by selecting only the vital variables that strongly affect the primary variables, rather than using all the available process variables.

In the past few decades, many different variable selection approaches have been reported for various applications with different soft sensor modeling methods. In order to evaluate the performance of different variable selection methods, several performance indices have been proposed in the literature. The most common ones are the average mean absolute percentage error (MAPE), coefficient of determination (R2), and geometric mean of selection sensitivity and specificity (G). Among them, only G directly measures the accuracy of variable selection results, while MAPE and R2 indirectly measure the effects of variable selection through the prediction performance of a soft sensor, such as PLS. However, when the information on the true relevant variables is not available, which is the case for most industrial applications, selection sensitivity and specificity (therefore G) cannot be obtained and there is no direct metric existing for variable selecting in the literature.

We recently reported an entropy based variable selection index to access the variable selection performance, which does not require the ground truth of variable relevance ADDIN \s <Colwiz><citation><biblioref linkend="f20f9a604c9f9c2" citekey="wang2015comparison" /></citation></Colwiz> [1]. The index evaluates the consistency of the variable selection performance, and is termed consistency index. It was shown that the consistency index describe the variable selection performance well for both simulated (with ground truth) and industrial (without ground truth) cases studies. However, the consistency index does not fully agree to a common expectation: the consistency is the lowest when a variable is being selected for a model 50% of the time. Instead, the minimum (the lowest consistency) occurs at probability of 0.3679 (see Figure 1); in addition, the consistency index cannot make use of the ground truth even when it is available.

To address these limitations, in this work we propose a modified consistency index based on information entropy. It has a symmetric response curve as shown in Figure 2, and can be applied to all cases – no matter the information on the true relevant variable is available or not. Simulated and industrial cases studies are provided to compare the performance of the proposed index to the existing indices.

Figure 1. Probability vs. consistency based on [1]

Figure 2. Probability vs. consistency based on the proposed index

References:

ADDIN  \s <CWZ.BIB></CWZ.BIB>

[1]        Z. X. Wang, Q. P. He, and J. Wang, “Comparison of variable selection methods for PLS-based soft sensor modeling,” J. Process. Control., vol. 26, pp. 56–72, 2015.


Extended Abstract: File Not Uploaded