- 2:18 PM

Selection of Molecular Descriptor Subsets for Property Prediction

Inga Paster1, Neima Brauner2, and Mordechai Shacham1. (1) Ben Gurion University of the Negev, Chemical Engineering Department, Beer-Sheva, 84105, Israel, (2) School of Engineering, Tel-Aviv University, Tel-Aviv, 69978, Israel

Pure-compound property data are at present available only for a small fraction of the compounds, pertaining to such diverse areas as chemistry and chemical engineering, environmental engineering and environmental impact assessment, hazard and operability analysis. Therefore, methods for reliable prediction of property data are needed. Current methods used to predict physical and thermodynamic properties can be classified into "group contribution" methods (see, for example, Marrero and Gani, 2001), methods based on the "corresponding-states principle", (Poling et al., 2001), "asymptotic behavior" correlations (Marano and Holder, 1997) and Quantitative Structure Property Relationships (QSPRs, Dearden et al., 2003).

Recently we have developed the Targeted QSPR (TQSPR) method (Shacham et al., 2007, Brauner, et al., 2008) which enables predicting properties within experimental error level. Unlike in the traditional QSPR methods, the TQSPR method is targeted to a particular compound, or a group of compounds, and relies on the identification of a relatively small number of structurally similar compounds. Hence, it can provide accurate predictions and estimates of the prediction error, while avoiding the need to model the highly nonlinear relationships between molecular descriptors and properties that may require large amount of experimental data.

A large database containing molecular descriptor data (up to 1600 molecular descriptors per compound) and physical property data for a large number of compounds is used for developing the TQSPR. First, a "target compound" representative of the desired "applicability domain" is selected. The correlation coefficient (or other similarity measures, all related to clustering algorithms, e.g. Euclidean distance) between the vector of molecular descriptors of the target compound and those of other compounds are used to select a subset of compounds similar to the target ("similarity group"). The first n (Го10) compounds with the highest correlation coefficient value and for which target property data are available serve as "training set" for the development of the TQSPR. The target compound and the rest of the compounds in the similarity group serve either as "validation set" (for members for which target property values are available), or as "predictive set" (compounds for which the property has to be predicted). The superiority of the method over conventional QSPRs depends on the ability of the algorithm to select a proper training set with high level of structural similarity to the target compound and the target property.

Using the training set, molecular descriptors whose linear combination can represent the target property within experimental error level are identified. A stepwise regression procedure is used to derive the linear regression model (containing typically one to four descriptors, out of the 1600 descriptors in the database) which is used eventually as the TQSPR model.

In the present work the optimal composition of the molecular descriptor subset, which is used for the similarity group selection, is investigated. The criterion for optimality is that the group selected based on the clustering measures, consists of compounds which are known to be structurally similar (such as members of homologous series). Various criteria for stepwise selection of the descriptors to be included in the TQSPR are also evaluated with regard to the precision of the property prediction for the target compound. The so-obtained TQSPR models are linear and typically consist of one or two descriptors.

The results of the so optimized algorithm are compared with predictions obtained with a 8-descriptors' QSPR and a neural network for predicting melting point temperatures of normal alkanes, 1-alkenes,1-alkanols, n-alkylbenzenes and n-alkanoic acids. The comparison is carried out for the prediction of the melting point temperatures, which can be measured accurately but notoriously difficult to predict (see for example Hughes et al., 2008). For most of the compounds the optimized version of the TQSPR method enables prediction of melting point within experimental error level with significantly lower prediction error than either the QSPR or the neural network predictions. The only exceptions are some low carbon number compounds (ethanoic acid, for example) whose similarity level with other compounds in the database is rather low.


1. Brauner, N., Cholakov, G. St., Kahrs, O., Stateva, R.P. and Shacham, M, "Linear QSPRs for Predicting Pure Compound Properties in Homologous Series", AIChE J, 54(4), 978-990 (2008).

2. Dearden, J. C. "Quantitative Structure-Property Relationships for Prediction of Boiling Point, Vapor Pressure, and Melting Point", Environmental Toxicology and Chemistry, 22( 8), 1696-1709 (2003).

3. Hughes, L. D., Palmer, D. S., Nigsch, F.and Mitchel, J. B. O.,"Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P", J. Chem. Inf. Model. 48, 220-232(2008)

4. Marano, J.J., Holder, G.D., "General Equations for Correlating the Thermo-physical Properties of n-Paraffins, n-Olefins and other Homologous Series. 2. Asymptotic Behavior Correlations for PVT Properties", Ind. Eng. Chem. Res., 36, 1887-1894 (1997).

5. Marrero, J., Gani, R., "Group-contribution based estimation of pure component properties.", Fluid Phase Equilibria, 183-184 , 183-208(2001).

6. Poling, B.E., Prausnitz, J. M., O'Connel, J. P., Properties of Gases and Liquids, 5th Ed., McGraw-Hill, New York (2001).

7. Shacham, M, O. Kahrs, G.St. Cholakov, R. P. Stateva, W. Marquardt and N. Brauner, " The Role of the Dominant Descriptor in Targeted Quantitative Structure Property Relationships", Chem. Eng. Sci. 62 (22), 6222-6233 (2007)