255431 Self-Consistency Analysis of Physical Property and Molecular Descriptor Databases Using a Variety of Prediction Techniques
Pure-compound property data are widely used in process design, simulation and optimization, environmental impact assessment, hazard and operability analysis and additional diverse areas as chemistry and chemical engineering. Presently pure compound property databases are the (such as the DIPPR database, Rowley et al. 2010) serve as the main sources of property data. For providing the maximal benefits to the users, these databases typically contain both experimental and predicted data. Both types of data are associated with certain levels of uncertainty. For experimental data the challenge is the selection of the "best" value from several reported values, while for predicted values only an average "prediction error" is reported, which is based on a particular training set of compounds and may not be applicable to the particular compound considered. New, more accurate prediction techniques are being continuously developed that enable replacing older less accurate predicted values by new more accurate ones.
However, periodical screening of the database for identifying data points that need to be replaced represent a great challenge, because of the huge amount of data involved. To make this task easier, a system that can screen all the data in the database and flag potentially erroneous or low precision data is required. Once such a point has been identified, various state of the art property prediction techniques, such as the multi-level group contribution method of Marrero and Gani, 2001, asymptotic behavior correlations (ABCs, Marano and Holder, 1997), Targeted Quantitative Structure Property Relationships (TQSPRs, Brauner et al., 2006) and the Reference Series method (Shacham et al., 2012) can be used to determine whether the particular value needs to be and can be replaced by a more accurate one.
The QSPR type property prediction methods require molecular descriptors for representing the structure of the molecule. In recent years computer programs that can calculate several thousands of descriptors have emerged. Checking the accuracy and consistency of the molecular descriptors, and the correctness of the associated molecular structure files (that often provided in MOL format), represent additional major challenges. Flagging potentially erroneous property data points can also help in identifying incorrect MOL files.
In the system we have developed, the single descriptor version of the TQSPR method (Shacham et al., 2007) is used for the initial screening of the database by selecting in turn every compound (target compound) in the database and predicting all its available constant properties. If the difference between the recommended database value and the predicted value is considerably higher than the uncertainty value assigned to the property in the data base, the data point is flagged as potentially erroneous. The flagged data points require additional analysis, as there can be various sources to the large differences. One potential source is the unsuitability of the prediction technique used for the screening. The single descriptor TQSPR method (like most/all other prediction techniques) may not provide accurate predictions for compounds with low carbon numbers (nC, such as for first members of homologous series) for which properties known to change irregularly and for solid properties for nC ≤ 20, where there is different trend of change for odd and even nC compounds. Another potential source is an incorrect mol file, or some erroneous molecular descriptors for the target compound. However in many cases the large differences between the predicted and the data base recommended values are caused by improper selection of the "recommended" value from the available experimental data and low accuracy or inconsistency of the available data.
We evaluated the proposed technique by applying it to a database that contains constant physical property data for 1798 compounds. Included in this data base are numerical values and data uncertainty for 32 properties (critical properties, normal melting and boiling temperatures, heat of formation, flammability limits etc.). All the property data is from the DIPPR database (Rowley et al., 2010). The database contains 3224 molecular descriptors generated by the Dragon, version 5.5. software (DRAGON is copyrighted by TALETE srl, http://www.talete.mi.it) from minimized 3-D molecular models. The molecular structure (MOL) files were provided by Rowley, 2010.
The results of this evaluation will be presented in the extended abstract and the presentation. Some typical cases where the property or the molecular structure databases needed updating will be discussed in more detail. These examples include cases where the predicted values for long chain substances exceeded the accepted maximal ("infinite") values of some properties, cases where the recommended property values for a homologous series were inconsistent with the accepted values of another series, and cases where incorrectness of the MOL files used prevented obtaining satisfactory TQSPR prediction.
1. Brauner, N; Stateva, R. P.; Cholakov, G. St.; Shacham, M. Structurally “Targeted” Quantitative Structure-Property Relationship Method for Property Prediction. Ind. Eng. Chem. Res. 2006, 45, 8430-8437.
2. Marano, J.J.; Holder, G.D. General Equations for Correlating the Thermo-physical Properties of n-Paraffins, n-Olefins and other Homologous Series. 2. Asymptotic Behavior Correlations for PVT Properties. Ind. Eng. Chem. Res. 1997A, 36, 1895.
3. Marrero, J.; Gani, R. Group-contribution based estimation of pure component properties. Fluid Phase Equilibrium. 2001, 183.
4. Rowley, R.L.; Wilding, W.V.; Oscarson, J.L.; Yang, Y.; Zundel, N.A. DIPPR Data Compilation of Pure Chemical Properties Design Institute for Physical Properties, (http//www.aiche.org/dippr), Brigham Young University Provo Utah, 2010.
5. Rowley, R. L. Personal communications, 2010
6. Shacham, M.; Kahrs, O.; St Cholakov, G.; Stateva, R.; Marquardt, W.; Brauner, N. The Role of the Dominant Descriptor in Targeted Quantitative Structure Property Relationships, Chem. Eng. Sci. 2007, 62, (22), 6222-6233.
7. Shacham, M.; Paster, I.; and Brauner,N.; Property Prediction and Consistency Analysis by a Reference Series Method, AIChE J., Accepted for Publication (2012)
See more of this Group/Topical: Engineering Sciences and Fundamentals