386157 Determining Optimal Groups for Group Contribution Methods

Wednesday, November 19, 2014: 2:42 PM
403 (Hilton Atlanta)
Nick Austin1, Nick Sahinidis1 and Daniel W. Trahan2, (1)Chemical Engineering, Carnegie Mellon University, Pittsburgh, PA, (2)The Dow Chemical Company, Freeport, TX

Given the sheer number of chemical compounds in use today, determining a new or untested compound’s properties is a problem beyond current laboratory resources. This necessitates the need for quantitative structure activity relationships (QSARs), which are models used to predict certain properties based solely on the chemical structure of a compound. Often, QSARs are in the form of group contribution methods, which assume that a property of a compound can be modelled based on the number of occurrences of molecular subunits, or “groups”. Many group contribution methods are derived from regression over large datasets, but many of these regression problems require first specifying the identity of these groups, a requirement which introduces some bias in the model building. While many group contribution methods have seen a substantial degree of success [1], many new methods are still based to some degree on the original UNIFAC groups [2]. Some recent approaches have applied best subset selection methods to alleviate this bias [3,4], but these methods still must specify a larger set from which to select the best components. We propose a method that aims to investigate the entire space of subgroups and select the set of groups which provides the best predictive accuracy from a given dataset. This is essentially optimal group selection for a given model. This method has promise in improving group contribution methods for properties that have historically been difficult to predict and design for as well as determining important substructures for more complicated properties like toxicity or bioactivity.

[1] Group-contribution based estimation of pure component properties. Jorge Marrero and Rafiqul Gani.Fluid Phase Equilibria. 2001. 183, 183-208.
[2] Group-contribution estimation of activity coefficients in nonideal liquid mixtures. Aage Fredenslund, Russel L. Jones, and John M. Prausnitz. AIChE Journal. 1975. 21(6), 1086-99.
[3] Choosing Feature Selection and Learning Algorithms in QSAR. Martin Eklund, Ulf Norinder, Scott Boyer, and Lars Carlsson. Journal of Chemical Information and Modeling 2014 54 (3), 837-843
[4] Towards Optimal Descriptor Subset Selection with Support Vector Machines in Classification and Regression. Holger Fröhlich, Jörg K. Wegner and Andreas Zell. QSAR & Combinatorial Science. 2004. 23(5), 311-318

Extended Abstract: File Not Uploaded
See more of this Session: Product and Molecular Design
See more of this Group/Topical: Computing and Systems Technology Division