Chemoinformatics methodology and application of large data sets of molecular information is becoming integrated into computer chemical engineering process design software. Computer design and/or selection of molecules with target properties from QSAR models is percived as large scale computational combinatorial problem. Information on molecule structure and inferences of its properties are mostly based on the following two approaches: molecule structure coding based on graph theory (Faulon et al.1 extended valence) and the chemical molecule based descriptors2. Available are software tools for automatic calculation of chemoinformatic data, but the needed inverse modelling from target propertis to molecule structures is difficult and is still an open problem. Due to lack of systemic formal mathematical properties of chemoinformatic mappings, they are nonlinear, noncontinuous, highly synergetic, hence linear/nonlinear continuous models lack generalisation and are mostly case limited. Here are applied models based on decision trees/random forest and evaluated are their accuarcy for inverse classification from chemoinformatic data to molecule structures. Here are presented as test molecules: alkanes, alkenes, acetones, aromatics, organic acids and halogenated hydrocarbons in the range of C1-C12, and a set of binary ionic liquids (cations: imidazole, pyridinium, quinolinium, ammonium, phosphonium). The results indicate that molecule descriptors outperform graph based approach for molecule prediction of properties, but for accuracy of the inverse mapping is favored by the graph extended valances.
1. Jean-Loup Faulon, Donald P. Visco, Ramdas S. Pophale, The Signature Molecule Descriptor. 1. Using Extended Valence Sequences in QSAR and QSAR Studies J. Chem. Info Comput. Sci. 43, 707-720
2. Chun Wei Yap, PaDEL-Descriptor: An Open Source Software to Calculate Moleclar Descriptors and Fingerprints, J. Comp. Chem., 32(2010)1466-1474
3. Bioclipse 2: A scriptable integration platform for the life sciences
Ola Spjuth, Jonathan Alvarsson, Arvid Berg, Martin Eklund, Stefan Kuhn, Carl Mäsak, Gilleain Torrance, Johannes Wagener, Egon L Willighagen, Christoph Steinbeck and Jarl ES Wikberg BMC Bioinformatics 2009, 10:397 doi:10.1186/1471-2105-10-397
4. R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
See more of this Group/Topical: Spring Meeting Poster Session and Networking Reception