Wednesday, November 18, 2020
Computational Molecular Science and Engineering Forum (21) (Poster Gallery)
Machine learning has been used extensively to predict molecular properties and design molecules [1-6]. When initializing a machine learning model, the modeler must make a series of decisions about how the model will operate. One must choose a learning algorithm (i.e random forest vs neural network), a featurization method (how the molecules in the data set will be described to the learning algorithm), training set size, hyper-parameters values, validation method, etc. These decisions are not independent and impact the cost and efficacy of the machine learning model. In this work, we trained a series of machine learning models using a wide gamut of the above parameters. For example, one instance could be a random forest model with 10-fold cross validation and Morgan molecular fingerprint featurization to predict logP. Each model was used to predict molecular properties and was evaluated based on error and prediction uncertainty. The model parameters, performance, and molecular datasets were stored in a property graph database (PGDB). Graph topology algorithms were used to identify model features, including molecular fragments, that most impact a model’s performance. The PGDB enhances the explainability of machine learning models by enabling visualization and efficient queries of relationships between modeling choices, data, and model performance.


See more of this Session: Poster Session: Computational Molecular Science and Engineering Forum (CoMSEF)
See more of this Group/Topical: Computational Molecular Science and Engineering Forum
See more of this Group/Topical: Computational Molecular Science and Engineering Forum