606412 Combining Strategic Training Data Selection and Feature Engineering to Reach Accurate and Efficient Molecular Property Prediction

Wednesday, November 18, 2020
Computational Molecular Science and Engineering Forum (21) (Poster Gallery)
Bowen Li, Chemical Engineering, Lehigh University, bethlehem, PA and Srinivas Rangarajan, Department of Chemical and Biomolecular Engineering, Lehigh University, Bethlehem, PA

Organic molecular design problems, such as drug discovery or material design, aim to identify molecules with desired properties from the chemical space, wherein the number of potential compounds is estimated to reach 1060. The size of the chemical space forbids experiments or high-level quantum chemistry to evaluate each molecule. In recent decades, the integration of machine learning methods with virtual screening makes the exploration of chemical space practical due to its high efficiency and low cost. While many machine learning models manage to reach high accuracy with hundreds of thousands of training molecules, only a handful of study has been focused on optimizing the model performance under a tight computation budget. In this work, we propose a strategy to obtain accurate machine learning predictions with a minimum number of data points required for training. Specifically, we address the problem in threefold. First, we demonstrate the efficacy of a method that adaptively builds the compact training set by systematically balancing exploitation via experimental design and exploration of the space via cheminformatics-based diversity maximization procedures. Second, we expand this procedure with the use of nonlinear and locally linear dimensionality reduction methods to leverage data embeddings. Third, we focus on improving the model accuracy under the constraint of a small training set, which we achieve by progressively incorporating nonlinearity to our modified group additivity approach.

Extended Abstract: File Not Uploaded