420230 Big Data Analysis for Selecting Clinically Relevant Biomarkers: A Global Optimization Framework

Wednesday, November 11, 2015: 2:36 PM
250A (Salt Palace Convention Center)
Yannis A. Guzman1,2,3 and Christodoulos A. Floudas2,3, (1)Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ, (2)Texas A&M Energy Institute, Texas A&M University, College Station, TX, (3)Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX

Biomarkers are measurable indicators of biological processes that can be applied in clinical settings for disease diagnosis and prognosis, risk-factor assessment, disease staging, and as indicators of treatment efficacy. High-throughput –omics platforms enables the interface of personalized medicine with the advantages and challenges of big data. Typical studies generate expansive datasets with thousands of candidate biomarkers (i.e., data features). For a biomarker to be accepted into clinical praxis, it must be subjected to large-scale, often expensive clinical validation stages [1]; the ultimate success of a discovery-phase biomarker study lies in its ability to produce a small subset of biomarkers with the greatest probability of success in large-scale targeted studies [1,2]. This high data dimensionality per sample is almost always coupled with a comparatively low number of samples in the discovery phase, yielding a statistically difficult feature selection problem with a high probability of overfitting or of selecting data artifacts as meaningful candidates [3].

We present a novel optimization model that selects the optimal subset of candidate biomarkers. The model is self-regularized and can be solved to global optimality to obtain the best subset of biomarkers as per the objective function. We display the utility and power of the model by applying it to four well-known expansive cancer genomics datasets from the literature [4-7] which study differences between diseased and healthy patients or between different types of cancer. All datasets have an extremely low (< 0.02) samples-to-features ratio. We utilized an established evaluation protocol [8] to compare our method to the current state of the art [9]. Very small subsets of genes selected by the model are able to classify new samples from each of the disease systems with high sensitivity, specificity, and accuracy, and are selected robustly and stably in the face of random data permutations [8,10,11]. The model comprises a general methodology which applies to any scenario, biomedical or otherwise, where an optimally descriptive set of features must be extracted from a vast pool of mostly irrelevant possibilities.


1. Rifai, N., Gillette, M. A., & Carr, S. A. (2006). Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nature biotechnology, 24(8), 971-983.

2. Srinivas, P. R., Verma, M., Zhao, Y., & Srivastava, S. (2002). Proteomics for cancer biomarker discovery. Clinical chemistry, 48(8), 1160-1169.

3. Rubingh, C. M., Bijlsma, S., Derks, E. P., Bobeldijk, I., Verheij, E. R., Kochhar, S., & Smilde, A. K. (2006). Assessing the performance of statistical validation tools for megavariate metabolomics data. Metabolomics, 2(2), 53-61.

4. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750.

5. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., ... & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, 286(5439), 531-537.

6. Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., ... & Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769), 503-511.

7. Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., ... & Sellers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2), 203-209.

8. Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P., & Saeys, Y. (2010). Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, 26(3), 392-398.

9. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1-3), 389-422.

10. He, Z., & Yu, W. (2010). Stable feature selection for biomarker discovery. Computational biology and chemistry, 34(4), 215-225.

11. Kuncheva, L. I. (2007, February). A stability index for feature selection. In Artificial intelligence and applications (pp. 421-427).

Extended Abstract: File Not Uploaded