461205 A Method for Learning a Sparse Classification Model in the Presence of Missing Data

Tuesday, November 15, 2016: 2:00 PM
Carmel I (Hotel Nikko San Francisco)
Kristen Severson, Chemical Engineering, MIT, Cambridge, MA, Brinda Monian, Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, J. Christopher Love, Department of Chemical Engineering, The David H. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA and Richard D. Braatz, Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA

This work addresses two issues in building classification models: finding a subset of a large number of possible predictors and learning in the presence of missing data. Computational and interpretability concerns promote interest in learning sparse models. Modern dataset often have many measurements but few replicates which can lead to numerical issues. Furthermore, models with many parameters may not provide insight to the analyst. This setting is particularly common in biological or medical studies. Missing data can occur for a wide variety of reasons and is not atypical in the analysis of real datasets. This work focuses on supervised generative binary classification models, specifically linear discriminant analysis (LDA). The parameters are found using an expectation maximization [1] (EM) algorithm. EM allows both missing observations to be handled more naturally as well as the introduction of priors to promote sparsity. An algorithm that finds a sparse LDA model for datasets with and without missing data is presented.

To test the algorithm, a case study for the classification of two types of acute leukemia is presented. The dataset is gene expression data from a microarray. It is a public benchmark problem and has been widely studied [2]. Missing data is artificially added to be representative of missing data in microarrays [3]. The proposed approach is compared to the nearest shrunken centroids algorithm [4] and sparse linear discriminant analysis [5]. Missing data is handled with complete case analysis, mean imputation and k-nearest neighbor imputation, all common approaches in the field. The proposed approach outcompetes alternative methods.

[1] A. P. Dempster, N. M. Laird and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B Methodological, 39: 1-39, 1977.

[2] T. R. Golub, D. K. Slonin, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: 531-537, 1999.

[3] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17: 520-525, 2001.

[4] R. Tibshirani, T. Hastie, B. Narasimhan, G. Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99: 6567-6572, 2003.

[5] K. Sjöstrand, L. H. Clemmensen, R. Larsen, B. Ersbøll. SpaSM: A Matlab toolbox for sparse statistical modeling. Journal of Statistical Software, 2012.


Extended Abstract: File Not Uploaded