461205 A Method for Learning a Sparse Classification Model in the Presence of Missing Data
To test the algorithm, a case study for the classification of two types of acute leukemia is presented. The dataset is gene expression data from a microarray. It is a public benchmark problem and has been widely studied [2]. Missing data is artificially added to be representative of missing data in microarrays [3]. The proposed approach is compared to the nearest shrunken centroids algorithm [4] and sparse linear discriminant analysis [5]. Missing data is handled with complete case analysis, mean imputation and k-nearest neighbor imputation, all common approaches in the field. The proposed approach outcompetes alternative methods.
[1] A. P. Dempster, N. M. Laird and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B Methodological, 39: 1-39, 1977.
[2] T. R. Golub, D. K. Slonin, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: 531-537, 1999.
[3] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17: 520-525, 2001.
[4] R. Tibshirani, T. Hastie, B. Narasimhan, G. Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99: 6567-6572, 2003.
[5] K. Sjöstrand, L. H. Clemmensen, R. Larsen, B. Ersbøll. SpaSM: A Matlab toolbox for sparse statistical modeling. Journal of Statistical Software, 2012.
See more of this Group/Topical: Computing and Systems Technology Division