- An Empirical Study of Optimism and Selection Bias in Binary Classification with Microarray Data
- Download the Paper Forward to a colleague
- Abstract:
Motivation: Feature subset selection is a very important aspect of performing binary classification using gene expression data. Once feature subsets are obtained, there is the need to evaluate the various models that are formed. This paper considers the problem of how best to evaluate prediction rules formed from the models such that the effects of both optimism and selection bias (i.e., overly optimistic misclassification error rates) are properly taken into account.
Results: An empirical study is presented, in which a 10-fold cross-validation is applied a) internally and b) externally to the feature selection process. These procedures are applied with respect to three supervised learning algorithms and six published two-class microarray datasets. We find that when no cross-validation is performed, optimism bias is present, but it is generally small. Also, we find that when the feature selection is not performed during each stage of the cross-validation process, selection bias is present, but again, it is generally small. Considering all datasets, classifiers, and gene subset sizes together, the average optimism, selection, and total (optimism plus selection) bias estimates are only 4%, 3%, and 7%, respectively. For five of the six datasets, the misclassification rates and bias estimates were very consistent, suggesting that these results should generalize well to other clinical microarray datasets. The same should hold with respect to classifiers, since the three classifiers used in this study behave in different ways, and since there is no clear reason to suspect that the results are connected to the method of classification.
Availability: Datasets are available from the authors upon request.
Contact: mlecocke@stat.rice.edu and khess@mdanderson.org
- Subject Area:
- Categorical Data Analysis, Microarrays, Multivariate Analysis
- Suggested Citation:
- Michael L. Lecocke and Kenneth Hess,
"An Empirical Study of Optimism and Selection Bias in Binary Classification with Microarray Data"
(December 2004).
UT MD Anderson Cancer Center Department of Biostatistics Working Paper Series.
Working Paper 3.
http://www.bepress.com/mdandersonbiostat/paper3