Classifying Gene Expression Profiles from Pairwise mRNA Comparisons

Donald Geman, Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute and Department of Applied Mathematics and Statistics, Johns Hopkins University
Christian d'Avignon, Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute and Department of Biomedical Engineering, Johns Hopkins University
Daniel Q. Naiman, Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute and Department of Applied Mathematics and Statistics, Johns Hopkins University
Raimond L. Winslow, Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute, and Department of Biomedical Engineering, Johns Hopkins University

Abstract

We present a new approach to molecular classification based on mRNA comparisons. Our method, referred to as the top-scoring pair(s) (TSP) classifier, is motivated by current technical and practical limitations in using gene expression microarray data for class prediction, for example to detect disease, identify tumors or predict treatment response. Accurate statistical inference from such data is difficult due to the small number of observations, typically tens, relative to the large number of genes, typically thousands. Moreover, conventional methods from machine learning lead to decisions which are usually very difficult to interpret in simple or biologically meaningful terms. In contrast, the TSP classifier provides decision rules which i) involve very few genes and only relative expression values (e.g., comparing the mRNA counts within a single pair of genes); ii) are both accurate and transparent; and iii) provide specific hypotheses for follow-up studies. In particular, the TSP classifier achieves prediction rates with standard cancer data that are as high as those of previous studies which use considerably more genes and complex procedures. Finally, the TSP classifier is parameter-free, thus avoiding the type of over-fitting and inflated estimates of performance that result when all aspects of learning a predictor are not properly cross-validated.

Submitted: June 14, 2004 · Accepted: August 17, 2004 · Published: August 30, 2004

Recommended Citation

Geman, Donald; d'Avignon, Christian; Naiman, Daniel Q.; and Winslow, Raimond L. (2004) "Classifying Gene Expression Profiles from Pairwise mRNA Comparisons," Statistical Applications in Genetics and Molecular Biology: Vol. 3 : Iss. 1, Article 19.
DOI: 10.2202/1544-6115.1071
Available at: http://www.bepress.com/sagmb/vol3/iss1/art19

 
 
 
 

ISSN: 1544-6115 ©1999-2010 The Berkeley Electronic Press™ All rights reserved.

To submit, subscribe, recommend this journal to your library, or sign up for email alerts, please visit: http://www.bepress.com/sagmb