Statistical Applications in Genetics and Molecular Biology Copyright (c) 2009 Berkeley Electronic Press All rights reserved. http://www.bepress.com/sagmb Recent documents in Statistical Applications in Genetics and Molecular Biology en-us Thu, 02 Jul 2009 11:27:38 PDT 3600 A Multivariate Growth Curve Model for Ranking Genes in Replicated Time Course Microarray Data http://www.bepress.com/sagmb/vol8/iss1/art33 http://www.bepress.com/sagmb/vol8/iss1/art33 Wed, 01 Jul 2009 15:39:40 PDT Gene ranking problem in time course microarray experiments is challenging since gene expression levels between different time points are correlated. This is because, expression values at successive time points are usually taken from the same organism, tissue or culture. Moreover, time dependency of gene expression values is usually of interest and often is the biological problem that motivates the experiment. We propose a multivariate growth curve model for ranking genes and estimating mean gene expression profiles in replicated time course microarray data. The approach takes the within individual correlation as well as the temporal ordering into consideration. Moreover, time is incorporated as a continuous variable in the model to account for the temporal pattern. Polynomial profiles are assumed to describe the time dependence and a transformation incorporating information across the genes is used. A moderated likelihood ratio test is then applied to the transformed data to get a statistic for ranking genes according to the difference in expression profiles among biological groups. The methodology is presented in a general setup and could be used for one sample as well as more than one sample problem. The estimation is done in a multivariate framework in which information from all the groups involved is used for better inference. Moreover, the within individual correlation as well as information across genes entered in the estimation through a moderated covariance matrix. We assess the performance of our method using simulation studies and illustrate the results with publicly available real time course microarray data. Jemila S. Hamid General Biostatistics Genetics Longitudinal Data Analysis and Time Series Microarrays Multivariate Analysis Statistical Models Estimation of Selection Intensity under Overdominance by Bayesian Methods http://www.bepress.com/sagmb/vol8/iss1/art32 http://www.bepress.com/sagmb/vol8/iss1/art32 Tue, 30 Jun 2009 19:33:53 PDT A balanced pattern in the allele frequencies of polymorphic loci is a potential sign of selection, particularly of overdominance. Although this type of selection is of some interest in population genetics, there exists no likelihood based approaches specifically tailored to make inference on selection intensity. To fill this gap, we present Bayesian methods to estimate selection intensity under k-allele models with overdominance. Our model allows for an arbitrary number of loci and alleles within a locus. The neutral and selected variability within each locus are modeled with corresponding k-allele models. To estimate the posterior distribution of the mean selection intensity in a multilocus region, a hierarchical setup between loci is used. The methods are demonstrated with data at the Human Leukocyte Antigen loci from world-wide populations. Erkan Ozge Buzbas General Biostatistics Genetics Statistical Models Statistical Theory and Methods Model Selection Based on FDR-Thresholding Optimizing the Area under the ROC-Curve http://www.bepress.com/sagmb/vol8/iss1/art31 http://www.bepress.com/sagmb/vol8/iss1/art31 Thu, 25 Jun 2009 14:08:55 PDT We evaluate variable selection by multiple tests controlling the false discovery rate (FDR) to build a linear score for prediction of clinical outcome in high-dimensional data. Quality of prediction is assessed by the receiver operating characteristic curve (ROC) for prediction in independent patients. Thus we try to combine both goals: prediction and controlled structure estimation. We show that the FDR-threshold which provides the ROC-curve with the largest area under the curve (AUC) varies largely over the different parameter constellations not known in advance. Hence, we investigated a new cross validation procedure based on the maximum rank correlation estimator to determine the optimal selection threshold. This procedure (i) allows choosing an appropriate selection criterion, (ii) provides an estimate of the FDR close to the true FDR and (iii) is simple and computationally feasible for rather moderate to small sample sizes. Low estimates of the cross validated AUC (the estimates generally being positively biased) and large estimates of the cross validated FDR may indicate a lack of sufficiently prognostic variables and/or too small sample sizes. The method is applied to an oncology dataset. Alexandra C. Graf General Biostatistics Adaptive Transmission Disequilibrium Test for Family Trio Design http://www.bepress.com/sagmb/vol8/iss1/art30 http://www.bepress.com/sagmb/vol8/iss1/art30 Tue, 23 Jun 2009 13:48:37 PDT The transmission disequilibrium test (TDT) is a standard method to detect association using family trio design. It is optimal for an additive genetic model. Other TDT-type tests optimal for recessive and dominant models have also been developed. Association tests using family data, including the TDT-type statistics, have been unified to a class of more comprehensive and flexable family-based association tests (FBAT). TDT-type tests have high efficiency when the genetic model is known or correctly specified, but may lose power if the model is mis-specified. Hence tests that are robust to genetic model mis-specification yet efficient are preferred. Constrained likelihood ratio test (CLRT) and MAX-type test have been shown to be efficiency robust. In this paper we propose a new efficiency robust procedure, referred to as adaptive TDT (aTDT). It uses the Hardy-Weinberg disequilibrium coefficient to identify the potential genetic model underlying the data and then applies the TDT-type test (or FBAT for general applications) corresponding to the selected model. Simulation demonstrates that aTDT is efficiency robust to model mis-specifications and generally outperforms the MAX test and CLRT in terms of power. We also show that aTDT has power close to, but much more robust, than the optimal TDT-type test based on a single genetic model. Applications to real and simulated data from Genetic Analysis Workshop (GAW) illustrate the use of our adaptive TDT. Min Yuan Genetics A Non-Homogeneous Hidden-State Model on First Order Differences for Automatic Detection of Nucleosome Positions http://www.bepress.com/sagmb/vol8/iss1/art29 http://www.bepress.com/sagmb/vol8/iss1/art29 Fri, 19 Jun 2009 13:15:30 PDT The ability to map individual nucleosomes accurately across genomes enables the study of relationships between dynamic changes in nucleosome positioning/occupancy and gene regulation. However, the highly heterogeneous nature of nucleosome densities across genomes and short linker regions pose challenges in mapping nucleosome positions based on high-throughput microarray data of micrococcal nuclease (MNase) digested DNA. Previous works rely on additional detrending and careful visual examination to detect low-signal nucleosomes, which may exist in a subpopulation of cells. We propose a non-homogeneous hidden-state model based on first order differences of experimental data along genomic coordinates that bypasses the need for local detrending and can automatically detect nucleosome positions of various occupancy levels. Our proposed approach is applicable to both low and high resolution MNase-Chip and MNase-Seq (high throughput sequencing) data, and is able to map nucleosome-linker boundaries accurately. This automated algorithm is also computationally efficient and only requires a simple preprocessing step. We provide several examples illustrating the pitfalls of existing methods, the difficulties of detrending the observed hybridization signals and demonstrate the advantages of utilizing first order differences in detecting nucleosome occupancies via simulations and case studies involving MNase-Chip and MNase-Seq data of nucleosome occupancy in yeast S. cerevisiae. Pei Fen Kuan Computational Biology/Bioinformatics Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data http://www.bepress.com/sagmb/vol8/iss1/art28 http://www.bepress.com/sagmb/vol8/iss1/art28 Tue, 09 Jun 2009 11:18:40 PDT In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set. Daniela M. Witten Computational Biology/Bioinformatics General Biostatistics Genetics Laboratory and Basic Science Research Microarrays Multivariate Analysis Statistical Models Statistical Theory and Methods Bayesian Unsupervised Learning with Multiple Data Types http://www.bepress.com/sagmb/vol8/iss1/art27 http://www.bepress.com/sagmb/vol8/iss1/art27 Fri, 05 Jun 2009 15:16:05 PDT We propose Bayesian generative models for unsupervised learning with two types of data and an assumed dependency of one type of data on the other. We consider two algorithmic approaches, based on a correspondence model, where latent variables are shared across datasets. These models indicate the appropriate number of clusters in addition to indicating relevant features in both types of data. We evaluate the model on artificially created data. We then apply the method to a breast cancer dataset consisting of gene expression and microRNA array data derived from the same patients. We assume partial dependence of gene expression on microRNA expression in this study. The method ranks genes within subtypes which have statistically significant abnormal expression and ranks associated abnormally expressing microRNA. We report a genetic signature for the basal-like subtype of breast cancer found across a number of previous gene expression array studies. Using the two algorithmic approaches we find that this signature also arises from clustering on the microRNA expression data and appears derivative from this data. Phaedra Agius Computation Computational Biology/Bioinformatics Microarrays Statistical Models Statistical Theory and Methods Survival Analysis A Parametric Model for Analyzing Anticipation in Genetically Predisposed Families http://www.bepress.com/sagmb/vol8/iss1/art26 http://www.bepress.com/sagmb/vol8/iss1/art26 Tue, 02 Jun 2009 16:09:51 PDT Anticipation, i.e. a decreasing age-at-onset in subsequent generations has been observed in a number of genetically triggered diseases. The impact of anticipation is generally studied in affected parent-child pairs. These analyses are restricted to pairs in which both individuals have been affected and are sensitive to right truncation of the data. We propose a normal random effects model that allows for right-censored observations and includes covariates, and draw statistical inference based on the likelihood function. We applied the model to the hereditary nonpolyposis colorectal cancer (HNPCC)/Lynch syndrome family cohort from the national Danish HNPCC register. Age-at-onset was analyzed in 824 individuals from 2-4 generations in 125 families with proved disease-predisposing mutations. A significant effect from anticipation was identified with a mean of 3 years earlier age-at-onset per generation. The suggested model corrects for incomplete observations and considers families rather than affected pairs and thereby allows for studies of large sample sets, facilitates subgroup analyses and provides generation effect estimates. Klaus Larsen Genetics Increase of Rejection Rate in Case-Control Studies with the Differential Genotyping Error Rates http://www.bepress.com/sagmb/vol8/iss1/art25 http://www.bepress.com/sagmb/vol8/iss1/art25 Thu, 07 May 2009 13:49:21 PDT Genotyping error adversely affects the statistical power of case-control association studies and introduces bias in the estimated parameters when the same error mechanism and probabilities apply to both affected and unaffected individuals; that is, when there is non-differential genotype misclassification. Simulation studies have shown that differential genotype misclassification leads to a rejection rate that is higher than the nominal significance level (type I error rate) for some tests of association. This study extends previous work by examining this issue analytically using the non-centrality parameter of the asymptotic distribution of the chi-squared test and linear trend test (LTT) when there is no difference between case and control genotype frequencies, but there is differential misclassification with SNP data. The parameters examined are the minor allele frequency (MAF) and sample size. When MAF is less than 0.2, differential genotyping errors lead to a rejection rate much larger than the nominal significance level. As the MAF decreases to zero, the increase in the rejection rate becomes larger. The errors that most increase the rejection rate are differential recording of the more common homozygote as the other homozygote and differential recording of the more common homozygote as the heterozygote. The rejection rate increases as the sample size increases for fixed differential genotyping error rates and nominal significance level for each test. Kwangmi Ahn Categorical Data Analysis Design of Experiments and Sample Surveys Genetics Incorporating Duplicate Genotype Data into Linear Trend Tests of Genetic Association: Methods and Cost-Effectiveness http://www.bepress.com/sagmb/vol8/iss1/art24 http://www.bepress.com/sagmb/vol8/iss1/art24 Tue, 05 May 2009 10:33:50 PDT The genome-wide association (GWA) study is an increasingly popular way to attempt to identify the causal variants in human disease. Duplicate genotyping (or re-genotyping) a portion of the samples in a GWA study is common, though it is typical for these data to be ignored in subsequent tests of genetic association. We demonstrate a method for including duplicate genotype data in linear trend tests of genetic association which yields increased power. We also consider the cost-effectiveness of collecting duplicate genotype data and find that when the relative cost of genotyping to phenotyping and sample acquisition costs is less than or equal to the genotyping error rate it is more powerful to duplicate genotype the entire sample instead of spending the same money to increase the sample size. Duplicate genotyping is particularly cost-effective when SNP minor allele frequencies are low. Practical advice for the implementation of duplicate genotyping is provided. Free software is provided to compute asymptotic and permutation based tests of association using duplicate genotype data as well as to aid in the duplicate genotyping design decision. Bryce Borchers Genetics