<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0">
<channel>
<title>Statistical Applications in Genetics and Molecular Biology</title>
<copyright>Copyright (c) 2012 Berkeley Electronic Press All rights reserved.</copyright>
<link>http://www.bepress.com/sagmb</link>
<description>Recent documents in Statistical Applications in Genetics and Molecular Biology</description>
<language>en-us</language>
<lastBuildDate>Sun, 12 Feb 2012 01:36:42 PST</lastBuildDate>
<ttl>3600</ttl>


	
		
	

	
		
	

	
		
	







<item>
<title>Graph Selection with GGMselect</title>
<link>http://www.bepress.com/sagmb/vol11/iss3/art3</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss3/art3</guid>
<pubDate>Fri, 10 Feb 2012 10:30:38 PST</pubDate>
<description>
	<![CDATA[
	<p>Applications on inference of biological networks have raised a strong interest in the problem of graph estimation in high-dimensional Gaussian graphical models. To handle this problem, we propose a two-stage procedure which first builds a family of candidate graphs from the data, and then selects one graph among this family according to a dedicated criterion. This estimation procedure is shown to be consistent in a high-dimensional setting, and its risk is controlled by a non-asymptotic oracle-like inequality. The procedure is tested on a real data set concerning gene expression data, and its performances are assessed on the basis of a large numerical study.</p>
<p>The procedure is implemented in the R-package GGMselect available on the CRAN.</p>

	]]>
</description>

<author>Christophe Giraud et al.</author>


<category>Statistical Theory and Methods</category>

</item>






<item>
<title>Sample Size Calculations for Designing Clinical Proteomic Profiling Studies Using Mass Spectrometry</title>
<link>http://www.bepress.com/sagmb/vol11/iss3/art2</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss3/art2</guid>
<pubDate>Fri, 10 Feb 2012 10:30:35 PST</pubDate>
<description>
	<![CDATA[
	<p>In cancer clinical proteomics, MALDI and SELDI profiling are used to search for biomarkers of potentially curable early-stage disease. A given number of samples must be analysed in order to detect clinically relevant differences between cancers and controls, with adequate statistical power. From clinical proteomic profiling studies, expression data for each peak (protein or peptide) from two or more clinically defined groups of subjects are typically available. Typically, both exposure and confounder information on each subject are also available, and usually the samples are not from randomized subjects. Moreover, the data is usually available in replicate. At the design stage, however, covariates are not typically available and are often ignored in sample size calculations. This leads to the use of insufficient numbers of samples and reduced power when there are imbalances in the numbers of subjects between different phenotypic groups. A method is proposed for accommodating information on covariates, data imbalances and design-characteristics, such as the technical replication and the observational nature of these studies, in sample size calculations. It assumes knowledge of a joint distribution for the protein expression values and the covariates. When discretized covariates are considered, the effect of the covariates enters the calculations as a function of the proportions of subjects with specific attributes. This makes it relatively straightforward (even when pilot data on subject covariates is unavailable) to specify and to adjust for the effect of the expected heterogeneities. The new method suggests certain experimental designs which lead to the use of a smaller number of samples when planning a study. Analysis of data from the proteomic profiling of colorectal cancer reveals that fewer samples are needed when a study is balanced than when it is unbalanced, and when the IMAC30 chip-type is used. The method is implemented in the clippda package and is available in R at: http://www.bioconductor.org/help/bioc-views/release/bioc/html/clippda.html.</p>

	]]>
</description>

<author>Stephen O. Nyangoma et al.</author>


<category>Computational Biology/Bioinformatics</category>

</item>






<item>
<title>A New Approach for the Joint Analysis of Multiple Chip-Seq Libraries with Application to Histone Modification</title>
<link>http://www.bepress.com/sagmb/vol11/iss3/art1</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss3/art1</guid>
<pubDate>Fri, 10 Feb 2012 10:30:31 PST</pubDate>
<description>
	<![CDATA[
	<p>Most approaches for analyzing ChIP-Seq data are focused on inferring exact protein binding sites from a single library. However, frequently multiple ChIP-Seq libraries derived from differing cell lines or tissue types from the same individual may be available. In such a situation, a separate analysis for each tissue or cell line may be inefficient. Here, we describe a novel method to analyze such data that intelligently uses the joint information from multiple related ChIP-Seq libraries. We present our method as a two-stage procedure. First, separate single cell line analysis is performed for each cell line. Here, we use a novel mixture regression approach to infer the subset of genes that are most likely to be involved in protein binding in each cell line. In the second step, we combine the separate single cell line analyses using an Empirical Bayes algorithm that implicitly incorporates inter-cell line correlation. We demonstrate the usefulness of our method using both simulated data, as well as real H3K4me3 and H3K27me3 histone methylation libraries.</p>

	]]>
</description>

<author>John P. Ferguson et al.</author>


<category>Categorical Data Analysis</category>

<category>Computational Biology/Bioinformatics</category>

<category>Multivariate Analysis</category>

<category>Statistical Models</category>

</item>






<item>
<title>The Inheritance Procedure: Multiple Testing of Tree-structured Hypotheses</title>
<link>http://www.bepress.com/sagmb/vol11/iss1/art11</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss1/art11</guid>
<pubDate>Sat, 21 Jan 2012 19:49:38 PST</pubDate>
<description>
	<![CDATA[
	<p>Hypotheses tests in bioinformatics can often be set in a tree structure in a very natural way, e.g. when tests are performed at probe, gene, and chromosome level. Exploiting this graph structure in a multiple testing procedure may result in a gain in power or increased interpretability of the results.</p>
<p>We present the inheritance procedure, a method of familywise error control for hypotheses structured in a tree. The method starts testing at the top of the tree, following up on those branches in which it finds significant results, and following up on leaf nodes in the neighborhood of those leaves. The method is a uniform improvement over a recently proposed method by Meinshausen. The inheritance procedure has been implemented in the globaltest package which is available on www.bioconductor.org.</p>

	]]>
</description>

<author>Jelle J. Goeman et al.</author>


<category>Genetics</category>

<category>Microarrays</category>

<category>Statistical Theory and Methods</category>

</item>






<item>
<title>Optimality Criteria for the Design of 2-Color Microarray Studies</title>
<link>http://www.bepress.com/sagmb/vol11/iss1/art10</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss1/art10</guid>
<pubDate>Fri, 13 Jan 2012 18:15:22 PST</pubDate>
<description>
	<![CDATA[
	<p>We discuss the definition and application of design criteria for evaluating the efficiency of 2-color microarray designs. First, we point out that design optimality criteria are defined differently for the regression and block design settings.  This has caused some confusion in the literature and warrants clarification.  Linear models for microarray data analysis have equivalent formulations as ANOVA or regression models.  However, this equivalence does not extend to design criteria.    We discuss optimality criterion, and argue against applying regression-style D-optimality to the microarray design problem.  We further disfavor E- and D-optimality (as defined in block design) because they are not attuned to scientific questions of interest.</p>

	]]>
</description>

<author>Kathleen F. Kerr</author>


<category>Microarrays</category>

</item>






<item>
<title>Improving Pedigree-based Linkage Analysis by Estimating Coancestry Among Families</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art11</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art11</guid>
<pubDate>Fri, 06 Jan 2012 12:39:23 PST</pubDate>
<description>
	<![CDATA[
	<p>We present a method for improving the power of linkage analysis by detecting chromosome segments shared identical by descent (IBD) by individuals not known to be related.  Existing Markov chain Monte Carlo methods sample descent patterns on pedigrees conditional on observed marker data. These patterns can be stored as IBD graphs, which express shared ancestry only, rather than specific family relationships.  A model for IBD between unrelated individuals allows the estimation of coancestry between individuals in different pedigrees. IBD graphs on separate pedigrees can then be combined using these estimates. We report results from analyses of three sets of simulated marker data on two different pedigrees.  We show that when families share a gene for a trait due to shared ancestry on the order of tens of generations, our method can detect a linkage signal when independent analyses of the families do not.</p>

	]]>
</description>

<author>Chris Glazner et al.</author>


<category>Statistical Models</category>

</item>






<item>
<title>Candidate Pathway Based Analysis for Cleft Lip with or without Cleft Palate</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art10</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art10</guid>
<pubDate>Fri, 06 Jan 2012 12:39:21 PST</pubDate>
<description>
	<![CDATA[
	<p>The objective of this research was to identify potential biological pathways associated with non-syndromic cleft lip with or without cleft palate (NSCL/P), and to explore the potential biological mechanisms underlying these associated pathways on risk of NSCL/P. This project was based on the dataset of a previously published genome-wide association (GWA) study on NSCL/P (Beaty et al. 2010). Case-parent trios used here originated from an international consortium (The Gene, Environment Association Studies consortium, GENEVA) formed in 2007. A total of 5,742 individuals from 1,908 CL/P case-parents trios (1,591 complete trios and 317 incomplete trios where one parent was missing) were collected and genotyped using the Illumina Human610-Quad array. Candidate pathways were selected using a list of 356 genes that may be related to oral clefts. In total, 42 candidate pathways, which included 1,564 genes and 40,208 SNPs were tested. Using a pathway-based analysis approach proposed by Wang et al (2007), we conducted a permutation-based test to assess the statistical significance of the nominal p-values of 42 candidate pathways. The analysis revealed several pathways yielding nominally significant p-values. However, controlling for the family wise error rate, none of these pathways could retain statistical significance. Nominal p-values of these pathways were concentrated at the lower tail of the distribution, with more than expected low p-values. A permutation based test for examining this type of distribution pattern yielded an overall p-value of 0.029. Thus, while this pathway-based analysis did not yield a clear significant result for any particular pathway, we conclude that one or more of the genes and pathways considered here likely do play a role in oral clefting.</p>

	]]>
</description>

<author>Tian-Xiao Zhang et al.</author>


<category>Computational Biology/Bioinformatics</category>

<category>Genetics</category>

</item>






<item>
<title>A Model-Based Analysis to Infer the Functional Content of a Gene List</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art9</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art9</guid>
<pubDate>Fri, 06 Jan 2012 12:39:18 PST</pubDate>
<description>
	<![CDATA[
	<p>An important challenge in statistical genomics concerns integrating experimental data with exogenous information about gene function. A number of statistical methods are available to address this challenge, but most do not accommodate complexities in the functional record.  To infer activity of a functional category (e.g., a gene ontology term), most methods use gene-level data on that category, but do not use other functional properties of the same genes.  Not doing so creates undue errors in inference. Recent developments in model-based category analysis aim to overcome this difficulty, but in attempting to do so they are faced with serious computational problems.  This paper investigates statistical properties and the structure of posterior computation in one such model for the analysis of functional category data.  We examine the graphical structures underlying posterior computation in the original parameterization and in a new parameterization aimed at leveraging elements of the model.  We characterize identifiability of the underlying activation states, describe a new prior distribution, and introduce approximations that aim to support numerical methods for posterior inference.</p>

	]]>
</description>

<author>Michael A. Newton et al.</author>


<category>Computation</category>

<category>Computational Biology/Bioinformatics</category>

<category>Genetics</category>

</item>






<item>
<title>Querying Genomic Databases: Refining the Connectivity Map</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art8</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art8</guid>
<pubDate>Fri, 06 Jan 2012 12:39:15 PST</pubDate>
<description>
	<![CDATA[
	<p>The advent of high-throughput biotechnologies, which can efficiently measure gene expression on a global basis, has led to the creation and population of correspondingly rich databases and compendia.  Such repositories have the potential to add enormous scientific value beyond that provided by individual studies which, due largely to cost considerations, are typified by small sample sizes. Accordingly, substantial effort has been invested in devising analysis schemes for utilizing gene-expression repositories. Here, we focus on one such scheme, the <em>Connectivity Map</em> (cmap), that was developed with the express purpose of identifying drugs with putative efficacy against a given disease, where the disease in question is characterized by a (differential) gene-expression signature. Initial claims surrounding cmap intimated that such tools might lead to new, previously unanticipated applications of existing drugs.  However, further application suggests that its primary utility is in connecting a disease condition whose biology is largely unknown to a drug whose mechanisms of action are well understood, making cmap a tool for enhancing biological knowledge.</p>
<p>The success of the Connectivity Map is belied by its simplicity.  The aforementioned signature serves as an unordered query which is applied to a customized database of (differential) gene-expression experiments designed to elicit response to a wide range of drugs, across of spectrum of concentrations, durations, and cell lines.  Such application is effected by computing a per experiment score that measures "closeness" between the signature and the experiment.  Top-scoring experiments, and the attendant drug(s), are then deemed relevant to the disease underlying the query.  Inference supporting such elicitations is pursued via re-sampling.  In this paper, we revisit two key aspects of the Connectivity Map implementation.  Firstly, we develop new approaches to measuring closeness for the common scenario wherein the query constitutes an <em>ordered</em> list.  These involve using metrics proposed for analyzing <em>partially</em> ranked data, these being of interest in their own right and not widely used.  Secondly, we advance an alternate inferential approach based on generating empirical null distributions that exploit the scope, and capture dependencies, embodied by the database.  Using these refinements we undertake a comprehensive re-evaluation of Connectivity Map findings that, in general terms, reveal that accommodating ordered queries is less critical than the mode of inference.</p>

	]]>
</description>

<author>Mark R. Segal et al.</author>


<category>Computational Biology/Bioinformatics</category>

<category>Microarrays</category>

<category>Statistical Theory and Methods</category>

</item>






<item>
<title>Adjusting for Spurious Gene-by-Environment Interaction Using Case-Parent Triads</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art7</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art7</guid>
<pubDate>Fri, 06 Jan 2012 12:39:11 PST</pubDate>
<description>
	<![CDATA[
	<p>In the case-parent trio design, unrelated children affected with a disease are genotyped along with their parents.  Information may also be collected on environmental factors in the children.  The design permits estimation and testing of genetic effects and gene-by-environment interaction.  Recently, it has been demonstrated that when genotypes are measured at a non-causal test locus, population stratification can create spurious interaction. That is, the environmental factor can appear to modify the disease risk associated with genotypes at the test locus without modifying the disease risk of genotypes at the causal locus.  One design-based approach that is robust to spurious interaction requires the environmental factor to also be available on an unaffected sibling of the affected child.  We explore the source of spurious interaction and suggest an alternate approach that mitigates its effects using case-parent triads.  Our approach is based on adjusting the risk model using ancestry informative markers or random markers measured on the affected child and does not require data on unaffected siblings.  We apply an approach to generating case-parent data, implemented in a freely-available R package soon to be released on the Comprehensive R Archive Network (CRAN).</p>

	]]>
</description>

<author>Ji-Hyung Shin et al.</author>


<category>Epidemiology</category>

<category>Genetics</category>

</item>






<item>
<title>A Family-Based Probabilistic Method for Capturing De Novo Mutations from High-Throughput Short-Read Sequencing Data</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art6</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art6</guid>
<pubDate>Fri, 06 Jan 2012 12:39:07 PST</pubDate>
<description>
	<![CDATA[
	<p>Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a <em>de novo</em> mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of <em>de novo</em> mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date.</p>

	]]>
</description>

<author>Reed A. Cartwright et al.</author>


<category>Genetics</category>

<category>Statistical Models</category>

</item>






<item>
<title>Bayesian Sparsity-Path-Analysis of Genetic Association Signal using Generalized t Priors</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art5</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art5</guid>
<pubDate>Fri, 06 Jan 2012 12:39:04 PST</pubDate>
<description>
	<![CDATA[
	<p>We explore the use of generalized t priors on regression coefficients to help understand the nature of association signal within “hit regions” of genome-wide association studies. The particular generalized t distribution we adopt is a Student distribution on the absolute value of its argument. For low degrees of freedom, we show that the generalized t exhibits “sparsity-prior” properties with some attractive features over other common forms of sparse priors and includes the well known double-exponential distribution as the degrees of freedom tends to infinity. We pay particular attention to graphical representations of posterior statistics obtained from sparsity-path-analysis (SPA) where we sweep over the setting of the scale (shrinkage/precision) parameter in the prior to explore the space of posterior models obtained over a range of complexities, from very sparse models with all coefficient distributions heavily concentrated around zero, to models with diffuse priors and coefficients distributed around their maximum likelihood estimates. The SPA plots are akin to LASSO plots of maximum a posteriori (MAP) estimates but they characterise the complete marginal posterior distributions of the coefficients plotted as a function of the precision of the prior. Generating posterior distributions over a range of prior precisions is computationally challenging but naturally amenable to sequential Monte Carlo (SMC) algorithms indexed on the scale parameter. We show how SMC simulation on graphic-processing-units (GPUs) provides very efficient inference for SPA. We also present a scale-mixture representation of the generalized t prior that leads to an expectation-maximization (EM) algorithm to obtain MAP estimates should only these be required.</p>

	]]>
</description>

<author>Anthony Lee et al.</author>


<category>General Biostatistics</category>

<category>Statistical Models</category>

<category>Statistical Theory and Methods</category>

</item>






<item>
<title>Principal Components of Heritability for High Dimension Quantitative Traits and General Pedigrees</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art4</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art4</guid>
<pubDate>Fri, 06 Jan 2012 12:38:59 PST</pubDate>
<description>
	<![CDATA[
	<p>For many complex disorders, genetically relevant disease definition is still unclear. For this reason, researchers tend to collect large numbers of items related directly or indirectly to the disease diagnostic. Since the measured traits may not be all influenced by genetic factors, researchers are faced with the problem of choosing which traits or combinations of traits to consider in linkage analysis. To combine items, one can subject the data to a principal component analysis. However, when family date are collected, principal component analysis does not take family structure into account. In order to deal with these issues, Ott & Rabinowitz (1999) introduced the principal components of heritability (PCH), which capture the familial information across traits by calculating linear combinations of traits that maximize heritability. The calculation of the PCHs is based on the estimation of the genetic and the environmental components of variance. In the genetic context, the standard estimators of the variance components are Lange's maximum likelihood estimators, which require complex numerical calculations. The objectives of this paper are the following: i) to review some standard strategies available in the literature to estimate variance components for unbalanced data in mixed models; ii) to propose an ANOVA method for a genetic random effect model to estimate the variance components, which can be applied to general pedigrees and high dimensional family data within the PCH framework; iii) to elucidate the connection between PCH analysis and Linear Discriminant Analysis. We use computer simulations to show that the proposed method has similar asymptotic properties as Lange's method when the number of traits is small, and we study the efficiency of our method when the number of traits is large. A data analysis involving schizophrenia and bipolar quantitative traits is finally presented to illustrate the PCH methodology.</p>

	]]>
</description>

<author>Karim Oualkacha et al.</author>


<category>General Biostatistics</category>

<category>Genetics</category>

<category>Multivariate Analysis</category>

<category>Statistical Models</category>

</item>






<item>
<title>Gene Filtering in the Analysis of Illumina Microarray Experiments</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art3</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art3</guid>
<pubDate>Fri, 06 Jan 2012 12:17:33 PST</pubDate>
<description>
	<![CDATA[
	<p>Illumina bead arrays are microarrays that contain a random number of technical replicates (beads) for every probe (bead type) within the same array.  Typically around 30 beads are placed at random positions on the array surface, which opens unique opportunities for quality control. Most preprocessing methods for Illumina bead arrays are ported from the Affymetrix microarray platform and ignore the availability of the technical replicates. The large number of beads for a particular bead type on the same array, however, should be highly correlated, otherwise they just measure noise and can be removed from the downstream analysis. Hence, filtering bead types can be considered as an important step of the preprocessing procedure for Illumina platform. This paper proposes a filtering method for Illumina bead arrays, which builds upon the mixed model framework. Bead types are called informative/non-informative (I/NI) based on a trade-off between within and between array variabilities.  The method is illustrated on a publicly available Illumina Spike-in data set (Dunning et al., 2008)  and we also show that filtering results in a more powerful analysis of differentially expressed genes.</p>

	]]>
</description>

<author>Anyiawung Chiara Forcheh et al.</author>


<category>Computational Biology/Bioinformatics</category>

<category>Microarrays</category>

</item>






<item>
<title>A Generalized Hidden Markov Model for Determining Sequence-based Predictors of Nucleosome Positioning</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art2</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art2</guid>
<pubDate>Fri, 06 Jan 2012 12:17:30 PST</pubDate>
<description>
	<![CDATA[
	<p>Chromatin structure, in terms of positioning of nucleosomes and nucleosome-free regions in the DNA, has been found to have an immense impact on various cell functions and processes, ranging from transcriptional regulation to growth and development. In spite of numerous experimental and computational approaches being developed in the past few years to determine the intrinsic relationship between chromatin structure (nucleosome positioning) and DNA sequence features, there is yet no universally accurate approach to predict nucleosome positioning from the underlying DNA sequence alone. We here propose an alternative approach to predicting nucleosome positioning from sequence, making use of characteristic sequence differences, and inherent dependencies in overlapping sequence features. Our nucleosomal positioning prediction algorithm, based on the idea of generalized hierarchical hidden Markov models (HGHMMs), was used to predict nucleosomal state based on the DNA sequence in yeast chromosome III, and compared with two other existing methods. The HGHMM method performed favorably among the three models in terms of specificity and sensitivity, and provided estimates that were largely consistent with predictions from the method of Yuan and Liu (2008). However, all the methods still give higher than desirable misclassification rates, indicating that sequence-based features may provide only limited information towards understanding positioning of nucleosomes. The method is implemented in the open-source statistical software R, and is freely available from the authors’ website.</p>

	]]>
</description>

<author>Carlee Moser et al.</author>


<category>Computational Biology/Bioinformatics</category>

</item>






<item>
<title>Special Issue on Computational Statistical Methods for Genomics and Systems Biology</title>
<link>http://www.bepress.com/sagmb/vol11/iss2/art1</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss2/art1</guid>
<pubDate>Fri, 06 Jan 2012 12:17:28 PST</pubDate>
<description>
	<![CDATA[
	<p>We provide a brief editorial introduction to a special issue of <i>Statistical Applications in Genetics and Molecular Biology</i> dedicated to the workshop on "Computational Statistical Methods for Genomics and Systems Biology", held at the Centre de recherches mathématiques in Montreal in April 2011.</p>

	]]>
</description>

<author>Aurélie Labbe et al.</author>


<category>Computational Biology/Bioinformatics</category>

<category>General Biostatistics</category>

<category>Genetics</category>

<category>Microarrays</category>

<category>Multivariate Analysis</category>

<category>Statistical Models</category>

<category>Statistical Theory and Methods</category>

</item>






<item>
<title>Stopping-Time Resampling and Population Genetic Inference under Coalescent Models</title>
<link>http://www.bepress.com/sagmb/vol11/iss1/art9</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss1/art9</guid>
<pubDate>Fri, 06 Jan 2012 12:00:27 PST</pubDate>
<description>
	<![CDATA[
	<p>To extract full information from samples of DNA sequence data, it is necessary to use sophisticated model-based techniques such as importance sampling under the coalescent. However, these are limited in the size of datasets they can handle efficiently. Chen and Liu (2000) introduced the idea of <em>stopping-time resampling</em> and showed that it can dramatically improve the efficiency of importance sampling methods under a finite-alleles coalescent model. In this paper, a new framework is developed for designing stopping-time resampling schemes under more general models. It is implemented on data both from infinite sites and stepwise models of mutation, and extended to incorporate crossover recombination. A simulation study shows that this new framework offers a substantial improvement in the accuracy of likelihood estimation over a range of parameters, while a direct application of the scheme of Chen and Liu (2000) can actually diminish the estimate. The method imposes no additional computational burden and is robust to the choice of parameters.</p>

	]]>
</description>

<author>Paul A. Jenkins</author>


<category>Genetics</category>

<category>Statistical Models</category>

<category>Statistical Theory and Methods</category>

</item>






<item>
<title>A Mixture-Model Approach for Parallel Testing for Unequal Variances</title>
<link>http://www.bepress.com/sagmb/vol11/iss1/art8</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss1/art8</guid>
<pubDate>Fri, 06 Jan 2012 12:00:23 PST</pubDate>
<description>
	<![CDATA[
	<p>Testing for unequal variances is usually performed in order to check the validity of the assumptions that underlie standard tests for differences between means (the t-test and anova). However, existing methods for testing for unequal variances (Levene's test and Bartlett's test) are notoriously non-robust to normality assumptions, especially for small sample sizes. Moreover, although these methods were designed to deal with one hypothesis at a time, modern applications (such as to microarrays and fMRI experiments) often involve parallel testing over a large number of levels (genes or voxels).  Moreover, in these settings a shift in variance may be biologically relevant, perhaps even more so than a change in the  mean. This paper proposes a parsimonious model for parallel testing of the equal variance hypothesis. It is designed to work well when the number of tests is large; typically much larger than the sample sizes. The tests are implemented using an empirical Bayes estimation procedure which `borrows information' across levels. The method is shown to be quite robust to deviations from normality, and to substantially increase the power to detect differences in variance over the more traditional approaches even when the normality assumption is valid.</p>

	]]>
</description>

<author>Haim Y. Bar et al.</author>


<category>General Biostatistics</category>

<category>Microarrays</category>

<category>Statistical Models</category>

</item>






<item>
<title>Fast Identification of Biological Pathways Associated with a Quantitative Trait Using Group Lasso with Overlaps</title>
<link>http://www.bepress.com/sagmb/vol11/iss1/art7</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss1/art7</guid>
<pubDate>Fri, 06 Jan 2012 12:00:19 PST</pubDate>
<description>
	<![CDATA[
	<p>Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within biological pathways, the incorporation of prior pathways information into a statistical model is expected to increase the power to detect true associations in a genetic association study. Most existing pathways-based methods rely on marginal SNP statistics and do not fully exploit the dependence patterns among SNPs within pathways.</p>
<p>We use a sparse regression model, with SNPs grouped into pathways, to identify causal pathways associated with a quantitative trait. Notable features of our “pathways group lasso with adaptive weights” (P-GLAW) algorithm include the incorporation of all pathways in a single regression model, an adaptive pathway weighting procedure that accounts for factors biasing pathway selection, and the use of a bootstrap sampling procedure for the ranking of important pathways. P-GLAW takes account of the presence of overlapping pathways and uses a novel combination of techniques to optimise model estimation, making it fast to run, even on whole genome datasets.</p>
<p>In a comparison study with an alternative pathways method based on univariate SNP statistics, our method demonstrates high sensitivity and specificity for the detection of important pathways, showing the greatest relative gains in performance where marginal SNP effect sizes are small.</p>

	]]>
</description>

<author>Matt Silver et al.</author>


<category>Statistical Models</category>

</item>






<item>
<title>MicroRNA Transcription Start Site Prediction with Multi-objective Feature Selection</title>
<link>http://www.bepress.com/sagmb/vol11/iss1/art6</link>
<guid isPermaLink="true">http://www.bepress.com/sagmb/vol11/iss1/art6</guid>
<pubDate>Fri, 06 Jan 2012 12:00:13 PST</pubDate>
<description>
	<![CDATA[
	<p>MicroRNAs (miRNAs) are non-coding, short (21-23nt) regulators of protein-coding genes that are generally transcribed first into primary miRNA (pri-miR), followed by the generation of precursor miRNA (pre-miR). This finally leads to the production of the mature miRNA. A large amount of information is available on the pre- and mature miRNAs. However, very little is known about the pri-miRs, due to a lack of knowledge about their transcription start sites (TSSs). Based on the genomic loci, miRNAs can be categorized into two types —intragenic (intra-miR) and intergenic (inter-miR). While it is already an established fact that intra-miRs are commonly transcribed in conjunction with their host genes, the transcription machinery of inter-miRs is poorly understood. Although it is assumed that miRNA promoters are similar in structure to gene promoters, since both are transcribed by RNA polymerase II (Pol II), computational validations exhibit poor performance of gene promoter prediction methods on miRNAs. In this paper, we concentrate on the problem of TSS prediction for miRNAs. The present study begins with the identification of positive and negative promoter samples from recently published data stemming from RNA-sequencing studies. From these samples of experimentally validated miRNA TSSs, a number of standard sequence features are extracted. Furthermore, to account for potential footprints related to promoter regulation by CpG dinucleotide targeted DNA methylation, a number of novel features are defined. We develop a support vector machine (SVM) with RBF kernel for the prediction of miRNA TSSs trained on human miRNA promoters. A novel feature reduction technique based on archived multi-objective simulated annealing (AMOSA) identifies the final set of features. The resulting model trained on miRNA promoters shows improved performance over the one trained on protein-coding gene promoters in terms of classification accuracy, sensitivity and specificity. Results are also reported for a completely independent biologically validated test set. In a part of the investigation, the proposed approach is used to predict protein-coding gene TSSs. It shows a significantly improved performance when compared to previously published gene TSS prediction methods.</p>

	]]>
</description>

<author>Malay Bhattacharyya et al.</author>


<category>Computational Biology/Bioinformatics</category>

</item>





</channel>
</rss>

