MLBio+Laboratory Machine Learning in Biomedical Informatics



Bioinformatics

Syndicate content
Bioinformatics - RSS feed of articles
Updated: 1 year 24 weeks ago

Fast simulation of reconstructed phylogenies under global, time-dependent birth-death processes

Sat, 03/30/2013 - 01:10

Motivation: Diversification rates and patterns may be inferred from reconstructed phylogenies. Both the time-dependent as well as the diversity-dependent birth-death process can produce the same observed patterns of diversity over time. To develop and test new models describing the macro-evolutionary process of diversification, generic and fast algorithms to simulate under these models are necessary. Simulations are not only important for testing and developing models but play an influential role in the assessment of model fit.

Results: In the present paper I consider as the model a global, time-dependent birth-death process where each species has the same rates but rates may vary over time. For this model I derive the likelihood of the speciation times from a reconstructed phylogenetic tree and show that each speciation event is independent and identically distributed. This fact can be used to simulate efficiently reconstructed phylogenetic trees when conditioning on the number of species, the time of the process or both. I show the usability of the simulation by approximating the posterior predictive distribution of a birth-death process with decreasing diversification rates applied on a published bird phylogeny (family Cettiidae).

Availability: The methods described in this manuscript are implement in the R package TESS, available from the repository CRAN (http://cran.r-project.org/web/packages/TESS/).

Contact: hoehna@math.su.se

FunFrame: functional gene ecological analysis pipeline

Fri, 03/29/2013 - 08:22

Summary: Pyrosequencing of 16S rDNA is widely used to study microbial communities, and a rich set of software tools support this analysis. Pyrosequencing of protein-coding genes, which can help elucidate functional differences among microbial communities, significantly lags behind 16S rDNA in availability of sequence analysis software. In both settings, frequent homopolymer read errors inflate the estimation of microbial diversity, and de-noising is required to reduce that bias. Here we describe FunFrame, an R-based data-analysis pipeline that uses recently described algorithms to de-noise functional gene pyrosequences and performs ecological analysis on de-noised sequence data. The novelty of this pipeline is that it provides users a unified set of tools, adapted from disparate sources and designed for different applications, that can be used to examine a particular protein coding gene of interest. We evaluated FunFrame on functional genes from four PCR-amplified clones with sequence depths ranging from 9084 to 14 494 sequences. FunFrame produced from one to nine OTUs for each clone, resulting in an error rate ranging from 0 to 0.18%. Importantly, FunFrame reduced spurious diversity while retaining more sequences than a commonly used de-noising method that discards sequences with frameshift errors.

Availability: Software, documentation and a complete set of sample data files are available at http://faculty.www.umb.edu/jennifer.bowen software/FunFrame.zip.

Contact: Jennifer.Bowen@umb.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

NTFD--a stand-alone application for the non-targeted detection of stable isotope-labeled compounds in GC/MS data

Fri, 03/29/2013 - 08:22

Summary: Most current stable isotope-based methodologies are targeted and focus only on the well-described aspects of metabolic networks. Here, we present NTFD (non-targeted tracer fate detection), a software for the non-targeted analysis of all detectable compounds derived from a stable isotope-labeled tracer present in a GC/MS dataset. In contrast to traditional metabolic flux analysis approaches, NTFD does not depend on any a priori knowledge or library information. To obtain dynamic information on metabolic pathway activity, NTFD determines mass isotopomer distributions for all detected and labeled compounds. These data provide information on relative fluxes in a metabolic network. The graphical user interface allows users to import GC/MS data in netCDF format and export all information into a tab-separated format.

Availability: NTFD is C++- and Qt4-based, and it is freely available under an open-source license. Pre-compiled packages for the installation on Debian- and Redhat-based Linux distributions, as well as Windows operating systems, along with example data, are provided for download at http://ntfd.mit.edu/.

Contact: gregstep@mit.edu

Enabling interspecies epigenomic comparison with CEpBrowser

Fri, 03/29/2013 - 08:22

Summary: We developed the Comparative Epigenome Browser (CEpBrowser) to allow the public to perform multi-species epigenomic analysis. The web-based CEpBrowser integrates, manages and visualizes sequencing-based epigenomic datasets. Five key features were developed to maximize the efficiency of interspecies epigenomic comparisons.

Availability: CEpBrowser is a web application implemented with PHP, MySQL, C and Apache. URL: http://www.cepbrowser.org/.

Contact: szhong@ucsd.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

FishingCNV: a graphical software package for detecting rare copy number variations in exome sequencing data

Thu, 03/28/2013 - 08:49

Summary: Rare copy number variations (CNVs) are frequent causes of genetic diseases. We developed a graphical software package based on a novel approach that can consistently identify CNVs of all types (homozygous deletions, heterozygous deletions, heterozygous duplications) from exome sequencing data without the need of a paired control. The algorithm compares coverage depth in a test sample against a background distribution of control samples and uses principal component analysis to remove batch effects. It is user friendly and can be run on a personal computer.

Availability and Implementation: The main scripts are implemented in R (2.15), and the GUI is created using Java 1.6. It can be run on all major operating systems. A non-GUI version for pipeline implementation is also available. The program is freely available online: https://sourceforge.net/projects/fishingcnv/

Contact: yuhao.shi@mail.mcgill.ca

Supplementary Information:

MCScanX-transposed: detecting transposed gene duplications based on multiple colinearity scans

Thu, 03/28/2013 - 08:49

Summary: Gene duplication occurs via different modes such as segmental and single-gene duplications. Transposed gene duplication, a specific form of single-gene duplication, ‘copies’ a gene from an ancestral chromosomal location to a novel location. MCScanX is a toolkit for detection and evolutionary analysis of gene colinearity. We have developed MCScanX-transposed, a software package to detect transposed gene duplications that occurred within different epochs, based on execution of MCScanX within and between related genomes. MCScanX-transposed can be also used for integrative analysis of gene duplication modes for a genome and to annotate a gene family of interest with gene duplication modes.

Availability: MCScanX-transposed is freely available at http://chibba.pgml.uga.edu/mcscan2/transposed/

Contact: paterson@plantbio.uga.edu

Density-based hierarchical clustering of pyro-sequences on a large scale - the case of fungal ITS1

Thu, 03/28/2013 - 08:49

Motivation: Analysis of millions of pyro-sequences is currently playing a crucial role in the advance of environmental microbiology. Taxonomy independent, i.e. unsupervised clustering of these sequences is essential for the definition of Operational Taxonomic Units. For this application, reproducibility and robustness should be the most sought after qualities, but have so far largely been overlooked.

Results: Over one million hyper-variable ITS1 sequences of fungal origin have been analyzed. The ITS1 sequences were first properly extracted from 454 reads using generalized profiles. Then, otupipe, cd-hit-454, ESPRIT-Tree and DBC454, a new algorithm presented here, were used to analyze the sequences. A numerical assay was developed to measure the reproducibility and robustness of these algorithms. DBC454 was the most robust, closely followed by ESPRIT-Tree. DBC454 features density-based hierarchical clustering, that complements the other methods by providing insights into the structure of the data.

Availability and Implementation: An executable is freely available for non-commercial users at ftp://ftp.vital-it.ch/tools/dbc454. It is designed to run under MPI on a cluster of 64-bit Linux machines running Red Hat 4.x, or on a multi-core OSX system.

Contact: dbc454@vital-it.ch

An accessible database for mouse and human whole transcriptome qPCR primers

Thu, 03/28/2013 - 07:41

Motivation: Real time quantitative PCR (qPCR) is an important tool in quantitative studies of DNA and RNA molecules; especially in transcriptome studies, where different primer combinations allow identification of specific transcripts such as splice variants or precursor mRNA (pre-mRNA). Several softwares which implement various rules for optimal primer design are available. Nevertheless, since designing qPCR primers needs to be done manually, the repeated task is tedious, time consuming and prone to errors.

Results: We used a set of rules to automatically design all possible exon-exon and intron-exon junctions in the Human and Mouse transcriptomes. The resulting database is included as a track in the UCSC genome browser, making it widely accessible and easy to use.

Availability: The database is available from the UCSC genome browser (http://genome.ucsc.edu/), track name "Whole Transcriptome qPCR Primers" for the hg19 (Human) and mm10 (Mouse) genome versions. Batch query is available in: http://www.weizmann.ac.il/complex/compphys/software/Amit/primers/batchqueryqpcrprimers.htm

Contact: eytan.domany@weizmann.ac.il

Improved ancestry inference using weights from external reference panels

Thu, 03/28/2013 - 07:41

Motivation: Inference of ancestry using genetic data is motivated by applications in genetic association studies, population genetics and personal genomics. Here, we provide methods and software for improved ancestry inference using genome-wide SNP weights from external reference panels. This approach makes it possible to leverage the rich ancestry information that is available from large external reference panels, without the administrative and computational complexities of re-analyzing the raw genotype data from the reference panel in subsequent studies.

Results: We extensively validate our approach in multiple African-American, Latino-American and European-American data sets, making use of genome-wide SNP weights derived from large reference panels, including HapMap 3 populations and 6,546 European Americans from the Framingham Heart Study. We show empirically that our approach provides much greater accuracy than either the prevailing Ancestry-Informative Markers (AIMs) approach or the analysis of genome-wide target genotypes without a reference panel. For example, in an independent set of 1,636 European American GWAS samples, we attained prediction accuracy (R2) of 1.000 and 0.994 for the first two principal components (PCs) using our method, compared to 0.418 and 0.407 using 150 published AIMs or 0.955 and 0.003 by applying PCA directly to the target samples. We finally show that the higher accuracy in inferring ancestry using our method leads to more effective correction for population stratification in association studies.

Availability: The SNPweights software is available online at http://www.hsph.harvard.edu/faculty/alkes-price/software/.

Contact: aprice@hsph.harvard.edu; cychen@mail.harvard.edu.

CytoHiC: a cytoscape plugin for visual comparison of Hi-C networks

Mon, 03/25/2013 - 05:32

Summary: With the introduction of the Hi-C method new and fundamental properties of the nuclear architecture are emerging. The ability to interpret data generated by this method, which aims to capture the physical proximity between and within chromosomes, is crucial for uncovering the three dimensional structure of the nucleus. Providing researchers with tools for interactive visualization of Hi-C data can help in gaining new and important insights. Specifically, visual comparison can pinpoint changes in spatial organization between Hi-C datasets, originating from different cell lines or different species, or normalized by different methods. Here, we present CytoHiC, a Cytsocape plugin, which allow users to view and compare spatial maps of genomic landmarks, based on normalized Hi-C datasets. CytoHiC was developed to support intuitive visual comparison of Hi-C data and integration of additional genomic annotations.

Availability: The CytoHiC plugin, source code, user manual, example files and documentation are available at: http://apps.cytoscape.org/apps/cytohicplugin

Contact: yolisha@gmail.com or ys388@cam.ac.uk

RCPedia: a database of retrocopied genes

Mon, 03/25/2013 - 05:32

Motivation: Retrocopies are copies of mature RNAs that are usually devoid of regulatory sequences and introns. They have routinely been classified as processed pseudo-genes with little or no biological relevance. However, recent findings have revealed functional roles for retrocopies, as well as their high frequency in some organisms, such as primates. Despite their increasing importance, there is no user-friendly and publicly available resource for the study of retrocopies.

Results: Here, we present RCPedia, an integrative and user-friendly database designed for the study of retrocopied genes. RCPedia contains a complete catalogue of the retrocopies that are known to be present in human and five other primate genomes, their genomic context, inter-species conservation and gene expression data. RCPedia also offers a streamlined data representation and an efficient query system.

Availability and implementation: RCPedia is available at http://www.bioinfo.mochsl.org.br/rcpedia.

Contact: pgalante@mochsl.org.br

Supplementary information: Supplementary data are available at Bioinformatics online.

A temporal switch model for estimating transcriptional activity in gene expression

Mon, 03/25/2013 - 05:32

Motivation: The analysis and mechanistic modelling of time series gene expression data provided by techniques such as microarrays, NanoString, reverse transcription–polymerase chain reaction and advanced sequencing are invaluable for developing an understanding of the variation in key biological processes. We address this by proposing the estimation of a flexible dynamic model, which decouples temporal synthesis and degradation of mRNA and, hence, allows for transcriptional activity to switch between different states.

Results: The model is flexible enough to capture a variety of observed transcriptional dynamics, including oscillatory behaviour, in a way that is compatible with the demands imposed by the quality, time-resolution and quantity of the data. We show that the timing and number of switch events in transcriptional activity can be estimated alongside individual gene mRNA stability with the help of a Bayesian reversible jump Markov chain Monte Carlo algorithm. To demonstrate the methodology, we focus on modelling the wild-type behaviour of a selection of 200 circadian genes of the model plant Arabidopsis thaliana. The results support the idea that using a mechanistic model to identify transcriptional switch points is likely to strongly contribute to efforts in elucidating and understanding key biological processes, such as transcription and degradation.

Contact: B.F.Finkenstadt@Warwick.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

FTFlex: accounting for binding site flexibility to improve fragment-based identification of druggable hot spots

Mon, 03/25/2013 - 05:32

Computational solvent mapping finds binding hot spots, determines their druggability and provides information for drug design. While mapping of a ligand-bound structure yields more accurate results, usually the apo structure serves as the starting point in design. The FTFlex algorithm, implemented as a server, can modify an apo structure to yield mapping results that are similar to those of the respective bound structure. Thus, FTFlex is an extension of our FTMap server, which only considers rigid structures. FTFlex identifies flexible residues within the binding site and determines alternative conformations using a rotamer library. In cases where the mapping results of the apo structure were in poor agreement with those of the bound structure, FTFlex was able to yield a modified apo structure, which lead to improved FTMap results. In cases where the mapping results of the apo and bound structures were in good agreement, no new structure was predicted.

Availability: FTFlex is freely available as a web-based server at http://ftflex.bu.edu/.

Contact: vajda@bu.edu or midas@bu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

EBARDenovo: highly accurate de novo assembly of RNA-Seq with efficient chimera-detection

Sat, 03/23/2013 - 04:07

Motivation: High-accuracy de novo assembly of the short sequencing reads from RNA-Seq technology is very challenging. We introduce a de novo assembly algorithm, EBARDenovo, which stands for Extension, Bridging And Repeat-sensing Denovo. This algorithm uses an efficient chimera-detection function to abrogate the effect of aberrant chimeric reads in RNA-Seq data.

Results: EBARDenovo resolves the complications of RNA-Seq assembly arising from sequencing errors, repetitive sequences and aberrant chimeric amplicons. In a series of assembly experiments, our algorithm is the most accurate among the examined programs, including de Bruijn graph assemblers, Trinity and Oases.

Availability and implementation: EBARDenovo is available at http://ebardenovo.sourceforge.net/. This software package (with patent pending) is free of charge for academic use only.

Contact: cykao@csie.ntu.edu.tw, htchu@asia.edu.tw or postergrey@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data

Sat, 03/23/2013 - 04:07

Summary: For heterogeneous tissues, measurements of gene expression through mRNA-Seq data are confounded by relative proportions of cell types involved. In this note, we introduce an efficient pipeline: DeconRNASeq, an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data. It adopts a globally optimized non-negative decomposition algorithm through quadratic programming for estimating the mixing proportions of distinctive tissue types in next-generation sequencing data. We demonstrated the feasibility and validity of DeconRNASeq across a range of mixing levels and sources using mRNA-Seq data mixed in silico at known concentrations. We validated our computational approach for various benchmark data, with high correlation between our predicted cell proportions and the real fractions of tissues. Our study provides a rigorous, quantitative and high-resolution tool as a prerequisite to use mRNA-Seq data. The modularity of package design allows an easy deployment of custom analytical pipelines for data from other high-throughput platforms.

Availability: DeconRNASeq is written in R, and is freely available at http://bioconductor.org/packages.

Contact: tinggong@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

ChIP-PED enhances the analysis of ChIP-seq and ChIP-chip data

Sat, 03/23/2013 - 01:07

Motivation: Although chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip) is increasingly used to map genome-wide–binding sites of transcription factors (TFs), it still remains difficult to generate a quality ChIPx (i.e. ChIP-seq or ChIP-chip) dataset because of the tremendous amount of effort required to develop effective antibodies and efficient protocols. Moreover, most laboratories are unable to easily obtain ChIPx data for one or more TF(s) in more than a handful of biological contexts. Thus, standard ChIPx analyses primarily focus on analyzing data from one experiment, and the discoveries are restricted to a specific biological context.

Results: We propose to enrich this existing data analysis paradigm by developing a novel approach, ChIP-PED, which superimposes ChIPx data on large amounts of publicly available human and mouse gene expression data containing a diverse collection of cell types, tissues and disease conditions to discover new biological contexts with potential TF regulatory activities. We demonstrate ChIP-PED using a number of examples, including a novel discovery that MYC, a human TF, plays an important functional role in pediatric Ewing sarcoma cell lines. These examples show that ChIP-PED increases the value of ChIPx data by allowing one to expand the scope of possible discoveries made from a ChIPx experiment.

Availability: http://www.biostat.jhsph.edu/~gewu/ChIPPED/

Contact: hji@jhsph.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

SplicingCompass: differential splicing detection using RNA-Seq data

Sat, 03/23/2013 - 01:07

Motivation: Alternative splicing is central for cellular processes and substantially increases transcriptome and proteome diversity. Aberrant splicing events often have pathological consequences and are associated with various diseases and cancer types. The emergence of next-generation RNA sequencing (RNA-seq) provides an exciting new technology to analyse alternative splicing on a large scale. However, algorithms that enable the analysis of alternative splicing from short-read sequencing are not fully established yet and there are still no standard solutions available for a variety of data analysis tasks.

Results: We present a new method and software to predict genes that are differentially spliced between two different conditions using RNA-seq data. Our method uses geometric angles between the high dimensional vectors of exon read counts. With this, differential splicing can be detected even if the splicing events are composed of higher complexity and involve previously unknown splicing patterns. We applied our approach to two case studies including neuroblastoma tumour data with favourable and unfavourable clinical courses. We show the validity of our predictions as well as the applicability of our method in the context of patient clustering. We verified our predictions by several methods including simulated experiments and complementary in silico analyses. We found a significant number of exons with specific regulatory splicing factor motifs for predicted genes and a substantial number of publications linking those genes to alternative splicing. Furthermore, we could successfully exploit splicing information to cluster tissues and patients. Finally, we found additional evidence of splicing diversity for many predicted genes in normalized read coverage plots and in reads that span exon–exon junctions.

Availability: SplicingCompass is licensed under the GNU GPL and freely available as a package in the statistical language R at http://www.ichip.de/software/SplicingCompass.html

Contact: m.aschoff@dkfz.de or r.koenig@dkfz.de

Supplementary information: Supplementary data are available at Bioinformatics online.

HExpoChem: a systems biology resource to explore human exposure to chemicals

Sat, 03/23/2013 - 01:07

Summary: Humans are exposed to diverse hazardous chemicals daily. Although an exposure to these chemicals is suspected to have adverse effects on human health, mechanistic insights into how they interact with the human body are still limited. Therefore, acquisition of curated data and development of computational biology approaches are needed to assess the health risks of chemical exposure. Here we present HExpoChem, a tool based on environmental chemicals and their bioactivities on human proteins with the objective of aiding the qualitative exploration of human exposure to chemicals. The chemical–protein interactions have been enriched with a quality-scored human protein–protein interaction network, a protein–protein association network and a chemical–chemical interaction network, thus allowing the study of environmental chemicals through formation of protein complexes and phenotypic outcomes enrichment.

Availability: HExpoChem is available at http://www.cbs.dtu.dk/services/HExpoChem-1.0/.

Contact: karine@cbs.dtu.dk

Supplementary information: Supplementary data are available at Bioinformatics online.

Brain: biomedical knowledge manipulation

Sat, 03/23/2013 - 01:07

Summary: Brain is a Java software library facilitating the manipulation and creation of ontologies and knowledge bases represented with the Web Ontology Language (OWL).

Availability and implementation: The Java source code and the library are freely available at https://github.com/loopasam/Brain and on the Maven Central repository (GroupId: uk.ac.ebi.brain). The documentation is available at https://github.com/loopasam/Brain/wiki.

Contact: croset@ebi.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

PeptideLocator: prediction of bioactive peptides in protein sequences

Fri, 03/22/2013 - 01:49

Motivation: Peptides play important roles in signalling, regulation and immunity within an organism. Many have successfully been used as therapeutic products often mimicking naturally occurring peptides. Here we present PeptideLocator for the automated prediction of functional peptides in a protein sequence.

Results: We have trained a machine learning algorithm to predict bioactive peptides within protein sequences. PeptideLocator performs well on training data achieving an area under the curve of 0.92 when tested in 5-fold cross-validation on a set of 2202 redundancy reduced peptide containing protein sequences. It has predictive power when applied to antimicrobial peptides, cytokines, growth factors, peptide hormones, toxins, venoms and other peptides. It can be applied to refine the choice of experimental investigations in functional studies of proteins.

Availability and implementation: PeptideLocator is freely available for academic users at http://bioware.ucd.ie/.

Contact: denis.shields@ucd.ie

Supplementary information: Supplementary data are available at Bioinformatics online.



Powered by Drupal, an open source content management system