Bioinformatics Journal

Bioinformatics - RSS feed of current issue
  • Shiny-phyloseq: Web application for interactive microbiome analysis with provenance tracking
    [Jan 2015]

    Summary: We have created a Shiny-based Web application, called Shiny-phyloseq, for dynamic interaction with microbiome data that runs on any modern Web browser and requires no programming, increasing the accessibility and decreasing the entrance requirement to using phyloseq and related R tools. Along with a data- and context-aware dynamic interface for exploring the effects of parameter and method choices, Shiny-phyloseq also records the complete user input and subsequent graphical results of a user’s session, allowing the user to archive, share and reproduce the sequence of steps that created their result—without writing any new code themselves.

    Availability and implementation: Shiny-phyloseq is implemented entirely in the R language. It can be hosted/launched by any system with R installed, including Windows, Mac OS and most Linux distributions. Information technology administrators can also host Shiny-phyloseq from a remote server, in which case users need only have a Web browser installed. Shiny-phyloseq is provided free of charge under a GPL-3 open-source license through GitHub at http://joey711.github.io/shiny-phyloseq/.

    Contact: mcmurdie@alumni.stanford.edu.

    Categories: Journal Articles
  • PhaseTank: genome-wide computational identification of phasiRNAs and their regulatory cascades
    [Jan 2015]

    Summary: Emerging evidence has revealed phased siRNAs (phasiRNAs) as important endogenous regulators in plants. However, the integrated prediction tools for phasiRNAs are still limited. In this article, we introduce a stand-alone package PhaseTank for systematically characterizing phasiRNAs and their regulatory networks. (i) It can identify phasiRNAs/tasiRNAs functional cascades (miRNA/phasiRNA->PHAS loci->phasiRNA->target) with high sensitivity and specificity. (ii) By one command analysis, it generates comprehensive annotation and quantification of the predicted PHAS genes from any given sequences. (iii) PhaseTank has no restriction with regards to prior information of sequence homology of unrestricted organism origins.

    Availability and implementation: PhaseTank is a free and open-source tool. The package is available at http://phasetank.sourceforge.net/.

    Contact: weibojin@gmail.com or guoql.karen@gmail.com.

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • KDDN: an open-source Cytoscape app for constructing differential dependency networks with significant rewiring
    [Jan 2015]

    Summary: We have developed an integrated molecular network learning method, within a well-grounded mathematical framework, to construct differential dependency networks with significant rewiring. This knowledge-fused differential dependency networks (KDDN) method, implemented as a Java Cytoscape app, can be used to optimally integrate prior biological knowledge with measured data to simultaneously construct both common and differential networks, to quantitatively assign model parameters and significant rewiring p-values and to provide user-friendly graphical results. The KDDN algorithm is computationally efficient and provides users with parallel computing capability using ubiquitous multi-core machines. We demonstrate the performance of KDDN on various simulations and real gene expression datasets, and further compare the results with those obtained by the most relevant peer methods. The acquired biologically plausible results provide new insights into network rewiring as a mechanistic principle and illustrate KDDN’s ability to detect them efficiently and correctly. Although the principal application here involves microarray gene expressions, our methodology can be readily applied to other types of quantitative molecular profiling data.

    Availability: Source code and compiled package are freely available for download at http://apps.cytoscape.org/apps/kddn

    Contact: yuewang@vt.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • MTide: an integrated tool for the identification of miRNA-target interaction in plants
    [Jan 2015]

    Motivation: Small RNA sequencing and degradome sequencing (also known as parallel analysis of RNA ends) have provided rich information on the microRNA (miRNA) and its cleaved mRNA targets on a genome-wide scale in plants, but no computational tools have been developed to effectively and conveniently deconvolute the miRNA–target interaction (MTI).

    Results: A freely available package, MTide, was developed by combining modified miRDeep2 and CleaveLand4 with some other useful scripts to explore MTI in a comprehensive way. By searching for targets of a complete miRNAs, we can facilitate large-scale identification of miRNA targets, allowing us to discover regulatory interaction networks.

    Availability and implementation: http://bis.zju.edu.cn/MTide

    Contact: mchen@zju.edu.cn

    Categories: Journal Articles
  • PHI-DAC: protein homology database through dihedral angle conservation
    [Jan 2015]

    Finding related conformations in the Protein Data Bank is essential in many areas of bioscience. To assist this task, we designed a dihedral angle database for searching protein segment homologs. The search engine relies on encoding of the protein coordinates into text characters representing amino acid sequence, and dihedral angles. The search engine is advantageous owing to its high speed and interactive nature and is expected to assist scientists in discovering conformation homologs and evolutionary kinship. The search engine is fast, with query times lasting a few seconds, and freely available at http://tarshish.md.biu.ac.il/~samsona

    Contact: avraham.samson@biu.ac.il

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • PIP-DB: the Protein Isoelectric Point database
    [Jan 2015]

    Summary: A protein’s isoelectric point or pI corresponds to the solution pH at which its net surface charge is zero. Since the early days of solution biochemistry, the pI has been recorded and reported, and thus literature reports of pI abound. The Protein Isoelectric Point database (PIP-DB) has collected and collated these data to provide an increasingly comprehensive database for comparison and benchmarking purposes. A web application has been developed to warehouse this database and provide public access to this unique resource. PIP-DB is a web-enabled SQL database with an HTML GUI front-end. PIP-DB is fully searchable across a range of properties.

    Availability and implementation: The PIP-DB database and documentation are available at http://www.pip-db.org.

    Contact: d.r.flower@aston.ac.uk

    Categories: Journal Articles
  • Summary of the BioLINK SIG 2013 meeting at ISMB/ECCB 2013
    [Jan 2015]

    The ISMB Special Interest Group on Linking Literature, Information and Knowledge for Biology (BioLINK) organized a one-day workshop at ISMB/ECCB 2013 in Berlin, Germany. The theme of the workshop was ‘Roles for text mining in biomedical knowledge discovery and translational medicine’. This summary reviews the outcomes of the workshop. Meeting themes included concept annotation methods and applications, extraction of biological relationships and the use of text-mined data for biological data analysis.

    Availability and implementation: All articles are available at http://biolinksig.org/proceedings-online/.

    Contact: karin.verspoor@unimelb.edu.au

    Categories: Journal Articles
  • METAINTER: meta-analysis of multiple regression models in genome-wide association studies
    [Jan 2015]

    Motivation: Meta-analysis of summary statistics is an essential approach to guarantee the success of genome-wide association studies (GWAS). Application of the fixed or random effects model to single-marker association tests is a standard practice. More complex methods of meta-analysis involving multiple parameters have not been used frequently, a gap that could be explained by the lack of a respective meta-analysis pipeline. Meta-analysis based on combining p-values can be applied to any association test. However, to be powerful, meta-analysis methods for high-dimensional models should incorporate additional information such as study-specific properties of parameter estimates, their effect directions, standard errors and covariance structure.

    Results: We modified ‘method for the synthesis of linear regression slopes’ recently proposed in the educational sciences to the case of multiple logistic regression, and implemented it in a meta-analysis tool called METAINTER. The software handles models with an arbitrary number of parameters, and can directly be applied to analyze the results of single-SNP tests, global haplotype tests, tests for and under gene–gene or gene–environment interaction. Via simulations for two-single nucleotide polymorphisms (SNP) models we have shown that the proposed meta-analysis method has correct type I error rate. Moreover, power estimates come close to that of the joint analysis of the entire sample. We conducted a real data analysis of six GWAS of type 2 diabetes, available from dbGaP (http://www.ncbi.nlm.nih.gov/gap). For each study, a genome-wide interaction analysis of all SNP pairs was performed by logistic regression tests. The results were then meta-analyzed with METAINTER.

    Availability: The software is freely available and distributed under the conditions specified on http://metainter.meb.uni-bonn.de

    Contact: vait@imbie.meb.uni-bonn.de

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes
    [Jan 2015]

    Motivation: With the advance of new sequencing technologies producing massive short reads data, metagenomics is rapidly growing, especially in the fields of environmental biology and medical science. The metagenomic data are not only high dimensional with large number of features and limited number of samples but also complex with a large number of zeros and skewed distribution. Efficient computational and statistical tools are needed to deal with these unique characteristics of metagenomic sequencing data. In metagenomic studies, one main objective is to assess whether and how multiple microbial communities differ under various environmental conditions.

    Results: We propose a two-stage statistical procedure for selecting informative features and identifying differentially abundant features between two or more groups of microbial communities. In the functional analysis of metagenomes, the features may refer to the pathways, subsystems, functional roles and so on. In the first stage of the proposed procedure, the informative features are selected using elastic net as reducing the dimension of metagenomic data. In the second stage, the differentially abundant features are detected using generalized linear models with a negative binomial distribution. Compared with other available methods, the proposed approach demonstrates better performance for most of the comprehensive simulation studies. The new method is also applied to two real metagenomic datasets related to human health. Our findings are consistent with those in previous reports.

    Availability: R code and two example datasets are available at http://cals.arizona.edu/~anling/software.htm

    Contact: anling@email.arizona.edu

    Supplementary information: Supplementary file is available at Bioinformatics online.

    Categories: Journal Articles
  • HTSeq--a Python framework to work with high-throughput sequencing data
    [Jan 2015]

    Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed.

    Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.

    Availability and implementation: HTSeq is released as an open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq.

    Contact: sanders@fs.tum.de

    Categories: Journal Articles
  • Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance
    [Jan 2015]

    Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis.

    Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic reads to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. The algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains.

    Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org.

    Contact: panc@ornl.gov

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • LongTarget: a tool to predict lncRNA DNA-binding motifs and binding sites via Hoogsteen base-pairing analysis
    [Jan 2015]

    Motivation: In mammalian cells, many genes are silenced by genome methylation. DNA methyltransferases and polycomb repressive complexes, which both lack sequence-specific DNA-binding motifs, are recruited by long non-coding RNA (lncRNA) to specific genomic sites to methylate DNA and chromatin. Increasing evidence indicates that many lncRNAs contain DNA-binding motifs that can bind to DNA by forming RNA:DNA triplexes. The identification of lncRNA DNA-binding motifs and binding sites is essential for deciphering lncRNA functions and correct and erroneous genome methylation; however, such identification is challenging because lncRNAs may contain thousands of nucleotides. No computational analysis of typical lncRNAs has been reported. Here, we report a computational method and program (LongTarget) to predict lncRNA DNA-binding motifs and binding sites. We used this program to analyse multiple antisense lncRNAs, including those that control well-known imprinting clusters, and obtained results agreeing with experimental observations and epigenetic marks. These results suggest that it is feasible to predict many lncRNA DNA-binding motifs and binding sites genome-wide.

    Availability and implementation: Website of LongTarget: lncrna.smu.edu.cn, or contact: hao.zhu@ymail.com.

    Contact: zhuhao@smu.edu.cn

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes
    [Jan 2015]

    Motivation: The development of cost-effective next-generation sequencing methods has spurred the development of high-throughput bioinformatics tools for detection of sequence variation. With many disparate variant-calling algorithms available, investigators must ask, ‘Which method is best for my data?’ Machine learning research has shown that so-called ensemble methods that combine the output of multiple models can dramatically improve classifier performance. Here we describe a novel variant-calling approach based on an ensemble of variant-calling algorithms, which we term the Consensus Genotyper for Exome Sequencing (CGES). CGES uses a two-stage voting scheme among four algorithm implementations. While our ensemble method can accept variants generated by any variant-calling algorithm, we used GATK2.8, SAMtools, FreeBayes and Atlas-SNP2 in building CGES because of their performance, widespread adoption and diverse but complementary algorithms.

    Results: We apply CGES to 132 samples sequenced at the Hudson Alpha Institute for Biotechnology (HAIB, Huntsville, AL) using the Nimblegen Exome Capture and Illumina sequencing technology. Our sample set consisted of 40 complete trios, two families of four, one parent–child duo and two unrelated individuals. CGES yielded the fewest total variant calls ($${N}_{CGES}=139^\circ 897$$), the highest Ts/Tv ratio (3.02), the lowest Mendelian error rate across all genotypes (0.028%), the highest rediscovery rate from the Exome Variant Server (EVS; 89.3%) and 1000 Genomes (1KG; 84.1%) and the highest positive predictive value (PPV; 96.1%) for a random sample of previously validated de novo variants. We describe these and other quality control (QC) metrics from consensus data and explain how the CGES pipeline can be used to generate call sets of varying quality stringency, including consensus calls present across all four algorithms, calls that are consistent across any three out of four algorithms, calls that are consistent across any two out of four algorithms or a more liberal set of all calls made by any algorithm.

    Availability and implementation: To enable accessible, efficient and reproducible analysis, we implement CGES both as a stand-alone command line tool available for download in GitHub and as a set of Galaxy tools and workflows configured to execute on parallel computers.

    Contact: trubetskoy@uchicago.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Proteomic analysis and prediction of human phosphorylation sites in subcellular level reveal subcellular specificity
    [Jan 2015]

    Motivation: Protein phosphorylation is the most common post-translational modification (PTM) regulating major cellular processes through highly dynamic and complex signaling pathways. Large-scale comparative phosphoproteomic studies have frequently been done on whole cells or organs by conventional bottom-up mass spectrometry approaches, i.e at the phosphopeptide level. Using this approach, there is no way to know from where the phosphopeptide signal originated. Also, as a consequence of the scale of these studies, important information on the localization of phosphorylation sites in subcellular compartments (SCs) is not surveyed.

    Results: Here, we present a first account of the emerging field of subcellular phosphoproteomics where a support vector machine (SVM) approach was combined with a novel algorithm of discrete wavelet transform (DWT) to facilitate the identification of compartment-specific phosphorylation sites and to unravel the intricate regulation of protein phosphorylation. Our data reveal that the subcellular phosphorylation distribution is compartment type dependent and that the phosphorylation displays site-specific sequence motifs that diverge between SCs.

    Availability and implementation: The method and database both are available as a web server at: http://bioinfo.ncu.edu.cn/SubPhos.aspx.

    Contact: jdqiu@ncu.edu.cn

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Comprehensive large-scale assessment of intrinsic protein disorder
    [Jan 2015]

    Motivation: Intrinsically disordered regions are key for the function of numerous proteins. Due to the difficulties in experimental disorder characterization, many computational predictors have been developed with various disorder flavors. Their performance is generally measured on small sets mainly from experimentally solved structures, e.g. Protein Data Bank (PDB) chains. MobiDB has only recently started to collect disorder annotations from multiple experimental structures.

    Results: MobiDB annotates disorder for UniProt sequences, allowing us to conduct the first large-scale assessment of fast disorder predictors on 25 833 different sequences with X-ray crystallographic structures. In addition to a comprehensive ranking of predictors, this analysis produced the following interesting observations. (i) The predictors cluster according to their disorder definition, with a consensus giving more confidence. (ii) Previous assessments appear over-reliant on data annotated at the PDB chain level and performance is lower on entire UniProt sequences. (iii) Long disordered regions are harder to predict. (iv) Depending on the structural and functional types of the proteins, differences in prediction performance of up to 10% are observed.

    Availability: The datasets are available from Web site at URL: http://mobidb.bio.unipd.it/lsd.

    Contact: silvio.tosatto@unipd.it

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Hybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation
    [Jan 2015]

    Motivation: Modern molecular technologies allow the collection of large amounts of high-throughput data on the functional attributes of genes. Often multiple technologies and study designs are used to address the same biological question such as which genes are overexpressed in a specific disease state. Consequently, there is considerable interest in methods that can integrate across datasets to present a unified set of predictions.

    Results: An important aspect of data integration is being able to account for the fact that datasets may differ in how accurately they capture the biological signal of interest. While many methods to address this problem exist, they always rely either on dataset internal statistics, which reflect data structure and not necessarily biological relevance, or external gold standards, which may not always be available. We present a new rank aggregation method for data integration that requires neither external standards nor internal statistics but relies on Bayesian reasoning to assess dataset relevance. We demonstrate that our method outperforms established techniques and significantly improves the predictive power of rank-based aggregations. We show that our method, which does not require an external gold standard, provides reliable estimates of dataset relevance and allows the same set of data to be integrated differently depending on the specific signal of interest.

    Availability: The method is implemented in R and is freely available at http://www.pitt.edu/~mchikina/BIRRA/

    Contact: mchikina@pitt.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Translating bioinformatics in oncology: guilt-by-profiling analysis and identification of KIF18B and CDCA3 as novel driver genes in carcinogenesis
    [Jan 2015]

    Motivation: Co-regulated genes are not identified in traditional microarray analyses, but may theoretically be closely functionally linked [guilt-by-association (GBA), guilt-by-profiling]. Thus, bioinformatics procedures for guilt-by-profiling/association analysis have yet to be applied to large-scale cancer biology.

    We analyzed 2158 full cancer transcriptomes from 163 diverse cancer entities in regard of their similarity of gene expression, using Pearson’s correlation coefficient (CC). Subsequently, 428 highly co-regulated genes (|CC| ≥ 0.8) were clustered unsupervised to obtain small co-regulated networks. A major subnetwork containing 61 closely co-regulated genes showed highly significant enrichment of cancer bio-functions. All genes except kinesin family member 18B (KIF18B) and cell division cycle associated 3 (CDCA3) were of confirmed relevance for tumor biology. Therefore, we independently analyzed their differential regulation in multiple tumors and found severe deregulation in liver, breast, lung, ovarian and kidney cancers, thus proving our GBA hypothesis. Overexpression of KIF18B and CDCA3 in hepatoma cells and subsequent microarray analysis revealed significant deregulation of central cell cycle regulatory genes. Consistently, RT-PCR and proliferation assay confirmed the role of both genes in cell cycle progression.

    Finally, the prognostic significance of the identified KIF18B- and CDCA3-dependent predictors (P = 0.01, P = 0.04) was demonstrated in three independent HCC cohorts and several other tumors.

    In summary, we proved the efficacy of large-scale guilt-by-profiling/association strategies in oncology. We identified two novel oncogenes and functionally characterized them. The strong prognostic importance of downstream predictors for HCC and many other tumors indicates the clinical relevance of our findings.

    Contact: andreas.teufel@ukr.de

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • GLAD: a mixed-membership model for heterogeneous tumor subtype classification
    [Jan 2015]

    Motivation: Genomic analyses of many solid cancers have demonstrated extensive genetic heterogeneity between as well as within individual tumors. However, statistical methods for classifying tumors by subtype based on genomic biomarkers generally entail an all-or-none decision, which may be misleading for clinical samples containing a mixture of subtypes and/or normal cell contamination.

    Results: We have developed a mixed-membership classification model, called glad, that simultaneously learns a sparse biomarker signature for each subtype as well as a distribution over subtypes for each sample. We demonstrate the accuracy of this model on simulated data, in-vitro mixture experiments, and clinical samples from the Cancer Genome Atlas (TCGA) project. We show that many TCGA samples are likely a mixture of multiple subtypes.

    Availability: A python module implementing our algorithm is available from http://genomics.wpi.edu/glad/

    Contact: pjflaherty@wpi.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • PROPER: comprehensive power evaluation for differential expression using RNA-seq
    [Jan 2015]

    Motivation: RNA-seq has become a routine technique in differential expression (DE) identification. Scientists face a number of experimental design decisions, including the sample size. The power for detecting differential expression is affected by several factors, including the fraction of DE genes, distribution of the magnitude of DE, distribution of gene expression level, sequencing coverage and the choice of type I error control. The complexity and flexibility of RNA-seq experiments, the high-throughput nature of transcriptome-wide expression measurements and the unique characteristics of RNA-seq data make the power assessment particularly challenging.

    Results: We propose prospective power assessment instead of a direct sample size calculation by making assumptions on all of these factors. Our power assessment tool includes two components: (i) a semi-parametric simulation that generates data based on actual RNA-seq experiments with flexible choices on baseline expressions, biological variations and patterns of DE; and (ii) a power assessment component that provides a comprehensive view of power. We introduce the concepts of stratified power and false discovery cost, and demonstrate the usefulness of our method in experimental design (such as sample size and sequencing depth), as well as analysis plan (gene filtering).

    Availability: The proposed method is implemented in a freely available R software package PROPER.

    Contact: hao.wu@emory.edu, zhijin_wu@brown.edu.

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles