Bioinformatics Journal

Bioinformatics - RSS feed of current issue
  • Comparing DNA integration site clusters with scan statistics
    [May 2014]

    Motivation: Gene therapy with retroviral vectors can induce adverse effects when those vectors integrate in sensitive genomic regions. Retroviral vectors are preferred that target sensitive regions less frequently, motivating the search for localized clusters of integration sites and comparison of the clusters formed by integration of different vectors. Scan statistics allow the discovery of spatial differences in clustering and calculation of false discovery rates providing statistical methods for comparing retroviral vectors.

    Results: A scan statistic for comparing two vectors using multiple window widths is proposed with software to detect clustering differentials and compute false discovery rates. Application to several sets of experimentally determined HIV integration sites demonstrates the software. Simulated datasets of various sizes and signal strengths are used to determine the power to discover clusters and evaluate a convenient lower bound. This provides a toolkit for planning evaluations of new gene therapy vectors.

    Availability and implementation: The geneRxCluster R package containing a simple tutorial and usage hints is available from http://www.bioconductor.org.

    Contact: ccberry@ucsd.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Integrative gene set analysis of multi-platform data with sample heterogeneity
    [May 2014]

    Motivation: Gene set analysis is a popular method for large-scale genomic studies. Because genes that have common biological features are analyzed jointly, gene set analysis often achieves better power and generates more biologically informative results. With the advancement of technologies, genomic studies with multi-platform data have become increasingly common. Several strategies have been proposed that integrate genomic data from multiple platforms to perform gene set analysis. To evaluate the performances of existing integrative gene set methods under various scenarios, we conduct a comparative simulation analysis based on The Cancer Genome Atlas breast cancer dataset.

    Results: We find that existing methods for gene set analysis are less effective when sample heterogeneity exists. To address this issue, we develop three methods for multi-platform genomic data with heterogeneity: two non-parametric methods, multi-platform Mann–Whitney statistics and multi-platform outlier robust T-statistics, and a parametric method, multi-platform likelihood ratio statistics. Using simulations, we show that the proposed multi-platform Mann–Whitney statistics method has higher power for heterogeneous samples and comparable performance for homogeneous samples when compared with the existing methods. Our real data applications to two datasets of The Cancer Genome Atlas also suggest that the proposed methods are able to identify novel pathways that are missed by other strategies.

    Availability and implementation: http://www4.stat.ncsu.edu/~jytzeng/Software/Multiplatform_gene_set_analysis/

    Contact: john.hu@omicsoft.com, jhu7@ncsu.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Supercomputing for the parallelization of whole genome analysis
    [May 2014]

    Motivation: The declining cost of generating DNA sequence is promoting an increase in whole genome sequencing, especially as applied to the human genome. Whole genome analysis requires the alignment and comparison of raw sequence data, and results in a computational bottleneck because of limited ability to analyze multiple genomes simultaneously.

    Results: We now adapted a Cray XE6 supercomputer to achieve the parallelization required for concurrent multiple genome analysis. This approach not only markedly speeds computational time but also results in increased usable sequence per genome. Relying on publically available software, the Cray XE6 has the capacity to align and call variants on 240 whole genomes in ~50 h. Multisample variant calling is also accelerated.

    Availability and implementation: The MegaSeq workflow is designed to harness the size and memory of the Cray XE6, housed at Argonne National Laboratory, for whole genome analysis in a platform designed to better match current and emerging sequencing volume.

    Contact: emcnally@uchicago.edu

    Categories: Journal Articles
  • Visualization and probability-based scoring of structural variants within repetitive sequences
    [May 2014]

    Motivation: Repetitive sequences account for approximately half of the human genome. Accurately ascertaining sequences in these regions with next generation sequencers is challenging, and requires a different set of analytical techniques than for reads originating from unique sequences. Complicating the matter are repetitive regions subject to programmed rearrangements, as is the case with the antigen-binding domains in the Immunoglobulin (Ig) and T-cell receptor (TCR) loci.

    Results: We developed a probability-based score and visualization method to aid in distinguishing true structural variants from alignment artifacts. We demonstrate the usefulness of this method in its ability to separate real structural variants from false positives generated with existing upstream analysis tools. We validated our approach using both target-capture and whole-genome experiments. Capture sequencing reads were generated from primary lymphoid tumors, cancer cell lines and an EBV-transformed lymphoblast cell line over the Ig and TCR loci. Whole-genome sequencing reads were from a lymphoblastoid cell-line.

    Availability: We implement our method as an R package available at https://github.com/Eitan177/targetSeqView. Code to reproduce the figures and results are also available.

    Contact: ehalper2@jhmi.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition
    [May 2014]

    Motivation: Nucleosome positioning participates in many cellular activities and plays significant roles in regulating cellular processes. With the avalanche of genome sequences generated in the post-genomic age, it is highly desired to develop automated methods for rapidly and effectively identifying nucleosome positioning. Although some computational methods were proposed, most of them were species specific and neglected the intrinsic local structural properties that might play important roles in determining the nucleosome positioning on a DNA sequence.

    Results: Here a predictor called ‘iNuc-PseKNC’ was developed for predicting nucleosome positioning in Homo sapiens, Caenorhabditis elegans and Drosophila melanogaster genomes, respectively. In the new predictor, the samples of DNA sequences were formulated by a novel feature-vector called ‘pseudo k-tuple nucleotide composition’, into which six DNA local structural properties were incorporated. It was observed by the rigorous cross-validation tests on the three stringent benchmark datasets that the overall success rates achieved by iNuc-PseKNC in predicting the nucleosome positioning of the aforementioned three genomes were 86.27%, 86.90% and 79.97%, respectively. Meanwhile, the results obtained by iNuc-PseKNC on various benchmark datasets used by the previous investigators for different genomes also indicated that the current predictor remarkably outperformed its counterparts.

    Availability: A user-friendly web-server, iNuc-PseKNC is freely accessible at http://lin.uestc.edu.cn/server/iNuc-PseKNC.

    Contact: hlin@uestc.edu.cn, wchen@gordonlifescience.org, kcchou@gordonlifescience.org

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations
    [May 2014]

    Motivation: 16S rDNA pyrosequencing is a powerful approach that requires extensive usage of computational methods for delineating microbial compositions. Previously, it was shown that outcomes of studies relying on this approach vastly depend on the choice of pre-processing and clustering algorithms used. However, obtaining insights into the effects and accuracy of these algorithms is challenging due to difficulties in generating samples of known composition with high enough diversity. Here, we use in silico microbial datasets to better understand how the experimental data are transformed into taxonomic clusters by computational methods.

    Results: We were able to qualitatively replicate the raw experimental pyrosequencing data after rigorously adjusting existing simulation software. This allowed us to simulate datasets of real-life complexity, which we used to assess the influence and performance of two widely used pre-processing methods along with 11 clustering algorithms. We show that the choice, order and mode of the pre-processing methods have a larger impact on the accuracy of the clustering pipeline than the clustering methods themselves. Without pre-processing, the difference between the performances of clustering methods is large. Depending on the clustering algorithm, the most optimal analysis pipeline resulted in significant underestimations of the expected number of clusters (minimum: 3.4%; maximum: 13.6%), allowing us to make quantitative estimations of the bacterial complexity of real microbiome samples.

    Contact: a.may@vu.nl or b.brandt@acta.nl

    Supplementary information: Supplementary data are available at Bioinformatics online. The simulated datasets are available via http://www.ibi.vu.nl/downloads.

    Categories: Journal Articles
  • SegAnnDB: interactive Web-based genomic segmentation
    [May 2014]

    Motivation: DNA copy number profiles characterize regions of chromosome gains, losses and breakpoints in tumor genomes. Although many models have been proposed to detect these alterations, it is not clear which model is appropriate before visual inspection the signal, noise and models for a particular profile.

    Results: We propose SegAnnDB, a Web-based computer vision system for genomic segmentation: first, visually inspect the profiles and manually annotate altered regions, then SegAnnDB determines the precise alteration locations using a mathematical model of the data and annotations. SegAnnDB facilitates collaboration between biologists and bioinformaticians, and uses the University of California, Santa Cruz genome browser to visualize copy number alterations alongside known genes.

    Availability and implementation: The breakpoints project on INRIA GForge hosts the source code, an Amazon Machine Image can be launched and a demonstration Web site is http://bioviz.rocq.inria.fr.

    Contact: toby@sg.cs.titech.ac.jp

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • An efficient algorithm for accurate computation of the Dirichlet-multinomial log-likelihood function
    [May 2014]

    Summary: The Dirichlet-multinomial (DMN) distribution is a fundamental model for multicategory count data with overdispersion. This distribution has many uses in bioinformatics including applications to metagenomics data, transctriptomics and alternative splicing. The DMN distribution reduces to the multinomial distribution when the overdispersion parameter is 0. Unfortunately, numerical computation of the DMN log-likelihood function by conventional methods results in instability in the neighborhood of . An alternative formulation circumvents this instability, but it leads to long runtimes that make it impractical for large count data common in bioinformatics. We have developed a new method for computation of the DMN log-likelihood to solve the instability problem without incurring long runtimes. The new approach is composed of a novel formula and an algorithm to extend its applicability. Our numerical experiments show that this new method both improves the accuracy of log-likelihood evaluation and the runtime by several orders of magnitude, especially in high-count data situations that are common in deep sequencing data. Using real metagenomic data, our method achieves manyfold runtime improvement. Our method increases the feasibility of using the DMN distribution to model many high-throughput problems in bioinformatics. We have included in our work an R package giving access to this method and a vingette applying this approach to metagenomic data.

    Availability and implementation: An implementation of the algorithm together with a vignette describing its use is available in Supplementary data.

    Contact: pengyu.bio@gmail.com or cashaw@bcm.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Analysis of gene expression data using a linear mixed model/finite mixture model approach: application to regional differences in the human brain
    [May 2014]

    Motivation: Gene expression data exhibit common information over the genome. This article shows how data can be analysed from an efficient whole-genome perspective. Further, the methods have been developed so that users with limited expertise in bioinformatics and statistical computing techniques could use and modify this procedure to their own needs. The method outlined first uses a large-scale linear mixed model for the expression data genome-wide, and then uses finite mixture models to separate differentially expressed (DE) from non-DE transcripts. These methods are illustrated through application to an exceptional UK Brain Expression Consortium involving 12 human frozen post-mortem brain regions.

    Results: Fitting linear mixed models has allowed variation in gene expression between different biological states (e.g. brain regions, gender, age) to be investigated. The model can be extended to allow for differing levels of variation between different biological states. Predicted values of the random effects show the effects of each transcript in a particular biological state. Using the UK Brain Expression Consortium data, this approach yielded striking patterns of co-regional gene expression. Fitting the finite mixture model to the effects within each state provides a convenient method to filter transcripts that are DE: these DE transcripts can then be extracted for advanced functional analysis.

    Availability: The data for all regions except HYPO and SPCO are available at the Gene Expression Omnibus (GEO) site, accession number GSE46706. R code for the analysis is available in the Supplementary file.

    Contact: peter.thomson@sydney.edu.au

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Learning phenotype densities conditional on many interacting predictors
    [May 2014]

    Motivation: Estimating a phenotype distribution conditional on a set of discrete-valued predictors is a commonly encountered task. For example, interest may be in how the density of a quantitative trait varies with single nucleotide polymorphisms and patient characteristics. The subset of important predictors is not usually known in advance. This becomes more challenging with a high-dimensional predictor set when there is the possibility of interaction.

    Results: We demonstrate a novel non-parametric Bayes method based on a tensor factorization of predictor-dependent weights for Gaussian kernels. The method uses multistage predictor selection for dimension reduction, providing succinct models for the phenotype distribution. The resulting conditional density morphs flexibly with the selected predictors. In a simulation study and an application to molecular epidemiology data, we demonstrate advantages over commonly used methods.

    Availability and implementation: MATLAB code available at https://googledrive.com/host/0Bw6KIFB-k4IOOWQ0dFJtSVZxNE0/ktdctf.html

    Contact: dave.kessler@gmail.com

    Categories: Journal Articles
  • Complete enumeration of elementary flux modes through scalable demand-based subnetwork definition
    [May 2014]

    Motivation: Elementary flux mode analysis (EFMA) decomposes complex metabolic network models into tractable biochemical pathways, which have been used for rational design and analysis of metabolic and regulatory networks. However, application of EFMA has often been limited to targeted or simplified metabolic network representations due to computational demands of the method.

    Results: Division of biological networks into subnetworks enables the complete enumeration of elementary flux modes (EFMs) for metabolic models of a broad range of complexities, including genome-scale. Here, subnetworks are defined using serial dichotomous suppression and enforcement of flux through model reactions. Rules for selecting appropriate reactions to generate subnetworks are proposed and tested; three test cases, including both prokaryotic and eukaryotic network models, verify the efficacy of these rules and demonstrate completeness and reproducibility of EFM enumeration. Division of models into subnetworks is demand-based and automated; computationally intractable subnetworks are further divided until the entire solution space is enumerated. To demonstrate the strategy’s scalability, the splitting algorithm was implemented using an EFMA software package (EFMTool) and Windows PowerShell on a 50 node Microsoft high performance computing cluster. Enumeration of the EFMs in a genome-scale metabolic model of a diatom, Phaeodactylum tricornutum, identified ~2 billion EFMs. The output represents an order of magnitude increase in EFMs computed compared with other published algorithms and demonstrates a scalable framework for EFMA of most systems.

    Availability and implementation: http://www.chbe.montana.edu/RossC.

    Contact: rossc@erc.montana.edu or kristopher.hunt@erc.montana.edu

    Supplementary Information: Supplemental materials are available at Bioinformatics online.

    Categories: Journal Articles
  • Identifying critical transitions of complex diseases based on a single sample
    [May 2014]

    Motivation: Unlike traditional diagnosis of an existing disease state, detecting the pre-disease state just before the serious deterioration of a disease is a challenging task, because the state of the system may show little apparent change or symptoms before this critical transition during disease progression. By exploring the rich interaction information provided by high-throughput data, the dynamical network biomarker (DNB) can identify the pre-disease state, but this requires multiple samples to reach a correct diagnosis for one individual, thereby restricting its clinical application.

    Results: In this article, we have developed a novel computational approach based on the DNB theory and differential distributions between the expressions of DNB and non-DNB molecules, which can detect the pre-disease state reliably even from a single sample taken from one individual, by compensating insufficient samples with existing datasets from population studies. Our approach has been validated by the successful identification of pre-disease samples from subjects or individuals before the emergence of disease symptoms for acute lung injury, influenza and breast cancer.

    Contact: lnchen@sibs.ac.cn.

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Event trigger identification for biomedical events extraction using domain knowledge
    [May 2014]

    Motivation: In molecular biology, molecular events describe observable alterations of biomolecules, such as binding of proteins or RNA production. These events might be responsible for drug reactions or development of certain diseases. As such, biomedical event extraction, the process of automatically detecting description of molecular interactions in research articles, attracted substantial research interest recently. Event trigger identification, detecting the words describing the event types, is a crucial and prerequisite step in the pipeline process of biomedical event extraction. Taking the event types as classes, event trigger identification can be viewed as a classification task. For each word in a sentence, a trained classifier predicts whether the word corresponds to an event type and which event type based on the context features. Therefore, a well-designed feature set with a good level of discrimination and generalization is crucial for the performance of event trigger identification.

    Results: In this article, we propose a novel framework for event trigger identification. In particular, we learn biomedical domain knowledge from a large text corpus built from Medline and embed it into word features using neural language modeling. The embedded features are then combined with the syntactic and semantic context features using the multiple kernel learning method. The combined feature set is used for training the event trigger classifier. Experimental results on the golden standard corpus show that >2.5% improvement on F-score is achieved by the proposed framework when compared with the state-of-the-art approach, demonstrating the effectiveness of the proposed framework.

    Availability and implementation: The source code for the proposed framework is freely available and can be downloaded at http://cse.seu.edu.cn/people/zhoudeyu/ETI_Sourcecode.zip.

    Contact: d.zhou@seu.edu.cn

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • AMASS: a database for investigating protein structures
    [May 2014]

    Motivation: Modern techniques have produced many sequence annotation databases and protein structure portals, but these Web resources are rarely integrated in ways that permit straightforward exploration of protein functional residues and their co-localization.

    Results: We have created the AMASS database, which maps 1D sequence annotation databases to 3D protein structures with an intuitive visualization interface. Our platform also provides an analysis service that screens mass spectrometry sequence data for post-translational modifications that reside in functionally relevant locations within protein structures. The system is built on the premise that functional residues such as active sites, cancer mutations and post-translational modifications within proteins may co-localize and share common functions.

    Availability and implementation: AMASS database is implemented with Biopython and Apache as a freely available Web server at amass-db.org.

    Contact: clinton.mielke@gmail.com

    Categories: Journal Articles
  • The cleverSuite approach for protein characterization: predictions of structural properties, solubility, chaperone requirements and RNA-binding abilities
    [May 2014]

    Motivation: The recent shift towards high-throughput screening is posing new challenges for the interpretation of experimental results. Here we propose the cleverSuite approach for large-scale characterization of protein groups.

    Description: The central part of the cleverSuite is the cleverMachine (CM), an algorithm that performs statistics on protein sequences by comparing their physico-chemical propensities. The second element is called cleverClassifier and builds on top of the models generated by the CM to allow classification of new datasets.

    Results: We applied the cleverSuite to predict secondary structure properties, solubility, chaperone requirements and RNA-binding abilities. Using cross-validation and independent datasets, the cleverSuite reproduces experimental findings with great accuracy and provides models that can be used for future investigations.

    Availability: The intuitive interface for dataset exploration, analysis and prediction is available at http://s.tartaglialab.com/clever_suite.

    Contact: gian.tartaglia@crg.es

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • A benchmark for comparison of cell tracking algorithms
    [May 2014]

    Motivation: Automatic tracking of cells in multidimensional time-lapse fluorescence microscopy is an important task in many biomedical applications. A novel framework for objective evaluation of cell tracking algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2013 Cell Tracking Challenge. In this article, we present the logistics, datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark.

    Results: The main contributions of the challenge include the creation of a comprehensive video dataset repository and the definition of objective measures for comparison and ranking of the algorithms. With this benchmark, six algorithms covering a variety of segmentation and tracking paradigms have been compared and ranked based on their performance on both synthetic and real datasets. Given the diversity of the datasets, we do not declare a single winner of the challenge. Instead, we present and discuss the results for each individual dataset separately.

    Availability and implementation: The challenge Web site (http://www.codesolorzano.com/celltrackingchallenge) provides access to the training and competition datasets, along with the ground truth of the training videos. It also provides access to Windows and Linux executable files of the evaluation software and most of the algorithms that competed in the challenge.

    Contact: codesolorzano@unav.es

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • bwtool: a tool for bigWig files
    [May 2014]

    BigWig files are a compressed, indexed, binary format for genome-wide signal data for calculations (e.g. GC percent) or experiments (e.g. ChIP-seq/RNA-seq read depth). bwtool is a tool designed to read bigWig files rapidly and efficiently, providing functionality for extracting data and summarizing it in several ways, globally or at specific regions. Additionally, the tool enables the conversion of the positions of signal data from one genome assembly to another, also known as ‘lifting’. We believe bwtool can be useful for the analyst frequently working with bigWig data, which is becoming a standard format to represent functional signals along genomes. The article includes supplementary examples of running the software.

    Availability and implementation: The C source code is freely available under the GNU public license v3 at http://cromatina.crg.eu/bwtool.

    Contact: andrew.pohl@crg.eu, andypohl@gmail.com

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • HiBrowse: multi-purpose statistical analysis of genome-wide chromatin 3D organization
    [May 2014]

    Summary: Recently developed methods that couple next-generation sequencing with chromosome conformation capture-based techniques, such as Hi-C and ChIA-PET, allow for characterization of genome-wide chromatin 3D structure. Understanding the organization of chromatin in three dimensions is a crucial next step in the unraveling of global gene regulation, and methods for analyzing such data are needed. We have developed HiBrowse, a user-friendly web-tool consisting of a range of hypothesis-based and descriptive statistics, using realistic assumptions in null-models.

    Availability and implementation: HiBrowse is supported by all major browsers, and is freely available at http://hyperbrowser.uio.no/3d. Software is implemented in Python, and source code is available for download by following instructions on the main site.

    Contact: jonaspau@ifi.uio.no

    Supplementary Information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • LPmerge: an R package for merging genetic maps by linear programming
    [May 2014]

    Summary: Consensus genetic maps constructed from multiple populations are an important resource for both basic and applied research, including genome-wide association analysis, genome sequence assembly and studies of evolution. The LPmerge software uses linear programming to efficiently minimize the mean absolute error between the consensus map and the linkage maps from each population. This minimization is performed subject to linear inequality constraints that ensure the ordering of the markers in the linkage maps is preserved. When marker order is inconsistent between linkage maps, a minimum set of ordinal constraints is deleted to resolve the conflicts.

    Availability and implementation: LPmerge is on CRAN at http://cran.r-project.org/web/packages/LPmerge.

    Contact: endelman@wisc.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • StochHMM: a flexible hidden Markov model tool and C++ library
    [May 2014]

    Summary: Hidden Markov models (HMMs) are probabilistic models that are well-suited to solve many different classification problems in computation biology. StochHMM provides a command-line program and C++ library that can implement a traditional HMM from a simple text file. StochHMM provides researchers the flexibility to create higher-order emissions, integrate additional data sources and/or user-defined functions into multiple points within the HMM framework. Additional features include user-defined alphabets, ability to handle ambiguous characters in an emission-dependent manner, user-defined weighting of state paths and ability to tie transition probabilities to sequence.

    Availability and implementation: StochHMM is implemented in C++ and is available under the MIT License. Software, source code, documentation and examples can be found at http://github.com/KorfLab/StochHMM.

    Contact: ifkorf@ucdavis.edu

    Categories: Journal Articles