Bioinformatics Journal

Bioinformatics - RSS feed of current issue
• Toward better understanding of artifacts in variant calling from high-coverage samples[Oct 2014]

Motivation: Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods.

Results: We made 10 single nucleotide polymorphism and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10–15 kb, but the error rate of post-filtered calls is reduced to 1 in 100–200 kb without significant compromise on the sensitivity.

Availability and implementation: BWA-MEM alignment and raw variant calls are available at http://bit.ly/1g8XqRt scripts and miscellaneous data at https://github.com/lh3/varcmp.

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• H3M2: detection of runs of homozygosity from whole-exome sequencing data[Oct 2014]

Motivation: Runs of homozygosity (ROH) are sizable chromosomal stretches of homozygous genotypes, ranging in length from tens of kilobases to megabases. ROHs can be relevant for population and medical genetics, playing a role in predisposition to both rare and common disorders. ROHs are commonly detected by single nucleotide polymorphism (SNP) microarrays, but attempts have been made to use whole-exome sequencing (WES) data. Currently available methods developed for the analysis of uniformly spaced SNP-array maps do not fit easily to the analysis of the sparse and non-uniform distribution of the WES target design.

Results: To meet the need of an approach specifically tailored to WES data, we developed $${H}^{3}{M}^{2}$$, an original algorithm based on heterogeneous hidden Markov model that incorporates inter-marker distances to detect ROH from WES data. We evaluated the performance of $${H}^{3}{M}^{2}$$ to correctly identify ROHs on synthetic chromosomes and examined its accuracy in detecting ROHs of different length (short, medium and long) from real 1000 genomes project data. $${H}^{3}{M}^{2}$$ turned out to be more accurate than GERMLINE and PLINK, two state-of-the-art algorithms, especially in the detection of short and medium ROHs.

Availability and implementation: $${H}^{3}{M}^{2}$$ is a collection of bash, R and Fortran scripts and codes and is freely available at https://sourceforge.net/projects/h3m2/.

Contact: albertomagi@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• CNV-guided multi-read allocation for ChIP-seq[Oct 2014]

Motivation: In chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) and other short-read sequencing experiments, a considerable fraction of the short reads align to multiple locations on the reference genome (multi-reads). Inferring the origin of multi-reads is critical for accurately mapping reads to repetitive regions. Current state-of-the-art multi-read allocation algorithms rely on the read counts in the local neighborhood of the alignment locations and ignore the variation in the copy numbers of these regions. Copy-number variation (CNV) can directly affect the read densities and, therefore, bias allocation of multi-reads.

Results: We propose cnvCSEM (CNV-guided ChIP-Seq by expectation-maximization algorithm), a flexible framework that incorporates CNV in multi-read allocation. cnvCSEM eliminates the CNV bias in multi-read allocation by initializing the read allocation algorithm with CNV-aware initial values. Our data-driven simulations illustrate that cnvCSEM leads to higher read coverage with satisfactory accuracy and lower loss in read-depth recovery (estimation). We evaluate the biological relevance of the cnvCSEM-allocated reads and the resultant peaks with the analysis of several ENCODE ChIP-seq datasets.

Availability and implementation: Available at http://www.stat.wisc.edu/~qizhang/

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• Learning protein-DNA interaction landscapes by integrating experimental data through computational models[Oct 2014]

Motivation: Transcriptional regulation is directly enacted by the interactions between DNA and many proteins, including transcription factors (TFs), nucleosomes and polymerases. A critical step in deciphering transcriptional regulation is to infer, and eventually predict, the precise locations of these interactions, along with their strength and frequency. While recent datasets yield great insight into these interactions, individual data sources often provide only partial information regarding one aspect of the complete interaction landscape. For example, chromatin immunoprecipitation (ChIP) reveals the binding positions of a protein, but only for one protein at a time. In contrast, nucleases like MNase and DNase can be used to reveal binding positions for many different proteins at once, but cannot easily determine the identities of those proteins. Currently, few statistical frameworks jointly model these different data sources to reveal an accurate, holistic view of the in vivo protein–DNA interaction landscape.

Results: Here, we develop a novel statistical framework that integrates different sources of experimental information within a thermodynamic model of competitive binding to jointly learn a holistic view of the in vivo protein–DNA interaction landscape. We show that our framework learns an interaction landscape with increased accuracy, explaining multiple sets of data in accordance with thermodynamic principles of competitive DNA binding. The resulting model of genomic occupancy provides a precise mechanistic vantage point from which to explore the role of protein–DNA interactions in transcriptional regulation.

Availability and implementation: The C source code for compete and Python source code for MCMC-based inference are available at http://www.cs.duke.edu/~amink.

Contact: amink@cs.duke.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• A multiobjective method for robust identification of bacterial small non-coding RNAs[Oct 2014]

Motivation: Small non-coding RNAs (sRNAs) have major roles in the post-transcriptional regulation in prokaryotes. The experimental validation of a relatively small number of sRNAs in few species requires developing computational algorithms capable of robustly encoding the available knowledge and using this knowledge to predict sRNAs within and across species.

Results: We present a novel methodology designed to identify bacterial sRNAs by incorporating the knowledge encoded by different sRNA prediction methods and optimally aggregating them as potential predictors. Because some of these methods emphasize specificity, whereas others emphasize sensitivity while detecting sRNAs, their optimal aggregation constitutes trade-off solutions between these two contradictory objectives that enhance their individual merits. Many non-redundant optimal aggregations uncovered by using multiobjective optimization techniques are then combined into a multiclassifier, which ensures robustness during detection and prediction even in genomes with distinct nucleotide composition. By training with sRNAs in Salmonella enterica Typhimurium, we were able to successfully predict sRNAs in Sinorhizobium meliloti, as well as in multiple and poorly annotated species. The proposed methodology, like a meta-analysis approach, may begin to lay a possible foundation for developing robust predictive methods across a wide spectrum of genomic variability.

Availability and implementation: Scripts created for the experimentation are available at http://m4m.ugr.es/SupInfo/sRNAOS/sRNAOSscripts.zip.

Contact: delval@decsai.ugr.es

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ARBitrator: a software pipeline for on-demand retrieval of auto-curated nifH sequences from GenBank[Oct 2014]

Motivation: Studies of the biochemical functions and activities of uncultivated microorganisms in the environment require analysis of DNA sequences for phylogenetic characterization and for the development of sequence-based assays for the detection of microorganisms. The numbers of sequences for genes that are indicators of environmentally important functions such as nitrogen (N2) fixation have been rapidly growing over the past few decades. Obtaining these sequences from the National Center for Biotechnology Information’s GenBank database is problematic because of annotation errors, nomenclature variation and paralogues; moreover, GenBank’s structure and tools are not conducive to searching solely by function. For some genes, such as the nifH gene commonly used to assess community potential for N2 fixation, manual collection and curation are becoming intractable because of the large number of sequences in GenBank and the large number of highly similar paralogues. If analysis is to keep pace with sequence discovery, an automated retrieval and curation system is necessary.

Results: ARBitrator uses a two-step process composed of a broad collection of potential homologues followed by screening with a best hit strategy to conserved domains. 34 420 nifH sequences were identified in GenBank as of November 20, 2012. The false-positive rate is ~0.033%. ARBitrator rapidly updates a public nifH sequence database, and we show that it can be adapted for other genes.

Availability and implementation: Java source and executable code are freely available to non-commercial users at http://pmc.ucsc.edu/~wwwzehr/research/database/.

Contact: zehrj@ucsc.edu

Supplementary information: Supplementary information is available at Bioinformatics online.

Categories: Journal Articles
• Efficient initial volume determination from electron microscopy images of single particles[Oct 2014]

Motivation: Structural information of macromolecular complexes provides key insights into the way they carry out their biological functions. The reconstruction process leading to the final 3D map requires an approximate initial model. Generation of an initial model is still an open and challenging problem in single-particle analysis.

Results: We present a fast and efficient approach to obtain a reliable, low-resolution estimation of the 3D structure of a macromolecule, without any a priori knowledge, addressing the well-known issue of initial volume estimation in the field of single-particle analysis. The input of the algorithm is a set of class average images obtained from individual projections of a biological object at random and unknown orientations by transmission electron microscopy micrographs. The proposed method is based on an initial non-lineal dimensionality reduction approach, which allows to automatically selecting representative small sets of class average images capturing the most of the structural information of the particle under study. These reduced sets are then used to generate volumes from random orientation assignments. The best volume is determined from these guesses using a random sample consensus (RANSAC) approach. We have tested our proposed algorithm, which we will term 3D-RANSAC, with simulated and experimental data, obtaining satisfactory results under the low signal-to-noise conditions typical of cryo-electron microscopy.

Availability: The algorithm is freely available as part of the Xmipp 3.1 package [http://xmipp.cnb.csic.es].

Contact: jvargas@cnb.csic.es

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• Intensity drift removal in LC/MS metabolomics by common variance compensation[Oct 2014]

Liquid chromatography coupled to mass spectrometry (LC/MS) has become widely used in Metabolomics. Several artefacts have been identified during the acquisition step in large LC/MS metabolomics experiments, including ion suppression, carryover or changes in the sensitivity and intensity. Several sources have been pointed out as responsible for these effects. In this context, the drift effects of the peak intensity is one of the most frequent and may even constitute the main source of variance in the data, resulting in misleading statistical results when the samples are analysed. In this article, we propose the introduction of a methodology based on a common variance analysis before the data normalization to address this issue. This methodology was tested and compared with four other methods by calculating the Dunn and Silhouette indices of the quality control classes. The results showed that our proposed methodology performed better than any of the other four methods. As far as we know, this is the first time that this kind of approach has been applied in the metabolomics context.

Availability and implementation: The source code of the methods is available as the R package intCor at http://b2slab.upc.edu/software-and-downloads/intensity-drift-correction/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• Fast and accurate imputation of summary statistics enhances evidence of functional enrichment[Oct 2014]

Motivation: Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available.

Results: In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1–5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case–control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of $${\chi }^{2}$$ association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses.

Availability and implementation: Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/.

Supplementary information: Supplementary materials are available at Bioinformatics online.

Categories: Journal Articles
• Fast spatial ancestry via flexible allele frequency surfaces[Oct 2014]

Motivation: Unique modeling and computational challenges arise in locating the geographic origin of individuals based on their genetic backgrounds. Single-nucleotide polymorphisms (SNPs) vary widely in informativeness, allele frequencies change non-linearly with geography and reliable localization requires evidence to be integrated across a multitude of SNPs. These problems become even more acute for individuals of mixed ancestry. It is hardly surprising that matching genetic models to computational constraints has limited the development of methods for estimating geographic origins. We attack these related problems by borrowing ideas from image processing and optimization theory. Our proposed model divides the region of interest into pixels and operates SNP by SNP. We estimate allele frequencies across the landscape by maximizing a product of binomial likelihoods penalized by nearest neighbor interactions. Penalization smooths allele frequency estimates and promotes estimation at pixels with no data. Maximization is accomplished by a minorize–maximize (MM) algorithm. Once allele frequency surfaces are available, one can apply Bayes’ rule to compute the posterior probability that each pixel is the pixel of origin of a given person. Placement of admixed individuals on the landscape is more complicated and requires estimation of the fractional contribution of each pixel to a person’s genome. This estimation problem also succumbs to a penalized MM algorithm.

Results: We applied the model to the Population Reference Sample (POPRES) data. The model gives better localization for both unmixed and admixed individuals than existing methods despite using just a small fraction of the available SNPs. Computing times are comparable with the best competing software.

Availability and implementation: Software will be freely available as the OriGen package in R.

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• Drug repositioning by integrating target information through a heterogeneous network model[Oct 2014]

Motivation: The emergence of network medicine not only offers more opportunities for better and more complete understanding of the molecular complexities of diseases, but also serves as a promising tool for identifying new drug targets and establishing new relationships among diseases that enable drug repositioning. Computational approaches for drug repositioning by integrating information from multiple sources and multiple levels have the potential to provide great insights to the complex relationships among drugs, targets, disease genes and diseases at a system level.

Results: In this article, we have proposed a computational framework based on a heterogeneous network model and applied the approach on drug repositioning by using existing omics data about diseases, drugs and drug targets. The novelty of the framework lies in the fact that the strength between a disease–drug pair is calculated through an iterative algorithm on the heterogeneous graph that also incorporates drug-target information. Comprehensive experimental results show that the proposed approach significantly outperforms several recent approaches. Case studies further illustrate its practical usefulness.

Availability and implementation: http://cbc.case.edu

Contact: jingli@cwru.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• MAGNA: Maximizing Accuracy in Global Network Alignment[Oct 2014]

Motivation: Biological network alignment aims to identify similar regions between networks of different species. Existing methods compute node similarities to rapidly identify from possible alignments the high-scoring alignments with respect to the overall node similarity. But, the accuracy of the alignments is then evaluated with some other measure that is different than the node similarity used to construct the alignments. Typically, one measures the amount of conserved edges. Thus, the existing methods align similar nodes between networks hoping to conserve many edges (after the alignment is constructed!).

Results: Instead, we introduce MAGNA to directly ‘optimize’ edge conservation while the alignment is constructed, without decreasing the quality of node mapping. MAGNA uses a genetic algorithm and our novel function for ‘crossover’ of two ‘parent’ alignments into a superior ‘child’ alignment to simulate a ‘population’ of alignments that ‘evolves’ over time; the ‘fittest’ alignments survive and proceed to the next ‘generation’, until the alignment accuracy cannot be optimized further. While we optimize our new and superior measure of the amount of conserved edges, MAGNA can optimize any alignment accuracy measure, including a combined measure of both node and edge conservation. In systematic evaluations against state-of-the-art methods (IsoRank, MI-GRAAL and GHOST), on both synthetic networks and real-world biological data, MAGNA outperforms all of the existing methods, in terms of both node and edge conservation as well as both topological and biological alignment accuracy.

Availability: Software: http://nd.edu/~cone/MAGNA

Contact: tmilenko@nd.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• Improving peak detection in high-resolution LC/MS metabolomics data using preexisting knowledge and machine learning approach[Oct 2014]

Motivation: Peak detection is a key step in the preprocessing of untargeted metabolomics data generated from high-resolution liquid chromatography-mass spectrometry (LC/MS). The common practice is to use filters with predetermined parameters to select peaks in the LC/MS profile. This rigid approach can cause suboptimal performance when the choice of peak model and parameters do not suit the data characteristics.

Results: Here we present a method that learns directly from various data features of the extracted ion chromatograms (EICs) to differentiate between true peak regions from noise regions in the LC/MS profile. It utilizes the knowledge of known metabolites, as well as robust machine learning approaches. Unlike currently available methods, this new approach does not assume a parametric peak shape model and allows maximum flexibility. We demonstrate the superiority of the new approach using real data. Because matching to known metabolites entails uncertainties and cannot be considered a gold standard, we also developed a probabilistic receiver-operating characteristic (pROC) approach that can incorporate uncertainties.

Availability and implementation: The new peak detection approach is implemented as part of the apLCMS package available at http://web1.sph.emory.edu/apLCMS/

Contact: tyu8@emory.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• The Amordad database engine for metagenomics[Oct 2014]

Motivation: Several technical challenges in metagenomic data analysis, including assembling metagenomic sequence data or identifying operational taxonomic units, are both significant and well known. These forms of analysis are increasingly cited as conceptually flawed, given the extreme variation within traditionally defined species and rampant horizontal gene transfer. Furthermore, computational requirements of such analysis have hindered content-based organization of metagenomic data at large scale.

Results: In this article, we introduce the Amordad database engine for alignment-free, content-based indexing of metagenomic datasets. Amordad places the metagenome comparison problem in a geometric context, and uses an indexing strategy that combines random hashing with a regular nearest neighbor graph. This framework allows refinement of the database over time by continual application of random hash functions, with the effect of each hash function encoded in the nearest neighbor graph. This eliminates the need to explicitly maintain the hash functions in order for query efficiency to benefit from the accumulated randomness. Results on real and simulated data show that Amordad can support logarithmic query time for identifying similar metagenomes even as the database size reaches into the millions.

Contact: andrewds@usc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• COSMOS: Python library for massively parallel workflows[Oct 2014]

Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.

Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• GATB: Genome Assembly & Analysis Tool Box[Oct 2014]

Motivation: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation.

Results: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints.

Availability and implementation: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license.

Contact: lavenier@irisa.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• bammds: a tool for assessing the ancestry of low-depth whole-genome data using multidimensional scaling (MDS)[Oct 2014]

Summary: We present bammds, a practical tool that allows visualization of samples sequenced by second-generation sequencing when compared with a reference panel of individuals (usually genotypes) using a multidimensional scaling algorithm. Our tool is aimed at determining the ancestry of unknown samples—typical of ancient DNA data—particularly when only low amounts of data are available for those samples.

Availability and implementation: The software package is available under GNU General Public License v3 and is freely available together with test datasets https://savannah.nongnu.org/projects/bammds/. It is using R (http://www.r-project.org/), parallel (http://www.gnu.org/software/parallel/), samtools (https://github.com/samtools/samtools).

Contact: bammds-users@nongnu.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• RAPIDR: an analysis package for non-invasive prenatal testing of aneuploidy[Oct 2014]

Non-invasive prenatal testing (NIPT) of fetal aneuploidy using cell-free fetal DNA is becoming part of routine clinical practice. RAPIDR (Reliable Accurate Prenatal non-Invasive Diagnosis R package) is an easy-to-use open-source R package that implements several published NIPT analysis methods. The input to RAPIDR is a set of sequence alignment files in the BAM format, and the outputs are calls for aneuploidy, including trisomies 13, 18, 21 and monosomy X as well as fetal sex. RAPIDR has been extensively tested with a large sample set as part of the RAPID project in the UK. The package contains quality control steps to make it robust for use in the clinical setting.

Availability and implementation: RAPIDR is implemented in R and can be freely downloaded via CRAN from here: http://cran.r-project.org/web/packages/RAPIDR/index.html.

Contact: kitty.lo@ucl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• Genome editing assessment using CRISPR Genome Analyzer (CRISPR-GA)[Oct 2014]

Summary: Clustered regularly interspaced short palindromic repeats (CRISPR)-based technologies have revolutionized human genome engineering and opened countless possibilities to basic science, synthetic biology and gene therapy. Albeit the enormous potential of these tools, their performance is far from perfect. It is essential to perform a posterior careful analysis of the gene editing experiment. However, there are no computational tools for genome editing assessment yet, and current experimental tools lack sensitivity and flexibility.

We present a platform to assess the quality of a genome editing experiment only with three mouse clicks. The method evaluates next-generation data to quantify and characterize insertions, deletions and homologous recombination. CRISPR Genome Analyzer provides a report for the locus selected, which includes a quantification of the edited site and the analysis of the different alterations detected. The platform maps the reads, estimates and locates insertions and deletions, computes the allele replacement efficiency and provides a report integrating all the information.

Availability and implementation: CRISPR-GA Web is available at http://crispr-ga.net. Documentation on CRISPR-GA instructions can be found at http://crispr-ga.net/documentation.html

Contact: mguell@genetics.med.harvard.edu

Categories: Journal Articles
• repfdr: a tool for replicability analysis for genome-wide association studies[Oct 2014]

Motivation: Identification of single nucleotide polymorphisms that are associated with a phenotype in more than one study is of great scientific interest in the genome-wide association studies (GWAS) research. The empirical Bayes approach for discovering whether results have been replicated across studies was shown to be a reliable method, and close to optimal in terms of power.

Results: The R package repfdr provides a flexible implementation of the empirical Bayes approach for replicability analysis and meta-analysis, to be used when several studies examine the same set of null hypotheses. The usefulness of the package for the GWAS community is discussed.

Availability and implementation: The R package repfdr can be downloaded from CRAN.

Contact: ruheller@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles