# Bioinformatics Journal

Bioinformatics - RSS feed of current issue
• ### SIST: stress-induced structural transitions in superhelical DNA[Jan 2015]

Summary: Supercoiling imposes stress on a DNA molecule that can drive susceptible sequences into alternative non-B form structures. This phenomenon occurs frequently in vivo and has been implicated in biological processes, such as replication, transcription, recombination and translocation. SIST is a software package that analyzes sequence-dependent structural transitions in kilobase length superhelical DNA molecules. The numerical algorithms in SIST are based on a statistical mechanical model that calculates the equilibrium probability of transition for each base pair in the domain. They are extensions of the original stress-induced duplex destabilization (SIDD) method, which analyzes stress-driven DNA strand separation. SIST also includes algorithms to analyze B-Z transitions and cruciform extrusion. The SIST pipeline has an option to use the DZCBtrans algorithm, which analyzes the competition among these three transitions within a superhelical domain.

Availability and implementation: The package and additional documentation are freely available at https://bitbucket.org/benhamlab/sist_codes.

Contact: dzhabinskaya@ucdavis.edu

Categories: Journal Articles
• ### The RNA shapes studio[Jan 2015]

Motivation: Abstract shape analysis, first proposed in 2004, allows one to extract several relevant structures from the folding space of an RNA sequence, preferable to focusing in a single structure of minimal free energy. We report recent extensions to this approach.

Results: We have rebuilt the original RNAshapes as a repository of components that allows us to integrate several established tools for RNA structure analysis: RNAshapes, RNAalishapes and pknotsRG, including its recent extension pKiss. As a spin-off, we obtain heretofore unavailable functionality: e. g. with pKiss, we can now perform abstract shape analysis for structures holding pseudoknots up to the complexity of kissing hairpin motifs. The new tool pAliKiss can predict kissing hairpin motifs from aligned sequences. Along with the integration, the functionality of the tools was also extended in manifold ways.

Availability and implementation: As before, the tool is available on the Bielefeld Bioinformatics server at http://bibiserv.cebitec.uni-bielefeld.de/rnashapesstudio.

Categories: Journal Articles
• ### CompMap: a reference-based compression program to speed up read mapping to related reference sequences[Jan 2015]

Summary: Exhaustive mapping of next-generation sequencing data to a set of relevant reference sequences becomes an important task in pathogen discovery and metagenomic classification. However, the runtime and memory usage increase as the number of reference sequences and the repeat content among these sequences increase. In many applications, read mapping time dominates the entire application. We developed CompMap, a reference-based compression program, to speed up this process. CompMap enables the generation of a non-redundant representative sequence for the input sequences. We have demonstrated that reads can be mapped to this representative sequence with a much reduced time and memory usage, and the mapping to the original reference sequences can be recovered with high accuracy.

Availability and implementation: CompMap is implemented in C and freely available at http://csse.szu.edu.cn/staff/zhuzx/CompMap/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### ExomeAI: detection of recurrent allelic imbalance in tumors using whole-exome sequencing data[Jan 2015]

Summary: Whole-exome sequencing (WES) has extensively been used in cancer genome studies; however, the use of WES data in the study of loss of heterozygosity or more generally allelic imbalance (AI) has so far been very limited, which highlights the need for user-friendly and flexible software that can handle low-quality datasets. We have developed a statistical approach, ExomeAI, for the detection of recurrent AI events using WES datasets, specifically where matched normal samples are not available.

Availability: ExomeAI is a web-based application, publicly available at: http://genomequebec.mcgill.ca/exomeai.

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### MulRF: a software package for phylogenetic analysis using multi-copy gene trees[Jan 2015]

Summary: MulRF is a platform-independent software package for phylogenetic analysis using multi-copy gene trees. It seeks the species tree that minimizes the Robinson–Foulds (RF) distance to the input trees using a generalization of the RF distance to multi-labeled trees. The underlying generic tree distance measure and fast running time make MulRF useful for inferring phylogenies from large collections of gene trees, in which multiple evolutionary processes as well as phylogenetic error may contribute to gene tree discord. MulRF implements several features for customizing the species tree search and assessing the results, and it provides a user-friendly graphical user interface (GUI) with tree visualization. The species tree search is implemented in C++ and the GUI in Java Swing.

Availability: MulRF’s executable as well as sample datasets and manual are available at http://genome.cs.iastate.edu/CBL/MulRF/, and the source code is available at https://github.com/ruchiherself/MulRFRepo.

Contact: ruchic@ufl.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### Tabhu: tools for antibody humanization[Jan 2015]

Summary: Antibodies are rapidly becoming essential tools in the clinical practice, given their ability to recognize their cognate antigens with high specificity and affinity, and a high yield at reasonable costs in model animals. Unfortunately, when administered to human patients, xenogeneic antibodies can elicit unwanted and dangerous immunogenic responses. Antibody humanization methods are designed to produce molecules with a better safety profile still maintaining their ability to bind the antigen. This can be accomplished by grafting the non-human regions determining the antigen specificity into a suitable human template. Unfortunately, this procedure may results in a partial or complete loss of affinity of the grafted molecule that can be restored by back-mutating some of the residues of human origin to the corresponding murine ones. This trial-and-error procedure is hard and involves expensive and time-consuming experiments. Here we present tools for antibody humanization (Tabhu) a web server for antibody humanization. Tabhu includes tools for human template selection, grafting, back-mutation evaluation, antibody modelling and structural analysis, helping the user in all the critical steps of the humanization experiment protocol.

Availability: http://www.biocomputing.it/tabhu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### Rcount: simple and flexible RNA-Seq read counting[Jan 2015]

Summary: Analysis of differential gene expression by RNA sequencing (RNA-Seq) is frequently done using feature counts, i.e. the number of reads mapping to a gene. However, commonly used count algorithms (e.g. HTSeq) do not address the problem of reads aligning with multiple locations in the genome (multireads) or reads aligning with positions where two or more genes overlap (ambiguous reads). Rcount specifically addresses these issues. Furthermore, Rcount allows the user to assign priorities to certain feature types (e.g. higher priority for protein-coding genes compared to rRNA-coding genes) or to add flanking regions.

Availability and implementation: Rcount provides a fast and easy-to-use graphical user interface requiring no command line or programming skills. It is implemented in C++ using the SeqAn (www.seqan.de) and the Qt libraries (qt-project.org). Source code and 64 bit binaries for (Ubuntu) Linux, Windows (7) and MacOSX are released under the GPLv3 license and are freely available on github.com/MWSchmid/Rcount.

Contact: marcschmid@gmx.ch

Supplementary information: Test data, genome annotation files, useful Python and R scripts and a step-by-step user guide (including run-time and memory usage tests) are available on github.com/MWSchmid/Rcount.

Categories: Journal Articles
• ### VCF2Networks: applying genotype networks to single-nucleotide variants data[Jan 2015]

Summary: A wealth of large-scale genome sequencing projects opens the doors to new approaches to study the relationship between genotype and phenotype. One such opportunity is the possibility to apply genotype networks analysis to population genetics data. Genotype networks are a representation of the set of genotypes associated with a single phenotype, and they allow one to estimate properties such as the robustness of the phenotype to mutations, and the ability of its associated genotypes to evolve new adaptations. So far, though, genotype networks analysis has rarely been applied to population genetics data. To help fill this gap, here we present VCF2Networks, a tool to determine and study genotype network structure from single-nucleotide variant data.

Availability and implementation: VCF2Networks is available at https://bitbucket.org/dalloliogm/vcf2networks.

Contact: giovanni.dallolio@kcl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### NOVA: a software to analyze complexome profiling data[Jan 2015]

Summary: We introduce nova, a software for the analysis of complexome profiling data. nova supports the investigation of the composition of complexes, cluster analysis of the experimental data, visual inspection and comparison of experiments and many other features.

Availability and implementation: nova is licensed under the Artistic License 2.0. It is freely available at http://www.bioinformatik.uni-frankfurt.de. nova requires at least Java 7 and runs under Linux, Microsoft Windows and Mac OS.

Categories: Journal Articles
• ### GeneNet Toolbox for MATLAB: a flexible platform for the analysis of gene connectivity in biological networks[Jan 2015]

Summary: We present GeneNet Toolbox for MATLAB (also available as a set of standalone applications for Linux). The toolbox, available as command-line or with a graphical user interface, enables biologists to assess connectivity among a set of genes of interest (‘seed-genes’) within a biological network of their choosing. Two methods are implemented for calculating the significance of connectivity among seed-genes: ‘seed randomization’ and ‘network permutation’. Options include restricting analyses to a specified subnetwork of the primary biological network, and calculating connectivity from the seed-genes to a second set of interesting genes. Pre-analysis tools help the user choose the best connectivity-analysis algorithm for their network. The toolbox also enables visualization of the connections among seed-genes. GeneNet Toolbox functions execute in reasonable time for very large networks (~10 million edges) on a desktop computer.

Availability and implementation: GeneNet Toolbox is open source and freely available from http://avigailtaylor.github.io/gntat14.

Supplementary information: Supplementary data are available at Bioinformatics online.

Contact: avigail.taylor@dpag.ox.ac.uk

Categories: Journal Articles
• ### FungiFun2: a comprehensive online resource for systematic analysis of gene lists from fungal species[Jan 2015]

Summary: Systematically extracting biological meaning from omics data is a major challenge in systems biology. Enrichment analysis is often used to identify characteristic patterns in candidate lists. FungiFun is a user-friendly Web tool for functional enrichment analysis of fungal genes and proteins. The novel tool FungiFun2 uses a completely revised data management system and thus allows enrichment analysis for 298 currently available fungal strains published in standard databases. FungiFun2 offers a modern Web interface and creates interactive tables, charts and figures, which users can directly manipulate to their needs.

Availability and implementation: FungiFun2, examples and tutorials are publicly available at https://elbe.hki-jena.de/fungifun/.

Categories: Journal Articles
• ### OrthoInspector 2.0: Software and database updates[Jan 2015]

Summary: We previously developed OrthoInspector, a package incorporating an original algorithm for the detection of orthology and inparalogy relations between different species. We have added new functionalities to the package. While its original algorithm was not modified, performing similar orthology predictions, we facilitated the prediction of very large databases (thousands of proteomes), refurbished its graphical interface, added new visualization tools for comparative genomics/protein family analysis and facilitated its deployment in a network environment. Finally, we have released three online databases of precomputed orthology relationships.

Availability: Package and databases are freely available at http://lbgi.fr/orthoinspector with all major browsers supported.

Contact: odile.lecompte@unistra.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### SNP-SIG 2013: the state of the art of genomic variant interpretation[Jan 2015]

Categories: Journal Articles
• ### Using the plurality of codon positions to identify deleterious variants in human exomes[Jan 2015]

Motivation: A codon position could perform different or multiple roles in alternative transcripts of a gene. For instance, a non-synonymous position in one transcript could be a synonymous site in another. Alternatively, a position could remain as non-synonymous in multiple transcripts. Here we examined the impact of codon position plurality on the frequency of deleterious single-nucleotide variations (SNVs) using data from 6500 human exomes.

Results: Our results showed that the proportion of deleterious SNVs was more than 2-fold higher in positions that remain non-synonymous in multiple transcripts compared with that observed in positions that are non-synonymous in one or some transcript(s) and synonymous or intronic in other(s). Furthermore, we observed a positive relationship between the fraction of deleterious non-synonymous SNVs and the number of proteins (alternative splice variants) affected. These results demonstrate that the plurality of codon positions is an important attribute, which could be useful in identifying mutations associated with diseases.

Contact: s.subramanian@griffith.edu.au

Supplementary Information: Supplementary data are available at Bioinformatics online

Categories: Journal Articles
• ### Novel function discovery with GeneMANIA: a new integrated resource for gene function prediction in Escherichia coli[Jan 2015]

Motivation: The model bacterium Escherichia coli is among the best studied prokaryotes, yet nearly half of its proteins are still of unknown biological function. This is despite a wealth of available large-scale physical and genetic interaction data. To address this, we extended the GeneMANIA function prediction web application developed for model eukaryotes to support E.coli.

Results: We integrated 48 distinct E.coli functional interaction datasets and used the GeneMANIA algorithm to produce thousands of novel functional predictions and prioritize genes for further functional assays. Our analysis achieved cross-validation performance comparable to that reported for eukaryotic model organisms, and revealed new functions for previously uncharacterized genes in specific bioprocesses, including components required for cell adhesion, iron–sulphur complex assembly and ribosome biogenesis. The GeneMANIA approach for network-based function prediction provides an innovative new tool for probing mechanisms underlying bacterial bioprocesses.

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### Pooled assembly of marine metagenomic datasets: enriching annotation through chimerism[Jan 2015]

Motivation: Despite advances in high-throughput sequencing, marine metagenomic samples remain largely opaque. A typical sample contains billions of microbial organisms from thousands of genomes and quadrillions of DNA base pairs. Its derived metagenomic dataset underrepresents this complexity by orders of magnitude because of the sparseness and shortness of sequencing reads. Read shortness and sequencing errors pose a major challenge to accurate species and functional annotation. This includes distinguishing known from novel species. Often the majority of reads cannot be annotated and thus cannot help our interpretation of the sample.

Results: Here, we demonstrate quantitatively how careful assembly of marine metagenomic reads within, but also across, datasets can alleviate this problem. For 10 simulated datasets, each with species complexity modeled on a real counterpart, chimerism remained within the same species for most contigs (97%). For 42 real pyrosequencing (‘454’) datasets, assembly increased the proportion of annotated reads, and even more so when datasets were pooled, by on average 1.6% (max 6.6%) for species, 9.0% (max 28.7%) for Pfam protein domains and 9.4% (max 22.9%) for PANTHER gene families. Our results outline exciting prospects for data sharing in the metagenomics community. While chimeric sequences should be avoided in other areas of metagenomics (e.g. biodiversity analyses), conservative pooled assembly is advantageous for annotation specificity and sensitivity. Intriguingly, our experiment also found potential prospects for (low-cost) discovery of new species in ‘old’ data.

Contact: dgerloff@ffame.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### Genome measures used for quality control are dependent on gene function and ancestry[Jan 2015]

Motivation: The transition/transversion (Ti/Tv) ratio and heterozygous/nonreference-homozygous (het/nonref-hom) ratio have been commonly computed in genetic studies as a quality control (QC) measurement. Additionally, these two ratios are helpful in our understanding of the patterns of DNA sequence evolution.

Results: To thoroughly understand these two genomic measures, we performed a study using 1000 Genomes Project (1000G) released genotype data (N = 1092). An additional two datasets (N = 581 and N = 6) were used to validate our findings from the 1000G dataset. We compared the two ratios among continental ancestry, genome regions and gene functionality. We found that the Ti/Tv ratio can be used as a quality indicator for single nucleotide polymorphisms inferred from high-throughput sequencing data. The Ti/Tv ratio varies greatly by genome region and functionality, but not by ancestry. The het/nonref-hom ratio varies greatly by ancestry, but not by genome regions and functionality. Furthermore, extreme guanine + cytosine content (either high or low) is negatively associated with the Ti/Tv ratio magnitude. Thus, when performing QC assessment using these two measures, care must be taken to apply the correct thresholds based on ancestry and genome region. Failure to take these considerations into account at the QC stage will bias any following analysis.

Contact: yan.guo@vanderbilt.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### Log-odds sequence logos[Jan 2015]

Motivation: DNA and protein patterns are usefully represented by sequence logos. However, the methods for logo generation in common use lack a proper statistical basis, and are non-optimal for recognizing functionally relevant alignment columns.

Results: We redefine the information at a logo position as a per-observation multiple alignment log-odds score. Such scores are positive or negative, depending on whether a column’s observations are better explained as arising from relatedness or chance. Within this framework, we propose distinct normalized maximum likelihood and Bayesian measures of column information. We illustrate these measures on High Mobility Group B (HMGB) box proteins and a dataset of enzyme alignments. Particularly in the context of protein alignments, our measures improve the discrimination of biologically relevant positions.

Availability and implementation: Our new measures are implemented in an open-source Web-based logo generation program, which is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/logoddslogo/index.html. A stand-alone version of the program is also available from this site.

Contact: altschul@ncbi.nlm.nih.gov

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### Integrative data analysis indicates an intrinsic disordered domain character of Argonaute-binding motifs[Jan 2015]

Motivation: Argonaute-interacting WG/GW proteins are characterized by the presence of repeated sequence motifs containing glycine (G) and tryptophan (W). The motifs seem to be remarkably adaptive to amino acid substitutions and their sequences show non-contiguity. Our previous approach to the detection of GW domains, based on scoring their gross amino acid composition, allowed annotation of several novel proteins involved in gene silencing. The accumulation of new experimental data and more advanced applications revealed some deficiency of the algorithm in prediction selectivity. Additionally, W-motifs, though critical in gene regulation, have not yet been annotated in any available online resources.

Results: We present an improved set of computational tools allowing efficient management and annotation of W-based motifs involved in gene silencing. The new prediction algorithms provide novel functionalities by annotation of the W-containing domains at the local sequence motif level rather than by overall compositional properties. This approach represents a significant improvement over the previous method in terms of prediction sensitivity and selectivity. Application of the algorithm allowed annotation of a comprehensive list of putative Argonaute-interacting proteins across eukaryotes. An in-depth characterization of the domains’ properties indicates its intrinsic disordered character. In addition, we created a knowledge-based portal (whub) that provides access to tools and information on RNAi-related tryptophan-containing motifs.

Availability and implementation: The web portal and tools are freely available at http://www.comgen.pl/whub.

Contact: wmk@amu.edu.pl

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles
• ### An alternative approach to multiple testing for methylation QTL mapping reduces the proportion of falsely identified CpGs[Jan 2015]

Introduction: An increasing number of studies investigates the influence of local genetic variation on DNA methylation levels, so-called in cis methylation quantitative trait loci (meQTLs). A common multiple testing approach in genome-wide cis meQTL studies limits the false discovery rate (FDR) among all CpG–SNP pairs to 0.05 and reports on CpGs from the significant CpG–SNP pairs. However, a statistical test for each CpG is not performed, potentially increasing the proportion of CpGs falsely reported on. Here, we presented an alternative approach that properly control for multiple testing at the CpG level.

Results: We performed cis meQTL mapping for varying window sizes using publicly available single-nucleotide polymorphism (SNP) and 450 kb data, extracting the CpGs from the significant CpG–SNP pairs ($$\hbox{ FDR } < 0.05$$). Using a new bait-and-switch simulation approach, we show that up to 50% of the CpGs found in the simulated data may be false-positive results. We present an alternative two-step multiple testing approach using the Simes and Benjamini–Hochberg procedures that does control the FDR among the CpGs, as confirmed by the bait-and-switch simulation. This approach indicates the use of window sizes in cis meQTL mapping studies that are significantly smaller than commonly adopted.

Discussion: Our approach to cis meQTL mapping properly controls the FDR at the CpG level, is computationally fast and can also be applied to cis eQTL studies.

Availability and implementation: An examplary R script for performing the Simes procedure is available as supplementary material.

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles