MLBio+Laboratory Machine Learning in Biomedical Informatics

BMC Bioinformatics

Syndicate content
The latest research articles published by BMC Bioinformatics
Updated: 3 weeks 19 hours ago

Homology-based prediction of interactions between proteins using Averaged One-Dependence Estimators

Sun, 06/22/2014 - 20:00
Background: Identification of protein-protein interactions (PPIs) is essential for a better understanding of biological processes, pathways and functions. However, experimental identification of the complete set of PPIs in a cell/organism ("an interactome") is still a difficult task. To circumvent limitations of current high-throughput experimental techniques, it is necessary to develop high-performance computational methods for predicting PPIs. Results: In this article, we propose a new computational method to predict interaction between a given pair of protein sequences using features derived from known homologous PPIs. The proposed method is capable of predicting interaction between two proteins (of unknown structure) using Averaged One-Dependence Estimators (AODE) and three features calculated for the protein pair: (a) sequence similarities to a known interacting protein pair (FSeq), (b) statistical propensities of domain pairs observed in interacting proteins (FDom) and (c) a sum of edge weights along the shortest path between homologous proteins in a PPI network (FNet). Feature vectors were defined to lie in a half-space of the symmetrical high-dimensional feature space to make them independent of the protein order. The predictability of the method was assessed by a 10-fold cross validation on a recently created human PPI dataset with randomly sampled negative data, and the best model achieved an Area Under the Curve of 0.79 (pAUC0.5% = 0.16). In addition, the AODE trained on all three features (named PSOPIA) showed better prediction performance on a separate independent data set than a recently reported homology-based method. Conclusions: Our results suggest that FNet, a feature representing proximity in a known PPI network between two proteins that are homologous to a target protein pair, contributes to the prediction of whether the target proteins interact or not. PSOPIA will help identify novel PPIs and estimate complete PPI networks. The method proposed in this article is freely available on the web at

CicArMiSatDB: the chickpea microsatellite database

Fri, 06/20/2014 - 20:00
Background: Chickpea (Cicer arietinum) is a widely grown legume crop in tropical, sub-tropical and temperate regions. Molecular breeding approaches seem to be essential for enhancing crop productivity in chickpea. Until recently, limited numbers of molecular markers were available in the case of chickpea for use in molecular breeding. However, the recent advances in genomics facilitated the development of large scale markers especially SSRs (simple sequence repeats), the markers of choice in any breeding program. Availability of genome sequence very recently opens new avenues for accelerating molecular breeding approaches for chickpea improvement.Description: In order to assist genetic studies and breeding applications, we have developed a user friendly relational database named the Chickpea Microsatellite Database (CicArMiSatDB This database provides detailed information on SSRs along with their features in the genome. SSRs have been classified and made accessible through an easy-to-use web interface. Conclusions: This database is expected to help chickpea community in particular and legume community in general, to select SSRs of particular type or from a specific region in the genome to advance both basic genomics research as well as applied aspects of crop improvement.

SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information

Thu, 06/19/2014 - 20:00
Background: The recent introduction of the Pacific Biosciences RS single molecule sequencing technology has opened new doors to scaffolding genome assemblies in a cost-effective manner. The long read sequence information is promised to enhance the quality of incomplete and inaccurate draft assemblies constructed from Next Generation Sequencing (NGS) data. Results: Here we propose a novel hybrid assembly methodology that aims to scaffold pre-assembled contigs in an iterative manner using PacBio RS long read information as a backbone. On a test set comprising six bacterial draft genomes, assembled using either a single Illumina MiSeq or Roche 454 library, we show that even a 50x coverage of uncorrected PacBio RS long reads is sufficient to drastically reduce the number of contigs. Comparisons to the AHA scaffolder indicate our strategy is better capable of producing (nearly) complete bacterial genomes. Conclusions: The current work describes our SSPACE-LongRead software which is designed to upgrade incomplete draft genomes using single molecule sequences. We conclude that the recent advances of the PacBio sequencing technology and chemistry, in combination with the limited computational resources required to run our program, allow to scaffold genomes in a fast and reliable manner.

Exploiting large-scale drug-protein interaction information for computational drug repurposing

Thu, 06/19/2014 - 20:00
Background: Despite increased investment in pharmaceutical research and development, fewer and fewer new drugs are entering the marketplace. This has prompted studies in repurposing existing drugs for use against diseases with unmet medical needs. A popular approach is to develop a classification model based on drugs with and without a desired therapeutic effect. For this approach to be statistically sound, it requires a large number of drugs in both classes. However, given few or no approved drugs for the diseases of highest medical urgency and interest, different strategies need to be investigated. Results: We developed a computational method termed "drug-protein interaction-based repurposing" (DPIR) that is potentially applicable to diseases with very few approved drugs. The method, based on genome-wide drug-protein interaction information and Bayesian statistics, first identifies drug-protein interactions associated with a desired therapeutic effect. Then, it uses key drug-protein interactions to score other drugs for their potential to have the same therapeutic effect. Conclusions: Detailed cross-validation studies using United States Food and Drug Administration-approved drugs for hypertension, human immunodeficiency virus, and malaria indicated that DPIR provides robust predictions. It achieves high levels of enrichment of drugs approved for a disease even with models developed based on a single drug known to treat the disease. Analysis of our model predictions also indicated that the method is potentially useful for understanding molecular mechanisms of drug action and for identifying protein targets that may potentiate the desired therapeutic effects of other drugs (combination therapies).

Automated identification of cell-type-specific genes in the mouse brain by image computing of expression patterns

Thu, 06/19/2014 - 20:00
Background: Differential gene expression patterns in cells of the mammalian brain result in the morphological,connectional, and functional diversity of cells. A wide variety of studies have shown that certaingenes are expressed only in specific cell-types. Analysis of cell-type-specific gene expressionpatterns can provide insights into the relationship between genes, connectivity, brain regions, andcell-types. However, automated methods for identifying cell-type-specific genes are lacking to date. Results: Here, we describe a set of computational methods for identifying cell-type-specific genes in themouse brain by automated image computing of in situ hybridization (ISH) expression patterns. Weapplied invariant image feature descriptors to capture local gene expression information fromcellular-resolution ISH images. We then built image-level representations by applying vectorquantization on the image descriptors. We employed regularized learning methods for classifyinggenes specifically expressed in different brain cell-types. These methods can also rank imagefeatures based on their discriminative power. We used a data set of 2,872 genes from the Allen BrainAtlas in the experiments. Results showed that our methods are predictive of cell-type-specificity ofgenes. Our classifiers achieved AUC values of approximately 87% when the enrichment level is setto 20. In addition, we showed that the highly-ranked image features captured the relationshipbetween cell-types. Conclusions: Overall, our results showed that automated image computing methods could potentially be used toidentify cell-type-specific genes in the mouse brain.

Stronger findings from mass spectral data through multi-peak modeling

Wed, 06/18/2014 - 20:00
Background: Mass spectrometry-based metabolomic analysis depends upon the identification of spectral peaks bytheir mass and retention time. Statistical analysis that follows the identification currently relies onone main peak of each compound. However, a compound present in the sample typically producesseveral spectral peaks due to its isotopic properties and the ionization process of the massspectrometer device. In this work, we investigate the extent to which these additional peaks can beused to increase the statistical strength of differential analysis. Results: We present a Bayesian approach for integrating data of multiple detected peaks that come from onecompound. We demonstrate the approach through a simulated experiment and validate it on ultraperformance liquid chromatography-mass spectrometry (UPLC-MS) experiments for metabolomicsand lipidomics. Peaks that are likely to be associated with one compound can be clustered bythe similarity of their chromatographic shape. Changes of concentration between sample groups canbe inferred more accurately when multiple peaks are available Conclusion: When the sample-size is limited, the proposed multi-peak approach improves the accuracy atinferring covariate effects. An R implementation and data are availableat

Automated peptide mapping and protein-topographical annotation of proteomics data

Wed, 06/18/2014 - 20:00
Background: In quantitative proteomics, peptide mapping is a valuable approach to combine positional quantitative information with topographical and domain information of proteins. Quantitative proteomic analysis of cell surface shedding is an exemplary application area of this approach. Results: We developed ImproViser ( for fully automated peptide mapping of quantitative proteomics data in the protXML data. The tool generates sortable and graphically annotated output, which can be easily shared with further users. As an exemplary application, we show its usage in the proteomic analysis of regulated intramembrane proteolysis. Conclusion: ImproViser is the first tool to enable automated peptide mapping of the widely-used protXML format.

A unifying model of genome evolution under parsimony

Wed, 06/18/2014 - 20:00
Background: Parsimony and maximum likelihood methods of phylogenetic tree estimation and parsimony methods for genome rearrangements are central to the study of genome evolution yet to date they have largely been pursued in isolation. Results: We present a data structure called a history graph that offers a practical basis for the analysis of genome evolution. It conceptually simplifies the study of parsimonious evolutionary histories by representing both substitutions and double cut and join (DCJ) rearrangements in the presence of duplications. The problem of constructing parsimonious history graphs thus subsumes related maximum parsimony problems in the fields of phylogenetic reconstruction and genome rearrangement. We show that tractable functions can be used to define upper and lower bounds on the minimum number of substitutions and DCJ rearrangements needed to explain any history graph. These bounds become tight for a special type of unambiguous history graph called an ancestral variation graph (AVG), which constrains in its combinatorial structure the number of operations required. We finally demonstrate that for a given history graph G, a finite set of AVGs describe all parsimonious interpretations of G, and this set can be explored with a few sampling moves. Conclusion: This theoretical study describes a model in which the inference of genome rearrangements and phylogeny can be unified under parsimony.

A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection

Wed, 06/18/2014 - 20:00
Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known. Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine. Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Detecting protein complexes in protein interaction networks using a ranking algorithm with a refined merging procedure

Wed, 06/18/2014 - 20:00
Background: Developing suitable methods for the identification of protein complexes remains an active research area. It is important since it allows better understanding of cellular functions as well as malfunctions and it consequently leads to producing more effective cures for diseases. In this context, various computational approaches were introduced to complement high-throughput experimental methods which typically involve large datasets, are expensive in terms of time and cost, and are usually subject to spurious interactions. Results: In this paper, we propose ProRank+, a method which detects protein complexes in protein interaction networks. The presented approach is mainly based on a ranking algorithm which sorts proteins according to their importance in the interaction network, and a merging procedure which refines the detected complexes in terms of their protein members. ProRank + was compared to several state-of-the-art approaches in order to show its effectiveness. It was able to detect more protein complexes with higher quality scores. Conclusions: The experimental results achieved by ProRank + show its ability to detect protein complexes in protein interaction networks. Eventually, the method could potentially identify previously-undiscovered protein complexes.The datasets and source codes are freely available for academic purposes at

Powered by Drupal, an open source content management system