The latest research articles published by BMC Bioinformatics
Background: Protein complexes are basic cellular entities that carry out the functions of their components. It can be found that in databases of protein complexes of yeast like CYC2008, the major type of known protein complexes is heterodimeric complexes. Although a number of methods for trying to predict sets of proteins that form arbitrary types of protein complexes simultaneously have been proposed, it can be found that they often fail to predict heterodimeric complexes. Results: In this paper, we have designed several features characterizing heterodimeric protein complexes based on genomic data sets, and proposed a supervised-learning method for the prediction of heterodimeric protein complexes. This method learns the parameters of the features, which are embedded in the naïve Bayes classifier. The log-likelihood ratio derived from the naïve Bayes classifier with the parameter values obtained by maximum likelihood estimation gives the score of a given pair of proteins to predict whether the pair is a heterodimeric complex or not. A five-fold cross-validation shows good performance on yeast. The trained classifiers also show higher predictability than various existing algorithms on yeast data sets with approximate and exact matching criteria. Conclusions: Heterodimeric protein complex prediction is a rather harder problem than heteromeric protein complex prediction because heterodimeric protein complex is topologically simpler. However, it turns out that by designing features specialized for heterodimeric protein complexes, predictability of them can be improved. Thus, the design of more sophisticate features for heterodimeric protein complexes as well as the accumulation of more accurate and useful genome-wide data sets will lead to higher predictability of heterodimeric protein complexes. Our tool can be downloaded from http://imi.kyushu-u.ac.jp/~om/.
Background: Disulfide engineering is an important biotechnological tool that has advanced a wide range of research. The introduction of novel disulfide bonds into proteins has been used extensively to improve protein stability, modify functional characteristics, and to assist in the study of protein dynamics. Successful use of this technology is greatly enhanced by software that can predict pairs of residues that will likely form a disulfide bond if mutated to cysteines. Results: We had previously developed and distributed software for this purpose: Disulfide by Design (DbD). The original DbD program has been widely used; however, it has a number of limitations including a Windows platform dependency. Here, we introduce Disulfide by Design 2.0 (DbD2), a web-based, platform-independent application that significantly extends functionality, visualization, and analysis capabilities beyond the original program. Among the enhancements to the software is the ability to analyze the B-factor of protein regions involved in predicted disulfide bonds. Importantly, this feature facilitates the identification of potential disulfides that are not only likely to form but are also expected to provide improved thermal stability to the protein. Conclusions: DbD2 provides platform-independent access and significantly extends the original functionality of DbD. A web server hosting DbD2 is provided at http://cptweb.cpt.wayne.edu/DbD2/.
wKinMut: An integrated tool for the analysis and interpretation of mutations in human protein kinases
Background: Protein kinases are involved in relevant physiological functions and a broad number of mutations in this superfamily have been reported in the literature to affect protein function and stability. Unfortunately, the exploration of the consequences on the phenotypes of each individual mutation remains a considerable challenge. Results: The wKinMut web-server offers direct prediction of the potential pathogenicity of the mutations from a number of methods, including our recently developed prediction method based on the combination of information from a range of diverse sources, including physicochemical properties and functional annotations from FireDB and Swissprot and kinase-specific characteristics such as the membership to specific kinase groups, the annotation with disease-associated GO terms or the occurrence of the mutation in PFAM domains, and the relevance of the residues in determining kinase subfamily specificity from S3Det. This predictor yields interesting results that compare favourably with other methods in the field when applied to protein kinases.Together with the predictions, wKinMut offers a number of integrated services for the analysis of mutations. These include: the classification of the kinase, information about associations of the kinase with other proteins extracted from iHop, the mapping of the mutations onto PDB structures, pathogenicity records from a number of databases and the classification of mutations in large-scale cancer studies. Importantly, wKinMut is connected with the SNP2L system that extracts mentions of mutations directly from the literature, and therefore increases the possibilities of finding interesting functional information associated to the studied mutations. Conclusions: wKinMut facilitates the exploration of the information available about individual mutations by integrating prediction approaches with the automatic extraction of information from the literature (text mining) and several state-of-the-art databases.wKinMut has been used during the last year for the analysis of the consequences of mutations in the context of a number of cancer genome projects, including the recent analysis of Chronic Lymphocytic Leukemia cases and is publicly available at http://wkinmut.bioinfo.cnio.es.
Motivation: Within Flux Balance Analysis, the investigation of complex subtasks, such as finding the optimalperturbation of the network or finding an optimal combination of drugs, often requires to set up abilevel optimization problem. In order to keep the linearity and convexity of these nestedoptimization problems, an ON/OFF description of the effect of the perturbation (i.e. Booleanvariable) is normally used. This restriction may not be realistic when one wants, for instance, todescribe the partial inhibition of a reaction induced by a drug. Results: In this paper we present a formulation of the bilevel optimization which overcomes theoversimplified ON/OFF modeling while preserving the linear nature of the problem. A case study isconsidered: the search of the best multi-drug treatment which modulates an objective reaction andhas the minimal perturbation on the whole network. The drug inhibition is described and modulatedthrough a convex combination of a fixed number of Boolean variables. The results obtained from theapplication of the algorithm to the core metabolism of E.coli highlight the possibility of finding abroader spectrum of drug combinations compared to a simple ON/OFF modeling. Conclusions: The method we have presented is capable of treating partial inhibition inside a bilevel optimization,without loosing the linearity property, and with reasonable computational performances also on largemetabolic networks. The more fine-graded representation of the perturbation allows to enlarge therepertoire of synergistic combination of drugs for tasks such as selective perturbation of cellularmetabolism. This may encourage the use of the approach also for other cases in which a morerealistic modeling is required.
Transferring functional annotations of membrane transporters on the basis of sequence similarity and sequence motifs
Background: Membrane transporters catalyze the transport of small solute molecules across biological barriers such as lipid bilayer membranes. Experimental identification of the transported substrates is very tedious. Once a particular transport mechanism has been identified in one organism, it is thus highly desirable to transfer this information to related transporter sequences in different organisms based on bioinformatics evidence. Results: We present a thorough benchmark at which level of sequence identity membrane transporters from Escherichia coli, Saccharomyces cerevisiae, and Arabidopsis thaliana belong to the same families of the Transporter Classification (TC) system, and at what level these membrane transporters mediate the transport of the same substrate. We found that two membrane transporter sequences from different organisms that are aligned with normalized BLAST expectation value better than E-value 1e-8 are highly likely to belong to the same TC family (F-measure around 90%). Enriched sequence motifs identified by MEME at thresholds below 1e-12 support accurate classification into TC families for about two thirds of the sequences (F-measure 80% and higher). For the comparison of transported substrates, we focused on the four largest substrate classes of amino acids, sugars, metal ions, and phosphate. At similar identity thresholds, the nature of the transported substrates was more divergent (F-measure 40 - 75% at the same thresholds) than the TC family membership. Conclusions: We suggest an acceptable threshold of 1e-8 for BLAST and HMMER where at least three quarters of the sequences are classified according to the TC system with a reasonably high accuracy. Researchers who wish to apply these thresholds in their studies should multiply these thresholds by the size of the database they search against. Our findings should be useful to those who wish to transfer transporter functional annotations across species.
Background: Proteins perform their functions in associated cellular locations. Therefore, the study of protein function can be facilitated by predictions of protein location. Protein location can be predicted either from the sequence of a protein alone by identification of targeting peptide sequences and motifs, or by homology to proteins of known location. A third approach, which is complementary, exploits the differences in amino acid composition of proteins associated to different cellular locations, and can be useful if motif and homology information are missing. Here we expand this approach taking into account amino acid composition at different levels of amino acid exposure. Results: Our method has two stages. For stage one, we trained multiple Support Vector Machines (SVMs) to score eukaryotic protein sequences for membership to each of three categories: nuclear, cytoplasmic and extracellular, plus extra category nucleocytoplasmic, accounting for the fact that a large number of proteins shuttles between those two locations. In stage two we use an artificial neural network (ANN) to propose a category from the scores given to the four locations in stage one. The method reaches an accuracy of 68% when using as input 3D-derived values of amino acid exposure. Calibration of the method using predicted values of amino acid exposure allows classifying proteins without 3D-information with an accuracy of 62% and discerning proteins in different locations even if they shared high levels of identity Conclusions: In this study we explored the relationship between residue exposure and protein subcellular location. We developed a new algorithm for subcellular location prediction that uses residue exposure signatures. Our algorithm uses a novel approach to address the multiclass classification problem. The algorithm is implemented as web server 'NYCE' and can be accessed at http://cbdm.mdc-berlin.de/~amer/nyce.
Background: A novel highly conserved protein domain, DUF162 [Pfam: PF02589], can be mapped to two proteins: LutB and LutC. Both proteins are encoded by a highly conserved LutABC operon, which has been implicated in lactate utilization in bacteria. Based on our analysis of its sequence, structure, and recent experimental evidence reported by other groups, we hereby redefine DUF162 as the LUD domain family. Results: JCSG solved the first crystal structure [PDB:2G40] from the LUD domain family: LutC protein, encoded by ORF DR_1909, of Deinococcus radiodurans. LutC shares features with domains in the functionally diverse ISOCOT superfamily. We have observed that the LUD domain has an increased abundance in the human gut microbiome. Conclusions: We propose a model for the substrate and cofactor binding and regulation in LUD domain. The significance of LUD-containing proteins in the human gut microbiome, and the implication of lactate metabolism in the radiation-resistance of Deinococcus radiodurans are discussed.
Reverse causal reasoning: applying qualitative causal knowledge to the interpretation of high-throughput data
Background: Gene expression profiling and other genome-scale measurement technologies provide comprehensive information about molecular changes resulting from a chemical or genetic perturbation, or disease state. A critical challenge is the development of methods to interpret these large-scale data sets to identify specific biological mechanisms that can provide experimentally verifiable hypotheses and lead to the understanding of disease and drug action. Results: We present a detailed description of Reverse Causal Reasoning (RCR), a reverse engineering methodology to infer mechanistic hypotheses from molecular profiling data. This methodology requires prior knowledge in the form of small networks that causally link a key upstream controller node representing a biological mechanism to downstream measurable quantities. These small directed networks are generated from a knowledge base of literature-curated qualitative biological cause-and-effect relationships expressed as a network. The small mechanism networks are evaluated as hypotheses to explain observed differential measurements. We provide a simple implementation of this methodology, Whistle, specifically geared towards the analysis of gene expression data and using prior knowledge expressed in Biological Expression Language (BEL). We present the Whistle analyses for three transcriptomic data sets using a publically available knowledge base. The mechanisms inferred by Whistle are consistent with the expected biology for each data set. Conclusions: Reverse Causal Reasoning yields mechanistic insights to the interpretation of gene expression profiling data that are distinct from and complementary to the results of analyses using ontology or pathway gene sets. This reverse engineering algorithm provides an evidence-driven approach to the development of models of disease, drug action, and drug toxicity.
zipHMMlib: a highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithm
Background: Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library. Results: We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library. Conclusions: We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.
MetSizeR: selecting the optimal sample size for metabolomic studies using an analysis based approach.
Background: Determining sample sizes for metabolomic experiments is important but due to the complexity of these experiments, there are currently no standard methods for sample size estimation in metabolomics. Since pilot studies are rarely done in metabolomics, currently existing sample size estimation approaches which rely on pilot data can not be applied. Results: In this article, an analysis based approach called MetSizeR is developed to estimate sample size for metabolomic experiments even when experimental pilot data are not available. The key motivation for MetSizeR is that it considers the type of analysis the researcher intends to use for data analysis when estimating sample size. MetSizeR uses information about the data analysis technique and prior expert knowledge of the metabolomic experiment to simulate pilot data from a statistical model. Permutation based techniques are then applied to the simulated pilot data to estimate the required sample size. Conclusions: The MetSizeR methodology, and a publicly available software package which implements the approach, are illustrated through real metabolomic applications. Sample size estimates, informed by the intended statistical analysis technique, and the associated uncertainty are provided.
Background: Phylogenetic comparative analyses usually rely on a single consensus phylogenetic tree in order to study evolutionary processes. However, most phylogenetic trees are incomplete with regard to species sampling, which may critically compromise analyses. Some approaches have been proposed to integrate non-molecular phylogenetic information into incomplete molecular phylogenies. An expanded tree approach consists of adding missing species to random locations within their clade. The information contained in the topology of the resulting expanded trees can be captured by the pairwise phylogenetic distance between species and stored in a matrix for further statistical analysis. Thus, the random expansion and processing of multiple phylogenetic trees can be used to estimate the phylogenetic uncertainty through a simulation procedure. Because of the computational burden required, unless this procedure is efficiently implemented, the analyses are of limited applicability. Results: In this paper, we present efficient algorithms and implementations for randomly expanding and processing phylogenetic trees so that simulations involved in comparative phylogenetic analysis with uncertainty can be conducted in a reasonable time. We propose algorithms for both randomly expanding trees and calculating distance matrices. We made available the source code, which was written in the C++ language. The code may be used as a standalone program or as a shared object in the R system. The software can also be used as a web service through the link: http://purl.oclc.org/NET/sunplin/. Conclusion: We compare our implementations to similar solutions and show that significant performance gains can be obtained. Our results open up the possibility of accounting for phylogenetic uncertainty in evolutionary and ecological analyses of large datasets.
Background: In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations. Results: In this paper, we present a corpus for the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences. We describe CoMAGC, a corpus with multi-faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, we show that the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, we deal with changes in gene expression levels among other types of gene changes. The corpus is available at http://biopathway.org/CoMAGC under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0). Conclusions: The corpus will be an important resource for the development of advanced TM systems on gene-cancer relations.
3DScapeCS: application of three dimensional, parallel, dynamic network visualization in Cytoscape
Background: The exponential growth of gigantic biological data from various sources, such as protein-protein interaction (PPI), genome sequences scaffolding, Mass spectrometry (MS) molecular networking and metabolic flux, demands an efficient way for better visualization and interpretation beyond the conventional, two-dimensional visualization tools. Results: We developed a 3D Cytoscape Client/Server (3DScapeCS) plugin, which adopted Cytoscape in interpreting different types of data, and UbiGraph for three-dimensional visualization. The extra dimension is useful in accommodating, visualizing, and distinguishing large-scale networks with multiple crossed connections in five case studies. Conclusions: Evaluation on several experimental data using 3DScapeCS and its special features, including multilevel graph layout, time-course data animation, and parallel visualization has proven its usefulness in visualizing complex data and help to make insightful conclusions.
PlantTFcat: an online plant transcription factor and transcriptional regulator categorization and analysis tool
Background: Plants regulate intrinsic gene expression through transcription factors (TFs), transcriptional regulators (TRs), chromatin regulators (CRs), and the basal transcription machinery. An understanding of plant gene regulatory mechanisms at a systems level requires the identification of these regulatory elements on a genomic scale. Results: Here, we present PlantTFcat, a high-performance web-based analysis tool that is designed to identify and categorize plant TF/TR/CR genes from genome-scale protein and nucleic acid sequences by systematically analyzing InterProScan domain patterns in protein sequences. The comprehensive prediction logics that are included in PlantTFcat are based on relationships between gene families and conserved domains from 108 published plant TF/TR/CR families. These prediction logics effectively distinguish TF/TR/CR families with common conserved domains. Our systematic performance evaluations indicate that PlantTFcat annotates known TF/TR/CR families with high coverage and sensitivity. Conclusions: PlantTFcat provides an analysis tool to identify and categorize plant TF/TR/CR genes on a genomic scale. PlantTFcat is freely available to the public at http://plantgrn.noble.org/PlantTFcat/.
Design of RNA splicing analysis null models for post hoc filtering of Drosophila head RNA-Seq data with the splicing analysis kit (Spanki)
Background: The production of multiple transcript isoforms from one gene is a major source of transcriptome complexity. RNA-Seq experiments, in which transcripts are converted to cDNA and sequenced, allow the resolution and quantification of alternative transcript isoforms. However, methods to analyze splicing are underdeveloped and errors resulting in incorrect splicing calls occur in every experiment. Results: We used RNA-Seq data to develop sequencing and aligner error models. By applying these error models to known input from simulations, we found that errors result from false alignment to minor splice motifs and antisense stands, shifted junction positions, paralog joining, and repeat induced gaps. By using a series of quantitative and qualitative filters, we eliminated diagnosed errors in the simulation, and applied this to RNA-Seq data from Drosophila melanogaster heads. We used high-confidence junction detections to specifically interrogate local splicing differences between transcripts. This method out-performed commonly used RNA-seq methods to identify known alternative splicing events in the Drosophila sex determination pathway. We describe a flexible software package to perform these tasks called Splicing Analysis Kit (Spanki), available at http://www.cbcb.umd.edu/software/spanki. Conclusions: Splice-junction centric analysis of RNA-Seq data provides advantages in specificity for detection of alternative splicing. Our software provides tools to better understand error profiles in RNA-Seq data and improve inference from this new technology. The splice-junction centric approach that this software enables will provide more accurate estimates of differentially regulated splicing than current tools.
Background: Multi-cellular segmentation of bright field microscopy images is an essential computational step when quantifying collective migration of cells in vitro. Despite the availability of various tools and algorithms, no publicly available benchmark has been proposed for evaluation and comparison between the different alternatives.DescriptionA uniform framework is presented to benchmark algorithms for multi-cellular segmentation in bright field microscopy images. A freely available set of 171 manually segmented images from diverse origins was partitioned into 8 datasets and evaluated on three leading designated tools. Conclusions: The presented benchmark resource for evaluating segmentation algorithms of bright field images is the first public annotated dataset for this purpose. This annotated dataset of diverse examples allows fair evaluations and comparisons of future segmentation methods. Scientists are encouraged to assess new algorithms on this benchmark, and to contribute additional annotated datasets.
Background: Constrained minimal cut sets (cMCSs) have recently been introduced as a framework to enumerate minimal genetic intervention strategies for targeted optimization of metabolic networks. Two different algorithmic schemes (adapted Berge algorithm and binary integer programming) have been proposed to compute cMCSs from elementary modes. However, in their original formulation both algorithms are not fully comparable. Results: Here we show that by a small extension to the integer program both methods become equivalent. Furthermore, based on well-known preprocessing procedures for integer programming we present efficient preprocessing steps which can be used for both algorithms. We then benchmark the numerical performance of the algorithms in several realistic medium-scale metabolic models. The benchmark calculations reveal (i) that these preprocessing steps can lead to an enormous speed-up under both algorithms, and (ii) that the adapted Berge algorithm outperforms the binary integer approach. Conclusions: Generally, both of our new implementations are by at least one order of magnitude faster than other currently available implementations.
Automated analysis of phylogenetic clusters
Background: As sequence data sets used for the investigation of pathogen transmission patterns increase in size, automated tools and standardized methods for cluster analysis have become necessary. We have developed an automated Cluster Picker which identifies monophyletic clades meeting user-input criteria for bootstrap support and maximum genetic distance within large phylogenetic trees. A second tool, the Cluster Matcher, automates the process of linking genetic data to epidemiological or clinical data, and matches clusters between runs of the Cluster Picker. Results: We explore the effect of different bootstrap and genetic distance thresholds on clusters identified in a data set of publicly available HIV sequences, and compare these results to those of a previously published tool for cluster identification. To demonstrate their utility, we then use the Cluster Picker and Cluster Matcher together to investigate how clusters in the data set changed over time. We find that clusters containing sequences from more than one UK location at the first time point (multiple origin) were significantly more likely to grow than those representing only a single location. Conclusions: The Cluster Picker and Cluster Matcher can rapidly process phylogenetic trees containing tens of thousands of sequences. Together these tools will facilitate comparisons of pathogen transmission dynamics between studies and countries.
DCE@urLAB: a dynamic contrast-enhanced MRI pharmacokinetic analysis tool for preclinical data
Background: DCE@urLAB is a software application for analysis of dynamic contrast-enhanced magnetic resonance imaging data (DCE-MRI). The tool incorporates a friendly graphical user interface (GUI) to interactively select and analyze a region of interest (ROI) within the image set, taking into account the tissue concentration of the contrast agent (CA) and its effect on pixel intensity. Results: Pixel-wise model-based quantitative parameters are estimated by fitting DCE-MRI data to several pharmacokinetic models using the Levenberg-Marquardt algorithm (LMA). DCE@urLAB also includes the semi-quantitative parametric and heuristic analysis approaches commonly used in practice. This software application has been programmed in the Interactive Data Language (IDL) and tested both with publicly available simulated data and preclinical studies from tumor-bearing mouse brains. Conclusions: A user-friendly solution for applying pharmacokinetic and non-quantitative analysis DCE-MRI in preclinical studies has been implemented and tested. The proposed tool has been specially designed for easy selection of multi-pixel ROIs. A public release of DCE@urLAB, together with the open source code and sample datasets, is available at http://www.die.upm.es/im/archives/DCEurLAB/.
A novel strategy for classifying the output from an in silico vaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms
Background: An in silico vaccine discovery pipeline for eukaryotic pathogens typically consists of several computational tools to predict protein characteristics. The aim of the in silico approach to discovering subunit vaccines is to use predicted characteristics to identify proteins which are worthy of laboratory investigation. A major challenge is that these predictions are inherent with hidden inaccuracies and contradictions. This study focuses on how to reduce the number of false candidates using machine learning algorithms rather than relying on expensive laboratory validation. Proteins from Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were used as training and test datasets. Results: The results show that machine learning algorithms can effectively distinguish expected true from expected false vaccine candidates (with an average sensitivity and specificity of 0.97 and 0.98 respectively), for proteins observed to induce immune responses experimentally. Conclusions: Vaccine candidates from an in silico approach can only be truly validated in a laboratory. Given any in silico output and appropriate training data, the number of false candidates allocated for validation can be dramatically reduced using a pool of machine learning algorithms. This will ultimately save time and money in the laboratory.