The latest research articles published by BMC Bioinformatics
Background: Signatures are short sequences that are unique and not similar to any other sequence in a databasethat can be used as the basis to identify different species. Even though several signature discoveryalgorithms have been proposed in the past, these algorithms require the entirety of databases to beloaded in the memory, thus restricting the amount of data that they can process. It makes thosealgorithms unable to process databases with large amounts of data. Also, those algorithms usesequential models and have slower discovery speeds, meaning that the efficiency can be improved. Results: In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discoveryand have proposed a parallel signature discovery algorithm on a computer cluster. The algorithmapplies the divide-and-conquer strategy to solve the problem posed to the existing algorithms wherethey are unable to process large databases and uses a parallel computing mechanism to effectivelyimprove the efficiency of signature discovery. Even when run with just the memory of regular personalcomputers, the algorithm can still process large databases such as the human whole-genome ESTdatabase which were previously unable to be processed by the existing algorithms. Conclusions: The algorithm proposed in this research is not limited by the amount of usable memory and canrapidly find signatures in large databases, making it useful in applications such as Next GenerationSequencing and other large database analysis and processing. The implementation of the proposedalgorithm is available at http://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.
CLAP: A web-server for automatic classification of proteins with special reference to multi-domain proteins
Background: The function of a protein can be deciphered with higher accuracy from its structure than from its amino acid sequence. Due to the huge gap in the available protein sequence and structural space, tools that can generate functionally homogeneous clusters using only the sequence information, hold great importance. For this, traditional alignment-based tools work well in most cases and clustering is performed on the basis of sequence similarity. But, in the case of multi-domain proteins, the alignment quality might be poor due to varied lengths of the proteins, domain shuffling or circular permutations. Multi-domain proteins are ubiquitous in nature, hence alignment-free tools, which overcome the shortcomings of alignment-based protein comparison methods, are required. Further, existing tools classify proteins using only domain-level information and hence miss out on the information encoded in the tethered regions or accessory domains. Our method, on the other hand, takes into account the full-length sequence of a protein, consolidating the complete sequence information to understand a given protein better. Results: Our web-server, CLAP (Classification of Proteins), is one such alignment-free software for automatic classification of protein sequences. It utilizes a pattern-matching algorithm that assigns local matching scores (LMS) to residues that are a part of the matched patterns between two sequences being compared. CLAP works on full-length sequences and does not require prior domain definitions.Pilot studies undertaken previously on protein kinases and immunoglobulins have shown that CLAP yields clusters, which have high functional and domain architectural similarity. Moreover, parsing at a statistically determined cut-off resulted in clusters that corroborated with the sub-family level classification of that particular domain family. Conclusions: CLAP is a useful protein-clustering tool, independent of domain assignment, domain order, sequence length and domain diversity. Our method can be used for any set of protein sequences, yielding functionally relevant clusters with high domain architectural homogeneity. The CLAP web server is freely available for academic use at http://nslab.mbu.iisc.ernet.in/clap/
Background: Guide-trees are used as part of an essential heuristic to enable the calculation of multiple sequence alignments. They have been the focus of much method development but there has been little effort at determining systematically, which guide-trees, if any, give the best alignments. Some guide-tree construction schemes are based on pair-wise distances amongst unaligned sequences. Others try to emulate an underlying evolutionary tree and involve various iteration methods. Results: We explore all possible guide-trees for a set of protein alignments of up to eight sequences. We find that pairwise distance based default guide-trees sometimes outperform evolutionary guide-trees, as measured by structure derived reference alignments. However, default guide-trees fall way short of the optimum attainable scores. On average chained guide-trees perform better than balanced ones but are not better than default guide-trees for small alignments. Conclusions: Alignment methods that use Consistency or hidden Markov models to make alignments are less susceptible to sub-optimal guide-trees than simpler methods, that basically use conventional sequence alignment between profiles. The latter appear to be affected positively by evolutionary based guide-trees for difficult alignments and negatively for easy alignments. One phylogeny aware alignment program can strongly discriminate between good and bad guide-trees. The results for randomly chained guide-trees improve with the number of sequences.
Systematic identification of transcriptional and post-transcriptional regulations in human respiratory epithelial cells during influenza A virus infection
Background: Respiratory epithelial cells are the primary target of influenza virus infection in human. However, the molecular mechanisms of airway epithelial cell responses to viral infection are not fully understood. Revealing genome-wide transcriptional and post-transcriptional regulatory relationships can further advance our understanding of this problem, which motivates the development of novel and more efficient computational methods to simultaneously infer the transcriptional and post-transcriptional regulatory networks. Results: Here we propose a novel framework named SITPR to investigate the interactions among transcription factors (TFs), microRNAs (miRNAs) and target genes. Briefly, a background regulatory network on a genome-wide scale (~23,000 nodes and ~370,000 potential interactions) is constructed from curated knowledge and algorithm predictions, to which the identification of transcriptional and post-transcriptional regulatory relationships is anchored. To reduce the dimension of the associated computing problem down to an affordable size, several topological and data-based approaches are used. Furthermore, we propose the constrained LASSO formulation and combine it with the dynamic Bayesian network (DBN) model to identify the activated regulatory relationships from time-course expression data. Our simulation studies on networks of different sizes suggest that the proposed framework can effectively determine the genuine regulations among TFs, miRNAs and target genes; also, we compare SITPR with several selected state-of-the-art algorithms to further evaluate its performance. By applying the SITPR framework to mRNA and miRNA expression data generated from human lung epithelial A549 cells in response to A/Mexico/InDRE4487/2009 (H1N1) virus infection, we are able to detect the activated transcriptional and post-transcriptional regulatory relationships as well as the significant regulatory motifs. Conclusion: Compared with other representative state-of-the-art algorithms, the proposed SITPR framework can more effectively identify the activated transcriptional and post-transcriptional regulations simultaneously from a given background network. The idea of SITPR is generally applicable to the analysis of gene regulatory networks in human cells. The results obtained for human respiratory epithelial cells suggest the importance of the transcriptional, post-transcriptional regulations as well as their synergies in the innate immune responses against IAV infection.
Background: Proteins dynamically interact with each other to perform their biological functions. The dynamic operations of protein interaction networks (PPI) are also reflected in the dynamic formations of protein complexes. Existing protein complex detection algorithms usually overlook the inherent temporal nature of protein interactions within PPI networks. Systematically analyzing the temporal protein complexes can not only improve the accuracy of protein complex detection, but also strengthen our biological knowledge on the dynamic protein assembly processes for cellular organization. Results: In this study, we propose a novel computational method to predict temporal protein complexes. Particularly, we first construct a series of dynamic PPI networks by joint analysis of time-course gene expression data and protein interaction data. Then a Time Smooth Overlapping Complex Detection model (TS-OCD) has been proposed to detect temporal protein complexes from these dynamic PPI networks. TS-OCD can naturally capture the smoothness of networks between consecutive time points and detect overlapping protein complexes at each time point. Finally, a nonnegative matrix factorization based algorithm is introduced to merge those very similar temporal complexes across different time points. Conclusions: Extensive experimental results demonstrate the proposed method is very effective in detecting temporal protein complexes than the state-of-the-art complex detection techniques.
Visualization and Correction of Automated Segmentation, Tracking and Lineaging from 5-D Stem Cell Image Sequences
Background: Neural stem cells are motile and proliferative cells that undergo mitosis, dividing to produce daughtercells and ultimately generating differentiated neurons and glia. Understanding the mechanismscontrolling neural stem cell proliferation and differentiation will play a key role in the emergingfields of regenerative medicine and cancer therapeutics. Stem cell studies in vitro from 2-D imagedata are well established. Visualizing and analyzing large three dimensional images of intact tissueis a challenging task. It becomes more difficult as the dimensionality of the image data increases toinclude time and additional fluorescence channels. There is a pressing need for 5-D image analysisand visualization tools to study cellular dynamics in the intact niche and to quantify the role thatenvironmental factors play in determining cell fate. Results: We present an application that integrates visualization and quantitative analysis of 5-D(x, y, z, t, channel) and large montage confocal fluorescence microscopy images. The imagesequences show stem cells together with blood vessels, enabling quantification of the dynamicbehaviors of stem cells in relation to their vascular niche, with applications in developmental andcancer biology. Our application automatically segments, tracks, and lineages the image sequencedata and then allows the user to view and edit the results of automated algorithms in a stereoscopic3-D window while simultaneously viewing the stem cell lineage tree in a 2-D window. Usingthe GPU to store and render the image sequence data enables a hybrid computational approach.An inference-based approach utilizing user-provided edits to automatically correct related mistakesexecutes interactively on the system CPU while the GPU handles 3-D visualization tasks. Conclusions: By exploiting commodity computer gaming hardware, we have developed an application that canbe run in the laboratory to facilitate rapid iteration through biological experiments. We combineunsupervised image analysis algorithms with an interactive visualization of the results. Our validationinterface allows for each data set to be corrected to 100% accuracy, ensuring that downstream dataanalysis is accurate and verifiable. Our tool is the first to combine all of these aspects, leveraging thesynergies obtained by utilizing validation information from stereo visualization to improve the lowlevel image processing tasks.
tigaR: integrative significance analysis of temporal differential gene expression induced by genomic abnormalities
Background: To determine which changes in the host cell genome are crucial for cervical carcinogenesis, a longitudinal in vitro model system of HPV-transformed keratinocytes was profiled in a genome-wide manner. Four cell lines affected with either HPV16 or HPV18 were assayed at 8 sequential time points for gene expression (mRNA) and gene copy number (DNA) using high-resolution microarrays. Available methods for temporal differential expression analysis are not designed for integrative genomic studies. Results: Here, we present a method that allows for the identification of differential gene expression associated with DNA copy number changes over time. The temporal variation in gene expression is described by a generalized linear mixed model employing low-rank thin-plate splines. Model parameters are estimated with an empirical Bayes procedure, which exploits integrated nested Laplace approximation for fast computation. Iteratively, posteriors of hyperparameters and model parameters are estimated. The empirical Bayes procedure shrinks multiple dispersion-related parameters. Shrinkage leads to more stable estimates of the model parameters, better control of false positives and improvement of reproducibility. In addition, to make estimates of the DNA copy number more stable, model parameters are also estimated in a multivariate way using triplets of features, imposing a spatial prior for the copy number effect. Conclusion: With the proposed method for analysis of time-course multilevel molecular data, more profound insight may be gained through the identification of temporal differential expression induced by DNA copy number abnormalities. In particular, in the analysis of an integrative oncogenomics study with a time-course set-up our method finds genes previously reported to be involved in cervical carcinogenesis. Furthermore, the proposed method yields improvements in sensitivity, specificity and reproducibility compared to existing methods. Finally, the proposed method is able to handle count (RNAseq) data from time course experiments as is shown on a real data set.
Background: In past number of methods have been developed for predicting post-translational modifications in proteins. In contrast, limited attempt has been made to understand post-transcriptional modifications. Recently it has been shown that tRNA modifications play direct role in the genome structure and codon usage. This study is an attempt to understand kingdom-wise tRNA modifications particularly uridine modifications (UMs), as majority of modifications are uridine-derived. Results: A three-steps strategy has been applied to develop an efficient method for the prediction of UMs. In the first step, we developed a common prediction model for all the kingdoms using a dataset from MODOMICS-2008. Support Vector Machine (SVM) based prediction models were developed and evaluated by five-fold cross-validation technique. Different approaches were applied and found that a hybrid approach of binary and structural information achieved highest Area under the curve (AUC) of 0.936. In the second step, we used newly added tRNA sequences (as independent dataset) of MODOMICS-2012 for the kingdom-wise prediction performance evaluation of previously developed (in the first step) common model and achieved performances between the AUC of 0.910 to 0.949. In the third and last step, we used different datasets from MODOMICS-2012 for the kingdom-wise individual prediction models development and achieved performances between the AUC of 0.915 to 0.987. Conclusions: The hybrid approach is efficient not only to predict kingdom-wise modifications but also to classify them into two most prominent UMs: Pseudouridine (Y) and Dihydrouridine (D). A webserver called tRNAmod (http://crdd.osdd.net/raghava/trnamod/) has been developed, which predicts UMs from both tRNA sequences and whole genome.
MAE-FMD: Multi-agent evolutionary method for functional module detection in protein-protein interaction networks
Background: Studies of functional modules in a Protein-Protein Interaction (PPI) network contribute greatly to theunderstanding of biological mechanisms. With the development of computing science,computational approaches have played an important role in detecting functional modules. Results: We present a new approach using multi-agent evolution for detection of functional modules in PPInetworks. The proposed approach consists of two stages: the solution construction for agents in apopulation and the evolutionary process of computational agents in a lattice environment, where eachagent corresponds to a candidate solution to the detection problem of functional modules in a PPInetwork. First, the approach utilizes a connection-based encoding scheme to model an agent, andemploys a random-walk behavior merged topological characteristics with functional information toconstruct a solution. Next, it applies several evolutionary operators, i. e., competition, crossover, andmutation, to realize information exchange among agents as well as solution evolution. Systematicexperiments have been conducted on three benchmark testing sets of yeast networks. Experimentalresults show that the approach is more effective compared to several other existing algorithms. Conclusions: The algorithm has the characteristics of outstanding recall, F-measure, sensitivity and accuracy whilekeeping other competitive performances, so it can be applied to the biological study which requireshigh accuracy.
Background: Structure-based drug design is an iterative process, following cycles of structural biology, computer-aided design, synthetic chemistry and bioassay. In favorable circumstances, this process can lead to the structures of hundreds of protein-ligand crystal structures. In addition, molecular dynamics simulations are increasingly being used to further explore the conformational landscape of these complexes. Currently, methods capable of the analysis of ensembles of crystal structures and MD trajectories are limited and usually rely upon least squares superposition of coordinates. Results: Novel methodologies are described for the analysis of multiple structures of the same or related proteins. Statistical approaches that rely upon residue equivalence, but not superposition, are developed. Tasks that can be performed include the identification of hinge regions, allosteric conformational changes and transient binding sites. The approaches are tested on crystal structures of CDK2 and other CMGC protein kinases and a simulation of p38alpha. Known interaction - conformational change relationships are highlighted but also new ones are revealed. A transient but druggable allosteric pocket in CDK2 is predicted to occur under the CMGC insert. Furthermore, an evolutionarily-conserved conformational link from the location of this pocket, via the alphaEF-alphaF loop, to phosphorylation sites on the activation loop is discovered. Conclusions: New methodologies are described and validated for the superimposition independent conformational analysis of large collections of structures or simulation snapshots of the same protein. The methodologies are encoded in a Python package called Polyphony, which is released as open source to accompany this paper [http://wrpitt.bitbucket.org/polyphony/].
A Generalizable NLP Framework for Fast Development of Pattern-based Biomedical Relation Extraction Systems
Background: Text mining is increasingly used in the biomedical domain because of its ability to automatically gather information from large amount of scientific articles. One important task in biomedical text mining is relation extraction, which aims to identify designated relations among biological entities reported in literature. A relation extraction system achieving high performance is expensive to develop because of the substantial time and effort required for its design and implementation. Here, we report a novel framework to facilitate the development of a pattern-based biomedical relation extraction system. It has several unique design features: (1) leveraging syntactic variations possible in a language and automatically generating extraction patterns in a systematic manner, (2) applying sentence simplification to improve the coverage of extraction patterns, and (3) identifying referential relations between a syntactic argument of a predicate and the actual target expected in the relation extraction task. Results: A relation extraction system derived using the proposed framework achieved overall F-scores of 72.66% for the Simple events and 55.57% for the Binding events on the BioNLP-ST 2011 GE test set, comparing favorably with the top performing systems that participated in the BioNLP-ST 2011 GE task. We obtained similar results on the BioNLP-ST 2013 GE test set (80.07% and 60.58%, respectively). We conducted additional experiments on the training and development sets to provide a more detailed analysis of the system and its individual modules. This analysis indicates that without increasing the number of patterns, simplification and referential relation linking play a key role in the effective extraction of biomedical relations. Conclusions: In this paper, we present a novel framework for fast development of relation extraction systems. The framework requires only a list of triggers as input, and does not need information from an annotated corpus. Thus, we reduce the involvement of domain experts, who would otherwise have to provide manual annotations and help with the design of hand crafted patterns. We demonstrate how our framework is used to develop a system which achieves state-of-the-art performance on a public benchmark corpus.
Background: Short oligonucleotides can be used as markers to tag and track DNA sequences. For example, barcoding techniques (i.e. Multiplex Identifiers or Indexing) use short oligonucleotides to distinguish between reads from different DNA samples pooled for high-throughput sequencing. A similar technique called molecule tagging uses the same principles but is applied to individual DNA template molecules. Each template molecule is tagged with a unique oligonucleotide prior to polymerase chain reaction. The resulting amplicon sequences can be traced back to their original templates by their oligonucleotide tag. Consensus building from sequences sharing the same tag enables inference of original template molecules thereby reducing effects of sequencing error and polymerase chain reaction bias. Several independent groups have developed similar protocols for molecule tagging; however, user-friendly software for build consensus sequences from molecule tagged reads is not readily available or is highly specific for a particular protocol. Results: MT-Toolbox recognizes oligonucleotide tags in amplicons and infers the correct template sequence. On a set of molecule tagged test reads, MT-Toolbox generates sequences having on average 0.00047 errors per base. MT-Toolbox includes a graphical user interface, command line interface, and options for speed and accuracy maximization. It can be run in serial on a standard personal computer or in parallel on a Load Sharing Facility based cluster system. An optional plugin provides features for common 16S metagenome profiling analysis such as chimera filtering, building operational taxonomic units, contaminant removal, and taxonomy assignments. Conclusions: MT-Toolbox provides an accessible, user-friendly environment for analysis of molecule tagged reads thereby reducing technical errors and polymerase chain reaction bias. These improvements reduce noise and allow for greater precision in single amplicon sequencing experiments.
Impact of variance components on reliability of absolute quantification using digital PCR
Background: Digital polymerase chain reaction (dPCR) is an increasingly popular technology for detecting andquantifying target nucleic acids. Its advertised strength is high precision absolute quantification withoutneeding reference curves. The standard data analytic approach follows a seemingly straightforwardtheoretical framework but ignores sources of variation in the data generating process. Thesestem from both technical and biological factors, where we distinguish features that are 1) hard-wiredin the equipment, 2) user-dependent and 3) provided by manufacturers but may be adapted by theuser. The impact of the corresponding variance components on the accuracy and precision of targetconcentration estimators presented in the literature is studied through simulation. Results: We reveal how system-specific technical factors influence accuracy as well as precision of concentrationestimates. We find that a well-chosen sample dilution level and modifiable settings such asthe fluorescence cut-off for target copy detection have a substantial impact on reliability and can beadapted to the sample analysed in ways that matter. User-dependent technical variation, includingpipette inaccuracy and specific sources of sample heterogeneity, leads to a steep increase in uncertaintyof estimated concentrations. Users can discover this through replicate experiments and derivedvariance estimation. Finally, the detection performance can be improved by optimizing the fluorescenceintensity cut point as suboptimal thresholds reduce the accuracy of concentration estimatesconsiderably. Conclusions: Like any other technology, dPCR is subject to variation induced by natural perturbations, systematicsettings as well as user-dependent protocols. Corresponding uncertainty may be controlled with anadapted experimental design. Our findings point to modifiable key sources of uncertainty that forman important starting point for the development of guidelines on dPCR design and data analysis with correct precision bounds. Besides clever choices of sample dilution levels, experiment-specific tuningof machine settings can greatly improve results. Well-chosen data-driven fluorescence intensitythresholds in particular result in major improvements in target presence detection. We call on manufacturersto provide sufficiently detailed output data that allows users to maximize the potential of themethod in their setting and obtain high precision and accuracy for their experiments.
Background: In past, a number of peptides have been reported to possess highly diverse properties ranging from cell penetrating, tumor homing, anticancer, anti-hypertensive, antiviral to antimicrobials. Owing to their excellent specificity, low-toxicity, rich chemical diversity and availability from natural sources, FDA has successfully approved a number of peptide-based drugs and several are in various stages of drug development. Though peptides are proven good drug candidates, their usage is still hindered mainly because of their high susceptibility towards proteases degradation. We have developed an in silico method to predict the half-life of peptides in intestine-like environment and to design better peptides having optimized physicochemical properties and half-life. Results: In this study, we have used 10mer (HL10) and 16mer (HL16) peptides dataset to develop prediction models for peptide half-life in intestine-like environment. First, SVM based models were developed on HL10 dataset which achieved maximum correlation R/R2 of 0.57/0.32, 0.68/0.46, and 0.69/0.47 using amino acid, dipeptide and tripeptide composition, respectively. Secondly, models developed on HL16 dataset showed maximum R/R2 of 0.91/0.82, 0.90/0.39, and 0.90/0.31 using amino acid, dipeptide and tripeptide composition, respectively. Furthermore, models that were developed on selected features, achieved a correlation (R) of 0.70 and 0.98 on HL10 and HL16 dataset, respectively. Preliminary analysis suggests the role of charged residue and amino acid size in peptide half-life/stability. Based on above models, we have developed a web server named HLP (Half Life Prediction), for predicting and designing peptides with desired half-life. The web server provides three facilities; i) half-life prediction, ii) physicochemical properties calculation and iii) designing mutant peptides. Conclusion: In summary, this study describes a web server 'HLP' that has been developed for assisting scientific community for predicting intestinal half-life of peptides and to design mutant peptides with better half-life and physicochemical properties. HLP models were trained using a dataset of peptides whose half-lives have been determined experimentally in crude intestinal proteases preparation. Thus, HLP server will help in designing peptides possessing the potential to be administered via oral route (http://www.imtech.res.in/raghava/hlp/ ).
Background: The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. Results: We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. Conclusion: We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.
Background: Chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-Seq) has been widely used to identify genomic loci of transcription factor (TF) binding and histone modifications. ChIP-Seq data analysis involves multiple steps from read mapping and peak calling to data integration and interpretation. It remains challenging and time-consuming to process large amounts of ChIP-Seq data derived from different antibodies or experimental designs using the same approach. To address this challenge, there is a need for a comprehensive analysis pipeline with flexible settings to accelerate the utilization of this powerful technology in epigenetics research. Results: We have developed a highly integrative pipeline, termed HiChIP for systematic analysis of ChIP-Seq data. HiChIP incorporates several open source software packages selected based on internal assessments and published comparisons. It also includes a set of tools developed in-house. This workflow enables the analysis of both paired-end and single-end ChIP-Seq reads, with or without replicates for the characterization and annotation of both punctate and diffuse binding sites. The main functionality of HiChIP includes: (a) read quality checking; (b) read mapping and filtering; (c) peak calling and peak consistency analysis; and (d) result visualization. In addition, this pipeline contains modules for generating binding profiles over selected genomic features, de novo motif finding from transcription factor (TF) binding sites and functional annotation of peak associated genes. Conclusions: HiChIP is a comprehensive analysis pipeline that can be configured to analyze ChIP-Seq data derived from varying antibodies and experiment designs. Using public ChIP-Seq data we demonstrate that HiChIP is a fast and reliable pipeline for processing large amounts of ChIP-Seq data.
SPARQLGraph: a web-based platform for graphically querying biological Semantic Web databases
Background: Semantic Web has established itself as a framework for using and sharing data across applications and database boundaries. Here, we present a web-based platform for querying biological Semantic Web databases in a graphical way. Results: SPARQLGraph offers an intuitive drag & drop query builder, which converts the visual graph into a query and executes it on a public endpoint. The tool integrates several publicly available Semantic Web databases, including the databases of the just recently released EBI RDF platform. Furthermore, it provides several predefined template queries for answering biological questions. Users can easily create and save new query graphs, which can also be shared with other researchers. Conclusions: This new graphical way of creating queries for biological Semantic Web databases considerably facilitates usability as it removes the requirement of knowing specific query languages and database structures. The system is freely available at http://sparqlgraph.i-med.ac.at.
Background: UniFrac is a well-known tool for comparing microbial communities and assessing statistically significant differences between communities. In this paper we identify a discrepancy in the UniFrac methodology that causes semantically equivalent inputs to produce different outputs in tests of statistical significance. Results: The phylogenetic trees that are input into UniFrac may or may not contain abundance counts. An isomorphic transform can be defined that will convert trees between these two formats without altering the semantic meaning of the trees. UniFrac produces different outputs for these equivalent forms of the same input tree. This is illustrated using metagenomics data from a lake sediment study. Conclusions: Results from the UniFrac tool can vary greatly for the same input depending on the arbitrary choice of input format. Practitioners should be aware of this issue and use the tool with caution to ensure consistency and validity in their analyses. We provide a script to transform inputs between equivalent formats to help researchers achieve this consistency.
CRF-based models of protein surfaces improve protein-protein interaction site predictions
Background: The identification of protein-protein interaction sites is a computationally challenging task and importantfor understanding the biology of protein complexes. There is a rich literature in this field. A broadclass of approaches assign to each candidate residue a real-valued score that measures how likely it isthat the residue belongs to the interface. The prediction is obtained by thresholding this score.Some probabilistic models classify the residues on the basis of the posterior probabilities. In thispaper, we introduce pairwise conditional random fields (pCRFs) in which edges are not restrictedto the backbone as in the case of linear-chain CRFs utilized by Li et al. (2007). In fact, any 3Dneighborhoodrelation can be modeled. On grounds of a generalized Viterbi inference algorithm anda piecewise training process for pCRFs, we demonstrate how to utilize pCRFs to enhance a givenresidue-wise score-based protein-protein interface predictor on the surface of the protein under study.The features of the pCRF are solely based on the interface predictions scores of the predictor theperformance of which shall be improved. Results: We performed three sets of experiments with synthetic scores assigned to the surface residues ofproteins taken from the data set PlaneDimers compiled by Zellner et al. (2011), from the list publishedby Keskin et al. (2004) and from the very recent data set due to Cukuroglu et al. (2014). That way wedemonstrated that our pCRF-based enhancer is effective given the interface residue score distributionand the non-interface residue score are unimodal.Moreover, the pCRF-based enhancer is also successfully applicable, if the distributions are only unimodalover a certain sub-domain. The improvement is then restricted to that domain. Thus we wereable to improve the prediction of the PresCont server devised by Zellner et al. (2011) on PlaneDimers. Conclusions: Our results strongly suggest that pCRFs form a methodological framework to improve residue-wisescore-based protein-protein interface predictors given the scores are appropriately distributed. A prototypicalimplementation of our method is accessible at http://ppicrf.informatik.uni-goettingen.de/index.html.
Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks
Background: Time series models can play an important role in disease prediction. Incidence data can be used to predict the future occurrence of disease events. Developments in modeling approaches provide an opportunity to compare different time series models for predictive power. Results: We applied ARIMA and Random Forest time series models to incidence data of outbreaks of highly pathogenic avian influenza (H5N1) in Egypt, available through the online EMPRES-I system. We found that the Random Forest model outperformed the ARIMA model in predictive ability. Furthermore, we found that the Random Forest model is effective for predicting outbreaks of H5N1 in Egypt. Conclusions: Random Forest time series modeling provides enhanced predictive ability over existing time series models for the prediction of infectious disease outbreaks. This result, along with those showing the concordance between bird and human outbreaks (Rabinowitz et al. 2012), provides a new approach to predicting these dangerous outbreaks in bird populations based on existing, freely available data. Our analysis uncovers the time-series structure of outbreak severity for highly pathogenic avain influenza (H5N1) in Egypt.