BMC Bioinformatics

The latest research articles published by BMC Bioinformatics
  • Spot quantification in two dimensional gel electrophoresis image analysis: comparison of different approaches and presentation of a novel compound fitting algorithm
    [Jun 2014]

    Background: Various computer-based methods exist for the detection and quantification of protein spots in two dimensional gel electrophoresis images. Area-based methods are commonly used for spot quantification: an area is assigned to each spot and the sum of the pixel intensities in that area, the so-called volume, is used a measure for spot signal. Other methods use the optical density, i.e. the intensity of the most intense pixel of a spot, or calculate the volume from the parameters of a fitted function. Results: In this study we compare the performance of different spot quantification methods using synthetic and real data. We propose a ready-to-use algorithm for spot detection and quantification that uses fitting of two dimensional Gaussian function curves for the extraction of data from two dimensional gel electrophoresis (2-DE) images. The algorithm implements fitting using logical compounds and is computationally efficient. The applicability of the compound fitting algorithm was evaluated for various simulated data and compared with other quantification approaches. We provide evidence that even if an incorrect bell-shaped function is used, the fitting method is superior to other approaches, especially when spots overlap. Finally, we validated the method with experimental data of urea-based 2-DE of Abeta peptides andre-analyzed published data sets. Our methods showed higher precision and accuracy than other approaches when applied to exposure time series and standard gels. Conclusion: Compound fitting as a quantification method for 2-DE spots shows several advantages over other approaches and could be combined with various spot detection methods.The algorithm was scripted in MATLAB (Mathworks) and is available as a supplemental file.
    Categories: Journal Articles
  • PBHoney: Identifying Genomic Variants via Long-Read Discordance and Interrupted Mapping
    [Jun 2014]

    Background: As resequencing projects become more prevalent across a larger number of species, accurate variantidentification will further elucidate the nature of genetic diversity and become increasingly relevantin genomic studies. However, the identification of larger genomic variants via DNA sequencing islimited by both the incomplete information provided by sequencing reads and the nature of the genomeitself. Long-read sequencing technologies provide high-resolution access to structural variants ofteninaccessible to shorter reads. Results: We present PBHoney, software that considers both intra-read discordance and soft-clipped tails of longreads (> 10, 000 bp) to identify structural variants. As a proof of concept, we identify four structuralvariants and two genomic features in a strain of Escherichia coli with PBHoney and validate them viade novo assembly. PBHoney is available for download at http://sourceforge.net/projects/pb-jelly/; Conclusions: Implementing two variant-identification approaches that exploit the high mappability of long reads,PBHoney is demonstrated as being effective at detecting larger structural variants using wholegenomePacific Biosciences RS II Continuous Long Reads. Furthermore, PBHoney is able to discovertwo genomic features: the existence of Rac-Phage in isolate; evidence of E. coli¿s circular genome.
    Categories: Journal Articles
  • A novel method for gathering and prioritizing disease candidate genes based on construction of a set of disease-related MeSH(R) terms
    [Jun 2014]

    Background: Understanding the molecular mechanisms involved in disease is critical for the development of more effective and individualized strategies for prevention and treatment. The amount of disease-related literature, including new genetic information on the molecular mechanisms of disease, is rapidly increasing. Extracting beneficial information from literature can be facilitated by computational methods such as the knowledge-discovery approach. Several methods for mining gene-disease relationships using computational methods have been developed, however, there has been a lack of research evaluating specific disease candidate genes. Results: We present a novel method for gathering and prioritizing specific disease candidate genes. Our approach involved the construction of a set of Medical Subject Headings (MeSH) terms for the effective retrieval of publications related to a disease candidate gene. Information regarding the relationships between genes and publications was obtained from the gene2pubmed database. The set of genes was prioritized using a "weighted literature score" based on the number of publications and weighted by the number of genes occurring in a publication. Using our method for the disease states of pain and Alzheimer's disease, a total of 1101 pain candidate genes and 2810 Alzheimer's disease candidate genes were gathered and prioritized. The precision was 0.30 and the recall was 0.89 in the case study of pain. The precision was 0.04 and the recall was 0.6 in the case study of Alzheimer's disease. The precision-recall curve indicated that the performance of our method was superior to that of other publicly available tools. Conclusions: Our method, which involved the use of a set of MeSH terms related to disease candidate genes and a novel weighted literature score, improved the accuracy of gathering and prioritizing candidate genes by focusing on a specific disease.
    Categories: Journal Articles
  • Design of a flexible component gathering algorithm for converting cell-based models to graph representations for use in evolutionary search
    [Jun 2014]

    Background: The ability of science to produce experimental data has outpaced the ability to effectively visualize and integrate the data into a conceptual framework that can further higher order understanding. Multidimensional and shape-based observational data of regenerative biology presents a particularly daunting challenge in this regard. Large amounts of data are available in regenerative biology, but little progress has been made in understanding how organisms such as planaria robustly achieve and maintain body form. An example of this kind of data can be found in a new repository (PlanformDB) that encodes descriptions of planaria experiments and morphological outcomes using a graph formalism. Results: We are developing a model discovery framework that uses a cell-based modeling platform combined with evolutionary search to automatically search for and identify plausible mechanisms for the biological behavior described in PlanformDB. To automate the evolutionary search we developed a way to compare the output of the modeling platform to the morphological descriptions stored in PlanformDB. We used a flexible connected component algorithm to create a graph representation of the virtual worm from the robust, cell-based simulation data. These graphs can then be validated and compared with target data from PlanformDB using the well-known graph-edit distance calculation, which provides a quantitative metric of similarity between graphs. The graph edit distance calculation was integrated into a fitness function that was able to guide automated searches for unbiased models of planarian regeneration. We present a cell-based model of planarian that can regenerate anatomical regions following bisection of the organism, and show that the automated model discovery framework is capable of searching for and finding models of planarian regeneration that match experimental data stored in PlanformDB. Conclusion: The work presented here, including our algorithm for converting cell-based models into graphs for comparison with data stored in an external data repository, has made feasible the automated development, training, and validation of computational models using morphology-based data. This work is part of an ongoing project to automate the search process, which will greatly expand our ability to identify, consider, and test biological mechanisms in the field of regenerative biology.
    Categories: Journal Articles
  • Integrating the interactome and the transcriptome of Drosophila
    [Jun 2014]

    Background: Networks of interacting genes and gene products mediate most cellular and developmental processes. High throughput screening methods combined with literature curation are identifying many of the protein-protein interactions (PPI) and protein-DNA interactions (PDI) that constitute these networks. Most of the detection methods, however, fail to identify the in vivo spatial or temporal context of the interactions. Thus, the interaction data are a composite of the individual networks that may operate in specific tissues or developmental stages. Genome-wide expression data may be useful for filtering interaction data to identify the subnetworks that operate in specific spatial or temporal contexts. Here we take advantage of the extensive interaction and expression data available for Drosophila to analyze how interaction networks may be unique to specific tissues and developmental stages. Results: We ranked genes on a scale from ubiquitously expressed to tissue or stage specific and examined their interaction patterns. Interestingly, ubiquitously expressed genes have many more interactions among themselves than do non-ubiquitously expressed genes both in PPI and PDI networks. While the PDI network is enriched for interactions between tissue-specific transcription factors and their tissue-specific targets, a preponderance of the PDI interactions are between ubiquitous and non-ubiquitously expressed genes and proteins. In contrast to PDI, PPI networks are depleted for interactions among tissue- or stage- specific proteins, which instead interact primarily with widely expressed proteins. In light of these findings, we present an approach to filter interaction data based on gene expression levels normalized across tissues or developmental stages. We show that this filter (the percent maximum or pmax filter) can be used to identify subnetworks that function within individual tissues or developmental stages. Conclusions: These observations suggest that protein networks are frequently organized into hubs of widely expressed proteins to which are attached various tissue- or stage-specific proteins. This is consistent with earlier analyses of human PPI data and suggests a similar organization of interaction networks across species. This organization implies that tissue or stage specific networks can be best identified from interactome data by using filters designed to include both ubiquitously expressed and specifically expressed genes and proteins.
    Categories: Journal Articles
  • QMachine: commodity supercomputing in web browsers
    [Jun 2014]

    Background: Ongoing advancements in cloud computing provide novel opportunities in scientific computing, especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for querying, processing, and visualizing genomics' "Big Data" from sources like The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of Linked Data in Biomedicine. Results: QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to execute them without installing any extra plugins or programs. A client library provides high-level distribution templates including MapReduce. This stark departure from the current reliance on expensive server hardware running "download and install" software has already gathered substantial community interest, as QM received more than 2.2 million API calls from 87 countries in 12 months. Conclusions: QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments.
    Categories: Journal Articles
  • SMARTPOP: inferring the impact of social dynamics on genetic diversity through high speed simulations
    [Jun 2014]

    Background: Social behavior has long been known to influence patterns of genetic diversity, but the effect of social processes on population genetics remains poorly quantified - partly due to limited community-level genetic sampling (which is increasingly being remedied), and partly to a lack of fast simulation software to jointly model genetic evolution and complex social behavior, such as marriage rules. Results: To fill this gap, we have developed SMARTPOP - a fast, forward-in-time genetic simulator - to facilitate large-scale statistical inference on interactions between social factors, such as mating systems, and population genetic diversity. By simultaneously modeling genetic inheritance and dynamic social processes at the level of the individual, SMARTPOP can simulate a wide range of genetic systems (autosomal, X-linked, Y chromosomal and mitochondrial DNA) under a range of mating systems and demographic models. Specifically designed to enable resource-intensive statistical inference tasks, such as Approximate Bayesian Computation, SMARTPOP has been coded in C++ and is heavily optimized for speed and reduced memory usage. Conclusion: SMARTPOP rapidly simulates population genetic data under a wide range of demographic scenarios and social behaviors, thus allowing quantitative analyses to address complex socio-ecological questions.
    Categories: Journal Articles
  • Scan for Motifs: a webserver for the analysis of post-transcriptional regulatory elements in the 3[prime] untranslated regions (3[prime] UTRs) of mRNAs
    [Jun 2014]

    Background: Gene expression in vertebrate cells may be controlled post-transcriptionally through regulatory elements in mRNAs. These are usually located in the untranslated regions (UTRs) of mRNA sequences, particularly the 3[prime]UTRs. Results: Scan for Motifs (SFM) simplifies the process of identifying a wide range of regulatory elements on alignments of vertebrate 3[prime]UTRs. SFM includes identification of both RNA Binding Protein (RBP) sites and targets of miRNAs. In addition to searching pre-computed alignments, the tool provides users the flexibility to search their own sequences or alignments. The regulatory elements may be filtered by expected value cutoffs and are cross-referenced back to their respective sources and literature. The output is an interactive graphical representation, highlighting potential regulatory elements and overlaps between them. The output also provides simple statistics and links to related resources for complementary analyses. The overall process is intuitive and fast. As SFM is a free web-application, the user does not need to install any software or databases. Conclusions: Visualisation of the binding sites of different classes of effectors that bind to 3[prime]UTRs will facilitate the study of regulatory elements in 3[prime] UTRs.
    Categories: Journal Articles
  • MiningABs: mining associated biomarkers across multi-connected gene expression datasets
    [Jun 2014]

    Background: Human disease often arises as a consequence of alterations in a set of associated genes rather than alterations to a set of unassociated individual genes. Most previous microarray-based meta-analyses identified disease-associated genes or biomarkers independent of genetic interactions. Therefore, in this study, we present the first meta-analysis method capable of taking gene combination effects into account to efficiently identify associated biomarkers (ABs) across different microarray platforms. Results: We propose a new meta-analysis approach called MiningABs to mine ABs across different array-based datasets. The similarity between paired probe sequences is quantified as a bridge to connect these datasets together. The ABs can be subsequently identified from an "improved" common logit model (c-LM) by combining several sibling-like LMs in a heuristic genetic algorithm selection process. Our approach is evaluated with two sets of gene expression datasets: i) 4 esophageal squamous cell carcinoma and ii) 3 hepatocellular carcinoma datasets. Based on an unbiased reciprocal test, we demonstrate that each gene in a group of ABs is required to maintain high cancer sample classification accuracy, and we observe that ABs are not limited to genes common to all platforms. Investigating the ABs using Gene Ontology (GO) enrichment, literature survey, and network analyses indicated that our ABs are not only strongly related to cancer development but also highly connected in a diverse network of biological interactions. Conclusions: The proposed meta-analysis method called MiningABs is able to efficiently identify ABs from different independently performed array-based datasets, and we show its validity in cancer biology via GO enrichment, literature survey and network analyses. We postulate that the ABs may facilitate novel target and drug discovery, leading to improved clinical treatment. Java source code, tutorial, example and related materials are available at "http://sourceforge.net/projects/miningabs/".
    Categories: Journal Articles
  • hsphase: an R package for pedigree reconstruction, detection of recombination events, phasing and imputation of half-sib family groups
    [Jun 2014]

    Background: Identification of recombination events and which chromosomal segments contributed to an individual is useful for a number of applications in genomic analyses including haplotyping, imputation, signatures of selection, and improved estimates of relationship and probability of identity by descent. Genotypic data on half-sib family groups are widely available in livestock genomics. This structure makes it possible to identify recombination events accurately even with only a few individuals and it lends itself well to a range of applications such as parentage assignment and pedigree verification. Results: Here we present hsphase, an R package that exploits the genetic structure found in half-sib livestock data to identify and count recombination events, impute and phase un-genotyped sires and phase its offspring. The package also allows reconstruction of family groups (pedigree inference), identification of pedigree errors and parentage assignment. Additional functions in the package allow identification of genomic mapping errors, imputation of paternal high density genotypes from low density genotypes, evaluation of phasing results either from hsphase or from other phasing programs. Various diagnostic plotting functions permit rapid visual inspection of results and evaluation of datasets. Conclusion: The hsphase package provides a suite of functions for analysis and visualization of genomic structures in half-sib family groups implemented in the widely used R programming environment. Low level functions were implemented in C++ and parallelized to improve performance. hsphase was primarily designed for use with high density SNP array data but it is fast enough to run directly on sequence data once they become more widely available. The package is available (GPL 3) from the Comprehensive R Archive Network (CRAN) or from http://www-personal.une.edu.au/~cgondro2/hsphase.htm.
    Categories: Journal Articles
  • BLASTGrabber: a bioinformatic tool for visualization, analysis and sequence selection of massive BLAST data
    [May 2014]

    Background: Advances in sequencing efficiency have vastly increased the sizes of biological sequence databases, including many thousands of genome-sequenced species. The BLAST algorithm remains the main search engine for retrieving sequence information, and must consequently handle data on an unprecedented scale. This has been possible due to high-performance computers and parallel processing. However, the raw BLAST output from contemporary searches involving thousands of queries becomes ill-suited for direct human processing. Few programs attempt to directly visualize and interpret BLAST output; those that do often provide a mere basic structuring of BLAST data. Results: Here we present a bioinformatics application named BLASTGrabber suitable for high-throughput sequencing analysis. BLASTGrabber, being implemented as a Java application, is OS-independent and includes a user friendly graphical user interface. Text or XML-formatted BLAST output files can be directly imported, displayed and categorized based on BLAST statistics. Query names and FASTA headers can be analysed by text-mining. In addition to visualizing sequence alignments, BLAST data can be ordered as an interactive taxonomy tree. All modes of analysis support selection, export and storage of data. A Java interface-based plugin structure facilitates the addition of customized third party functionality. Conclusion: The BLASTGrabber application introduces new ways of visualizing and analysing massive BLAST output data by integrating taxonomy identification, text mining capabilities and generic multi-dimensional rendering of BLAST hits. The program aims at a non-expert audience in terms of computer skills; the combination of new functionalities makes the program flexible and useful for a broad range of operations.
    Categories: Journal Articles
  • A practical approximation algorithm for solving massive instances of hybridization number for binary and nonbinary trees
    [May 2014]

    Background: Reticulate events play an important role in determining evolutionary relationships. The problem ofcomputing the minimum number of such events to explain discordance between two phylogenetictrees is a hard computational problem. Even for binary trees, exact solvers struggle to solve instanceswith reticulation number larger than 40-50. Results: Here we present CYCLEKILLER and NONBINARYCYCLEKILLER, the first methods to producesolutions verifiably close to optimality for instances with hundreds or even thousands of reticulations. Conclusions: Using simulations, we demonstrate that these algorithms run quickly for large and difficult instances,producing solutions that are very close to optimality. As a spin-off from our simulations we alsopresent TERMINUSEST, which is the fastest exact method currently available that can handlenonbinary trees: this is used to measure the accuracy of the NONBINARYCYCLEKILLER algorithm.All three methods are based on extensions of previous theoretical work (SIDMA 26(4):1635-1656,TCBB 10(1):18-25, SIDMA 28(1):49-66) and are publicly available. We also apply our methods toreal data.KeywordsHybridization number, Phylogenetic networks, Approximation algorithms, Directed feedback vertex set
    Categories: Journal Articles
  • Automated ensemble assembly and validation of microbial genomes
    [May 2014]

    Background: The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible. Results: To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers. Conclusions: Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
    Categories: Journal Articles
  • Effective filtering strategies to improve data quality from population-based whole exome sequencing studies
    [May 2014]

    Background: Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK's recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone. Results: The detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes. Conclusions: The described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses.
    Categories: Journal Articles
  • The discriminant power of RNA features for pre-miRNA recognition
    [May 2014]

    Background: Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features frommiRNA precursors (pre-miRNA). Some feature sets are composed of sequence-structure patternscommonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features.In this work, we analyze the discriminant power of seven feature sets, which are used in six premiRNAprediction tools. The analysis is based on the classification performance achieved with thesefeature sets for the training algorithms used in these tools. We also evaluate feature discriminationthrough the F-score and feature importance in the induction of random forests. Results: Small or non-significant differences were found among the estimated classification performances ofclassifiers induced using sets with diversification of features, despite the wide differences in theirdimension. Inspired in these results, we obtained a lower-dimensional feature set, which achieved asensitivity of 90% and a specificity of 95%. These estimates are within 0.1% of the maximal valuesobtained with any feature set (SELECT, Section ¿Results and discussion¿) while it is 34 times fasterto compute. Even compared to another feature set (FS2, see Section ¿Results and discussion¿), whichis the computationally least expensive feature set of those from the literature which perform within0.1% of the maximal values, it is 34 times faster to compute. The results obtained by the tools used asreferences in the experiments carried out showed that five out of these six tools have lower sensitivityor specificity. Conclusion: In miRNA discovery the number of putative miRNA loci is in the order of millions. Analysisof putative pre-miRNAs using a computationally expensive feature set would be wasteful or evenunfeasible for large genomes. In this work, we propose a relatively inexpensive feature set and exploremost of the learning aspects implemented in current ab-initio pre-miRNA prediction tools, which maylead to the development of efficient ab-initio pre-miRNA discovery tools.The material to reproduce the main results from this paper can be downloaded fromhttp://bioinformatics.rutgers.edu/Static/Software/discriminant.tar.gz.
    Categories: Journal Articles
  • Protein-specific prediction of mRNA binding using RNA sequences, binding motifs and predicted secondary structures
    [Apr 2014]

    Background: RNA-binding proteins interact with specific RNA molecules to regulate important cellular processes. It is therefore necessary to identify the RNA interaction partners in order to understand the precise functions of such proteins. Protein-RNA interactions are typically characterized using in vivo and in vitro experiments but these may not detect all binding partners. Therefore, computational methods that capture the protein-dependent nature of such binding interactions could help to predict potential binding partners in silico. Results: We have developed three methods to predict whether an RNA can interact with a particular RNAbinding protein using support vector machines and different features based on the sequence (the Oli method), the motif score (the OliMo method) and the secondary structure (the OliMoSS method). We applied these approaches to different experimentally-derived datasets and compared the predictions with RNAcontext and RPISeq. Oli outperformed OliMoSS and RPISeq, confirming our protein-specific predictions and suggesting that tetranucleotide frequencies are appropriate discriminative features. Oli and RNAcontext were the most competitive methods in terms of the area under curve. A precisionrecall curve analysis achieved higher precision values for Oli. On a second experimental dataset including real negative binding information, Oli outperformed RNAcontext with a precision of 0.73 vs. 0.59. Conclusions: Our experiments showed that features based on primary sequence information are sufficiently discriminating to predict specific RNA-protein interactions. Sequence motifs and secondary structure information were not necessary to improve these predictions. Finally we confirmed that proteinspecific experimental data concerning RNA-protein interactions are valuable sources of information that can be used for the efficient training of models for in silico predictions. The scripts are available upon request to the corresponding author.
    Categories: Journal Articles
  • Transcript mapping based on dRNA-seq data
    [Apr 2014]

    Background: RNA-seq and its variant differential RNA-seq (dRNA-seq) are today routine methods for transcriptome analysis in bacteria. While expression profiling and transcriptional start site prediction are standard tasks today, the problem of identifying transcriptional units in a genome-wide fashion is still not solved for prokaryotic systems. Results: We present RNASEG, an algorithm for the prediction of transcriptional units based on dRNA-seq data. A key feature of the algorithm is that, based on the data, it distinguishes between transcribed and un-transcribed genomic segments. Furthermore, the program provides many different predictions in a single run, which can be used to infer the significance of transcriptional units in a consensus procedure. We show the performance of our method based on a well-studied dRNA-seq data set for Helicobacter pylori. Conclusions: With our algorithm it is possible to identify operons and 5'- and 3'-UTRs in an automated fashion. This alleviates the need for labour intensive manual inspection and enables large-scale studies in the area of comparative transcriptomics.
    Categories: Journal Articles
  • Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs
    [Apr 2014]

    Background: Tandem mass spectrometry-based database searching is currently the main method for protein identification in shotgun proteomics. The explosive growth of protein and peptide databases, which is a result of genome translations, enzymatic digestions, and post-translational modifications (PTMs), is making computational efficiency in database searching a serious challenge. Profile analysis shows that most search engines spend 50%-90% of their total time on the scoring module, and that the spectrum dot product (SDP) based scoring module is the most widely used. As a general purpose and high performance parallel hardware, graphics processing units (GPUs) are promising platforms for speeding up database searches in the protein identification process. Results: We designed and implemented a parallel SDP-based scoring module on GPUs that exploits the efficient use of GPU registers, constant memory and shared memory. Compared with the CPU-based version, we achieved a 30 to 60 times speedup using a single GPU. We also implemented our algorithm on a GPU cluster and achieved an approximately favorable speedup. Conclusions: Our GPU-based SDP algorithm can significantly improve the speed of the scoring module in mass spectrometry-based protein identification. The algorithm can be easily implemented in many database search engines such as X!Tandem, SEQUEST, and pFind. A software tool implementing this algorithm is available at http://www.comp.hkbu.edu.hk/~youli/ProteinByGPU.html
    Categories: Journal Articles
  • SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines
    [Apr 2014]

    Background: It is important to predict the quality of a protein structural model before its native structure is known. The method that can predict the absolute local quality of individual residues in a single protein model is rare, yet particularly needed for using, ranking and refining protein models. Results: We developed a machine learning tool (SMOQ) that can predict the distance deviation of each residue in a single protein model. SMOQ uses support vector machines (SVM) with protein sequence and structural features (i.e. basic feature set), including amino acid sequence, secondary structures, solvent accessibilities, and residue-residue contacts to make predictions. We also trained a SVM model with two new additional features (profiles and SOV scores) on 20 CASP8 targets and found that including them can only improve the performance when real deviations between native and model are higher than 5A. The SMOQ tool finally released uses the basic feature set trained on 85 CASP8 targets. Moreover, SMOQ implemented a way to convert predicted local quality scores into a global quality score. SMOQ was tested on the 84 CASP9 single-domain targets. The average difference between the residue-specific distance deviation predicted by our method and the actual distance deviation on the test data is 2.637A. The global quality prediction accuracy of the tool is comparable to other good tools on the same benchmark. Conclusion: SMOQ is a useful tool for protein single model quality assessment. Its source code and executable are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/.
    Categories: Journal Articles
  • ConSole: using modularity of contact maps to locate Solenoid domains in protein structures
    [Apr 2014]

    Background: Periodic proteins, characterized by the presence of multiple repeats of short motifs, form an interesting and seldom-studied group. Due to often extreme divergence in sequence, detection and analysis of such motifs is performed more reliably on the structural level. Yet, few algorithms have been developed for the detection and analysis of structures of periodic proteins. Results: ConSole recognizes modularity in protein contact maps, allowing for precise identification of repeats in solenoid protein structures, an important subgroup of periodic proteins. Tests on benchmarks show that ConSole has higher recognition accuracy as compared to Raphael, the only other publicly available solenoid structure detection tool. As a next step of ConSole analysis, we show how detection of solenoid repeats in structures can be used to improve sequence recognition of these motifs and to detect subtle irregularities of repeat lengths in three solenoid protein families. Conclusions: The ConSole algorithm provides a fast and accurate tool to recognize solenoid protein structures as a whole and to identify individual solenoid repeat units from a structure. ConSole is available as a web-based, interactive server and is available for download at http://console.sanfordburnham.org.
    Categories: Journal Articles