The latest research articles published by BMC Bioinformatics
Background: Next-generation sequencing (NGS) has changed genomics significantly. More and more applications strive for sequencing with different platforms. Now, in 2012, after a decade of development and evolution, NGS has been accepted for a variety of research fields. Determination of sequencing errors is essential in order to follow next-generation sequencing beyond research use only. This study describes the overall 454 system performance of using multiple GS Junior runs with an in-house established and validated diagnostic assay for human leukocyte antigen (HLA) exon sequencing. Based on this data, we extracted, evaluated and characterized errors and variants of 60 HLA loci per run with respect to their adjacencies. Results: We determined an overall error rate of 0.18% in a total of 118,484,408 bases. 31.3% of all reads analyzed (n=349,503) contain one or more errors. The largest group are deletions that account for 50% of the errors. Incorrect bases are not distributed equally along sequences and tend to be more frequent at sequence ends. Certain sequence positions in the middle or at the beginning of the read accumulate errors. Typically, the corresponding quality score at the actual error position is lower than the adjacent scores. Conclusions: Here we present the first error assessment in a human next-generation sequencing diagnostics assay in an amplicon sequencing approach. Improvements of sequence quality and error rate that have been made over the years are evident and it is shown that both have now reached a level where diagnostic applications become feasible. Our presented data are better than previously published error rates and we can confirm and quantify the often described relation of homopolymers and errors. Nevertheless, a certain depth of coverage is needed, in particular with challenging areas of the sequencing target. Furthermore, the usage of error correcting tools is not essential but might contribute towards the capacity and efficiency of a sequencing run.
Background Biomedical events are key to understanding physiological processes and disease, and wide coverage extraction is required for comprehensive automatic analysis of statements describing biomedical systems in the literature. In turn, the training and evaluation of extraction methods requires manually annotated corpora. However, as manual annotation is time-consuming and expensive, any single event-annotated corpus can only cover a limited number of semantic types. Although combined use of several such corpora could potentially allow an extraction system to achieve broad semantic coverage, there has been little research into learning from multiple corpora with partially overlapping semantic annotation scopes.Results We propose a method for learning from multiple corpora with partial semantic annotation overlap, and implement this method to improve our existing event extraction system, EventMine. An evaluation using seven event annotated corpora, including 65 event types in total, shows that learning from overlapping corpora can produce a single, corpus-independent, wide coverage extraction system that outperforms systems trained on single corpora and exceeds previously reported results on two established event extraction tasks from the BioNLP Shared Task 2011.Conclusions The proposed method allows the training of a wide-coverage, state-of-the-art event extraction system from multiple corpora with partial semantic annotation overlap. The resulting single model makes broad-coverage extraction straightforward in practice by removing the need to either select a subset of compatible corpora or semantic types, or to merge results from several models trained on different individual corpora. Multi-corpus learning also allows annotation efforts to focus on covering additional semantic types, rather than aiming for exhaustive coverage in any single annotation effort, or extending the coverage of semantic types annotated in existing corpora.
Background: MicroRNAs (miRNAs) are identified in nearly all plants where they play important roles in development and stress responses by target mRNA cleavage or translation repression. MiRNAs exert their functions by sequence complementation with target genes and hence their targets can be predicted using bioinformatics algorithms. In the past two decades, microarray technology has been employed to study genes involved in important biological processes such as biotic response, abiotic response, and specific tissues and developmental stages, many of which are miRNA targets. Despite their value in assisting research work for plant biologists, miRNA target genes are difficult to access without pre-processing and assistance of necessary analytical and visualization tools because they are embedded in a large body of microarray data that are scattered around in public databases.Description: Plant MiRNA Target Expression Database (PMTED) is designed to retrieve and analyze expression profiles of miRNA targets represented in the plethora of existing microarray data that are manually curated. It provides a Basic Information query function for miRNAs and their target sequences, gene ontology, and differential expression profiles. It also provides searching and browsing functions for a global Meta-network among species, bioprocesses, conditions, and miRNAs, meta-terms curated from well annotated microarray experiments. Networks are displayed through a Cytoscape Web-based graphical interface. In addition to conserved miRNAs, PMTED provides a target prediction portal for user-defined novel miRNAs and corresponding target expression profile retrieval. Hypotheses that are suggested by miRNA-target networks should provide starting points for further experimental validation. Conclusions: PMTED exploits value-added microarray data to study the contextual significance of miRNA target genes and should assist functional investigation for both miRNAs and their targets. PMTED will be updated over time and is freely available for non-commercial use at http://pmted.agrinome.org.
Region-based progressive localization of cell nuclei in microscopic images with data adaptive modeling
Background: Segmenting cell nuclei in microscopic images has become one of the most important routines in modern biological applications. With the vast amount of data, automatic localization, i.e. detection and segmentation, of cell nuclei is highly desirable compared to time-consuming manual processes. However, automated segmentation is challenging due to large intensity inhomogeneities in the cell nuclei and the background. Results: We present a new method for automated progressive localization of cell nuclei using data-adaptive models that can better handle the inhomogeneity problem. We perform localization in a three-stage approach: first identify all interest regions with contrast-enhanced salient region detection, then process the clusters to identify true cell nuclei with probability estimation via feature-distance profiles of reference regions, and finally refine the contours of detected regions with regional contrast-based graphical model. The proposed region-based progressive localization (RPL) method is evaluated on three different datasets, with the first two containing grayscale images, and the third one comprising of color images with cytoplasm in addition to cell nuclei. We demonstrate performance improvement over the state-of-the-art. For example, compared to the second best approach, on the first dataset, our method achieves 2.8 and 3.7 reduction in Hausdorff distance and false negatives; on the second dataset that has larger intensity inhomogeneity, our method achieves 5% increase in Dice coefficient and Rand index; on the third dataset, our method achieves 4% increase in object-level accuracy. Conclusions: To tackle the intensity inhomogeneities in cell nuclei and background, a region-based progressivelocalization method is proposed for cell nuclei localization in fluorescence microscopy images. TheRPL method is demonstrated highly effective on three different public datasets, with on average 3.5%and 7% improvement of region- and contour-based segmentation performance over the state-of-theart.
Background: Different genome annotation services have been developed in recent years and widely used. However, the functional annotation results from different services are often not the same and a scheme to obtain consensus functional annotations by integrating different results is in demand. Results: This article presents a semi-automated scheme that is capable of comparing functional annotations from different sources and consequently obtaining a consensus genome functional annotation result. In this study, we used four automated annotation services to annotate a newly sequenced genome--Arcobacter butzleri ED-1. Our scheme is divided into annotation comparison and annotation determination sections. In the functional annotation comparison section, we employed gene synonym lists to tackle term difference problems. Multiple techniques from information retrieval were used to preprocess the functional annotations. Based on the functional annotation comparison results, we designed a decision tree to obtain a consensus functional annotation result. Experimental results show that our approach can greatly reduce the workload of manual comparison by automatically comparing 87% of the functional annotations. In addition, it automatically determined 87% of the functional annotations, leaving only 13% of the genes for manual curation. We applied this approach across six phylogenetically different genomes in order to assess the performance consistency. The results showed that our scheme is able to automatically perform, on average, 73% and 86% of the annotation comparison and determination tasks, respectively. Conclusions: We propose a semi-automatic and effective scheme to compare and determine genome functional annotations. It greatly reduces the manual work required in genome functional annotation. As this scheme does not require any specific biological knowledge, it is readily applicable for genome annotation comparison and genome re-annotation projects.
Background: A Gene Reference Into Function (GeneRIF) describes novel functionality of genes. GeneRIFs areavailable from the National Center for Biotechnology Information (NCBI) Gene database. GeneRIFindexing is performed manually, and the intention of our work is to provide methods to supportcreating the GeneRIF entries. The creation of GeneRIF entries involves the identification of thegenes mentioned in MEDLINE R ? citations and the sentences describing a novel function. Results: We have compared several learning algorithms and several features extracted or derived fromMEDLINE sentences to determine if a sentence should be selected for GeneRIF indexing. Featuresare derived from the sentences or using mechanisms to augment the information provided by them:assigning a discourse label using a previously trained model, for example. We show that machinelearning approaches with specific feature combinations achieve results close to one of the annotators.We have evaluated different feature sets and learning algorithms. In particular, Naive Bayes achievesbetter performance with a selection of features similar to one used in related work, which considersthe location of the sentence, the discourse of the sentence and the functional terminology in it. Conclusions: The current performance is at a level similar to human annotation and it shows that machine learningcan be used to automate the task of sentence selection for GeneRIF annotation. The currentexperiments are limited to the human species. We would like to see how the methodology can beextended to other species, specifically the normalization of gene mentions in other species.
Oral cancer prognosis based on clinicopathologic and genomic markers using a hybrid of feature selection and machine learning methods
Background: Machine learning techniques are becoming useful as an alternative approach to conventional medical diagnosis or prognosis as they are good for handling noisy and incomplete data, and significant results can be attained despite a small sample size. Traditionally, clinicians make prognostic decisions based on clinicopathologic markers. However, it is not easy for the most skilful clinician to come out with an accurate prognosis by using these markers alone. Thus, there is a need to use genomic markers to improve the accuracy of prognosis. The main aim of this research is to apply a hybrid of feature selection and machine learning methods in oral cancer prognosis based on the parameters of the correlation of clinicopathologic and genomic markers. Results: In the first stage of this research, five feature selection methods have been proposed and experimented on the oral cancer prognosis dataset. In the second stage, the model with the features selected from each feature selection methods are tested on the proposed classifiers. Four types of classifiers are chosen; these are namely, ANFIS, artificial neural network, support vector machine and logistic regression. A k-fold cross-validation is implemented on all types of classifiers due to the small sample size. The hybrid model of ReliefF-GA-ANFIS with 3-input features of drink, invasion and p63 achieved the best accuracy (accuracy = 93.81%; AUC = 0.90) for the oral cancer prognosis. Conclusions: The results revealed that the prognosis is superior with the presence of both clinicopathologic and genomic markers. The selected features can be investigated further to validate the potential of becoming as significant prognostic signature in the oral cancer studies.
Accounting for immunoprecipitation efficiencies in the statistical analysis of ChIP-seq data
Background: ImmunoPrecipitation (IP) efficiencies may vary largely between different antibodies and between repeated experiments with the same antibody. These differences have a large impact on the quality of ChIP-seq data: a more efficient experiment will necessarily lead to a higher signal to background ratio, and therefore to an apparent larger number of enriched regions, compared to a less efficient experiment. In this paper, we show how IP efficiencies can be explicitly accounted for in the joint statistical modelling of ChIP-seq data. Results: We fit a latent mixture model to eight experiments on two proteins, from two laboratories where different antibodies are used for the two proteins. We use the model parameters to estimate the efficiencies of individual experiments, and find that these are clearly different for the different laboratories, and amongst technical replicates from the same lab. When we account for ChIP efficiency, we find more regions bound in the more efficient experiments than in the less efficient ones, at the same false discovery rate. A priori knowledge of the same number of binding sites across experiments can also be included in the model for a more robust detection of differentially bound regions among two different proteins. Conclusions: We propose a statistical model for the detection of enriched and differentially bound regions from multiple ChIP-seq data sets. The framework that we presentaccounts explicitly for IP efficiencies in ChIP-seq data, and allows to model jointly, rather than individually, replicates and experiments from different proteins, leading to more robust biological conclusions.
MAPs: a database of modular antibody parts for predicting tertiary structures and designing affinity matured antibodies
Background: The de novo design of a novel protein with a particular function remains a formidable challenge with only isolated and hard-to-repeat successes to date. Due to their many structurally conserved features, antibodies are a family of proteins amenable to predictable rational design. Design algorithms must consider the structural diversity of possible naturally occurring antibodies. The human immune system samples this design space (2 1012) by randomly combining variable, diversity, and joining genes in a process known as V-(D)-J recombination.DescriptionBy analyzing structural features found in affinity matured antibodies, a database of Modular Antibody Parts (MAPs) analogous to the variable, diversity, and joining genes has been constructed for the prediction of antibody tertiary structures. The database contains 929 parts constructed from an analysis of 1168 human, humanized, chimeric, and mouse antibody structures and encompasses all currently observed structural diversity of antibodies. Conclusions: The generation of 260 antibody structures shows that the MAPs database can be used to reliably predict antibody tertiary structures with an average all-atom RMSD of 1.9 A. Using the broadly neutralizing anti-influenza antibody CH65 and anti-HIV antibody 4E10 as examples, promising starting antibodies for affinity maturation are identified and amino acid changes are traced as antibody affinity maturation occurs.
Identifying cancer mutation targets across thousands of samples: MuteProc, a high throughput mutation analysis pipeline
Background In the past decade, bioinformatics tools have matured enough to reliably perform sophisticated primary data analysis on Next Generation Sequencing (NGS) data, such as mapping, assemblies and variant calling, however, there is still a dire need for improvements in the higher level analysis such as NGS data organization, analysis of mutation patterns and Genome Wide Association Studies (GWAS).Results We present a high throughput pipeline for identifying cancer mutation targets, capable of processing billions of variations across thousands of samples. This pipeline is coupled with our Human Variation Database to provide more complex down stream analysis on the variations hosted in the database. Most notably, these analysis include finding significantly mutated regions across multiple genomes and regions with mutational preferences within certain types of cancers. The results of the analysis is presented in HTML summary reports that incorporate gene annotations from various resources for the reported regions.Conclusion MuteProc is available for download through the Vancouver Short Read Analysis Package on Sourceforge: http://vancouvershortr.sourceforge.net. Instructions for use and a tutorial are provided on the accompanying wiki pages at https://sourceforge.net/apps/mediawiki/vancouvershortr/index.php?title=Pipeline_introduction.
GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies
Background Genome-wide association studies have become very popular in identifyinggenetic contributions to phenotypes. Millions of SNPs are being tested fortheir association with diseases and traits using linear or logistic regression models.This conceptually simple strategy encounters the following computational issues: a largenumber of tests and very large genotype files (many Gigabytes) which cannot bedirectly loaded into the software memory. One of the solutions applied on agrand scale is cluster computing involving large-scale resources.We show how to speed up the computations using matrix operations in pure R code.Results We improve speed: computation time from 6 hours is reduced to 10-15 minutes.Our approach can handle essentially an unlimited amount of covariates efficiently, using projections. Data files in GWAS are vast and reading them intocomputer memory becomes an important issue. However, much improvement can bemade if the data is structured beforehand in a way allowing for easy access to blocks ofSNPs. We propose several solutions based on the R packages ff and ncdf.We adapted the semi-parallel computations for logistic regression.We show that in a typical GWAS setting, where SNP effects are very small, we do not lose any precision and our computations are few hundreds times faster than standard procedures.Conclusions We provide very fast algorithms for GWAS written in pure R code. We also showhow to rearrange SNP data for fast access.
Background: Two-channel (or two-color) microarrays are cost-effective platforms for comparative analysis of gene expression. They are traditionally analysed in terms of the log-ratios (M-values) of the two channel intensities at each spot, but this analysis does not use all the information available in the separate channel observations. Mixed models have been proposed to analyse intensities from the two channels as separate observations, but such models can be complex to use and the gain in efficiency over the log-ratio analysis is difficult to quantify. Mixed models yield test statistics for the null distributions can be specified only approximately, and some approaches do not borrow strength between genes. Results: This article reformulates the mixed model to clarify the relationship with the traditional log-ratio analysis, to facilitate information borrowing between genes, and to obtain an exact distributional theory for the resulting test statistics. The mixed model is transformed to operate on the M-values and A-values (average log-expression for each spot) instead of on the log-expression values. The log-ratio analysis is shown to ignore information contained in the A-values. The relative efficiency of the log-ratio analysis is shown to depend on the size of the intraspot correlation. A new separate channel analysis method is proposed that assumes a constant intra-spot correlation coefficient across all genes. This approach permits the mixed model to be transformed into an ordinary linear model, allowing the data analysis to use a well-understood empirical Bayes analysis pipeline for linear modeling of microarray data. This yields statistically powerful test statistics that have an exact distributional theory. The log-ratio, mixed model and common correlation methods are compared using three case studies. The results show that separate channel analyses that borrow strength between genes are more powerful than log-ratio analyses. The common correlation analysis is the most powerful of all. Conclusions: The common correlation method proposed in this article for separate-channel analysis of two-channel microarray data is no more difficult to apply in practice than the traditional log-ratio analysis. It provides an intuitive and powerful means to conduct analyses and make comparisons that might otherwise not be possible.
Background; Many models have been proposed to detect copy number alterations in chromosomal copynumber profiles, but it is usually not obvious to decide which is most effective for a givendata set. Furthermore, most methods have a smoothing parameter that determines the numberof breakpoints and must be chosen using various heuristics.Results; We present three contributions for copy number profile smoothing model selection. First, wepropose to select the model and degree of smoothness that maximizes agreement with visualbreakpoint region annotations. Second, we develop cross-validation procedures to estimatethe error of the trained models. Third, we apply these methods to compare 17 smoothingmodels on a new database of 575 annotated neuroblastoma copy number profiles, which wemake available as a public benchmark for testing new algorithms.Conclusions; Whereas previous studies have been qualitative or limited to simulated data, our annotation-guided approach is quantitative and suggests which algorithms are fastest and most accuratein practice on real data. In the neuroblastoma data, the equivalent pelt.n and cghseg.k meth-ods were the best breakpoint detectors, and exhibited reasonable computation times.
Protein complex detection using interaction reliability assessment and weighted clustering coefficient
Background Predicting protein complexes from protein-protein interaction data is becoming a fundamental problem in computational biology. The identification and characterization of protein complexes implicated are crucial to the understanding of the molecular events under normal and abnormal physiological conditions. On the other hand, large datasets of experimentally detected protein-protein interactions were determined using High-throughput experimental techniques. However, experimental data is usually liable to contain a large number of spurious interactions. Therefore, it is essential to validate these interactions before exploiting them to predict protein complexes.Results In this paper, we propose a novel graph mining algorithm (PEWCC) to identify such protein complexes. Firstly, the algorithm assesses the reliability of the interaction data, then predicts protein complexes based on the concept of weighted clustering coefficient. To demonstrate the effectiveness of the proposed method, the performance of PEWCC was compared to several methods. PEWCC was able to detect more matched complexes than any of the state-of-the-art methods with higher quality scores.Conclusions The higher accuracy achieved by PEWCC in detecting protein complexes is a valid argument in favor of the proposed method. The datasets and programs are freely available at http://faculty.uaeu.ac.ae/nzaki/Research.htm.
Background: Calcium (Ca2+) propagates within tissues serving as an important information carrier. In particular, cilia beat frequency in oviduct cells is partially regulated by Ca2+ changes. Thus, measuring the calcium density and characterizing the traveling wave plays a key role in understanding biological phenomena. However, current methods to measure propagation velocities and other wave characteristics involve several manual or time-consuming procedures. This limits the amount of information that can be extracted, and the statistical quality of the analysis. Results: Our work provides a framework based on image processing procedures that enables a fast, automatic and robust characterization of data from two-filter fluorescence Ca2+ experiments. We calculate the mean velocity of the wave-front, and use theoretical models to extract meaningful parameters like wave amplitude, decay rate and time of excitation. Conclusions: Measurements done by different operators showed a high degree of reproducibility. This framework is also extended to a single filter fluorescence experiments, allowing higher sampling rates, and thus an increased accuracy in velocity measurements.
Background: Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS) or (l, d)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the Motif Stems Search (MSS) problem. A motif stem is an l-mer (for some relevant value of l) with some wildcard characters and hence corresponds to a set of l-mers (without wildcards), some of which are (l, d)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs. Results: In this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic's algorithm. Conclusions: Our MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller) subset of the stems output by Kuksa and Pavlovic's algorithm.
Disk-based k-mer counting on a PC
Background: The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol longsubstring in a given text, is important for many bioinformatics applications. They include developingde Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. Results: We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experimentsshow that it usually offers the fastest solution to the considered problem, while demanding a relativelysmall amount of memory. In particular, it is capable of counting the statistics for short-read humangenome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16GB of RAMand 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerfulmachine, using 32GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time.No other algorithm for most tested settings of this problem and mammalian-size data can accomplishthis task in comparable time. Our solution also belongs to memory-frugal ones; most competitivealgorithms cannot efficiently work on a PC with 16GB of memory for such massive data. Conclusions: By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very compet-itive k-mer counting procedure, called KMC. Our results suggest that judicious resource managementmay allow to solve at least some bioinformatics problems with massive data on a commodity personalcomputer.Keywordsk-mer counting, de Bruijn graph genome assemblers, Multiple sequence alignment, Repeat detectionAvailabilityKMC is freely available at http://sun.aei.polsl.pl/kmc.
Background: Molecular pathways represent an ensemble of interactions occurring among molecules within the cell and between cells. The identification of similarities between molecular pathways across organisms and functions has a critical role in understanding complex biological processes. For the inference of such novel information, the comparison of molecular pathways requires to account for imperfect matches (flexibility) and to efficiently handle complex network topologies. To date, these characteristics are only partially available in tools designed to compare molecular interaction maps. Results: Our approach MIMO (Molecular Interaction Maps Overlap) addresses the first problem by allowing the introduction of gaps and mismatches between query and template pathways and permits -when necessary- supervised queries incorporating a priori biological information. It then addresses the second issue by relying directly on the rich graph topology described in the Systems Biology Markup Language (SBML) standard, and uses multidigraphs to efficiently handle multiple queries on biological graph databases. The algorithm has been here successfully used to highlight the contact point between various human pathways in the Reactome database. Conclusions: MIMO offers a flexible and efficient graph-matching tool for comparing complex biological pathways.
Background: Scientists rarely reuse expert knowledge of phylogeny, in spite of years of effort to assemble a great "Tree of Life" (ToL). A notable exception involves the use of Phylomatic, which provides tools to generate custom phylogenies from a large, pre-computed, expert phylogeny of plant taxa. This suggests great potential for a more generalized system that, starting with a query consisting of a list of any known species, would rectify non-standard names, identify expert phylogenies containing the implicated taxa, prune away unneeded parts, and supply branch lengths and annotations, resulting in a custom phylogeny suited to the user's needs. Such a system could become a sustainable community resource if implemented as a distributed system of loosely coupled parts that interact through clearly defined interfaces. Results: With the aim of building such a "phylotastic" system, the NESCent Hackathons, Interoperability, Phylogenies (HIP) working group recruited 2 dozen scientist-programmers to a weeklong programming hackathon in June 2012. During the hackathon (and a three-month follow-up period), 5 teams produced designs, implementations, documentation, presentations, and tests including: (1) a generalized scheme for integrating components; (2) proof-of-concept pruners and controllers; (3) a meta-API for taxonomic name resolution services; (4) a system for storing, finding, and retrieving phylogenies using semantic web technologies for data exchange, storage, and querying; (5) an innovative new service, DateLife.org, which synthesizes pre-computed, time-calibrated phylogenies to assign ages to nodes; and (6) demonstration projects. These outcomes are accessible via a public code repository (GitHub.com), a website (www.phylotastic.org), and a server image. Conclusions: Approximately 9 person-months of effort (centered on a software development hackathon) resulted in the design and implementation of proof-of-concept software for 4 core phylotastic components, 3 controllers, and 3 end-user demonstration tools. While these products have substantial limitations, they suggest considerable potential for a distributed system that makes phylogenetic knowledge readily accessible in computable form. Widespread use of phylotastic systems will create an electronic marketplace for sharing phylogenetic knowledge that will spur innovation in other areas of the ToL enterprise, such as annotation of sources and methods and third-party methods of quality assessment.
Investigating the concordance of Gene Ontology terms reveals the intra- and inter-platform reproducibility of enrichment analysis
Background: Reliability and Reproducibility of differentially expressed genes (DEGs) are essential for the biological interpretation of microarray data. The microarray quality control (MAQC) project launched by US Food and Drug Administration (FDA) elucidated that the lists of DEGs generated by intra- and inter-platform comparisons can reach a high level of concordance, which mainly depended on the statistical criteria used for ranking and selecting DEGs. Generally, it will produce reproducible lists of DEGs when combining fold change ranking with a non-stringent p-value cutoff. For further interpretation of the gene expression data, statistical methods of gene enrichment analysis provide powerful tools for associating the DEGs with prior biological knowledge, e.g. Gene Ontology (GO) terms and pathways, and are widely used in genome-wide research. Although the DEG lists generated from the same compared conditions proved to be reliable, the reproducible enrichment results are still crucial to the discovery of the underlying molecular mechanism differentiating the two conditions. Therefore, it is important to know whether the enrichment results are still reproducible, when using the lists of DEGs generated by different statistic criteria from inter-laboratory and cross-platform comparisons. In our study, we used the MAQC data sets for systematically accessing the intra- and inter-platform concordance of GO terms enriched by Gene Set Enrichment Analysis (GSEA) and LRpath. Results: In intra-platform comparisons, the overlapped percentage of enriched GO terms was as high as ~80% when the inputted lists of DEGs were generated by fold change ranking and Significance Analysis of Microarrays (SAM), whereas the percentages decreased about 20% when generating the lists of DEGs by using fold change ranking and t-test, or by using SAM and t-test. Similar results were found in inter-platform comparisons. Conclusions: Our results demonstrated that the lists of DEGs in a high level of concordance can ensure the high concordance of enrichment results. Importantly, based on the lists of DEGs generated by a straightforward method of combining fold change ranking with a non-stringent p-value cutoff, enrichment analysis will produce reproducible enriched GO terms for the biological interpretation.