BMC Bioinformatics

The latest research articles published by BMC Bioinformatics
  • Multi-TGDR, a multi-class regularization method, identifies the metabolic profiles of hepatocellular carcinoma and cirrhosis infected with hepatitis B or hepatitis C virus
    [Apr 2014]

    Background: Over the last decade, metabolomics has evolved into a mainstream enterprise utilized by many laboratories globally. Like other "omics" data, metabolomics data has the characteristics of a smaller sample size compared to the number of features evaluated. Thus the selection of an optimal subset of features with a supervised classifier is imperative. We extended an existing feature selection algorithm, threshold gradient descent regularization (TGDR), to handle multi-class classification of "omics" data, and proposed two such extensions referred to as multi-TGDR. Both multi-TGDR frameworks were used to analyze a metabolomics dataset that compares the metabolic profiles of hepatocellular carcinoma (HCC) infected with hepatitis B (HBV) or C virus (HCV) with that of cirrhosis induced by HBV/HCV infection; the goal was to improve early-stage diagnosis of HCC. Results: We applied two multi-TGDR frameworks to the HCC metabolomics data that determined TGDR thresholds either globally across classes, or locally for each class. Multi-TGDR global model selected 45 metabolites with a 0% misclassification rate (the error rate on the training data) and had a 3.82% 5-fold cross-validation (CV-5) predictive error rate. Multi-TGDR local selected 48 metabolites with a 0% misclassification rate and a 5.34% CV-5 error rate. Conclusions: One important advantage of multi-TGDR local is that it allows inference for determining which feature is related specifically to the class/classes. Thus, we recommend multi-TGDR local be used because it has similar predictive performance and requires the same computing time as multi-TGDR global, but may provide class-specific inference.
    Categories: Journal Articles
  • Quantum Coupled Mutation Finder: Predicting functionally or structurally important sites in proteins using quantum Jensen-Shannon divergence and CUDA programming
    [Apr 2014]

    Background: The identification of functionally or structurally important non-conserved residue sites in protein MSAs is an important challenge for understanding the structural basis and molecular mechanism of protein functions.Despite the rich literature on compensatory mutations as well as sequenceconservation analysis for the detection of those important residues, previousmethods often rely on classical information-theoretic measures. However, thesemeasures usually do not take into account dis/similarities of aminoacids which are likely to be crucial for those residues. In this study, we present a new method, the Quantum Coupled Mutation Finder (QCMF) that incorporates significant dis/similar amino acid pair signals in the prediction of functionally or structurally important sites. Results: The result of this study is twofold. First, using the essential sitesof two human proteins, namely epidermal growth factor receptor (EGFR) and glucokinase (GCK), we tested the QCMFMethodQCMF includes two metrics based on quantum Jensen-Shannon divergence to measure both sequence conservation and compensatory mutations.We found that QCMF reaches an improved performance in identifyingessential sites from MSAs of both proteins with a significantly higher Matthewscorrelation coefficient (MCC) value in comparison to previous Methods: Second, using a data set of 153 proteins, we made a pairwise comparison between QCMF and three conventional methods. This comparison study strongly suggests that QCMF complements the conventional methods for the identification of correlated mutations in MSAs. Conclusions: QCMF utilizes the notion of entanglement, which is a major resource of quantum information, to model significant dissimilar and similar amino acid pair signals in the detection of functionally or structurally important sites. Our results suggest that on the one hand QCMF significantly outperforms the previous method, which mainly focuses on dissimilar amino acid signals, to detect essential sites in proteins. On the other hand, it is complementary to the existing methods for the identification of correlated mutations. The method of QCMF is computationally intensive. To ensure a feasible computation time of the QCMF's algorithm, we leveraged Compute Unified Device Architecture (CUDA). The QCMF server is freely accessible at http://qcmf.informatik.uni-goettingen.de/.
    Categories: Journal Articles
  • A local average distance descriptor for flexible protein structure comparison
    [Apr 2014]

    Background: Protein structures are flexible and often show conformational changes upon binding to other molecules to exert biological functions. As protein structures correlate with characteristic functions, structure comparison allows classification and prediction of proteins of undefined functions. However, most comparison methods treat proteins as rigid bodies and cannot retrieve similarities of proteins with large conformational changes effectively. Results: In this paper, we propose a novel descriptor, local average distance (LAD), based on either the geodesic distances (GDs) or Euclidean distances (EDs) for pairwise flexible protein structure comparison. The proposed method was compared with 7 structural alignment methods and 7 shape descriptors on two datasets comprising hinge bending motions from the MolMovDB, and the results have shown that our method outperformed all other methods regarding retrieving similar structures in terms of precision-recall curve, retrieval success rate, R-precision, mean average precision and F1-measure. Conclusions: Both ED- and GD-based LAD descriptors are effective to search deformed structures and overcome the problems of self-connection caused by a large bending motion. We have also demonstrated that the ED-based LAD is more robust than the GD-based descriptor. The proposed algorithm provides an alternative approach for blasting structure database, discovering previously unknown conformational relationships, and reorganizing protein structure classification.
    Categories: Journal Articles
  • The number of reduced alignments between two DNA sequences
    [Mar 2014]

    Background: In this study we consider DNA sequences as mathematical strings. Total and reduced alignments between two DNA sequences have been considered in the literature to measure their similarity. Results for explicit representations of some alignments have been already obtained. Results: We present exact, explicit and computable formulas for the number of different possible alignments between two DNA sequences and a new formula for a class of reduced alignments. Conclusions: A unified approach for a wide class of alignments between two DNA sequences has been provided. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods.
    Categories: Journal Articles
  • SPiCE: a web-based tool for sequence-based protein classification and exploration
    [Mar 2014]

    Background: Amino acid sequences and features extracted from such sequences have been used to predict many protein properties, such as subcellular localization or solubility, using classifier algorithms. Although software tools are available for both feature extraction and classifier construction, their application is not straightforward, requiring users to install various packages and to convert data into different formats. This lack of easily accessible software hampers quick, explorative use of sequence-based classification techniques by biologists. Results: We have developed the web-based software tool SPiCE for exploring sequence-based features of proteins in predefined classes. It offers data upload/download, sequence-based feature calculation, data visualization and protein classifier construction and testing in a single integrated, interactive environment. To illustrate its use, two example datasets are included showing the identification of differences in amino acid composition between proteins yielding low and high production levels in fungi and low and high expression levels in yeast, respectively. Conclusions: SPiCE is an easy-to-use online tool for extracting and exploring sequence-based features of sets of proteins, allowing non-experts to apply advanced classification techniques. The tool is available at http://helix.ewi.tudelft.nl/spice.
    Categories: Journal Articles
  • DAFS: a data-adaptive flag method for RNA-sequencing data to differentiate genes with low and high expression
    [Mar 2014]

    Background: Next-generation sequencing (NGS) has advanced the application of high-throughput sequencing technologies in genetic and genomic variation analysis. Due to the large dynamic range of expression levels, RNA-seq is more prone to detect transcripts with low expression. It is clear that genes with no mapped reads are not expressed; however, there is ongoing debate about the level of abundance that constitutes biologically meaningful expression. To date, there is no consensus on the definition of low expression. Since random variation is high in regions with low expression and distributions of transcript expression are affected by numerous experimental factors, methods to differentiate low and high expressed data in a sample are critical to interpreting classes of abundance levels in RNA-seq data. Results: A data-adaptive approach was developed to estimate the lower bound of high expression for RNA-seq data. The Kolmgorov-Smirnov statistic and multivariate adaptive regression splines were used to determine the optimal cutoff value for separating transcripts with high and low expression. Results from the proposed method were compared to results obtained by estimating the theoretical cutoff of a fitted two-component mixture distribution. The robustness of the proposed method was demonstrated by analyzing different RNA-seq datasets that varied by sequencing depth, species, scale of measurement, and empirical density shape. Conclusions: The analysis of real and simulated data presented here illustrates the need to employ data-adaptive methodology in lieu of arbitrary cutoffs to distinguish low expressed RNA-seq data from high expression. Our results also present the drawbacks of characterizing the data by a two-component mixture distribution when classes of gene expression are not well separated. The ability to ascertain stably expressed RNA-seq data is essential in the filtering process of data analysis, and methodologies that consider the underlying data structure demonstrate superior performance in preserving most of the interpretable and meaningful data. The proposed algorithm for classifying low and high regions of transcript abundance promises wide-range application in the continuing development of RNA-seq analysis.
    Categories: Journal Articles
  • Differential meta-analysis of RNA-seq data from multiple studies
    [Mar 2014]

    Background: High-throughput sequencing is now regularly used for studies of the transcriptome (RNA-seq), particularly for comparisons among experimental conditions. For the time being, a limited number of biological replicates are typically considered in such experiments, leading to low detection power for differential expression. As their cost continues to decrease, it is likely that additional follow-up studies will be conducted to re-address the same biological question. Results: We demonstrate how p-value combination techniques previously used for microarray meta-analyses can be used for the differential analysis of RNA-seq data from multiple related studies. These techniques are compared to a negative binomial generalized linear model (GLM) including a fixed study effect on simulated data and real data on human melanoma cell lines. The GLM with fixed study effect performed well for low inter-study variation and small numbers of studies, but was outperformed by the meta-analysis methods for moderate to large inter-study variability and larger numbers of studies. Conclusions: The p-value combination techniques illustrated here are a valuable tool to perform differential meta-analyses of RNA-seq data by appropriately accounting for biological and technical variability within studies as well as additional study-specific effects. An R package metaRNASeq is available on the R Forge.
    Categories: Journal Articles
  • Consistency of metagenomic assignment programs in simulated and real data
    [Mar 2014]

    Background: Metagenomics is the genomic study of uncultured environmental samples, which has been greatly facilitated by the advent of shotgun-sequencing technologies. One of the main focuses of metagenomics is the discovery of previously uncultured microorganisms, which makes the assignment of sequences to a particular taxon a challenge and a crucial step. Recently, several methods have been developed to perform this task, based on different methodologies such as sequence composition or sequence similarity. The sequence composition methods have the ability to completely assign the whole dataset. However, their use in metagenomics and the study of their performance with real data is limited. In this work, we assess the consistency of three different methods (BLAST + Lowest Common Ancestor, Phymm, and Naïve Bayesian Classifier) in assigning real and simulated sequence reads. Results: Both in real and in simulated data, BLAST + Lowest Common Ancestor (BLAST + LCA), Phymm, and Naïve Bayesian Classifier consistently assign a larger number of reads in higher taxonomic levels than in lower levels. However, discrepancies increase at lower taxonomic levels. In simulated data, consistent assignments between all three methods showed greater precision than assignments based on Phymm or Bayesian Classifier alone, since the BLAST + LCA algorithm performed best. In addition, assignment consistency in real data increased with sequence read length, in agreement with previously published simulation results. Conclusions: The use and combination of different approaches is advisable to assign metagenomic reads. Although the sensitivity could be reduced, the reliability can be increased by using the reads consistently assigned to the same taxa by, at least, two methods, and by training the programs using all available information.
    Categories: Journal Articles
  • TSSAR: TSS annotation regime for dRNA-seq data
    [Mar 2014]

    Background: Differential RNA sequencing (dRNA-seq) is a high-throughput screening technique designed to examinethe architecture of bacterial operons in general and the precise position of transcription startsites (TSS) in particular. Hitherto, dRNA-seq data were analyzed by visualizing the sequencing readsmapped to the reference genome and manually annotating reliable positions. This is very labor intensiveand, due to the subjectivity, biased. Results: Here, we present TSSAR, a tool for automated de novo TSS annotation from dRNA-seq data thatrespects the statistics of dRNA-seq libraries. TSSAR uses the premise that the number of sequencingreads starting at a certain genomic position within a transcriptional active region follows a Poissondistribution with a parameter that depends on the local strength of expression. The differences oftwo dRNA-seq library counts thus follow a Skellam distribution. This provides a statistical basis toidentify significantly enriched primary transcripts. Conclusions: Having an automated and efficient tool for analyzing dRNA-seq data facilitates the use of thedRNA-seq technique and promotes its application to more sophisticated analysis. For instance, monitoringthe plasticity and dynamics of the transcriptomal architecture triggered by different stimuli andgrowth conditions becomes possible.The main asset of a novel tool for dRNA-seq analysis that reaches out to a broad user community isusability. As such, we provide TSSAR both as intuitive RESTfulWeb service (http://rna.tbi.univie.ac.at/TSSAR) together with a set of post-processing and analysis tools, as well as a stand-alone versionfor use in high-throughput dRNA-seq data analysis pipelines.
    Categories: Journal Articles
  • Multi-model inference using mixed effects from a linear regression based genetic algorithm
    [Mar 2014]

    Background: Different high-dimensional regression methodologies exist for the selection of variables to predict a continuous variable. To improve the variable selection in case clustered observations are present in the training data, an extension towards mixed-effects modeling (MM) is requested, but may not always be straightforward to implement.In this article, we developed such a MM extension (GA-MM-MMI) for the automated variable selection by a linear regression based genetic algorithm (GA) using multi-model inference (MMI). We exemplify our approach by training a linear regression model for prediction of resistance to the integrase inhibitor Raltegravir (RAL) on a genotype-phenotype database, with many integrase mutations as candidate covariates. The genotype-phenotype pairs in this database were derived from a limited number of subjects, with presence of multiple data points from the same subject, and with an intra-class correlation of 0.92. Results: In generation of the RAL model, we took computational efficiency into account by optimizing the GA parameters one by one, and by using tournament selection. To derive the main GA parameters we used 3 times 5-fold cross-validation. The number of integrase mutations to be used as covariates in the mixed effects models was 25 (chrom.size). A GA solution was found when R2 MM > 0.95 (goal.fitness). We tested three different MMI approaches to combine the results of 100 GA solutions into one GA-MM-MMI model. When evaluating the GA-MM-MMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GA-MM-MMI TOP18: mixed-effects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R2) in comparison to GA-ordinary least squares (GA-OLS) and Least Absolute Shrinkage and Selection Operator (LASSO). Conclusions: We have demonstrated improved performance when using GA-MM-MMI for selection of mutations on a genotype-phenotype data set. As we largely automated setting the GA parameters, the method should be applicable on similar datasets with clustered observations.
    Categories: Journal Articles
  • Estimation of protein function using template-based alignment of enzyme active sites
    [Mar 2014]

    Background: The accumulation of protein structural data occurs more rapidly than it can be characterized by traditional laboratory means. This has motivated widespread efforts to predict enzyme function computationally. The most useful/accurate strategies employed to date are based on the detection of motifs in novel structures that correspond to a specific function. Functional residues are critical components of predictively useful motifs. We have implemented a novel method, to complement current approaches, which detects motifs solely on the basis of distance restraints between catalytic residues. Results: ProMOL is a plugin for the PyMOL molecular graphics environment that can be used to create active site motifs for enzymes. A library of 181 active site motifs has been created with ProMOL, based on definitions published in the Catalytic Site Atlas (CSA). Searches with ProMOL produce better than 50% useful Enzyme Commission (EC) class suggestions for level 1 searches in EC classes 1, 4 and 5, and produce some useful results for other classes. 261 additional motifs automatically translated from Jonathan Barker's JESS motif set [Bioinformatics 19:1644-1649, 2003] and a set of NMR motifs is under development. Alignments are evaluated by visual superposition, Levenshtein distance and root-mean-square deviation (RMSD) and are reasonably consistent with related search methods. Conclusion: The ProMOL plugin for PyMOL provides ready access to template-based local alignments. Recent improvements to ProMOL, including the expanded motif library, RMSD calculations and output selection formatting, have greatly increased the program's usability and speed, and have improved the way that the results are presented.
    Categories: Journal Articles
  • A graph theoretic approach to utilizing protein structure to identify non-random somatic mutations
    [Mar 2014]

    Background: It is well known that the development of cancer is caused by the accumulation of somatic mutations within the genome. For oncogenes specifically, current research suggests that there is a small set of ``driver'' mutations that are primarily responsible for tumorigenesis. Further, due to some recent pharmacological successes in treating these driver mutations and their resulting tumors, a variety of methods have been developed to identify potential driver mutations using methods such as machine learning and mutational clustering. We propose a novel methodology that increases our power to identify mutational clusters by taking into account protein tertiary structure via a graph theoretical approach. Results: We have designed and implemented GraphPAC (Graph Protein Amino acid Clustering) to identify mutational clustering while considering protein spatial structure. Using GraphPAC, we are able to detect novel clusters in proteins that are known to exhibit mutation clustering as well as identify clusters in proteins without evidence of prior clustering based on current methods. Specifically, by utilizing the spatial information available in the Protein Data Bank (PDB) along with the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC), GraphPAC identifies new mutational clusters in well known oncogenes such as EGFR and KRAS. Further, by utilizing graph theory to account for the tertiary structure, GraphPAC discovers clusters in DPP4, NRP1 and other proteins not identified by existing methods. The R package is available at: http://bioconductor.org/packages/release/bioc/html/GraphPAC.html. Conclusion: GraphPAC provides an alternative to iPAC and an extension to current methodology when identifying potential activating driver mutations by utilizing a graph theoretic approach when considering protein tertiary structure.
    Categories: Journal Articles
  • ISAAC - InterSpecies Analysing Application using Containers
    [Jan 2014]

    Background: Information about genes, transcripts and proteins is spread over a wide variety of databases. Different tools have been developed using these databases to identify biological signals in gene lists from large scale analysis. Mostly, they search for enrichments of specific features. But, these tools do not allow an explorative walk through different views and to change the gene lists according to newly upcoming stories. Results: To fill this niche, we have developed ISAAC, the InterSpecies Analysing Application using Containers. The central idea of this web based tool is to enable the analysis of sets of genes, transcripts and proteins under different biological viewpoints and to interactively modify these sets at any point of the analysis. Detailed history and snapshot information allows tracing each action. Furthermore, one can easily switch back to previous states and perform new analyses. Currently, sets can be viewed in the context of genomes, protein functions, protein interactions, pathways, regulation, diseases and drugs. Additionally, users can switch between species with an automatic, orthology based translation of existing gene sets. As todays research usually is performed in larger teams and consortia, ISAAC provides group based functionalities. Here, sets as well as results of analyses can be exchanged between members of groups. Conclusions: ISAAC fills the gap between primary databases and tools for the analysis of large gene lists. With its highly modular, JavaEE based design, the implementation of new modules is straight forward. Furthermore, ISAAC comes with an extensive web-based administration interface including tools for the integration of third party data. Thus, a local installation is easily feasible. In summary, ISAAC is tailor made for highly explorative interactive analyses of gene, transcript and protein sets in a collaborative environment.
    Categories: Journal Articles
  • Large-scale combining signals from both biomedical literature and the FDA Adverse Event Reporting System (FAERS) to improve post-marketing drug safety signal detection
    [Jan 2014]

    Background: Independent data sources can be used to augment post-marketing drug safety signal detection. The vast amount of publicly available biomedical literature contains rich side effect information for drugs at all clinical stages. In this study, we present a large-scale signal boosting approach that combines over 4 million records in the US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) and over 21 million biomedical articles. Results: The datasets are comprised of 4,285,097 records from FAERS and 21,354,075 MEDLINE articles. We first extracted all drug-side effect (SE) pairs from FAERS. Our study implemented a total of seven signal ranking algorithms. We then compared these different ranking algorithms before and after they were boosted with signals from MEDLINE sentences or abstracts. Finally, we manually curated all drug-cardiovascular (CV) pairs that appeared in both data sources and investigated whether our approach can detect many true signals that have not been included in FDA drug labels. We extracted a total of 2,787,797 drug-SE pairs from FAERS with a low initial precision of 0.025. The ranking algorithm combined signals from both FAERS and MEDLINE, significantly improving the precision from 0.025 to 0.371 for top-ranked pairs, representing a 13.8 fold elevation in precision. We showed by manual curation that drug-SE pairs that appeared in both data sources were highly enriched with true signals, many of which have not yet been included in FDA drug labels. Conclusions: We have developed an efficient and effective drug safety signal ranking and strengthening approach We demonstrate that large-scale combining information from FAERS and biomedical literature can significantly contribute to drug safety surveillance.
    Categories: Journal Articles
  • Joint probabilistic-logical refinement of multiple protein feature predictors
    [Jan 2014]

    Background: Computational methods for the prediction of protein features from sequence are a long-standing focusof bioinformatics. A key observation is that several protein features are closely inter-related, that is,they are conditioned on each other. Researchers invested a lot of effort into designing predictors thatexploit this fact. Most existing methods leverage inter-feature constraints by including known (orpredicted) correlated features as inputs to the predictor, thus conditioning the result. Results: By including correlated features as inputs, existing methods only rely on one side of the relation:the output feature is conditioned on the known input features. Here we show how to jointly improvethe outputs of multiple correlated predictors by means of a probabilistic-logical consistencylayer. The logical layer enforces a set of weighted first-order rules encoding biological constraintsbetween the features, and improves the raw predictions so that they least violate the constraints. Inparticular, we show how to integrate three stand-alone predictors of correlated features: subcellular localization(Loctree [J Mol Biol 348:85-100, 2005]), disulfide bonding state (Disulfind [Nucleic AcidsRes 34:W177-W181, 2006]), and metal bonding state (MetalDetector [Bioinformatics 24:2094-2095,2008]), in a way that takes into account the respective strengths and weaknesses, and does not requireany change to the predictors themselves. We also compare our methodology against two alternativerefinement pipelines based on state-of-the-art sequential prediction methods. Conclusions: The proposed framework is able to improve the performance of the underlying predictors by removingrule violations. We show that different predictors offer complementary advantages, and our method isable to integrate them using non-trivial constraints, generating more consistent predictions. In addition,our framework is fully general, and could in principle be applied to a vast array of heterogeneouspredictions without requiring any change to the underlying software. On the other hand, the alternativestrategies are more specific and tend to favor one task at the expense of the others, as shown byour experimental evaluation. The ultimate goal of our framework is to seamlessly integrate full predictionsuites, such as Distill [BMC Bioinformatics 7:402, 2006] and PredictProtein [Nucleic AcidsRes 32:W321-W326, 2004].
    Categories: Journal Articles
  • OncomiRdbB: a comprehensive database of microRNAs and their targets in breast cancer
    [Jan 2014]

    Background: Given the estimate that 30% of our genes are controlled by microRNAs, it is essential that we understand the precise relationship between microRNAs and their targets. OncomiRs are microRNAs (miRNAs) that have been frequently shown to be deregulated in cancer. However, although several oncomiRs have been identified and characterized, there is as yet no comprehensive compilation of this data which has rendered it underutilized by cancer biologists. There is therefore an unmet need in generating bioinformatic platforms to speed the identification of novel therapeutic targets.Description: We describe here OncomiRdbB, a comprehensive database of oncomiRs mined from different existing databases for mouse and humans along with novel oncomiRs that we have validated in human breast cancer samples. The database also lists their respective predicted targets, identified using miRanda, along with their IDs, sequences, chromosome location and detailed description. This database facilitates querying by search strings including microRNA name, sequence, accession number, target genes and organisms. The microRNA networks and their hubs with respective targets at 3[prime]UTR, 5[prime]UTR and exons of different pathway genes were also deciphered using the 'R' algorithm. Conclusion: OncomiRdbB is a comprehensive and integrated database of oncomiRs and their targets in breast cancer with multiple query options which will help enhance both understanding of the biology of breast cancer and the development of new and innovative microRNA based diagnostic tools and targets of therapeutic significance. OncomiRdbB is freely available for download through the URL link http://tdb.ccmb.res.in/OncomiRdbB/index.htm
    Categories: Journal Articles
  • Fold change rank ordering statistics: a new method for detecting differentially expressed genes
    [Jan 2014]

    Background: Different methods have been proposed for analyzing differentially expressed (DE) genes in microarray data. Methods based on statistical tests that incorporate expression level variability are used more commonly than those based on fold change (FC). However, FC based results are more reproducible and biologically relevant. Results: We propose a new method based on fold change rank ordering statistics (FCROS). We exploit the variation in calculated FC levels using combinatorial pairs of biological conditions in the datasets. A statistic is associated with the ranks of the FC values for each gene, and the resulting probability is used to identify the DE genes within an error level. The FCROS method is deterministic, requires a low computational runtime and also solves the problem of multiple tests which usually arises with microarray datasets. Conclusion: We compared the performance of FCROS with those of other methods using synthetic and real microarray datasets. We found that FCROS is well suited for DE gene identification from noisy datasets when compared with existing FC based methods.
    Categories: Journal Articles
  • gsGator: an integrated web platform for cross-species gene set analysis
    [Jan 2014]

    Background: Gene set analysis (GSA) is useful in deducing biological significance of gene lists using a priori defined gene sets such as gene ontology (GO) or pathways. Phenotypic annotation is sparse for human genes, but is far more abundant for other model organisms such as mouse, fly, and worm. Often, GSA needs to be done highly interactively by combining or modifying gene lists or inspecting gene-gene interactions in a molecular network.Description: We developed gsGator, a web-based platform for functional interpretation of gene sets with useful features such as cross-species GSA, simultaneous analysis of multiple gene sets, and a fully integrated network viewer for visualizing both GSA results and molecular networks. An extensive set of gene annotation information is amassed including GO & pathways, genomic annotations, protein-protein interaction, transcription factor-target (TF-target), miRNA targeting, and phenotype information for various model organisms. By combining the functionalities of Set Creator, Set Operator and Network Navigator, user can perform highly flexible and interactive GSA by creating a new gene list by any combination of existing gene sets (intersection, union and difference) or expanding genes interactively along the molecular networks such as protein-protein interaction and TF-target. We also demonstrate the utility of our interactive and cross-species GSA implemented in gsGator by several usage examples for interpreting genome-wide association study (GWAS) results. gsGator is freely available at http://gsGator.ewha.ac.kr. Conclusions: Interactive and cross-species GSA in gsGator greatly extends the scope and utility of GSA, leading to novel insights via conserved functional gene modules across different species.
    Categories: Journal Articles
  • GIANT: pattern analysis of molecular interactions in 3D structures of protein-small ligand complexes
    [Jan 2014]

    Background: Interpretation of binding modes of protein-small ligand complexes from 3D structure data is essential for understanding selective ligand recognition by proteins. It is often performed by visual inspection and sometimes largely depends on a priori knowledge about typical interactions such as hydrogen bonds and pi-pi stacking. Because it can introduce some biases due to scientists' subjective perspectives, more objective viewpoints considering a wide range of interactions are required.Description: In this paper, we present a web server for analyzing protein-small ligand interactions on the basis of patterns of atomic contacts, or "interaction patterns" obtained from the statistical analyses of 3D structures of protein-ligand complexes in our previous study. This server can guide visual inspection by providing information about interaction patterns for each atomic contact in 3D structures. Users can visually investigate what atomic contacts in user-specified 3D structures of protein-small ligand complexes are statistically overrepresented. This server consists of two main components: "Complex Analyzer," and "Pattern Viewer." The former provides a 3D structure viewer with annotations of interacting amino acid residues, ligand atoms, and interacting pairs of these. In the annotations of interacting pairs, assignment to an interaction pattern of each contact and statistical preferences of the patterns are presented. The "Pattern Viewer" provides details of each interaction pattern. Users can see visual representations of probability density functions of interactions, and a list of protein-ligand complexes showing similar interactions. Conclusions: Users can interactively analyze protein-small ligand binding modes with statistically determined interaction patterns rather than relying on a priori knowledge of the users, by using our new web server named GIANT that is freely available at http://giant.hgc.jp/.
    Categories: Journal Articles
  • kruX: Matrix-based non-parametric eQTL discovery
    [Jan 2014]

    Background: The Kruskal-Wallis test is a popular non-parametric statistical test for identifying expression quantitativetrait loci (eQTLs) from genome-wide data due to its robustness against variations in the underlyinggenetic model and expression trait distribution, but testing billions of marker-trait combinationsone-by-one can become computationally prohibitive. Results: We developed kruX, an algorithm implemented in Matlab, Python and R that uses matrix multiplicationsto simultaneously calculate the Kruskal-Wallis test statistic for several millions of marker-traitcombinations at once. KruX is more than ten thousand times faster than computing associations oneby-one on a typical human dataset. We used kruX and a dataset of more than 500k SNPs and 20kexpression traits measured in 102 human blood samples to compare eQTLs detected by the Kruskal-Wallis test to eQTLs detected by the parametric ANOVA and linear model methods. We found that theKruskal-Wallis test is more robust against data outliers and heterogeneous genotype group sizes anddetects a higher proportion of non-linear associations, but is more conservative for calling additivelinear associations. Conclusion: kruX enables the use of robust non-parametric methods for massive eQTL mapping without the needfor a high-performance computing infrastructure and is freely available from http://krux.googlecode.com.
    Categories: Journal Articles