Bioinformatics Journal

Bioinformatics - RSS feed of current issue
  • ThunderSTORM: a comprehensive ImageJ plug-in for PALM and STORM data analysis and super-resolution imaging
    [Aug 2014]

    Summary: ThunderSTORM is an open-source, interactive and modular plug-in for ImageJ designed for automated processing, analysis and visualization of data acquired by single-molecule localization microscopy methods such as photo-activated localization microscopy and stochastic optical reconstruction microscopy. ThunderSTORM offers an extensive collection of processing and post-processing methods so that users can easily adapt the process of analysis to their data. ThunderSTORM also offers a set of tools for creation of simulated data and quantitative performance evaluation of localization algorithms using Monte Carlo simulations.

    Availability and implementation: ThunderSTORM and the online documentation are both freely accessible at https://code.google.com/p/thunder-storm/

    Contact: guy.hagen@lf1.cuni.cz

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Retraction:
    [Aug 2014]

    Categories: Journal Articles
  • Circular RNAs are depleted of polymorphisms at microRNA binding sites
    [Aug 2014]

    Motivation: Circular RNAs (circRNAs) are an abundant class of highly stable RNAs that can affect gene regulation by binding and preventing microRNAs (miRNAs) from regulating their messenger RNA (mRNA) targets. Mammals have thousands of circRNAs with predicted miRNA binding sites, but only two circRNAs have been verified as being actual miRNA sponges. As it is unclear whether these thousands of predicted miRNA binding sites are functional, we investigated whether miRNA seed sites within human circRNAs are under selective pressure.

    Results: Using SNP data from the 1000 Genomes Project, we found a significant decrease in SNP density at miRNA seed sites compared with flanking sequences and random sites. This decrease was similar to that of miRNA seed sites in 3' untranslated regions, suggesting that many of the predicted miRNA binding sites in circRNAs are functional and under similar selective pressure as miRNA binding sites in mRNAs.

    Contact: pal.satrom@ntnu.no

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • ReadXplorer--visualization and analysis of mapped sequences
    [Aug 2014]

    Motivation: Fast algorithms and well-arranged visualizations are required for the comprehensive analysis of the ever-growing size of genomic and transcriptomic next-generation sequencing data.

    Results: ReadXplorer is a software offering straightforward visualization and extensive analysis functions for genomic and transcriptomic DNA sequences mapped on a reference. A unique specialty of ReadXplorer is the quality classification of the read mappings. It is incorporated in all analysis functions and displayed in ReadXplorer's various synchronized data viewers for (i) the reference sequence, its base coverage as (ii) normalizable plot and (iii) histogram, (iv) read alignments and (v) read pairs. ReadXplorer's analysis capability covers RNA secondary structure prediction, single nucleotide polymorphism and deletion–insertion polymorphism detection, genomic feature and general coverage analysis. Especially for RNA-Seq data, it offers differential gene expression analysis, transcription start site and operon detection as well as RPKM value and read count calculations. Furthermore, ReadXplorer can combine or superimpose coverage of different datasets.

    Availability and implementation: ReadXplorer is available as open-source software at http://www.readxplorer.org along with a detailed manual.

    Contact: rhilker@mikrobio.med.uni-giessen.de

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Multiscale DNA partitioning: statistical evidence for segments
    [Aug 2014]

    Motivation: DNA segmentation, i.e. the partitioning of DNA in compositionally homogeneous segments, is a basic task in bioinformatics. Different algorithms have been proposed for various partitioning criteria such as Guanine/Cytosine (GC) content, local ancestry in population genetics or copy number variation. A critical component of any such method is the choice of an appropriate number of segments. Some methods use model selection criteria and do not provide a suitable error control. Other methods that are based on simulating a statistic under a null model provide suitable error control only if the correct null model is chosen.

    Results: Here, we focus on partitioning with respect to GC content and propose a new approach that provides statistical error control: as in statistical hypothesis testing, it guarantees with a user-specified probability that the number of identified segments does not exceed the number of actually present segments. The method is based on a statistical multiscale criterion, rendering this as a segmentation method that searches segments of any length (on all scales) simultaneously. It is also accurate in localizing segments: under benchmark scenarios, our approach leads to a segmentation that is more accurate than the approaches discussed in the comparative review of Elhaik et al. In our real data examples, we find segments that often correspond well to features taken from standard University of California at Santa Cruz (UCSC) genome annotation tracks.

    Availability and implementation: Our method is implemented in function smuceR of the R-package stepR available at http://www.stochastik.math.uni-goettingen.de/smuce.

    Contact: andreas.futschik@jku.at or thomas.hotz@tu-ilmenau.de

    Supplementary information: Supplementary Data are available at Bioinformatics online.

    Categories: Journal Articles
  • Detecting clustering and ordering binding patterns among transcription factors via point process models
    [Aug 2014]

    Motivation: Recent development in ChIP-Seq technology has generated binding data for many transcription factors (TFs) in various cell types and cellular conditions. This opens great opportunities for studying combinatorial binding patterns among a set of TFs active in a particular cellular condition, which is a key component for understanding the interaction between TFs in gene regulation.

    Results: As a first step to the identification of combinatorial binding patterns, we develop statistical methods to detect clustering and ordering patterns among binding sites (BSs) of a pair of TFs. Testing procedures based on Ripley’s K-function and its generalizations are developed to identify binding patterns from large collections of BSs in ChIP-Seq data. We have applied our methods to the ChIP-Seq data of 91 pairs of TFs in mouse embryonic stem cells. Our methods have detected clustering binding patterns between most TF pairs, which is consistent with the findings in the literature, and have identified significant ordering preferences, relative to the direction of target gene transcription, among the BSs of seven TFs. More interestingly, our results demonstrate that the identified clustering and ordering binding patterns between TFs are associated with the expression of the target genes. These findings provide new insights into co-regulation between TFs.

    Availability and implementation: See ‘www.stat.ucla.edu/~zhou/TFKFunctions/’ for source code.

    Contact: zhou@stat.ucla.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Efficient Bayesian inference under the structured coalescent
    [Aug 2014]

    Motivation: Population structure significantly affects evolutionary dynamics. Such structure may be due to spatial segregation, but may also reflect any other gene-flow-limiting aspect of a model. In combination with the structured coalescent, this fact can be used to inform phylogenetic tree reconstruction, as well as to infer parameters such as migration rates and subpopulation sizes from annotated sequence data. However, conducting Bayesian inference under the structured coalescent is impeded by the difficulty of constructing Markov Chain Monte Carlo (MCMC) sampling algorithms (samplers) capable of efficiently exploring the state space.

    Results: In this article, we present a new MCMC sampler capable of sampling from posterior distributions over structured trees: timed phylogenetic trees in which lineages are associated with the distinct subpopulation in which they lie. The sampler includes a set of MCMC proposal functions that offer significant mixing improvements over a previously published method. Furthermore, its implementation as a BEAST 2 package ensures maximum flexibility with respect to model and prior specification. We demonstrate the usefulness of this new sampler by using it to infer migration rates and effective population sizes of H3N2 influenza between New Zealand, New York and Hong Kong from publicly available hemagglutinin (HA) gene sequences under the structured coalescent.

    Availability and implementation: The sampler has been implemented as a publicly available BEAST 2 package that is distributed under version 3 of the GNU General Public License at http://compevol.github.io/MultiTypeTree.

    Contact: tgvaughan@gmail.com

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • KDETREES: non-parametric estimation of phylogenetic tree distributions
    [Aug 2014]

    Motivation: Although the majority of gene histories found in a clade of organisms are expected to be generated by a common process (e.g. the coalescent process), it is well known that numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and subsequent neofunctionalization) will cause some genes to exhibit a history distinct from those of the majority of genes. Such ‘outlying’ gene trees are considered to be biologically interesting, and identifying these genes has become an important problem in phylogenetics.

    Results: We propose and implement kdetrees, a non-parametric method for estimating distributions of phylogenetic trees, with the goal of identifying trees that are significantly different from the rest of the trees in the sample. Our method compares favorably with a similar recently published method, featuring an improvement of one polynomial order of computational complexity (to quadratic in the number of trees analyzed), with simulation studies suggesting only a small penalty to classification accuracy. Application of kdetrees to a set of Apicomplexa genes identified several unreliable sequence alignments that had escaped previous detection, as well as a gene independently reported as a possible case of horizontal gene transfer. We also analyze a set of Epichloë genes, fungi symbiotic with grasses, successfully identifying a contrived instance of paralogy.

    Availability and implementation: Our method for estimating tree distributions and identifying outlying trees is implemented as the R package kdetrees and is available for download from CRAN.

    Contact: ruriko.yoshida@uky.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Improving B-cell epitope prediction and its application to global antibody-antigen docking
    [Aug 2014]

    Motivation: Antibodies are currently the most important class of biopharmaceuticals. Development of such antibody-based drugs depends on costly and time-consuming screening campaigns. Computational techniques such as antibody–antigen docking hold the potential to facilitate the screening process by rapidly providing a list of initial poses that approximate the native complex.

    Results: We have developed a new method to identify the epitope region on the antigen, given the structures of the antibody and the antigen—EpiPred. The method combines conformational matching of the antibody–antigen structures and a specific antibody–antigen score. We have tested the method on both a large non-redundant set of antibody–antigen complexes and on homology models of the antibodies and/or the unbound antigen structure. On a non-redundant test set, our epitope prediction method achieves 44% recall at 14% precision against 23% recall at 14% precision for a background random distribution. We use our epitope predictions to rescore the global docking results of two rigid-body docking algorithms: ZDOCK and ClusPro. In both cases including our epitope, prediction increases the number of near-native poses found among the top decoys.

    Availability and implementation: Our software is available from http://www.stats.ox.ac.uk/research/proteins/resources.

    Contact: deane@stats.ox.ac.uk

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Redundancy-weighting for better inference of protein structural features
    [Aug 2014]

    Motivation: Structural knowledge, extracted from the Protein Data Bank (PDB), underlies numerous potential functions and prediction methods. The PDB, however, is highly biased: many proteins have more than one entry, while entire protein families are represented by a single structure, or even not at all. The standard solution to this problem is to limit the studies to non-redundant subsets of the PDB. While alleviating biases, this solution hides the many-to-many relations between sequences and structures. That is, non-redundant datasets conceal the diversity of sequences that share the same fold and the existence of multiple conformations for the same protein. A particularly disturbing aspect of non-redundant subsets is that they hardly benefit from the rapid pace of protein structure determination, as most newly solved structures fall within existing families.

    Results: In this study we explore the concept of redundancy-weighted datasets, originally suggested by Miyazawa and Jernigan. Redundancy-weighted datasets include all available structures and associate them (or features thereof) with weights that are inversely proportional to the number of their homologs. Here, we provide the first systematic comparison of redundancy-weighted datasets with non-redundant ones. We test three weighting schemes and show that the distributions of structural features that they produce are smoother (having higher entropy) compared with the distributions inferred from non-redundant datasets. We further show that these smoothed distributions are both more robust and more correct than their non-redundant counterparts.

    We suggest that the better distributions, inferred using redundancy-weighting, may improve the accuracy of knowledge-based potentials and increase the power of protein structure prediction methods. Consequently, they may enhance model-driven molecular biology.

    Contact: cheny@il.ibm.com or chen.keasar@gmail.com

    Categories: Journal Articles
  • Structural and energetic determinants of tyrosylprotein sulfotransferase sulfation specificity
    [Aug 2014]

    Motivation: Tyrosine sulfation is a type of post-translational modification (PTM) catalyzed by tyrosylprotein sulfotransferases (TPST). The modification plays a crucial role in mediating protein–protein interactions in many biologically important processes. There is no well-defined sequence motif for TPST sulfation, and the underlying determinants of TPST sulfation specificity remains elusive. Here, we perform molecular modeling to uncover the structural and energetic determinants of TPST sulfation specificity.

    Results: We estimate the binding affinities between TPST and peptides around tyrosines of both sulfated and non-sulfated proteins to differentiate them. We find that better differentiation is achieved after including energy costs associated with local unfolding of the tyrosine-containing peptide in a host protein, which depends on both the peptide’s secondary structures and solvent accessibility. Local unfolding renders buried peptide—with ordered structures—thermodynamically available for TPST binding. Our results suggest that both thermodynamic availability of the peptide and its binding affinity to the enzyme are important for TPST sulfation specificity, and their interplay results into great variations in sequences and structures of sulfated peptides. We expect our method to be useful in predicting potential sulfation sites and transferable to other TPST variants. Our study may also shed light on other PTM systems without well-defined sequence and structural specificities.

    Availability and implementation: All the data and scripts used in the work are available at http://dlab.clemson.edu/research/Sulfation.

    Contact: fding@clemson.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • On non-detects in qPCR data
    [Aug 2014]

    Motivation: Quantitative real-time PCR (qPCR) is one of the most widely used methods to measure gene expression. Despite extensive research in qPCR laboratory protocols, normalization and statistical analysis, little attention has been given to qPCR non-detects—those reactions failing to produce a minimum amount of signal.

    Results: We show that the common methods of handling qPCR non-detects lead to biased inference. Furthermore, we show that non-detects do not represent data missing completely at random and likely represent missing data occurring not at random. We propose a model of the missing data mechanism and develop a method to directly model non-detects as missing data. Finally, we show that our approach results in a sizeable reduction in bias when estimating both absolute and differential gene expression.

    Availability and implementation: The proposed algorithm is implemented in the R package, nondetects. This package also contains the raw data for the three example datasets used in this manuscript. The package is freely available at http://mnmccall.com/software and as part of the Bioconductor project.

    Contact: mccallm@gmail.com

    Categories: Journal Articles
  • A power set-based statistical selection procedure to locate susceptible rare variants associated with complex traits with sequencing data
    [Aug 2014]

    Motivation: Existing association methods for rare variants from sequencing data have focused on aggregating variants in a gene or a genetic region because of the fact that analysing individual rare variants is underpowered. However, these existing rare variant detection methods are not able to identify which rare variants in a gene or a genetic region of all variants are associated with the complex diseases or traits. Once phenotypic associations of a gene or a genetic region are identified, the natural next step in the association study with sequencing data is to locate the susceptible rare variants within the gene or the genetic region.

    Results: In this article, we propose a power set-based statistical selection procedure that is able to identify the locations of the potentially susceptible rare variants within a disease-related gene or a genetic region. The selection performance of the proposed selection procedure was evaluated through simulation studies, where we demonstrated the feasibility and superior power over several comparable existing methods. In particular, the proposed method is able to handle the mixed effects when both risk and protective variants are present in a gene or a genetic region. The proposed selection procedure was also applied to the sequence data on the ANGPTL gene family from the Dallas Heart Study to identify potentially susceptible rare variants within the trait-related genes.

    Availability and implementation: An R package ‘rvsel’ can be downloaded from http://www.columbia.edu/~sw2206/ and http://statsun.pusan.ac.kr.

    Contact: sw2206@columbia.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • EnsembleGASVR: a novel ensemble method for classifying missense single nucleotide polymorphisms
    [Aug 2014]

    Motivation: Single nucleotide polymorphisms (SNPs) are considered the most frequently occurring DNA sequence variations. Several computational methods have been proposed for the classification of missense SNPs to neutral and disease associated. However, existing computational approaches fail to select relevant features by choosing them arbitrarily without sufficient documentation. Moreover, they are limited to the problem of missing values, imbalance between the learning datasets and most of them do not support their predictions with confidence scores.

    Results: To overcome these limitations, a novel ensemble computational methodology is proposed. EnsembleGASVR facilitates a two-step algorithm, which in its first step applies a novel evolutionary embedded algorithm to locate close to optimal Support Vector Regression models. In its second step, these models are combined to extract a universal predictor, which is less prone to overfitting issues, systematizes the rebalancing of the learning sets and uses an internal approach for solving the missing values problem without loss of information. Confidence scores support all the predictions and the model becomes tunable by modifying the classification thresholds. An extensive study was performed for collecting the most relevant features for the problem of classifying SNPs, and a superset of 88 features was constructed. Experimental results show that the proposed framework outperforms well-known algorithms in terms of classification performance in the examined datasets. Finally, the proposed algorithmic framework was able to uncover the significant role of certain features such as the solvent accessibility feature, and the top-scored predictions were further validated by linking them with disease phenotypes.

    Availability and implementation: Datasets and codes are freely available on the Web at http://prlab.ceid.upatras.gr/EnsembleGASVR/dataset-codes.zip. All the required information about the article is available through http://prlab.ceid.upatras.gr/EnsembleGASVR/site.html

    Contact: mavroudi@ceid.upatras.gr

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Regulatory interactions maintaining self-renewal of human embryonic stem cells as revealed through a systems analysis of PI3K/AKT pathway
    [Aug 2014]

    Motivation: Maintenance of the self-renewal state in human embryonic stem cells (hESCs) is the foremost critical step for regenerative therapy applications. The insulin-mediated PI3K/AKT pathway is well appreciated as being the central pathway supporting hESC self-renewal; however, the regulatory interactions in the pathway that maintain cell state are not yet known. Identification of these regulatory pathway components will be critical for designing targeted interventions to facilitate a completely defined platform for hESC propagation and differentiation. Here, we have developed a systems analysis approach to identify regulatory components that control PI3K/AKT pathway in self-renewing hESCs.

    Results: A detailed mathematical model was adopted to explain the complex regulatory interactions in the PI3K/AKT pathway. We evaluated globally sensitive processes of the pathway in a computationally efficient manner by replacing the detailed model by a surrogate meta-model. Our mathematical analysis, supported by experimental validation, reveals that negative regulators of the molecules IRS1 and PIP3 primarily govern the steady state of the pathway in hESCs. Among the regulators, negative feedback via IRS1 reduces the sensitivity of various reactions associated with direct trunk of the pathway and also constraints the propagation of parameter uncertainty to the levels of post receptor signaling molecules. Furthermore, our results suggest that inhibition of negative feedback can significantly increase p-AKT levels and thereby, better support hESC self-renewal. Our integrated mathematical modeling and experimental workflow demonstrates the significant advantage of computationally efficient meta-model approaches to detect sensitive targets from signaling pathways.

    Availability and implementation: FORTRAN codes for the PI3K/AKT pathway and the RS-HDMR implementation are available from the authors upon request.

    Contact: ipb1@pitt.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Modeling disease progression using dynamics of pathway connectivity
    [Aug 2014]

    Motivation: Disease progression is driven by dynamic changes in both the activity and connectivity of molecular pathways. Understanding these dynamic events is critical for disease prognosis and effective treatment. Compared with activity dynamics, connectivity dynamics is poorly explored.

    Results: We describe the M-module algorithm to identify gene modules with common members but varied connectivity across multiple gene co-expression networks (aka M-modules). We introduce a novel metric to capture the connectivity dynamics of an entire M-module. We find that M-modules with dynamic connectivity have distinct topological and biochemical properties compared with static M-modules and hub genes. We demonstrate that incorporation of module connectivity dynamics significantly improves disease stage prediction. We identify different sets of M-modules that are important for specific disease stage transitions and offer new insights into the molecular events underlying disease progression. Besides modeling disease progression, the algorithm and metric introduced here are broadly applicable to modeling dynamics of molecular pathways.

    Availability and implementation: M-module is implemented in R. The source code is freely available at http://www.healthcare.uiowa.edu/labs/tan/M-module.zip.

    Contact: kai-tan@uiowa.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • A comparison of algorithms for the pairwise alignment of biological networks
    [Aug 2014]

    Motivation: As biological inquiry produces ever more network data, such as protein–protein interaction networks, gene regulatory networks and metabolic networks, many algorithms have been proposed for the purpose of pairwise network alignment—finding a mapping from the nodes of one network to the nodes of another in such a way that the mapped nodes can be considered to correspond with respect to both their place in the network topology and their biological attributes. This technique is helpful in identifying previously undiscovered homologies between proteins of different species and revealing functionally similar subnetworks. In the past few years, a wealth of different aligners has been published, but few of them have been compared with one another, and no comprehensive review of these algorithms has yet appeared.

    Results: We present the problem of biological network alignment, provide a guide to existing alignment algorithms and comprehensively benchmark existing algorithms on both synthetic and real-world biological data, finding dramatic differences between existing algorithms in the quality of the alignments they produce. Additionally, we find that many of these tools are inconvenient to use in practice, and there remains a need for easy-to-use cross-platform tools for performing network alignment.

    Contact: cclark@uccs.edu, jkalita@uccs.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • A systems-level integrative framework for genome-wide DNA methylation and gene expression data identifies differential gene expression modules under epigenetic control
    [Aug 2014]

    Motivation: There is a growing number of studies generating matched Illumina Infinium HumanMethylation450 and gene expression data, yet there is a corresponding shortage of statistical tools aimed at their integrative analysis. Such integrative tools are important for the discovery of epigenetically regulated gene modules or molecular pathways, which play key roles in cellular differentiation and disease.

    Results: Here, we present a novel functional supervised algorithm, called Functional Epigenetic Modules (FEM), for the integrative analysis of Infinium 450k DNA methylation and matched or unmatched gene expression data. The algorithm identifies gene modules of coordinated differential methylation and differential expression in the context of a human interactome. We validate the FEM algorithm on simulated and real data, demonstrating how it successfully retrieves an epigenetically deregulated gene, previously known to drive endometrial cancer development. Importantly, in the same cancer, FEM identified a novel epigenetically deregulated hotspot, directly upstream of the well-known progesterone receptor tumour suppressor pathway. In the context of cellular differentiation, FEM successfully identifies known endothelial cell subtype-specific gene expression markers, as well as a novel gene module whose overexpression in blood endothelial cells is mediated by DNA hypomethylation. The systems-level integrative framework presented here could be used to identify novel key genes or signalling pathways, which drive cellular differentiation or disease through an underlying epigenetic mechanism.

    Availability and implementation: FEM is freely available as an R-package from http://sourceforge.net/projects/funepimod.

    Contact: andrew@picb.ac.cn

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • The cell behavior ontology: describing the intrinsic biological behaviors of real and model cells seen as active agents
    [Aug 2014]

    Motivation: Currently, there are no ontologies capable of describing both the spatial organization of groups of cells and the behaviors of those cells. The lack of a formalized method for describing the spatiality and intrinsic biological behaviors of cells makes it difficult to adequately describe cells, tissues and organs as spatial objects in living tissues, in vitro assays and in computational models of tissues.

    Results: We have developed an OWL-2 ontology to describe the intrinsic physical and biological characteristics of cells and tissues. The Cell Behavior Ontology (CBO) provides a basis for describing the spatial and observable behaviors of cells and extracellular components suitable for describing in vivo, in vitro and in silico multicell systems. Using the CBO, a modeler can create a meta-model of a simulation of a biological model and link that meta-model to experiment or simulation results. Annotation of a multicell model and its computational representation, using the CBO, makes the statement of the underlying biology explicit. The formal representation of such biological abstraction facilitates the validation, falsification, discovery, sharing and reuse of both models and experimental data.

    Availability and implementation: The CBO, developed using Protégé 4, is available at http://cbo.biocomplexity.indiana.edu/cbo/ and at BioPortal (http://bioportal.bioontology.org/ontologies/CBO).

    Contact: jsluka@indiana.edu or Glazier@indiana.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment
    [Aug 2014]

    Summary: Phenome-wide association studies (PheWAS) have been used to replicate known genetic associations and discover new phenotype associations for genetic variants. This PheWAS implementation allows users to translate ICD-9 codes to PheWAS case and control groups, perform analyses using these and/or other phenotypes with covariate adjustments and plot the results. We demonstrate the methods by replicating a PheWAS on rs3135388 (near HLA-DRB, associated with multiple sclerosis) and performing a novel PheWAS using an individual’s maximum white blood cell count (WBC) as a continuous measure. Our results for rs3135388 replicate known associations with more significant results than the original study on the same dataset. Our PheWAS of WBC found expected results, including associations with infections, myeloproliferative diseases and associated conditions, such as anemia. These results demonstrate the performance of the improved classification scheme and the flexibility of PheWAS encapsulated in this package.

    Availability and implementation: This R package is freely available under the Gnu Public License (GPL-3) from http://phewascatalog.org. It is implemented in native R and is platform independent.

    Contact: phewas@vanderbilt.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles