The latest research articles published by BMC Bioinformatics
Background: Information about genes, transcripts and proteins is spread over a wide variety of databases. Different tools have been developed using these databases to identify biological signals in gene lists from large scale analysis. Mostly, they search for enrichments of specific features. But, these tools do not allow an explorative walk through different views and to change the gene lists according to newly upcoming stories. Results: To fill this niche, we have developed ISAAC, the InterSpecies Analysing Application using Containers. The central idea of this web based tool is to enable the analysis of sets of genes, transcripts and proteins under different biological viewpoints and to interactively modify these sets at any point of the analysis. Detailed history and snapshot information allows tracing each action. Furthermore, one can easily switch back to previous states and perform new analyses. Currently, sets can be viewed in the context of genomes, protein functions, protein interactions, pathways, regulation, diseases and drugs. Additionally, users can switch between species with an automatic, orthology based translation of existing gene sets. As todays research usually is performed in larger teams and consortia, ISAAC provides group based functionalities. Here, sets as well as results of analyses can be exchanged between members of groups. Conclusions: ISAAC fills the gap between primary databases and tools for the analysis of large gene lists. With its highly modular, JavaEE based design, the implementation of new modules is straight forward. Furthermore, ISAAC comes with an extensive web-based administration interface including tools for the integration of third party data. Thus, a local installation is easily feasible. In summary, ISAAC is tailor made for highly explorative interactive analyses of gene, transcript and protein sets in a collaborative environment.
Large-scale combining signals from both biomedical literature and the FDA Adverse Event Reporting System (FAERS) to improve post-marketing drug safety signal detection
Background: Independent data sources can be used to augment post-marketing drug safety signal detection. The vast amount of publicly available biomedical literature contains rich side effect information for drugs at all clinical stages. In this study, we present a large-scale signal boosting approach that combines over 4 million records in the US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) and over 21 million biomedical articles. Results: The datasets are comprised of 4,285,097 records from FAERS and 21,354,075 MEDLINE articles. We first extracted all drug-side effect (SE) pairs from FAERS. Our study implemented a total of seven signal ranking algorithms. We then compared these different ranking algorithms before and after they were boosted with signals from MEDLINE sentences or abstracts. Finally, we manually curated all drug-cardiovascular (CV) pairs that appeared in both data sources and investigated whether our approach can detect many true signals that have not been included in FDA drug labels. We extracted a total of 2,787,797 drug-SE pairs from FAERS with a low initial precision of 0.025. The ranking algorithm combined signals from both FAERS and MEDLINE, significantly improving the precision from 0.025 to 0.371 for top-ranked pairs, representing a 13.8 fold elevation in precision. We showed by manual curation that drug-SE pairs that appeared in both data sources were highly enriched with true signals, many of which have not yet been included in FDA drug labels. Conclusions: We have developed an efficient and effective drug safety signal ranking and strengthening approach We demonstrate that large-scale combining information from FAERS and biomedical literature can significantly contribute to drug safety surveillance.
Background: Computational methods for the prediction of protein features from sequence are a long-standing focusof bioinformatics. A key observation is that several protein features are closely inter-related, that is,they are conditioned on each other. Researchers invested a lot of effort into designing predictors thatexploit this fact. Most existing methods leverage inter-feature constraints by including known (orpredicted) correlated features as inputs to the predictor, thus conditioning the result. Results: By including correlated features as inputs, existing methods only rely on one side of the relation:the output feature is conditioned on the known input features. Here we show how to jointly improvethe outputs of multiple correlated predictors by means of a probabilistic-logical consistencylayer. The logical layer enforces a set of weighted first-order rules encoding biological constraintsbetween the features, and improves the raw predictions so that they least violate the constraints. Inparticular, we show how to integrate three stand-alone predictors of correlated features: subcellular localization(Loctree [J Mol Biol 348:85-100, 2005]), disulfide bonding state (Disulfind [Nucleic AcidsRes 34:W177-W181, 2006]), and metal bonding state (MetalDetector [Bioinformatics 24:2094-2095,2008]), in a way that takes into account the respective strengths and weaknesses, and does not requireany change to the predictors themselves. We also compare our methodology against two alternativerefinement pipelines based on state-of-the-art sequential prediction methods. Conclusions: The proposed framework is able to improve the performance of the underlying predictors by removingrule violations. We show that different predictors offer complementary advantages, and our method isable to integrate them using non-trivial constraints, generating more consistent predictions. In addition,our framework is fully general, and could in principle be applied to a vast array of heterogeneouspredictions without requiring any change to the underlying software. On the other hand, the alternativestrategies are more specific and tend to favor one task at the expense of the others, as shown byour experimental evaluation. The ultimate goal of our framework is to seamlessly integrate full predictionsuites, such as Distill [BMC Bioinformatics 7:402, 2006] and PredictProtein [Nucleic AcidsRes 32:W321-W326, 2004].
Background: Given the estimate that 30% of our genes are controlled by microRNAs, it is essential that we understand the precise relationship between microRNAs and their targets. OncomiRs are microRNAs (miRNAs) that have been frequently shown to be deregulated in cancer. However, although several oncomiRs have been identified and characterized, there is as yet no comprehensive compilation of this data which has rendered it underutilized by cancer biologists. There is therefore an unmet need in generating bioinformatic platforms to speed the identification of novel therapeutic targets.Description: We describe here OncomiRdbB, a comprehensive database of oncomiRs mined from different existing databases for mouse and humans along with novel oncomiRs that we have validated in human breast cancer samples. The database also lists their respective predicted targets, identified using miRanda, along with their IDs, sequences, chromosome location and detailed description. This database facilitates querying by search strings including microRNA name, sequence, accession number, target genes and organisms. The microRNA networks and their hubs with respective targets at 3[prime]UTR, 5[prime]UTR and exons of different pathway genes were also deciphered using the 'R' algorithm. Conclusion: OncomiRdbB is a comprehensive and integrated database of oncomiRs and their targets in breast cancer with multiple query options which will help enhance both understanding of the biology of breast cancer and the development of new and innovative microRNA based diagnostic tools and targets of therapeutic significance. OncomiRdbB is freely available for download through the URL link http://tdb.ccmb.res.in/OncomiRdbB/index.htm
Fold change rank ordering statistics: a new method for detecting differentially expressed genes
Background: Different methods have been proposed for analyzing differentially expressed (DE) genes in microarray data. Methods based on statistical tests that incorporate expression level variability are used more commonly than those based on fold change (FC). However, FC based results are more reproducible and biologically relevant. Results: We propose a new method based on fold change rank ordering statistics (FCROS). We exploit the variation in calculated FC levels using combinatorial pairs of biological conditions in the datasets. A statistic is associated with the ranks of the FC values for each gene, and the resulting probability is used to identify the DE genes within an error level. The FCROS method is deterministic, requires a low computational runtime and also solves the problem of multiple tests which usually arises with microarray datasets. Conclusion: We compared the performance of FCROS with those of other methods using synthetic and real microarray datasets. We found that FCROS is well suited for DE gene identification from noisy datasets when compared with existing FC based methods.
Background: Gene set analysis (GSA) is useful in deducing biological significance of gene lists using a priori defined gene sets such as gene ontology (GO) or pathways. Phenotypic annotation is sparse for human genes, but is far more abundant for other model organisms such as mouse, fly, and worm. Often, GSA needs to be done highly interactively by combining or modifying gene lists or inspecting gene-gene interactions in a molecular network.Description: We developed gsGator, a web-based platform for functional interpretation of gene sets with useful features such as cross-species GSA, simultaneous analysis of multiple gene sets, and a fully integrated network viewer for visualizing both GSA results and molecular networks. An extensive set of gene annotation information is amassed including GO & pathways, genomic annotations, protein-protein interaction, transcription factor-target (TF-target), miRNA targeting, and phenotype information for various model organisms. By combining the functionalities of Set Creator, Set Operator and Network Navigator, user can perform highly flexible and interactive GSA by creating a new gene list by any combination of existing gene sets (intersection, union and difference) or expanding genes interactively along the molecular networks such as protein-protein interaction and TF-target. We also demonstrate the utility of our interactive and cross-species GSA implemented in gsGator by several usage examples for interpreting genome-wide association study (GWAS) results. gsGator is freely available at http://gsGator.ewha.ac.kr. Conclusions: Interactive and cross-species GSA in gsGator greatly extends the scope and utility of GSA, leading to novel insights via conserved functional gene modules across different species.
GIANT: pattern analysis of molecular interactions in 3D structures of protein-small ligand complexes
Background: Interpretation of binding modes of protein-small ligand complexes from 3D structure data is essential for understanding selective ligand recognition by proteins. It is often performed by visual inspection and sometimes largely depends on a priori knowledge about typical interactions such as hydrogen bonds and pi-pi stacking. Because it can introduce some biases due to scientists' subjective perspectives, more objective viewpoints considering a wide range of interactions are required.Description: In this paper, we present a web server for analyzing protein-small ligand interactions on the basis of patterns of atomic contacts, or "interaction patterns" obtained from the statistical analyses of 3D structures of protein-ligand complexes in our previous study. This server can guide visual inspection by providing information about interaction patterns for each atomic contact in 3D structures. Users can visually investigate what atomic contacts in user-specified 3D structures of protein-small ligand complexes are statistically overrepresented. This server consists of two main components: "Complex Analyzer," and "Pattern Viewer." The former provides a 3D structure viewer with annotations of interacting amino acid residues, ligand atoms, and interacting pairs of these. In the annotations of interacting pairs, assignment to an interaction pattern of each contact and statistical preferences of the patterns are presented. The "Pattern Viewer" provides details of each interaction pattern. Users can see visual representations of probability density functions of interactions, and a list of protein-ligand complexes showing similar interactions. Conclusions: Users can interactively analyze protein-small ligand binding modes with statistically determined interaction patterns rather than relying on a priori knowledge of the users, by using our new web server named GIANT that is freely available at http://giant.hgc.jp/.
Background: The Kruskal-Wallis test is a popular non-parametric statistical test for identifying expression quantitativetrait loci (eQTLs) from genome-wide data due to its robustness against variations in the underlyinggenetic model and expression trait distribution, but testing billions of marker-trait combinationsone-by-one can become computationally prohibitive. Results: We developed kruX, an algorithm implemented in Matlab, Python and R that uses matrix multiplicationsto simultaneously calculate the Kruskal-Wallis test statistic for several millions of marker-traitcombinations at once. KruX is more than ten thousand times faster than computing associations oneby-one on a typical human dataset. We used kruX and a dataset of more than 500k SNPs and 20kexpression traits measured in 102 human blood samples to compare eQTLs detected by the Kruskal-Wallis test to eQTLs detected by the parametric ANOVA and linear model methods. We found that theKruskal-Wallis test is more robust against data outliers and heterogeneous genotype group sizes anddetects a higher proportion of non-linear associations, but is more conservative for calling additivelinear associations. Conclusion: kruX enables the use of robust non-parametric methods for massive eQTL mapping without the needfor a high-performance computing infrastructure and is freely available from http://krux.googlecode.com.
Background: The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available. Such genetic sequences are characterized by common alleles interspersed with multiple rarer alleles. This situation has renewed the interest for the identification of haplotypes carrying the rare risk alleles. However, large scale explorations of the linkage-disequilibrium (LD) pattern to identify haplotype blocks are not easy to perform, because traditional algorithms have at least Theta(n^2) time and memory complexity. Results: We derived three incremental optimizations of the widely used haplotype block recognition algorithm proposed by Gabriel et al. in 2002. Our most efficient solution, called MIG++, has only Theta(n) memory complexity and, on a genome-wide scale, it omits >80% of the calculations, which makes it an order of magnitude faster than the original algorithm. Differently from the existing software, the MIG++ analyzes the LD between SNPs at any distance, avoiding restrictions on the maximal block length. The haplotype block partition of the entire HapMap II CEPH dataset was obtained in 457 hours. By replacing the standard likelihood-based D' variance estimator with an approximated estimator, the runtime was further improved. While producing a coarser partition, the approximate method allowed to obtain the full-genome haplotype block partition of the entire 1000 Genomes Project CEPH dataset in 44 hours, with no restrictions on allele frequency or long-range correlations. These experiments showed that LD-based haplotype blocks can span more than one million base-pairs in both HapMap II and 1000 Genomes datasets. An application to the North American Rheumatoid Arthritis Consortium (NARAC) dataset shows how the MIG++ can support genome-wide haplotype association studies. Conclusions: The MIG++ enables to perform LD-based haplotype block recognition on genetic sequences of any length and density. In the new generation sequencing era, this can help identify haplotypes that carry rare variants of interest. The low computational requirements open the possibility to include the haplotype block structure into genome-wide association scans, downstream analyses, and visual interfaces for online genome browsers.
A generic classification-based method for segmentation of nuclei in 3D images of early embryos
Background: Studying how individual cells spatially and temporally organize within the embryo is a fundamental issue in modern developmental biology to better understand the first stages of embryogenesis. In order to perform high-throughput analyses in three-dimensional microscopic images, it is essential to be able to automatically segment, classify and track cell nuclei. Many 3D/4D segmentation and tracking algorithms have been reported in the literature. Most of them are specific to particular models or acquisition systems and often require the fine tuning of parameters. Results: We present a new automatic algorithm to segment and simultaneously classify cell nuclei in 3D/4D images. Segmentation relies on training samples that are interactively provided by the user and on an iterative thresholding process. This algorithm can correctly segment nuclei even when they are touching, and remains effective under temporal and spatial intensity variations. The segmentation is coupled to a classification of nuclei according to cell cycle phases, allowing biologists to quantify the effect of genetic perturbations and drug treatments. Robust 3D geometrical shape descriptors are used as training features for classification. Segmentation and classification results of three complete datasets are presented. In our working dataset of the Caenorhabditis elegans embryo, only 21 nuclei out of 3,585 were not detected, the overall F-score for segmentation reached 0.99, and more than 95% of the nuclei were classified in the correct cell cycle phase. No merging of nuclei was found. Conclusion: We developed a novel generic algorithm for segmentation and classification in 3D images. The method, referred to as Adaptive Generic Iterative Thresholding Algorithm (AGITA), is freely available as an ImageJ plug-in.
iMir: An Integrated pipeline for high-throughput analysis of small non-coding RNA data obtained by smallRNA-Seq
Background: Qualitative and quantitative analysis of small non-coding RNAs by next generation sequencing (smallRNA-Seq) represents a novel technology increasingly used to investigate with high sensitivity and specificity RNA population comprising microRNAs and other regulatory small transcripts. Analysis of smallRNA-Seq data to gather biologically relevant information, i.e. detection and differential expression analysis of known and novel non-coding RNAs, target prediction, etc., requires implementation of multiple statistical and bioinformatics tools from different sources, each focusing on a specific step of the analysis pipeline. As a consequence, the analytical workflow is slowed down by the need for continuous interventions by the operator, a critical factor when large numbers of datasets need to be analyzed at once. Results: We designed a novel modular pipeline (iMir) for comprehensive analysis of smallRNA-Seq data, comprising specific tools for adapter trimming, quality filtering, differential expression analysis, biological target prediction and other useful options by integrating multiple open source modules and resources in an automated workflow. As statistics is crucial in deep-sequencing data analysis, we devised and integrated in iMir tools based on different statistical approaches to allow the operator to analyze data rigorously. The pipeline created here proved to be efficient and time-saving than currently available methods and, in addition, flexible enough to allow the user to select the preferred combination of analytical steps. We present here the results obtained by applying this pipeline to analyze simultaneously 6 smallRNA-Seq datasets from either exponentially growing or growth-arrested human breast cancer MCF-7 cells, that led to the rapid and accurate identification, quantitation and differential expression analysis of ~450 miRNAs, including several novel miRNAs and isomiRs, as well as identification of the putative mRNA targets of differentially expressed miRNAs. In addition, iMir allowed also the identification of ~70 piRNAs (piwi-interacting RNAs), some of which differentially expressed in proliferating vs growth arrested cells. Conclusion: The integrated data analysis pipeline described here is based on a reliable, flexible and fully automated workflow, useful to rapidly and efficiently analyze high-throughput smallRNA-Seq data, such as those produced by the most recent high-performance next generation sequencers. iMir is available at http://www.labmedmolge.unisa.it/inglese/research/imir.
Detection of attractors of large Boolean networks via exhaustive enumeration of appropriate subspaces of the state space
Background: Boolean models are increasingly used to study biological signaling networks. In a Boolean network, nodes represent biological entities such as genes, proteins or protein complexes, and edges indicate activating or inhibiting influences of one node towards another. Depending on the input of activators or inhibitors, Boolean networks categorize nodes as either active or inactive. The formalism is appealing because for many biological relationships, we lack quantitative information about binding constants or kinetic parameters and can only rely on a qualitative description of the type "A activates (or inhibits) B". A central aim of Boolean network analysis is the determination of attractors (steady states and/or cycles). This problem is known to be computationally complex, its most important parameter being the number of network nodes. Various algorithms tackle it with considerable success. In this paper we present an algorithm, which extends the size of analyzable networks thanks to simple and intuitive arguments. Results: We present lnet, a software package which, in fully asynchronous updating mode and without any network reduction, detects the fixed states of Boolean networks with up to 150 nodes and a good part of any present cycles for networks with up to half the above number of nodes. The algorithm goes through a complete enumeration of the states of appropriately selected subspaces of the entire network state space. The size of these relevant subspaces is small compared to the full network state space, allowing the analysis of large networks. The subspaces scanned for the analyses of cycles are larger, reducing the size of accessible networks. Importantly, inherent in cycle detection is a classification scheme based on the number of non-frozen nodes of the cycle member states, with cycles characterized by fewer non-frozen nodes being easier to detect. It is further argued that these detectable cycles are also the biologically more important ones. Furthermore, lnet also provides standard Boolean analysis features such as node loop detection. Conclusion: lnet is a software package that facilitates the analysis of large Boolean networks. Its intuitive approach helps to better understand the network in question.
Gene set bagging for estimating the probability a statistically significant result will replicate
Background: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features associated with illness. We propose a new approach, called gene set bagging, for measuring the probability that a gene set replicates in future studies. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate in the bagged samples. Results: Using both simulated and publicly-available genomics data, we demonstrate that significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. We show our method estimates the replication probability (R), the probability that a gene set will replicate as a significant result in future studies, and show in simulations that this method reflects replication better than each set's p-value. Conclusions: Our results suggest that gene lists based on p-values are not necessarily stable, and therefore additional steps like gene set bagging may improve biological inference on gene sets.
Background: DNA methylation is indispensible for normal human genome function. Currently there is an increasingly large number of DNA methylomic data being released in the public domain allowing for an opportunity to investigate the relationships between the DNA methylome, genome function, and human phenotypes. The Illumina450K is one of the most popular platforms for assessing DNA methylation with over 10,000 samples available in the public domain. However, accessing all this data requires downloading each individual experiment and due to inconsistent annotation, accessing the right data can be a challenge.Description: Here we introduce 'Marmal-aid', the first standardised database for DNA methylation (freely available at http://marmal-aid.org). In Marmal-aid, the majority of publicly available Illumina HumanMethylation450 data is incorporated into a single repository allowing for re-processing of data including normalisation and imputation of missing values. The database is accessible in two ways: (1) Using an R package to allow for incorporation into existing analysis pipelines which can then be easily queried to gain insight into the functionality of certain CpG sites. This is aimed at a bioinformatician with experience in R. (2) Using a graphical interface allowing general biologists to query a pre-defined set of tissues (currently 15) providing a reference database of the methylation state in these tissues for the 450,000 CpG sites profiled by the Illumina HumanMethylation450. Conclusion: Marmal-aid is the largest publicly available Illumina HumanMethylation450 methylation database combining Illumina HumanMethylation450 data from a number of sources into a single location with a single common annotation format. This allows for automated extraction using the R package and inclusion into existing analysis pipelines. Marmal-aid also provides a easy to use GUI to visualise methylation data in user defined genomic regions for various reference tissues.
Background: Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability. Results: We present an implementation of a distributed EM solution to the fragment assignment problem using Spark, a data analytics framework that can scale by leveraging compute clusters within datacenters¿"the cloud". We demonstrate that our implementation easily scales to billions of sequenced fragments, while providing the exact maximum likelihood assignment of ambiguous fragments. The accuracy of the method is shown to be an improvement over the most widely used tools available and can be run in a constant amount of time when cluster resources are scaled linearly with the amount of input data. Conclusions: The cloud offers one solution for the difficulties faced in the analysis of massive high-thoughput sequencing data, which continue to grow rapidly. Researchers in bioinformatics must follow developments in distributed systems¿such as new frameworks like Spark¿for ways to port existing methods to the cloud and help them scale to the datasets of the future. Our software, eXpress-D, is freely available at: http://github.com/adarob/express-d.
Sample size calculation based on exact test for assessing differential expression analysis in RNA-seq data
Background: Sample size calculation is an important issue in the experimental design of biomedical research. For RNA-seq experiments, the sample size calculation method based on the Poisson model has been proposed; however, when there are biological replicates, RNA-seq data could exhibit variation significantly greater than the mean (i.e. over-dispersion). The Poisson model cannot appropriately model the over-dispersion, and in such cases, the negative binomial model has been used as a natural extension of the Poisson model. Because the field currently lacks a sample size calculation method based on the negative binomial model for assessing differential expression analysis of RNA-seq data, we propose a method to calculate the sample size. Results: We propose a sample size calculation method based on the exact test for assessing differential expression analysis of RNA-seq data. Conclusions: The proposed sample size calculation method is straightforward and not computationally intensive. Simulation studies to evaluate the performance of the proposed sample size method are presented; the results indicate our method works well, with achievement of desired power.
Background: Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers. Data are commonly represented in tables stored as plain text files and require line-by-line parsing for analysis, which is time consuming and error prone. Furthermore, there is no simple means of indexing these files so that rows containing particular values can be quickly found. Results: We introduce a new data format and software library called wormtable, which provides efficient access to tabular data in Python. Wormtable stores data in a compact binary format, provides random access to rows, and enables sophisticated indexing on columns within these tables. Files written in existing formats can be easily converted to wormtable format, and we provide conversion utilities for the VCF and GTF formats. Conclusions: Wormtable’s simple API allows users to process large tables orders of magnitude more quickly than is possible when parsing text. Furthermore, the indexing facilities provide efficient access to subsets of the data along with providing useful methods of summarising columns. Since third-party libraries or custom code are no longer needed to parse complex plain text formats, analysis code can also be substantially simpler as well as being uniform across different data formats. These benefits of reduced code complexity and greatly increased performance allow users much greater freedom to explore their data.
Background: Many potentially life-threatening infectious viruses are highly mutable in nature. Characterizing the fittest variants within a quasispecies from infected patients is expected to allow unprecedented opportunities to investigate the relationship between quasispecies diversity and disease epidemiology. The advent of next-generation sequencing technologies has allowed the study of virus diversity with high-throughput sequencing, although these methods come with higher rates of errors which can artificially increase diversity. Results: Here we introduce a novel computational approach that incorporates base quality scores from next-generation sequencers for reconstructing viral genome sequences that simultaneously infers the number of variants within a quasispecies that are present. Comparisons on simulated and clinical data on dengue virus suggest that the novel approach provides a more accurate inference of the underlying number of variants within the quasispecies, which is vital for clinical efforts in mapping the within-host viral diversity. Sequence alignments generated by our approach are also found to exhibit lower rates of error. Conclusions: The ability to infer the viral quasispecies colony that is present within a human host provides the potential for a more accurate classification of the viral phenotype. Understanding the genomics of viruses will be relevant not just to studying how to control or even eradicate these viral infectious diseases, but also in learning about the innate protection in the human host against the viruses.
The causal effect of red blood cell folate on genome-wide methylation in cord blood: a Mendelian randomization approach
Background: Investigation of the biological mechanism by which folate acts to affect fetal development can inform appraisal of expected benefits and risk management. This research is ethically imperative given the ubiquity of folic acid fortified products in the US. Considering that folate is an essential component in the one-carbon metabolism pathway that provides methyl groups for DNA methylation, epigenetic modifications provide a putative molecular mechanism mediating the effect of folic acid supplementation on neonatal and pediatric outcomes. Results: In this study we use a Mendelian Randomization (Mendelian Randomization) approach to assess the effect of red blood cell (RBC) folate on genome-wide DNA methylation in cord blood. Site-specific CpG methylation within the proximal promoter regions of approximately 14,500 genes was analyzed using the Illumina Infinium Human Methylation27 Bead Chip for 50 infants from the Epigenetic Birth Cohort at Brigham and Women's Hospital in Boston. Using methylenetetrahydrofolate reductase genotype as the instrument, the Mendelian Randomization approach identified 7 CpG loci with a significant (mostly positive) association between RBC folate and methylation level. Among the genes in closest proximity to this significant subset of CpG loci, several enriched biologic processes were involved in nucleic acid transport and metabolic processing. Compared to the standard ordinary least squares regression method, our estimates were demonstrated to be more robust to unmeasured confounding. Conclusions: To the authors' knowledge, this is the largest genome-wide analysis of the effects of folate on methylation pattern, and the first to employ Mendelian Randomization to assess the effects of an exposure on epigenetic modifications. These results can help guide future analyses of the causal effects of periconceptional folate levels on candidate pathways.
Background: The ever on-going technical developments in Next Generation Sequencing have led to an increase in detected disease related mutations. Many bioinformatics approaches exist to analyse these variants, and of those the methods that use 3D structure information generally outperform those that do not use this information. 3D structure information today is available for about twenty percent of the human exome, and homology modelling can double that fraction. This percentage is rapidly increasing so that we can expect to analyse the majority of all human exome variants in the near future using protein structure information. Results: We collected a test dataset of well-described mutations in proteins for which 3D-structure information is available. This test dataset was used to analyse the possibilities and the limitations of methods based on sequence information alone, hybrid methods, machine learning based methods, and structure based methods. Conclusions: Our analysis shows that the use of structural features improves the classification of mutations. This study suggests strategies for future analyses of disease causing mutations, and it suggests which bioinformatics approaches should be developed to make progress in this field.