The latest research articles published by BMC Bioinformatics
A Generalizable NLP Framework for Fast Development of Pattern-based Biomedical Relation Extraction Systems
Background: Text mining is increasingly used in the biomedical domain because of its ability to automatically gather information from large amount of scientific articles. One important task in biomedical text mining is relation extraction, which aims to identify designated relations among biological entities reported in literature. A relation extraction system achieving high performance is expensive to develop because of the substantial time and effort required for its design and implementation. Here, we report a novel framework to facilitate the development of a pattern-based biomedical relation extraction system. It has several unique design features: (1) leveraging syntactic variations possible in a language and automatically generating extraction patterns in a systematic manner, (2) applying sentence simplification to improve the coverage of extraction patterns, and (3) identifying referential relations between a syntactic argument of a predicate and the actual target expected in the relation extraction task. Results: A relation extraction system derived using the proposed framework achieved overall F-scores of 72.66% for the Simple events and 55.57% for the Binding events on the BioNLP-ST 2011 GE test set, comparing favorably with the top performing systems that participated in the BioNLP-ST 2011 GE task. We obtained similar results on the BioNLP-ST 2013 GE test set (80.07% and 60.58%, respectively). We conducted additional experiments on the training and development sets to provide a more detailed analysis of the system and its individual modules. This analysis indicates that without increasing the number of patterns, simplification and referential relation linking play a key role in the effective extraction of biomedical relations. Conclusions: In this paper, we present a novel framework for fast development of relation extraction systems. The framework requires only a list of triggers as input, and does not need information from an annotated corpus. Thus, we reduce the involvement of domain experts, who would otherwise have to provide manual annotations and help with the design of hand crafted patterns. We demonstrate how our framework is used to develop a system which achieves state-of-the-art performance on a public benchmark corpus.
Background: Short oligonucleotides can be used as markers to tag and track DNA sequences. For example, barcoding techniques (i.e. Multiplex Identifiers or Indexing) use short oligonucleotides to distinguish between reads from different DNA samples pooled for high-throughput sequencing. A similar technique called molecule tagging uses the same principles but is applied to individual DNA template molecules. Each template molecule is tagged with a unique oligonucleotide prior to polymerase chain reaction. The resulting amplicon sequences can be traced back to their original templates by their oligonucleotide tag. Consensus building from sequences sharing the same tag enables inference of original template molecules thereby reducing effects of sequencing error and polymerase chain reaction bias. Several independent groups have developed similar protocols for molecule tagging; however, user-friendly software for build consensus sequences from molecule tagged reads is not readily available or is highly specific for a particular protocol. Results: MT-Toolbox recognizes oligonucleotide tags in amplicons and infers the correct template sequence. On a set of molecule tagged test reads, MT-Toolbox generates sequences having on average 0.00047 errors per base. MT-Toolbox includes a graphical user interface, command line interface, and options for speed and accuracy maximization. It can be run in serial on a standard personal computer or in parallel on a Load Sharing Facility based cluster system. An optional plugin provides features for common 16S metagenome profiling analysis such as chimera filtering, building operational taxonomic units, contaminant removal, and taxonomy assignments. Conclusions: MT-Toolbox provides an accessible, user-friendly environment for analysis of molecule tagged reads thereby reducing technical errors and polymerase chain reaction bias. These improvements reduce noise and allow for greater precision in single amplicon sequencing experiments.
Impact of variance components on reliability of absolute quantification using digital PCR
Background: Digital polymerase chain reaction (dPCR) is an increasingly popular technology for detecting andquantifying target nucleic acids. Its advertised strength is high precision absolute quantification withoutneeding reference curves. The standard data analytic approach follows a seemingly straightforwardtheoretical framework but ignores sources of variation in the data generating process. Thesestem from both technical and biological factors, where we distinguish features that are 1) hard-wiredin the equipment, 2) user-dependent and 3) provided by manufacturers but may be adapted by theuser. The impact of the corresponding variance components on the accuracy and precision of targetconcentration estimators presented in the literature is studied through simulation. Results: We reveal how system-specific technical factors influence accuracy as well as precision of concentrationestimates. We find that a well-chosen sample dilution level and modifiable settings such asthe fluorescence cut-off for target copy detection have a substantial impact on reliability and can beadapted to the sample analysed in ways that matter. User-dependent technical variation, includingpipette inaccuracy and specific sources of sample heterogeneity, leads to a steep increase in uncertaintyof estimated concentrations. Users can discover this through replicate experiments and derivedvariance estimation. Finally, the detection performance can be improved by optimizing the fluorescenceintensity cut point as suboptimal thresholds reduce the accuracy of concentration estimatesconsiderably. Conclusions: Like any other technology, dPCR is subject to variation induced by natural perturbations, systematicsettings as well as user-dependent protocols. Corresponding uncertainty may be controlled with anadapted experimental design. Our findings point to modifiable key sources of uncertainty that forman important starting point for the development of guidelines on dPCR design and data analysis with correct precision bounds. Besides clever choices of sample dilution levels, experiment-specific tuningof machine settings can greatly improve results. Well-chosen data-driven fluorescence intensitythresholds in particular result in major improvements in target presence detection. We call on manufacturersto provide sufficiently detailed output data that allows users to maximize the potential of themethod in their setting and obtain high precision and accuracy for their experiments.
Background: In past, a number of peptides have been reported to possess highly diverse properties ranging from cell penetrating, tumor homing, anticancer, anti-hypertensive, antiviral to antimicrobials. Owing to their excellent specificity, low-toxicity, rich chemical diversity and availability from natural sources, FDA has successfully approved a number of peptide-based drugs and several are in various stages of drug development. Though peptides are proven good drug candidates, their usage is still hindered mainly because of their high susceptibility towards proteases degradation. We have developed an in silico method to predict the half-life of peptides in intestine-like environment and to design better peptides having optimized physicochemical properties and half-life. Results: In this study, we have used 10mer (HL10) and 16mer (HL16) peptides dataset to develop prediction models for peptide half-life in intestine-like environment. First, SVM based models were developed on HL10 dataset which achieved maximum correlation R/R2 of 0.57/0.32, 0.68/0.46, and 0.69/0.47 using amino acid, dipeptide and tripeptide composition, respectively. Secondly, models developed on HL16 dataset showed maximum R/R2 of 0.91/0.82, 0.90/0.39, and 0.90/0.31 using amino acid, dipeptide and tripeptide composition, respectively. Furthermore, models that were developed on selected features, achieved a correlation (R) of 0.70 and 0.98 on HL10 and HL16 dataset, respectively. Preliminary analysis suggests the role of charged residue and amino acid size in peptide half-life/stability. Based on above models, we have developed a web server named HLP (Half Life Prediction), for predicting and designing peptides with desired half-life. The web server provides three facilities; i) half-life prediction, ii) physicochemical properties calculation and iii) designing mutant peptides. Conclusion: In summary, this study describes a web server 'HLP' that has been developed for assisting scientific community for predicting intestinal half-life of peptides and to design mutant peptides with better half-life and physicochemical properties. HLP models were trained using a dataset of peptides whose half-lives have been determined experimentally in crude intestinal proteases preparation. Thus, HLP server will help in designing peptides possessing the potential to be administered via oral route (http://www.imtech.res.in/raghava/hlp/ ).
Background: The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. Results: We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. Conclusion: We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.
Background: Chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-Seq) has been widely used to identify genomic loci of transcription factor (TF) binding and histone modifications. ChIP-Seq data analysis involves multiple steps from read mapping and peak calling to data integration and interpretation. It remains challenging and time-consuming to process large amounts of ChIP-Seq data derived from different antibodies or experimental designs using the same approach. To address this challenge, there is a need for a comprehensive analysis pipeline with flexible settings to accelerate the utilization of this powerful technology in epigenetics research. Results: We have developed a highly integrative pipeline, termed HiChIP for systematic analysis of ChIP-Seq data. HiChIP incorporates several open source software packages selected based on internal assessments and published comparisons. It also includes a set of tools developed in-house. This workflow enables the analysis of both paired-end and single-end ChIP-Seq reads, with or without replicates for the characterization and annotation of both punctate and diffuse binding sites. The main functionality of HiChIP includes: (a) read quality checking; (b) read mapping and filtering; (c) peak calling and peak consistency analysis; and (d) result visualization. In addition, this pipeline contains modules for generating binding profiles over selected genomic features, de novo motif finding from transcription factor (TF) binding sites and functional annotation of peak associated genes. Conclusions: HiChIP is a comprehensive analysis pipeline that can be configured to analyze ChIP-Seq data derived from varying antibodies and experiment designs. Using public ChIP-Seq data we demonstrate that HiChIP is a fast and reliable pipeline for processing large amounts of ChIP-Seq data.
SPARQLGraph: a web-based platform for graphically querying biological Semantic Web databases
Background: Semantic Web has established itself as a framework for using and sharing data across applications and database boundaries. Here, we present a web-based platform for querying biological Semantic Web databases in a graphical way. Results: SPARQLGraph offers an intuitive drag & drop query builder, which converts the visual graph into a query and executes it on a public endpoint. The tool integrates several publicly available Semantic Web databases, including the databases of the just recently released EBI RDF platform. Furthermore, it provides several predefined template queries for answering biological questions. Users can easily create and save new query graphs, which can also be shared with other researchers. Conclusions: This new graphical way of creating queries for biological Semantic Web databases considerably facilitates usability as it removes the requirement of knowing specific query languages and database structures. The system is freely available at http://sparqlgraph.i-med.ac.at.
Background: UniFrac is a well-known tool for comparing microbial communities and assessing statistically significant differences between communities. In this paper we identify a discrepancy in the UniFrac methodology that causes semantically equivalent inputs to produce different outputs in tests of statistical significance. Results: The phylogenetic trees that are input into UniFrac may or may not contain abundance counts. An isomorphic transform can be defined that will convert trees between these two formats without altering the semantic meaning of the trees. UniFrac produces different outputs for these equivalent forms of the same input tree. This is illustrated using metagenomics data from a lake sediment study. Conclusions: Results from the UniFrac tool can vary greatly for the same input depending on the arbitrary choice of input format. Practitioners should be aware of this issue and use the tool with caution to ensure consistency and validity in their analyses. We provide a script to transform inputs between equivalent formats to help researchers achieve this consistency.
CRF-based models of protein surfaces improve protein-protein interaction site predictions
Background: The identification of protein-protein interaction sites is a computationally challenging task and importantfor understanding the biology of protein complexes. There is a rich literature in this field. A broadclass of approaches assign to each candidate residue a real-valued score that measures how likely it isthat the residue belongs to the interface. The prediction is obtained by thresholding this score.Some probabilistic models classify the residues on the basis of the posterior probabilities. In thispaper, we introduce pairwise conditional random fields (pCRFs) in which edges are not restrictedto the backbone as in the case of linear-chain CRFs utilized by Li et al. (2007). In fact, any 3Dneighborhoodrelation can be modeled. On grounds of a generalized Viterbi inference algorithm anda piecewise training process for pCRFs, we demonstrate how to utilize pCRFs to enhance a givenresidue-wise score-based protein-protein interface predictor on the surface of the protein under study.The features of the pCRF are solely based on the interface predictions scores of the predictor theperformance of which shall be improved. Results: We performed three sets of experiments with synthetic scores assigned to the surface residues ofproteins taken from the data set PlaneDimers compiled by Zellner et al. (2011), from the list publishedby Keskin et al. (2004) and from the very recent data set due to Cukuroglu et al. (2014). That way wedemonstrated that our pCRF-based enhancer is effective given the interface residue score distributionand the non-interface residue score are unimodal.Moreover, the pCRF-based enhancer is also successfully applicable, if the distributions are only unimodalover a certain sub-domain. The improvement is then restricted to that domain. Thus we wereable to improve the prediction of the PresCont server devised by Zellner et al. (2011) on PlaneDimers. Conclusions: Our results strongly suggest that pCRFs form a methodological framework to improve residue-wisescore-based protein-protein interface predictors given the scores are appropriately distributed. A prototypicalimplementation of our method is accessible at http://ppicrf.informatik.uni-goettingen.de/index.html.
Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks
Background: Time series models can play an important role in disease prediction. Incidence data can be used to predict the future occurrence of disease events. Developments in modeling approaches provide an opportunity to compare different time series models for predictive power. Results: We applied ARIMA and Random Forest time series models to incidence data of outbreaks of highly pathogenic avian influenza (H5N1) in Egypt, available through the online EMPRES-I system. We found that the Random Forest model outperformed the ARIMA model in predictive ability. Furthermore, we found that the Random Forest model is effective for predicting outbreaks of H5N1 in Egypt. Conclusions: Random Forest time series modeling provides enhanced predictive ability over existing time series models for the prediction of infectious disease outbreaks. This result, along with those showing the concordance between bird and human outbreaks (Rabinowitz et al. 2012), provides a new approach to predicting these dangerous outbreaks in bird populations based on existing, freely available data. Our analysis uncovers the time-series structure of outbreak severity for highly pathogenic avain influenza (H5N1) in Egypt.
Spot quantification in two dimensional gel electrophoresis image analysis: comparison of different approaches and presentation of a novel compound fitting algorithm
Background: Various computer-based methods exist for the detection and quantification of protein spots in two dimensional gel electrophoresis images. Area-based methods are commonly used for spot quantification: an area is assigned to each spot and the sum of the pixel intensities in that area, the so-called volume, is used a measure for spot signal. Other methods use the optical density, i.e. the intensity of the most intense pixel of a spot, or calculate the volume from the parameters of a fitted function. Results: In this study we compare the performance of different spot quantification methods using synthetic and real data. We propose a ready-to-use algorithm for spot detection and quantification that uses fitting of two dimensional Gaussian function curves for the extraction of data from two dimensional gel electrophoresis (2-DE) images. The algorithm implements fitting using logical compounds and is computationally efficient. The applicability of the compound fitting algorithm was evaluated for various simulated data and compared with other quantification approaches. We provide evidence that even if an incorrect bell-shaped function is used, the fitting method is superior to other approaches, especially when spots overlap. Finally, we validated the method with experimental data of urea-based 2-DE of Abeta peptides andre-analyzed published data sets. Our methods showed higher precision and accuracy than other approaches when applied to exposure time series and standard gels. Conclusion: Compound fitting as a quantification method for 2-DE spots shows several advantages over other approaches and could be combined with various spot detection methods.The algorithm was scripted in MATLAB (Mathworks) and is available as a supplemental file.
Background: As resequencing projects become more prevalent across a larger number of species, accurate variantidentification will further elucidate the nature of genetic diversity and become increasingly relevantin genomic studies. However, the identification of larger genomic variants via DNA sequencing islimited by both the incomplete information provided by sequencing reads and the nature of the genomeitself. Long-read sequencing technologies provide high-resolution access to structural variants ofteninaccessible to shorter reads. Results: We present PBHoney, software that considers both intra-read discordance and soft-clipped tails of longreads (> 10, 000 bp) to identify structural variants. As a proof of concept, we identify four structuralvariants and two genomic features in a strain of Escherichia coli with PBHoney and validate them viade novo assembly. PBHoney is available for download at http://sourceforge.net/projects/pb-jelly/; Conclusions: Implementing two variant-identification approaches that exploit the high mappability of long reads,PBHoney is demonstrated as being effective at detecting larger structural variants using wholegenomePacific Biosciences RS II Continuous Long Reads. Furthermore, PBHoney is able to discovertwo genomic features: the existence of Rac-Phage in isolate; evidence of E. coli¿s circular genome.
A novel method for gathering and prioritizing disease candidate genes based on construction of a set of disease-related MeSH(R) terms
Background: Understanding the molecular mechanisms involved in disease is critical for the development of more effective and individualized strategies for prevention and treatment. The amount of disease-related literature, including new genetic information on the molecular mechanisms of disease, is rapidly increasing. Extracting beneficial information from literature can be facilitated by computational methods such as the knowledge-discovery approach. Several methods for mining gene-disease relationships using computational methods have been developed, however, there has been a lack of research evaluating specific disease candidate genes. Results: We present a novel method for gathering and prioritizing specific disease candidate genes. Our approach involved the construction of a set of Medical Subject Headings (MeSH) terms for the effective retrieval of publications related to a disease candidate gene. Information regarding the relationships between genes and publications was obtained from the gene2pubmed database. The set of genes was prioritized using a "weighted literature score" based on the number of publications and weighted by the number of genes occurring in a publication. Using our method for the disease states of pain and Alzheimer's disease, a total of 1101 pain candidate genes and 2810 Alzheimer's disease candidate genes were gathered and prioritized. The precision was 0.30 and the recall was 0.89 in the case study of pain. The precision was 0.04 and the recall was 0.6 in the case study of Alzheimer's disease. The precision-recall curve indicated that the performance of our method was superior to that of other publicly available tools. Conclusions: Our method, which involved the use of a set of MeSH terms related to disease candidate genes and a novel weighted literature score, improved the accuracy of gathering and prioritizing candidate genes by focusing on a specific disease.
Design of a flexible component gathering algorithm for converting cell-based models to graph representations for use in evolutionary search
Background: The ability of science to produce experimental data has outpaced the ability to effectively visualize and integrate the data into a conceptual framework that can further higher order understanding. Multidimensional and shape-based observational data of regenerative biology presents a particularly daunting challenge in this regard. Large amounts of data are available in regenerative biology, but little progress has been made in understanding how organisms such as planaria robustly achieve and maintain body form. An example of this kind of data can be found in a new repository (PlanformDB) that encodes descriptions of planaria experiments and morphological outcomes using a graph formalism. Results: We are developing a model discovery framework that uses a cell-based modeling platform combined with evolutionary search to automatically search for and identify plausible mechanisms for the biological behavior described in PlanformDB. To automate the evolutionary search we developed a way to compare the output of the modeling platform to the morphological descriptions stored in PlanformDB. We used a flexible connected component algorithm to create a graph representation of the virtual worm from the robust, cell-based simulation data. These graphs can then be validated and compared with target data from PlanformDB using the well-known graph-edit distance calculation, which provides a quantitative metric of similarity between graphs. The graph edit distance calculation was integrated into a fitness function that was able to guide automated searches for unbiased models of planarian regeneration. We present a cell-based model of planarian that can regenerate anatomical regions following bisection of the organism, and show that the automated model discovery framework is capable of searching for and finding models of planarian regeneration that match experimental data stored in PlanformDB. Conclusion: The work presented here, including our algorithm for converting cell-based models into graphs for comparison with data stored in an external data repository, has made feasible the automated development, training, and validation of computational models using morphology-based data. This work is part of an ongoing project to automate the search process, which will greatly expand our ability to identify, consider, and test biological mechanisms in the field of regenerative biology.
Background: Networks of interacting genes and gene products mediate most cellular and developmental processes. High throughput screening methods combined with literature curation are identifying many of the protein-protein interactions (PPI) and protein-DNA interactions (PDI) that constitute these networks. Most of the detection methods, however, fail to identify the in vivo spatial or temporal context of the interactions. Thus, the interaction data are a composite of the individual networks that may operate in specific tissues or developmental stages. Genome-wide expression data may be useful for filtering interaction data to identify the subnetworks that operate in specific spatial or temporal contexts. Here we take advantage of the extensive interaction and expression data available for Drosophila to analyze how interaction networks may be unique to specific tissues and developmental stages. Results: We ranked genes on a scale from ubiquitously expressed to tissue or stage specific and examined their interaction patterns. Interestingly, ubiquitously expressed genes have many more interactions among themselves than do non-ubiquitously expressed genes both in PPI and PDI networks. While the PDI network is enriched for interactions between tissue-specific transcription factors and their tissue-specific targets, a preponderance of the PDI interactions are between ubiquitous and non-ubiquitously expressed genes and proteins. In contrast to PDI, PPI networks are depleted for interactions among tissue- or stage- specific proteins, which instead interact primarily with widely expressed proteins. In light of these findings, we present an approach to filter interaction data based on gene expression levels normalized across tissues or developmental stages. We show that this filter (the percent maximum or pmax filter) can be used to identify subnetworks that function within individual tissues or developmental stages. Conclusions: These observations suggest that protein networks are frequently organized into hubs of widely expressed proteins to which are attached various tissue- or stage-specific proteins. This is consistent with earlier analyses of human PPI data and suggests a similar organization of interaction networks across species. This organization implies that tissue or stage specific networks can be best identified from interactome data by using filters designed to include both ubiquitously expressed and specifically expressed genes and proteins.
Background: Ongoing advancements in cloud computing provide novel opportunities in scientific computing, especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for querying, processing, and visualizing genomics' "Big Data" from sources like The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of Linked Data in Biomedicine. Results: QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to execute them without installing any extra plugins or programs. A client library provides high-level distribution templates including MapReduce. This stark departure from the current reliance on expensive server hardware running "download and install" software has already gathered substantial community interest, as QM received more than 2.2 million API calls from 87 countries in 12 months. Conclusions: QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments.
SMARTPOP: inferring the impact of social dynamics on genetic diversity through high speed simulations
Background: Social behavior has long been known to influence patterns of genetic diversity, but the effect of social processes on population genetics remains poorly quantified - partly due to limited community-level genetic sampling (which is increasingly being remedied), and partly to a lack of fast simulation software to jointly model genetic evolution and complex social behavior, such as marriage rules. Results: To fill this gap, we have developed SMARTPOP - a fast, forward-in-time genetic simulator - to facilitate large-scale statistical inference on interactions between social factors, such as mating systems, and population genetic diversity. By simultaneously modeling genetic inheritance and dynamic social processes at the level of the individual, SMARTPOP can simulate a wide range of genetic systems (autosomal, X-linked, Y chromosomal and mitochondrial DNA) under a range of mating systems and demographic models. Specifically designed to enable resource-intensive statistical inference tasks, such as Approximate Bayesian Computation, SMARTPOP has been coded in C++ and is heavily optimized for speed and reduced memory usage. Conclusion: SMARTPOP rapidly simulates population genetic data under a wide range of demographic scenarios and social behaviors, thus allowing quantitative analyses to address complex socio-ecological questions.
Scan for Motifs: a webserver for the analysis of post-transcriptional regulatory elements in the 3[prime] untranslated regions (3[prime] UTRs) of mRNAs
Background: Gene expression in vertebrate cells may be controlled post-transcriptionally through regulatory elements in mRNAs. These are usually located in the untranslated regions (UTRs) of mRNA sequences, particularly the 3[prime]UTRs. Results: Scan for Motifs (SFM) simplifies the process of identifying a wide range of regulatory elements on alignments of vertebrate 3[prime]UTRs. SFM includes identification of both RNA Binding Protein (RBP) sites and targets of miRNAs. In addition to searching pre-computed alignments, the tool provides users the flexibility to search their own sequences or alignments. The regulatory elements may be filtered by expected value cutoffs and are cross-referenced back to their respective sources and literature. The output is an interactive graphical representation, highlighting potential regulatory elements and overlaps between them. The output also provides simple statistics and links to related resources for complementary analyses. The overall process is intuitive and fast. As SFM is a free web-application, the user does not need to install any software or databases. Conclusions: Visualisation of the binding sites of different classes of effectors that bind to 3[prime]UTRs will facilitate the study of regulatory elements in 3[prime] UTRs.
Background: Human disease often arises as a consequence of alterations in a set of associated genes rather than alterations to a set of unassociated individual genes. Most previous microarray-based meta-analyses identified disease-associated genes or biomarkers independent of genetic interactions. Therefore, in this study, we present the first meta-analysis method capable of taking gene combination effects into account to efficiently identify associated biomarkers (ABs) across different microarray platforms. Results: We propose a new meta-analysis approach called MiningABs to mine ABs across different array-based datasets. The similarity between paired probe sequences is quantified as a bridge to connect these datasets together. The ABs can be subsequently identified from an "improved" common logit model (c-LM) by combining several sibling-like LMs in a heuristic genetic algorithm selection process. Our approach is evaluated with two sets of gene expression datasets: i) 4 esophageal squamous cell carcinoma and ii) 3 hepatocellular carcinoma datasets. Based on an unbiased reciprocal test, we demonstrate that each gene in a group of ABs is required to maintain high cancer sample classification accuracy, and we observe that ABs are not limited to genes common to all platforms. Investigating the ABs using Gene Ontology (GO) enrichment, literature survey, and network analyses indicated that our ABs are not only strongly related to cancer development but also highly connected in a diverse network of biological interactions. Conclusions: The proposed meta-analysis method called MiningABs is able to efficiently identify ABs from different independently performed array-based datasets, and we show its validity in cancer biology via GO enrichment, literature survey and network analyses. We postulate that the ABs may facilitate novel target and drug discovery, leading to improved clinical treatment. Java source code, tutorial, example and related materials are available at "http://sourceforge.net/projects/miningabs/".
hsphase: an R package for pedigree reconstruction, detection of recombination events, phasing and imputation of half-sib family groups
Background: Identification of recombination events and which chromosomal segments contributed to an individual is useful for a number of applications in genomic analyses including haplotyping, imputation, signatures of selection, and improved estimates of relationship and probability of identity by descent. Genotypic data on half-sib family groups are widely available in livestock genomics. This structure makes it possible to identify recombination events accurately even with only a few individuals and it lends itself well to a range of applications such as parentage assignment and pedigree verification. Results: Here we present hsphase, an R package that exploits the genetic structure found in half-sib livestock data to identify and count recombination events, impute and phase un-genotyped sires and phase its offspring. The package also allows reconstruction of family groups (pedigree inference), identification of pedigree errors and parentage assignment. Additional functions in the package allow identification of genomic mapping errors, imputation of paternal high density genotypes from low density genotypes, evaluation of phasing results either from hsphase or from other phasing programs. Various diagnostic plotting functions permit rapid visual inspection of results and evaluation of datasets. Conclusion: The hsphase package provides a suite of functions for analysis and visualization of genomic structures in half-sib family groups implemented in the widely used R programming environment. Low level functions were implemented in C++ and parallelized to improve performance. hsphase was primarily designed for use with high density SNP array data but it is fast enough to run directly on sequence data once they become more widely available. The package is available (GPL 3) from the Comprehensive R Archive Network (CRAN) or from http://www-personal.une.edu.au/~cgondro2/hsphase.htm.