MLBio+Laboratory Machine Learning in Biomedical Informatics



BMC Bioinformatics

Syndicate content
The latest research articles published by BMC Bioinformatics
Updated: 1 year 1 week ago

PlanktoVision -- an automated analysis system for the identification of phytoplankton

Tue, 03/26/2013 - 20:00
Background: Phytoplankton communities are often used as a marker for the determination of fresh water quality. The routine analysis, however, is very time consuming and expensive as it is carried out manually by trained personnel. The goal of this work is to develop a system for an automated analysis. Results: A novel open source system for the automated recognition of phytoplankton by the use of microscopy and image analysis was developed. It integrates the segmentation of the organisms from the background, the calculation of a large range of features, and a neural network for the classification of imaged organisms into different groups of plankton taxa. The analysis of samples containing 10 different taxa showed an average recognition rate of 94.7% and an average error rate of 5.5%. The presented system has a flexible framework which easily allows expanding it to include additional taxa in the future. Conclusions: The implemented automated microscopy and the new open source image analysis system - PlanktoVision - showed classification results that were comparable or better than existing systems and the exclusion of non-plankton particles could be greatly improved. The software package is published as free software and is available to anyone to help make the analysis of water quality more reproducible and cost effective.

Computing minimal nutrient sets from metabolic networks via linear constraint solving

Tue, 03/26/2013 - 20:00
Background As more complete genome sequences become available, bioinformaticschallenges arise in how to exploit genome sequences to make phenotypicpredictions. One type of phenotypic prediction is to determine setsof compounds that will support the growth of a bacterium from themetabolic network inferred from the genome sequence of that organism.Results We present a method for computationally determining alternative growth media for an organism basedon its metabolic network and transporter complement. Our method predicted 787 alternative anaerobicminimal nutrient sets for Escherichia coli K-12 MG1655 from the EcoCyc database. The programautomatically partitioned the nutrients within these sets into 21 equivalence classes, most ofwhich correspond to compounds serving as sources of carbon, nitrogen, phosphorous, and sulfur, orcombinations of these essential elements. The nutrient sets were predicted with 72.5% accuracy as evaluated by comparison with 91 growth experiments. Novel aspects of our approach include (a) exhaustiveconsideration of all combinations of nutrients rather than assuming that all element sourcescan substitute for one another (an assumption that can be invalid in general) (b) leveraging the notionof a machinery-duplicating constraint, namely, that all intermediate metabolites used in active reactionsmust be produced in increasing concentrations to prevent successive dilution from cell division,(c) the use of Satisfiability Modulo Theory solvers rather than Linear Programming solvers, becauseour approach cannot be formulated as linear programming, (d) the use of Binary Decision Diagramsto produce an efficient implementation. Conclusions: Our method for generating minimal nutrient sets from the metabolicnetwork and transporters of an organism combines linear constraint solving with binary decisiondiagrams to efficiently produce solution sets to providedgrowth problems.

Using cited references to improve the retrieval of related biomedical documents

Tue, 03/26/2013 - 20:00
Background: A popular query from scientists reading a biomedical abstract is to search for topic-related documents in bibliographic databases. Such a query is challenging because the amount of information attached to a single abstract is little, whereas classification-based retrieval algorithms are optimally trained with large sets of relevant documents. As a solution to this problem, we propose a query expansion method that extends the information related to a manuscript using its cited references. Results: Data on cited references and text sections in 249,108 full-text biomedical articles was extracted from the Open Access subset of the PubMed Central(R) database (PMC-OA). Of the five standard sections of a scientific article, the Introduction and Discussion sections contained most of the citations (mean = 10.2 and 9.9 citations, respectively). A large proportion of articles (98.4%) and their cited references (79.5%) were indexed in the PubMed(R) database.Using the MedlineRanker abstract classification tool, cited references allowed accurate retrieval of the citing document in a test set of 10,000 documents and also of documents related to six biomedical topics defined by particular MeSH(R) terms from the entire PMC-OA (p-value<0.01).Classification performance was sensitive to the topic and also to the text sections from which the references were selected. Classifiers trained on the baseline (i.e., only text from the query document and not from the references) were outperformed in almost all the cases. Best performance was often obtained when using all cited references, though using the references from Introduction and Discussion sections led to similarly good results. This query expansion method performed significantly better than pseudo relevance feedback in 4 out of 6 topics. Conclusions: The retrieval of documents related to a single document can be significantly improved by using the references cited by this document (p-value<0.01). Using references from Introduction and Discussion performs almost as well as using all references, which might be useful for methods that require reduced datasets due to computational limitations. Cited references from particular sections might not be appropriate for all topics. Our method could be a better alternative to pseudo relevance feedback though it is limited by full text availability.

A systematic comparison of the MetaCyc and KEGG pathway databases

Tue, 03/26/2013 - 20:00
Background The MetaCyc and KEGG projects have developed large metabolic pathway databases that are usedfor a variety of applications including genome analysis and metabolic engineering. We present acomparison of the compound, reaction, and pathway content of MetaCyc version 16.0 and a KEGGversion downloaded on Feb-27-2012 to increase understanding of their relative sizes, their degree ofoverlap, and their scope. To assess their overlap, we must know the correspondences between compounds,reactions, and pathways in MetaCyc, and those in KEGG. We devoted significant effort tocomputational and manual matching of these entities, and we evaluated the accuracy of the correspondences.Results KEGG contains 179 module pathways versus 1,846 base pathways in MetaCyc; KEGG contains 237map pathways versus 296 super pathways in MetaCyc. KEGG pathways contain 3.3 times as manyreactions on average as do MetaCyc pathways, and the databases employ different conceptualizationsof metabolic pathways. KEGG contains 8,692 reactions versus 10,262 for MetaCyc. 6,174 KEGGreactions are components of KEGG pathways versus 6,348 for MetaCyc. KEGG contains 16,586 compoundsversus 11,991 for MetaCyc. 6,912 KEGG compounds act as substrates in KEGG reactionsversus 8,891 for MetaCyc. MetaCyc contains a broader set of database attributes than does KEGG,such as relationships from a compound to enzymes that it regulates, identification of spontaneousreactions, and the expected taxonomic range of metabolic pathways. MetaCyc contains many pathwaysnot found in KEGG, from plants, fungi, metazoa, and actinobacteria; KEGG contains pathways not found in MetaCyc, for xenobiotic degradation, glycan metabolism, and metabolism of terpenoidsand polyketides. MetaCyc contains fewer unbalanced reactions, which facilitates metabolic modelingsuch as using flux-balance analysis. MetaCyc includes generic reactions that may be instantiatedcomputationally.Conclusions KEGG contains significantly more compounds than does MetaCyc, whereas MetaCyc contains significantlymore reactions and pathways than does KEGG, in particular KEGG modules are quiteincomplete. The number of reactions occurring in pathways in the two DBs are quite similar.

A benchmark server using high resolution protein structure data, and benchmark results for membrane helix predictions

Tue, 03/26/2013 - 20:00
Background: Helical membrane proteins are vital for the interaction of cells with their environment. Predicting the location of membrane helices in protein amino acid sequences provides substantial understanding of their structure and function and identifies membrane proteins in sequenced genomes. Currently there is no comprehensive benchmark tool for evaluating prediction methods, and there is no publication comparing all available prediction tools. Current benchmark literature is outdated, as recently determined membrane protein structures are not included. Current literature is also limited to global assessments, as specialised benchmarks for predicting specific classes of membrane proteins were not previously carried out.Description: We present a benchmark server at http://sydney.edu.au/pharmacy/sbio/software/TMH_benchmark.shtml that uses recent high resolution protein structural data to provide a comprehensive assessment of the accuracy of existing membrane helix prediction methods. The server further allows a user to compare uploaded predictions generated by novel methods, permitting the comparison of these novel methods against all existing methods compared by the server. Benchmark metrics include sensitivity and specificity of predictions for membrane helix location and orientation, and many others. The server allows for customised evaluations such as assessing prediction method performances for specific helical membrane protein subtypes.We report results for custom benchmarks which illustrate how the server may be used for specialised benchmarks. Which prediction method is the best performing method depends on which measure is being benchmarked. The OCTOPUS membrane helix prediction method is consistently one of the highest performing methods across all measures in the benchmarks that we performed. Conclusions: The benchmark server allows general and specialised assessment of existing and novel membrane helix prediction methods. Users can employ this benchmark server to determine the most suitable method for the type of prediction the user needs to perform, be it general whole-genome annotation or the prediction of specific types of helical membrane protein. Creators of novel prediction methods can use this benchmark server to evaluate the performance of their new methods. The benchmark server will be a valuable tool for researchers seeking to extract more sophisticated information from the large and growing protein sequence databases.

Differential expression analysis for paired RNA-seq data

Tue, 03/26/2013 - 20:00
Background RNA-Seq technology measures the transcript abundance by generating sequence reads and counting their frequencies across different biological conditions. To identify differentially expressed genes between two conditions, it is important to consider the experimental design as well as the distributional property of the data. In many RNA-Seq studies, the expression data are obtained as multiple pairs, e.g., pre- vs. post-treatment samples from the same individual. We seek to incorporate paired structure into analysis.Results We present a Bayesian hierarchical mixture model for RNA-Seq data to separately account for the variability within and between individuals from a paired data structure. The method assumes a Poisson distribution for the data mixed with a gamma distribution to account variability between pairs. The effect of differential expression is modeled by two-component mixture model. The performance of this approach is examined by simulated and real data.Conclusions In this setting, our proposed model provides higher sensitivity than existing methods to detect differential expression. Application to real RNA-Seq data demonstrates the usefulness of this method for detecting expression alteration for genes with low average expression levels or shorter transcript length.

TPMS: a set of utilities for querying collections of gene trees

Tue, 03/26/2013 - 20:00
Background The information in large collections of phylogenetic trees is useful for many comparative genomic studies. Therefore, there is a need for flexible tools that allow exploration of such collections in order to retrieve relevant data as quickly as possible.Results In this paper, we present TPMS (Tree Pattern-Matching Suite), a set of programs for handling and retrieving gene trees according to different criteria. The programs from the suite include utilities for tree collection building, specific tree-pattern search strategies and tree rooting. Use of TPMS is illustrated through three examples: systematic search for incongruencies in a large tree collection, a short study on the Coelomata/Ecdysozoa controversy and an evaluation of the level of support for a recently published Mammal phylogeny.Conclusion TPMS is a powerful suite allowing to quickly retrieve sets of trees matching complex patterns in large collection or to root trees using more rigorous approaches than the classical midpoint method. As it is made of a set of command-line programs, it can be easily integrated in any sequence analysis pipeline for an automated use.

LASAGNA: A novel algorithm for transcription factor binding site alignment

Sat, 03/23/2013 - 20:00
Background Scientists routinely scan DNA sequences for transcription factor (TF) binding sites (TFBSs). Most of the available tools rely on position-specific scoring matrices (PSSMs) constructed from aligned binding sites.Because of the resolutions of assays used to obtain TFBSs, databases such as TRANSFAC, ORegAnno and PAZARstore unaligned variable-length DNA segments containing binding sites of a TF. These DNA segments need to bealigned to build a PSSM. While the TRANSFAC database provides scoring matrices for TFs, nearly 78% of the TFsin the public release do not have matrices available. As work on TFBS alignment algorithms has been limited, itis highly desirable to have an alignment algorithm tailored to TFBSs.Results We designed a novel algorithm named LASAGNA, which is aware of the lengths of input TFBSs and utilizes position dependence.Results on 189 TFs of 5 species in the TRANSFAC database showed that our method significantly outperformed ClustalW2and MEME. We further compared a PSSM method dependent on LASAGNA to an alignment-free TFBS search method.Results on 89 TFs whose binding sites can be located in genomes showed that our method is significantly more preciseat fixed recall rates. Finally, we described LASAGNA-ChIP, a more sophisticated version for ChIP(Chromatin immunoprecipitation) experiments. Under the one-per-sequence model, it showed comparableperformance with MEME in discovering motifs in ChIP-seq peak sequences.Conclusions We conclude that the LASAGNA algorithm is simple and effective in aligning variable-length binding sites.It has been integrated into a user-friendly webtool for TFBS search and visualization calledLASAGNA-Search. The tool currently stores precomputed PSSM models for 189 TFs and 133 TFs built from TFBSs in theTRANSFAC Public database (release 7.0) and the ORegAnno database (08Nov10 dump), respectively.The webtool is available at: http://biogrid.engr.uconn.edu/lasagna_search/.

Non-negative matrix factorization by maximizing correntropy for cancer clustering

Sat, 03/23/2013 - 20:00
BackgroundNon-negative matrixfactorization (NMF) has been shown to be a powerful tool for clustering gene expression data,which are widely used to classify cancers.NMF aims to find twonon-negative matrices whose product closely approximates the originalmatrix.Traditional NMF methods minimize either thel2 norm or the Kullback-Leibler distance between the product of the two matrices and the original matrix. Correntropy was recently shown to be an effective similaritymeasurementdue to its stability to outliers or noise.Results We propose a maximum correntropy criterion (MCC)-basedNMF method (NMF-MCC) for gene expression data-based cancer clustering.Instead of minimizing the l2 norm or the Kullback-Leibler distance,NMF-MCC maximizes the correntropy between the product of the two matrices and the original matrix.The optimization problem can be solved by an expectation conditional maximization algorithm.Conclusions Extensive experiments onsix cancer benchmark setsdemonstrate that the proposed method is significantly more accurate than the state-of-the-art methodsin cancer clustering.

SMOTE for high-dimensional class-imbalanced data

Thu, 03/21/2013 - 20:00
Background Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.Results While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.Conclusions In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.


Powered by Drupal, an open source content management system