Nucleic Acids Research
POGO-DB (http://pogo.ece.drexel.edu/) provides an easy platform for comparative microbial genomics. POGO-DB allows users to compare genomes using pre-computed metrics that were derived from extensive computationally intensive BLAST comparisons of >2000 microbes. These metrics include (i) average protein sequence identity across all orthologs shared by two genomes, (ii) genomic fluidity (a measure of gene content dissimilarity), (iii) number of ‘orthologs’ shared between two genomes, (iv) pairwise identity of the 16S ribosomal RNA genes and (v) pairwise identity of an additional 73 marker genes present in >90% prokaryotes. Users can visualize these metrics against each other in a 2D plot for exploratory analysis of genome similarity and of how different aspects of genome similarity relate to each other. The results of these comparisons are fully downloadable. In addition, users can download raw BLAST results for all or user-selected comparisons. Therefore, we provide users with full flexibility to carry out their own downstream analyses, by creating easy access to data that would normally require heavy computational resources to generate. POGO-DB should prove highly useful for researchers interested in comparative microbiology and benefit the microbiome/metagenomic communities by providing the information needed to select suitable phylogenetic marker genes within particular lineages.
Ribosomal Database Project (RDP; http://rdp.cme.msu.edu/) provides the research community with aligned and annotated rRNA gene sequence data, along with tools to allow researchers to analyze their own rRNA gene sequences in the RDP framework. RDP data and tools are utilized in fields as diverse as human health, microbial ecology, environmental microbiology, nucleic acid chemistry, taxonomy and phylogenetics. In addition to aligned and annotated collections of bacterial and archaeal small subunit rRNA genes, RDP now includes a collection of fungal large subunit rRNA genes. RDP tools, including Classifier and Aligner, have been updated to work with this new fungal collection. The use of high-throughput sequencing to characterize environmental microbial populations has exploded in the past several years, and as sequence technologies have improved, the sizes of environmental datasets have increased. With release 11, RDP is providing an expanded set of tools to facilitate analysis of high-throughput data, including both single-stranded and paired-end reads. In addition, most tools are now available as open source packages for download and local use by researchers with high-volume needs or who would like to develop custom analysis pipelines.
SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive resource for up-to-date quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. SILVA provides a manually curated taxonomy for all three domains of life, based on representative phylogenetic trees for the small- and large-subunit rRNA genes. This article describes the improvements the SILVA taxonomy has undergone in the last 3 years. Specifically we are focusing on the curation process, the various resources used for curation and the comparison of the SILVA taxonomy with Greengenes and RDP-II taxonomies. Our comparisons not only revealed a reasonable overlap between the taxa names, but also points to significant differences in both names and numbers of taxa between the three resources.
The COLOMBOS database (http://www.colombos.net) features comprehensive organism-specific cross-platform gene expression compendia of several bacterial model organisms and is supported by a fully interactive web portal and an extensive web API. COLOMBOS was originally published in PLoS One, and COLOMBOS v2.0 includes both an update of the expression data, by expanding the previously available compendia and by adding compendia for several new species, and an update of the surrounding functionality, with improved search and visualization options and novel tools for programmatic access to the database. The scope of the database has also been extended to incorporate RNA-seq data in our compendia by a dedicated analysis pipeline. We demonstrate the validity and robustness of this approach by comparing the same RNA samples measured in parallel using both microarrays and RNA-seq. As far as we know, COLOMBOS currently hosts the largest homogenized gene expression compendia available for seven bacterial model organisms.
We have recently developed a new version of the DOOR operon database, DOOR 2.0, which is available online at http://csbl.bmb.uga.edu/DOOR/ and will be updated on a regular basis. DOOR 2.0 contains genome-scale operons for 2072 prokaryotes with complete genomes, three times the number of genomes covered in the previous version published in 2009. DOOR 2.0 has a number of new features, compared with its previous version, including (i) more than 250 000 transcription units, experimentally validated or computationally predicted based on RNA-seq data, providing a dynamic functional view of the underlying operons; (ii) an integrated operon-centric data resource that provides not only operons for each covered genome but also their functional and regulatory information such as their cis-regulatory binding sites for transcription initiation and termination, gene expression levels estimated based on RNA-seq data and conservation information across multiple genomes; (iii) a high-performance web service for online operon prediction on user-provided genomic sequences; (iv) an intuitive genome browser to support visualization of user-selected data; and (v) a keyword-based Google-like search engine for finding the needed information intuitively and rapidly in this database.
Virus Variation (http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/) is a comprehensive, web-based resource designed to support the retrieval and display of large virus sequence datasets. The resource includes a value added database, a specialized search interface and a suite of sequence data displays. Virus-specific sequence annotation and database loading pipelines produce consistent protein and gene annotation and capture sequence descriptors from sequence records then map these metadata to a controlled vocabulary. The database supports a metadata driven, web-based search interface where sequences can be selected using a variety of biological and clinical criteria. Retrieved sequences can then be downloaded in a variety of formats or analyzed using a suite of tools and displays. Over the past 2 years, the pre-existing influenza and Dengue virus resources have been combined into a single construct and West Nile virus added to the resultant resource. A number of improvements were incorporated into the sequence annotation and database loading pipelines, and the virus-specific search interfaces were updated to support more advanced functions. Several new features have also been added to the sequence download options, and a new multiple sequence alignment viewer has been incorporated into the resource tool set. Together these enhancements should support enhanced usability and the inclusion of new viruses in the future.
CyanoBase and RhizoBase: databases of manually curated annotations for cyanobacterial and rhizobial genomes
To understand newly sequenced genomes of closely related species, comprehensively curated reference genome databases are becoming increasingly important. We have extended CyanoBase (http://genome.microbedb.jp/cyanobase), a genome database for cyanobacteria, and newly developed RhizoBase (http://genome.microbedb.jp/rhizobase), a genome database for rhizobia, nitrogen-fixing bacteria associated with leguminous plants. Both databases focus on the representation and reusability of reference genome annotations, which are continuously updated by manual curation. Domain experts have extracted names, products and functions of each gene reported in the literature. To ensure effectiveness of this procedure, we developed the TogoAnnotation system offering a web-based user interface and a uniform storage of annotations for the curators of the CyanoBase and RhizoBase databases. The number of references investigated for CyanoBase increased from 2260 in our previous report to 5285, and for RhizoBase, we perused 1216 references. The results of these intensive annotations are displayed on the GeneView pages of each database. Advanced users can also retrieve this information through the representational state transfer-based web application programming interface in an automated manner.
Bacterial infectious diseases are the result of multifactorial processes affected by the interplay between virulence factors and host targets. The host-Pseudomonas and Coxiella interaction database (HoPaCI-DB) is a publicly available manually curated integrative database (http://mips.helmholtz-muenchen.de/HoPaCI/) of host–pathogen interaction data from Pseudomonas aeruginosa and Coxiella burnetii. The resource provides structured information on 3585 experimentally validated interactions between molecules, bioprocesses and cellular structures extracted from the scientific literature. Systematic annotation and interactive graphical representation of disease networks make HoPaCI-DB a versatile knowledge base for biologists and network biology approaches.
PortEco: a resource for exploring bacterial biology through high-throughput data and analysis tools
PortEco (http://porteco.org) aims to collect, curate and provide data and analysis tools to support basic biological research in Escherichia coli (and eventually other bacterial systems). PortEco is implemented as a ‘virtual’ model organism database that provides a single unified interface to the user, while integrating information from a variety of sources. The main focus of PortEco is to enable broad use of the growing number of high-throughput experiments available for E. coli, and to leverage community annotation through the EcoliWiki and GONUTS systems. Currently, PortEco includes curated data from hundreds of genome-wide RNA expression studies, from high-throughput phenotyping of single-gene knockouts under hundreds of annotated conditions, from chromatin immunoprecipitation experiments for tens of different DNA-binding factors and from ribosome profiling experiments that yield insights into protein expression. Conditions have been annotated with a consistent vocabulary, and data have been consistently normalized to enable users to find, compare and interpret relevant experiments. PortEco includes tools for data analysis, including clustering, enrichment analysis and exploration via genome browsers. PortEco search and data analysis tools are extensively linked to the curated gene, metabolic pathway and regulation content at its sister site, EcoCyc.
SporeWeb: an interactive journey through the complete sporulation cycle of Bacillus subtilis
Bacterial spores are a continuous problem for both food-based and health-related industries. Decades of scientific research dedicated towards understanding molecular and gene regulatory aspects of sporulation, spore germination and spore properties have resulted in a wealth of data and information. To facilitate obtaining a complete overview as well as new insights concerning this complex and tightly regulated process, we have developed a database-driven knowledge platform called SporeWeb (http://sporeweb.molgenrug.nl) that focuses on gene regulatory networks during sporulation in the Gram-positive bacterium Bacillus subtilis. Dynamic features allow the user to navigate through all stages of sporulation with review-like descriptions, schematic overviews on transcriptional regulation and detailed information on all regulators and the genes under their control. The Web site supports data acquisition on sporulation genes and their expression, regulon network interactions and direct links to other knowledge platforms or relevant literature. The information found on SporeWeb (including figures and tables) can and will be updated as new information becomes available in the literature. In this way, SporeWeb offers a novel, convenient and timely reference, an information source and a data acquisition tool that will aid in the general understanding of the dynamics of the complete sporulation cycle.
SubtiWiki-a database for the model organism Bacillus subtilis that links pathway, interaction and expression information
Genome annotation and access to information from large-scale experimental approaches at the genome level are essential to improve our understanding of living cells and organisms. This is even more the case for model organisms that are the basis to study pathogens and technologically important species. We have generated SubtiWiki, a database for the Gram-positive model bacterium Bacillus subtilis (http://subtiwiki.uni-goettingen.de/). In addition to the established companion modules of SubtiWiki, SubtiPathways and SubtInteract, we have now created SubtiExpress, a third module, to visualize genome scale transcription data that are of unprecedented quality and density. Today, SubtiWiki is one of the most complete collections of knowledge on a living organism in one single resource.
MycoCosm is a fungal genomics portal (http://jgi.doe.gov/fungi), developed by the US Department of Energy Joint Genome Institute to support integration, analysis and dissemination of fungal genome sequences and other ‘omics’ data by providing interactive web-based tools. MycoCosm also promotes and facilitates user community participation through the nomination of new species of fungi for sequencing, and the annotation and analysis of resulting data. By efficiently filling gaps in the Fungal Tree of Life, MycoCosm will help address important problems associated with energy and the environment, taking advantage of growing fungal genomics resources.
The Aspergillus Genome Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations
The Aspergillus Genome Database (AspGD; http://www.aspgd.org) is a freely available web-based resource that was designed for Aspergillus researchers and is also a valuable source of information for the entire fungal research community. In addition to being a repository and central point of access to genome, transcriptome and polymorphism data, AspGD hosts a comprehensive comparative genomics toolbox that facilitates the exploration of precomputed orthologs among the 20 currently available Aspergillus genomes. AspGD curators perform gene product annotation based on review of the literature for four key Aspergillus species: Aspergillus nidulans, Aspergillus oryzae, Aspergillus fumigatus and Aspergillus niger. We have iteratively improved the structural annotation of Aspergillus genomes through the analysis of publicly available transcription data, mostly expressed sequenced tags, as described in a previous NAR Database article (Arnaud et al. 2012). In this update, we report substantive structural annotation improvements for A. nidulans, A. oryzae and A. fumigatus genomes based on recently available RNA-Seq data. Over 26 000 loci were updated across these species; although those primarily comprise the addition and extension of untranslated regions (UTRs), the new analysis also enabled over 1000 modifications affecting the coding sequence of genes in each target genome.
The Candida Genome Database: The new homology information page highlights protein similarity and phylogeny
The Candida Genome Database (CGD, http://www.candidagenome.org/) is a freely available online resource that provides gene, protein and sequence information for multiple Candida species, along with web-based tools for accessing, analyzing and exploring these data. The goal of CGD is to facilitate and accelerate research into Candida pathogenesis and biology. The CGD Web site is organized around Locus pages, which display information collected about individual genes. Locus pages have multiple tabs for accessing different types of information; the default Summary tab provides an overview of the gene name, aliases, phenotype and Gene Ontology curation, whereas other tabs display more in-depth information, including protein product details for coding genes, notes on changes to the sequence or structure of the gene and a comprehensive reference list. Here, in this update to previous NAR Database articles featuring CGD, we describe a new tab that we have added to the Locus page, entitled the Homology Information tab, which displays phylogeny and gene similarity information for each locus.
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the community resource for genomic, gene and protein information about the budding yeast Saccharomyces cerevisiae, containing a variety of functional information about each yeast gene and gene product. We have recently added regulatory information to SGD and present it on a new tabbed section of the Locus Summary entitled ‘Regulation’. We are compiling transcriptional regulator–target gene relationships, which are curated from the literature at SGD or imported, with permission, from the YEASTRACT database. For nearly every S. cerevisiae gene, the Regulation page displays a table of annotations showing the regulators of that gene, and a graphical visualization of its regulatory network. For genes whose products act as transcription factors, the Regulation page also shows a table of their target genes, accompanied by a Gene Ontology enrichment analysis of the biological processes in which those genes participate. We additionally synthesize information from the literature for each transcription factor in a free-text Regulation Summary, and provide other information relevant to its regulatory function, such as DNA binding site motifs and protein domains. All of the regulation data are available for querying, analysis and download via YeastMine, the InterMine-based data warehouse system in use at SGD.
LoQAtE--Localization and Quantitation ATlas of the yeast proteomE. A new tool for multiparametric dissection of single-protein behavior in response to biological perturbations in yeast
Living organisms change their proteome dramatically to sustain a stable internal milieu in fluctuating environments. To study the dynamics of proteins during stress, we measured the localization and abundance of the Saccharomyces cerevisiae proteome under various growth conditions and genetic backgrounds using the GFP collection. We created a database (DB) called ‘LoQAtE’ (Localizaiton and Quantitation Atlas of the yeast proteomE), available online at http://www.weizmann.ac.il/molgen/loqate/, to provide easy access to these data. Using LoQAtE DB, users can get a profile of changes for proteins of interest as well as querying advanced intersections by either abundance changes, primary localization or localization shifts over the tested conditions. Currently, the DB hosts information on 5330 yeast proteins under three external perturbations (DTT, H2O2 and nitrogen starvation) and two genetic mutations [in the chaperonin containing TCP1 (CCT) complex and in the proteasome]. Additional conditions will be uploaded regularly. The data demonstrate hundreds of localization and abundance changes, many of which were not detected at the level of mRNA. LoQAtE is designed to allow easy navigation for non-experts in high-content microscopy and data are available for download. These data should open up new perspectives on the significant role of proteins while combating external and internal fluctuations.
YeastNet v3: a public database of data-specific and integrated functional gene networks for Saccharomyces cerevisiae
Saccharomyces cerevisiae, i.e. baker’s yeast, is a widely studied model organism in eukaryote genetics because of its simple protocols for genetic manipulation and phenotype profiling. The high abundance of publicly available data that has been generated through diverse ‘omics’ approaches has led to the use of yeast for many systems biology studies, including large-scale gene network modeling to better understand the molecular basis of the cellular phenotype. We have previously developed a genome-scale gene network for yeast, YeastNet v2, which has been used for various genetics and systems biology studies. Here, we present an updated version, YeastNet v3 (available at http://www.inetbio.org/yeastnet/), that significantly improves the prediction of gene–phenotype associations. The extended genome in YeastNet v3 covers up to 5818 genes (~99% of the coding genome) wired by 362 512 functional links. YeastNet v3 provides a new web interface to run the tools for network-guided hypothesis generations. YeastNet v3 also provides edge information for all data-specific networks (~2 million functional links) as well as the integrated networks. Therefore, users can construct alternative versions of the integrated network by applying their own data integration algorithm to the same data-specific links.
Antibiotic resistance has become a major human health concern due to widespread use, misuse and overuse of antibiotics. In addition to antibiotics, antibacterial biocides and metals can contribute to the development and maintenance of antibiotic resistance in bacterial communities through co-selection. Information on metal and biocide resistance genes, including their sequences and molecular functions, is, however, scattered. Here, we introduce BacMet (http://bacmet.biomedicine.gu.se)—a manually curated database of antibacterial biocide- and metal-resistance genes based on an in-depth review of the scientific literature. The BacMet database contains 470 experimentally verified resistance genes. In addition, the database also contains 25 477 potential resistance genes collected from public sequence repositories. All resistance genes in the BacMet database have been organized according to their molecular function and induced resistance phenotype.
mVOC: a database of microbial volatiles
Scents are well known to be emitted from flowers and animals. In nature, these volatiles are responsible for inter- and intra-organismic communication, e.g. attraction and defence. Consequently, they influence and improve the establishment of organisms and populations in ecological niches by acting as single compounds or in mixtures. Despite the known wealth of volatile organic compounds (VOCs) from species of the plant and animal kingdom, in the past, less attention has been focused on volatiles of microorganisms. Although fast and affordable sequencing methods facilitate the detection of microbial diseases, however, the analysis of signature or fingerprint volatiles will be faster and easier. Microbial VOCs (mVOCs) are presently used as marker to detect human diseases, food spoilage or moulds in houses. Furthermore, mVOCs exhibited antagonistic potential against pathogens in vitro, but their biological roles in the ecosystems remain to be investigated. Information on volatile emission from bacteria and fungi is presently scattered in the literature, and no public and up-to-date collection on mVOCs is available. To address this need, we have developed mVOC, a database available online at http://bioinformatics.charite.de/mvoc.
Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been extended with additional support for comparative genomics and ontology information. Finally, we provide updated information about our methods for data access and resources for user training.