Nucleic Acids Research
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration (http://www.ncbi.nlm.nih.gov/refseq/). We report here on growth of the mammalian and human subsets, changes to NCBI’s eukaryotic annotation pipeline and modifications affecting transcript and protein records. Recent changes to NCBI’s eukaryotic genome annotation pipeline provide higher throughput, and the addition of RNAseq data to the pipeline results in a significant expansion of the number of transcripts and novel exons annotated on mammalian RefSeq genomes. Recent annotation changes include reporting supporting evidence for transcript records, modification of exon feature annotation and the addition of a structured report of gene and sequence attributes of biological interest. We also describe a revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins and we summarize the current status of the RefSeqGene project.
The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a large collection of organisms, primarily vertebrates, with an emphasis on the human and mouse genomes. The Browser’s web-based tools provide an integrated environment for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic data sets. As of September 2013, the database contained genomic sequence and a basic set of annotation ‘tracks’ for ~90 organisms. Significant new annotations include a 60-species multiple alignment conservation track on the mouse, updated UCSC Genes tracks for human and mouse, and several new sets of variation and ENCODE data. New software tools include a Variant Annotation Integrator that returns predicted functional effects of a set of variants uploaded as a custom track, an extension to UCSC Genes that displays haplotype alleles for protein-coding genes and an expansion of data hubs that includes the capability to display remotely hosted user-provided assembly sequence in addition to annotation data. To improve European access, we have added a Genome Browser mirror (http://genome-euro.ucsc.edu) hosted at Bielefeld University in Germany.
The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC).
FlyBase (http://flybase.org) is the leading website and database of Drosophila genes and genomes. Whether you are using the fruit fly Drosophila melanogaster as an experimental system or wish to understand Drosophila biological knowledge in relation to human disease or to other model systems, FlyBase can help you successfully find the information you are looking for. Here, we demonstrate some of our more advanced searching systems and highlight some of our new tools for searching the wealth of data on FlyBase. The first section explores gene function in FlyBase, using our TermLink tool to search with Controlled Vocabulary terms and our new RNA-Seq Search tool to search gene expression. The second section of this article describes a few ways to search genomic data in FlyBase, using our BLAST server and the new implementation of GBrowse 2, as well as our new FeatureMapper tool. Finally, we move on to discuss our most powerful search tool, QueryBuilder, before describing pre-computed cuts of the data and how to query the database programmatically.
WormBase 2014: new views of curated biology
WormBase (http://www.wormbase.org/) is a highly curated resource dedicated to supporting research using the model organism Caenorhabditis elegans. With an electronic history predating the World Wide Web, WormBase contains information ranging from the sequence and phenotype of individual alleles to genome-wide studies generated using next-generation sequencing technologies. In recent years, we have expanded the contents to include data on additional nematodes of agricultural and medical significance, bringing the knowledge of C. elegans to bear on these systems and providing support for underserved research communities. Manual curation of the primary literature remains a central focus of the WormBase project, providing users with reliable, up-to-date and highly cross-linked information. In this update, we describe efforts to organize the original atomized and highly contextualized curated data into integrated syntheses of discrete biological topics. Next, we discuss our experiences coping with the vast increase in available genome sequences made possible through next-generation sequencing platforms. Finally, we describe some of the features and tools of the new WormBase Web site that help users better find and explore data of interest.
WormQTLHD--a web database for linking human disease to natural variation data in C. elegans
Interactions between proteins are highly conserved across species. As a result, the molecular basis of multiple diseases affecting humans can be studied in model organisms that offer many alternative experimental opportunities. One such organism—Caenorhabditis elegans—has been used to produce much molecular quantitative genetics and systems biology data over the past decade. We present WormQTLHD (Human Disease), a database that quantitatively and systematically links expression Quantitative Trait Loci (eQTL) findings in C. elegans to gene–disease associations in man. WormQTLHD, available online at http://www.wormqtl-hd.org, is a user-friendly set of tools to reveal functionally coherent, evolutionary conserved gene networks. These can be used to predict novel gene-to-gene associations and the functions of genes underlying the disease of interest. We created a new database that links C. elegans eQTL data sets to human diseases (34 337 gene–disease associations from OMIM, DGA, GWAS Central and NHGRI GWAS Catalogue) based on overlapping sets of orthologous genes associated to phenotypes in these two species. We utilized QTL results, high-throughput molecular phenotypes, classical phenotypes and genotype data covering different developmental stages and environments from WormQTL database. All software is available as open source, built on MOLGENIS and xQTL workbench.
The International Mouse Phenotyping Consortium Web Portal, a unified point of access for knockout mice and related phenotyping data
The International Mouse Phenotyping Consortium (IMPC) web portal (http://www.mousephenotype.org) provides the biomedical community with a unified point of access to mutant mice and rich collection of related emerging and existing mouse phenotype data. IMPC mouse clinics worldwide follow rigorous highly structured and standardized protocols for the experimentation, collection and dissemination of data. Dedicated ‘data wranglers’ work with each phenotyping center to collate data and perform quality control of data. An automated statistical analysis pipeline has been developed to identify knockout strains with a significant change in the phenotype parameters. Annotation with biomedical ontologies allows biologists and clinicians to easily find mouse strains with phenotypic traits relevant to their research. Data integration with other resources will provide insights into mammalian gene function and human disease. As phenotype data become available for every gene in the mouse, the IMPC web portal will become an invaluable tool for researchers studying the genetic contributions of genes to human diseases.
The Mouse Genome Database: integration of and access to knowledge about the laboratory mouse
The Mouse Genome Database (MGD) (http://www.informatics.jax.org) is the community model organism database resource for the laboratory mouse, a premier animal model for the study of genetic and genomic systems relevant to human biology and disease. MGD maintains a comprehensive catalog of genes, functional RNAs and other genome features as well as heritable phenotypes and quantitative trait loci. The genome feature catalog is generated by the integration of computational and manual genome annotations generated by NCBI, Ensembl and Vega/HAVANA. MGD curates and maintains the comprehensive listing of functional annotations for mouse genes using the Gene Ontology, and MGD curates and integrates comprehensive phenotype annotations including associations of mouse models with human diseases. Recent improvements include integration of the latest mouse genome build (GRCm38), improved access to comparative and functional annotations for mouse genes with expanded representation of comparative vertebrate genomes and new loads of phenotype data from high-throughput phenotyping projects. All MGD resources are freely available to the research community.
The Gene Expression Database (GXD; http://www.informatics.jax.org/expression.shtml) is an extensive and well-curated community resource of mouse developmental expression information. GXD collects different types of expression data from studies of wild-type and mutant mice, covering all developmental stages and including data from RNA in situ hybridization, immunohistochemistry, RT-PCR, northern blot and western blot experiments. The data are acquired from the scientific literature and from researchers, including groups doing large-scale expression studies. Integration with the other data in Mouse Genome Informatics (MGI) and interconnections with other databases places GXD’s gene expression information in the larger biological and biomedical context. Since the last report, the utility of GXD has been greatly enhanced by the addition of new data and by the implementation of more powerful and versatile search and display features. Web interface enhancements include the capability to search for expression data for genes associated with specific phenotypes and/or human diseases; new, more interactive data summaries; easy downloading of data; direct searches of expression images via associated metadata; and new displays that combine image data and their associated annotations. At present, GXD includes >1.4 million expression results and 250 000 images that are accessible to our search tools.
Mouse Phenome Database
The Mouse Phenome Database (MPD; phenome.jax.org) was launched in 2001 as the data coordination center for the international Mouse Phenome Project. MPD integrates quantitative phenotype, gene expression and genotype data into a common annotated framework to facilitate query and analysis. MPD contains >3500 phenotype measurements or traits relevant to human health, including cancer, aging, cardiovascular disorders, obesity, infectious disease susceptibility, blood disorders, neurosensory disorders, drug addiction and toxicity. Since our 2012 NAR report, we have added >70 new data sets, including data from Collaborative Cross lines and Diversity Outbred mice. During this time we have completely revamped our homepage, improved search and navigational aspects of the MPD application, developed several web-enabled data analysis and visualization tools, annotated phenotype data to public ontologies, developed an ontology browser and released new single nucleotide polymorphism query functionality with much higher density coverage than before. Here, we summarize recent data acquisitions and describe our latest improvements.
EMAGE (http://www.emouseatlas.org/emage/) is a freely available database of in situ gene expression patterns that allows users to perform online queries of mouse developmental gene expression. EMAGE is unique in providing both text-based descriptions of gene expression plus spatial maps of gene expression patterns. This mapping allows spatial queries to be accomplished alongside more traditional text-based queries. Here, we describe our recent progress in spatial mapping and data integration. EMAGE has developed a method of spatially mapping 3D embryo images captured using optical projection tomography, and through the use of an IIP3D viewer allows users to view arbitrary sections of raw and mapped 3D image data in the context of a web browser. EMAGE now includes enhancer data, and we have spatially mapped images from a comprehensive screen of transgenic reporter mice that detail the expression of mouse non-coding genomic DNA fragments with enhancer activity. We have integrated the eMouseAtlas anatomical atlas and the EMAGE database so that a user of the atlas can query the EMAGE database easily. In addition, we have extended the atlas framework to enable EMAGE to spatially cross-index EMBRYS whole mount in situ hybridization data. We additionally report on recent developments to the EMAGE web interface, including new query and analysis capabilities.
Proper selection of the translation initiation site (TIS) on mRNAs is crucial for the production of desired protein products. Recent studies using ribosome profiling technology uncovered a surprising variety of potential TIS sites in addition to the annotated start codon. The prevailing alternative translation reshapes the landscape of the proteome in terms of diversity and complexity. To identify the hidden coding potential of the transcriptome in mammalian cells, we developed global translation initiation sequencing (GTI-Seq) that maps genome-wide TIS positions at nearly a single nucleotide resolution. To facilitate studies of alternative translation, we created a database of alternative TIS sites identified from human and mouse cell lines based on multiple GTI-Seq replicates. The TISdb, available at http://tisdb.human.cornell.edu, includes 6991 TIS sites from 4961 human genes and 9973 TIS sites from 5668 mouse genes. The TISdb website provides a simple browser interface for query of high-confidence TIS sites and their associated open reading frames. The output of search results provides a user-friendly visualization of TIS information in the context of transcript isoforms. Together, the information in the database provides an easy reference for alternative translation in mammalian cells and will support future investigation of novel translational products.
GeneProf data: a resource of curated, integrated and reusable high-throughput genomics experiments
GeneProf Data (http://www.geneprof.org) is an open web resource for analysed functional genomics experiments. We have built up a large collection of completely processed RNA-seq and ChIP-seq studies by carefully and transparently reanalysing and annotating high-profile public data sets. GeneProf makes these data instantly accessible in an easily interpretable, searchable and reusable manner and thus opens up the path to the advantages and insights gained from genome-scale experiments to a broader scientific audience. Moreover, GeneProf supports programmatic access to these data via web services to further facilitate the reuse of experimental data across tools and laboratories.
We describe the development of GWIPS-viz (http://gwips.ucc.ie), an online genome browser for viewing ribosome profiling data. Ribosome profiling (ribo-seq) is a recently developed technique that provides genome-wide information on protein synthesis (GWIPS) in vivo. It is based on the deep sequencing of ribosome-protected messenger RNA (mRNA) fragments, which allows the ribosome density along all mRNA transcripts present in the cell to be quantified. Since its inception, ribo-seq has been carried out in a number of eukaryotic and prokaryotic organisms. Owing to the increasing interest in ribo-seq, there is a pertinent demand for a dedicated ribo-seq genome browser. GWIPS-viz is based on The University of California Santa Cruz (UCSC) Genome Browser. Ribo-seq tracks, coupled with mRNA-seq tracks, are currently available for several genomes: human, mouse, zebrafish, nematode, yeast, bacteria (Escherichia coli K12, Bacillus subtilis), human cytomegalovirus and bacteriophage lambda. Our objective is to continue incorporating published ribo-seq data sets so that the wider community can readily view ribosome profiling information from multiple studies without the need to carry out computational processing.
The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles
JASPAR (http://jaspar.genereg.net) is the largest open-access database of matrix-based nucleotide profiles describing the binding preference of transcription factors from multiple species. The fifth major release greatly expands the heart of JASPAR—the JASPAR CORE subcollection, which contains curated, non-redundant profiles—with 135 new curated profiles (74 in vertebrates, 8 in Drosophila melanogaster, 10 in Caenorhabditis elegans and 43 in Arabidopsis thaliana; a 30% increase in total) and 43 older updated profiles (36 in vertebrates, 3 in D. melanogaster and 4 in A. thaliana; a 9% update in total). The new and updated profiles are mainly derived from published chromatin immunoprecipitation-seq experimental datasets. In addition, the web interface has been enhanced with advanced capabilities in browsing, searching and subsetting. Finally, the new JASPAR release is accompanied by a new BioPython package, a new R tool package and a new R/Bioconductor data package to facilitate access for both manual and automated methods.
Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein–DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone.
CollecTF: a database of experimentally validated transcription factor-binding sites in Bacteria
The influx of high-throughput data and the need for complex models to describe the interaction of prokaryotic transcription factors (TF) with their target sites pose new challenges for TF-binding site databases. CollecTF (http://collectf.umbc.edu) compiles data on experimentally validated, naturally occurring TF-binding sites across the Bacteria domain, placing a strong emphasis on the transparency of the curation process, the quality and availability of the stored data and fully customizable access to its records. CollecTF integrates multiple sources of data automatically and openly, allowing users to dynamically redefine binding motifs and their experimental support base. Data quality and currency are fostered in CollecTF by adopting a sustainable model that encourages direct author submissions in combination with in-house validation and curation of published literature. CollecTF entries are periodically submitted to NCBI for integration into RefSeq complete genome records as link-out features, maximizing the visibility of the data and enriching the annotation of RefSeq files with regulatory information. Seeking to facilitate comparative genomics and machine-learning analyses of regulatory interactions, in its initial release CollecTF provides domain-wide coverage of two TF families (LexA and Fur), as well as extensive representation for a clinically important bacterial family, the Vibrionaceae.
The YEASTRACT database: an upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae
The YEASTRACT (http://www.yeastract.com) information system is a tool for the analysis and prediction of transcription regulatory associations in Saccharomyces cerevisiae. Last updated in June 2013, this database contains over 200 000 regulatory associations between transcription factors (TFs) and target genes, including 326 DNA binding sites for 113 TFs. All regulatory associations stored in YEASTRACT were revisited and new information was added on the experimental conditions in which those associations take place and on whether the TF is acting on its target genes as activator or repressor. Based on this information, new queries were developed allowing the selection of specific environmental conditions, experimental evidence or positive/negative regulatory effect. This release further offers tools to rank the TFs controlling a gene or genome-wide response by their relative importance, based on (i) the percentage of target genes in the data set; (ii) the enrichment of the TF regulon in the data set when compared with the genome; or (iii) the score computed using the TFRank system, which selects and prioritizes the relevant TFs by walking through the yeast regulatory network. We expect that with the new data and services made available, the system will continue to be instrumental for yeast biologists and systems biology researchers.
OnTheFly: a database of Drosophila melanogaster transcription factors and their binding sites
We present OnTheFly (http://bhapp.c2b2.columbia.edu/OnTheFly/index.php), a database comprising a systematic collection of transcription factors (TFs) of Drosophila melanogaster and their DNA-binding sites. TFs predicted in the Drosophila melanogaster genome are annotated and classified and their structures, obtained via experiment or homology models, are provided. All known preferred TF DNA-binding sites obtained from the B1H, DNase I and SELEX methodologies are presented. DNA shape parameters predicted for these sites are obtained from a high throughput server or from crystal structures of protein–DNA complexes where available. An important feature of the database is that all DNA-binding domains and their binding sites are fully annotated in a eukaryote using structural criteria and evolutionary homology. OnTheFly thus provides a comprehensive view of TFs and their binding sites that will be a valuable resource for deciphering non-coding regulatory DNA.