Nucleic Acids Research
Allostery is the most direct and efficient way for regulation of biological macromolecule function and is induced by the binding of a ligand at an allosteric site topographically distinct from the orthosteric site. AlloSteric Database (ASD, http://mdl.shsmu.edu.cn/ASD) has been developed to provide comprehensive information on allostery. Owing to the inherent high receptor selectivity and lower target-based toxicity, allosteric regulation is expected to assume a more prominent role in drug discovery and bioengineering, leading to the rapid growth of allosteric findings. In this updated version, ASD v2.0 has expanded to 1286 allosteric proteins, 565 allosteric diseases and 22 008 allosteric modulators. A total of 907 allosteric site-modulator structural complexes and >200 structural pairs of orthosteric/allosteric sites in the allosteric proteins were constructed for researchers to develop allosteric site and pathway tools in response to community demands. Up-to-date allosteric pathways were manually curated in the updated version. In addition, both the front-end and the back-end of ASD have been redesigned and enhanced to allow more efficient access. Taken together, these updates are useful for facilitating the investigation of allosteric mechanisms, allosteric target identification and allosteric drug discovery.
We have compiled MultitaskProtDB, available online at http://wallace.uab.es/multitask, to provide a repository where the many multitasking proteins found in the literature can be stored. Multitasking or moonlighting is the capability of some proteins to execute two or more biological functions. Usually, multitasking proteins are experimentally revealed by serendipity. This ability of proteins to perform multitasking functions helps us to understand one of the ways used by cells to perform many complex functions with a limited number of genes. Even so, the study of this phenomenon is complex because, among other things, there is no database of moonlighting proteins. The existence of such a tool facilitates the collection and dissemination of these important data. This work reports the database, MultitaskProtDB, which is designed as a friendly user web page containing >288 multitasking proteins with their NCBI and UniProt accession numbers, canonical and additional biological functions, monomeric/oligomeric states, PDB codes when available and bibliographic references. This database also serves to gain insight into some characteristics of multitasking proteins such as frequencies of the different pairs of functions, phylogenetic conservation and so forth.
The Structure-Function Linkage Database
The Structure–Function Linkage Database (SFLD, http://sfld.rbvi.ucsf.edu/) is a manually curated classification resource describing structure–function relationships for functionally diverse enzyme superfamilies. Members of such superfamilies are diverse in their overall reactions yet share a common ancestor and some conserved active site features associated with conserved functional attributes such as a partial reaction. Thus, despite their different functions, members of these superfamilies ‘look alike’, making them easy to misannotate. To address this complexity and enable rational transfer of functional features to unknowns only for those members for which we have sufficient functional information, we subdivide superfamily members into subgroups using sequence information, and lastly into families, sets of enzymes known to catalyze the same reaction using the same mechanistic strategy. Browsing and searching options in the SFLD provide access to all of these levels. The SFLD offers manually curated as well as automatically classified superfamily sets, both accompanied by search and download options for all hierarchical levels. Additional information includes multiple sequence alignments, tab-separated files of functional and other attributes, and sequence similarity networks. The latter provide a new and intuitively powerful way to visualize functional trends mapped to the context of sequence similarity.
We reported an integrated database of Compendium of Protein Lysine Modifications (CPLM; http://cplm.biocuckoo.org) for protein lysine modifications (PLMs), which occur at active -amino groups of specific lysine residues in proteins and are critical for orchestrating various biological processes. The CPLM database was updated from our previously developed database of Compendium of Protein Lysine Acetylation (CPLA), which contained 7151 lysine acetylation sites in 3311 proteins. Here, we manually collected experimentally identified substrates and sites for 12 types of PLMs, including acetylation, ubiquitination, sumoylation, methylation, butyrylation, crotonylation, glycation, malonylation, phosphoglycerylation, propionylation, succinylation and pupylation. In total, the CPLM database contained 203 972 modification events on 189 919 modified lysines in 45 748 proteins for 122 species. With the dataset, we totally identified 76 types of co-occurrences of various PLMs on the same lysine residues, and the most abundant PLM crosstalk is between acetylation and ubiquitination. Up to 53.5% of acetylation and 33.1% of ubiquitination events co-occur at 10 746 lysine sites. Thus, the various PLM crosstalks suggested that a considerable proportion of lysines were competitively and dynamically regulated in a complicated manner. Taken together, the CPLM database can serve as a useful resource for further research of PLMs.
The 2014 Nucleic Acids Research Database Issue and an updated NAR online Molecular Biology Database Collection
The 2014 Nucleic Acids Research Database Issue includes descriptions of 58 new molecular biology databases and recent updates to 123 databases previously featured in NAR or other journals. For convenience, the issue is now divided into eight sections that reflect major subject categories. Among the highlights of this issue are six databases of the transcription factor binding sites in various organisms and updates on such popular databases as CAZy, Database of Genomic Variants (DGV), dbGaP, DrugBank, KEGG, miRBase, Pfam, Reactome, SEED, TCDB and UniProt. There is a strong block of structural databases, which includes, among others, the new RNA Bricks database, updates on PDBe, PDBsum, ArchDB, Gene3D, ModBase, Nucleic Acid Database and the recently revived iPfam database. An update on the NCBI’s MMDB describes VAST+, an improved tool for protein structure comparison. Two articles highlight the development of the Structural Classification of Proteins (SCOP) database: one describes SCOPe, which automates assignment of new structures to the existing SCOP hierarchy; the other one describes the first version of SCOP2, with its more flexible approach to classifying protein structures. This issue also includes a collection of articles on bacterial taxonomy and metagenomics, which includes updates on the List of Prokaryotic Names with Standing in Nomenclature (LPSN), Ribosomal Database Project (RDP), the Silva/LTP project and several new metagenomics resources. The NAR online Molecular Biology Database Collection, http://www.oxfordjournals.org/nar/database/c/, has been expanded to 1552 databases. The entire Database Issue is freely available online on the Nucleic Acids Research website (http://nar.oxfordjournals.org/).
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, PubReader, Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link, Primer-BLAST, COBALT, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, the Genetic Testing Registry, Genome and related tools, the Map Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, ClinVar, MedGen, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Probe, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All these resources can be accessed through the NCBI home page.
Molecular Biology has been at the heart of the ‘big data’ revolution from its very beginning, and the need for access to biological data is a common thread running from the 1965 publication of Dayhoff’s ‘Atlas of Protein Sequence and Structure’ through the Human Genome Project in the late 1990s and early 2000s to today’s population-scale sequencing initiatives. The European Bioinformatics Institute (EMBL-EBI; http://www.ebi.ac.uk) is one of three organizations worldwide that provides free access to comprehensive, integrated molecular data sets. Here, we summarize the principles underpinning the development of these public resources and provide an overview of EMBL-EBI’s database collection to complement the reviews of individual databases provided elsewhere in this issue.
The U.S. Department of Energy (DOE) Joint Genome Institute (JGI), a national user facility, serves the diverse scientific community by providing integrated high-throughput sequencing and computational analysis to enable system-based scientific approaches in support of DOE missions related to clean energy generation and environmental characterization. The JGI Genome Portal (http://genome.jgi.doe.gov) provides unified access to all JGI genomic databases and analytical tools. The JGI maintains extensive data management systems and specialized analytical capabilities to manage and interpret complex genomic data. A user can search, download and explore multiple data sets available for all DOE JGI sequencing projects including their status, assemblies and annotations of sequenced genomes. Here we describe major updates of the Genome Portal in the past 2 years with a specific emphasis on efficient handling of the rapidly growing amount of diverse genomic data accumulated in JGI.
GenBank® is a comprehensive database that contains publicly available nucleotide sequences for over 280 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and GenBank staff assign accession numbers upon data receipt. Daily data exchange with the European Nucleotide Archive and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI home page: www.ncbi.nlm.nih.gov.
The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is a repository for the world public domain nucleotide sequence data output. ENA content covers a spectrum of data types including raw reads, assembly data and functional annotation. ENA has faced a dramatic growth in genome assembly submission rates, data volumes and complexity of datasets. This has prompted a broad reworking of assembly submission services, for which we now reach the end of a major programme of work and many enhancements have already been made available over the year to components of the submission service. In this article, we briefly review ENA content and growth over 2013, describe our rapidly developing services for genome assembly information and outline further major developments over the last year.
The DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp) maintains and provides archival, retrieval and analytical resources for biological information. This database content is shared with the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). DDBJ launched a new nucleotide sequence submission system for receiving traditional nucleotide sequence. We expect that the new submission system will be useful for many submitters to input accurate annotation and reduce the time needed for data input. In addition, DDBJ has started a new service, the Japanese Genotype–phenotype Archive (JGA), with our partner institute, the National Bioscience Database Center (NBDC). JGA permanently archives and shares all types of individual human genetic and phenotypic data. We also introduce improvements in the DDBJ services and databases made during the past year.
The BioSamples database at the EBI (http://www.ebi.ac.uk/biosamples) provides an integration point for BioSamples information between technology specific databases at the EBI, projects such as ENCODE and reference collections such as cell lines. The database delivers a unified query interface and API to query sample information across EBI’s databases and provides links back to assay databases. Sample groups are used to manage related samples, e.g. those from an experimental submission, or a single reference collection. Infrastructural improvements include a new user interface with ontological and key word queries, a new query API, a new data submission API, complete RDF data download and a supporting SPARQL endpoint, accessioning at the point of submission to the European Nucleotide Archive and European Genotype Phenotype Archives and improved query response times.
NGSmethDB: an updated genome resource for high quality, single-cytosine resolution methylomes
The updated release of ‘NGSmethDB’ (http://bioinfo2.ugr.es/NGSmethDB) is a repository for single-base whole-genome methylome maps for the best-assembled eukaryotic genomes. Short-read data sets from NGS bisulfite-sequencing projects of cell lines, fresh and pathological tissues are first pre-processed and aligned to the corresponding reference genome, and then the cytosine methylation levels are profiled. One major improvement is the application of a unique bioinformatics protocol to all data sets, thereby assuring the comparability of all values with each other. We implemented stringent quality controls to minimize important error sources, such as sequencing errors, bisulfite failures, clonal reads or single nucleotide variants (SNVs). This leads to reliable and high-quality methylomes, all obtained under uniform settings. Another significant improvement is the detection in parallel of SNVs, which might be crucial for many downstream analyses (e.g. SNVs and differential-methylation relationships). A next-generation methylation browser allows fast and smooth scrolling and zooming, thus speeding data download/upload, at the same time requiring fewer server resources. Several data mining tools allow the comparison/retrieval of methylation levels in different tissues or genome regions. NGSmethDB methylomes are also available as native tracks through a UCSC hub, which allows comparison with a wide range of third-party annotations, in particular phenotype or disease annotations.
Approximately half of all human transcripts contain at least one upstream translational initiation site that precedes the main coding sequence (CDS) and gives rise to an upstream open reading frame (uORF). We generated uORFdb, publicly available at http://cbdm.mdc-berlin.de/tools/uorfdb, to serve as a comprehensive literature database on eukaryotic uORF biology. Upstream ORFs affect downstream translation by interfering with the unrestrained progression of ribosomes across the transcript leader sequence. Although the first uORF-related translational activity was observed >30 years ago, and an increasing number of studies link defective uORF-mediated translational control to the development of human diseases, the features that determine uORF-mediated regulation of downstream translation are not well understood. The uORFdb was manually curated from all uORF-related literature listed at the PubMed database. It categorizes individual publications by a variety of denominators including taxon, gene and type of study. Furthermore, the database can be filtered for multiple structural and functional uORF-related properties to allow convenient and targeted access to the complex field of eukaryotic uORF biology.
We describe an update of the miRBase database (http://www.mirbase.org/), the primary microRNA sequence repository. The latest miRBase release (v20, June 2013) contains 24 521 microRNA loci from 206 species, processed to produce 30 424 mature microRNA products. The rate of deposition of novel microRNAs and the number of researchers involved in their discovery continue to increase, driven largely by small RNA deep sequencing experiments. In the face of these increases, and a range of microRNA annotation methods and criteria, maintaining the quality of the microRNA sequence data set is a significant challenge. Here, we describe recent developments of the miRBase database to address this issue. In particular, we describe the collation and use of deep sequencing data sets to assign levels of confidence to miRBase entries. We now provide a high confidence subset of miRBase entries, based on the pattern of mapped reads. The high confidence microRNA data set is available alongside the complete microRNA collection at http://www.mirbase.org/. We also describe embedding microRNA-specific Wikipedia pages on the miRBase website to encourage the microRNA community to contribute and share textual and functional information.
Ever growing interest in microRNAs has immensely populated the number of resources and research papers devoted to the field and, as a result, it becomes more and more demanding to find miRNA data of interest. To mitigate this problem, we created miRNEST database (http://mirnest.amu.edu.pl), an integrative microRNAs resource. In its updated version, named miRNEST 2.0, the database is complemented with our extensive miRNA predictions from deep sequencing libraries, data from plant degradome analyses, results of pre-miRNA classification with HuntMi and miRNA splice sites information. We also added download and upload options and improved the user interface to make it easier to browse through miRNA records.
miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions
MicroRNAs (miRNAs) are small non-coding RNA molecules capable of negatively regulating gene expression to control many cellular mechanisms. The miRTarBase database (http://mirtarbase.mbc.nctu.edu.tw/) provides the most current and comprehensive information of experimentally validated miRNA-target interactions. The database was launched in 2010 with data sources for >100 published studies in the identification of miRNA targets, molecular networks of miRNA targets and systems biology, and the current release (2013, version 4) includes significant expansions and enhancements over the initial release (2010, version 1). This article reports the current status of and recent improvements to the database, including (i) a 14-fold increase to miRNA-target interaction entries, (ii) a miRNA-target network, (iii) expression profile of miRNA and its target gene, (iv) miRNA target-associated diseases and (v) additional utilities including an upgrade reminder and an error reporting/user feedback system.