Annotated Sequence Database

0 views

Skip to first unread message

Sofiel Kustra

unread,

Aug 4, 2024, 8:58:26 PM8/4/24

to celitelo

GenBankis the NIH genetic sequence database, anannotated collection of all publicly available DNA sequences(Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of theInternational Nucleotide Sequence Database Collaboration,which comprises the DNA DataBank of Japan (DDBJ), the EuropeanNucleotide Archive (ENA), and GenBank at NCBI. These threeorganizations exchange data on a daily basis.

A GenBank release occurs every two months and is available from theftp site. The release notesfor the current version of GenBank providedetailed information about the release and notifications of upcomingchanges to GenBank. Release notes for previous GenBank releasesare also available. GenBank growthstatistics for both the traditional GenBank divisionsand the WGS division are available from each release.

The GenBank database is designed to provide and encourage accesswithin the scientific community to the most up-to-date andcomprehensive DNA sequence information. Therefore, NCBI places norestrictions on the use or distribution of the GenBank data. However,some submitters may claim patent, copyright, or other intellectualproperty rights in all or a portion of the data they havesubmitted. NCBI is not in a position to assess the validity of suchclaims, and therefore cannot provide comment or unrestrictedpermission concerning the use, copying, or distribution of theinformation contained in GenBank.

The most important source of new data for GenBank is direct submissions from a variety of individuals, including researchers, using one of our submission tools. Following submission, data are subject to automated and manual processing to ensure data integrity and quality and are subsequently made available to the public. On rare occasions, data may be removed from public view. More details about this process can be found on the NLM GenBank and SRA Data Processing.

Some authors are concerned that the appearance of their data in GenBank prior to publication will compromise their work. GenBank will, upon request, withhold release of new submissions for a specified period of time. However, if the accession number or sequence data appears in print or online prior to the specified date, your sequence will be released. In order to prevent the delay in the appearance of published sequence data, we urge authors to inform us of the appearance of the published data. As soon as it is available, please send the full publication data--all authors, title, journal, volume, pages and date--to the following address: upd...@ncbi.nlm.nih.gov

If you are submitting human sequences to GenBank, do not include anydata that could reveal the personal identity of the source. GenBankassumes that the submitter has received any necessary informed consentauthorizations required prior to submitting sequences.

The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses. [ more... ]

NCBI provides RefSeqs for taxonomically diverse organisms including archaea, bacteria, eukaryotes, and viruses. References sequences are provided for genomes, transcripts, and proteins. Some targeted loci projects are included in RefSeq including: RefSeqGene, fungal ITS, and rRNA loci. New or updated records are added to the collection as data become publicly available.

RefSeq is accessible via BLAST , Entrez, and the NCBI FTP site (RefSeq releases, and RefSeq Genomes). Information is also available in NCBI's Assembly, Genomes and Gene resources, and for some organisms additional information is available in NCBI's genome browser Genome Data Viewer. Special properties have been defined to facilitate Entrez-based retrieval. See also: Entrez Query Hints

RefSeq records are derived from publicly available sequence data; varying levels of validation, additional annotation, and manual curation are applied to the RefSeq record. NCBI Reference Sequences are provided through the separate processes described below.

This page provides a brief overview of the RefSeq production processes. Also see: NCBI Handbook, RefSeq chapter NCBI Handbook, Genome Annotation chapter RefSeq Prokaryotic Genomes Eukaryotic genome annotation policy

For some organisms, the annotated RefSeq records are provided by collaborating groups. Depending on the organism, collaborations may be established at the whole-genome level, or smaller collaborations may be established for gene families.

Whole-genome collaborations include records for Saccharomyces cerevisiae , Arabidopsis thaliana , Drosophila melanogaster , and Caenorhabditis elegans . When such a collaboration is established, the primary sequence level review is carried out by the collaborating group. Processing of annotated genome data submitted by collaborations is semi-automated; data is provided by a collaborating group and validated at NCBI to detect obvious errors (e.g., the annotated CDS location is not capable of encoding the provided protein), and to apply the annotation in a more uniform way. NCBI processing may integrate additional information such as nomenclature or other descriptive data. Additional manual curation of these records is not carried out by NCBI staff. NCBI may update the records to correct a general format problem, but otherwise these records are only updated when the collaborating group provides an update. Should errors be reported, then NCBI staff relays that information to the collaborating group.

RefSeq records that are supplied by collaboration do include an indication of the submitting group on the record either as a direct submission Reference citation and/or in the COMMENT block. The RefSeq status (e.g., REVIEWED etc) is either indicated by the collaborating group, or is inferred based on the supplied annotation.

NCBI is providing annotation for some assembled genomic sequence data including human, mouse, rat, honey bee, chicken, chimpanzee (and others). This pipeline is automated and data is refreshed periodically. The model RefSeq records produced from this pipeline have a distinguishing accession prefix (XM, XR, XP), are derived from the genomic sequence, have varying levels of transcript or protein homology support, and are not subject to further manual curation.

RefSeq transcript and protein records for a subset of organisms, primarily mammals, are curated by NCBI staff. Curation is an ongoing process and some records have not been reviewed yet; the curation status is indicated on the RefSeq record in the COMMENT block. Some records representing genomic regions (accession prefix NG_) are provided specifically to support more comprehensive genome-level annotation. The curated RefSeq records are created via a process that includes automated computational methods, collaboration, and manual data review by NCBI staff. This process is further described in the NCBI Handbook, RefSeq chapter .

A combined approach uses both collaborator supplied sequence information and automated BLAST analysis to provide an initial RefSeq record. Records are subject to validation to correct annotation errors and provide annotation in a more consistent format. Descriptive information, including Official Nomenclature and additional citations, are applied to the records. These initial records have a PROVISIONAL, PREDICTED, or INFERRED status.

Additional manual curation is applied to this set of RefSeq records to provide the optimal sequence record, and to fix sequence errors including mis-association with a locus (as might occur for closely related gene families), chimeric sequences, vector or linker contamination, or apparent sequencing errors. Both the nucleotide and protein sequence record may change due to this process. Sequence level review is carried out primarily by NCBI staff but some records are provided via collaboration. These records have a VALIDATED status. Additional annotation, a summary description, and other functional information may be applied, as available, during the sequence review process. These records have a REVIEWED status.

Since there is a strong manual curation component in this pipeline, input from the research community is especially welcome to further improve the quality of this dataset. The RefSeq records generated by this pipeline are used as a reagent in the genome assembly & annotation pipeline (see above).

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate.[1] Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

Searching in a sequence database involves looking for similarities between a genomic/protein sequence and a query string and, finding the sequence in the database that "best" matches the target sequence (based on criteria which vary depending on the search method). The number of matches/hits is used to formulate a score that determines the similarity between the sequence query and the sequences in the sequence database.[2] The main goal is to have a good balance between the two criteria.

The need for sequence databases originated in 1950 when Fredrick Sanger reported the primary structure of insulin. He won his second Nobel Prize for creating methods for sequencing nucleic acids, and his comparative approach is what sparked other protein biochemists to begin collecting amino acid sequences. Thus marking the beginning of molecular databases.[3]