Greetings GenBank Users,
GenBank Release 224.0 is now available via FTP from the National Center
for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 224.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 224.0
Close-of-data for GenBank 224.0 occurred on 02/13/2018. Uncompressed,
the Release 224.0 flatfiles require roughly 871 GB (sequence files only).
The ASN.1 data require approximately 719 GB.
Recent statistics for 'traditional' sequences (including non-bulk-oriented
TSA, and excluding WGS, bulk-oriented TSA, TLS, and the CON-division):
Release Date Base Pairs Entries
223 Dec 2017 249722163594 206293625
224 Feb 2018 253630708098 207040555
Recent statistics for WGS sequencing projects:
Release Date Base Pairs Entries
223 Dec 2017 2466098053327 551063065
224 Feb 2018 2608532210351 564286852
Recent statistics for bulk-oriented TSA sequencing projects:
Release Date Base Pairs Entries
223 Dec 2017 181394660188 201559502
224 Feb 2018 193940551226 214324264
Recent statistics for bulk-oriented TLS sequencing projects:
Release Date Base Pairs Entries
223 Dec 2017
4458042616 12695198
224 Feb 2018 4531966831 12819978
During the 60 days between the close dates for GenBank Releases 223.0
and 224.0, the 'traditional' portion of GenBank grew by 3,908,544,504
basepairs and by 746,930 sequence records. During that same period,
62,840 records were updated. An average of 13,496 'traditional' records
were added and/or updated per day.
Between releases 223.0 and 224.0, the WGS component of GenBank grew by
142,434,157,024 basepairs and by 13,223,787 sequence records.
Between releases 223.0 and 224.0, the TSA component of GenBank grew by
12,545,891,038 basepairs and by 12,764,762 sequence records.
Between releases 223.0 and 224.0, the TLS component of GenBank grew by
73,924,215 basepairs and by 124,780 sequence records.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 224.0 and Upcoming Changes) have been appended
below for your convenience.
Release 224.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: February 15 2018, 224.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.
A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :
set files = `ls gb*.*`
foreach i ($files)
head -10 $i | grep Release
end
Or, if the files are compressed, perhaps:
gzcat $i | head -10 | grep Release
If you encounter problems while ftp'ing or uncompressing Release
224.0, please send email outlining your difficulties to:
in...@ncbi.nlm.nih.gov
Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky
GenBank
NCBI/NLM/NIH/HHS
1.3 Important Changes in Release 224.0
1.3.1 Organizational changes
The total number of sequence data files increased by 36 with this release:
- the BCT division is now composed of 450 files (+22)
- the CON division is now composed of 362 files (-1)
- the ENV division is now composed of 101 files (+1)
- the INV division is now composed of 161 files (+2)
- the PAT division is now composed of 323 files (+3)
- the PLN division is now composed of 168 files (+2)
- the VRL division is now composed of 53 files (+2)
- the VRT division is now composed of 85 files (+5)
The decrease in the number of CON division files is due to re-assembled
WGS sequence projects that did not include the submission of new
scaffolds/CON-records.
1.3.2 GSS File Header Problem
GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.
There is thus a discrepancy between the filenames and file headers for 130
of the GSS flatfiles in Release 224.0. Consider gbgss175.seq :
GBGSS1.SEQ Genetic Sequence Data Bank
February 15 2018
NCBI-GenBank Flat File Release 224.0
GSS Sequences (Part 1)
87375 loci, 64103840 bases, from 87375 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "175" based on the number of files dumped from the other
system. Files gbgss175.seq.gz through gbgss304.seq.gz are affected. We hope
to resolve this discrepancy at some point, but the priority is certainly
much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 New source feature qualifier : /submitter_seqid
Data submitters typically have their own identifiers for genomic contigs and
scaffolds of Whole Genome Shotgun (WGS) sequencing projects, the RNA sequences
of Transcriptome Shotgun Assembly (TSA) sequencing projects, and the genomic
loci of Targeted Locus Study (TLS) sequencing projects. These identifiers can
be very simple (contig01, contig02, etc.), or they can have a bit more
meaning/structure. Examples of the latter include:
gcontig_1106166512749 (ABDU01000001)
CCB157_001 (BDDQ01000001)
Lo7_v2_contig_2306 (CCJQ010001199)
The INSDC has decided that it would be helpful to provide these identifiers
in a formalized way, since they may be known to, or used by, parties other than
the submitters themselves. For example, if a submitter had made them public
in some sort of data products, or displayed in genome browsers, cited in an
analysis, or mentioned on websites.
These submitter identifiers will be provided via a new qualifier of the
source feature : /submitter_seqid . The value format for the qualifier will
be free text.
A complete definition of the qualifier will be provided when it becomes
available. The earliest implementation date is within the two month period
after GenBank Release 224.0 on February 15th 2018. But a more realistic
timeframe is April 15th of 2018.
1.4.2 New /gap_type value : "contamination"
When contamination is discovered in a sequence record, removing the
bases from the sequence data can be problematic (especially at the 5' end)
because the length of the sequence changes. If there exist higher-level
scaffold/CON-division records (possibly chromosomes), the resulting change
in length requires an adjustment to the coordinate system of the scaffold/
chromosome, and the features annotated on it. The impact of such a change
on both data submitters and users can impose quite a burden.
To address this, the INSDC has decided to introduce a new Gap Type for
the assembly_gap feature : contamination . When sequence contamination
is discovered, the submitter will have the option of replacing the affected
base pairs with Ns, via a terminal assembly_gap feature. For example:
assembly_gap 1..2956
/estimated_length=2956
/gap_type="contamination"
/note="contamination masked with Ns"
An updated definition of the /gap_type qualifier will be provided when
it becomes available. The earliest implementation date is within the two
month period after GenBank Release 224.0 on February 15th 2018. But a
more realistic timeframe is April 15th of 2018.