Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

GenBank Release 219.0 Available : April 19 2017

4 views
Skip to first unread message

Cavanaugh, Mark (NIH/NLM/NCBI) [E]

unread,
Apr 20, 2017, 7:19:13 AM4/20/17
to 'genbankb@net.bio.net' (genbankb@net.bio.net)
Greetings GenBank Users,

GenBank Release 219.0 is now available via FTP from the National Center
for Biotechnology Information (NCBI):

Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 219.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 219.0

Close-of-data for GenBank 219.0 occurred on 04/14/2017. Uncompressed,
the Release 219.0 flatfiles require roughly 818 GB (sequence files only).
The ASN.1 data require approximately 685 GB.

Recent statistics for 'traditional' sequences (including non-bulk-oriented
TSA, and excluding WGS, bulk-oriented TSA, TLS, and the CON-division):

Release Date Base Pairs Entries

218 Feb 2017 228719437638 199341377
219 Apr 2017 231824951552 200877884

Recent statistics for WGS sequencing projects:

Release Date Base Pairs Entries

218 Feb 2017 1892966308635 409490397
219 Apr 2017 2035032639807 451840147

Recent statistics for bulk-oriented TSA sequencing projects:

Release Date Base Pairs Entries

218 Feb 2017 133517212104 151431485
219 Apr 2017 149038907599 165068542

Recent statistics for bulk-oriented TLS sequencing projects:

Release Date Base Pairs Entries

218 Feb 2017 636923295 1438349
219 Apr 2017 636923295 1438349 (unchanged)f

During the 60 days between the close dates for GenBank Releases 218.0
and 219.0, the 'traditional' portion of GenBank grew by 3,105,513,914
basepairs and by 1,536,507 sequence records. During that same period,
173,862 records were updated. An average of 28,506 'traditional' records
were added and/or updated per day.

Between releases 218.0 and 219.0, the WGS component of GenBank grew by
142,066,331,172 basepairs and by 42,349,750 sequence records.

Between releases 218.0 and 219.0, the TSA component of GenBank grew by
15,521,695,495 basepairs and by 13,637,057 sequence records.

Between releases 218.0 and 219.0, the TLS component of GenBank was
unchanged.

For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 219.0 and Upcoming Changes) have been appended
below for your convenience.

* * * IMPORTANT * * *

The files of this GenBank Release are the first for which integer NCBI
"GI" sequence identifiers are no longer presented in the GenBank, GenPept,
and FASTA sequence formats. Users who rely on GIs need to transition to
Accession.Version identifiers. The following NCBI News articles may be of
interest:

https://www.ncbi.nlm.nih.gov/news/12-23-2016-ncbi-insights-bulk-converting-gis/
https://www.ncbi.nlm.nih.gov/news/12-06-2016-ncbi-insights-convert-gi-accver/
https://www.ncbi.nlm.nih.gov/news/10-17-2016-gi-numbers-removed/

* * * IMPORTANT * * *

Release 219.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: April 15 2017, 219.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.

A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :

set files = `ls gb*.*`
foreach i ($files)
head -10 $i | grep Release
end

Or, if the files are compressed, perhaps:

gzcat $i | head -10 | grep Release

If you encounter problems while ftp'ing or uncompressing Release
219.0, please send email outlining your difficulties to:

in...@ncbi.nlm.nih.gov

Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov,
GenBank
NCBI/NLM/NIH/HHS

1.3 Important Changes in Release 219.0

1.3.1 GI sequence identifiers have been removed from GenBank/GenPept/FASTA
formats and the FASTA header has been simplified

As of March 15 2017, the integer sequence identifiers known as "GIs" were
no longer included in the GenBank, GenPept, and FASTA formats for GenBank
Update products. The FASTA header has been further simplified, to report only
the sequence Accession.Version for records that originate within the
International Sequence Database Collaboration (INSDC).

And as of April 15 2017, this GenBank Release 219.0 is the first one to
follow that same policy.

Previously-assigned GI sequence identifiers will continue to exist
'behind the scenes', and NCBI services which accept GIs as inputs will
continue to be supported. NCBI will be adding support for Accession.Version
identifiers to any services that currently do not support them. As NCBI
makes this transition, we encourage any users who have workflows that
depend on GIs to make use of Accession.Version identifiers instead.

The FASTA format has also been changed for sequence records originating
within the INSDC, to report only the Accession.Version and the record title.
This will improve compatibility with other file types provided by NCBI and
others, including GFF3, Gene, and dbSNP download files. This FASTA format
change has already been made for the redesigned genomes FTP site based on
user requests to have a single consistent sequence identifier for both GFF3
and FASTA formats.

At this time, we plan to continue to provide database source information in
the FASTA header/definition line for non-INSDC sources of sequence data,
including UniProt, PDB structures, PIR, and Patent sequences.

Example 1 : An INSDC nucleotide record

In the sample record below, nucleotide sequence AF123456 was assigned a
GI of 6633795, and the protein translated from its coding region feature
was assigned a GI of 6633796 :

LOCUS AF123456 1510 bp mRNA linear VRT 12-APR-2012
DEFINITION Gallus gallus doublesex and mab-3 related transcription factor 1
(DMRT1) mRNA, partial cds.
ACCESSION AF123456
VERSION AF123456.2 GI:6633795
....
CDS <1..936
/gene="DMRT1"
/note="cDMRT1"
/codon_start=1
/product="doublesex and mab-3 related transcription factor
1"
/protein_id="AAF19666.1"
/db_xref="GI:6633796"
/translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"

The Accession.Version is now the sole sequence version identifier. The GI
value on the VERSION line and the GI /db_xref qualifier for the coding region
feature are no longer displayed:

LOCUS AF123456 1510 bp mRNA linear VRT 12-APR-2012
DEFINITION Gallus gallus doublesex and mab-3 related transcription factor 1
(DMRT1) mRNA, partial cds.
ACCESSION AF123456
VERSION AF123456.2
....
CDS <1..936
/gene="DMRT1"
/note="cDMRT1"
/codon_start=1
/product="doublesex and mab-3 related transcription factor
1"
/protein_id="AAF19666.1"
/translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"

Example 2 : A GenPept record for an INSDC sequence

The GenPept display format previously included GI identifiers in the VERSION
lines (note that the coding region feature for GenPept has never included any
mention of the protein GI identifiers) :

LOCUS AAF19666 311 aa linear VRT 12-APR-2012
DEFINITION doublesex and mab-3 related transcription factor 1, partial [Gallus
gallus].
ACCESSION AAF19666
VERSION AAF19666.1 GI:6633796
DBSOURCE accession AF123456.2
....
CDS 1..311
/gene="DMRT1"
/coded_by="AF123456.2:<1..936"

The VERSION line now no longer includes the GI identifier:

LOCUS AAF19666 311 aa linear VRT 12-APR-2012
DEFINITION doublesex and mab-3 related transcription factor 1, partial [Gallus
gallus].
ACCESSION AAF19666
VERSION AAF19666.1
DBSOURCE accession AF123456.2
....
CDS 1..311
/gene="DMRT1"
/coded_by="AF123456.2:<1..936"

Example 3: FASTA format for an INSDC nucleotide and protein sequence

Previously, the FASTA display for most products included GI and database
source information (eg, 'gb' for GenBank, 'emb' for ENA, 'dbj' for
DDBJ), using the '|' character as a delimiter:

>gi|6633795|gb|AF123456.2| Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]

>gi|6633796|gb|AAF19666.1| doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE

Since March 15 2017, and with this April 2017 GenBank Release, just the
Accession.Version will be provided:

>AF123456.2 Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]

>AAF19666.1 doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE

Please direct any inquiries about these changes to the NCBI Service Desk:

in...@ncbi.nlm.nih.gov

1.3.2 Organizational changes

The total number of sequence data files increased by 42 with this release:

- the BCT division is now composed of 350 files (+20)
- the CON division is now composed of 359 files (+3)
- the ENV division is now composed of 97 files (+2)
- the EST division is now composed of 483 files (+2)
- the INV division is now composed of 153 files (+1)
- the PAT division is now composed of 290 files (+7)
- the PHG division is now composed of 4 files (+1)
- the PLN division is now composed of 145 files (+2)
- the PRI division is now composed of 56 files (+1)
- the SYN division is now composed of 10 files (+1)
- the TSA division is now composed of 230 files (+1)
- the VRL division is now composed of 48 files (+1)

1.3.3 Invalid flatfile entry in GenBank 218.0 corrected : KX396599

A user at Chemical Abstracts Services helpfully reported a formatting error
for GenBank sequence record KX396599 in GenBank Release 218.0 . The record
appeared with the KEYWORDS linetype in the wrong column:

LOCUS KX396599 8308 bp DNA linear PLN 18-JAN-2017
DEFINITION Marshallia obovata retrotransposon del/tekay, complete sequence.
ACCESSION KX396599
VERSION KX396599.1 GI:1131742074

KEYWORDS.
SOURCE Marshallia obovata

This was caused by the presence of an invalid keyword, consisting of just
a period. The data problem was fixed, but time constraints prevented us from
issuing a patch for the affected file ( gbpln130.seq ). Nonetheless, we
appreciate the scrutiny provided by GenBank users, and we do follow-up on all
problem reports. Thank you CAS!

1.3.4 GSS File Header Problem

GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.

There is thus a discrepancy between the filenames and file headers for 130
of the GSS flatfiles in Release 219.0. Consider gbgss174.seq :

GBGSS1.SEQ Genetic Sequence Data Bank
April 15 2017

NCBI-GenBank Flat File Release 219.0

GSS Sequences (Part 1)

87034 loci, 63855245 bases, from 87034 reported sequences

Here, the filename and part number in the header is "1", though the file
has been renamed as "174" based on the number of files dumped from the other
system. We hope to resolve this discrepancy at some point, but the priority
is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 No changes impacting GenBank Release content are currently planned.



0 new messages