Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

GenBank FTP site changes for WGS, TSA, and TLS data products : July 11 2017

4 views
Skip to first unread message

Cavanaugh, Mark (NIH/NLM/NCBI) [E]

unread,
May 16, 2017, 9:40:00 AM5/16/17
to 'genbankb@net.bio.net' (genbankb@net.bio.net)
Greetings GenBank Users,

This announcement describes several related changes for areas of the NCBI FTP
site used for WGS, TSA, and TLS data products. Some have already been made,
some are in-progress but not yet announced, while others have not yet been
implemented. All of the changes are expected to be in place as of July 11 2017.


1.) Subdirectory Structure Will Be Introduced For WGS, TSA, and TLS FTP Areas

Comprehensive FTP areas dedicated to WGS, TSA, and TLS sequencing projects
are provided at the NCBI FTP site. They are intended for large-scale sequence
data customers interested in data for *all* such projects, for all organisms
(bacteria to eukaryotes), annotated or unannotated, high-coverage or low-
coverage. READMEs that describe the FTP products, file organization, and
terminology are available:

ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs/README.genbank.wgs
ftp://ftp.ncbi.nlm.nih.gov/genbank/tsa/README.genbank.tsa
ftp://ftp.ncbi.nlm.nih.gov/genbank/tls/README.genbank.tls

Given the target audience, these FTP areas have been completely "flat",
without any sub-structure, so that the data files for *every* WGS/TSA/TLS
project are all available in a single convenient place.

However, we recently exceeded the maximum number of files that can be present
in the WGS areas, requiring us to introduce new sub-directories and to move
some WGS projects to them. The new WGS sub-directories are "/A" and "/C".

WGS projects with project codes starting with letters "A" (eg: AAAB, ABBA)
and "C" (eg: CAAA, CAAB), and which were last updated *prior* to January 1
2016, have been relocated to the sub-directories.

This greatly reduced the number of files at the top-level of the WGS FTP
areas and allowed us to continue operating as before, but it was just a stop-
gap solution. So effective July 11 2017, we will be introducing a sub-directory
for every unique initial WGS project-code letter (/A, /B, /C, etc), and the
data files for *every* project will be moved to the subdirectory appropriate
for its project-code.

While the number of files for TSA and TLS projects is much smaller than WGS,
we will use the same approach for them so that all three project types are
handled in the same way.

There are several advantages of this change: the per-directory limit on the
number of files will not be an issue going forward; the overall structure is
still relatively "flat", thus it remains easy for large-scale customers to
navigate and mirror; the sub-directories provide a natural mechanism for
scheduling FTP or mirroring tasks, one project-code prefix letter at a time.

There are separate WGS/TSA/TLS FTP areas for flatfile-oriented data products
(GenBank flatfiles; GenPept flatfiles; FASTA) and for ASN.1 data products.
Using TSA as a model, here is a summary of the directory structure that will be
introduced on July 11 :

Original 'top-level' directory New sub-directories

ftp://ftp.ncbi.nlm.nih.gov/genbank/tsa/ genbank/tsa/G (GAAA-GZZZ projects)
genbank/tsa/H (HAAA-HZZZ projects)
genbank/tsa/I (IAAA-IZZZ projects)


ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1/tsa ncbi-asn1/tsa/G (GAAA-GZZZ projects)
ncbi-asn1/tsa/H (HAAA-HZZZ projects)
ncbi-asn1/tsa/I (IAAA-IZZZ projects)

Similarly, eight new sub-directories will be introduced for the WGS FTP areas:
/A, /B, /C, /F, /J, /L, /M, /N :

Only one TLS project-code prefix is currently in use, so only one sub-directory
(/K) will be introduced for the TLS FTP areas.


2.) Sixty-Day Symlink Areas For Recent WGS, TSA, and TLS Projects

As an additional convenience for our large-scale customers, we also provide
an alternative method for staying current with each day's new and updated
WGS/TSA/TLS projects, based on symbolic links.

Within the flatfile-oriented and ASN.1-oriented FTP areas for GenBank
Incremental Updates, there are sub-directories for WGS/TSA/TLS projects:

ftp://ftp.ncbi.nlm.nih.gov/genbank/daily-nc/wgs/
ftp://ftp.ncbi.nlm.nih.gov/genbank/daily-nc/tsa/
ftp://ftp.ncbi.nlm.nih.gov/genbank/daily-nc/tls/

ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1/daily-nc/wgs/
ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1/daily-nc/tsa/
ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1/daily-nc/tls/

These directories contain WGS/TSA/TLS data files for just those projects new
or updated within the past 60 days, implemented as symbolic links to the
*actual* data files located in the comprehensive WGS/TSA/TLS FTP areas.

Since the number of these links is much smaller than the overall set of
WGS/TSA/TLS data files, some customers will find it easier to process just the
contents of these directories, on a daily basis. Note that there is no sub-
directory structure: symbolic links for all projects, regardless of project-
code prefix letter, are commingled. Here is an example of what the symbolic
links for WGS project MPXR will look like as of July 11:

lrwxrwxrwx 1 gbupdate stats.wgs.MPXR -> ../../wgs/M/stats.wgs.MPXR
lrwxrwxrwx 1 gbupdate wgs.MPXR.1.fsa_aa.gz -> ../../wgs/M/wgs.MPXR.1.fsa_aa.gz
lrwxrwxrwx 1 gbupdate wgs.MPXR.1.fsa_nt.gz -> ../../wgs/M/wgs.MPXR.1.fsa_nt.gz
lrwxrwxrwx 1 gbupdate wgs.MPXR.1.gbff.gz -> ../../wgs/M/wgs.MPXR.1.gbff.gz
lrwxrwxrwx 1 gbupdate wgs.MPXR.1.gnp.gz -> ../../wgs/M/wgs.MPXR.1.gnp.gz
lrwxrwxrwx 1 gbupdate wgs.MPXR.mstr.gbff.gz -> ../../wgs/M/wgs.MPXR.mstr.gbff.gz

As described in the READMEs mentioned in Section (1), daily WGS/TLS/TSA
"Project Lists" provide the project codes of new, re-assembled, updated, and
refreshed projects. Importantly, projects described as 'refreshed' in the
Project Lists will not be represented by symbolic links in these 'daily-nc'
WGS/TSA/TLS FTP areas.

NCBI periodically performs maintenance activities which are not submitter-
driven on extremely large numbers of projects. Consider a taxonomic lineage
change that impacts all bacterial WGS projects, or a change to the INSDC
Feature Table Document which impacts the annotation for many projects. While
such events will result in a "refresh" of the project data files in the
comprehensive WGS/TSA/TLS FTP areas, symlinks will not be established which
point to them.

Essentially, the content of these 'daily-nc' WGS/TSA/TLS FTP areas is focused
on new projects, re-assembled projects, and (submitter) updated projects, since
those are the most important events. Customers with an interest in processing
refreshed projects have the option of identifying them via the daily Project
Lists and then obtaining them directly from the comprehensive WGS/TSA/TLS FTP
areas.


3.) Per-file File Size Limits Will Be Increased for WGS, TSA, and TLS projects

As explained in the README files mentioned in Section (1), data products for
large WGS/TSA/TLS projects are split into a series of numbered files, so that
the uncompressed size of each file is fairly small. We will be increasing the
uncompressed size by a factor of 4 on July 11 2017, to approximately 1000MB.

We currently target approximately 250MB as the uncompressed size for ASN.1
files. GenBank flatfiles generated from the ASN.1 exhibit greater size
variation, because the ASN.1 representation is usually more efficient. Here are
file size statistics for the ABBA WGS project:

compressed uncompressed ratio uncompressed_name
57911193 262153918 77.9% wgs.ABBA.1.bbs
57201651 262192398 78.2% wgs.ABBA.2.bbs
55345471 262240781 78.9% wgs.ABBA.3.bbs
54917695 262413854 79.1% wgs.ABBA.4.bbs
55491102 262261436 78.8% wgs.ABBA.5.bbs
55453991 262164036 78.8% wgs.ABBA.6.bbs
56181408 262150045 78.6% wgs.ABBA.7.bbs
56073338 262382300 78.6% wgs.ABBA.8.bbs
56901610 262155098 78.3% wgs.ABBA.9.bbs
55902522 262220122 78.7% wgs.ABBA.10.bbs
55362004 262173885 78.9% wgs.ABBA.11.bbs
55094239 262154436 79.0% wgs.ABBA.12.bbs
57349780 262165088 78.1% wgs.ABBA.13.bbs
63627982 262151291 75.7% wgs.ABBA.14.bbs
67977134 262149696 74.1% wgs.ABBA.15.bbs
4921323 15290887 67.8% wgs.ABBA.16.bbs

compressed uncompressed ratio uncompressed_name
88643783 290501062 69.5% wgs.ABBA.1.gbff
89438726 284391650 68.6% wgs.ABBA.2.gbff
90695439 276356887 67.2% wgs.ABBA.3.gbff
90857003 274837376 66.9% wgs.ABBA.4.gbff
90643043 276818347 67.3% wgs.ABBA.5.gbff
90879837 276106059 67.1% wgs.ABBA.6.gbff
90452122 279001644 67.6% wgs.ABBA.7.gbff
90259343 279179762 67.7% wgs.ABBA.8.gbff
90178329 281619649 68.0% wgs.ABBA.9.gbff
90595106 278149575 67.4% wgs.ABBA.10.gbff
90549917 276016214 67.2% wgs.ABBA.11.gbff
90475153 275278862 67.1% wgs.ABBA.12.gbff
89920341 282606996 68.2% wgs.ABBA.13.gbff
79701383 379323738 79.0% wgs.ABBA.14.gbff
61552402 577391512 89.3% wgs.ABBA.15.gbff
3332210 36833099 91.0% wgs.ABBA.16.gbff

With the increase in the file size limit, the total number of files for the
ABBA project would decrease by a factor of four, and the new file size
statistics would be:

compressed uncompressed ratio uncompressed_name
225298566 1048603012 78.5% wgs.ABBA.1.bbs
223127691 1048609318 78.7% wgs.ABBA.2.bbs
223236836 1048647105 78.7% wgs.ABBA.3.bbs
194021307 802493392 75.8% wgs.ABBA.4.bbs

compressed uncompressed ratio uncompressed_name
345653109 1063030610 67.5% wgs.ABBA.1.gbff
348641406 1062911676 67.2% wgs.ABBA.2.gbff
348666570 1062945874 67.2% wgs.ABBA.3.gbff
219579562 817096952 73.1% wgs.ABBA.4.gbff

Notice that the compressed file sizes remain well below 500MB. This means
that we can reduce the overall number of files for large projects, and yet the
size of individual FTP transfers remains reasonable and manageable. Fewer files
will definitely benefit our large-scale WGS/TSA/TLS customers.

Large projects processed on or after July 11 will be split based on the
increased file size limit. Projects currently split into multiple files based
on the older file size limit will remain as-is until they are either updated or
"refreshed" by a maintenance (see Section 2) process.

4.) Changes For Daily WGS/TSA/TLS Project List Files

Due to the large volume of WGS, TSA, and TLS projects, multiple distinct
processing runs are already required in order to provide FTP data products for
each day's new and updated projects. As an example, consider the Project Lists
for May 2 2017 in the WGS FTP area for flatfile-oriented products:

ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs

-rw-r--r-- 1 gbupdate 1828 May 2 02:46 proj_list.2017.0502
-rw-r--r-- 1 gbupdate 12 May 2 05:16 proj_list.2017.0502.050501

The second/additional Project List makes use of a suffix of the form .HHMMSS,
where HH is a two-digit hour value (05), MM is a two-digit minute value (05),
and SS is a two-digit second value (01).

As of July 11 2017, all Project List files will include the .HHMMSS suffix in
their filenames.

We will also be changing the naming convention of all project list files so
that they can be distinguished more easily from the actual WGS, TSA, and TLS
data files. The new uniform convention will be:

proj_list.PROJ_TYPE.YYYY.MMDD.HHMMSS

where "PROJ_TYPE" will be wgs, tsa, or tls, YYYY is a four digit year, MMDD is
a two-digit month and two digit day, and HHMMSS is as described above. Here are
some examples of this change:

Old Project List Filename New Project List Filename

proj_list.2017.0502.050501 proj_list.wgs.2017.0502.050501
tsa.proj_list.2017.0502.050501 proj_list.tsa.2017.0502.050501
tls.proj_list.2016.1111.050502 proj_list.tls.2016.1111.050502

Reserving the 'wgs', 'tsa', and 'tls' filename prefixes for the actual
WGS/TSA/TLS sequence data files will simplify file management for large-scale
customers who process significant numbers of sequence projects.



0 new messages