Google 網路論壇不再支援新的 Usenet 貼文或訂閱項目,但過往內容仍可供查看。

GenBank Release 221.0 Problem : 21 records with illegally line-wrapped text qualifiers in the Feature Table

瀏覽次數:2 次
跳到第一則未讀訊息

Cavanaugh, Mark (NIH/NLM/NCBI) [E]

未讀,
2017年9月2日 上午11:52:412017/9/2
收件者:'genbankb@net.bio.net' (genbankb@net.bio.net)
Greetings GenBank Users,

Staff from Chemical Abstracts Services detected invalid flatfile formatting
for one of the GenBank 221.0 release files and reported it to NCBI on Friday
August 25th.

NCBI confirmed the problem and isolated a testcase. CAS ultimately found that
five of the release files were impacted:

gbhtg86.seq.gz
gbpri2.seq.gz
gbpri24.seq.gz
gbpri3.seq.gz
gbpri56.seq.gz

NCBI identified a total of 21 affected records among these five files.
Accession lists are provided below.

The formatting problem is the presence of completely empty/null lines within
the Feature Table section. They are caused by text qualifiers whose values
include one or more lines of whitespace, intended to make the text more
readable. Here is an example:

LOCUS AC202819 70919 bp DNA linear HTG 27-JUN-2008
DEFINITION Gossypium hirsutum chromosome UNKNOWN clone ZMMBBb-510L8, ***
SEQUENCING IN PROGRESS ***, 21 unordered pieces.
ACCESSION AC202819
VERSION AC202819.1
....
misc_feature 1..4949
/note=";
This clone was previously submitted as Zea mays as part of
the Maize sequencing project. It appears that the original
maize BAC library (ZMMBBb) is contaminated with Gossypium
hirsutum cv Maaxacotton. During the mapping project, these
clones clustered together to form small unanchored mapping
contigs and clones from these contigs were chosen for
sequencing. In order to make the best use of this data the
clones have been reclassified as cotton and left in the
public domain for potential use.

assembly_name:Contig100

SOURCE INFORMATION:
The ZMMBBb Corn BAC Library was constructed by Jeff
Tomkins at Clemson University Genomics Institute from Zea
mays cultivar B73. For more information about this library
or to obtain a clone, please refer to the online ordering
system at the CUGI BAC/EST Resource Center
(https://www.genome.clemson.edu)."

Because the blank lines lie within the body of a line-wrapped value for
the /note qualifier, they should *not* actually be completely empty. Rather,
they should consist of 21 space characters, followed by a newline:

public domain for potential use.

^^^^^^^^^^^^^^^^^^^^^
assembly_name:Contig100

These leading spaces were erroneously trimmed, and the result technically
breaks the GenBank flatfile specification. Most likely, only those who perform
a fairly deep parse of the flatfile structure will be impacted by this error.

The five GenBank release files were patched on Friday Sept 1 2017 and then
installed at the FTP site:

-rw-r--r-- 1 gbupdate giprog 63705600 Sep 1 13:46 gbpri56.seq.gz
-rw-r--r-- 1 gbupdate giprog 80114309 Sep 1 13:46 gbpri3.seq.gz
-rw-r--r-- 1 gbupdate giprog 19750385 Sep 1 13:46 gbpri24.seq.gz
-rw-r--r-- 1 gbupdate giprog 65577608 Sep 1 13:46 gbpri2.seq.gz
-rw-r--r-- 1 gbupdate giprog 83466269 Sep 1 13:46 gbhtg86.seq.gz

In addition to fixing the flatfile generator bug, we have implemented
stricter flatfile parse-checks during GenBank Release processing, which
will prevent similar problems in the future.

We would like to thank our customers at Chemical Abstract Services for
alerting us to this problem. We appreciate the scrutiny of our products
which GenBank users provide, and welcome error reports.

Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS

htg86:

AC202819
AC202820
AC202822
AC202823
AC202824
AC202825
AC202827
AC202829
AC202830
AC202831

pri2:

AC002116
AC003107
AC004143
AC004151
AC004472
AC004602
AC004770

pri3

AC005175
AC005306

pri24:

AF037222

pri56:

U95626

0 則新訊息