[Bioperl-l] sets of sequences - how to read?

4 views
Skip to first unread message

Carnë Draug

unread,
May 15, 2013, 9:53:55 PM5/15/13
to bioperl mailing list
Hi

when accessing entrez gene using eutils to get multiple genes, NCBI
now returns an Entrezgene-Set[1] rather than a list of EntrezGene.
This change must have happened sometime on the last 2 months. Compare:

use Bio::DB::EUtilities;

my %sets = (
eutil => 'efetch',
db => 'gene',
retmode => 'text',
rettype => 'asn1',
email => 'biop...@lists.open-bio.org',
);

## this mimics the previous behaviour of the NCBI server but the
multiple requests will annoy their servers
my @ids = (3014, 85235);
my $response;
foreach (@ids) {
my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_);
$response .= $fetcher->get_Response->content;
}
print $fetcher->get_Response->content;

## this used to be the right way to do it, but now returns an Entrezgene-Set
my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids);
$response .= $fetcher->get_Response->content;
print $fetcher->get_Response->content;

There is no module to read these Entrezgene-Set in Perl at the moment,
since Bio::ASN1::EntrezGene; is not able to handle them. I have
contacted the module author and set him a fix[2] and he said he'll try
to look into it next week.

However, even with the fix there is another problem. How would one
access a set of sequences using the Bio::SeqIO API? There is no method
to do that. One could say, to ignore them, and make next_seq return
the next sequence of the set. But then we are losing data. After all,
it's perfectly viable to have multiple Entrezgene-Set in one file.
What would be the right way to do this?

Carnë

[1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html
[2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b

_______________________________________________
Bioperl-l mailing list
Biop...@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Fields, Christopher J

unread,
May 17, 2013, 12:08:04 AM5/17/13
to Carnë Draug, bioperl mailing list
This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... <looks at watch>... now.

My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1.

chris

Carnë Draug

unread,
May 17, 2013, 1:12:24 AM5/17/13
to Fields, Christopher J, bioperl mailing list
On 17 May 2013 05:08, Fields, Christopher J <cjfi...@illinois.edu> wrote:
> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... <looks at watch>... now.
>
> My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1.

:s I'm not sure I understood your suggestion. I think the problem is
just the introduction of a new concept, a "set" of stuff (genes in
this case), and how should SeqIO handle multiple sets.

Carnë

Fields, Christopher J

unread,
May 17, 2013, 1:37:53 PM5/17/13
to Carnë Draug, bioperl mailing list
On May 17, 2013, at 12:12 AM, Carnë Draug <carandr...@gmail.com>
wrote:

> On 17 May 2013 05:08, Fields, Christopher J <cjfi...@illinois.edu> wrote:
>> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... <looks at watch>... now.
>>
>> My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1.
>
> :s I'm not sure I understood your suggestion. I think the problem is
> just the introduction of a new concept, a "set" of stuff (genes in
> this case), and how should SeqIO handle multiple sets.
>
> Carnë


(note: critical point in this is Bio::ASN1::Entrezgene would allow this, I'm not sure it would. Otherwise this is all really hand-wavy)

To me a 'set of stuff', particularly when the 'stuff' is stored sequentially in a flat file, is a simple 'database' or 'store' of similar items, where the class allows one the ability to look up particular members in the set, but also could store higher level information about the set as a whole if needed. If it were me, I would implement a method particular to Bio::SeqIO::entrezgene that specifically creates and returns this ( next_geneset(), for instance ); next_seq() could then be implemented to iterate through the items in that database/store.

Two useful things come out of this. First, if the data for the Entrez Gene file/chunk are parsed to store offsets per ID, one would only need to parse out the chunks needed (offset of ID to next offset), then pass that into the parser and create objects on the fly. This would probably be as fast or faster than (for instance) the greedy method of parsing the entire file and storing everything in objects up-front, then iterating through those objects one at a time, which I think is current behavior.

Second: if an index is created, the upfront cost is already paid (you could reuse the same index when parsing the same data).

An analogous example might be storing all FASTQ data in a sequencing run; I don't want to expend the effort to parse all the FASTQ data, but I may want to run operations on individual items in the set as well as store additional information about the data (barcodes per run, lanes, overall quality stats, etc).

Does that make sense? The pieces for this are lying around (Bio::Index::* for instance has methods for indexing flat files, and classes like Bio::DB::Fasta).

chris

Carnë Draug

unread,
Jun 17, 2013, 3:37:43 AM6/17/13
to Fields, Christopher J, bioperl mailing list
On 17 May 2013 05:08, Fields, Christopher J <cjfi...@illinois.edu> wrote:
> On May 15, 2013, at 8:53 PM, Carnë Draug <carandr...@gmail.com> wrote:
>> Hi
>>
>> when accessing entrez gene using eutils to get multiple genes, NCBI
>> now returns an Entrezgene-Set[1] rather than a list of EntrezGene.
>> This change must have happened sometime on the last 2 months.
>>
>> [...]
>>
>> Carnë
>>
>> [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html
>
> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... <looks at watch>... now.

Hi,

for those interested, I have contacted NCBI about this and they have
reverted the change (see conversation below). Still, entrezgene-set is
a thing so the issue of reading such things still exists.

Carnë


---------- Forwarded message ----------
Date: 17 May 2013 00:36
Subject: Entrezegene-Set: recent changes to E-utilities

Hi

I believe there was a recent change to the E-utilities service. When
fetching multiple ASN1 entrezegene records from the gene database, it
now returns an Entrezgene-Set instead of the typical list of
Entrezgene records, one after the other.

For example, here's an example Entrezgene-Set:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=3014,85235&rettype=asn1&retmode=text

which used to be the same as a concatenation of:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=3014&rettype=asn1&retmode=text
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=85235&rettype=asn1&retmode=text

This is something new. I don't know exactly when it was introduced but
must have been sometime in the last 2 months.

I don't know about other programming languages, but at least in Perl
there is no module able to parse this files. I have already contacted
the author of the module responsible for reading the non-set
Entrezgene with a patch but who knows when it will made available. The
only workaround is to make multiple requests, one for each UID, which
will obviously annoy your servers.

As far as I am aware, there was no notification of this change to
E-utilities, which worked fine for many years. We did have a lot of
code that worked fine for years, until it started to fail last month.
And no one using perl will be able to parse them until a fix is
released. Is there anyway this change can be reverted?


---------- Forwarded message ----------
Date: 23 May 2013 04:53
Subject: Re: Entrezegene-Set: recent changes to E-utilities

Thanks very much for your report. I will discuss this with the Gene
development team to see why this change occurred and get back to you.
Out of curiosity, have you considered using the XML format for Gene
(&retmode=xml)? There are a variety of XML parsers for Perl that should be
able to read Gene XML.


---------- Forwarded message ----------
Date: 24 May 2013 13:01
Subject: Re: Entrezegene-Set: recent changes to E-utilities

thank you for looking into this.

While there are several XML parsers for perl, there is not one that
will return a Bio::Seq object (a Bio::SeqIO compliant). Of course I
could use one of the XML parsers to write write my own but then I
could as well fix the entrezgene parser to deal with Entrezgene-sets
which is what I'm doing. I already proposed a patch to them but the
inclusion of a new concept, of a set of sequences, does not really fit
in the design of Bio::Seq.

Please do let me know of more news on this. Thank you again,


---------- Forwarded message ----------
Date: 13 June 2013 22:08
Subject: Re: Entrezegene-Set: recent changes to E-utilities

The fix for this should now be live. Let us know if you have further
problems with this.

Regards,

Fields, Christopher J

unread,
Jun 17, 2013, 8:48:56 AM6/17/13
to Carnë Draug, bioperl mailing list
The best thing to do in this case is to try contacting the author for Bio::ASN1::EntrezGene to see if the code can be updated. He indicated interest in putting the code on github and giving BIOPERLML co-maint last time I heard; we could easily do that, but I'm not sure if it is currently hosted anywhere else.

chris
Reply all
Reply to author
Forward
0 new messages