Has anyone tried implementing Storage on GAE using the Blobstore (files) API to avoid the 1MB limit? Any idea if it's possible and how would it affect search performance?
Background - We are deep in a project where we are forced by the client to use AppEngine. We need to upload and index a data set of ~7,000 very small documents. It will be done once in a few months so cost and performance of indexing is not a big issue.
However we get stuck after a while with the 1MB limit. Is that normal for that amount of documents?
Would love to hear anyone's experience with that. Thanks H
> Has anyone tried implementing Storage on GAE using the Blobstore (files)
> API to avoid the 1MB limit?
> Any idea if it's possible and how would it affect search performance?
I haven't tried it, but it should work. Whoosh uses a write-once, read-many design so in theory it should play well with Blobstore. I have no idea how it would affect performance, but it should be a straightforward change to one file so at least you could try it and see.
The biggest issue with this is that I don't use or know AppEngine and don't have much interest in it, so help from someone would get this done a lot faster. If you have experience with AppEngine and Blobstore and can help me update whoosh.filedb.gae to use Blobstore, let me know.
(A more ambitious project would be to use the codec API in Whoosh 3 to actually store the index data in the AppEngine database directly instead of faking files, but that would require more help from someone with much more interest in and knowledge of AppEngine than I :)
Thanks for the answer Matt. I am no big expert on GAE nor Whoosh, but i do have big interest in making this work. So i guess i am that person... Will hopefully post soon about my progress.
But as for my other half of me question - Does it sound right to you that i am hitting the 1MB limit after indexing such small data set? It currently happens after around 1500 extremely short documents.
On Sunday, September 9, 2012 10:55:41 AM UTC+3, hovavo wrote:
> Has anyone tried implementing Storage on GAE using the Blobstore (files) > API to avoid the 1MB limit? > Any idea if it's possible and how would it affect search performance?
> Background - > We are deep in a project where we are forced by the client to use > AppEngine. > We need to upload and index a data set of ~7,000 very small documents. > It will be done once in a few months so cost and performance of indexing > is not a big issue.
> However we get stuck after a while with the 1MB limit. Is that normal for > that amount of documents?
> Would love to hear anyone's experience with that. > Thanks > H
> Thanks for the answer Matt.
> I am no big expert on GAE nor Whoosh, but i do have big interest in
> making this work.
> So i guess i am that person...
Cool! Please clone/pull the latest repo and take a look at the current BlobProperty implementation in src/whoosh/filedb/gae.py.
The current implementation loads the entire property into memory using BytesIO. Hopefully the Blobstore API provides a more file-like access.
I've added docstrings for the base Storage class in src/whoosh/filedb/filestore.py, so hopefully it's fairly straightforward. Anything you need to know about the Whoosh side, let me know.
Also, your new storage class should have the following class attribute:
supports_mmap = False
> But as for my other half of me question -
> Does it sound right to you that i am hitting the 1MB limit after indexing such small data set?
> It currently happens after around 1500 extremely short documents.
It sounds odd, but I'd have to try the data to see. It might have something to do with the new "compound segment" format, where Whoosh writes separate files and then combines them into a single file, then deletes the original files. If you're using the default branch of the repo, you can try opening a writer with the compound=False keyword arg to prevent this, e.g.:
I already spent some time on it. Haven't seen the doc strings but managed
to get the general idea.
Things were looking promising until i hit a problem -
The files API only support 'r' and 'a' file modes. No 'w'.
So although the _File class provides methods like seek() and tell(), they
only work on read mode.
My code failed on the HashWriter.__init__ (), and then i gave up this
direction since i figured it may fail on other cases as well.
What i want try next is something more similar to the existing gae
implementation. i.e keep using datastore entities, but sture keys of blob
files instead of the blob value.
But first i'll try the compound=False suggestion just to see where it gets
me.
Stay tuned...
On Wed, Sep 12, 2012 at 11:59 PM, Matt Chaput <m...@whoosh.ca> wrote:
> On 11/09/2012 3:30 AM, hovavo wrote:
>> Thanks for the answer Matt.
>> I am no big expert on GAE nor Whoosh, but i do have big interest in
>> making this work.
>> So i guess i am that person...
> Cool! Please clone/pull the latest repo and take a look at the current
> BlobProperty implementation in src/whoosh/filedb/gae.py.
> The current implementation loads the entire property into memory using
> BytesIO. Hopefully the Blobstore API provides a more file-like access.
> I've added docstrings for the base Storage class in
> src/whoosh/filedb/filestore.**py, so hopefully it's fairly
> straightforward. Anything you need to know about the Whoosh side, let me
> know.
> Also, your new storage class should have the following class attribute:
> supports_mmap = False
> But as for my other half of me question -
>> Does it sound right to you that i am hitting the 1MB limit after indexing
>> such small data set?
>> It currently happens after around 1500 extremely short documents.
> It sounds odd, but I'd have to try the data to see. It might have
> something to do with the new "compound segment" format, where Whoosh writes
> separate files and then combines them into a single file, then deletes the
> original files. If you're using the default branch of the repo, you can try
> opening a writer with the compound=False keyword arg to prevent this, e.g.:
On Thu, Sep 13, 2012 at 9:46 AM, Hovav Oppenheim <hov...@gmail.com> wrote:
> I already spent some time on it. Haven't seen the doc strings but managed to
> get the general idea.
> Things were looking promising until i hit a problem -
> The files API only support 'r' and 'a' file modes. No 'w'.
> So although the _File class provides methods like seek() and tell(), they
> only work on read mode.
> My code failed on the HashWriter.__init__ (), and then i gave up this
> direction since i figured it may fail on other cases as well.
> What i want try next is something more similar to the existing gae
> implementation. i.e keep using datastore entities, but sture keys of blob
> files instead of the blob value.
> But first i'll try the compound=False suggestion just to see where it gets
> me.
> Stay tuned...
> On Wed, Sep 12, 2012 at 11:59 PM, Matt Chaput <m...@whoosh.ca> wrote:
>> On 11/09/2012 3:30 AM, hovavo wrote:
>>> Thanks for the answer Matt.
>>> I am no big expert on GAE nor Whoosh, but i do have big interest in
>>> making this work.
>>> So i guess i am that person...
>> Cool! Please clone/pull the latest repo and take a look at the current
>> BlobProperty implementation in src/whoosh/filedb/gae.py.
>> The current implementation loads the entire property into memory using
>> BytesIO. Hopefully the Blobstore API provides a more file-like access.
>> I've added docstrings for the base Storage class in
>> src/whoosh/filedb/filestore.py, so hopefully it's fairly straightforward.
>> Anything you need to know about the Whoosh side, let me know.
>> Also, your new storage class should have the following class attribute:
>> supports_mmap = False
>>> But as for my other half of me question -
>>> Does it sound right to you that i am hitting the 1MB limit after indexing
>>> such small data set?
>>> It currently happens after around 1500 extremely short documents.
>> It sounds odd, but I'd have to try the data to see. It might have
>> something to do with the new "compound segment" format, where Whoosh writes
>> separate files and then combines them into a single file, then deletes the
>> original files. If you're using the default branch of the repo, you can try
>> opening a writer with the compound=False keyword arg to prevent this, e.g.:
Oh :)
I have looked at it but theres not much info there...
And I did use that api but in the inline docs i saw i can only use "r" or
"a".
So you are saying i can also pass "w" and it will work?
On Thu, Sep 13, 2012 at 8:04 PM, Guido van Rossum <gu...@python.org> wrote:
> On Thu, Sep 13, 2012 at 9:46 AM, Hovav Oppenheim <hov...@gmail.com> wrote:
> > I already spent some time on it. Haven't seen the doc strings but
> managed to
> > get the general idea.
> > Things were looking promising until i hit a problem -
> > The files API only support 'r' and 'a' file modes. No 'w'.
> > So although the _File class provides methods like seek() and tell(), they
> > only work on read mode.
> > My code failed on the HashWriter.__init__ (), and then i gave up this
> > direction since i figured it may fail on other cases as well.
> > What i want try next is something more similar to the existing gae
> > implementation. i.e keep using datastore entities, but sture keys of blob
> > files instead of the blob value.
> > But first i'll try the compound=False suggestion just to see where it
> gets
> > me.
> > Stay tuned...
> > On Wed, Sep 12, 2012 at 11:59 PM, Matt Chaput <m...@whoosh.ca> wrote:
> >> On 11/09/2012 3:30 AM, hovavo wrote:
> >>> Thanks for the answer Matt.
> >>> I am no big expert on GAE nor Whoosh, but i do have big interest in
> >>> making this work.
> >>> So i guess i am that person...
> >> Cool! Please clone/pull the latest repo and take a look at the current
> >> BlobProperty implementation in src/whoosh/filedb/gae.py.
> >> The current implementation loads the entire property into memory using
> >> BytesIO. Hopefully the Blobstore API provides a more file-like access.
> >> I've added docstrings for the base Storage class in
> >> src/whoosh/filedb/filestore.py, so hopefully it's fairly
> straightforward.
> >> Anything you need to know about the Whoosh side, let me know.
> >> Also, your new storage class should have the following class attribute:
> >> supports_mmap = False
> >>> But as for my other half of me question -
> >>> Does it sound right to you that i am hitting the 1MB limit after
> indexing
> >>> such small data set?
> >>> It currently happens after around 1500 extremely short documents.
> >> It sounds odd, but I'd have to try the data to see. It might have
> >> something to do with the new "compound segment" format, where Whoosh
> writes
> >> separate files and then combines them into a single file, then deletes
> the
> >> original files. If you're using the default branch of the repo, you can
> try
> >> opening a writer with the compound=False keyword arg to prevent this,
> e.g.:
On Thu, Sep 13, 2012 at 10:14 AM, Hovav Oppenheim <hov...@gmail.com> wrote:
> Oh :)
> I have looked at it but theres not much info there...
> And I did use that api but in the inline docs i saw i can only use "r" or
> "a".
> So you are saying i can also pass "w" and it will work?
> On Thu, Sep 13, 2012 at 8:04 PM, Guido van Rossum <gu...@python.org> wrote:
>> On Thu, Sep 13, 2012 at 9:46 AM, Hovav Oppenheim <hov...@gmail.com> wrote:
>> > I already spent some time on it. Haven't seen the doc strings but
>> > managed to
>> > get the general idea.
>> > Things were looking promising until i hit a problem -
>> > The files API only support 'r' and 'a' file modes. No 'w'.
>> > So although the _File class provides methods like seek() and tell(),
>> > they
>> > only work on read mode.
>> > My code failed on the HashWriter.__init__ (), and then i gave up this
>> > direction since i figured it may fail on other cases as well.
>> > What i want try next is something more similar to the existing gae
>> > implementation. i.e keep using datastore entities, but sture keys of
>> > blob
>> > files instead of the blob value.
>> > But first i'll try the compound=False suggestion just to see where it
>> > gets
>> > me.
>> > Stay tuned...
>> > On Wed, Sep 12, 2012 at 11:59 PM, Matt Chaput <m...@whoosh.ca> wrote:
>> >> On 11/09/2012 3:30 AM, hovavo wrote:
>> >>> Thanks for the answer Matt.
>> >>> I am no big expert on GAE nor Whoosh, but i do have big interest in
>> >>> making this work.
>> >>> So i guess i am that person...
>> >> Cool! Please clone/pull the latest repo and take a look at the current
>> >> BlobProperty implementation in src/whoosh/filedb/gae.py.
>> >> The current implementation loads the entire property into memory using
>> >> BytesIO. Hopefully the Blobstore API provides a more file-like access.
>> >> I've added docstrings for the base Storage class in
>> >> src/whoosh/filedb/filestore.py, so hopefully it's fairly
>> >> straightforward.
>> >> Anything you need to know about the Whoosh side, let me know.
>> >> Also, your new storage class should have the following class attribute:
>> >> supports_mmap = False
>> >>> But as for my other half of me question -
>> >>> Does it sound right to you that i am hitting the 1MB limit after
>> >>> indexing
>> >>> such small data set?
>> >>> It currently happens after around 1500 extremely short documents.
>> >> It sounds odd, but I'd have to try the data to see. It might have
>> >> something to do with the new "compound segment" format, where Whoosh
>> >> writes
>> >> separate files and then combines them into a single file, then deletes
>> >> the
>> >> original files. If you're using the default branch of the repo, you can
>> >> try
>> >> opening a writer with the compound=False keyword arg to prevent this,
>> >> e.g.:
> No. Read the example code I pointed to. Follow it exactly.
> On Thu, Sep 13, 2012 at 10:14 AM, Hovav Oppenheim <hov...@gmail.com>
wrote:
>> Oh :)
>> I have looked at it but theres not much info there...
>> And I did use that api but in the inline docs i saw i can only use "r" or
>> "a".
>> So you are saying i can also pass "w" and it will work?
>> On Thu, Sep 13, 2012 at 8:04 PM, Guido van Rossum <gu...@python.org>
wrote:
>>> Actually the files API also supports writing files, however you have
>>> to use a different API to open the file. Check out the create() call
>>> here:
>>> On Thu, Sep 13, 2012 at 9:46 AM, Hovav Oppenheim <hov...@gmail.com>
wrote:
>>> > I already spent some time on it. Haven't seen the doc strings but
>>> > managed to
>>> > get the general idea.
>>> > Things were looking promising until i hit a problem -
>>> > The files API only support 'r' and 'a' file modes. No 'w'.
>>> > So although the _File class provides methods like seek() and tell(),
>>> > they
>>> > only work on read mode.
>>> > My code failed on the HashWriter.__init__ (), and then i gave up this
>>> > direction since i figured it may fail on other cases as well.
>>> > What i want try next is something more similar to the existing gae
>>> > implementation. i.e keep using datastore entities, but sture keys of
>>> > blob
>>> > files instead of the blob value.
>>> > But first i'll try the compound=False suggestion just to see where it
>>> > gets
>>> > me.
>>> > Stay tuned...
>>> > On Wed, Sep 12, 2012 at 11:59 PM, Matt Chaput <m...@whoosh.ca> wrote:
>>> >> On 11/09/2012 3:30 AM, hovavo wrote:
>>> >>> Thanks for the answer Matt.
>>> >>> I am no big expert on GAE nor Whoosh, but i do have big interest in
>>> >>> making this work.
>>> >>> So i guess i am that person...
>>> >> Cool! Please clone/pull the latest repo and take a look at the
current
>>> >> BlobProperty implementation in src/whoosh/filedb/gae.py.
>>> >> The current implementation loads the entire property into memory
using
>>> >> BytesIO. Hopefully the Blobstore API provides a more file-like
access.
>>> >> I've added docstrings for the base Storage class in
>>> >> src/whoosh/filedb/filestore.py, so hopefully it's fairly
>>> >> straightforward.
>>> >> Anything you need to know about the Whoosh side, let me know.
>>> >> Also, your new storage class should have the following class
attribute:
>>> >> supports_mmap = False
>>> >>> But as for my other half of me question -
>>> >>> Does it sound right to you that i am hitting the 1MB limit after
>>> >>> indexing
>>> >>> such small data set?
>>> >>> It currently happens after around 1500 extremely short documents.
>>> >> It sounds odd, but I'd have to try the data to see. It might have
>>> >> something to do with the new "compound segment" format, where Whoosh
>>> >> writes
>>> >> separate files and then combines them into a single file, then
deletes
>>> >> the
>>> >> original files. If you're using the default branch of the repo, you
can
>>> >> try
>>> >> opening a writer with the compound=False keyword arg to prevent this,
>>> >> e.g.:
On Thu, Sep 13, 2012 at 1:32 PM, Hovav Oppenheim <hov...@gmail.com> wrote:
> Well I have... But that example shows exactly my problem. The only write
> mode available is 'a' and then seek() and tell() throw errors.
> On Thursday, September 13, 2012, Guido van Rossum <gu...@python.org> wrote:
>> No. Read the example code I pointed to. Follow it exactly.
>> On Thu, Sep 13, 2012 at 10:14 AM, Hovav Oppenheim <hov...@gmail.com>
>> wrote:
>>> Oh :)
>>> I have looked at it but theres not much info there...
>>> And I did use that api but in the inline docs i saw i can only use "r" or
>>> "a".
>>> So you are saying i can also pass "w" and it will work?
>>> On Thu, Sep 13, 2012 at 8:04 PM, Guido van Rossum <gu...@python.org>
>>> wrote:
>>>> Actually the files API also supports writing files, however you have
>>>> to use a different API to open the file. Check out the create() call
>>>> here:
>>>> On Thu, Sep 13, 2012 at 9:46 AM, Hovav Oppenheim <hov...@gmail.com>
>>>> wrote:
>>>> > I already spent some time on it. Haven't seen the doc strings but
>>>> > managed to
>>>> > get the general idea.
>>>> > Things were looking promising until i hit a problem -
>>>> > The files API only support 'r' and 'a' file modes. No 'w'.
>>>> > So although the _File class provides methods like seek() and tell(),
>>>> > they
>>>> > only work on read mode.
>>>> > My code failed on the HashWriter.__init__ (), and then i gave up this
>>>> > direction since i figured it may fail on other cases as well.
>>>> > What i want try next is something more similar to the existing gae
>>>> > implementation. i.e keep using datastore entities, but sture keys of
>>>> > blob
>>>> > files instead of the blob value.
>>>> > But first i'll try the compound=False suggestion just to see where it
>>>> > gets
>>>> > me.
>>>> > Stay tuned...
>>>> > On Wed, Sep 12, 2012 at 11:59 PM, Matt Chaput <m...@whoosh.ca> wrote:
>>>> >> On 11/09/2012 3:30 AM, hovavo wrote:
>>>> >>> Thanks for the answer Matt.
>>>> >>> I am no big expert on GAE nor Whoosh, but i do have big interest in
>>>> >>> making this work.
>>>> >>> So i guess i am that person...
>>>> >> Cool! Please clone/pull the latest repo and take a look at the
>>>> >> current
>>>> >> BlobProperty implementation in src/whoosh/filedb/gae.py.
>>>> >> The current implementation loads the entire property into memory
>>>> >> using
>>>> >> BytesIO. Hopefully the Blobstore API provides a more file-like
>>>> >> access.
>>>> >> I've added docstrings for the base Storage class in
>>>> >> src/whoosh/filedb/filestore.py, so hopefully it's fairly
>>>> >> straightforward.
>>>> >> Anything you need to know about the Whoosh side, let me know.
>>>> >> Also, your new storage class should have the following class
>>>> >> attribute:
>>>> >> supports_mmap = False
>>>> >>> But as for my other half of me question -
>>>> >>> Does it sound right to you that i am hitting the 1MB limit after
>>>> >>> indexing
>>>> >>> such small data set?
>>>> >>> It currently happens after around 1500 extremely short documents.
>>>> >> It sounds odd, but I'd have to try the data to see. It might have
>>>> >> something to do with the new "compound segment" format, where Whoosh
>>>> >> writes
>>>> >> separate files and then combines them into a single file, then
>>>> >> deletes
>>>> >> the
>>>> >> original files. If you're using the default branch of the repo, you
>>>> >> can
>>>> >> try
>>>> >> opening a writer with the compound=False keyword arg to prevent this,
>>>> >> e.g.:
> I guess until then the only solution is to buffer the entire thing in
> memory before writing...
> On Thu, Sep 13, 2012 at 1:55 PM, Matt Chaput <m...@whoosh.ca> wrote:
> > On 13/09/2012 4:39 PM, Guido van Rossum wrote:
> >> Ah. Yes. You can only append. Also, each write() call becomes a
> >> separate API request so it's best to write large chunks. (256 KB works
> >> well.)
> > I'll see if I can change the on-disk formats in the Whoosh 3.0 codec to
> not
> > require seeking while writing. My instinct is this should be possible.
> Thanks Matt and Guido!
> But the writes will be a lot bigger than 256KB. > My original problem was the 1MB limit on Datastore puts.
> Will i hit a limit with the files api too?
I'm pretty sure he just meant that ideally we should buffer writes in blocks of 256KB. We can do that after we've got the basic API calls working.
You shouldn't worry about the seeking problem, I'm mostly finished fixing my local copy to use only serial writes. You might want to focus just on getting your new storage class to work in isolation for now (that is, able to write data >1MB and read it back out again), until I check in my changes.
On Fri, Sep 14, 2012 at 7:52 AM, Matt Chaput <m...@whoosh.ca> wrote:
> > Thanks Matt and Guido!
> > But the writes will be a lot bigger than 256KB.
> > My original problem was the 1MB limit on Datastore puts.
> > Will i hit a limit with the files api too?
> I'm pretty sure he just meant that ideally we should buffer writes in
> blocks of 256KB. We can do that after we've got the basic API calls working.
> You shouldn't worry about the seeking problem, I'm mostly finished fixing
> my local copy to use only serial writes. You might want to focus just on
> getting your new storage class to work in isolation for now (that is, able
> to write data >1MB and read it back out again), until I check in my changes.
Sorry disappearing for such a long time. I decided to wait with this experiment since compound=False did the job, and i also figured keeping a big file on GAE that needs to load on every instance invocation won't be a good direction to explore.
Anyhow, now i'm in a dead end again... I'm adding more data, and i now reach the 1mb hard limit on some of the blobs. Is there any way in the lib to tell the segment writer not to reach that limit and creat new files instead?
Thanks a lot, Hovav
BTW I'm really hoping to get to experimenting with the file API so we know for sure how feasible it is. It's just those goddamn deadlines.... :)