GAE using Blobstore API?

139 views
Skip to first unread message

hovavo

unread,
Sep 9, 2012, 3:55:41 AM9/9/12
to who...@googlegroups.com
Has anyone tried implementing Storage on GAE using the Blobstore (files) API to avoid the 1MB limit? 
Any idea if it's possible and how would it affect search performance?

Background -
We are deep in a project where we are forced by the client to use AppEngine.
We need to upload and index a data set of ~7,000 very small documents. 
It will be done once in a few months so cost and performance of indexing is not a big issue. 

However we get stuck after a while with the 1MB limit. Is that normal for that amount of documents?

Would love to hear anyone's experience with that.
Thanks
H

Matt Chaput

unread,
Sep 10, 2012, 2:04:22 PM9/10/12
to who...@googlegroups.com
On 09/09/2012 3:55 AM, hovavo wrote:
> Has anyone tried implementing Storage on GAE using the Blobstore (files)
> API to avoid the 1MB limit?
> Any idea if it's possible and how would it affect search performance?

I haven't tried it, but it should work. Whoosh uses a write-once,
read-many design so in theory it should play well with Blobstore. I have
no idea how it would affect performance, but it should be a
straightforward change to one file so at least you could try it and see.

The biggest issue with this is that I don't use or know AppEngine and
don't have much interest in it, so help from someone would get this done
a lot faster. If you have experience with AppEngine and Blobstore and
can help me update whoosh.filedb.gae to use Blobstore, let me know.

(A more ambitious project would be to use the codec API in Whoosh 3 to
actually store the index data in the AppEngine database directly instead
of faking files, but that would require more help from someone with much
more interest in and knowledge of AppEngine than I :)

Cheers,

Matt

hovavo

unread,
Sep 11, 2012, 3:30:55 AM9/11/12
to who...@googlegroups.com
Thanks for the answer Matt.
I am no big expert on GAE nor Whoosh, but i do have big interest in making this work. 
So i guess i am that person...
Will hopefully post soon about my progress.

But as for my other half of me question -
Does it sound right to you that i am hitting the 1MB limit after indexing such small data set?
It currently happens after around 1500 extremely short documents.

Thanks
H

Matt Chaput

unread,
Sep 12, 2012, 4:59:49 PM9/12/12
to who...@googlegroups.com
On 11/09/2012 3:30 AM, hovavo wrote:
> Thanks for the answer Matt.
> I am no big expert on GAE nor Whoosh, but i do have big interest in
> making this work.
> So i guess i am that person...

Cool! Please clone/pull the latest repo and take a look at the current
BlobProperty implementation in src/whoosh/filedb/gae.py.

The current implementation loads the entire property into memory using
BytesIO. Hopefully the Blobstore API provides a more file-like access.

I've added docstrings for the base Storage class in
src/whoosh/filedb/filestore.py, so hopefully it's fairly
straightforward. Anything you need to know about the Whoosh side, let me
know.

Also, your new storage class should have the following class attribute:

supports_mmap = False

> But as for my other half of me question -
> Does it sound right to you that i am hitting the 1MB limit after indexing such small data set?
> It currently happens after around 1500 extremely short documents.

It sounds odd, but I'd have to try the data to see. It might have
something to do with the new "compound segment" format, where Whoosh
writes separate files and then combines them into a single file, then
deletes the original files. If you're using the default branch of the
repo, you can try opening a writer with the compound=False keyword arg
to prevent this, e.g.:

w = myindex.writer(compound=False)

Cheers,

Matt

Hovav Oppenheim

unread,
Sep 13, 2012, 12:46:44 PM9/13/12
to who...@googlegroups.com
I already spent some time on it. Haven't seen the doc strings but managed to get the general idea.
Things were looking promising until i hit a problem - 
The files API only support 'r' and 'a' file modes. No 'w'.
So although the _File class provides methods like seek() and tell(), they only work on read mode.
My code failed on the HashWriter.__init__ (), and then i gave up this direction since i figured it may fail on other cases as well.

What i want try next is something more similar to the existing gae implementation. i.e keep using datastore entities, but sture keys of blob files instead of the blob value.

But first i'll try the compound=False suggestion just to see where it gets me.
Stay tuned...



--





--
···································
Hovav Oppenheim
Bacon Oppenheim Ltd
M: 052-3834118
···································


Guido van Rossum

unread,
Sep 13, 2012, 1:04:15 PM9/13/12
to who...@googlegroups.com
Actually the files API also supports writing files, however you have
to use a different API to open the file. Check out the create() call
here: https://developers.google.com/appengine/docs/python/blobstore/overview#Writing_Files_to_the_Blobstore
> --
>
>



--
--Guido van Rossum (python.org/~guido)

Hovav Oppenheim

unread,
Sep 13, 2012, 1:14:24 PM9/13/12
to who...@googlegroups.com
Oh :)
I have looked at it but theres not much info there...
And I did use that api but in the inline docs i saw i can only use "r" or "a". 
So you are saying i can also pass "w" and it will work?

Guido van Rossum

unread,
Sep 13, 2012, 2:15:10 PM9/13/12
to who...@googlegroups.com
No. Read the example code I pointed to. Follow it exactly.

Hovav Oppenheim

unread,
Sep 13, 2012, 4:32:57 PM9/13/12
to who...@googlegroups.com
Well I have... But that example shows exactly my problem. The only write mode available is 'a' and then seek() and tell() throw errors.

Guido van Rossum

unread,
Sep 13, 2012, 4:39:33 PM9/13/12
to who...@googlegroups.com
Ah. Yes. You can only append. Also, each write() call becomes a
separate API request so it's best to write large chunks. (256 KB works
well.)

Matt Chaput

unread,
Sep 13, 2012, 4:55:16 PM9/13/12
to who...@googlegroups.com
On 13/09/2012 4:39 PM, Guido van Rossum wrote:
> Ah. Yes. You can only append. Also, each write() call becomes a
> separate API request so it's best to write large chunks. (256 KB works
> well.)

I'll see if I can change the on-disk formats in the Whoosh 3.0 codec to
not require seeking while writing. My instinct is this should be possible.

Matt

Guido van Rossum

unread,
Sep 13, 2012, 5:00:16 PM9/13/12
to who...@googlegroups.com
I guess until then the only solution is to buffer the entire thing in
memory before writing...

Hovav Oppenheim

unread,
Sep 14, 2012, 12:46:38 AM9/14/12
to who...@googlegroups.com
Thanks Matt and Guido!
But the writes will be a lot bigger than 256KB. 
My original problem was the 1MB limit on Datastore puts. 

Will i hit a limit with the files api too?


--


Matt Chaput

unread,
Sep 14, 2012, 12:52:48 AM9/14/12
to who...@googlegroups.com
> Thanks Matt and Guido!
> But the writes will be a lot bigger than 256KB.
> My original problem was the 1MB limit on Datastore puts.
>
> Will i hit a limit with the files api too?

I'm pretty sure he just meant that ideally we should buffer writes in blocks of 256KB. We can do that after we've got the basic API calls working.

You shouldn't worry about the seeking problem, I'm mostly finished fixing my local copy to use only serial writes. You might want to focus just on getting your new storage class to work in isolation for now (that is, able to write data >1MB and read it back out again), until I check in my changes.

Thanks!

Matt

Hovav Oppenheim

unread,
Sep 14, 2012, 12:55:15 AM9/14/12
to who...@googlegroups.com
Ok thanks a lot Matt. Will do.


Matt

--


Matt Chaput

unread,
Sep 14, 2012, 3:32:39 PM9/14/12
to who...@googlegroups.com
On 14/09/2012 12:55 AM, Hovav Oppenheim wrote:
> Ok thanks a lot Matt. Will do.

Just pushed my changes to bitbucket.

Cheers,

Matt

hovavo

unread,
Dec 5, 2012, 10:57:18 AM12/5/12
to who...@googlegroups.com
Sorry disappearing for such a long time.
I decided to wait with this experiment since compound=False did the job, 
and i also figured keeping a big file on GAE that needs to load on every instance invocation won't be a good direction to explore.

Anyhow, now i'm in a dead end again... 
I'm adding more data, and i now reach the 1mb hard limit on some of the blobs.
Is there any way in the lib to tell the segment writer not to reach that limit and creat new files instead?  

Thanks a lot,
Hovav

BTW I'm really hoping to get to experimenting with the file API so we know for sure how feasible it is.
It's just those goddamn deadlines.... :)
Reply all
Reply to author
Forward
0 new messages