Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
GAE using Blobstore API?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  17 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
hovavo  
View profile  
 More options Sep 9 2012, 3:55 am
From: hovavo <hov...@gmail.com>
Date: Sun, 9 Sep 2012 00:55:41 -0700 (PDT)
Local: Sun, Sep 9 2012 3:55 am
Subject: GAE using Blobstore API?

Has anyone tried implementing Storage on GAE using the Blobstore (files)
API to avoid the 1MB limit?
Any idea if it's possible and how would it affect search performance?

Background -
We are deep in a project where we are forced by the client to use AppEngine.
We need to upload and index a data set of ~7,000 very small documents.
It will be done once in a few months so cost and performance of indexing is
not a big issue.

However we get stuck after a while with the 1MB limit. Is that normal for
that amount of documents?

Would love to hear anyone's experience with that.
Thanks
H


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Chaput  
View profile  
 More options Sep 10 2012, 2:04 pm
From: Matt Chaput <m...@whoosh.ca>
Date: Mon, 10 Sep 2012 14:04:22 -0400
Local: Mon, Sep 10 2012 2:04 pm
Subject: Re: [Whoosh] GAE using Blobstore API?
On 09/09/2012 3:55 AM, hovavo wrote:

> Has anyone tried implementing Storage on GAE using the Blobstore (files)
> API to avoid the 1MB limit?
> Any idea if it's possible and how would it affect search performance?

I haven't tried it, but it should work. Whoosh uses a write-once,
read-many design so in theory it should play well with Blobstore. I have
no idea how it would affect performance, but it should be a
straightforward change to one file so at least you could try it and see.

The biggest issue with this is that I don't use or know AppEngine and
don't have much interest in it, so help from someone would get this done
a lot faster. If you have experience with AppEngine and Blobstore and
can help me update whoosh.filedb.gae to use Blobstore, let me know.

(A more ambitious project would be to use the codec API in Whoosh 3 to
actually store the index data in the AppEngine database directly instead
of faking files, but that would require more help from someone with much
more interest in and knowledge of AppEngine than I :)

Cheers,

Matt


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
hovavo  
View profile  
 More options Sep 11 2012, 3:30 am
From: hovavo <hov...@gmail.com>
Date: Tue, 11 Sep 2012 00:30:55 -0700 (PDT)
Local: Tues, Sep 11 2012 3:30 am
Subject: Re: GAE using Blobstore API?

Thanks for the answer Matt.
I am no big expert on GAE nor Whoosh, but i do have big interest in making
this work.
So i guess i am that person...
Will hopefully post soon about my progress.

But as for my other half of me question -
Does it sound right to you that i am hitting the 1MB limit after indexing
such small data set?
It currently happens after around 1500 extremely short documents.

Thanks
H


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Chaput  
View profile  
 More options Sep 12 2012, 5:00 pm
From: Matt Chaput <m...@whoosh.ca>
Date: Wed, 12 Sep 2012 16:59:49 -0400
Local: Wed, Sep 12 2012 4:59 pm
Subject: Re: [Whoosh] Re: GAE using Blobstore API?
On 11/09/2012 3:30 AM, hovavo wrote:

> Thanks for the answer Matt.
> I am no big expert on GAE nor Whoosh, but i do have big interest in
> making this work.
> So i guess i am that person...

Cool! Please clone/pull the latest repo and take a look at the current
BlobProperty implementation in src/whoosh/filedb/gae.py.

The current implementation loads the entire property into memory using
BytesIO. Hopefully the Blobstore API provides a more file-like access.

I've added docstrings for the base Storage class in
src/whoosh/filedb/filestore.py, so hopefully it's fairly
straightforward. Anything you need to know about the Whoosh side, let me
know.

Also, your new storage class should have the following class attribute:

   supports_mmap = False

> But as for my other half of me question -
> Does it sound right to you that i am hitting the 1MB limit after indexing such small data set?
> It currently happens after around 1500 extremely short documents.

It sounds odd, but I'd have to try the data to see. It might have
something to do with the new "compound segment" format, where Whoosh
writes separate files and then combines them into a single file, then
deletes the original files. If you're using the default branch of the
repo, you can try opening a writer with the compound=False keyword arg
to prevent this, e.g.:

   w = myindex.writer(compound=False)

Cheers,

Matt


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hovav Oppenheim  
View profile  
 More options Sep 13 2012, 12:46 pm
From: Hovav Oppenheim <hov...@gmail.com>
Date: Thu, 13 Sep 2012 19:46:44 +0300
Local: Thurs, Sep 13 2012 12:46 pm
Subject: Re: [Whoosh] Re: GAE using Blobstore API?

I already spent some time on it. Haven't seen the doc strings but managed
to get the general idea.
Things were looking promising until i hit a problem -
The files API only support 'r' and 'a' file modes. No 'w'.
So although the _File class provides methods like seek() and tell(), they
only work on read mode.
My code failed on the HashWriter.__init__ (), and then i gave up this
direction since i figured it may fail on other cases as well.

What i want try next is something more similar to the existing gae
implementation. i.e keep using datastore entities, but sture keys of blob
files instead of the blob value.

But first i'll try the compound=False suggestion just to see where it gets
me.
Stay tuned...

--
···································
Hovav Oppenheim
Bacon Oppenheim Ltd <http://www.baconoppenheim.com/>
M: 052-3834118
E: hov...@gmail.com
···································

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Guido van Rossum  
View profile  
 More options Sep 13 2012, 1:04 pm
From: Guido van Rossum <gu...@python.org>
Date: Thu, 13 Sep 2012 10:04:15 -0700
Local: Thurs, Sep 13 2012 1:04 pm
Subject: Re: [Whoosh] Re: GAE using Blobstore API?
Actually the files API also supports writing files, however you have
to use a different API to open the file. Check out the create() call
here: https://developers.google.com/appengine/docs/python/blobstore/overvie...

--
--Guido van Rossum (python.org/~guido)

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hovav Oppenheim  
View profile  
 More options Sep 13 2012, 1:14 pm
From: Hovav Oppenheim <hov...@gmail.com>
Date: Thu, 13 Sep 2012 20:14:24 +0300
Local: Thurs, Sep 13 2012 1:14 pm
Subject: Re: [Whoosh] Re: GAE using Blobstore API?

Oh :)
I have looked at it but theres not much info there...
And I did use that api but in the inline docs i saw i can only use "r" or
"a".
So you are saying i can also pass "w" and it will work?

On Thu, Sep 13, 2012 at 8:04 PM, Guido van Rossum <gu...@python.org> wrote:

--
···································
Hovav Oppenheim
Bacon Oppenheim Ltd <http://www.baconoppenheim.com/>
M: 052-3834118
E: hov...@gmail.com
···································

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Guido van Rossum  
View profile  
 More options Sep 13 2012, 2:15 pm
From: Guido van Rossum <gu...@python.org>
Date: Thu, 13 Sep 2012 11:15:10 -0700
Local: Thurs, Sep 13 2012 2:15 pm
Subject: Re: [Whoosh] Re: GAE using Blobstore API?
No. Read the example code I pointed to. Follow it exactly.

--
--Guido van Rossum (python.org/~guido)

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hovav Oppenheim  
View profile  
 More options Sep 13 2012, 4:33 pm
From: Hovav Oppenheim <hov...@gmail.com>
Date: Thu, 13 Sep 2012 23:32:57 +0300
Local: Thurs, Sep 13 2012 4:32 pm
Subject: Re: [Whoosh] GAE using Blobstore API?

Well I have... But that example shows exactly my problem. The only write
mode available is 'a' and then seek() and tell() throw errors.

On Thursday, September 13, 2012, Guido van Rossum <gu...@python.org> wrote:

https://developers.google.com/appengine/docs/python/blobstore/overvie...

--
···································
Hovav Oppenheim
Bacon Oppenheim Ltd <http://www.baconoppenheim.com/>
M: 052-3834118
E: hov...@gmail.com
···································

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Guido van Rossum  
View profile  
 More options Sep 13 2012, 4:39 pm
From: Guido van Rossum <gu...@python.org>
Date: Thu, 13 Sep 2012 13:39:33 -0700
Local: Thurs, Sep 13 2012 4:39 pm
Subject: Re: [Whoosh] GAE using Blobstore API?
Ah. Yes. You can only append. Also, each write() call becomes a
separate API request so it's best to write large chunks. (256 KB works
well.)

--
--Guido van Rossum (python.org/~guido)

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Chaput  
View profile  
 More options Sep 13 2012, 4:55 pm
From: Matt Chaput <m...@whoosh.ca>
Date: Thu, 13 Sep 2012 16:55:16 -0400
Local: Thurs, Sep 13 2012 4:55 pm
Subject: Re: [Whoosh] GAE using Blobstore API?
On 13/09/2012 4:39 PM, Guido van Rossum wrote:

> Ah. Yes. You can only append. Also, each write() call becomes a
> separate API request so it's best to write large chunks. (256 KB works
> well.)

I'll see if I can change the on-disk formats in the Whoosh 3.0 codec to
not require seeking while writing. My instinct is this should be possible.

Matt


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Guido van Rossum  
View profile  
 More options Sep 13 2012, 5:00 pm
From: Guido van Rossum <gu...@python.org>
Date: Thu, 13 Sep 2012 14:00:16 -0700
Local: Thurs, Sep 13 2012 5:00 pm
Subject: Re: [Whoosh] GAE using Blobstore API?
I guess until then the only solution is to buffer the entire thing in
memory before writing...

--
--Guido van Rossum (python.org/~guido)

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hovav Oppenheim  
View profile  
 More options Sep 14 2012, 12:46 am
From: Hovav Oppenheim <hov...@gmail.com>
Date: Fri, 14 Sep 2012 07:46:38 +0300
Local: Fri, Sep 14 2012 12:46 am
Subject: Re: [Whoosh] GAE using Blobstore API?

Thanks Matt and Guido!
But the writes will be a lot bigger than 256KB.
My original problem was the 1MB limit on Datastore puts.

Will i hit a limit with the files api too?

On Fri, Sep 14, 2012 at 12:00 AM, Guido van Rossum <gu...@python.org> wrote:

--
···································
Hovav Oppenheim
Bacon Oppenheim Ltd <http://www.baconoppenheim.com/>
M: 052-3834118
E: hov...@gmail.com
···································

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Chaput  
View profile  
 More options Sep 14 2012, 12:52 am
From: Matt Chaput <m...@whoosh.ca>
Date: Fri, 14 Sep 2012 00:52:48 -0400
Local: Fri, Sep 14 2012 12:52 am
Subject: Re: [Whoosh] GAE using Blobstore API?

> Thanks Matt and Guido!
> But the writes will be a lot bigger than 256KB.
> My original problem was the 1MB limit on Datastore puts.

> Will i hit a limit with the files api too?

I'm pretty sure he just meant that ideally we should buffer writes in blocks of 256KB. We can do that after we've got the basic API calls working.

You shouldn't worry about the seeking problem, I'm mostly finished fixing my local copy to use only serial writes. You might want to focus just on getting your new storage class to work in isolation for now (that is, able to write data >1MB and read it back out again), until I check in my changes.

Thanks!

Matt


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hovav Oppenheim  
View profile  
 More options Sep 14 2012, 12:55 am
From: Hovav Oppenheim <hov...@gmail.com>
Date: Fri, 14 Sep 2012 07:55:15 +0300
Local: Fri, Sep 14 2012 12:55 am
Subject: Re: [Whoosh] GAE using Blobstore API?

Ok thanks a lot Matt. Will do.

--
···································
Hovav Oppenheim
Bacon Oppenheim Ltd <http://www.baconoppenheim.com/>
M: 052-3834118
E: hov...@gmail.com
···································

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Chaput  
View profile  
 More options Sep 14 2012, 3:32 pm
From: Matt Chaput <m...@whoosh.ca>
Date: Fri, 14 Sep 2012 15:32:39 -0400
Local: Fri, Sep 14 2012 3:32 pm
Subject: Re: [Whoosh] GAE using Blobstore API?
On 14/09/2012 12:55 AM, Hovav Oppenheim wrote:

> Ok thanks a lot Matt. Will do.

Just pushed my changes to bitbucket.

Cheers,

Matt


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
hovavo  
View profile  
 More options Dec 5 2012, 10:57 am
From: hovavo <hov...@gmail.com>
Date: Wed, 5 Dec 2012 07:57:18 -0800 (PST)
Local: Wed, Dec 5 2012 10:57 am
Subject: Re: [Whoosh] GAE using Blobstore API?

Sorry disappearing for such a long time.
I decided to wait with this experiment since compound=False did the job,
and i also figured keeping a big file on GAE that needs to load on every
instance invocation won't be a good direction to explore.

Anyhow, now i'm in a dead end again...
I'm adding more data, and i now reach the 1mb hard limit on some of the
blobs.
Is there any way in the lib to tell the segment writer not to reach that
limit and creat new files instead?  

Thanks a lot,
Hovav

BTW I'm really hoping to get to experimenting with the file API so we know
for sure how feasible it is.
It's just those goddamn deadlines.... :)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »