Massive datastore batch put, how to?

48 views
Skip to first unread message

Stakka

unread,
Aug 12, 2009, 5:36:41 PM8/12/09
to Google App Engine
I'am working on an browser based accounting app which has a feature to
import ledger transactions through file uploads. I'am currently only
running on the local dev server, but from what I've read datastore
puts -- even batch -- is very slow and CPU (quota) intensive when
deployed live.

How do I overcome this problem if the user uploads a large file with
thousands transaction?

I've seen solutions where you batch put entities in chunks of 500.
That only works if you run a custom upload tool on your computer, not
from a browser since the request is limited to 30 seconds. Am I forced
to use the Task Queue? But where do I store the raw uploaded file or
the preferably parsed interim transaction entities when the task isn't
executing?

Funny App Engine has a 10 megabyte request (file upload) size limit
when storing 10 megabyte worth of entities seems to be so hard.

Nick Johnson (Google)

unread,
Aug 13, 2009, 1:44:37 PM8/13/09
to google-a...@googlegroups.com
Hi Stakka,

My suggestion would be to do something like this:
- Split the uploaded file into 'jobs'. One job per 500k might be about
right; it depends on your processing overhead. In any case, the job
needs to be less than 1MB.
- Insert the jobs into the datastore.
- Add a task queue job for each job.
- Have the task queue job process its part of the total data.

-Nick Johnson
--
Nick Johnson, Developer Programs Engineer, App Engine

Juraj Vitko

unread,
Aug 14, 2009, 10:48:42 AM8/14/09
to Google App Engine
I think you need to write your own Flash or Java Applet based chunked
uploader. Or use an existing one and let us know, so that we can use
it too.

Stakka

unread,
Aug 14, 2009, 3:24:22 PM8/14/09
to Google App Engine
Thanks for the tip, but why write a web app when Java Applets are
required, that whouldn't be a good solution. Also, the uploaded file
needs to be parsed in it's entirety (CRC check, value references,
etc.), and it's not XML.

I think I have to parse the file server-side, populate (Java) Entity
objects and serialize as many I can into 1 MB blobs. When that is
done, start a task that put the de-serialized entities in batches of
500 into the datastore. The response for the file upload request will
have to contain some unique task URL that the browser can (AJAX) poll
to display the progress.

Before I commit to such a elaborate solution, I'll have to test the
batch-put performance to see if GAE is even suitable for this kind of
app. http://groups.google.com/group/google-appengine/browse_thread/thread/41276f470e12037c#

Users of an online apps shouldn't have to wait hours for a simple data
import just because it's hosted at GAE. If the app where using an SQL
database this would only take a minute.

Juraj Vitko

unread,
Aug 15, 2009, 1:57:53 PM8/15/09
to Google App Engine
I agree with everything you said. Just one thing to consider: by first
storing the uploaded data, then retrieving that data for reprocessing
and then storing the processed data again will consume additional
resources / quotas of your app.

GAE really appears to be designed for apps with very high read to
write ratio. I would say, if you don't need to handle more than
thousand of concurrent users, then you'd be better off renting a
server. Into this I've factored additional hassles you may not know
about yet, like index size and count limits, single entity group write
limits, transaction limitations. All of these are possible to work
around, but I have yet to see if those workarounds are feasible in
terms of the final price I will be paying to run the app.

On Aug 14, 9:24 pm, Stakka <henrik.lindqv...@gmail.com> wrote:
> Thanks for the tip, but why write a web app when Java Applets are
> required, that whouldn't be a good solution. Also, the uploaded file
> needs to be parsed in it's entirety (CRC check, value references,
> etc.), and it's not XML.
>
> I think I have to parse the file server-side, populate (Java) Entity
> objects and serialize as many I can into 1 MB blobs. When that is
> done, start a task that put the de-serialized entities in batches of
> 500 into the datastore. The response for the file upload request will
> have to contain some unique task URL that the browser can (AJAX) poll
> to display the progress.
>
> Before I commit to such a elaborate solution, I'll have to test the
> batch-put performance to see if GAE is even suitable for this kind of
> app.http://groups.google.com/group/google-appengine/browse_thread/thread/...

Stakka

unread,
Aug 16, 2009, 6:36:42 PM8/16/09
to Google App Engine
I implemented a rough version of my solution, and it seems to work up
to ~15k entities. Above that I hit the undocumented transaction write
limit you mention when trying to commit 36408 entities serialized into
24 blobs of 999960 bytes:

java.lang.IllegalArgumentException: datastore transaction or write too
big.

Well, the datastore seems fast enough for large dataset writes, but
all the limitations really makes it troublesome to implement. Also the
potential loss of data integrity while processing in multiple requests/
tasks without transactions is risky. Costly too, an 15k entities
"upload" comsumes about 30 minutes of CPU quota.

Mark Jones

unread,
Aug 21, 2009, 10:57:09 AM8/21/09
to Google App Engine
That is similar to what I am seeing. Writing to the datastore is VERY
expensive. 130K items for me consumes nearly 6.5 hours of CPU. Not
very efficient

On Aug 16, 5:36 pm, Stakka <henrik.lindqv...@gmail.com> wrote:
> I implemented a rough version of my solution, and it seems to work up
> to ~15k entities. Above that I hit the undocumented transaction write
> limit you mention when trying to commit 36408 entities serialized into
> 24 blobs of 999960 bytes:
>
> java.lang.IllegalArgumentException: datastore transaction or write too
> big.
>
> Well, the datastore seems fast enough for large dataset writes, but
> all the limitations really makes it troublesome to implement. Also the
> potentiallossofdataintegrity while processing in multiple requests/
> tasks without transactions is risky. Costly too, an 15k entities
> "upload" comsumes about 30 minutes of CPU quota.
>
> On 15 Aug, 19:57, Juraj Vitko <juraj.vi...@gmail.com> wrote:
>
> > I agree with everything you said. Just one thing to consider: by first
> > storing the uploadeddata, then retrieving thatdatafor reprocessing
> > and then storing the processeddataagain will consume additional

Stakka

unread,
Aug 21, 2009, 4:29:34 PM8/21/09
to Google App Engine
I wonder how Google thinks the app providers should bill their
customers? A flat fee isn't feasible when a single user action costs
soo much of the app profit. Limiting the number of actions (imports)
is a possibility, but thats bad for the app's productiveness.

Google should remove the cost of CPU for internal API calls. It might
be more fair if they billed for the number of API calls in combination
with the megabyte size of data in each call. That would also encurage
good app design; Batch put whould be cheaper than many single puts.
Small entities with few indices cheaper than large entities with many
indices, etc,.

Google say App Engine is great for small startups and scale to
millions of users. But how will a startup afford to reach the scale
where GAE shines. With regular hosting you atleast know the cost in
advance, and during heavy load it's just slower.

I really like GAE, I'd like to use it! But with all the missing JRE
classes, many limits and the cost of datastore CPU I probably can't
for my current project. ;(
Reply all
Reply to author
Forward
0 new messages