Best way to update 400,000 entities at once?

Keith Lea

unread,

Feb 7, 2014, 4:43:33 PM2/7/14

to google-a...@googlegroups.com

Hi everyone,

I'm a long time App Engine user for my app's backend, but I'm really still a novice about the datastore.

I'd like to add a new property (and index) for all entities of a certain type. I have about 400,000 of this type of entity in the datastore, and I'd like to load each one, add a property, and save it back to the datastore. 400,000 times.

This will obviously take a long time, so I'd really like to split it up into ~100 tasks that each take 1/100th of the entities (~4,000 entities) and perform this operation.

But I really don't know how to do this using queries, and the Java MapReduce library is overwhelmingly complicated.

So how can I create 100 tasks that each take a unique chunk of the entities to operate on? Is this called "sharding"? Is there a way for a task to say "give me entity #200,000 thru #204,000"? (My entity's keys are strings, which were generated by my application and generally look like "928348-com.example-iOS".)

I'm using Java and Objectify btw. Thanks for any help or guidance!!

Keith

Philip Lodine

unread,

Feb 12, 2014, 11:00:35 AM2/12/14

to google-a...@googlegroups.com

Check out the documentation on App Engine's map/reduce infrastructure -- it takes care of mapping across all of your entities and doing what you want to each of them.

Best,

Phil

de Witte

unread,

Feb 27, 2014, 8:04:02 AM2/27/14

to google-a...@googlegroups.com

Use a backend instance and keep it running until done.

Or

Use two tasks. One for retrieving 1000 keys at the time and a second one to update the entities in a batch of 1000.

Done it for 300.000 entities in less than a 20 mins. ~300 tasks

Op vrijdag 7 februari 2014 22:43:33 UTC+1 schreef Keith Lea:

Tom Kaitchuck

unread,

Feb 28, 2014, 3:49:46 PM2/28/14

to google-a...@googlegroups.com

This is exactly the sort of task the MapReduce was meant for. It should be really a lot easier than managing the partitioning, error recovery, etc yourself.

Take a look at our new docs: https://developers.google.com/appengine/docs/java/dataprocessing/mapreduce_library
hopefully they should make it less overwhelming.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.

de Witte

unread,

Mar 1, 2014, 4:29:27 AM3/1/14

to google-a...@googlegroups.com

I'm sorry but it is not or the documentation is not clear.

From the overview page it looks like I have to create 6 classes.., that is a lot of code and work.

This piece of code from the example page doesn't make me happy. Change the tutorial in a real life scenario, for example updating 300.000 entities.

Op vrijdag 28 februari 2014 21:49:46 UTC+1 schreef Tom Kaitchuck:

Lorenzo Bugiani

unread,

Mar 2, 2014, 10:38:32 AM3/2/14

to google-a...@googlegroups.com

Well, if you don't need distributed termination (or, in other words, you don't need to know when the entire job is finished), than you can simply use tasks and cursors.

A main task can execute a query limited by first N results, then launch a task that update those entities. While the other task perform the update, the main task fetch the query cursor and execute another query, asking for the first N results after the cursor, an then launch another task that update those entities, and so on...

Read about cursor here for java documentation or here for python docs.

John Wheeler

unread,

Mar 4, 2014, 11:41:48 AM3/4/14

to google-a...@googlegroups.com

Research Bret Slatkin Fan Out

Brandon Wirtz

unread,

Mar 4, 2014, 12:02:50 PM3/4/14

to google-a...@googlegroups.com

We just learned that there is a limit to how fast you can defer to the task queue. You can only have 10 async requests open at a time, so if you try to defer 100 things number 11-100 will fail if the scheduler for task 1 doesn’t respond by the time you get to 11.

This is really annoying since If the scheduler responds half as fast as you try to add them you will get something like 1-12, 15-18, 20-23, 27-29, and so on and so forth. We spent a bit of time figuring out why our updates would have chunks missing, especially since days the defers were running fast everything would work.

Kaan Soral

unread,

Mar 4, 2014, 2:16:15 PM3/4/14

to google-a...@googlegroups.com

I haven't noticed something as critical as the issue you mentioned, but I did notice some fishy behavior around taskqueue.BulkAdd - especially when excess amount of tasks involed as you mentioned

I was forced to switch to my own implementation of defer and defer task routines, as appengine only recently started using ndb.toplevel, as far as I recall, defer just adds the task to a taskqueue, so basically we can consider you are talking about taskqueue

It might be a good idea to start a new discussion about this subject and clearly define these limitations

I started utilizing a governer async container and flush routines to .wait when the async elements reach to a limit, it might be a good idea to apply this to taskqueue's too
I'm manually batching taskqueue calls by 50's - 500's are a bit problematic, but it's too much trouble for defer-like tasks, where there is no order

(I didn't refine the above post, so it might be a bit confusing, out of touch, not well-defined)

Jeff Schnitzer

unread,

Mar 4, 2014, 3:32:33 PM3/4/14

to Google App Engine

When I batch update entities, I have two tasks: The first, iterates
through the entity keys and encodes an update task for each entity
that needs to be updated. That approach will probably get you to 400k
within the 10m run of the iteration task, especially if you submit the
tasks in batches of 100 or so. The google guava Lists.partition()
method is handy here.

I haven't needed to do this yet, but you can also take the approach of
splitting up the work. This is fairly easy if you have simple ids
without parent keys. You start with the lowest key and the highest
key, then you use a splitter task that divides and conquers the
address space, spawning new splitter tasks until you reach what you
consider to be a sufficiently small space to spawn iterator tasks.
Just be careful to use the task queue's name mechanism to prevent
duplicate tasks from being encoded when some of these tasks inevitably
retry.

I don't quite know what you do when you have parent keys. Presumably
the map/reduce library has solved that - it would be worth looking
into. It basically uses the same divide-and-conquer approach.

Jeff

Wolfram Gürlich

unread,

Mar 5, 2014, 7:18:34 AM3/5/14

to google-a...@googlegroups.com

Instead of iterating over all keys at once you could also use the hidden __scatter__ property to determine proper split points for your key range in advance. That is what the mapreduce library does.

Lorenzo Bugiani

unread,

Mar 5, 2014, 9:24:43 AM3/5/14

to google-a...@googlegroups.com

I haven't understand what you can do with __scatter__ property.

As MapReduce docs says, "We do not allow retrieving this value directly (it's stripped from the entity before it's returned to the application by the Datastore)"...

Also, I haven't understood what's wrong with using tasks and cursors.

A task can simply iterate the data, fetching N entities at time (in this case 1000), than launch a task that can update them (by itself or splitting again the work). At most, if 10 minutes aren't so much to iterate over all data, this "main" task can fetch only a piece of data, start update task, than start itself on the next chunk of data...

Obviously if is possible to split the work using only key's informations (for example, keys are number from 1 to 400.000) this works better, because fetching data by key is always a better than querying for the same data...

2014-03-05 13:18 GMT+01:00 Wolfram Gürlich <w.gue...@gmail.com>:

Instead of iterating over all keys at once you could also use the hidden __scatter__ property to determine proper split points for your key range in advance. That is what the mapreduce library does.

Barry Hunter

unread,

Mar 5, 2014, 10:14:54 AM3/5/14

to google-appengine

On 5 March 2014 14:24, Lorenzo Bugiani <canema...@gmail.com> wrote:

I haven't understand what you can do with __scatter__ property.
As MapReduce docs says, "We do not allow retrieving this value directly (it's stripped from the entity before it's returned to the application by the Datastore)"...

Suppose you want to break the whole dataset into N number of shards.

In theory, if you just take the first N keys from a keys only query sorted by __scatter__, you get N keys evenly spread out thoughtout the whole dataset.

Each of those keys can then be used as 'first' key in a standard datastore query, to get all documents in that shard. (using a greater than __key__ filter, with the results in __key__ order). Works sot of similar to how a cursor actully works under the hood.

Also, I haven't understood what's wrong with using tasks and cursors.

The main 'advantage' of using the __scatter__ is, can get all the data to setup ALL the tasks at once. Say you want 200 shards. One query retrieving 200 keys, and you can immidiately create all 200 tasks*. You can then begin processing those tasks in any order, even concurrently. (althouh if you have lots of shards to create, may use cursors for 'looping' that initial query!)

Using a 'while loop' and cursors. You have to run each query, and get all the data (keys only or the actual documents), to get the cursor to begin the next task. Even if you do keys only queries, to get all the cursors (and add the cursors to the task, so they can get all the documents for real), you downloading much more data than you need.

So with the 'nibble away' approach, you either have to be very inefficient in creating all the initial tasks (so can get a progress report), or just have to create the 'next task' after running the for-real query (ie get the next cursor in the query) - in which case cant parallelize.

Both approaches have their pro's and cons. One is much simpler and easier to understand, the other is more complex, but should be more efficient.

* actully the mapreduce lib, does oversample to improve the results, so its not quite that efficient.

A task can simply iterate the data, fetching N entities at time (in this case 1000), than launch a task that can update them (by itself or splitting again the work). At most, if 10 minutes aren't so much to iterate over all data, this "main" task can fetch only a piece of data, start update task, than start itself on the next chunk of data...

Yes, for small batch runs, its not going to make a big difference, but the bigger the whole dataset to be iterated, the more any efficiency gains have an affect.

Obviously if is possible to split the work using only key's informations (for example, keys are number from 1 to 400.000) this works better, because fetching data by key is always a better than querying for the same data...

That's what the __scatter__ does allow :)

Lorenzo Bugiani

unread,

Mar 5, 2014, 10:32:54 AM3/5/14

to google-a...@googlegroups.com

Ok thanks!

I've heard of __scatter__ now for the first time! :D

Jeff Schnitzer

unread,

Mar 5, 2014, 3:27:46 PM3/5/14

to Google App Engine

Wow, I had no idea this exists. It's brilliant!

This seems to be the only formal documentation:

https://code.google.com/p/appengine-mapreduce/wiki/ScatterPropertyImplementation

Perhaps this should be added to the main GAE documentation? I never
would have found it.

Jeff

Kaan Soral

unread,

Mar 5, 2014, 3:34:14 PM3/5/14

to google-a...@googlegroups.com, je...@infohazard.org

3-4 years ago I also shared your enthusiasm, but it's much much more logical to just implement a scatter manually, the pre-set probability results in a diluted scatter poolset after a while, but the idea of scatters is obviously nice

Lorenzo Bugiani

unread,

Mar 5, 2014, 3:36:56 PM3/5/14

to google-a...@googlegroups.com

Well, the documentation says that __scatter__ is intended only for MapReduce and could change in the future, so I think isn't a good idea write some production software based on it now.

Maybe when MapReduce Library will be GA, we could use __scared__ in a secure way...

Reply all

Reply to author

Forward