Mass updating with .put_multi(list_of_entities)

1,321 views
Skip to first unread message

Adam Bradley

unread,
Jun 20, 2012, 3:57:20 PM6/20/12
to appengine-...@googlegroups.com
Let's say I'd like to update 1000 entities as quickly as possibleIs there a recommended number of entities I should stay below when using .put_multi()? Would it be better to break up the 1000 entities into 10 async calls?

For example, let's say the "list_of_entities" list has 1000 entities in it which are ready to be updated, which method would you prefer when balancing all factors involved?

Method 1):
ndb.put_multi(list_of_entities) 


Method 2):
put_list = []
for entity in list_of_entities:
    put_list.append(entity)
    if len(put_list) > 100:
        ndb.put_multi_async(put_list)
        put_list = []
if len(put_list) > 0:
    ndb.put_multi_async(put_list)


And the same question goes for .delete_multi()

Thanks

Guido van Rossum

unread,
Jun 20, 2012, 4:49:46 PM6/20/12
to appengine-...@googlegroups.com
I'd use method (1), but I'd look into configuring the ndb Context to
set max_entity_groups_per_rpc around 100 (assuming you're using the
HRD, which you should). The call is something like

config = datastore_rpc.Configuration(max_entity_groups_per_rpc=100)
ctx = ndb.Context(config=config)
ndb.set_context(ctx)
--
--Guido van Rossum (python.org/~guido)

Adam Bradley

unread,
Jun 20, 2012, 5:19:27 PM6/20/12
to appengine-...@googlegroups.com
Perfect, thanks for pointing me in the right direction. Looks like the default is 10 RPC:

Incase anyone else is interested:
For a non-transactional operation that involves more entity groups than the
    maximum, the operation will be performed by executing multiple, asynchronous
    rpcs to the datastore, each of which has no more entity groups represented
    than the maximum.  So, if a put() operation has 8 entity groups and the
    maximum is 3, we will send 3 rpcs, 2 with 3 entity groups and 1 with 2
    entity groups.  This is a performance optimization - in many cases
    multiple, small, concurrent rpcs will finish faster than a single large
    rpc.  The optimal value for this property will be application-specific, so
    experimentation is encouraged.

Thanks for your help,
Adam



On Wednesday, June 20, 2012 3:49:46 PM UTC-5, Guido van Rossum wrote:
I'd use method (1), but I'd look into configuring the ndb Context to
set max_entity_groups_per_rpc around 100 (assuming you're using the
HRD, which you should). The call is something like

config = datastore_rpc.Configuration(max_entity_groups_per_rpc=100)
ctx = ndb.Context(config=config)
ndb.set_context(ctx)

Guido van Rossum

unread,
Jun 20, 2012, 5:55:08 PM6/20/12
to appengine-...@googlegroups.com
Beware that on the HRD another default takes precedence: 

  DEFAULT_MAX_ENTITY_GROUPS_PER_HIGH_REP_READ_RPC = 1

You can use Appstats to see how many RPCs are *actually* made.

Steve

unread,
Jun 22, 2012, 3:37:01 AM6/22/12
to appengine-...@googlegroups.com
I did a quick google but didn't see a good answer.  Could you please briefly explain the difference between the two:

DEFAULT_MAX_ENTITY_GROUPS_PER_HIGH_REP_READ_RPC 
DEFAULT_MAX_ENTITY_GROUPS_PER_RPC 

Thanks!

Guido van Rossum

unread,
Jun 22, 2012, 9:38:45 AM6/22/12
to appengine-...@googlegroups.com
When reading using the HR datastore, the first (i.e. 1) is used of no
explicit value is specified. When writing or using the M/S datastore,
the second is used. I think that you can override either by passing
max_entity_groups_per_rpc=N on your API calls, e.g.
ndb.get_multi(keys, max_entity_groups_per_rpc=100). But please verify
for yourself using Appstats. The logic is excruciatingly complex.

MikeDSP

unread,
Mar 27, 2013, 5:58:22 PM3/27/13
to appengine-...@googlegroups.com, adambr...@gmail.com
What type of performance times wise did you have?

I have been playing with a data import of roughly 60,000 entities, and using a batch of 1000 at a time.
Using a ndb.Expando model with a few dynamic fields (roughly 6-7) with indexing disabled.

I'm getting (using Guido's method) of approx 9-10s per 1000 (1 batch), kind of expensive :/

I've tried a few variations, currently using one put_multi() tried breaking it down to put_multi at 100, and 500 interval. But not much difference in performance (none more than likely).
App stats says its the datastore_v3.Put that is the cost factor.

Anybody else able to import these volumes with lower cost time?

Thanks,
Mike
Reply all
Reply to author
Forward
0 new messages