Datastore: Problem updating entities

732 views
Skip to first unread message

Filipe Caldas

unread,
Aug 11, 2017, 9:02:53 AM8/11/17
to Google App Engine, Filipe Caldas
Hi,

  I am currently trying to update a kind in my database and add a field (indexed=0), the table has more than 10M entities.

  I tried to use MapReduce for appengine and launched a fairly simple job where the mapper only sets the property and yields an operation.db.Put(), the only problem is that some of the shards failed, so the job was stopped and automatically restarted.

  Problem is, launching this job on 10M entities cost me about $ 100 and the job was not finished (the retry was going slow so don't think they billed much for that). 
  
The extra annoying thing is that there is no other way that I know to update these properties "fast" enough (the mapreduce took over 7 hours to fail on 10M). I know Beam/Dataflow is apparently the way to go, but documentation on doing basic operations like updating Datastore entities is still very poor (not sure if can even be done).

  So, my question is is there a fast and *safe* way to update  my entities that does not consist of doing 10M fetchs and puts in sequence?

  Bonus question: do anyone know why was I billed 70M reads on only 10M entities?

Best regards,
cost.png
shards_fail.png

Shivam(Google Cloud Support)

unread,
Aug 11, 2017, 4:46:20 PM8/11/17
to Google App Engine, fil...@lipex.com

There should be no actual need to mass-put a new property to all of your entities, and set that new property to a default value since the Datastore supports entities with and without set property values (as you have noticed with the failed Map Reduce job).


You can assume that if an entity does not have the property, that it is equal to the default "indexed=0". You can then set this value directly in your application during read time. If it exists, read it and use it, else use a hard-coded default and set the value then in  your code (aka only when the entity is being read).


Updating existing entities is documented here.


Without knowing what happened exactly, it is not possible to know the reason for 70M reads. However, I would recommend to view this post which might answer your question.


Filipe Caldas

unread,
Aug 15, 2017, 6:00:37 AM8/15/17
to Google App Engine, fil...@lipex.com
The job was actually doing slightly more than just setting a property to a default value, it also was doing a .strip() in one of the fields due to an error in our insert scripts, so in some cases there is a need to do a mass update on all entities, it definitely doesn't happen often but we would rather not re-insert all the entities on the table.

The documented method of updating entities works fine, but as many other users have noticed for any case where the amount of rows is big (> 10M) this would take over a week to finish, it is also definitely much cheaper to run that MapReduce, but it takes too long.

The way we found to do it in a "safe" way, as in we will be sure the task will be done and in a limited amount of time was to instead use a VM that spawns about 5 threads and reads / updates in parallel the entities on Datastore (and even this is still taking about 2 days to finish for 12M entities).

Shivam(Google Cloud Support)

unread,
Aug 15, 2017, 1:04:58 PM8/15/17
to Google App Engine, fil...@lipex.com
The job could tend to be slow for such amount of entities following the example here. The solution for proper Datastore Map Reduce solution in the cloud would be Datastore I/O using Dataflow.


Dataflow SDKs provide an API for reading data from and writing data to a Google Cloud Datastore database. Its programming model is designed to simplify the mechanics of large-scale data processing. When you program with a Dataflow SDK, you are essentially creating a data processing job to be executed by one of the Cloud Dataflow runner services. This model lets you concentrate on the logical composition of your data processing job, rather than the physical orchestration of parallel processing. You can focus on what you need your job to do instead of exactly how that job gets executed.


If you choose to stick with Map Reduce on App Engine, it is recommended to file any issues you experience directly to the engineering team on their Git repository.

Filipe Caldas

unread,
Aug 16, 2017, 5:11:37 AM8/16/17
to Google App Engine, fil...@lipex.com
Hi Shivam,

  Is it possible to use Python instead of Java on Dataflow to do the update on Datastore? If so where can I find an example?

Best regards,

Shivam(Google Cloud Support)

unread,
Aug 17, 2017, 10:53:47 AM8/17/17
to Google App Engine, fil...@lipex.com

You can view the Python example which is importing these datastore libraries. For further issues with the code, you may directly contact them.


Reply all
Reply to author
Forward
0 new messages