Options to synchronize a moderate amount of data with Datastore?

87 views
Skip to first unread message

Attila-Mihaly Balazs

unread,
Jul 11, 2018, 2:18:32 PM7/11/18
to Google App Engine
Hello,

I'm trying to do the following periodically (lets say once a week):

- download a couple of public datasets
- merge them together, resulting in a dictionary (I'm using Python) of ~2.5m entries
- upload the result to Cloud Datastore so that I have it as "reference data" for other things running in the project

I've put together a python script using google-cloud-datastore however the performance is abysmal - it takes around 10 hours (!) to do this. What I'm doing:

- iterate over the entries from the datastore
- look them up in my dictionary and decide if the need update / delete (if no longer present in the dictionary)
- write them back / delete them as needed
- insert any new elements from the dictionary

I already batch the requests (use .put_multi, .delete_multi, etc).

Some things I considered:

- Use DataFlow. The problem is that each tasks would have to load the dataset (my "dictionary") into memory which is time and memory consuming
- Use the managed import / export. Problem is that it produces / consumes some undocumented binary format (I would guess entities serialized as protocol buffers?)
- Use multiple threads locally to mitigate the latency. Problem is the google-cloud-datastore library has limited support for cursors (it doesn't have an "advance cursor by X" method for example) so I don't have a way to efficiently divide up the entities from the DataStore into chunks which could be processed by different threads

Any suggestions on how I could improve the performance?

Attila

Jean Juste-constant

unread,
Jul 12, 2018, 6:05:24 PM7/12/18
to Google App Engine
I do believe Dataflow would be the best option here if configured with many workers (which can be split based on your current batch request). Not sure what type of datasets your 'dictionary' is using  but, correct me if I'm wrong, my understanding of your current script is that you are querying one Datastore entity at a time against your 2.5M entries in your dictionary and keep repeating the process until all Datastore entities were checked against the dictionary. Are you creating the Datastore keys based on your dictionary entries or you are allowing the keys to be created by Datastore? If its the former, there's a possibility that you are experiencing a 'hotspot' issue due to narrow key range, as explained here

Regarding the memory, adapting the dictionary into a file (yes it will be big) and adding it to Cloud Storage should be the ideal way to create the PCollection to be used. You can delete the file once the whole process is finished to minimized the costs. Depending on the source of these public datasets, you could also completely skip the dictionary creation and simply store the data within a file. However, I'm not sure if the dictionary is being used somewhere else, like the "referenced data".

I'm not fully understanding the undocumented binary format which you mentioned, would you be able to provide an example? 
Reply all
Reply to author
Forward
0 new messages