Hello,
I'm trying to do the following periodically (lets say once a week):
- download a couple of public datasets
- merge them together, resulting in a dictionary (I'm using Python) of ~2.5m entries
- upload the result to Cloud Datastore so that I have it as "reference data" for other things running in the project
I've put together a python script using google-cloud-datastore however the performance is abysmal - it takes around 10 hours (!) to do this. What I'm doing:
- iterate over the entries from the datastore
- look them up in my dictionary and decide if the need update / delete (if no longer present in the dictionary)
- write them back / delete them as needed
- insert any new elements from the dictionary
I already batch the requests (use .put_multi, .delete_multi, etc).
Some things I considered:
- Use DataFlow. The problem is that each tasks would have to load the dataset (my "dictionary") into memory which is time and memory consuming
- Use the managed import / export. Problem is that it produces / consumes some undocumented binary format (I would guess entities serialized as protocol buffers?)
- Use multiple threads locally to mitigate the latency. Problem is the google-cloud-datastore library has limited support for cursors (it doesn't have an "advance cursor by X" method for example) so I don't have a way to efficiently divide up the entities from the DataStore into chunks which could be processed by different threads
Any suggestions on how I could improve the performance?
Attila