We use Cloud Storage to store large elasticsearch results (from aggregations - so scan+scroll isn't going to work here).
To handle these large aggregations in parallel, we store them as multiline JSON dumps that is sourced from a managed vm.
As a result, to perform parallel processing, many app engine instances will open this file at once, and as a result, hit the URLFetch rate limit because of this documented limitation:
and the calls count against your URL fetch quota, as the library uses the URL Fetch service to interact with Cloud Storage.
Here's the resulting exception:
Here's the code that opens the file:
import cloudstorage as gcs
def open_file(path, mode, **kwargs):
f = gcs.open(path, mode=mode, **kwargs)
if not f:
raise Exception("File could not be opened: %s" % path)
return f
--
We need a method of communicating with Cloud Storage that bypasses the URLFetch quotas and rate limits, or it becomes impossible for us to effectively execute parallel processing.
Is there a method of reading GCS files from App Engine that does not route through URLFetch, much like the datastore API does not incur url fetch rate limits?