Hey folks,
I have a very simple use case, and created a composite index as suggested by Google datastore, but as soon as deployed my query works but is getting progressively slower from ~200ms to ~2s within 2 hours. I am trying to see what I am doing wrong or can do differently to keep the performance to initial levels.
Use Case: I am receiving file paths for a dataset (think of it as customer name) with empty batch id, and assigning a batch id (after which another process uses them but that's out of our scope)
There are about 300 million File records in total in File kind right now - they are all already processed and have a set batchId.
Every few minutes, the receiver process will insert about 10,000 new File records with empty batch_id for a given dataset and do nothing until it receives another set of files within 3-5 minutes.
On the processing side, the background processor queries all Files with empty batch_id and assigns them a batch_id and updates the datastore record.
Once the query is done and files are assigned batchId's and updated in the datastore, the query will return 0 records until we receive more files in the next few minutes.
So the number of files this query will return is going to be between 0 to 10,000 at any moment regardless of how many total files in the table. And since I have a composite index defined for it that partitions it for the given dataset first and then with batchId, I don't understand how it is getting progressively worse from 200ms to 2s when the query result changes only between 0 to 10,000 records every few minutes and not progressively getting bigger. Neither the table will change much in that 2 hours window, the record count will change maybe from 300 Mil to 300.5 Mil.
so there is a File Kind with three properties dataset, batchId, arrivalInstant
(Key - being the file path/name) there is no ancestor entity.
- dataset (String) is separation between different customers, each customer has a different dataset name and we always perform the queries and operations on dataset level, meaning each dataset is processed on a different thread)
- batchId (String) is empty by default when a File record is written to Datastore
- arrivalInstant (Integer) is timestamp of file arrival at the server.
The query done is as below
kind("File")
filter("dataset = ", dataset)
filter("batchId = ", "")
order("-arrivalInstant") // order by latest file first
For this query, I created a composite index as below
- kind: File
properties:
- name: dataset
- name: batchId
- name: arrivalInstant
direction: desc
Note: For the record, as another experiment, I swapped the positions of dataset and batchId, to do the query as below with the index below it, but I got the same exact gradually worsening performance.
Query
kind("File")
filter("batchId = ", "") //filtering with batchId first
filter("dataset = ", dataset)
order("-arrivalInstant") // order by latest file first
Index:
- kind: File
properties:
- name: batchId //Index to partition by batchId first as well
- name: dataset
- name: arrivalInstant
direction: desc