Query getting progressively slower even though there is an index

Muratti Andolini

unread,

Jan 27, 2023, 5:26:49 PM1/27/23

to Google Cloud Datastore

Hey folks,

I have a very simple use case, and created a composite index as suggested by Google datastore, but as soon as deployed my query works but is getting progressively slower from ~200ms to ~2s within 2 hours. I am trying to see what I am doing wrong or can do differently to keep the performance to initial levels.

Use Case: I am receiving file paths for a dataset (think of it as customer name) with empty batch id, and assigning a batch id (after which another process uses them but that's out of our scope)

There are about 300 million File records in total in File kind right now - they are all already processed and have a set batchId.

Every few minutes, the receiver process will insert about 10,000 new File records with empty batch_id for a given dataset and do nothing until it receives another set of files within 3-5 minutes.

On the processing side, the background processor queries all Files with empty batch_id and assigns them a batch_id and updates the datastore record.

Once the query is done and files are assigned batchId's and updated in the datastore, the query will return 0 records until we receive more files in the next few minutes.

So the number of files this query will return is going to be between 0 to 10,000 at any moment regardless of how many total files in the table. And since I have a composite index defined for it that partitions it for the given dataset first and then with batchId, I don't understand how it is getting progressively worse from 200ms to 2s when the query result changes only between 0 to 10,000 records every few minutes and not progressively getting bigger. Neither the table will change much in that 2 hours window, the record count will change maybe from 300 Mil to 300.5 Mil.

so there is a File Kind with three properties dataset, batchId, arrivalInstant

(Key - being the file path/name) there is no ancestor entity.

- dataset (String) is separation between different customers, each customer has a different dataset name and we always perform the queries and operations on dataset level, meaning each dataset is processed on a different thread)

- batchId (String) is empty by default when a File record is written to Datastore

- arrivalInstant (Integer) is timestamp of file arrival at the server.

The query done is as below

kind("File")

filter("dataset = ", dataset)

filter("batchId = ", "")

order("-arrivalInstant") // order by latest file first

For this query, I created a composite index as below

- kind: File

properties:

- name: dataset

- name: batchId

- name: arrivalInstant

direction: desc

Note: For the record, as another experiment, I swapped the positions of dataset and batchId, to do the query as below with the index below it, but I got the same exact gradually worsening performance.

Query

kind("File")

filter("batchId = ", "") //filtering with batchId first

filter("dataset = ", dataset)

order("-arrivalInstant") // order by latest file first

Index:

- kind: File

properties:

- name: batchId //Index to partition by batchId first as well

- name: dataset

- name: arrivalInstant

direction: desc

Jim Morrison

unread,

Jan 27, 2023, 5:36:55 PM1/27/23

to Muratti Andolini, Google Cloud Datastore

You are skipping over deleted index data with your index scans. You can find details at https://firebase.google.com/docs/firestore/best-practices#avoid_skipping_over_deleted_data . Your best bet would be to add a minimum arrivalInstant to your queries to avoid skipping over the tombstoned index entries.

Jim

--
You received this message because you are subscribed to the Google Groups "Google Cloud Datastore" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcd-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcd-discuss/91975b53-ad32-4581-bcec-b1099e05a76bn%40googlegroups.com.

--

Jim

Muratti Andolini

unread,

Jan 27, 2023, 5:48:14 PM1/27/23

to Google Cloud Datastore

Thanks for the hint. Just curious though - I am not deleting any File records, only updating. In that case are index entries are updated or deleted and re-created? Trying to understand if it applies to my use case here.

Jim Morrison

unread,

Jan 29, 2023, 12:03:59 PM1/29/23

to Muratti Andolini, Google Cloud Datastore

Updating a record means that if an field changes values (or any field in a composite index entry) the previous index entry needs to be deleted and a new one created.

Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/gcd-discuss/c82cdfa6-2f4b-40e2-96f0-5deab8a9bda4n%40googlegroups.com.