Query getting progressively slower even though there is an index

37 views
Skip to first unread message

Muratti Andolini

unread,
Jan 27, 2023, 5:26:49 PM1/27/23
to Google Cloud Datastore
Hey folks, 
I have a very simple use case, and created a composite index as suggested by Google datastore, but as soon as deployed my query works but is getting progressively slower from ~200ms to ~2s within 2 hours. I am trying to see what I am doing wrong or can do differently to keep the performance to initial levels.

Use Case: I am receiving file paths for a dataset (think of it as customer name) with empty batch id, and assigning a batch id (after which another process uses them but that's out of our scope)

There are about 300 million File records in total in File kind right now  - they are all already processed and have a set batchId. 

Every few minutes, the receiver process will insert about 10,000 new File records with empty batch_id for a given dataset and do nothing until it receives another set of files within 3-5 minutes.

On the processing side, the background processor queries all Files with empty batch_id and assigns them a batch_id and updates the datastore record.
 
Once the query is done and files are assigned batchId's and updated in the datastore, the query will return 0 records until we receive more files in the next few minutes. 

So the number of files this query will return is going to be between 0 to 10,000 at any moment regardless of how many total files in the table. And since I have a composite index defined for it that partitions it for the given dataset first and then with batchId, I don't understand how it is getting progressively worse from 200ms to 2s when the query result changes only between 0 to 10,000 records every few minutes and not progressively getting bigger. Neither the table will change much in that 2 hours window, the record count will change maybe from 300 Mil to 300.5 Mil.

so there is a File Kind with three properties datasetbatchId, arrivalInstant
(Key - being the file path/name) there is no ancestor entity.

- dataset (String) is separation between different customers, each customer has a different dataset name and we always perform the queries and operations on dataset level, meaning each dataset is processed on a different thread)
- batchId (String) is empty by default when a File record is written to Datastore
- arrivalInstant (Integer) is timestamp of file arrival at the server.

The query done is as below
kind("File")
filter("dataset = ", dataset)
filter("batchId = ", "")
order("-arrivalInstant") // order by latest file first

For this query, I created a composite index as below
- kind: File
  properties:
    - name: dataset
    - name: batchId
    - name: arrivalInstant
      direction: desc


Note: For the record, as another experiment, I swapped the positions of dataset and batchId, to do the query as below with the index below it, but I got the same exact gradually worsening performance.

Query
kind("File")
filter("batchId = ", "") //filtering with batchId first
filter("dataset = ", dataset)
order("-arrivalInstant") // order by latest file first

Index:
- kind: File
  properties:
    - name: batchId //Index to partition by batchId first as well
    - name: dataset
    - name: arrivalInstant
      direction: desc

 Image 1-27-23 at 2.12 PM.jpg

Jim Morrison

unread,
Jan 27, 2023, 5:36:55 PM1/27/23
to Muratti Andolini, Google Cloud Datastore
You are skipping over deleted index data with your index scans.  You can find details at https://firebase.google.com/docs/firestore/best-practices#avoid_skipping_over_deleted_data .  Your best bet would be to add a minimum arrivalInstant to your queries to avoid skipping over the tombstoned index entries.

Jim

--
You received this message because you are subscribed to the Google Groups "Google Cloud Datastore" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcd-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcd-discuss/91975b53-ad32-4581-bcec-b1099e05a76bn%40googlegroups.com.


--
Jim

Muratti Andolini

unread,
Jan 27, 2023, 5:48:14 PM1/27/23
to Google Cloud Datastore
Thanks for the hint. Just curious though - I am not deleting any File records, only updating. In that case are index entries are updated or deleted and re-created? Trying to understand if it applies to my use case here.

Jim Morrison

unread,
Jan 29, 2023, 12:03:59 PM1/29/23
to Muratti Andolini, Google Cloud Datastore
Updating a record means that if an field changes values (or any field in a composite index entry) the previous index entry needs to be deleted and a new one created.

Jim



--
Jim
Reply all
Reply to author
Forward
0 new messages