We're running a chat system on realtime database and we clear some of the oldest threads daily, for business purposes. We have a `/conversations` endpoint in the db, where all conversations are stored, this grew very large overtime, I'm not sure how large, but it's considerably large since we store close to around 150GB of data in the realtime db - but that amount includes denormalized data duplications for each conversation metadata, so this endpoint clearly hold less than 150GB (20-30GB I assume).
Conversations objects under `/conversations` have a createdAt timestamp field, which is indexed using ".indexOn" from security rules. We query by this child field using Python Firebase Admin SDK, and the query is as follow:
Using a Python generator:
firebase_admin
.db
.reference('/conversations')
.order_by_child('createdAt')
.start_at(min_created_at) <- updates after fully consumed
.end_at(max_created_at)
.limit_to_first(batch_size)
We have an archiving task that currently runs daily. What happens is that the first run of this query blocks (freezes) the db for around 5 minutes, if the 5 minutes mark is exceeded then the task fails with a "Failed to establish a connection httpsconnectionpool". After the db unblocks, the same query runs instantaneous, its execution time being around 0ms. After profiling using firebase database:profile, it shows that the blocking query is taking 300 seconds to execute and any subsequent query of this type takes 0-2ms. Blocking the entire database for 5 minutes is as you can guess, unacceptable.
I'm writing here to debate my proposed theory and fix to this and understand from the firebase team why this happens.
My theory: assuming this is an expensive query (understandable, millions of items to be filtered), and rarely executed, there's some caching involved at some level, which is discarded at some point since it's not executing anymore. We've found that waiting 24h between executions and 6h between executions do have the same outcome and the first request blocks the db. A fix for this theory (if it's a valid one) would be making this a minute task, and querying every minute, to keep the cache valid and execution times low.