What's the best way to process a large queue?

17 views
Skip to first unread message

T3db0t

unread,
May 29, 2019, 5:12:35 PM5/29/19
to MongoDB Stitch Users
I have a collection of 1.7M records that are just IDs from Wikidata. For each of those IDs, I need to write a record to a new collection and then delete the ID from the queue. (Technically not a queue in that there are no new records coming in, they're already all there.) Given the number of records, I'd like to do this with some kind of controllable concurrency. I'd like to be able to run a script from wherever (including my laptop) and pick whichever record is next and process it. They don't need to be in any order.

I have some ideas for how to approach this, but I figured I'd throw it out here to see what I'm missing.

—Ted

Adam Chelminski

unread,
Jun 3, 2019, 1:52:38 PM6/3/19
to MongoDB Stitch Users
Hi Ted,

There are definitely many ways to approach this, but here are two things you should keep in mind if you want to do this purely with Stitch:

1. If you want to process the records from a Stitch function, keep in mind that there is a 60 second execution time limit for functions. The reason for this is that Stitch functions are not intended for long-running operations. However, you could set up a Stitch function that runs on a scheduled trigger (See https://docs.mongodb.com/stitch/triggers/scheduled-triggers/) to process batches of the IDs on a fixed schedule. You'll just have to measure how long it takes to do the processing you need to do and make your batches be less than 60 seconds each.

2. If you're starting off with a fresh collection, you could set up a database trigger that runs a single-document processing function for every insert in the collection (See https://docs.mongodb.com/stitch/triggers/database-triggers/). This way, the entire collection is just the queue. Just be aware that sometimes, triggers may be stopped if there is an intermittent outage of Stitch or your Atlas cluster, so you'll still want to have a function or scheduled trigger to drain the queue in case any inserted events get missed by the trigger and aren't processed when they're inserted.

Let me know if you have any questions.

-Adam
Reply all
Reply to author
Forward
0 new messages