Just had a question about how other people are handling this problem -- or maybe where the development roadmap comes down on the following issue:
The majority of luigi targets are filesystem targets (my impression)-- which makes it relatively straightforward to implement the 'exists' methods. For the postgres target included in the library, however, you have this whole table_updates table which keeps track of what tasks have run. As I understand it, this table allows easy lookups/checkpointing to determine what tasks have run and what data exists in the postgres database.
But how should I be thinking about this sort of checkpointing as I implement other database targets or database oriented tasks? If I want to dump some data into mongo or redshift or whatever else -- should I have one central database collection (e.g., postgres 'table_updates') that keeps track of the tasks that have run? It feels a little silly to have to reimplement this sort of checkpoint type feature for each additional datastore. I mean -- why have a mongo updates/checkpoint/task collection that does the exact same thing as the postgres table?
But then should the solution be that my luigi scheduler has access to an all encompassing datastore that it checks for tasks that don't leave behind a file in a filesystem? This is the sort of stuff that I feel like I could really easily over complicate or screwup if left to my own devices. Feedback/thoughts/avenues much appreciated.
Grayson