Keeping track of tasks that write to database targets

562 views
Skip to first unread message

badgley

unread,
Aug 7, 2013, 3:00:16 AM8/7/13
to luigi...@googlegroups.com
Just had a question about how other people are handling this problem -- or maybe where the development roadmap comes down on the following issue: 

The majority of luigi targets are filesystem targets (my impression)-- which makes it relatively straightforward to implement the 'exists' methods. For the postgres target included in the library, however, you have this whole table_updates table which keeps track of what tasks have run. As I understand it, this table allows easy lookups/checkpointing to determine what tasks have run and what data exists in the postgres database. 

But how should I be thinking about this sort of checkpointing as I implement other database targets or database oriented tasks? If I want to dump some data into mongo or redshift or whatever else -- should I have one central database collection (e.g., postgres 'table_updates') that keeps track of the tasks that have run? It feels a little silly to have to reimplement this sort of checkpoint type feature for each additional datastore. I mean -- why have a mongo updates/checkpoint/task collection that does the exact same thing as the postgres table?

But then should the solution be that my luigi scheduler has access to an all encompassing datastore that it checks for tasks that don't leave behind a file in a filesystem? This is the sort of stuff that I feel like I could really easily over complicate or screwup if left to my own devices. Feedback/thoughts/avenues much appreciated.


Grayson

Erik Bernhardsson

unread,
Aug 7, 2013, 7:09:22 PM8/7/13
to badgley, luigi...@googlegroups.com
On Wed, Aug 7, 2013 at 3:00 AM, badgley <grayson...@gmail.com> wrote:
Just had a question about how other people are handling this problem -- or maybe where the development roadmap comes down on the following issue: 

The majority of luigi targets are filesystem targets (my impression)-- which makes it relatively straightforward to implement the 'exists' methods. For the postgres target included in the library, however, you have this whole table_updates table which keeps track of what tasks have run. As I understand it, this table allows easy lookups/checkpointing to determine what tasks have run and what data exists in the postgres database. 

But how should I be thinking about this sort of checkpointing as I implement other database targets or database oriented tasks? If I want to dump some data into mongo or redshift or whatever else -- should I have one central database collection (e.g., postgres 'table_updates') that keeps track of the tasks that have run? It feels a little silly to have to reimplement this sort of checkpoint type feature for each additional datastore. I mean -- why have a mongo updates/checkpoint/task collection that does the exact same thing as the postgres table?

Yeah, I kind of agree.
 

But then should the solution be that my luigi scheduler has access to an all encompassing datastore that it checks for tasks that don't leave behind a file in a filesystem? This is the sort of stuff that I feel like I could really easily over complicate or screwup if left to my own devices. Feedback/thoughts/avenues much appreciated.


We have some Cassandra import tasks, and what we do usually just write an empty checkpoint file after we're done.

You're bringing up a good point, which is that Target objects serve two purposes
(1) existence checks
(2) easy semantics for modifying etc inside the run() method

If you don't need (2) then you can just create some dummy file somewhere to mark that the ingestion was successful
 

Grayson

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Erik Bernhardsson
Engineering Manager, Spotify, New York

Joe Ennever

unread,
Aug 7, 2013, 7:21:57 PM8/7/13
to Erik Bernhardsson, badgley, luigi...@googlegroups.com
We have our own 'PostgresTarget' that works well for most of our Redshift tasks. It's main parameters are the table, and a predicate. The exists() method returns True if "SELECT COUNT(*) FROM table WHERE predicate" is > 1. Predicate is usually something like "date = %s" % self.date. Since most of our Redshift tasks load in data a day at a time, they don't run if there is already data for that day. If you need to re-run a particular day, you can just delete the data, instead of deleting the data and the dummy file. Also, accidentally deleting the dummy file won't create duplicate data. 

If your tasks don't fit into that model, then I agree with Erik's solution.
Joe Ennever
Server Engineer
Foursquare

Dennis Jakobsen

unread,
Aug 7, 2013, 7:55:54 PM8/7/13
to Joe Ennever, Erik Bernhardsson, badgley, luigi...@googlegroups.com
How does the scheduler handle multiple nodes - say you use local files to track if a task has completed, it is critical that the exists method is called on the machine where the task executed last time and it created its local file, or that task would run again.  

Erik Bernhardsson

unread,
Aug 7, 2013, 7:56:33 PM8/7/13
to Dennis Jakobsen, Joe Ennever, badgley, luigi...@googlegroups.com
We usually just put the file in HDFS to achieve that.
Reply all
Reply to author
Forward
0 new messages