Luigi Execution - Crontab and Terminal

1,641 views
Skip to first unread message

Dillon Stadther

unread,
Feb 3, 2016, 8:52:51 AM2/3/16
to Luigi
I have reached a point where I am putting the dozens of Luigi jobs i've written into production. It is at this point that I'm running into unexpected walls that I'm seeking others' execution methods.

Presently, the majority of my jobs are being executed by one of two wrapper tasks. These wrapper tasks are then scheduled in crontab as the following:

00 05 * * * cd /home/user/path/to/my/luigi/jobs/; python five_utc.py  # wrapper task
00 06 * * * cd /home/user/path/to/my/luigi/jobs/; python emr_job.py
00 10 * * * cd /home/user/path/to/my/luigi/jobs/; python ten_utc.py   # wrapper task
00 11 * * * cd /home/user/path/to/my/luigi/jobs/; python another_job.py

[I have first changed directories so that my client.cfg will be identified (it is located in the same directory as my luigi files).]

The first job (at 5 UTC) launches and runs with no issue. However, the second (at 10 UTC) never runs. The files themselves are identical with the exception of which jobs are yielded within requires( ).

Note: Both five_utc.py and ten_utc.py successfully run from terminal using the exact commands above.

Has anyone encountered this issue or a similar and can help?


Also, I know that luigi can also be execution by 'luigi <task> --module <filename>'. However, I cannot get this to run successfully. I get an import error "no module named <whatever>". This made me wonder....do luigi users have to install their tasks within their systems? (i.e. create __init__.py and setup.py then 'sudo python setup.py install').


Any help would be greatly appreciated!

Thanks

Dave Buchfuhrer

unread,
Feb 3, 2016, 12:03:46 PM2/3/16
to Dillon Stadther, Luigi
I generally try to avoid installing any python package at the system level, Instead, I install everything in a virtualenv and use that in my cron jobs, using the luigi/python binary found in your virtualenv's bin folder. I've also recently moved to using http://pantsbuild.github.io/ to build portable binaries that wrap up all of your source code and external libraries. This makes it possible to build and deploy from a CI without having to maintain a consistent environment on each machine.

I've had more success using a single file as my main runner and calling luigi.run() within it, but you should be able to use the recommended luigi command if you're using your virtualenv's luigi binary. Then you can pass the WrapperTask instance on the command line like "python pipeline.py TenUtc --args". I'm not sure what you're doing in your example files so it's hard to say what's going wrong there.

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lars Albertsson

unread,
Feb 4, 2016, 7:14:12 AM2/4/16
to Dave Buchfuhrer, Dillon Stadther, Luigi
Regarding packaging Luigi jobs, I would recommend using setuptools to
build a package that can be installed with pip, e.g. a source tarball.
As Dave suggests, install it in a virtualenv. Your package can then
depend on external python packages, including Luigi, and the versions
do not need to be consistent across systems and your pipeline
packages.

You can also package non-python files, needed by your job, into the
tarball. It could be jars for Spark/Hadoop jobs or other files. If you
bundle them with your Luigi pipeline package, it becomes
self-contained, which eliminates risks of mixing old jars with new
Luigi job definitions or vice versa, e.g. when changing job invocation
parameters.

Regards,



Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109

Uldis Barbans

unread,
Feb 4, 2016, 7:17:32 AM2/4/16
to Dillon Stadther, Luigi

Hi!

The more robust way is to trigger hourly or even more frequently, instead of a particular time of day, and invoke something like

luigi --module your.module RangeDaily --of YourActualTask --start 2016-01-01

as documented in http://luigi.readthedocs.org/en/latest/api/luigi.tools.range.html.

That way you’re saying “finish please as soon as possible, when the dependencies have come in” and “finish the tasks for yesterday (and contiguously back) even if something major blocked them from executing all day”.

On Wed, Feb 3, 2016 at 2:52 PM, Dillon Stadther <dlsta...@gmail.com> wrote:

I have reached a point where I am putting the dozens of Luigi jobs i've written into production. It is at this point that I'm running into unexpected walls that I'm seeking others' execution methods.

Presently, the majority of my jobs are being executed by one of two wrapper tasks. These wrapper tasks are then scheduled in crontab as the following:

00 05 * * * cd /home/user/path/to/my/luigi/jobs/; python five_utc.py  # wrapper task
00 06 * * * cd /home/user/path/to/my/luigi/jobs/; python emr_job.py
00 10 * * * cd /home/user/path/to/my/luigi/jobs/; python ten_utc.py   # wrapper task
00 11 * * * cd /home/user/path/to/my/luigi/jobs/; python another_job.py

[I have first changed directories so that my client.cfg will be identified (it is located in the same directory as my luigi files).]

The first job (at 5 UTC) launches and runs with no issue. However, the second (at 10 UTC) never runs. The files themselves are identical with the exception of which jobs are yielded within requires( ). 
Note: Both five_utc.py and ten_utc.py successfully run from terminal using the exact commands above.
Has anyone encountered this issue or a similar and can help?

If you can share console output of the 10 UTC invocation, we can help decipher it.

A classic reason might be – if something is already running with a particular task ID, concurrent runs are precluded by the central orchestrator (if you’re using one).



Also, I know that luigi can also be execution by 'luigi <task> --module <filename>'. However, I cannot get this to run successfully. I get an import error "no module named <whatever>". This made me wonder....do luigi users have to install their tasks within their systems? (i.e. create __init__.py and setup.py then 'sudo python setup.py install').

One does not necessarily have to install, but the modules need to be on PYTHONPATH. Typically something like PYTHONPATH=. before python.



Any help would be greatly appreciated!

Thanks

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Uldis "ulzha" Barbans - Software Engineer, Data Infrastructure - Spotify AB

Dillon Stadther

unread,
Feb 4, 2016, 8:37:27 AM2/4/16
to Luigi, dlsta...@gmail.com
As time allows, I will try to round up all my dependencies and move to a virtualenv and install via setuptools as suggested by Dave and Lars.

Regarding the luigi range tools, am I correct that it will make sure that it is run for all days between --start and today? I have specific times of the day (5 and 10 UTC) specified because we use Luigi to perform our daily data warehouse ETL for which some external and internal scripts must run and complete prior to our execution. A number of our dependencies are, in fact, constantly being updated, but we only want a daily snapshot of them.

The issue with the 10 UTC invocation is that it does not occur via crontab when scheduled (it does not even show up in the cron logs 'grep CRON /var/log/syslog').

For giggles, I just changed the crontab execution time to be 2 minutes into the future, expecting the usual fate. However, I was completely caught off guard when it worked! The cron log for today is below (sorry for the size):

Basically, the cron log shows that it never executed anything from the ubuntu user between 6 UTC and 13:05 UTC (where there should have been a cron execution at 10 UTC).

Uldis Barbans

unread,
Feb 4, 2016, 10:03:10 AM2/4/16
to Dillon Stadther, Luigi

On Thu, Feb 4, 2016 at 2:37 PM, Dillon Stadther <dlsta...@gmail.com> wrote:

As time allows, I will try to round up all my dependencies and move to a virtualenv and install via setuptools as suggested by Dave and Lars.

Regarding the luigi range tools, am I correct that it will make sure that it is run for all days between --start and today? I have specific times of the day (5 and 10 UTC) specified because we use Luigi to perform our daily data warehouse ETL for which some external and internal scripts must run and complete prior to our execution. A number of our dependencies are, in fact, constantly being updated, but we only want a daily snapshot of them.

That’s a typical use for Luigi’s completeness checking. You can always execute whatever kind of logic in the ExternalTasks complete(), to see if the preconditions for ETL are fine. For the ETL job itself include date in output (create a dummy marker output, if it’s not easy natively in the data warehouse), so Luigi will see the ETL job as complete on subsequent retries, and will do nothing.


The issue with the 10 UTC invocation is that it does not occur via crontab when scheduled (it does not even show up in the cron logs 'grep CRON /var/log/syslog').

For giggles, I just changed the crontab execution time to be 2 minutes into the future, expecting the usual fate. However, I was completely caught off guard when it worked! The cron log for today is below (sorry for the size):

Basically, the cron log shows that it never executed anything from the ubuntu user between 6 UTC and 13:05 UTC (where there should have been a cron execution at 10 UTC).

Weird. I won’t be able to help troubleshoot that.

Dave Buchfuhrer

unread,
Feb 4, 2016, 11:12:21 AM2/4/16
to Uldis Barbans, Dillon Stadther, Luigi
You probably just had a bug in your crontab. Maybe a weird unicode lookalike character or something. I've had bugs like that before when people sent me code over Slack and quote marks got changed into prettier versions. They looked identical in my terminal but re-typing them fixed the bug.

Yu Vicki Fu

unread,
Oct 1, 2016, 4:59:01 PM10/1/16
to Luigi, dlsta...@gmail.com
Thank Uldis for pointing this out.
but how to add luigi --module your.module RangeDaily --of YourActualTask --start 2016-01-01
to the cronjob since using cronjob you can set the time to run but the commend you give only can do by day.
Please give us more detail example, for example, i want to set every hour 4th min to run a job, how to use luigi commend to setup.
thanks
vicky Fu

On Thursday, February 4, 2016 at 7:17:32 AM UTC-5, Uldis Barbans wrote:
Reply all
Reply to author
Forward
0 new messages