Scheduler fails tasks that aren't actually failed

1,350 views
Skip to first unread message

bespin

unread,
Nov 15, 2016, 8:18:34 PM11/15/16
to Luigi
Issue: The scheduler sometimes records a task as failed, mid-task, even though there was no apparent failure.

Details:
I have a simple ETL workflow.  GetData() --> BuildIndexes() --> RunReports() --> AggregateResults()

    GetData() : Extract Data from sql into a proprietary database format.
    BuildIndexes() : Creates an index on output results table from GetData().
    RunReports() : Runs a count on results table from GetData().
    AggregateResults() : Compress database and move it somewhere.


BuildIndexes() starts when its requires() is satisfied.  After a minute or so the scheduler logs BuildIndexes() as a failure while the worker process is still chugging along.  The worker process continues to run and eventually finishes successfully by writing the appropriate output().  The worker then reports back and the scheduler changes the failure to a success.  The node in the graph looks like this over time: yellow-->blue-->red--> green.  Nothing is ever logged to screen that indicates an issue. 


At this point DEBUG says "There are no more tasks to run at this time" and the summary ends like this:

    ===== Luigi Execution Summary =====
   
Scheduled 4 tasks of which:
   
* 2 ran successfully:
       
- 1 BuildIndexes(gdb=C:\Path\to\file.gdb, fc=points, ndx_name=ndx_field1, field=field1)
       
- 1 GetData(...)
   
* 2 were left pending, among these:
       
* 2 was not granted run permission by the scheduler:
           
- 1 AggregateResults()
           
- 1 BuildIndexes(gdb=C:\Path\to\file.gdb, fc=points, ndx_name=ndx_field2, field=field2)

   
This progress looks :| because there were tasks that were not granted run permission by the scheduler
   
===== Luigi Execution Summary =====
*Sometimes one index will finish and the second will 'fail'.   In the above example the first finished successfully without getting flagged as failed.


Notes :
1) Only happens on the scheduler; using --local-scheduler always results in complete success without any stops.
2) Somehow it appears to be related to the size of the GetData() extract. If I reduce the query to extract less than ~5 Million records it always seems to complete successfully. Anything above 5 Million records appears to fail.  GetData() takes around 15 minutes for this pull, indexing only takes around 5.
3) Adding a long time.sleep(2000) at the start of BuildIndexes() run seems to workaround the issue.
4) Config is pretty stock, changes are mostly related to email settings.
5) Windows 2012 R2 w/ Luigi 2.3.3


At this point I'm just fishing for ideas on where to look.  What is luigi picking up on that's making it think there is a failure?  Any ideas?

Attached is a graph + simplified demo.

foo.py
dag_01.jpg

Dave Buchfuhrer

unread,
Nov 15, 2016, 8:44:55 PM11/15/16
to bespin, Luigi
It sounds like you might not be pinging the scheduler and your worker is timing out and getting killed. You can test this by watching the workers tab to see if the last contacted time keeps getting updated. You could also try adding --workers 2 to your command prompt to move your work into a different subprocess. Let us know if this fixes your problem, it'll indicate that we should always be using a subprocess.

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bespin

unread,
Nov 15, 2016, 11:15:23 PM11/15/16
to Luigi, bes...@gmail.com
You might be on to something.


You can test this by watching the workers tab to see if the last contacted time keeps getting updated.

I think "Last Contacted" = "Last Checkin"?  I've attached a screenshot to make sure we're talking about the same thing.

When running the getdata() task the "Last Checkin" continually updates.  Not sure what the poll rate is but it updates every time I hit F5.  I assumed that was just from the browser refresh.  When it gets to BuildIndexes() it stops refreshing though.  The "Time" on the task description updates to the start time and the minute counter starts to increase but the "Last Checkin" value stays the same.  After ~90 seconds the worker fails and disappears from the worker tab.

Not sure if you noticed but if In my Notes I mentioned that you can add a long time.sleep() and it somehow makes the BuildIndexes() complete without issue.  Not sure how that fits into this theory that's its a worker timeout.


You could also try adding --workers 2 to your command prompt

Unfortunately on windows we can't run workers > 1.  That's a whole other issue I need to resolve.  See issue.




To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
luigi1.jpg

Dave Buchfuhrer

unread,
Nov 16, 2016, 2:13:24 AM11/16/16
to bespin, Luigi
Yeah, I meant last checkin. This should update every second, which is faster than you can reasonably refresh. It looks like you've verified the issue. Try adding --workers 2 to your command prompt and let us know if that fixes it.

To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.

nvado...@squarespace.com

unread,
Dec 1, 2016, 1:50:43 PM12/1/16
to Luigi, bes...@gmail.com
>     ===== Luigi Execution Summary =====*Sometimes one index will finish and the second will 'fail'.   In the above example the first finished successfully without getting flagged as failed.
>
>
> Notes :
> 1) Only happens on the scheduler; using --local-scheduler always results in complete success without any stops.
> 2) Somehow it appears to be related to the size of the GetData() extract. If I reduce the query to extract less than ~5 Million records it always seems to complete successfully. Anything above 5 Million records appears to fail.  GetData() takes around 15 minutes for this pull, indexing only takes around 5.
> 3) Adding a long time.sleep(2000) at the start of BuildIndexes() run seems to workaround the issue.
> 4) Config is pretty stock, changes are mostly related to email settings.
> 5) Windows 2012 R2 w/ Luigi 2.3.3
>
>
> At this point I'm just fishing for ideas on where to look.  What is luigi picking up on that's making it think there is a failure?  Any ideas?
>
> Attached is a graph + simplified demo.
>
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "Luigi" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "Luigi" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

I don't see an update for this, but I had this exact problem and using two workers (vs one) fixed the issue for me.

Is there a github issue related to the problem? I didn't see one when I checked.

cam...@novilabs.com

unread,
May 8, 2017, 12:13:30 PM5/8/17
to Luigi, bes...@gmail.com, nvado...@squarespace.com
This is happening to me. Did a github issue every get created?

bespin

unread,
Jun 20, 2017, 5:36:36 PM6/20/17
to Luigi, bes...@gmail.com
You're correct, now that workers > 1 is fixed on windows I've verified that using workers 2 resolves this issue.  One worker is still an issue up through current version 2.6.2.
Reply all
Reply to author
Forward
0 new messages