Issue: The scheduler sometimes records a task as failed, mid-task, even though there was no apparent failure.
Details:
I have a simple ETL workflow. GetData() --> BuildIndexes() --> RunReports() --> AggregateResults()
GetData() : Extract Data from sql into a proprietary database format.
BuildIndexes() : Creates an index on output results table from GetData().
RunReports() : Runs a count on results table from GetData().
AggregateResults() : Compress database and move it somewhere.
BuildIndexes() starts when its requires() is satisfied. After a minute or so the scheduler logs BuildIndexes() as a failure while the worker process is still chugging along. The worker process continues to run and eventually finishes successfully by writing the appropriate output(). The worker then reports back and the scheduler changes the failure to a success. The node in the graph looks like this over time: yellow-->blue-->red--> green. Nothing is ever logged to screen that indicates an issue.
At this point DEBUG says "There are no more tasks to run at this time" and the summary ends like this:
===== Luigi Execution Summary =====
Scheduled 4 tasks of which:
* 2 ran successfully:
- 1 BuildIndexes(gdb=C:\Path\to\file.gdb, fc=points, ndx_name=ndx_field1, field=field1)
- 1 GetData(...)
* 2 were left pending, among these:
* 2 was not granted run permission by the scheduler:
- 1 AggregateResults()
- 1 BuildIndexes(gdb=C:\Path\to\file.gdb, fc=points, ndx_name=ndx_field2, field=field2)
This progress looks :| because there were tasks that were not granted run permission by the scheduler
===== Luigi Execution Summary =====
*Sometimes one index will finish and the second will 'fail'. In the above example the first finished successfully without getting flagged as failed.
Notes :
1) Only happens on the scheduler; using --local-scheduler always results in complete success without any stops.
2) Somehow it appears to be related to the size of the GetData() extract. If I reduce the query to extract less than ~5 Million records it always seems to complete successfully. Anything above 5 Million records appears to fail. GetData() takes around 15 minutes for this pull, indexing only takes around 5.
3) Adding a long time.sleep(2000) at the start of BuildIndexes() run seems to workaround the issue.
4) Config is pretty stock, changes are mostly related to email settings.
5) Windows 2012 R2 w/ Luigi 2.3.3
At this point I'm just fishing for ideas on where to look. What is luigi picking up on that's making it think there is a failure? Any ideas?
Attached is a graph + simplified demo.