Schedule more than 1k tasks with Luigi

Nahal Mirzaie

unread,

Jun 14, 2021, 2:39:54 PM6/14/21

to Luigi

Hi,

First of all, thank you for such a great module.

I'm using Luigi to parallel my Cellprofiler pipelines for analyzing High Throughput Screening images. My pipeline is quite simple and consists of these tasks:

step 1)

create_metadata

calculate_illumination_pattern

step 2)

calculate_well_features

step 3)

calcualte_profile

The problem is that in my image dataset, I have 1340 wells. So my calculate_profile task requires that 1340 tasks of calculate_well_features be done. In this StackOverflow answer, suggested creating one task and using multiprocessing.Pool for scheduling more than 1k jobs with Luigi.

My questions are first, does Luigi really have a problem with scheduling more than 1k tasks? Second, what is the best practice for piping my workflow using Luigi?

This is my first conversation in google groups, so I'm looking forward to your feedback.

Bests,

Nahal

Eamonn Faherty

unread,

Jun 16, 2021, 5:10:17 AM6/16/21

to Luigi

I run workflows with 30-40k tasks regularly without issue. The workflows I run perform very few actions themselves, they call APIs and chain responses using luigi requires between tasks.

Do keep an eye on memory consumption with a large workflow. Using many depths of requires can lead to large amounts of memory being needed. When I got into that situation I built a preprocessor to split my workflow into smaller parts.

Eamonn Faherty

unread,

Jun 16, 2021, 5:11:54 AM6/16/21

to Luigi

I have also found it can take 45 - 50 mins for my tasks to be registered before execution starts at that scale.

Lars Albertsson

unread,

Jun 17, 2021, 3:29:32 PM6/17/21

to Luigi

1k tasks should not be a problem. What problems are you experiencing?

In a large scale batch processing system, there is typically concurrency exploitation at two levels. Luigi is useful for coarse concurrency, where concurrent tasks can be heterogeneous and have different dependencies. It scales well to thousands of items, but not to millions (as in the SO post). Tools such as Spark or hand-rolled multiprocessing/multithreading are better at handling fine-grained concurrency, since they do not add overhead per item. But items must be homogeneous in terms of structure and dependencies. A typical mature batch environment combines coarse concurrency with fine-grained, and e.g. uses Luigi to schedule Spark jobs.

Reply all

Reply to author

Forward