Lots of inputs

35 views
Skip to first unread message

Burner Account

unread,
Jun 19, 2024, 7:23:49 PM (10 days ago) Jun 19
to Luigi
Hi all,

So I have some tasks that I want to do and they form a DAG, which is why I was drawn to Luigi.

signal-2024-06-19-191352_002.jpeg

Each task needs to process tens of thousands of inputs, and for the most part, every input can be processed in parallel, as these different "chains" don't depend on each other. Let's say I have to process 10000 inputs, like so:

signal-2024-06-19-191407_002.jpeg

My question is whether Luigi would be adequate to handle this use case, and if so, what kinds of things I would need to make sure it runs successfully. What if I had 1,000,000 inputs or more?

Maybe this would be too many files. If I have a sequence-ish of 8 tasks, and I need to do that 10000 times, I would need to create 80,000 files or file-like things to use as outputs/inputs.

One way would be for A to process all 10,000 inputs at a time, since Luigi's docs do say if I have a few tasks that handle a lot, that is valid. However, I do want A(x) and A(y) to be two different task invocations, since if A(42) isn't done running but A(84) is, then I want B(84) to start running and try to take advantage of doing simultaneously whenever possible. My doubt of whether Luigi would work for this comes from https://luigi.readthedocs.io/en/stable/design_and_limitations.html:
> The assumption is that each task is a sizable chunk of work. While you can probably schedule a few thousand jobs, it’s not meant to scale beyond tens of thousands.

I know that there are many integrations such as Spark. Would any of them help me achieve my goals? Would I still be able to achieve my goals with Luigi if I don't want a more complex thing like Spark?

Best,
Plumber Appreciator
Reply all
Reply to author
Forward
0 new messages