The quick answer is that S3:s eventual consistency does not support
ETL pipelines with Luigi or anything else. See
https://berlinbuzzwords.de/17/session/what-does-rename-do for a deeper
explanation.
In order to have a stable solution, one must use a file system with
stronger consistency, or add a layer that provides consistency. In the
video above, one solution in progress, made by Hortonworks is
mentioned. Amazon has another (EMRFS), but it is a leaky abstraction,
and I have bad experiences. Netflix has made another (S3mper). There
are more similar solutions in the works.
AFAIK, no solution currently works sufficiently well, however. Your
options are basically:
1. Live with the problems, and mitigate them with pragmatic hacks,
such as the ones suggested in another mail in this thread.
2. Use HDFS. I have heard rumours of setups where pipelines are run in
EMR, using HDFS between jobs, and end results are copied back to S3.
It is not a better solution in theory, since you have the same problem
if pipelines are connected, but might work in practice.
3. Use a managed file system with stronger consistency. From the docs,
it seems EFS might be sufficient. Azure has managed HDFS, but I think
that no other provider has.
Lars Albertsson
Data engineering consultant
www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109
Calendar:
http://www.mapflat.com/calendar
> --
> You received this message because you are subscribed to the Google Groups
> "Luigi" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
luigi-user+...@googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.