Doing hdfs operations efficiently using Python 3

222 views
Skip to first unread message

Arash Rouhani Kalleh

unread,
Nov 27, 2015, 4:58:17 AM11/27/15
to luigi...@googlegroups.com
Hi Luigi users,

Long time maintainer, first time user. ;)

I'm currently using luigi with python 3 to do orchestration. Most of my inputs reside on hdfs and I started with using the "hadoopcli" option for checking if my inputs exist yet or not. Now I realize that this has become a CPU-bottleneck during scheduling, since it constantly starts and shuts down jvms all the time (it's my theory at least).

In python 2 world, there is snakebite. But how does Python 3 users work around this? Do you live with it? Do you optimize the scheduling phase using globs to spawn fewer `hadoop fs` processes? Have anyone used the hdfs python library[1]?

Cheers,
Arash

Elias Freider

unread,
Nov 27, 2015, 7:07:38 PM11/27/15
to Arash Rouhani Kalleh, luigi...@googlegroups.com
Hi Arash, hope you are doing well in VN :)

Before snakebite existed we (at spotify) used to use webhdfs/httpfs for faster existence checks. If I'm not mistaken it uses a single process that acts as an http proxy to a bunch of hdfs commands. The original code we used for that was most likely in some spotify-internal package, and didn't conform to newer versions of the luigi filesystem apis anyways, but there is a webhdfs module that I believe some foursquare guys originally contributed at roughly the same time as part of luigi.contrib [1].

I believe it originally wrapped a third-party webhdfs python library called whoops, but it seems to have been refactored to use another since. It might not implement the entire HdfsFileSystem interface, but with a little effort you should definitely be able to use it (or some other httpfs-based solution, like overriding exists() of HdfsTarget with an http call) to speed up your exists checks. Let us know if you figure something out, as it it might help others in the same situation :)

/Elias (original co-author, long time mailing list lurker)


--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arash Rouhani Kalleh

unread,
Dec 6, 2015, 11:13:50 PM12/6/15
to Elias Freider, luigi...@googlegroups.com
On Sat, Nov 28, 2015 at 7:07 AM, Elias Freider <elias....@gmail.com> wrote:
Hi Arash, hope you are doing well in VN :)

Cam on! :)
 

Before snakebite existed we (at spotify) used to use webhdfs/httpfs for faster existence checks. If I'm not mistaken it uses a single process that acts as an http proxy to a bunch of hdfs commands. The original code we used for that was most likely in some spotify-internal package, and didn't conform to newer versions of the luigi filesystem apis anyways, but there is a webhdfs module that I believe some foursquare guys originally contributed at roughly the same time as part of luigi.contrib [1].

I believe it originally wrapped a third-party webhdfs python library called whoops, but it seems to have been refactored to use another since. It might not implement the entire HdfsFileSystem interface, but with a little effort you should definitely be able to use it (or some other httpfs-based solution, like overriding exists() of HdfsTarget with an http call) to speed up your exists checks. Let us know if you figure something out, as it it might help others in the same situation :)

Thanks for the summary!

I took some time to rewrite some of luigi's webhdfs integration. In this now merged PR: https://github.com/spotify/luigi/pull/1438

The main advantages over old webhdfs module:

 * It's now integrated with rest of luigi's hdfs modules. You can specify to use [hdfs]client: webhdfs in your luigi.cfg
 * Travis actually runs some tests to ensure that the code will not suddenly break. It uses the minicluster.

I hope it will be useful for others too.

Cheers,
Arash
Reply all
Reply to author
Forward
0 new messages