Source Tap reading Local file system and Sink writing to HDFS

Amit

unread,

Mar 3, 2015, 11:36:51 AM3/3/15

to cascadi...@googlegroups.com

Hello,

Could someone please help me understand if we can read the local files , process them and write them to HDFS ?

I went through the API doc ( cascading.tap.hadoop.Hfs ) which says -

Use the Hfs class if the 'kind' of resource is unknown at design time. To use, prefix a scheme to the 'stringPath'. Where hdfs://... will denote Dfs, andfile://... will denote Lfs.

However subsequent documentation says

By default Cascading on Hadoop will assume any source or sink Tap using the file:// URI scheme intends to read files from the local client filesystem (for example when using the Lfs Tap) where the Hadoop job jar is started, Tap so will force any MapReduce jobs reading or writing to file:// resources to run in Hadoop "standalone mode" so that the file can be read.

Does this mean if we read the files from local disk then we cannot avail the cluster for processing those files ?

All this while I have been using Source and Sink both either local or HDFS but not of mixed nature hence the question.

Regards,

Amit

Andre Kelpe

unread,

Mar 3, 2015, 12:59:13 PM3/3/15

to cascadi...@googlegroups.com

For the local file use Lfs instead of Hfs and it should work:
http://docs.cascading.org/cascading/2.6/javadoc/cascading/tap/hadoop/Lfs.html

- André

> --
> You received this message because you are subscribed to the Google Groups
> "cascading-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cascading-use...@googlegroups.com.
> To post to this group, send email to cascadi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/cascading-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cascading-user/57bf164f-948d-4c18-be2a-18965239d7af%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
André Kelpe
an...@concurrentinc.com
http://concurrentinc.com

Amit

unread,

Mar 3, 2015, 2:34:43 PM3/3/15

to cascadi...@googlegroups.com

Thanks for your response.

Reading the local files , process them and write them to HDFS

Would this be an acceptable pattern to address any use case? I believe it may be slower based on the java documentation for Lfs.

Note that using a Lfs Tap instance in a Flow will force a portion of not the whole Flow to be executed in "local" mode forcing the Flow to execute in the current JVM. Mixing with Dfs and other Tap types is possible, providing a means to implement complex file/data management functions.

However I am not able to figure out what " portion " of the processing would execute in local mode and what "portion" on Hadoop cluster ? Hence I need to figure out if this is even acceptable solution ?

Consider a case where I want to move number of files from local filesystem to HDFS and while doing so I want to do some kind of processing on them.

Regards,

Amit

Andre Kelpe

unread,

Mar 4, 2015, 4:09:32 AM3/4/15

to cascadi...@googlegroups.com

"portion" refers to the fact, that a flow can be many map/reduce jobs.
Remember that Cascading maps your higher level logic onto the building
blocks of the computational fabric, which is in this case map/reduce.
That means that the jobs involving local files will run local, but not
subsequent jobs of the same flow.

If you want to see, how your logic maps to physical jobs, give driven
a try: http://cascading.io/driven/

- André

> https://groups.google.com/d/msgid/cascading-user/f2a8d0b8-fad6-4d04-86db-9a2b4e1fa3f7%40googlegroups.com.

Reply all

Reply to author

Forward