[Scalding]:DistributedCacheFile error: Data is missing from one or more paths

65 views
Skip to first unread message

SK

unread,
Apr 16, 2014, 5:10:58 PM4/16/14
to cascadi...@googlegroups.com

I have a large csv file that I have placed in the distributed cache using the following code:
 
val book_file = DistributedCacheFile("/path/to/file/on/hdfs/Books.csv")

Then this file is read to get some information as follows:

val book_names =
           Csv(book_file.path, skipHeader=true,separator=";", fields=book_format)
             .read
             .project('id,'name)

However, when I run the code in hdfs mode, I get the following error:

Exception in thread "main" com.twitter.scalding.InvalidSourceException: [com.twitter.scalding.CsvWrappedArray(./Books.csv-db872e0b620ec7716244dd4b341f094b)]
Data is missing from one or more paths in: List(./Books.csv-db872e0b620ec7716244dd4b341f094b)
    at com.twitter.scalding.FileSource.validateTaps(FileSource.scala:121)

So looks like the mapper nodes are not  able to get the path to the file in the DistributedCache. I have read the online howto on DistributedCache and followed the example there.
In that example, the file in the distributed cache is sent to an external java lookup service.
However, In my code above, I am reading the file in the Dcache within my scalding code. I am not sure why the nodes are unable to get the correct local path. I would appreciate any help in fixing the above error.

thanks
Reply all
Reply to author
Forward
0 new messages