Input csv data file to remote R-Hadoop cluster

Hao Huang

unread,

Jul 17, 2013, 8:15:05 PM7/17/13

to chenn...@googlegroups.com

The R-Hadoop for my project is built up remotely. (If you don't use R, you can jump to the 2), where I asked how FileSystem.copyFromLocalFile() works)

I want to input a TB size csv data file to Hadoop as mapreduce input and I got a problem:

1) If I use Hadoop HDFS Java API, I can write csv file directly to hadoop hdfs, the data will be stored in blocks across cluster. But after this, I will be not able to use R command such like:

mydata<-read.csv("/somedirectory/test.csv")

because there is no real csv file "test.csv" in any node of the cluster. If I do:

mydata<-from.dfs("hdfs://rhadoop:8020/tmp/test.csv")

I'm not getting dataframe which can be used for mapreduce, but only a text file assembled from distributed blocks with each line representing a row for the original test.csv file.

2) I considered to copy my "test.csv" file to remote Hadoop cluster's namenode using FTP(or whatever), then I can do

mydata<-read.csv("/somedirectory/test.csv")

But the file size is so large that I really doubt if this is an appropriate way to do it.(say, taking too much space for namenode? also security issue if namenode die?) How does the java Filesystem.copyFromLocalFile() method work? Maybe move the file to namenode first, then namenode will use stream to distribute blocks to cluster? Or the file will be divided into blocks and transported one by one to namenode, while namenode will send the new coming block to some datanode first then accept next block?

How to handle this problem?

I may be not explaining it clearly, if you need more information, just lemme know, thank you!

Ashwanth Kumar

unread,

Jul 17, 2013, 10:48:45 PM7/17/13

to chenn...@googlegroups.com

Just setting the context -

What RMR does internally is, it serializes your entire state to a file. Packages the state in a JAR and submit the Job. JT internally distributes the JAR to all TT (Task Trackers) where they are executed. That's you need to have RMR installed on TT when you run them on the Hadoop cluster. So on every TT your R program gets called with a smaller set (file split) of your input file.

On Thu, Jul 18, 2013 at 5:45 AM, Hao Huang <hhu...@gmail.com> wrote:

The R-Hadoop for my project is built up remotely. (If you don't use R, you can jump to the 2), where I asked how FileSystem.copyFromLocalFile() works)
I want to input a TB size csv data file to Hadoop as mapreduce input and I got a problem:

1) If I use Hadoop HDFS Java API, I can write csv file directly to hadoop hdfs, the data will be stored in blocks across cluster. But after this, I will be not able to use R command such like:
mydata<-read.csv("/somedirectory/test.csv")
because there is no real csv file "test.csv" in any node of the cluster. If I do:
mydata<-from.dfs("hdfs://rhadoop:8020/tmp/test.csv")
I'm not getting dataframe which can be used for mapreduce, but only a text file assembled from distributed blocks with each line representing a row for the original test.csv file.

AFAIK that's how it is supposed to work . What behavior were you expecting? (Or are you trying to read the file directly instead of via RMR?)

2) I considered to copy my "test.csv" file to remote Hadoop cluster's namenode using FTP(or whatever), then I can do

mydata<-read.csv("/somedirectory/test.csv")
But the file size is so large that I really doubt if this is an appropriate way to do it.(say, taking too much space for namenode? also security issue if namenode die?) How does the java Filesystem.copyFromLocalFile() method work? Maybe move the file to namenode first, then namenode will use stream to distribute blocks to cluster? Or the file will be divided into blocks and transported one by one to namenode, while namenode will send the new coming block to some datanode first then accept next block?

You don't need to move the file to Namenode. DFS Client automatically does all that for you. Just run $ hadoop dfs -copyFromLocal /loca/path/test.csv /hdfs/path/test.csv from the box which is part of the Hadoop cluster ( you can also do it otherwise, but let me skip that for now). Hadoop reads the fs.default.name / fs.defaultFS property from the configuration and contacts the NN for streaming the file to it.

To answer the first part of the question what happens if the NN dies? The entire cluster (Except the MR Nodes) dies. NN is a single point of failure in traditional hadoop for HDFS.

How to handle this problem?

I may be not explaining it clearly, if you need more information, just lemme know, thank you!

--
You received this message because you are subscribed to the Google Groups "Hadoop Users Group (HUG) Chennai" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chennaihug+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Ashwanth Kumar / ashwanthkumar.in

Hao Huang

unread,

Jul 18, 2013, 3:37:40 AM7/18/13

to chenn...@googlegroups.com

Thanks Kumar for answering my question again! I get much better understanding how hdfs works!

For the first question, what I concern most is how can I get the data in that test.csv file. If I put the file into hdfs, how do you write R code to get data inside that file?

For example, if I use

mydata<-from.dfs("hdfs://rhadoop:8020/tmp/test.csv")

then the "mydata" gonna be an array of strings, with each string representing a row for test.csv. Am I supposed to parse each string by myself??

Isn't there anyway like mydata<-read.csv("/somedirectory/test.csv") to simply get the data frame?

2013/7/17 Ashwanth Kumar <ashwan...@googlemail.com>

--
You received this message because you are subscribed to a topic in the Google Groups "Hadoop Users Group (HUG) Chennai" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/chennaihug/lCGcE81qyxE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to chennaihug+...@googlegroups.com.

Ashwanth Kumar

unread,

Jul 18, 2013, 4:11:45 AM7/18/13

to chenn...@googlegroups.com

Its been quite some time since I last used R with Hadoop. My only suggestion will be to check - https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/getting-data-in-and-out.md (if you haven't seen it before).

I will leave this question to be answered by someone who has better understanding of how R and RMR works.

Reply all

Reply to author

Forward