I want to input a TB size csv data file to Hadoop as mapreduce input and I got a problem:
1) If I use Hadoop HDFS Java API, I can write csv file directly to hadoop hdfs, the data will be stored in blocks across cluster. But after this, I will be not able to use R command such like:
mydata<-read.csv("/somedirectory/test.csv")
because there is no real csv file "test.csv" in any node of the cluster. If I do:
mydata<-from.dfs("hdfs://rhadoop:8020/tmp/test.csv")
I'm not getting dataframe which can be used for mapreduce, but only a text file assembled from distributed blocks with each line representing a row for the original test.csv file.
2) I considered to copy my "test.csv" file to remote Hadoop cluster's namenode using FTP(or whatever), then I can do
mydata<-read.csv("/somedirectory/test.csv")
But the file size is so large that I really doubt if this is an appropriate way to do it.(say, taking too much space for namenode? also security issue if namenode die?) How does the java Filesystem.copyFromLocalFile() method work? Maybe move the file to namenode first, then namenode will use stream to distribute blocks to cluster? Or the file will be divided into blocks and transported one by one to namenode, while namenode will send the new coming block to some datanode first then accept next block?
How to handle this problem?
I may be not explaining it clearly, if you need more information, just lemme know, thank you!