RHadoop Query

88 views
Skip to first unread message

Chandan Nishad

unread,
Mar 18, 2016, 1:25:48 PM3/18/16
to RHadoop
 
Hi everyone , 

I am trying to extract the data from HDFS to R.. after importing the data to R, I can see only 7,40,726 no. of rows only, but the actual no. of rows in the data is around 11,55,000 and the size of data is 105MB only.
below is the code i have written to extract the data.

Sys.setenv("HADOOP_PREFIX"="/opt/hadoop")
Sys.setenv("HADOOP_CMD"="/opt/hadoop/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar")

library(rmr2)
library(rhdfs)
hdfs.init()
library(rJava)

# To read the data from HDFS.
hdfs.defaults()
f = hdfs.file("/tmp/projectdata/churndata/Apr_data1.csv","r",buffersize=104857600)
m = hdfs.read(f)
c = rawToChar(m)
data = read.table(textConnection(c), sep = ",",fill = TRUE);

Kindly let me know what should I do, in order to extract the complete data(11,55,000 rows). 
Waiting for your kind response on the same.

Thank you,
Chandan

Ranjit Mishra

unread,
Apr 6, 2016, 9:11:14 AM4/6/16
to RHadoop
Not sure why the error is coming, but I would suggest a work around -  download the file from HDFS (hortonworks or cloudera interface has this option on the GUI itself) and then try to read in R. You can also do a 'hdfs.get' in to local box, and then try to read.
Reply all
Reply to author
Forward
0 new messages