Mapreduce job from HDFS directory in Rstudio

50 views
Skip to first unread message

Iman neustre

unread,
Jul 27, 2015, 1:52:28 PM7/27/15
to RHadoop

I have set up sqoop to fetch data from a distant database , and it works fine , i can see data splitted in blocks in my hdfs . NOw i want to explore this data with RHadoop , i have my algorithm when i try it on a csv file it works , so the problem i have is how i can make the mapreduce make its job on the hdfs blocks :

because when i put a csv on hdfs , the mapreduce takes soo much time , 4 minutes for million raw .. the problem is that this csv is only an example , the real data will be millions of raws


the blocks are in /Data which is a hdfs directory 
hdfs.data.root=('/Data')

hdfs.data=file.path(hdfs.data.root)
#makes a mapreduce function
job
<- mapreduce(input = hdfs.data,
 map
= function(k, v)
{keyval(v[6], 1)},
 reduce
= function(k, v)
{keyval(k, length(v))})

in the end it fails .

Antonio Piccolboni

unread,
Aug 5, 2015, 6:26:53 PM8/5/15
to rha...@googlegroups.com
You did not specify the input format. That will do it. Another thing, instead of trying one job for 7 years (4E6 minutes) you could try a run on 400 rows, also on hdfs, same format, same everything, and see if it works. That will allow you to learn what the problem is as opposed to depending on this forum.

On Monday, July 27, 2015 at 10:52:28 AM UTC-7, Iman neustre wrote:

I have set-up Sqoop to fetch data from a distant database, and it works fine, i can see data divided in blocks in my hdfs. Now I want to explore this Data with RHadoop, i have my algorithm when i try it on a csv file it works, so the problem i have is how i can make the mapreduce make STI job on the hdfs blocks:

Because when i put to csv on hdfs, the mapreduce Takes soo much time, 4 million minutes for raw .. the problem is this csv That is only an example, the current data will be millions of raws


the blocks are in / Data directory Which is to hdfs 
hdfs . data . root = ( '/ Data' )

hdfs . data = file . path ( hdfs . data . root )
#makes to mapreduce function
job
<- mapreduce ( input = hdfs . data ,
 map
= function ( k , v )
{ keyval ( v [ 6 ], 1 )},
 reduce
= function ( k , v )
{ keyval ( k , length ( v ))})

Reply all
Reply to author
Forward
0 new messages