Mapreduce job from HDFS directory in Rstudio

50 views

Skip to first unread message

Iman neustre

unread,

Jul 27, 2015, 1:52:28 PM7/27/15

to RHadoop

I have set up sqoop to fetch data from a distant database , and it works fine , i can see data splitted in blocks in my hdfs . NOw i want to explore this data with RHadoop , i have my algorithm when i try it on a csv file it works , so the problem i have is how i can make the mapreduce make its job on the hdfs blocks :

because when i put a csv on hdfs , the mapreduce takes soo much time , 4 minutes for million raw .. the problem is that this csv is only an example , the real data will be millions of raws

the blocks are in /Data which is a hdfs directory hdfs.data.root=('/Data')



hdfs.data=file.path(hdfs.data.root) 
#makes a mapreduce function 
job <- mapreduce(input = hdfs.data,
 map = function(k, v) 
{keyval(v[6], 1)},
 reduce = function(k, v) 
{keyval(k, length(v))})

in the end it fails .

Antonio Piccolboni

unread,

Aug 5, 2015, 6:26:53 PM8/5/15

to rha...@googlegroups.com

You did not specify the input format. That will do it. Another thing, instead of trying one job for 7 years (4E6 minutes) you could try a run on 400 rows, also on hdfs, same format, same everything, and see if it works. That will allow you to learn what the problem is as opposed to depending on this forum.

On Monday, July 27, 2015 at 10:52:28 AM UTC-7, Iman neustre wrote:

I have set-up Sqoop to fetch data from a distant database, and it works fine, i can see data divided in blocks in my hdfs. Now I want to explore this Data with RHadoop, i have my algorithm when i try it on a csv file it works, so the problem i have is how i can make the mapreduce make STI job on the hdfs blocks:
Because when i put to csv on hdfs, the mapreduce Takes soo much time, 4 million minutes for raw .. the problem is this csv That is only an example, the current data will be millions of raws
the blocks are in / Data directory Which is to hdfs hdfs . data . root = ( '/ Data' )



hdfs . data = file . path ( hdfs . data . root )  
#makes to mapreduce function 
job <- mapreduce ( input = hdfs . data , 
 map =  function ( k , v )  
{ keyval ( v [ 6 ],  1 )}, 
 reduce =  function ( k , v )  
{ keyval ( k , length ( v ))})

Reply all

Reply to author

Forward

0 new messages