using filename as key for reducer

30 views
Skip to first unread message

Raf

unread,
Oct 5, 2014, 9:03:12 PM10/5/14
to
Hello,

I am very new to rhadoop and having a little trouble.

I have a large number of files, each containing a number of vectors. I wish to perform some vector operations on each vector and then in the reducer stage sum up all vectors from the same file before outputting a filename + a single vector as my output.

Currently I am mimic-ing the wordcount example to some extent, however I want to emit the filename as key and processed vector as data. 

something like:

map =  function(.,line){
 keyval(Sys.getenv("mapreduce_map_input_file"), vector)
}

reduce = function(filename,vectors){
       keyval(filename,sum(vectors)
}

When I run this the vectors appear to be getting processed correctly however I am not getting back the filenames. I cam getting unique ids of some kind 1,2,3,.... that seem to correspond to the vector not the file. 

That is I am outputing each vector rather than each file + it's associated vector.  

I have also tried "map_input_file" and MAPREDUCE_MAP_INPUT_FILE as the env values.

Antonio Piccolboni

unread,
Oct 6, 2014, 12:06:03 PM10/6/14
to RHadoop Google Group
Yes, that's the normal output. In the wordcount example, did you consider what happens to the output of mapreduce? What functions can be called on it? Do any of the examples call print on it? If you answer those questions I think it should solve your problem.


Antonio

On Sun, Oct 5, 2014 at 6:03 PM, Raf <rafael....@gmail.com> wrote:
Hello,

I am very new to rhadoop and having a little trouble.

I have a large number of files, each containing a number of vectors. I wish to perform some vector operations on each vector and then in the reducer stage sum up all vectors from the same file before outputting a filename + a single vector as my output.

Currently I am mimic-ing the wordcount example to some extent, however I want to emit the filename as key and processed vector as data. 

something like:

map =  function(.,line){
 keyval(Sys.getenv("mapreduce_map_input_file"), vector)
}

reduce = function(filename,vectors){
       keyval(filename,sum(vectors)
}

when I run this on a simple one file example in backend mode however I get:

function () 
{
    fname
}
<environment: 0x29d0818>

as output... which I don't really understand.

I have also tried "map_input_file" and MAPREDUCE_MAP_INPUT_FILE as the env values.

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Raf

unread,
Oct 7, 2014, 1:40:59 AM10/7/14
to rha...@googlegroups.com, ant...@piccolboni.info
Ok, I have it outputing the filename now,

but it does not appear to be combining keys in the reducer. I am getting a line for each vector component and many copies of each key rather a key with associated vector... 

Antonio Piccolboni

unread,
Oct 7, 2014, 11:32:53 AM10/7/14
to RHadoop Google Group
Keyval does recycling, just try keyval(1, 1:10). So maybe what you want is keyval(one.key, list(vector))
Reply all
Reply to author
Forward
0 new messages