Allow only one line of the input to go to individual mappers using RHadoop

Reddy

unread,

Apr 8, 2014, 10:59:44 AM4/8/14

to rha...@googlegroups.com

Hi,

I am new to the Rhadoop package and I have an issue where I am trying to control the inputs to my mappers. Let me detail the setup.

Main Objective:

I have a large number of individual files that I want each mapper to read (one file goes to only one mapper and one mapper gets only one file) and sample from. This sampled data is going to be given to the reducer to do further analysis. To simulate this setup I have created a small example.

Example Setup:

I am using the wordcount example. I have two data files called data.txt and data1.txt. I have stored these files on a HDFS location /user/root/wordcount/data/ . I have another file which lists the file names with the full path. I called it file_list.txt and it is stored in another hdfs location /user/root/wordcount/files/file_list.txt . I am providing the file_list.txt as input to the mapper such that mapper gets a line from the file (which is the location of a data file). The mapper uses hdfs.get to pull data to local directory and then processes it. The mapper generates the output which is given to the reducer. The reducer summarizes the wordcount.

Problem:

Ofcourse the above solution does not seem to work for me. The issue I am having is that one mapper gets both lines of input (data file locations), and process them in one mapper.

Is there a way I can force the mapper to get only one line of input? When I used the "hadoopStreaming" package in R, I had an option called "chunkSize" in hsLineReader. I could use this option to give only one line to the mapper.

Here is my code: (Edited to make it work)

Sys.setenv(HADOOP_HOME="/usr/lib/hadoop")

Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")

library(rmr2)

library(rhdfs)

hdfs.init()

map <- function(k,lines) {

Sys.setenv(HADOOP_HOME="/usr/lib/hadoop") # do I need to do set this up again?

Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")

library(rhdfs)

hdfs.init()

hdfs.get(lines,'.') #based on some debugging the length(lines) is 2. I would ideally like this to be 1

word_list <- list() # added to make it work

file_name <- basenames(lines)

for (fn in file_names) {

file_handle <- file(fn,'r')

lines <-readLines(file_handle)

words_list <- strpline(lines,"\\s")

words <- unlist(words_list)

word_list <- C(unlist(word_list),words)

}

return(keyval(word_list,1)

}

reduce <- function(word, counts) {

keyval(word, sum(counts))

}

wordcount <- function (input, output=NULL) {

mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)

}

## read input file from folder wordcount/files

## save result in folder wordcount/out

## Submit job

hdfs.root <- 'wordcount'

hdfs.data <- file.path(hdfs.root, 'files')

hdfs.out <- file.path(hdfs.root, 'out')

out <- wordcount(hdfs.data, hdfs.out)

#output processing follow

..

...

....

Any help would be greatly appreciated.

Thanks,
Sudhamsh.

This email message and any attachments may contain confidential, proprietary or non-public information. The information is intended solely for the designated recipient(s). If an addressing or transmission error has misdirected this email, please notify the sender immediately and destroy this email. Any review, dissemination, use or reliance upon this information by unintended recipients is prohibited. Any opinions expressed in this email are those of the author personally.

Antonio Piccolboni

unread,

Apr 8, 2014, 1:07:07 PM4/8/14

to RHadoop Google Group

On Tue, Apr 8, 2014 at 7:59 AM, Reddy <sudhams...@aciworldwide.com> wrote:

Hi,

I am new to the Rhadoop package and I have an issue where I am trying to control the inputs to my mappers. Let me detail the setup.

Main Objective:
I have a large number of individual files that I want each mapper to read (one file goes to only one mapper and one mapper gets only one file) and sample from.

You need to write a custom input format in Java. Not supported in rmr, nor in Hadoop without custom development. If the files are small, you can put them in HDFS then create a file with the paths and read that as an input, then read one file at a time in the mapper. After all, the origins of Hadoop are in web crawling whereby the input was a list or URLs to crawl. But don't expect that to give you the same throughput as specifying the files as the input to the Hadoop job. Another possibility is to give up on the one file/one mapper requirement, if it's enough for your sampling procedure to be aware of which file is being read. Each map call should read only one file, but it's not guaranteed to read all of it. In the map function you can call Sys.getenv("mapreduce_map_input_file") or Sys.getenv("map_input_file") for older Hadoop distros to get the information about which file is being read. I hope this helps

Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reddy

unread,

Apr 8, 2014, 1:54:27 PM4/8/14

to rha...@googlegroups.com, ant...@piccolboni.info

Hi Antonio,

I have created the file (file_list.txt) with the paths and am providing that as input. I would like the mapper to get a single line of this file_list.txt file. Unfortunately, when I checked the map process inputs, I see that each mapper is getting all the lines of the file. Please have a look at the code I included in my original post. Is there anyway I can control the number of lines each mapper gets as input using the mapreduce function in the rmr2 package?

Regards,

Sudhamsh.

Antonio Piccolboni

unread,

Apr 8, 2014, 2:05:32 PM4/8/14

to RHadoop Google Group

Yes, with custom input formats (this time, in R, no Java required). But why not do an lapply or similar? It seems to me you are wanting Hadoop and rmr2 to bend to your will, instead of using them for what they are. Some design choices were made, for better or worse, revisiting them is possible but not the easy route. Later, with more experience, you can go and fix them but right now, as per your own statement, you are still new to RHadoop, it's probably better to learn how things are done first. Once you get into it, you may find it's good enough for your purposes or that you want to improve it. In general, we tried to use vectorization as much as possible. For most people, a single line of input is not a file, it's a bunch of numbers that can be processed in a few microseconds. Vectorization requires that a single map call take care of several of those. That's the general programming pattern we are trying to support. The Hadoop way to do things is not with many small files. Google [hdfs small files problem] to read more about it.

Antonio

Reddy

unread,

Apr 8, 2014, 5:50:59 PM4/8/14

to rha...@googlegroups.com, ant...@piccolboni.info

Thanks Antonio. Let me see if I can change the way I provide inputs and do my sampling.