Hi All,
In the above post we took a very simple problem where there is only
single word per line. Lets remove this restriction and took a real
scenario, where we have 1000 of files (Example: You can use Gutenberg
data set: The command for downloading gutenberg data is:
wget -w 2 -m
http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en
This will download all the data set which are in english.)
Problem in the above post is very easy, since we need to deal with
single word only. And since the values comes to the map expression in
the form of R list, we just need to unlist them to form vectors and
simply iterate.
Now how do we do if we have lot of words in a single line itself.First
we should consider what should be our delimiter, like space, :,@
etc..Now usually if it is a plane text then we can take space as our
delimiter, but if it is xml file then we need to choose something
else..
What next??
map.values are in list, so we need to do the following things:
1. unlist the map.values
2.Split the string
3.Unlist them again to form a unified vector.
You might be wondering one time unlisting will form a vector, then why
we are doing for the second time.To understand this lets do a small
activity:
For simplicity assume that we following lines in our file:
sachin ramesh tendulkar
Paul cohelo
ken follet
Jeffery Archer
Now as you know they came to our map expression as list. So for this
open the R console and first make a R list out of these strings.
> str1<-"sachin ramesh tendulkar"
> str2<-"Paul cohelo"
> str3<-"ken follet"
> str4<-"Jeffery Archer"
> myList<-list(str1,str2,str3,str4)# Formed a list of four strings
> myList
[[1]]
[1] "sachin ramesh tendulkar"
[[2]]
[1] "Paul cohelo"
[[3]]
[1] "ken follet"
[[4]]
[1] "Jeffery Archer"
> Doing_Unlist<-unlist(myList)# Doing unlist will form the vector
> Doing_Unlist
[1] "sachin ramesh tendulkar" "Paul cohelo"
[3] "ken follet" "Jeffery Archer"
> strsplit(Doing_Unlist," ")# We are splitting the string with space as
> delimiter. again it forms list
[[1]]
[1] "sachin" "ramesh" "tendulkar"
[[2]]
[1] "Paul" "cohelo"
[[3]]
[1] "ken" "follet"
[[4]]
[1] "Jeffery" "Archer"
> unlist(strsplit(Doing_Unlist," "))# so again unlisting and we get a
> proper vector
[1] "sachin" "ramesh" "tendulkar" "Paul" "cohelo"
"ken"
[7] "follet" "Jeffery" "Archer"
>
Now this is the only logic applied.Now the R code is pretty straight
forward
-----------------------------------------------
Generic_WordCount.R
------------------------------------------------
library(Rhipe)
map<-expression({
words_vector<-unlist(strsplit(unlist(map.values)," "))
lapply(words_vector,function(i) rhcollect(i,1))
})
reduce<-expression(
pre={total=0},
reduce={total<-sum(total,unlist(reduce.values))},
post={rhcollect(reduce.key,total)}
)
mapred<-list(rhipe_map_buff_size=20,mapred.job.tracker='local')
job_object<-
rhmr(map=map,reduce=reduce,inout=c("text","sequence"),ifolder="/
sample_1",ofolder="/output_02",mapred=mapred,jobname="word_count")
rhex(job_object)
you can run the code and can verify the results.
There are some flaws in this code:
(1)Since delimiter is space, so assume if you have a sentence like
this "time, is very precious, so we need to utilize this precious time
because time once gone will never come back."
What do you think what should be the count of word "time" and
"precious"?
Answer: The count for word "time" will not be 3 it will be 2. and the
count for word precious will be 1. Since there are will be two keys
for time Key1---"time"
Key2---"time,"----> Notice the comma after the word time Similarly for
the word "precious".
(2) More over we are using single space as delimiter, so if in between
the words if we have more than one space, then the extra space will be
treated as word.(Check this out)
I guess you are getting the idea how we are exploiting the R
capabilities on distributed plaform like Hadoop.
Now one can use of the various hadoop related parameters here like
combiner, partitioner, num of reduce tasks, num of map tasks. Using
this on pseudo mode wont have any that much of effect on performance.
In the coming posts we will try to solve real world problems like Data
mining, k-means clustering, generating mandelbrot set, monte carlo
simulation etc. We will also use the graphics in k-means clustering
algorithm to see how the centers are moving in every iteration
Enjoy !!!
Regards,
Som Shekhar