Another Hadoop problem solved using Rhipe

97 views
Skip to first unread message

Shekhar

unread,
Nov 2, 2011, 2:24:01 AM11/2/11
to Bangalore R Users - BRU
Problem Description:
Well here we are going to solve a simple problem of summing up the
attributes of advertisement data.
Basicall the data tells is about an image and tells whether a
particular record/image belongs to an adnvertisement(ad) or non-
advertisement.

The data cam be downloaded from : http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements

Each record in the data has 1558 attributes and some of the attributes
might be missing which is indicated by "?" mark and the last entry
shows whether the record is ad i.e. advertisement and nonad i.e non
advetisement. Yuo can find more details in the site.

Now the job is to add up the numerical attributes including binary
attributes.

Regards,
Som shekhar

Shekhar

unread,
Nov 17, 2011, 1:51:17 AM11/17/11
to Bangalore R Users - BRU
Hi,
I would suggest before seeing the code, understanding the data
structure would be important.

R Version:2.13.0
Rhipe Version:0.66 (http://ml.stat.purdue.edu/rhipe/download/
Rhipe_0.66.tar.gz)
Protobuf:2.4.1 (http://protobuf.googlecode.com/files/
protobuf-2.4.1.tar.gz)

The R Code for solving the above problem is as follows:

R Script name: Add.R

library(Rhipe)
rhinit()
map<-expression({
attr_val<-unlist(map.values)
lapply(attr_val,function(i)
{
attr<-unlist(strsplit(i,","))
attr<-gsub("\\s","",attr)
questionMark<-which(attr=="?")
attr[questionMark]="0"
rhcollect(attr[length(attr)],sum(as.numeric(attr[1:
(length(attr)-1)])))
}
)
})

reduce<-expression(
pre={total=0},
reduce={total<-sum(total,unlist(reduce.values))},
post={rhcollect(reduce.key,total)}
)

mapred<-list(rhipe_map_buff_size=20,mapred.job.tracker='local')

job_object<-
rhmr(map=map,reduce=reduce,inout=c("text","sequence"),ifolder="/
data",ofolder="/
RhipeOut",mapred=mapred,combiner=TRUE,jobname="AddAttributes")

rhex(job_object)

Running the job:

> source("Add.R")

Results:
> rhread("/RhipeOut/part-r-00000")
RHIPE: Read 2 pairs occupying 45 bytes, deserializing
[[1]]
[[1]][[1]]
[1] "ad."

[[1]][[2]]
[1] 162549.9


[[2]]
[[2]][[1]]
[1] "nonad."

[[2]][[2]]
[1] 410103.5

Shekhar

unread,
Nov 22, 2011, 6:57:09 AM11/22/11
to Bangalore R Users - BRU
Few of my friends asked me to explain the code. So here it goes:

1. Understanding the data:

As explained in the above two posts, the data/record consist of 1558
attributes seperated with comma.
Some of the records may have some attributes.The missing attributes
are represented by question mark "?". The value of first three
attributes is a float and the rest of the attributes are binary. The
last field of the record tells whether the record is "ad" i.e. an
advertisement or "nonad" i.e. non-advertisement.
Moreover the fields have uneven space, so you need to take care while
processing it else it will throw error

2. Understanding Algorithm:

map<-expression({
attr_val<-unlist(map.values)
lapply(attr_val,function(i)
{
attr<-unlist(strsplit(i,","))
attr<-gsub("\\s","",attr)
questionMark<-which(attr=="?")
attr[questionMark]="0"
rhcollect(attr[length(attr)],sum(as.numeric(attr[1:
(length(attr)-1)])))
}
)

})

The values coming to the map expression will be in the form of list,
so first step would be unlist them. After unlist the values, the
variable "attr_value" would have bunch of records with him. Now here
either you can use FOR loop or lapply. FOR loop looks dirty and low
performing, so I have used lapply.
This lapply takes one record at a time, and process it.

Once the record is inside lapply, the first step would be split the
string and unlist them, now you have vector. Now since you need to add
attributes, but keep in mind that
(A) There is uneven spaces between the values. For example: the
variable "attr" will consist the values of this form
"1"," 2","3 ",.... some thing like this

(B) In a record there might be some missing attributes which are
marked by question mark, which you need to ignore.

(C) The last field will tell you whether the record belongs to "ad" or
"nonad"

After getting the vector of string, first remove the unwanted spaces,
and then find which all values in vector are question mark and then
replace them with zero. Once done you need to emit the last value of
the vector as key and and the sum as value

reduce<-expression(
pre={total=0},
reduce={total<-sum(total,unlist(reduce.values))},
post={rhcollect(reduce.key,total)}
)

I have used this so that i can use combiner which will fastedise the
process. Here for every key we need to unlist the values and sum it
up...

Hope i made it clear.

Regards,
Som Shekhar

Reply all
Reply to author
Forward
0 new messages