1. Understanding the data:
As explained in the above two posts, the data/record consist of 1558
attributes seperated with comma.
Some of the records may have some attributes.The missing attributes
are represented by question mark "?". The value of first three
attributes is a float and the rest of the attributes are binary. The
last field of the record tells whether the record is "ad" i.e. an
advertisement or "nonad" i.e. non-advertisement.
Moreover the fields have uneven space, so you need to take care while
processing it else it will throw error
2. Understanding Algorithm:
map<-expression({
attr_val<-unlist(map.values)
lapply(attr_val,function(i)
{
attr<-unlist(strsplit(i,","))
attr<-gsub("\\s","",attr)
questionMark<-which(attr=="?")
attr[questionMark]="0"
rhcollect(attr[length(attr)],sum(as.numeric(attr[1:
(length(attr)-1)])))
}
)
})
The values coming to the map expression will be in the form of list,
so first step would be unlist them. After unlist the values, the
variable "attr_value" would have bunch of records with him. Now here
either you can use FOR loop or lapply. FOR loop looks dirty and low
performing, so I have used lapply.
This lapply takes one record at a time, and process it.
Once the record is inside lapply, the first step would be split the
string and unlist them, now you have vector. Now since you need to add
attributes, but keep in mind that
(A) There is uneven spaces between the values. For example: the
variable "attr" will consist the values of this form
"1"," 2","3 ",.... some thing like this
(B) In a record there might be some missing attributes which are
marked by question mark, which you need to ignore.
(C) The last field will tell you whether the record belongs to "ad" or
"nonad"
After getting the vector of string, first remove the unwanted spaces,
and then find which all values in vector are question mark and then
replace them with zero. Once done you need to emit the last value of
the vector as key and and the sum as value
reduce<-expression(
pre={total=0},
reduce={total<-sum(total,unlist(reduce.values))},
post={rhcollect(reduce.key,total)}
)
I have used this so that i can use combiner which will fastedise the
process. Here for every key we need to unlist the values and sum it
up...
Hope i made it clear.
Regards,
Som Shekhar