Hi!
Let's assume an example where I analyse web logs and emit the number of
hits per requesting IP adress, and also that I want daily
and a monthly numbers. Logs would be parsed on a daily basis.
The daily figure is easy, that's just emitting key=<ip> and value=1,
the reduce that into the total number of hits per IP.
For the monthly numbers, I think I want to do some kind of mapside
join to get a scalable solution. I've been looking at the mapside join
package in Hadoop, but I'm curious if I can do this with dumbo?
http://dumbotics.com/2009/03/20/join-keys/ seems to do something that
is similar, but I'm a bit confused, so I would appreciate some hints on
how to solve my particular problem.
Thanks!
\EF