statistical reducer

Marcel Mitsuto F. S.

unread,

Nov 14, 2012, 2:48:38 PM11/14/12

to dumbo...@googlegroups.com

I've stumbled upon this case where I need to reduce http response codes, http response times, and the ISP from maxmind's GeoIPISP.dat to have a glimpse of how different ISP's bring traffic to the web farm.

I'm searching for any examples on how to statistically reduce response times to have per http response code, by ISP, means and standard deviation, but that is proving a little hard to accomplish. There is not much information out there on how to produce such a reducer.

Anyone here have solved or have some pointers for me to look after?

Tobias Speckbacher

unread,

Nov 14, 2012, 3:34:31 PM11/14/12

to dumbo...@googlegroups.com

Using a combination of ISP and response code as the output key from the mapper would allow you to calculate stats per response code per isp.

Is this what you are looking for ?

-Tobias

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/dumbo-user/-/Z0_9h6Muu08J.
To post to this group, send email to dumbo...@googlegroups.com.
To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.

Klaas Bosteels

unread,

Nov 14, 2012, 3:38:59 PM11/14/12

to dumbo...@googlegroups.com

The statsreducer and -combiner included with dumbo might be of interest:

https://github.com/klbostee/dumbo/blob/master/dumbo/lib/__init__.py

-K

--

Marcel Mitsuto F. S.

unread,

Nov 14, 2012, 3:53:55 PM11/14/12

to dumbo...@googlegroups.com

I'm worried that the sumsreducer will eventually sum up all http response codes as per the apache log analysis example, where the bytes transferred column got summed. I want to analyze:

1) http response codes, where I'd reduce hits by ISP (i know how to do this);

2) mean http response time, by ISP and by http response code (I still don't know how to achieve this);

I'm still scratching my head here trying to wrap it around the whole map reduce M.O. Is there a way to reduce both items above on just one map and reduce run?

I'll read dumbo constructor and try to experiment those reducers, thanks!

Tobias Speckbacher

unread,

Nov 14, 2012, 4:02:41 PM11/14/12

to dumbo...@googlegroups.com

simplistically you could do something like this in your mapper

value = {'isp': isp, 'response_code': code, 'response_time': time}

key = '::'.join([isp,code])

yield key, value

in your reducer you will now receive all values associated to a given combination of isp and response code

-Tobias

To view this discussion on the web visit https://groups.google.com/d/msg/dumbo-user/-/1kV8SnbB7XkJ.

Message has been deleted

Marcel Mitsuto F. S.

unread,

Nov 15, 2012, 11:32:46 AM11/15/12

to dumbo...@googlegroups.com

Hello

I'm running into all sorts of errors seeking to map this out:

First I've got MemoryError, then I stopped yielding key, value as advised, and instead I'm only yielding key:

mykey = '::'.join([isp,code,time])

yield mykey, 1

Then the map phase reaches 10%, but then some jobs start getting killed with:

izip argument #1 must support iteration

Until the whole job gets killed.

So I went and removed time from mykey and did:

mykey = '::'.join([isp,code])

yield mykey, (1, int(time))

To run into memory errors again, some google and:

run(Mapper,sumsreducer,combiner=sumsreducer,buffersize=4096)

And apparently, now the job ends successfully, BUT, the numbers do not stack up:

$ dumbo cat input.log -hadoop /usr -hadooplib /usr/share/hadoop | wc -l

13595587

$ dumbo cat ip2isp -hadoop /usr -hadooplib /usr/share/hadoop | awk {sum+=$2}END{print sum}

I was expecting the sum of column #2 from ip2isp (-output path from dumbo start command) to be equal to the number of log lines in input.log

here is the code:

class Mapper:
        def __init__(self):
                from re import compile
                self.regex = compile(r'(?P<ip>[\d]+)\s(?P<response>[\d]+)\s(?P<time>[\d]+)$')
                from pygeoip import GeoIP, MEMORY_CACHE
                self.geoip = GeoIP("/usr/share/geoip/GeoIPISP.dat",flags=MEMORY_CACHE)
        def __call__(self,key,value):
                mo = self.regex.match(value)
                if mo:
                        from socket import inet_ntoa
                        from struct import pack
                        isp = str(self.geoip.org_by_addr(inet_ntoa(pack("!L",int(mo.group("ip"))))))
                        code, time = mo.group("response"), mo.group("time")
                        mykey = '::'.join([isp,code])
                        yield mykey, (1, int(time))

if __name__ == "__main__":
        from dumbo import run, sumsreducer
        run(Mapper,sumsreducer,combiner=sumsreducer,buffersize=4096)

Any advice?

Thanks!

Marcel Mitsuto

unread,

Nov 15, 2012, 1:13:32 PM11/15/12

to dumbo...@googlegroups.com

sorry to reply my own post,

I was mistakenly awk'ing the results on whitespace, instead of 'tab', everything looks fine!

thanks!
/marcel

To view this discussion on the web visit https://groups.google.com/d/msg/dumbo-user/-/RzssqqOJ9DAJ.

Gilles

unread,

Nov 16, 2012, 2:58:24 AM11/16/12

to dumbo...@googlegroups.com

Marcel,

You can keep the version with

    mykey = '::'.join([isp,code,time])
    yield mykey, 1

But in that case you should use sumreducer and not sumsreducer:

run(Mapper,sumreducer,combiner=sumreducer,buffersize=4096)

-Gilles

Reply all

Reply to author

Forward