Re: Generator slower than iterator?

MRAB

unread,

Dec 16, 2008, 10:25:50 AM12/16/08

to pytho...@python.org

Federico Moreira wrote:
> Hi all,
>
> Im parsing a 4.1GB apache log to have stats about how many times an ip
> request something from the server.
>
> The first design of the algorithm was
>
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split()[0]
> if match_counter.has_key(ip):
> match_counter[ip] += 1
> else:
> match_counter[ip] = 1
>
> And it took 3min 58 seg to give me the stats
>
> Then i tried a generator solution like
>
> def generateit():
> for line in fileinput.input(sys.argv[1:]):
> yield line.split()[0]
>
> for ip in generateit():
> ...the same if sentence
>
> Instead of being faster it took 4 min 20 seg
>
> Should i leave fileinput behind?
> Am i using generators with the wrong aproach?
>
Your first design is already simple to understand, so I think that using
a generator isn't necessary (and probably isn't worth the cost!).

You might want to try defaultdict instead of dict to see whether that
would be faster:

from collections import defaultdict

match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):
ip = line.split()[0]
match_counter[ip] += 1

Lie Ryan

unread,

Dec 16, 2008, 11:04:24 AM12/16/08

to pytho...@python.org

On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote:

> Hi all,
>
> Im parsing a 4.1GB apache log to have stats about how many times an ip
> request something from the server.
>
> The first design of the algorithm was
>
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split()[0]
> if match_counter.has_key(ip):
> match_counter[ip] += 1
> else:
> match_counter[ip] = 1
>
> And it took 3min 58 seg to give me the stats
>
> Then i tried a generator solution like
>
> def generateit():
> for line in fileinput.input(sys.argv[1:]):
> yield line.split()[0]
>
> for ip in generateit():
> ...the same if sentence
>
> Instead of being faster it took 4 min 20 seg
>
> Should i leave fileinput behind?
> Am i using generators with the wrong aproach?

What's fileinput? A file-like object (unlikely)? Also, what's
fileinput.input? I guess the reason why you don't see much difference
(and is in fact slower) lies in what fileinput.input does.

Generators excels in processing huge data since it doesn't have to create
huge intermediate lists which eats up memory, given an infinite memory, a
generator solution is almost always slower than straight up solution
using lists. However in real life we don't have infinite memory, hogging
our memory with the huge intermediate list would make the system start
swapping, swapping is very slow and is a big hit to performance. This is
the way generator could be faster than list.

Lie Ryan

unread,

Dec 16, 2008, 11:10:54 AM12/16/08

to pytho...@python.org

On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote:

> Hi all,
>
> Im parsing a 4.1GB apache log to have stats about how many times an ip
> request something from the server.
>
> The first design of the algorithm was
>
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split()[0]
> if match_counter.has_key(ip):
> match_counter[ip] += 1
> else:
> match_counter[ip] = 1

nitpick:
dict.has_key is usually replaced with
if ip in match_counter: ...

also, after investigating your code further, I see that you've
unnecessarily used generators, the first code is simpler and you've not
avoided any creation of huge intermediate list by using the generator
this way. You won't get any performance improvement with this, and
instead get a performance hit due to function overhead and name look up.

Gary Herron

unread,

Dec 16, 2008, 11:30:00 AM12/16/08

to pytho...@python.org

Lie Ryan wrote:
> On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote:
>
>
>> Hi all,
>>
>> Im parsing a 4.1GB apache log to have stats about how many times an ip
>> request something from the server.
>>
>> The first design of the algorithm was
>>
>> for line in fileinput.input(sys.argv[1:]):
>> ip = line.split()[0]
>> if match_counter.has_key(ip):
>> match_counter[ip] += 1
>> else:
>> match_counter[ip] = 1
>>

>> And it took 3min 58 seg to give me the stats
>>
>> Then i tried a generator solution like
>>
>> def generateit():

>> for line in fileinput.input(sys.argv[1:]):

>> yield line.split()[0]
>>
>> for ip in generateit():
>> ...the same if sentence
>>
>> Instead of being faster it took 4 min 20 seg
>>
>> Should i leave fileinput behind?
>> Am i using generators with the wrong aproach?
>>
>
> What's fileinput? A file-like object (unlikely)? Also, what's
> fileinput.input? I guess the reason why you don't see much difference
> (and is in fact slower) lies in what fileinput.input does.
>
>

Fileinput is a standard module distributed with Python:

>From the manual:

11.2 fileinput -- Iterate over lines from multiple input streams

This module implements a helper class and functions to quickly write a
loop over standard input or a list of files.

The typical use is:

import fileinput
for line in fileinput.input():
process(line)

...

> Generators excels in processing huge data since it doesn't have to create
> huge intermediate lists which eats up memory, given an infinite memory, a
> generator solution is almost always slower than straight up solution
> using lists. However in real life we don't have infinite memory, hogging
> our memory with the huge intermediate list would make the system start
> swapping, swapping is very slow and is a big hit to performance. This is
> the way generator could be faster than list.
>
>

> --
> http://mail.python.org/mailman/listinfo/python-list
>

bearoph...@lycos.com

unread,

Dec 16, 2008, 11:44:42 AM12/16/08

to

MRAB:

> from collections import defaultdict
> match_counter = defaultdict(int)
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split()[0]
> match_counter[ip] += 1

This can be a little faster still:

match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):

ip = line.split(None, 1)[0]
match_counter[ip] += 1

Bye,
bearophile

rdmu...@bitdance.com

unread,

Dec 16, 2008, 11:29:36 AM12/16/08

to pytho...@python.org

Quoth Lie Ryan <lie....@gmail.com>:

> On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote:
>
> > Hi all,
> >
> > Im parsing a 4.1GB apache log to have stats about how many times an ip
> > request something from the server.
> >
> > The first design of the algorithm was
> >

> > for line in fileinput.input(sys.argv[1:]):
> > ip = line.split()[0]

> > if match_counter.has_key(ip):
> > match_counter[ip] += 1
> > else:
> > match_counter[ip] = 1
>

> nitpick:
> dict.has_key is usually replaced with
> if ip in match_counter: ...

I'd say it's just slightly more than a nitpick nowadays: since Python
3.0 does not support has_key, it's time to get used to not using it :)

--RDM

Arnaud Delobelle

unread,

Dec 16, 2008, 1:17:14 PM12/16/08

to

bearoph...@lycos.com writes:

> This can be a little faster still:
>
> match_counter = defaultdict(int)
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split(None, 1)[0]
> match_counter[ip] += 1
>
> Bye,
> bearophile

Or maybe (untested):

match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):

ip = line[:line.index(' ')]
match_counter[ip] += 1

Or even (untested, I've never tried this ;):

from itertools import count
match_counter = defaultdict(count)

for line in fileinput.input(sys.argv[1:]):

ip = line[:line.index(' ')]
match_counter[ip].next()

match_total = dict((key, val()) for key, val in match_counter.iteritems())

Obviously the variable 'ip' could be removed in all cases.

--
Arnaud

Arnaud Delobelle

unread,

Dec 16, 2008, 2:29:44 PM12/16/08

to

Arnaud Delobelle <arn...@googlemail.com> writes:

> match_total = dict((key, val()) for key, val in match_counter.iteritems())

Sorry I meant

match_total = dict((key, val.next())

for key, val in match_counter.iteritems())

--
Arnaud

Raymond Hettinger

unread,

Dec 19, 2008, 6:31:07 AM12/19/08

to

> FedericoMoreirawrote:

> > Hi all,
>
> > Im parsing a 4.1GB apache log to have stats about how many times an ip
> > request something from the server.
>
> > The first design of the algorithm was
>
> > for line in fileinput.input(sys.argv[1:]):
> > ip = line.split()[0]
> > if match_counter.has_key(ip):
> > match_counter[ip] += 1
> > else:
> > match_counter[ip] = 1

. . .

> > Should i leave fileinput behind?

Yes. fileinput is slow because it does a lot more than just read
files.

> > Am i using generators with the wrong aproach?

No need for a generator here. The time is being lost with fileinput,
split, and the counting code. Try this instead:

match_counter = collections.defaultdict(int)
for filename in sys.argv[1:]:
for line in open(filename):
ip, sep, rest = line.partition(' ')
match_counter[ip] += 1

If you're on *nix, there's a fast command line approach:

cut -d' ' -f1 filelist | sort | uniq -c

MRAB

unread,

Dec 19, 2008, 11:16:12 AM12/19/08

to pytho...@python.org

Federico Moreira wrote:
> Great, 2min 34 secs with the open method =)
>
> but why?

>
> ip, sep, rest = line.partition(' ')
> match_counter[ip] += 1
>

> instead of
>
> match_counter[line.strip()[0]] += 1
>
> strip really takes more time than partition?
>
> I'm having the same results with both of them right now.
>
I think you meant split(), not strip().

split() might split the string into many parts, but you want only the
first part (at most 1 split), so the extra splits are unnecessary and
waste time. split(None, 1) and partition(' ') do at most 1 split, so
they don't do unnecessary work.

Arnaud Delobelle

unread,

Dec 19, 2008, 5:51:18 PM12/19/08

to

MRAB <goo...@mrabarnett.plus.com> writes:

Here's a very quick test:

>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a.partition(' ')[0]").timeit(5000000)
2.9831948280334473
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a[:a.find(' ')]").timeit(5000000)
3.0122888088226318
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a.split(None, 1)[0]").timeit(5000000)
4.5993850231170654
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a.split()[0]").timeit(5000000)
5.2067980766296387

So .split(None, 1) is quite a bit slower than .partition(' ') apparently.

--
Arnaud