You might want to try defaultdict instead of dict to see whether that
would be faster:
from collections import defaultdict
match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):
ip = line.split()[0]
match_counter[ip] += 1
> Hi all,
>
> Im parsing a 4.1GB apache log to have stats about how many times an ip
> request something from the server.
>
> The first design of the algorithm was
>
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split()[0]
> if match_counter.has_key(ip):
> match_counter[ip] += 1
> else:
> match_counter[ip] = 1
>
> And it took 3min 58 seg to give me the stats
>
> Then i tried a generator solution like
>
> def generateit():
> for line in fileinput.input(sys.argv[1:]):
> yield line.split()[0]
>
> for ip in generateit():
> ...the same if sentence
>
> Instead of being faster it took 4 min 20 seg
>
> Should i leave fileinput behind?
> Am i using generators with the wrong aproach?
What's fileinput? A file-like object (unlikely)? Also, what's
fileinput.input? I guess the reason why you don't see much difference
(and is in fact slower) lies in what fileinput.input does.
Generators excels in processing huge data since it doesn't have to create
huge intermediate lists which eats up memory, given an infinite memory, a
generator solution is almost always slower than straight up solution
using lists. However in real life we don't have infinite memory, hogging
our memory with the huge intermediate list would make the system start
swapping, swapping is very slow and is a big hit to performance. This is
the way generator could be faster than list.
> Hi all,
>
> Im parsing a 4.1GB apache log to have stats about how many times an ip
> request something from the server.
>
> The first design of the algorithm was
>
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split()[0]
> if match_counter.has_key(ip):
> match_counter[ip] += 1
> else:
> match_counter[ip] = 1
nitpick:
dict.has_key is usually replaced with
if ip in match_counter: ...
also, after investigating your code further, I see that you've
unnecessarily used generators, the first code is simpler and you've not
avoided any creation of huge intermediate list by using the generator
this way. You won't get any performance improvement with this, and
instead get a performance hit due to function overhead and name look up.
Fileinput is a standard module distributed with Python:
>From the manual:
11.2 fileinput -- Iterate over lines from multiple input streams
This module implements a helper class and functions to quickly write a
loop over standard input or a list of files.
The typical use is:
import fileinput
for line in fileinput.input():
process(line)
...
> Generators excels in processing huge data since it doesn't have to create
> huge intermediate lists which eats up memory, given an infinite memory, a
> generator solution is almost always slower than straight up solution
> using lists. However in real life we don't have infinite memory, hogging
> our memory with the huge intermediate list would make the system start
> swapping, swapping is very slow and is a big hit to performance. This is
> the way generator could be faster than list.
>
>
This can be a little faster still:
match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):
ip = line.split(None, 1)[0]
match_counter[ip] += 1
Bye,
bearophile
I'd say it's just slightly more than a nitpick nowadays: since Python
3.0 does not support has_key, it's time to get used to not using it :)
--RDM
> This can be a little faster still:
>
> match_counter = defaultdict(int)
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split(None, 1)[0]
> match_counter[ip] += 1
>
> Bye,
> bearophile
Or maybe (untested):
match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):
ip = line[:line.index(' ')]
match_counter[ip] += 1
Or even (untested, I've never tried this ;):
from itertools import count
match_counter = defaultdict(count)
for line in fileinput.input(sys.argv[1:]):
ip = line[:line.index(' ')]
match_counter[ip].next()
match_total = dict((key, val()) for key, val in match_counter.iteritems())
Obviously the variable 'ip' could be removed in all cases.
--
Arnaud
> match_total = dict((key, val()) for key, val in match_counter.iteritems())
Sorry I meant
match_total = dict((key, val.next())
for key, val in match_counter.iteritems())
--
Arnaud
Yes. fileinput is slow because it does a lot more than just read
files.
> > Am i using generators with the wrong aproach?
No need for a generator here. The time is being lost with fileinput,
split, and the counting code. Try this instead:
match_counter = collections.defaultdict(int)
for filename in sys.argv[1:]:
for line in open(filename):
ip, sep, rest = line.partition(' ')
match_counter[ip] += 1
If you're on *nix, there's a fast command line approach:
cut -d' ' -f1 filelist | sort | uniq -c
split() might split the string into many parts, but you want only the
first part (at most 1 split), so the extra splits are unnecessary and
waste time. split(None, 1) and partition(' ') do at most 1 split, so
they don't do unnecessary work.
Here's a very quick test:
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a.partition(' ')[0]").timeit(5000000)
2.9831948280334473
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a[:a.find(' ')]").timeit(5000000)
3.0122888088226318
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a.split(None, 1)[0]").timeit(5000000)
4.5993850231170654
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a.split()[0]").timeit(5000000)
5.2067980766296387
So .split(None, 1) is quite a bit slower than .partition(' ') apparently.
--
Arnaud