Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Efficient processing of large nuumeric data file
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Paul Rubin  
View profile  
 More options Jan 18 2008, 12:58 pm
Newsgroups: comp.lang.python
From: Paul Rubin <http://phr...@NOSPAM.invalid>
Date: 18 Jan 2008 09:58:57 -0800
Local: Fri, Jan 18 2008 12:58 pm
Subject: Re: Efficient processing of large nuumeric data file

David Sanders <dpsand...@gmail.com> writes:
> The data files are large (~100 million lines), and this code takes a
> long time to run (compared to just doing wc -l, for example).

wc is written in carefully optimized C and will almost certainly
run faster than any python program.

> Am I doing something very inefficient?  (Any general comments on my
> pythonic (or otherwise) style are also appreciated!)  Is
> "line.split()" efficient, for example?

Your implementation's efficiency is not too bad.  Stylistically it's
not quite fluent but there's nothing to really criticize--you may
develop a more concise style with experience, or maybe not.
One small optimization you could make is to use collections.defaultdict
to hold the counters instead of a regular dict, so you can get rid of
the test for whether a key is in the dict.  

Keep an eye on your program's memory consumption as it runs.  The
overhead of a pair of python ints and a dictionary cell to hold them
is some dozens of bytes at minimum.  If you have a lot of distinct
keys and not enough memory to hold them all in the large dict, your
system may be thrashing.  If that is happening, the two basic
solutions are 1) buy more memory; or, 2) divide the input into smaller
pieces, attack them separately, and merge the results.

If I were writing this program and didn't have to run it too often,
I'd probably use the unix "sort" utility to sort the input (that
utility does an external disk sort if the input is large enough to
require it) then make a single pass over the sorted list to count up
each group of keys (see itertools.groupby for a convenient way to do
that).


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.