Newsgroups: comp.lang.python
From: George Sakkis <george.sak...@gmail.com>
Date: Fri, 18 Jan 2008 09:50:40 -0800 (PST)
Local: Fri, Jan 18 2008 12:50 pm
Subject: Re: Efficient processing of large nuumeric data file
On Jan 18, 12:15 pm, David Sanders <dpsand...@gmail.com> wrote:
> Hi, Without further information, I don't see anything particularly > I am processing large files of numerical data. Each line is either a > My question is how to process such files efficiently to obtain a > ------------------- > import sys > if num_args < 2: > name = args[1] > hist = {} # dictionary for histogram > for line in file: > if len(data) == 1: > if first in hist: # add the information to the histogram > num+=count > keys = hist.keys() > print "# i fraction hist[i]" > The data files are large (~100 million lines), and this code takes a > Am I doing something very inefficient? (Any general comments on my inefficient. What may help here is if you have any a priori knowledge about the data, specifically: - How often does a single number occur compared to a pair of numbers ? Similarly if the pair is much more frequent than the single number, - What proportion of the first numbers is unique ? If it's small > Is a dictionary the right way to do this? In any given file, there is Yes, dict is the right data structure; since Python 2.5, > an upper bound on the data, so it seems to me that some kind of array > (numpy?) would be more efficient, but the upper bound changes in each > file. collections.defaultdict is an alternative. numpy is good for processing numeric data once they are already in arrays, not for populating them. George You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||