Newsgroups: comp.lang.python
From: Tim Chase <python.l...@tim.thechases.com>
Date: Fri, 18 Jan 2008 12:06:56 -0600
Local: Fri, Jan 18 2008 1:06 pm
Subject: Re: Efficient processing of large nuumeric data file
> for line in file: The first thing I would try is just doing a for line in file: to see how much time is consumed merely by iterating over the > data = line.split() Well, some experiments I might try: > first = int(data[0]) > if len(data) == 1: try: or possibly first = int(data[0]) or even # pad it to contain at least two items I don't know how efficient len() is (if it's internally linearly I'm not sure any of them is more or less "pythonic", but they > if first in hist: # add the information to the histogram This might also be written as > hist[first]+=count > else: > hist[first]=count hist[first] = hist.get(first, 0) + count > Is a dictionary the right way to do this? In any given file, there is I'm not sure an array would net you great savings here, since the > an upper bound on the data, so it seems to me that some kind of array > (numpy?) would be more efficient, but the upper bound changes in each > file. upper-bound seems to be an unknown. If "first" has a known maximum (surely, the program generating this file has an idea to the range of allowed values), you could just create an array the length of the span of numbers, initialized to zero, which would reduce the hist.get() call to just hist[first] += count and then you'd iterate over hist (which would already be sorted Otherwise, your code looks good...the above just riff on various -tkc You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||