Newsgroups: comp.lang.python
From: Matimus <mccre...@gmail.com>
Date: Fri, 18 Jan 2008 09:55:56 -0800 (PST)
Local: Fri, Jan 18 2008 12:55 pm
Subject: Re: Efficient processing of large nuumeric data file
On Jan 18, 9:15 am, David Sanders <dpsand...@gmail.com> wrote:
> Hi, My first suggestion is to wrap your code in a function. Functions run > I am processing large files of numerical data. Each line is either a > My question is how to process such files efficiently to obtain a > ------------------- > import sys > if num_args < 2: > name = args[1] > hist = {} # dictionary for histogram > for line in file: > if len(data) == 1: > if first in hist: # add the information to the histogram > num+=count > keys = hist.keys() > print "# i fraction hist[i]" > The data files are large (~100 million lines), and this code takes a > Am I doing something very inefficient? (Any general comments on my > Is a dictionary the right way to do this? In any given file, there is much faster in python than module level code, so that will give you a speed up right away. My second suggestion is to look into using defaultdict for your histogram. A dictionary is a very appropriate way to store this data. There has been some mention of a bag type, which would do exactly what you need, but unfortunately there is not a built in bag type (yet). I would write it something like this: from collections import defaultdict def get_hist(file_name): HTH Matt You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||