Hi Peter ...
Welcome to the group.
> I want to do the above with Python. I've written something using xlrd
> which works, but it's a bit slow.
>
> There are a few complications with my data. The first is that it's
> rather voluminous - each file is about 70MB in size. The second is
> that each tab (there are usually 7) has a header of information that
> contains some funky unicode stuff,
It depends on one's perspective, I suppose; mine is that Unicode is not
funky, it is "a thing of beauty and a joy forever" :-)
> so I can't use any nice efficiency
> savings like list comprehensions on it - or at least I can't figure
> out how to do that. My code is posted below. It takes about a minute
> to run on a quad-core Xeon with 8GB RAM. About half the time is in
> reading in the file and half in writing.
>
Using psyco in the simple brute-force fashion should cut about 70% off
your open time, but add a little to your dump time. Recoding your dump
routine (see attached file) cut about 30% off the dump time on my setup.
I note that your code assumes that there are no commas, quotes, or
newlines in your text cells. I've used the csv module which does the
"right thing", and does it at C speed instead of Python speed.
Note: Anyone borrowing this code to write a general-purpose xls2csv
routine will need to do better with date, boolean and error cell-types
... see the xlrd-supplied runxlrd.py for a few clues.
HTH,
John
Proper handling of embedded quotes/commas/newlines is an even better
reason :-)
>
> The input performance led me on a merry tour through psycho, its lack
> of 64 bit support and its spiritual successor PyPy. This looks very
> interesting, but somewhat less of a quick fix than psycho would be if
> I was running in 32 bit. Is there a drop in replacement for
> pscyho.full() I can use on a 64 bit system?
Given that the author of psycho/pscyho/psyco says on the website that he
is not going to do it but is concentrating on PyPy and it's a very
narrowly specialised area of expertise, my guess before googling would
have to be "No". Note that you should be able to use psyco with a 32-bit
Python on a 64-bit system ... """A common question I get is whether
Psyco will work on 64-bit x86 architectures. The answer is no, unless
you have a Python compiled in 32-bit compatibility mode"""
>
> This is a little off topic, but I also noticed OpenOffice could launch
> and load a 70MB file in less than 30 seconds.
Given it's written in C++, you'd hope so :-)
Here are some rough timings (seconds) for opening a 120 Mb xls file on
my laptop (single core 2 GHz AMD Turion Mobile chip running Windows XP
SP2). Times exclude all setup and discard 1st run (i.e. gui or python
program is already loaded and the input file is highly likely to be read
from memory cache).
11.5 xlrd; Python 2.5.2; psyco [time includes 1.0 sec for import psyco;
psyco.full()]
15 Excel 2003 SP3
25 OOo Calc 2.0.3
31 xlrd; Python 2.6.0b1; no psyco
33 Gnumeric 1.9.1 (Notes: repeated runs blew memory; Also should add
(considerable) extra time for closing the workbook!)
46.7 xlrd; Python 2.5.2; no psyco
I'm guessing that most of the explanation for the Python 2.6 speed-up is
this: """All of the functions in the struct module have been rewritten
in C, thanks to work at the Need For Speed sprint. (Contributed by
Raymond Hettinger.)""" [Thanks, Raymond!]
Which version of Python are you currently using?
Cheers,
John