Export each tab of an Excel file as csv

39 views
Skip to first unread message

psaf...@googlemail.com

unread,
Aug 27, 2008, 10:40:02 AM8/27/08
to python-excel
I want to do the above with Python. I've written something using xlrd
which works, but it's a bit slow.

There are a few complications with my data. The first is that it's
rather voluminous - each file is about 70MB in size. The second is
that each tab (there are usually 7) has a header of information that
contains some funky unicode stuff, so I can't use any nice efficiency
savings like list comprehensions on it - or at least I can't figure
out how to do that. My code is posted below. It takes about a minute
to run on a quad-core Xeon with 8GB RAM. About half the time is in
reading in the file and half in writing.

Any ideas welcome,

Peter

import xlrd
import sys
import os

def dumpOneTab(tab, name):
fh = open(name, "w")
for rownum in range(tab.nrows):
line = []
for colnum in range(tab.ncols):
cellval = tab.cell_value(rowx=rownum, colx=colnum)
try:
cellval = str(cellval)
except:
cellval = "--garbled--"
line += [cellval]
print >> fh, ",".join(line)
fh.close()


if len(sys.argv) != 2:
print "usage: ./rip_xls.py <file.xls>"
sys.exit(1)

filename = sys.argv[1]
filebase = os.path.basename(filename)[:-4]

print "reading workbook..."
xls = xlrd.open_workbook(filename)

for sheetname in xls.sheet_names():
dumpfile = "%s-%s.csv" % (filebase, sheetname)
print "dumping tab", dumpfile, "..."
sheet = xls.sheet_by_name(sheetname)
dumpOneTab(sheet, dumpfile)

John Machin

unread,
Aug 27, 2008, 8:07:35 PM8/27/08
to python...@googlegroups.com
psaf...@googlemail.com wrote:

Hi Peter ...

Welcome to the group.


> I want to do the above with Python. I've written something using xlrd
> which works, but it's a bit slow.
>
> There are a few complications with my data. The first is that it's
> rather voluminous - each file is about 70MB in size. The second is
> that each tab (there are usually 7) has a header of information that
> contains some funky unicode stuff,

It depends on one's perspective, I suppose; mine is that Unicode is not
funky, it is "a thing of beauty and a joy forever" :-)

> so I can't use any nice efficiency
> savings like list comprehensions on it - or at least I can't figure
> out how to do that. My code is posted below. It takes about a minute
> to run on a quad-core Xeon with 8GB RAM. About half the time is in
> reading in the file and half in writing.
>

Using psyco in the simple brute-force fashion should cut about 70% off
your open time, but add a little to your dump time. Recoding your dump
routine (see attached file) cut about 30% off the dump time on my setup.

I note that your code assumes that there are no commas, quotes, or
newlines in your text cells. I've used the csv module which does the
"right thing", and does it at C speed instead of Python speed.

Note: Anyone borrowing this code to write a general-purpose xls2csv
routine will need to do better with date, boolean and error cell-types
... see the xlrd-supplied runxlrd.py for a few clues.

HTH,
John

rip_xls2.py

psaf...@googlemail.com

unread,
Aug 28, 2008, 9:21:28 AM8/28/08
to python-excel
Many thanks for your speedy and helpful reply.

I have tried your improved version, which certainly improves the
output performance as advertised; certainly a good reason to make use
of the csv module.

The input performance led me on a merry tour through psycho, its lack
of 64 bit support and its spiritual successor PyPy. This looks very
interesting, but somewhat less of a quick fix than psycho would be if
I was running in 32 bit. Is there a drop in replacement for
pscyho.full() I can use on a 64 bit system?

This is a little off topic, but I also noticed OpenOffice could launch
and load a 70MB file in less than 30 seconds. I wondered if OO was
scriptable to any extent for this kind of task. I've posted to one of
their mailing lists to ask:

http://www.openoffice.org/servlets/ReadMsg?list=users&msgNo=185240

Peter

Rakesh Singh

unread,
Aug 29, 2008, 3:22:47 AM8/29/08
to python...@googlegroups.com
Hi,

If psycho does not work for you, perhaps try Parallel Python to use those 4 cores you have.
http://www.parallelpython.com/

It should give you a great performance boost.

Cheers

Rakesh

psaf...@googlemail.com

unread,
Aug 29, 2008, 7:06:29 AM8/29/08
to python-excel
I'm not sure that can help me in this case. The majority of execution
time is taken with the single open_workbook() call. Unless xlrd has
implemented this in a parallel fashion (which seems rather unlikely),
I can't make use of more than one core on this operation.

Peter

On Aug 29, 8:22 am, "Rakesh Singh" <rzs0...@gmail.com> wrote:
> Hi,
>
> If psycho does not work for you, perhaps try Parallel Python to use those 4
> cores you have.http://www.parallelpython.com/
>
> It should give you a great performance boost.
>
> Cheers
>
> Rakesh
>
> On Thu, Aug 28, 2008 at 3:21 PM, psaff...@googlemail.com <

John Machin

unread,
Aug 30, 2008, 10:48:32 PM8/30/08
to python...@googlegroups.com
psaf...@googlemail.com wrote:
>
> I have tried your improved version, which certainly improves the
> output performance as advertised; certainly a good reason to make use
> of the csv module.

Proper handling of embedded quotes/commas/newlines is an even better
reason :-)

>
> The input performance led me on a merry tour through psycho, its lack
> of 64 bit support and its spiritual successor PyPy. This looks very
> interesting, but somewhat less of a quick fix than psycho would be if
> I was running in 32 bit. Is there a drop in replacement for
> pscyho.full() I can use on a 64 bit system?

Given that the author of psycho/pscyho/psyco says on the website that he
is not going to do it but is concentrating on PyPy and it's a very
narrowly specialised area of expertise, my guess before googling would
have to be "No". Note that you should be able to use psyco with a 32-bit
Python on a 64-bit system ... """A common question I get is whether
Psyco will work on 64-bit x86 architectures. The answer is no, unless
you have a Python compiled in 32-bit compatibility mode"""

>
> This is a little off topic, but I also noticed OpenOffice could launch
> and load a 70MB file in less than 30 seconds.

Given it's written in C++, you'd hope so :-)

Here are some rough timings (seconds) for opening a 120 Mb xls file on
my laptop (single core 2 GHz AMD Turion Mobile chip running Windows XP
SP2). Times exclude all setup and discard 1st run (i.e. gui or python
program is already loaded and the input file is highly likely to be read
from memory cache).

11.5 xlrd; Python 2.5.2; psyco [time includes 1.0 sec for import psyco;
psyco.full()]
15 Excel 2003 SP3
25 OOo Calc 2.0.3
31 xlrd; Python 2.6.0b1; no psyco
33 Gnumeric 1.9.1 (Notes: repeated runs blew memory; Also should add
(considerable) extra time for closing the workbook!)
46.7 xlrd; Python 2.5.2; no psyco

I'm guessing that most of the explanation for the Python 2.6 speed-up is
this: """All of the functions in the struct module have been rewritten
in C, thanks to work at the Need For Speed sprint. (Contributed by
Raymond Hettinger.)""" [Thanks, Raymond!]

Which version of Python are you currently using?

Cheers,
John

psaf...@googlemail.com

unread,
Sep 4, 2008, 10:37:51 AM9/4/08
to python-excel

On Aug 31, 3:48 am, John Machin <sjmac...@lexicon.net> wrote:
>
> Proper handling of embedded quotes/commas/newlines is an even better
> reason :-)
>

I'm not too worried about this because my data is quite regular, but I
guess that comes under the category of "famous last words".

> Given that the author of psycho/pscyho/psyco says on the website that he

It doesn't work on my system, therefore I am determined to spell it
wrong.

> Which version of Python are you currently using?
>

So it looks like I might benefit from building my own version of
Python, either 2.6 or 2.5 with 32 bit compatibility (or even 2.6 with
32 bit). At the moment, my needs are not so urgent that I can be
bothered to step outside the safety of my package manager, but I may
revisit this next time I get performance problems.

Thanks for your guidance,

Peter
Reply all
Reply to author
Forward
0 new messages