xlwt hacks

521 views
Skip to first unread message

John Harrison

unread,
Oct 5, 2008, 3:28:06 PM10/5/08
to python-excel
As promised I've made some changes to my copy of xlwt in order to
facilitate rows to repeat at top functionality. In contrast to my
earlier discussion with John Machin on this topic I've put the
functionality in the worksheet class rather than the workbook class.

I've also made two other changes. One makes great improvements to
both the speed and especially the memory consumption of xlwt when
writing out large spreadsheets. The other allows a user to insert
worksheets at any location in a workbook rather than simply appending
them to the end of the current list of sheets.

I've posted details of the changes and reasons for them on my blog at:
http://blog.insightvr.com/?p=30

I've also posted a diff of my changes at:
http://insightvr.com/download/xlwtHack.diff

I hope that some of these are useful to other users of xlwt and I'm
happy to participate in any process needed for them to be incorporated
into xlwt. If they're rejected, I understand that as well and I'll
just keep using them privately.

Many thanks to John Machin and others for making this library
available!

Chris Withers

unread,
Oct 5, 2008, 5:13:11 PM10/5/08
to python...@googlegroups.com
John Harrison wrote:
> As promised I've made some changes to my copy of xlwt in order to
> facilitate rows to repeat at top functionality.

I don't really understand this, so I'm not quite so interested ;-)

> I've also made two other changes. One makes great improvements to
> both the speed

...I think this is likely incidental, swapping would be my guess as the
cause of the slowdown.

> and especially the memory consumption of xlwt when
> writing out large spreadsheets.

Now this is really interesting!
(read: I might be looking for this myself soon, the xlwt memory
structures certainly are memory hungry...)

> The other allows a user to insert
> worksheets at any location in a workbook rather than simply appending
> them to the end of the current list of sheets.

As is this!

> I've also posted a diff of my changes at:
> http://insightvr.com/download/xlwtHack.diff

John, can you see any reasons not just to merge and commit this patch?
If the answer is "no", then I'll happilly get it merged and committed.

cheers,

Chris

--
Simplistix - Content Management, Zope & Python Consulting
- http://www.simplistix.co.uk

John Harrison

unread,
Oct 5, 2008, 5:50:36 PM10/5/08
to python-excel
> > As promised I've made some changes to my copy of xlwt in order to
> > facilitate rows to repeat at top functionality.  
>
> I don't really understand this, so I'm not quite so interested ;-)
>

Excel allows you to specify that certain rows (or columns) repeat at
the top of each printed page. A while back I posted on my attempts to
make this work. John Machin replied with improvements. What I've got
in the diff is what I feel is a further improvement to what John did
on two fronts:

1. I think that the methods belong in the worksheet rather than the
workbook since they apply to a specific worksheet.
2. They are compatible with my sheet insertion code.

I should note that the code also allows you to specify multiple "print
areas" on each sheet, such that only the specified areas are printed.

> > I've also made two other changes.  One makes great improvements to
> > both the speed
>
> ...I think this is likely incidental, swapping would be my guess as the
> cause of the slowdown.

I agree, but on my laptop the increase is significant. On my 100,000
row test (code on my blog, which I linked to) the improvement is
better than 2x. I'm running a dual core MacBook Pro with 3 gigs of
ram.
>
> > and especially the memory consumption of xlwt when
> > writing out large spreadsheets.  
>
> Now this is really interesting!
> (read: I might be looking for this myself soon, the xlwt memory
> structures certainly are memory hungry...)
>

Yeah, memory consumption is greatly improved, 35x in my simple 100,000
row test. More rows, more improvement. Ugh, just realized that I'm
testing at the wrong point in my code for memory consumption. Need to
check that out, but it is still a big improvement.

> > I've also posted a diff of my changes at:
> >http://insightvr.com/download/xlwtHack.diff
>
> John, can you see any reasons not just to merge and commit this patch?
> If the answer is "no", then I'll happilly get it merged and committed.

Chris,

That is the question that I'm wondering about. For how I use xlwt my
changes hurt nothing and help greatly. I believe that the worksheet
insertion code and the rows to repeat code (includes print areas) will
not break any compatibility with current functionality.

The row flush code is possibly more problematic. My understanding
(from memory) is that the api does not currently support deleting a
row or a cell in a row. If that is the case then my code should be
backwards compatible. The concern is in how the sheet keeps track of
the max/min rows and cols. My code has to keep track of these as they
are added to a sheet rather than figuring this out when rows are
written.

So I don't think that the row flush changes break compatibility.
However that doesn't mean that it is safe for everyone to start using
this. It depends on how you write your rows. You don't want to flush
when you might write more cells to a row that you've already written
to.

Also I don't know if there is any functionality that needs to
reference a row in some way after it is written other then to write it
out. Flushing the rows to the temp file would be problematic in this
case.

Basically you can't flush the rows if you're going to use that row
again. But you have to call the flush as it isn't automatic, and if
you don't flush at all the only difference should be the max/min code.

I think that my concern boils down to something that would be resolved
by adequate documentation.

I hope that makes sense.

later,
John

John Machin

unread,
Oct 5, 2008, 9:14:59 PM10/5/08
to python...@googlegroups.com
Chris Withers wrote:
> John Harrison wrote:
>> As promised I've made some changes to my copy of xlwt in order to
>> facilitate rows to repeat at top functionality.
>
> I don't really understand this, so I'm not quite so interested ;-)

It's so that you get a heading at the top of every printed page. It's
one of those things that end-users understand and appreciate :-)

>
>> I've also made two other changes. One makes great improvements to
>> both the speed
>
> ...I think this is likely incidental, swapping would be my guess as the
> cause of the slowdown.
>
>> and especially the memory consumption of xlwt when
>> writing out large spreadsheets.
>
> Now this is really interesting!
> (read: I might be looking for this myself soon, the xlwt memory
> structures certainly are memory hungry...)

Good work. Looks great at first glance. Callers would need to be warned
about the consequences of putting more data in a row after it had been
flushed.

>
>> The other allows a user to insert
>> worksheets at any location in a workbook rather than simply appending
>> them to the end of the current list of sheets.
>
> As is this!

Ummm, what about:

summary_sheet = wb.add_sheet("Summary")
data_sheet = wb.add_sheet("Data")
# write stuff to data_sheet, accumulate summary info
# write summary info to summary_sheet

?

>
>> I've also posted a diff of my changes at:
>> http://insightvr.com/download/xlwtHack.diff
>
> John, can you see any reasons not just to merge and commit this patch?
> If the answer is "no", then I'll happilly get it merged and committed.
>

Presuming "John" means me, not John Harrison:

Reasons not to just merge and commit this patch (now, willy-nilly):
(1) It hasn't been reviewed; also I'd like to see some examples and/or
test files and/or a paragraph or two for the docs
(2) It should be 3 quite separate patches
(3) Christophe Tronche and I are currently working on a patch to provide
major improvement to xlwt's formula capabilities; as CT has evidently
some difficulty/problem with svn, he is supplying me with patches
against the version 0.7.0 tar.gz, and I would very much prefer for svn
not to be updated until this patch is ready -- a couple of days at most.
(4) I am somewhat concerned by "Also, I’d mucked up the worksheet index
concept by allowing a user to insert a worksheet earlier in the list. So
you could invalidate an index prior to writing out the xls file". In
particular I'd like to investigate the effect on the formula patch (the
main effect of which is to support 3D references to other worksheets
e.g. Sheet1:Sheet3!A1:Z99) and its use of worksheet indexes.

Cheers,
John

John Harrison

unread,
Oct 5, 2008, 10:47:45 PM10/5/08
to python-excel

> >> I've also made two other changes.  One makes great improvements to
> >> both the speed
>
> > ...I think this is likely incidental, swapping would be my guess as the
> > cause of the slowdown.
>
> >> and especially the memory consumption of xlwt when
> >> writing out large spreadsheets.  
>
> > Now this is really interesting!
> > (read: I might be looking for this myself soon, the xlwt memory
> > structures certainly are memory hungry...)
>
> Good work. Looks great at first glance. Callers would need to be warned
> about the consequences of putting more data in a row after it had been
> flushed.

Right, as I said I think documentation is the key here.


> >> The other allows a user to insert
> >> worksheets at any location in a workbook rather than simply appending
> >> them to the end of the current list of sheets.
>
> > As is this!
>
> Ummm, what about:
>
> summary_sheet = wb.add_sheet("Summary")
> data_sheet = wb.add_sheet("Data")
> # write stuff to data_sheet, accumulate summary info
> # write summary info to summary_sheet
>
> ?

Unfortunately I don't even know how many sheets I'm going to need
because I'm using a giant iterator and don't know how many times I'll
go over the 65k row limit until I've already been through all the
data.

> >> I've also posted a diff of my changes at:
> >>http://insightvr.com/download/xlwtHack.diff
>
> > John, can you see any reasons not just to merge and commit this patch?
> > If the answer is "no", then I'll happilly get it merged and committed.
>
> Presuming "John" means me, not John Harrison:
>
> Reasons not to just merge and commit this patch (now, willy-nilly):
> (1) It hasn't been reviewed; also I'd like to see some examples and/or
> test files and/or a paragraph or two for the docs

I agree.

> (2) It should be 3 quite separate patches

I agree.

> (3) Christophe Tronche and I are currently working on a patch to provide
> major improvement to xlwt's formula capabilities; as CT has evidently
> some difficulty/problem with svn, he is supplying me with patches
> against the version 0.7.0 tar.gz, and I would very much prefer for svn
> not to be updated until this patch is ready -- a couple of days at most.

I was unaware of this effort. I'm happy to revise after this is done.

> (4) I am somewhat concerned by "Also, I’d mucked up the worksheet index
> concept by allowing a user to insert a worksheet earlier in the list. So
> you could invalidate an index prior to writing out the xls file". In
> particular I'd like to investigate the effect on the formula patch (the
> main effect of which is to support 3D references to other worksheets
> e.g. Sheet1:Sheet3!A1:Z99) and its use of worksheet indexes.

Correct. This would have side effects on 3D references. I did not
know they were in progress. This might fall under the same "needs to
be well documented" category as the row flush code, or it might need
much more work, or it might just be a bad for inclusion in xlwt.

Ch. Tronche

unread,
Oct 6, 2008, 11:58:38 AM10/6/08
to python-excel


On 6 oct, 03:14, John Machin <sjmac...@lexicon.net> wrote:
> CT has evidently
> some difficulty/problem with svn, he is supplying me with patches
> against the version 0.7.0 tar.gz,

I've made progress on this side, I can now get code from the svn
(silly, but I just didn't know the syntax). I can adapt the patch
against svn if you prefer, but it would take a few days.

Regards,
Ch. Tronche

> Cheers,
> John

John Machin

unread,
Oct 7, 2008, 2:48:08 AM10/7/08
to python...@googlegroups.com

Sorry, I don't understand. AFAICT [*] the xlwt files in the svn
repository are the same as in the version 0.7.0 tar.gz, as expected.

* = TortoiseSVN: check for modifications, click on "Check Repository",
shows tags updated, no mention of source files changed.

Chris Withers

unread,
Oct 7, 2008, 4:18:45 AM10/7/08
to python...@googlegroups.com
John Machin wrote:
> * = TortoiseSVN: check for modifications, click on "Check Repository",
> shows tags updated, no mention of source files changed.

I had a look but couldn't find this Check Repository option.
If you're using "check for modifications" then that checks if you've
made any changes locally *not* with the repository.

To get changes that have been committed to the repository, right-click
and select "SVN Update". That will update your checkout with any changes
committed to the repository.

If you just want to see if anything's been comitted, right-click ->
TortoiseSVN -> show log is quite handy. That'll show you a log of all
commits where you can spot any news ones and when they were committed.
Better yet, if you want to see what changes there were ebtween the two
revisions,hold down Ctrl and click on the two revisions, then
right-click and select 'compare revisions'.

All that said, xlwt trunk does indeed seem to be 0.7.0....

Ch. Tronche

unread,
Oct 7, 2008, 12:56:21 PM10/7/08
to python-excel
Sorry, I wrongly assumed they were different !

Regards,
Ch. Tronche

Chris Withers

unread,
Oct 9, 2008, 11:24:53 AM10/9/08
to python...@googlegroups.com
John Harrison wrote:
> Excel allows you to specify that certain rows (or columns) repeat at
> the top of each printed page. A while back I posted on my attempts to
> make this work. John Machin replied with improvements. What I've got
> in the diff is what I feel is a further improvement to what John did
> on two fronts:

OK, this stuff I'm not *that* interested in, but I don't mind merging as
part of the work to merge the memory usage stuff...

>>> I've also made two other changes. One makes great improvements to
>>> both the speed
>> ...I think this is likely incidental, swapping would be my guess as the
>> cause of the slowdown.
>
> I agree, but on my laptop the increase is significant.

You're missing my point; if you ran your new code over a data set that
didn't consume more than the physical memory of the box, you'd likely
see no speed increase. In fact, you might even see a slight slow down ;-)

>>> and especially the memory consumption of xlwt when
>>> writing out large spreadsheets.
>> Now this is really interesting!
>> (read: I might be looking for this myself soon, the xlwt memory
>> structures certainly are memory hungry...)

scratch might, I *do* need this ;-)

> Yeah, memory consumption is greatly improved, 35x in my simple 100,000
> row test. More rows, more improvement. Ugh, just realized that I'm
> testing at the wrong point in my code for memory consumption. Need to
> check that out, but it is still a big improvement.

I noticed in my tests with heapy (glad to see you're using it as well!)
that a lot of memory was taken up by formatting objects. Did you see
this memory usage go down with your patch?

> That is the question that I'm wondering about. For how I use xlwt my
> changes hurt nothing and help greatly. I believe that the worksheet
> insertion code and the rows to repeat code (includes print areas) will
> not break any compatibility with current functionality.

Good :-)

> The row flush code is possibly more problematic. My understanding
> (from memory) is that the api does not currently support deleting a
> row or a cell in a row.

Well, from my understanding of the purpose of xlwt, it's for
serialisation of data to .xls format, so it should never support
deleting a row or a cell.

> backwards compatible. The concern is in how the sheet keeps track of
> the max/min rows and cols. My code has to keep track of these as they
> are added to a sheet rather than figuring this out when rows are
> written.

OK, but what difference does this make?

> So I don't think that the row flush changes break compatibility.
> However that doesn't mean that it is safe for everyone to start using
> this. It depends on how you write your rows. You don't want to flush
> when you might write more cells to a row that you've already written
> to.

OK, I think some checking and exception raising should solve this.
You should only flush when you're sure you don't want to write more
cells to a column you've already written. xlwt should raise an exception
if you do.

> Also I don't know if there is any functionality that needs to
> reference a row in some way after it is written other then to write it
> out. Flushing the rows to the temp file would be problematic in this
> case.

Right, does anyone know if there are problems with this?

> Basically you can't flush the rows if you're going to use that row
> again. But you have to call the flush as it isn't automatic, and if
> you don't flush at all the only difference should be the max/min code.

Right, from this perspective I think it's fine.

> I think that my concern boils down to something that would be resolved
> by adequate documentation.

...and some exception raising when you do silly things.

As I said earlier, I now need this code so I'm happy to merge it
I'm also interested in writing some unit tests, but that's a subject for
a seperate post.

Chris Withers

unread,
Oct 9, 2008, 11:38:27 AM10/9/08
to python...@googlegroups.com
John Machin wrote:
>
> It's so that you get a heading at the top of every printed page. It's
> one of those things that end-users understand and appreciate :-)

yes, we hate end users, but that'll be another seperate post arriving
shortly ;-)

>>> and especially the memory consumption of xlwt when
>>> writing out large spreadsheets.
>> Now this is really interesting!
>> (read: I might be looking for this myself soon, the xlwt memory
>> structures certainly are memory hungry...)
>
> Good work. Looks great at first glance. Callers would need to be warned
> about the consequences of putting more data in a row after it had been
> flushed.

That brings up the subject of documentation ;-) It would be good if
there existed some meta-docs on how to generate documentation for xlrd
and xlwt. I know I started by hand for xlwt unaware that John had used a
tool for xlrd (the style of which I was copying). I'm also wondering if
maybe we should use Sphynx* for the docs for both?

>>> The other allows a user to insert
>>> worksheets at any location in a workbook rather than simply appending
>>> them to the end of the current list of sheets.
>> As is this!
>
> Ummm, what about:
>
> summary_sheet = wb.add_sheet("Summary")
> data_sheet = wb.add_sheet("Data")
> # write stuff to data_sheet, accumulate summary info
> # write summary info to summary_sheet
>
> ?

Not sure what the point you're trying to make here is John, could you
elaborate?

> Presuming "John" means me, not John Harrison:

You are the relevent John, yes ;-)

> Reasons not to just merge and commit this patch (now, willy-nilly):
> (1) It hasn't been reviewed;

Well, I need to use it in a production setting, so it'll be getting
*plenty* of review...

> also I'd like to see some examples and/or
> test files

I'm going to post about testing shortly...

> and/or a paragraph or two for the docs

See earlier referenes to Sphynx and documentation in general. I'm happy
to stick with the current tool, but some brief meta-docs on what it is
and how to use it would be great.

> (2) It should be 3 quite separate patches

Indeed, although I'm happy to merge the "uber-patch" if that's all
that's available...

> (3) Christophe Tronche and I are currently working on a patch to provide
> major improvement to xlwt's formula capabilities; as CT has evidently
> some difficulty/problem with svn, he is supplying me with patches
> against the version 0.7.0 tar.gz, and I would very much prefer for svn
> not to be updated until this patch is ready -- a couple of days at most.

svn is really rather good at this kind of thing ;-)

If you do rigfht click on the folder that is your xlwt checkout and then
do TortoiseSVN -> show log you'll get the log listing.

Right clicking on a revision (I think 3843 is the one that corresponds
to the 0.7.0 release) and then select "update item to revision" your
checkout will be exactly as the source tree was at revision 3843 so your
patch should merge fine.

If you then do "SVN update" all later changes will be merged in. If
there are incompatible changes, the file(s) affective will show in a
conflicted state. You can then manually fix the conflicts before marking
them as resolved and committing your changes.

> (4) I am somewhat concerned by "Also, I’d mucked up the worksheet index
> concept by allowing a user to insert a worksheet earlier in the list. So
> you could invalidate an index prior to writing out the xls file".

I couldn't find the email this was quoted from. Which one was it? Not
sure of the original context...

> In
> particular I'd like to investigate the effect on the formula patch (the
> main effect of which is to support 3D references to other worksheets
> e.g. Sheet1:Sheet3!A1:Z99) and its use of worksheet indexes.

OK, well, it's the row flushing that I really really need asap. I assume
the row flushing bits are okay to merge with your proviso above?

cheers,

Chris

*http://sphinx.pocoo.org/

John Machin

unread,
Oct 9, 2008, 7:38:34 PM10/9/08
to python...@googlegroups.com
Chris Withers wrote:
> John Harrison wrote:
>
> I noticed in my tests with heapy (glad to see you're using it as well!)
> that a lot of memory was taken up by formatting objects. Did you see
> this memory usage go down with your patch?

If it does go down, that could mean that your code may be creating new
XFStyle objects for each cell, rather than sharing them. It may also be
a memory leak. How are you creating your XFStyle objects? From an input
xls file using your replacement for the xlrd2wtmap gadget, or from scratch?

When one does
shell-prompt> runxlrd.py -f1 hdr empty_file_created_by_Excel_2003.xls
one gets:
[snip]
FORMATs: 8, FONTs: 4, XFs: 21
These are counts of the actual occurrences in the file.
What do you get when you do that with your output xls file? With your
input file, if any? How do those numbers compare with "the lot of
memory" that you are seeing with heapy? What is a "lot"?

Please report a second set of results after using:
output_book = xlwt.Workbook(style_compression=2)

>> The row flush code is possibly more problematic. My understanding
>> (from memory) is that the api does not currently support deleting a
>> row or a cell in a row.
>
> Well, from my understanding of the purpose of xlwt, it's for
> serialisation of data to .xls format, so it should never support
> deleting a row or a cell.

I agree entirely. xlwt is for creating xls files, serially or randomly.
John H has presented a reasonable use-case for inserting new sheets
out-of-creation-order-wise, but that's rather different to deleting a
row or cell.

>
>> backwards compatible. The concern is in how the sheet keeps track of
>> the max/min rows and cols. My code has to keep track of these as they
>> are added to a sheet rather than figuring this out when rows are
>> written.
>
> OK, but what difference does this make?
>
>> So I don't think that the row flush changes break compatibility.
>> However that doesn't mean that it is safe for everyone to start using
>> this. It depends on how you write your rows. You don't want to flush
>> when you might write more cells to a row that you've already written
>> to.
>
> OK, I think some checking and exception raising should solve this.
> You should only flush when you're sure you don't want to write more
> cells to a column you've already written.

s/column/row/

> xlwt should raise an exception
> if you do.

My quick read of John H's patch gave me the impression that after a row
was flushed, the corresponding Row object was deleted [can't check;
insightvr.com website seems to be down]. We'd have to keep track of the
state (non-existent, created, flushed) of all possible rows in each
sheet, and check this on each Row.write etc call. Instead of deleting a
Row object and maintaining a separate state collection, it might be
better for the Row class to have a shrink_after_flushing method which
would set self.shrunk to True and delete all cell references.

>
>> Also I don't know if there is any functionality that needs to
>> reference a row in some way after it is written other then to write it
>> out. Flushing the rows to the temp file would be problematic in this
>> case.
>
> Right, does anyone know if there are problems with this?

There are some places in the BIFF emission where lengths/sizes/offsets
have to be patched in, but not to do with rows/cells (IIRC). I doubt
there are any problems -- but checking this would be part of the review
that anyone would expect to be done before the patch is merged in.

>
>> Basically you can't flush the rows if you're going to use that row
>> again. But you have to call the flush as it isn't automatic, and if
>> you don't flush at all the only difference should be the max/min code.
>
> Right, from this perspective I think it's fine.
>
>> I think that my concern boils down to something that would be resolved
>> by adequate documentation.
>
> ...and some exception raising when you do silly things.
>
> As I said earlier, I now need this code so I'm happy to merge it

Only the row-flushing part, please (*NOT* the worksheet-inserting stuff
and the head-rows stuff).

Will you add the "exception raising when you do silly things" code, as well?

Cheers,
John

John Machin

unread,
Oct 10, 2008, 6:24:24 AM10/10/08
to python...@googlegroups.com
Chris Withers wrote:

> John Machin wrote:
>
> That brings up the subject of documentation ;-) It would be good if
> there existed some meta-docs on how to generate documentation for xlrd
> and xlwt. I know I started by hand for xlwt unaware that John had used a
> tool for xlrd (the style of which I was copying). I'm also wondering if
> maybe we should use Sphynx* for the docs for both?

Please start a new thread.

>>>> The other allows a user to insert
>>>> worksheets at any location in a workbook rather than simply appending
>>>> them to the end of the current list of sheets.
>>> As is this!
>> Ummm, what about:
>>
>> summary_sheet = wb.add_sheet("Summary")
>> data_sheet = wb.add_sheet("Data")
>> # write stuff to data_sheet, accumulate summary info
>> # write summary info to summary_sheet
>>
>> ?
>
> Not sure what the point you're trying to make here is John, could you
> elaborate?

The point was that if you know you will have only one summary sheet, you
can create it first (instead of inserting it later) and fill it in
after/while you fill in the data sheets. John H has subsequently
remarked that his use-case involved a variable number of summary sheets
(or words to that effect).


>
>> Presuming "John" means me, not John Harrison:
>
> You are the relevent John, yes ;-)
>
>> Reasons not to just merge and commit this patch (now, willy-nilly):
>> (1) It hasn't been reviewed;
>
> Well, I need to use it in a production setting, so it'll be getting
> *plenty* of review...

By "review" I meant reading the patch and the patched code closely to
check that it's doing something useful and that it has covered all the
bolt-holes -- see below.

>
>> also I'd like to see some examples and/or
>> test files
>
> I'm going to post about testing shortly...
>
>> and/or a paragraph or two for the docs

My point was that a patch should be accompanied by an example file that
shows Noddies like me how it's intended to be used and what the expected
outcome is, instead of having to reverse engineer that from the patch.
We got that for the row-flush part of the tripartite patch, but not the
other two.

>
> See earlier referenes to Sphynx and documentation in general. I'm happy
> to stick with the current tool, but some brief meta-docs on what it is
> and how to use it would be great.

See earlier reference for the need for a separate thread on documentation.


>
>> (2) It should be 3 quite separate patches
>
> Indeed, although I'm happy to merge the "uber-patch" if that's all
> that's available...

I'm not so happy, so I'm working on a cut-down patch with just the
row-flush stuff in it before anyone goes all Gadarene :-)

>
>> (4) I am somewhat concerned by "Also, I’d mucked up the worksheet index
>> concept by allowing a user to insert a worksheet earlier in the list. So
>> you could invalidate an index prior to writing out the xls file".
>
> I couldn't find the email this was quoted from. Which one was it? Not
> sure of the original context...

Quoted from John H's blog.

>
> OK, well, it's the row flushing that I really really need asap. I assume
> the row flushing bits are okay to merge with your proviso above?

Not quite. Tracking the leftmost and rightmost columns used needs to be
done in the Row.__adjust_bound_col_idx method (which is called by *all*
cell-writing methods), not just in the Worksheet.write method. If you
are using the Row.set_cell_<datatype> methods (like xlrd2wtmap does),
you might find that the DIMENSIONS record would have strange contents,
and some xls readers might sigh about it; e.g. xlrd with verbosity >= 1:

NOTE *** sheet 0 (u'foo0'): DIMENSIONS R,C = 10000,1 should be 10000,20

What does your "asap" translate into? Committed by when you get in to
work on Monday OK?

Cheers,
John

John Harrison

unread,
Oct 10, 2008, 12:42:51 PM10/10/08
to python-excel
hey everybody,

I don't have time at the moment to do a detailed response to all the
comments. I have the patches as separate svn commits on my system and
I'm happy to break them out into three patches. I've got some minor
changes to make anyhow.

I'd need to add some code for exception handling, basically just a
dict that keeps track of flushed row numbers and raises an exception
if you try to write to a flushed row. Does that sound acceptable?

I'm a bit under the gun at the moment on two projects so I might not
have the patches out till Monday. What is the best way to submit
them?

thanks,
John

John Machin

unread,
Oct 10, 2008, 7:00:34 PM10/10/08
to python...@googlegroups.com
John Harrison wrote:
> hey everybody,
>
> I don't have time at the moment to do a detailed response to all the
> comments. I have the patches as separate svn commits on my system and
> I'm happy to break them out into three patches. I've got some minor
> changes to make anyhow.
Hi John,

As stated two messages ago, I have already started on the row_flush caper.

I have fixed the problem with the first/last row/col being handled in
the wrong place, leading to the possibility of a bad DIMENSIONS record.
This definitely needed fixing; although xlrd (with verbosity=0), OOo
Calc, and Gnumeric don't complain, Excel 200[37] pops up a dialogue box
saying "File error: data may have been lost" -- not exactly what we want
the consumers of xlwt output files to experience. There is an existing
bug in pyExcelerator/xlwt: the DIMENSIONS record is incorrect for an
empty worksheet (needs special case code); I've fixed that as well.

I have also enhanced your example script to take command-line args for
the number of rows and columns and the flushing interval (with 0
implying don't flush). Unfortunately heapy isn't available on Windows
AFAICT so I'm stuck with watching the gauge in the Task Manager [anyone
have any better ideas?].

Example [Python 2.5.2, Windows XP SP2, one-core 2GHz CPU, enough memory
to avoid swapping]:
250,000 rows, 20 columns, no flushing: 232 seconds, ~510MB
250,000 rows, 20 columns, flushing every 1000 rows: 184 seconds, process
uses only tens of MB until the final save, when it spikes up to ~100MB.
100,000 rows, 20 columns, no flushing: 79 seconds, ~200MB
100,000 rows, 20 columns, flushing every 1000 rows: 72 seconds, spikes
up to ~45MB.

Looks good to me.

Doing anything major about the size of those spikes is best left to a
separate exercise. However I'm experimenting with the
Worksheet.get_biff_data method to see if using ''.join(list_of_strings)
is better than the += approach. Currently (whether we are flushing or
not) all of the row & cell records in the sheet are added in, then a few
small records are added in. If += is not being optimised, then memory is
needed for an extra copy of the row/cell records.

>
> I'd need to add some code for exception handling, basically just a
> dict that keeps track of flushed row numbers and raises an exception
> if you try to write to a flushed row. Does that sound acceptable?

More precisely: if you try to create a Row object for a flushed
row-index. A set is logically preferable but it would involve a
conditional import for Python 2.3 so a dict is what I'll be using.

>
> I'm a bit under the gun at the moment on two projects so I might not
> have the patches out till Monday. What is the best way to submit
> them?
>

If you have any minor changes to the flushing stuff, please let me know
ASAP. Otherwise, yes, I'd very much appreciate separate patches (and
example file and API docs) for the print titles stuff and the sheet
insertion stuff, but there's no hurry. Physical delivery: attachments to
a posting to this group is OK with me; however putting them on your
website and posting links might suit other people better -- I don't
really mind.

I have made a few minor changes myself e.g. s/== None/is None/ :-)

Cheers,
John

John Harrison

unread,
Oct 10, 2008, 11:40:20 PM10/10/08
to python-excel
John,

> I have also enhanced your example script to take command-line args for
> the number of rows and columns and the flushing interval (with 0
> implying don't flush).

Good idea.

> Example [Python 2.5.2, Windows XP SP2, one-core 2GHz CPU, enough memory
> to avoid swapping]:
> 250,000 rows, 20 columns, no flushing: 232 seconds, ~510MB
> 250,000 rows, 20 columns, flushing every 1000 rows: 184 seconds, process
> uses only tens of MB until the final save, when it spikes up to ~100MB.
> 100,000 rows, 20 columns, no flushing: 79 seconds, ~200MB
> 100,000 rows, 20 columns, flushing every 1000 rows: 72 seconds, spikes
> up to ~45MB.
>
> Looks good to me.
>
> Doing anything major about the size of those spikes is best left to a
> separate exercise.

I am aware of the final spikes. My understanding is that this put the
entire file data in memory. I had considered writing some code that
would just stream the biff data directly into a file instead of
holding it in a giant string.

> > I'd need to add some code for exception handling, basically just a
> > dict that keeps track of flushed row numbers and raises an exception
> > if you try to write to a flushed row.  Does that sound acceptable?
>
> More precisely: if you try to create a Row object for a flushed
> row-index. A set is logically preferable but it would involve a
> conditional import for Python 2.3 so a dict is what I'll be using.

Right. Well if you're well on your way with this then I'm happy to
let you code it.

> > I'm a bit under the gun at the moment on two projects so I might not
> > have the patches out till Monday.  What is the best way to submit
> > them?
>
> If you have any minor changes to the flushing stuff, please let me know
> ASAP. Otherwise, yes, I'd very much appreciate separate patches (and
> example file and API docs) for the print titles stuff and the sheet
> insertion stuff, but there's no hurry. Physical delivery: attachments to
> a posting to this group is OK with me; however putting them on your
> website and posting links might suit other people better -- I don't
> really mind.

I do not have any changes to the flushing stuff. It has been working
well for us. We're putting it into production next week and will
migrate to your version when it is released.

I have made some minor changes to the other stuff. I'll put together
separate patches for the rows to repeat stuff and the insert sheet
stuff with examples and zip them up. I'll probably post to my blog
and post links.

Many thanks!
John

John Machin

unread,
Oct 11, 2008, 1:47:23 AM10/11/08
to python...@googlegroups.com
John Harrison wrote:
> John,

>
>
>> Example [Python 2.5.2, Windows XP SP2, one-core 2GHz CPU, enough memory
>> to avoid swapping]:
>> 250,000 rows, 20 columns, no flushing: 232 seconds, ~510MB
>> 250,000 rows, 20 columns, flushing every 1000 rows: 184 seconds, process
>> uses only tens of MB until the final save, when it spikes up to ~100MB.
>> 100,000 rows, 20 columns, no flushing: 79 seconds, ~200MB
>> 100,000 rows, 20 columns, flushing every 1000 rows: 72 seconds, spikes
>> up to ~45MB.
>>
>> Looks good to me.
>>
>> Doing anything major about the size of those spikes is best left to a
>> separate exercise.
>
> I am aware of the final spikes. My understanding is that this put the
> entire file data in memory.

No, just the BIFF data aka the Workbook stream. This then has to be
embedded in an OLE2 Compound Document. See CompoundDoc.py.

> I had considered writing some code that
> would just stream the biff data directly into a file instead of
> holding it in a giant string.

One problem to be overcome: the file needs to be randomly accessible
because the header stuff in the first 512 bytes can't be known until the
size of the Workbook stream is known. The argument to Workbook.save is
either the name of a file to be created (no problem) or a file-like
object. Up to now, the file-like object has required a write method and
nothing else ...

Let's put this in perspective: your row flushing idea gives a reduction
from ~500MB to about ~100MB for a 250,000-row file created by your
example script. Be happy. The size of the file is 75,602,432 bytes. Some
of that 100MB may be caused by fragmentation due to inefficient creation
of that giant string in Workbook.get_biff_data:
[snipped]
sheets = ''
for sheet in self.__worksheets:
data = sheet.get_biff_data()
sheets += data
return before + bundlesheets + after + ext_sst + eof + sheets

I'd like see the effect of a good dose of "".join() applied to that
method before we jumped into streaming BIFF pieces into a CompDoc writer.

>
> I do not have any changes to the flushing stuff. It has been working
> well for us. We're putting it into production next week and will
> migrate to your version when it is released.
>
> I have made some minor changes to the other stuff. I'll put together
> separate patches for the rows to repeat stuff and the insert sheet
> stuff with examples and zip them up. I'll probably post to my blog
> and post links.
>
> Many thanks!

And to you!

Cheers,
John

Chris Withers

unread,
Oct 16, 2008, 6:31:13 AM10/16/08
to python...@googlegroups.com
John Machin wrote:
> Chris Withers wrote:
>> John Machin wrote:
>>
>> That brings up the subject of documentation ;-) It would be good if
>> there existed some meta-docs on how to generate documentation for xlrd
>> and xlwt. I know I started by hand for xlwt unaware that John had used a
>> tool for xlrd (the style of which I was copying). I'm also wondering if
>> maybe we should use Sphynx* for the docs for both?
>
> Please start a new thread.

Well, given the response I had to the thread I started on testing... ;-)
I did actually take a look at Sphinx (http://sphinx.pocoo.org/) and it
seems to be for authoring static documentation, for which I'm quite
happy to use just plain restructured text for docs that end up on PyPI
and plain text for other stuff.

There's epydoc (http://epydoc.sourceforge.net/) for generation of api
docs, but I don't like its frame-based layout
(http://epydoc.sourceforge.net/stdlib/).

pydoctor (http://codespeak.net/~mwh/pydoctor/) would be my choice so far
although I haven't actually used it, I just liked its output:
http://starship.python.net/crew/mwh/nevowapi/

John M, what was/is the tool you're currently using for xlrd?

cheers,

Chris

Chris Withers

unread,
Oct 16, 2008, 6:34:30 AM10/16/08
to python...@googlegroups.com
John Machin wrote:
>>> also I'd like to see some examples and/or
>>> test files
>> I'm going to post about testing shortly...
>>
>>> and/or a paragraph or two for the docs
>
> My point was that a patch should be accompanied by an example file that
> shows Noddies like me how it's intended to be used and what the expected
> outcome is, instead of having to reverse engineer that from the patch.

Well, actually, it should come with some unit tests and a docstring for
the changed or added methods. *and* it should come with a doctest that
shows how it should be used...

> I'm not so happy, so I'm working on a cut-down patch with just the
> row-flush stuff in it before anyone goes all Gadarene :-)

Thanks for doing this :-)

cheers,

Chris

John Machin

unread,
Oct 16, 2008, 6:48:03 AM10/16/08
to python...@googlegroups.com
Chris Withers wrote:
> John Machin wrote:
>> Chris Withers wrote:
>>> John Machin wrote:
>>>
>>> That brings up the subject of documentation ;-) It would be good if
>>> there existed some meta-docs on how to generate documentation for xlrd
>>> and xlwt. I know I started by hand for xlwt unaware that John had used a
>>> tool for xlrd (the style of which I was copying). I'm also wondering if
>>> maybe we should use Sphynx* for the docs for both?
>> Please start a new thread.
>
> Well, given the response I had to the thread I started on testing... ;-)

Sorry, I was busy at the time amd didn't get back to it. Watch this space.

> I did actually take a look at Sphinx (http://sphinx.pocoo.org/) and it
> seems to be for authoring static documentation, for which I'm quite
> happy to use just plain restructured text for docs that end up on PyPI
> and plain text for other stuff.
>
> There's epydoc (http://epydoc.sourceforge.net/) for generation of api
> docs, but I don't like its frame-based layout
> (http://epydoc.sourceforge.net/stdlib/).
>
> pydoctor (http://codespeak.net/~mwh/pydoctor/) would be my choice so far
> although I haven't actually used it, I just liked its output:
> http://starship.python.net/crew/mwh/nevowapi/
>
> John M, what was/is the tool you're currently using for xlrd?

PythonDoc; see http://effbot.org/zone/pythondoc.htm

I'll have a look at the ones you mentioned.

Cheers,
John

John Machin

unread,
Oct 16, 2008, 7:39:02 AM10/16/08
to python...@googlegroups.com
Chris Withers wrote:
> John Machin wrote:
>>>> also I'd like to see some examples and/or
>>>> test files
>>> I'm going to post about testing shortly...
>>>
>>>> and/or a paragraph or two for the docs
>> My point was that a patch should be accompanied by an example file that
>> shows Noddies like me how it's intended to be used and what the expected
>> outcome is, instead of having to reverse engineer that from the patch.
>
> Well, actually, it should come with some unit tests and a docstring for
> the changed or added methods. *and* it should come with a doctest that
> shows how it should be used...

How about you produce the unit tests and the docstrings and a doctest
for Worksheet.flush_row_data?

>
>> I'm not so happy, so I'm working on a cut-down patch with just the
>> row-flush stuff in it before anyone goes all Gadarene :-)
>
> Thanks for doing this :-)

No need for thanks. Just test it :-)

Chris Withers

unread,
Oct 16, 2008, 7:42:11 AM10/16/08
to python...@googlegroups.com
John Machin wrote:
>> John M, what was/is the tool you're currently using for xlrd?
>
> PythonDoc; see http://effbot.org/zone/pythondoc.htm

Aha.

From what I've seen so far, I'd go with pydoctor, I prefer the
navigation and summary structures it builds...

...and besides, Michael is in NZ, so merely spitting distance away
should you need any bugs fixing ;-)

Chris Withers

unread,
Oct 16, 2008, 7:44:01 AM10/16/08
to python...@googlegroups.com
John Machin wrote:
>>> My point was that a patch should be accompanied by an example file that
>>> shows Noddies like me how it's intended to be used and what the expected
>>> outcome is, instead of having to reverse engineer that from the patch.
>> Well, actually, it should come with some unit tests and a docstring for
>> the changed or added methods. *and* it should come with a doctest that
>> shows how it should be used...
>
> How about you produce the unit tests and the docstrings and a doctest
> for Worksheet.flush_row_data?

If I end up needing it, I promise I will ;-)
(pending help and feedback on the testing thread...)

..and xlutils will certainly have appropriate examples and tests!

John Machin

unread,
Oct 16, 2008, 9:10:16 PM10/16/08
to python...@googlegroups.com

pydoctor sucks stuff in like Darth Vader's tractor beam on steroids. It
requires Nevow. Nevow requires Twisted. That's just to get pydoctor
installed. Then when you run it you get:
"""
Error trying to import 'epytext' parser:

ImportError: No module named epydoc.markup.epytext

Using plain text formatting only.
"""

Sheesh.


Chris Withers

unread,
Oct 17, 2008, 9:54:36 AM10/17/08
to python...@googlegroups.com, Michael Hudson
John Machin wrote:
> pydoctor sucks stuff in like Darth Vader's tractor beam on steroids. It
> requires Nevow. Nevow requires Twisted. That's just to get pydoctor
> installed. Then when you run it you get:
> """
> Error trying to import 'epytext' parser:
>
> ImportError: No module named epydoc.markup.epytext
>
> Using plain text formatting only.

Ouch. This sucks. Michael, why did you make pydoctor suck so much? ;-)

John Machin

unread,
Oct 17, 2008, 9:02:21 PM10/17/08
to python...@googlegroups.com, Michael Hudson
Chris Withers wrote:
> John Machin wrote:
>> pydoctor sucks stuff in like Darth Vader's tractor beam on steroids. It
>> requires Nevow. Nevow requires Twisted. That's just to get pydoctor
>> installed. Then when you run it you get:
>> """
>> Error trying to import 'epytext' parser:
>>
>> ImportError: No module named epydoc.markup.epytext
>>
>> Using plain text formatting only.
>
> Ouch. This sucks. Michael, why did you make pydoctor suck so much? ;-)
>

and it also (silently) installed zope.interface

[exits, muttering something about a distant waterway]

Chris Withers

unread,
Oct 21, 2008, 9:05:21 AM10/21/08
to python...@googlegroups.com, michael...@canonical.com
John Machin wrote:
> and it also (silently) installed zope.interface
>
> [exits, muttering something about a distant waterway]

reference lost on me ;-)

...but really, I wish zope.interface was in the python core. It's such a
useful library.

cheers,

Reply all
Reply to author
Forward
0 new messages