I don't really understand this, so I'm not quite so interested ;-)
> I've also made two other changes. One makes great improvements to
> both the speed
...I think this is likely incidental, swapping would be my guess as the
cause of the slowdown.
> and especially the memory consumption of xlwt when
> writing out large spreadsheets.
Now this is really interesting!
(read: I might be looking for this myself soon, the xlwt memory
structures certainly are memory hungry...)
> The other allows a user to insert
> worksheets at any location in a workbook rather than simply appending
> them to the end of the current list of sheets.
As is this!
> I've also posted a diff of my changes at:
> http://insightvr.com/download/xlwtHack.diff
John, can you see any reasons not just to merge and commit this patch?
If the answer is "no", then I'll happilly get it merged and committed.
cheers,
Chris
--
Simplistix - Content Management, Zope & Python Consulting
- http://www.simplistix.co.uk
It's so that you get a heading at the top of every printed page. It's
one of those things that end-users understand and appreciate :-)
>
>> I've also made two other changes. One makes great improvements to
>> both the speed
>
> ...I think this is likely incidental, swapping would be my guess as the
> cause of the slowdown.
>
>> and especially the memory consumption of xlwt when
>> writing out large spreadsheets.
>
> Now this is really interesting!
> (read: I might be looking for this myself soon, the xlwt memory
> structures certainly are memory hungry...)
Good work. Looks great at first glance. Callers would need to be warned
about the consequences of putting more data in a row after it had been
flushed.
>
>> The other allows a user to insert
>> worksheets at any location in a workbook rather than simply appending
>> them to the end of the current list of sheets.
>
> As is this!
Ummm, what about:
summary_sheet = wb.add_sheet("Summary")
data_sheet = wb.add_sheet("Data")
# write stuff to data_sheet, accumulate summary info
# write summary info to summary_sheet
?
>
>> I've also posted a diff of my changes at:
>> http://insightvr.com/download/xlwtHack.diff
>
> John, can you see any reasons not just to merge and commit this patch?
> If the answer is "no", then I'll happilly get it merged and committed.
>
Presuming "John" means me, not John Harrison:
Reasons not to just merge and commit this patch (now, willy-nilly):
(1) It hasn't been reviewed; also I'd like to see some examples and/or
test files and/or a paragraph or two for the docs
(2) It should be 3 quite separate patches
(3) Christophe Tronche and I are currently working on a patch to provide
major improvement to xlwt's formula capabilities; as CT has evidently
some difficulty/problem with svn, he is supplying me with patches
against the version 0.7.0 tar.gz, and I would very much prefer for svn
not to be updated until this patch is ready -- a couple of days at most.
(4) I am somewhat concerned by "Also, I’d mucked up the worksheet index
concept by allowing a user to insert a worksheet earlier in the list. So
you could invalidate an index prior to writing out the xls file". In
particular I'd like to investigate the effect on the formula patch (the
main effect of which is to support 3D references to other worksheets
e.g. Sheet1:Sheet3!A1:Z99) and its use of worksheet indexes.
Cheers,
John
Sorry, I don't understand. AFAICT [*] the xlwt files in the svn
repository are the same as in the version 0.7.0 tar.gz, as expected.
* = TortoiseSVN: check for modifications, click on "Check Repository",
shows tags updated, no mention of source files changed.
I had a look but couldn't find this Check Repository option.
If you're using "check for modifications" then that checks if you've
made any changes locally *not* with the repository.
To get changes that have been committed to the repository, right-click
and select "SVN Update". That will update your checkout with any changes
committed to the repository.
If you just want to see if anything's been comitted, right-click ->
TortoiseSVN -> show log is quite handy. That'll show you a log of all
commits where you can spot any news ones and when they were committed.
Better yet, if you want to see what changes there were ebtween the two
revisions,hold down Ctrl and click on the two revisions, then
right-click and select 'compare revisions'.
All that said, xlwt trunk does indeed seem to be 0.7.0....
OK, this stuff I'm not *that* interested in, but I don't mind merging as
part of the work to merge the memory usage stuff...
>>> I've also made two other changes. One makes great improvements to
>>> both the speed
>> ...I think this is likely incidental, swapping would be my guess as the
>> cause of the slowdown.
>
> I agree, but on my laptop the increase is significant.
You're missing my point; if you ran your new code over a data set that
didn't consume more than the physical memory of the box, you'd likely
see no speed increase. In fact, you might even see a slight slow down ;-)
>>> and especially the memory consumption of xlwt when
>>> writing out large spreadsheets.
>> Now this is really interesting!
>> (read: I might be looking for this myself soon, the xlwt memory
>> structures certainly are memory hungry...)
scratch might, I *do* need this ;-)
> Yeah, memory consumption is greatly improved, 35x in my simple 100,000
> row test. More rows, more improvement. Ugh, just realized that I'm
> testing at the wrong point in my code for memory consumption. Need to
> check that out, but it is still a big improvement.
I noticed in my tests with heapy (glad to see you're using it as well!)
that a lot of memory was taken up by formatting objects. Did you see
this memory usage go down with your patch?
> That is the question that I'm wondering about. For how I use xlwt my
> changes hurt nothing and help greatly. I believe that the worksheet
> insertion code and the rows to repeat code (includes print areas) will
> not break any compatibility with current functionality.
Good :-)
> The row flush code is possibly more problematic. My understanding
> (from memory) is that the api does not currently support deleting a
> row or a cell in a row.
Well, from my understanding of the purpose of xlwt, it's for
serialisation of data to .xls format, so it should never support
deleting a row or a cell.
> backwards compatible. The concern is in how the sheet keeps track of
> the max/min rows and cols. My code has to keep track of these as they
> are added to a sheet rather than figuring this out when rows are
> written.
OK, but what difference does this make?
> So I don't think that the row flush changes break compatibility.
> However that doesn't mean that it is safe for everyone to start using
> this. It depends on how you write your rows. You don't want to flush
> when you might write more cells to a row that you've already written
> to.
OK, I think some checking and exception raising should solve this.
You should only flush when you're sure you don't want to write more
cells to a column you've already written. xlwt should raise an exception
if you do.
> Also I don't know if there is any functionality that needs to
> reference a row in some way after it is written other then to write it
> out. Flushing the rows to the temp file would be problematic in this
> case.
Right, does anyone know if there are problems with this?
> Basically you can't flush the rows if you're going to use that row
> again. But you have to call the flush as it isn't automatic, and if
> you don't flush at all the only difference should be the max/min code.
Right, from this perspective I think it's fine.
> I think that my concern boils down to something that would be resolved
> by adequate documentation.
...and some exception raising when you do silly things.
As I said earlier, I now need this code so I'm happy to merge it
I'm also interested in writing some unit tests, but that's a subject for
a seperate post.
yes, we hate end users, but that'll be another seperate post arriving
shortly ;-)
>>> and especially the memory consumption of xlwt when
>>> writing out large spreadsheets.
>> Now this is really interesting!
>> (read: I might be looking for this myself soon, the xlwt memory
>> structures certainly are memory hungry...)
>
> Good work. Looks great at first glance. Callers would need to be warned
> about the consequences of putting more data in a row after it had been
> flushed.
That brings up the subject of documentation ;-) It would be good if
there existed some meta-docs on how to generate documentation for xlrd
and xlwt. I know I started by hand for xlwt unaware that John had used a
tool for xlrd (the style of which I was copying). I'm also wondering if
maybe we should use Sphynx* for the docs for both?
>>> The other allows a user to insert
>>> worksheets at any location in a workbook rather than simply appending
>>> them to the end of the current list of sheets.
>> As is this!
>
> Ummm, what about:
>
> summary_sheet = wb.add_sheet("Summary")
> data_sheet = wb.add_sheet("Data")
> # write stuff to data_sheet, accumulate summary info
> # write summary info to summary_sheet
>
> ?
Not sure what the point you're trying to make here is John, could you
elaborate?
> Presuming "John" means me, not John Harrison:
You are the relevent John, yes ;-)
> Reasons not to just merge and commit this patch (now, willy-nilly):
> (1) It hasn't been reviewed;
Well, I need to use it in a production setting, so it'll be getting
*plenty* of review...
> also I'd like to see some examples and/or
> test files
I'm going to post about testing shortly...
> and/or a paragraph or two for the docs
See earlier referenes to Sphynx and documentation in general. I'm happy
to stick with the current tool, but some brief meta-docs on what it is
and how to use it would be great.
> (2) It should be 3 quite separate patches
Indeed, although I'm happy to merge the "uber-patch" if that's all
that's available...
> (3) Christophe Tronche and I are currently working on a patch to provide
> major improvement to xlwt's formula capabilities; as CT has evidently
> some difficulty/problem with svn, he is supplying me with patches
> against the version 0.7.0 tar.gz, and I would very much prefer for svn
> not to be updated until this patch is ready -- a couple of days at most.
svn is really rather good at this kind of thing ;-)
If you do rigfht click on the folder that is your xlwt checkout and then
do TortoiseSVN -> show log you'll get the log listing.
Right clicking on a revision (I think 3843 is the one that corresponds
to the 0.7.0 release) and then select "update item to revision" your
checkout will be exactly as the source tree was at revision 3843 so your
patch should merge fine.
If you then do "SVN update" all later changes will be merged in. If
there are incompatible changes, the file(s) affective will show in a
conflicted state. You can then manually fix the conflicts before marking
them as resolved and committing your changes.
> (4) I am somewhat concerned by "Also, I’d mucked up the worksheet index
> concept by allowing a user to insert a worksheet earlier in the list. So
> you could invalidate an index prior to writing out the xls file".
I couldn't find the email this was quoted from. Which one was it? Not
sure of the original context...
> In
> particular I'd like to investigate the effect on the formula patch (the
> main effect of which is to support 3D references to other worksheets
> e.g. Sheet1:Sheet3!A1:Z99) and its use of worksheet indexes.
OK, well, it's the row flushing that I really really need asap. I assume
the row flushing bits are okay to merge with your proviso above?
cheers,
Chris
If it does go down, that could mean that your code may be creating new
XFStyle objects for each cell, rather than sharing them. It may also be
a memory leak. How are you creating your XFStyle objects? From an input
xls file using your replacement for the xlrd2wtmap gadget, or from scratch?
When one does
shell-prompt> runxlrd.py -f1 hdr empty_file_created_by_Excel_2003.xls
one gets:
[snip]
FORMATs: 8, FONTs: 4, XFs: 21
These are counts of the actual occurrences in the file.
What do you get when you do that with your output xls file? With your
input file, if any? How do those numbers compare with "the lot of
memory" that you are seeing with heapy? What is a "lot"?
Please report a second set of results after using:
output_book = xlwt.Workbook(style_compression=2)
>> The row flush code is possibly more problematic. My understanding
>> (from memory) is that the api does not currently support deleting a
>> row or a cell in a row.
>
> Well, from my understanding of the purpose of xlwt, it's for
> serialisation of data to .xls format, so it should never support
> deleting a row or a cell.
I agree entirely. xlwt is for creating xls files, serially or randomly.
John H has presented a reasonable use-case for inserting new sheets
out-of-creation-order-wise, but that's rather different to deleting a
row or cell.
>
>> backwards compatible. The concern is in how the sheet keeps track of
>> the max/min rows and cols. My code has to keep track of these as they
>> are added to a sheet rather than figuring this out when rows are
>> written.
>
> OK, but what difference does this make?
>
>> So I don't think that the row flush changes break compatibility.
>> However that doesn't mean that it is safe for everyone to start using
>> this. It depends on how you write your rows. You don't want to flush
>> when you might write more cells to a row that you've already written
>> to.
>
> OK, I think some checking and exception raising should solve this.
> You should only flush when you're sure you don't want to write more
> cells to a column you've already written.
s/column/row/
> xlwt should raise an exception
> if you do.
My quick read of John H's patch gave me the impression that after a row
was flushed, the corresponding Row object was deleted [can't check;
insightvr.com website seems to be down]. We'd have to keep track of the
state (non-existent, created, flushed) of all possible rows in each
sheet, and check this on each Row.write etc call. Instead of deleting a
Row object and maintaining a separate state collection, it might be
better for the Row class to have a shrink_after_flushing method which
would set self.shrunk to True and delete all cell references.
>
>> Also I don't know if there is any functionality that needs to
>> reference a row in some way after it is written other then to write it
>> out. Flushing the rows to the temp file would be problematic in this
>> case.
>
> Right, does anyone know if there are problems with this?
There are some places in the BIFF emission where lengths/sizes/offsets
have to be patched in, but not to do with rows/cells (IIRC). I doubt
there are any problems -- but checking this would be part of the review
that anyone would expect to be done before the patch is merged in.
>
>> Basically you can't flush the rows if you're going to use that row
>> again. But you have to call the flush as it isn't automatic, and if
>> you don't flush at all the only difference should be the max/min code.
>
> Right, from this perspective I think it's fine.
>
>> I think that my concern boils down to something that would be resolved
>> by adequate documentation.
>
> ...and some exception raising when you do silly things.
>
> As I said earlier, I now need this code so I'm happy to merge it
Only the row-flushing part, please (*NOT* the worksheet-inserting stuff
and the head-rows stuff).
Will you add the "exception raising when you do silly things" code, as well?
Cheers,
John
Please start a new thread.
>>>> The other allows a user to insert
>>>> worksheets at any location in a workbook rather than simply appending
>>>> them to the end of the current list of sheets.
>>> As is this!
>> Ummm, what about:
>>
>> summary_sheet = wb.add_sheet("Summary")
>> data_sheet = wb.add_sheet("Data")
>> # write stuff to data_sheet, accumulate summary info
>> # write summary info to summary_sheet
>>
>> ?
>
> Not sure what the point you're trying to make here is John, could you
> elaborate?
The point was that if you know you will have only one summary sheet, you
can create it first (instead of inserting it later) and fill it in
after/while you fill in the data sheets. John H has subsequently
remarked that his use-case involved a variable number of summary sheets
(or words to that effect).
>
>> Presuming "John" means me, not John Harrison:
>
> You are the relevent John, yes ;-)
>
>> Reasons not to just merge and commit this patch (now, willy-nilly):
>> (1) It hasn't been reviewed;
>
> Well, I need to use it in a production setting, so it'll be getting
> *plenty* of review...
By "review" I meant reading the patch and the patched code closely to
check that it's doing something useful and that it has covered all the
bolt-holes -- see below.
>
>> also I'd like to see some examples and/or
>> test files
>
> I'm going to post about testing shortly...
>
>> and/or a paragraph or two for the docs
My point was that a patch should be accompanied by an example file that
shows Noddies like me how it's intended to be used and what the expected
outcome is, instead of having to reverse engineer that from the patch.
We got that for the row-flush part of the tripartite patch, but not the
other two.
>
> See earlier referenes to Sphynx and documentation in general. I'm happy
> to stick with the current tool, but some brief meta-docs on what it is
> and how to use it would be great.
See earlier reference for the need for a separate thread on documentation.
>
>> (2) It should be 3 quite separate patches
>
> Indeed, although I'm happy to merge the "uber-patch" if that's all
> that's available...
I'm not so happy, so I'm working on a cut-down patch with just the
row-flush stuff in it before anyone goes all Gadarene :-)
>
>> (4) I am somewhat concerned by "Also, I’d mucked up the worksheet index
>> concept by allowing a user to insert a worksheet earlier in the list. So
>> you could invalidate an index prior to writing out the xls file".
>
> I couldn't find the email this was quoted from. Which one was it? Not
> sure of the original context...
Quoted from John H's blog.
>
> OK, well, it's the row flushing that I really really need asap. I assume
> the row flushing bits are okay to merge with your proviso above?
Not quite. Tracking the leftmost and rightmost columns used needs to be
done in the Row.__adjust_bound_col_idx method (which is called by *all*
cell-writing methods), not just in the Worksheet.write method. If you
are using the Row.set_cell_<datatype> methods (like xlrd2wtmap does),
you might find that the DIMENSIONS record would have strange contents,
and some xls readers might sigh about it; e.g. xlrd with verbosity >= 1:
NOTE *** sheet 0 (u'foo0'): DIMENSIONS R,C = 10000,1 should be 10000,20
What does your "asap" translate into? Committed by when you get in to
work on Monday OK?
Cheers,
John
As stated two messages ago, I have already started on the row_flush caper.
I have fixed the problem with the first/last row/col being handled in
the wrong place, leading to the possibility of a bad DIMENSIONS record.
This definitely needed fixing; although xlrd (with verbosity=0), OOo
Calc, and Gnumeric don't complain, Excel 200[37] pops up a dialogue box
saying "File error: data may have been lost" -- not exactly what we want
the consumers of xlwt output files to experience. There is an existing
bug in pyExcelerator/xlwt: the DIMENSIONS record is incorrect for an
empty worksheet (needs special case code); I've fixed that as well.
I have also enhanced your example script to take command-line args for
the number of rows and columns and the flushing interval (with 0
implying don't flush). Unfortunately heapy isn't available on Windows
AFAICT so I'm stuck with watching the gauge in the Task Manager [anyone
have any better ideas?].
Example [Python 2.5.2, Windows XP SP2, one-core 2GHz CPU, enough memory
to avoid swapping]:
250,000 rows, 20 columns, no flushing: 232 seconds, ~510MB
250,000 rows, 20 columns, flushing every 1000 rows: 184 seconds, process
uses only tens of MB until the final save, when it spikes up to ~100MB.
100,000 rows, 20 columns, no flushing: 79 seconds, ~200MB
100,000 rows, 20 columns, flushing every 1000 rows: 72 seconds, spikes
up to ~45MB.
Looks good to me.
Doing anything major about the size of those spikes is best left to a
separate exercise. However I'm experimenting with the
Worksheet.get_biff_data method to see if using ''.join(list_of_strings)
is better than the += approach. Currently (whether we are flushing or
not) all of the row & cell records in the sheet are added in, then a few
small records are added in. If += is not being optimised, then memory is
needed for an extra copy of the row/cell records.
>
> I'd need to add some code for exception handling, basically just a
> dict that keeps track of flushed row numbers and raises an exception
> if you try to write to a flushed row. Does that sound acceptable?
More precisely: if you try to create a Row object for a flushed
row-index. A set is logically preferable but it would involve a
conditional import for Python 2.3 so a dict is what I'll be using.
>
> I'm a bit under the gun at the moment on two projects so I might not
> have the patches out till Monday. What is the best way to submit
> them?
>
If you have any minor changes to the flushing stuff, please let me know
ASAP. Otherwise, yes, I'd very much appreciate separate patches (and
example file and API docs) for the print titles stuff and the sheet
insertion stuff, but there's no hurry. Physical delivery: attachments to
a posting to this group is OK with me; however putting them on your
website and posting links might suit other people better -- I don't
really mind.
I have made a few minor changes myself e.g. s/== None/is None/ :-)
Cheers,
John
No, just the BIFF data aka the Workbook stream. This then has to be
embedded in an OLE2 Compound Document. See CompoundDoc.py.
> I had considered writing some code that
> would just stream the biff data directly into a file instead of
> holding it in a giant string.
One problem to be overcome: the file needs to be randomly accessible
because the header stuff in the first 512 bytes can't be known until the
size of the Workbook stream is known. The argument to Workbook.save is
either the name of a file to be created (no problem) or a file-like
object. Up to now, the file-like object has required a write method and
nothing else ...
Let's put this in perspective: your row flushing idea gives a reduction
from ~500MB to about ~100MB for a 250,000-row file created by your
example script. Be happy. The size of the file is 75,602,432 bytes. Some
of that 100MB may be caused by fragmentation due to inefficient creation
of that giant string in Workbook.get_biff_data:
[snipped]
sheets = ''
for sheet in self.__worksheets:
data = sheet.get_biff_data()
sheets += data
return before + bundlesheets + after + ext_sst + eof + sheets
I'd like see the effect of a good dose of "".join() applied to that
method before we jumped into streaming BIFF pieces into a CompDoc writer.
>
> I do not have any changes to the flushing stuff. It has been working
> well for us. We're putting it into production next week and will
> migrate to your version when it is released.
>
> I have made some minor changes to the other stuff. I'll put together
> separate patches for the rows to repeat stuff and the insert sheet
> stuff with examples and zip them up. I'll probably post to my blog
> and post links.
>
> Many thanks!
And to you!
Cheers,
John
Well, given the response I had to the thread I started on testing... ;-)
I did actually take a look at Sphinx (http://sphinx.pocoo.org/) and it
seems to be for authoring static documentation, for which I'm quite
happy to use just plain restructured text for docs that end up on PyPI
and plain text for other stuff.
There's epydoc (http://epydoc.sourceforge.net/) for generation of api
docs, but I don't like its frame-based layout
(http://epydoc.sourceforge.net/stdlib/).
pydoctor (http://codespeak.net/~mwh/pydoctor/) would be my choice so far
although I haven't actually used it, I just liked its output:
http://starship.python.net/crew/mwh/nevowapi/
John M, what was/is the tool you're currently using for xlrd?
cheers,
Chris
Well, actually, it should come with some unit tests and a docstring for
the changed or added methods. *and* it should come with a doctest that
shows how it should be used...
> I'm not so happy, so I'm working on a cut-down patch with just the
> row-flush stuff in it before anyone goes all Gadarene :-)
Thanks for doing this :-)
cheers,
Chris
Sorry, I was busy at the time amd didn't get back to it. Watch this space.
> I did actually take a look at Sphinx (http://sphinx.pocoo.org/) and it
> seems to be for authoring static documentation, for which I'm quite
> happy to use just plain restructured text for docs that end up on PyPI
> and plain text for other stuff.
>
> There's epydoc (http://epydoc.sourceforge.net/) for generation of api
> docs, but I don't like its frame-based layout
> (http://epydoc.sourceforge.net/stdlib/).
>
> pydoctor (http://codespeak.net/~mwh/pydoctor/) would be my choice so far
> although I haven't actually used it, I just liked its output:
> http://starship.python.net/crew/mwh/nevowapi/
>
> John M, what was/is the tool you're currently using for xlrd?
PythonDoc; see http://effbot.org/zone/pythondoc.htm
I'll have a look at the ones you mentioned.
Cheers,
John
How about you produce the unit tests and the docstrings and a doctest
for Worksheet.flush_row_data?
>
>> I'm not so happy, so I'm working on a cut-down patch with just the
>> row-flush stuff in it before anyone goes all Gadarene :-)
>
> Thanks for doing this :-)
No need for thanks. Just test it :-)
Aha.
From what I've seen so far, I'd go with pydoctor, I prefer the
navigation and summary structures it builds...
...and besides, Michael is in NZ, so merely spitting distance away
should you need any bugs fixing ;-)
If I end up needing it, I promise I will ;-)
(pending help and feedback on the testing thread...)
..and xlutils will certainly have appropriate examples and tests!
pydoctor sucks stuff in like Darth Vader's tractor beam on steroids. It
requires Nevow. Nevow requires Twisted. That's just to get pydoctor
installed. Then when you run it you get:
"""
Error trying to import 'epytext' parser:
ImportError: No module named epydoc.markup.epytext
Using plain text formatting only.
"""
Sheesh.
Ouch. This sucks. Michael, why did you make pydoctor suck so much? ;-)
and it also (silently) installed zope.interface
[exits, muttering something about a distant waterway]
reference lost on me ;-)
...but really, I wish zope.interface was in the python core. It's such a
useful library.
cheers,