any way to utilize optimizations when I need access to remove_sheet() ?

243 views
Skip to first unread message

axf...@gmail.com

unread,
Sep 4, 2014, 7:22:12 PM9/4/14
to openpyx...@googlegroups.com
Hi,

OpenPyxl is dying on me when trying to work with a workbook that has one sheet containing about 500k rows and another sheet containing about 60k rows -- Python's memory usage steadily climbs to around 1gb then is automatically terminated. I could conceivably raise this limit to around 2gb, but I'm afraid that might not do the trick anyhow, and after seeing that there are apparently some optimized methods available, I'd like to try using them. From reviewing https://pythonhosted.org/openpyxl/optimized.html, however, it doesn't seem like either will necessarily suit my use case -- apparently the optimized writer method isn't available when doing load_workbook(), and the optimized reader method creates read-only worksheets? If I'm wrong about the former, I can't tell how to invoke it in my code based on the documentation, and I'm indeed getting an error when trying the latter. Here's the relevant snippet of my code:

workbook = openpyxl.load_workbook(workbookname)
sheets = workbook.get_sheet_names()
totalsheets = len(sheets)
if totalsheets!=1:
workbooks = {}
for y in sheets:
if y != "DV-IDENTITY-0":
workbooks[y] = copy.deepcopy(workbook)
for x in sheets:
if x != y:
workbooks[y].remove_sheet(workbooks[y].get_sheet_by_name(x))
cleanname = re.sub(r'[^a-zA-Z0-9\-_]','',y)
filename = csvlist_filepath+cleanname+'.xlsx'
workbooks[y].save(filename)

Works fine on small files. When I try adding use_iterators = True as an argument to that first load_workbook() line, however, openpyxl dies with 

Traceback (most recent call last):
  File "csvnormalize.py", line 36, in <module>
    workbooks[y].save(filename)
  File "c:\Users\Alex\Dropbox\Python27\lib\site-packages\openpyxl\workbook\workbook.py", line 267, in save
    save_workbook(self, filename)
  File "c:\Users\Alex\Dropbox\Python27\lib\site-packages\openpyxl\writer\excel.py", line 185, in save_workbook
    writer.save(filename)
  File "c:\Users\Alex\Dropbox\Python27\lib\site-packages\openpyxl\writer\excel.py", line 168, in save
    self.write_data(archive)
  File "c:\Users\Alex\Dropbox\Python27\lib\site-packages\openpyxl\writer\excel.py", line 86, in write_data
    archive.writestr(ARC_WORKBOOK, write_workbook(self.workbook))
  File "c:\Users\Alex\Dropbox\Python27\lib\site-packages\openpyxl\writer\workbook.py", line 231, in write_workbook
    {'activeTab': '%d' % workbook.get_index(workbook.get_active_sheet()),
  File "c:\Users\Alex\Dropbox\Python27\lib\site-packages\openpyxl\workbook\workbook.py", line 131, in get_active_sheet
    return self.active
  File "c:\Users\Alex\Dropbox\Python27\lib\site-packages\openpyxl\workbook\workbook.py", line 136, in active
    return self.worksheets[self._active_sheet_index]
IndexError: list index out of range

Any ideas appreciated!

Adam Morris

unread,
Sep 4, 2014, 9:03:44 PM9/4/14
to openpyx...@googlegroups.com, axf...@gmail.com
The error seems to suggest that maybe the active sheet is being deleted.  Before saving, try setting this to the first sheet, like:

wb._active_sheet_index = 0


Cheers,
Adam

Charlie Clark

unread,
Sep 5, 2014, 2:57:37 AM9/5/14
to openpyx...@googlegroups.com
Hiya Alex,

Am .09.2014, 01:22 Uhr, schrieb <axf...@gmail.com>:

> OpenPyxl is dying on me when trying to work with a workbook that has one
> sheet containing about 500k rows and another sheet containing about 60k
> rows -- Python's memory usage steadily climbs to around 1gb then is
> automatically terminated. I could conceivably raise this limit to around
> 2gb, but I'm afraid that might not do the trick anyhow, and after seeing
> that there are apparently some optimized methods available, I'd like to
> tryusing them.

Memory use is driven by a number of things but mainly the number of cells
and what they contain. You are working with some pretty large files and if
those rows are wide then you *will* have to increase the amount of memory
available to Python.

The implementation for cells is already very good - I ran some comparisons
with dictionaries and they are only slightly faster and smaller.

> Here's the relevant snippet of my code:
> workbook = openpyxl.load_workbook(workbookname)
> sheets = workbook.get_sheet_names()
> totalsheets = len(sheets)
> if totalsheets!=1:
> workbooks = {}
> for y in sheets:
> if y != "DV-IDENTITY-0":
> workbooks[y] = copy.deepcopy(workbook)
> for x in sheets:
> if x != y:
> workbooks[y].remove_sheet(workbooks[y].get_sheet_by_name(x))
> cleanname = re.sub(r'[^a-zA-Z0-9\-_]','',y)
> filename = csvlist_filepath+cleanname+'.xlsx'
> workbooks[y].save(filename)

> Works fine on small files. When I try adding use_iterators = True as an
> argument to that first load_workbook() line, however, openpyxl dies with

The optimised modes are for when you are either reading or writing only. I
don't think that is your case so you will have to load everything into
memory.

Using copy.deepcopy is not advised on workbooks and I wouldn't be
surprised to see it add to your memory woes. See
https://bitbucket.org/openpyxl/openpyxl/issue/171/ for more information.

The traceback is probably misleading. Read-only workbooks and worksheets
inherit from the standard classes which means that methods are available
unless they are explicitly disabled or removed: you simply shouldn't be
able to remove a worksheet from a read-only workbook. Work is ongoing to
clean this up either to make things work, or remove the method, or provide
you an error. Starting with 2.1 we also refer to read-only or write-only
workbooks.

It seems there is some demand for a kind of template function which we
might be able to offer in a future version. Like the other optimisations
this will involve some kind of compromise (ie. some things won't be
possible). It looks to me at the moment like an extension of the existing
write-only workbooks.

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

axf...@gmail.com

unread,
Sep 5, 2014, 12:26:53 PM9/5/14
to openpyx...@googlegroups.com
Hi guys,

Thanks for the responses. I agree that the traceback was probably misleading -- after trying Adam's fix, I got "ValueError: can only parse strings," which seems totally unrelated; I think you're right in that I currently can't utilize either of the optimised modes. Unfortunately, I just tried running the script without the optimizations on a 16gb machine just to see if it could at least complete, and I had to switch to tty1 to kill it because it started hammering swap so hard after it ran out of memory :(

For now I'll impose a strict filesize limit on the sheets our script will handle, but suffice it to say I would love an optimized method that would permit me to remove worksheets while still not being able to directly edit cell contents.

Charlie Clark

unread,
Sep 5, 2014, 1:24:51 PM9/5/14
to openpyx...@googlegroups.com
Am .09.2014, 18:26 Uhr, schrieb <axf...@gmail.com>:

> Hi guys,

> Thanks for the responses. I agree that the traceback was probably
> misleading -- after trying Adam's fix, I got "ValueError: can only parse
> strings," which seems totally unrelated; I think you're right in that I
> currently can't utilize either of the optimised modes. Unfortunately, I
> just tried running the script without the optimizations on a 16gb machine
> just to see if it could at least complete, and I had to switch to tty1 to
> kill it because it started hammering swap so hard after it ran out of
> memory.

That doesn't sound right. The benchmark code has some pretty big sheets in
it and never gets above about 400 MB. Sounds like garbage collection isn't
happening. You'll have to do some memory profiling to see what's going on.
I'm afraid I've not experience with memory profiling myself but Adam did
put some notes in the documentation. If you're still using
copy.deepcopy(workbook) probably now is a good time to stop.

> For now I'll impose a strict filesize limit on the sheets our script will
> handle, but suffice it to say I would love an optimized method that would
> permit me to remove worksheets while still not being able to directly
> edit
> cell contents.

It would probably help a lot if you could provide more complete code and
some data samples. Otherwise we can only really nod sagely, shrug or wave
our hands.

axf...@gmail.com

unread,
Sep 5, 2014, 1:36:08 PM9/5/14
to openpyx...@googlegroups.com
Ah, certainly :)



I seem to recall there was a reason when I wrote this some months ago that I used the deepcopy-and-prune-sheets-off method rather than instantiating a new workbook each time and copying sheets into it, but I can't seem to remember what that was. If you're confident that that's the best place to optimize, I'll take another look. Thanks!

axf...@gmail.com

unread,
Sep 5, 2014, 1:36:50 PM9/5/14
to openpyx...@googlegroups.com, axf...@gmail.com
Oh, and for the record, this was python 2.7.6.

Charlie Clark

unread,
Sep 5, 2014, 1:49:39 PM9/5/14
to openpyx...@googlegroups.com
Am .09.2014, 19:36 Uhr, schrieb <axf...@gmail.com>:

> Ah, certainly
> Code:
> https://github.com/axfelix/islandora_solution_pack_data/blob/master/csvnormalize.py

Grumble, GPL… ;-)

> Data: http://lib-hf1.lib.sfu.ca/5to1Lvl20130328.xlsx

> I seem to recall there was a reason when I wrote this some months ago
> that
> I used the deepcopy-and-prune-sheets-off method rather than
> instantiating a
> new workbook each time and copying sheets into it, but I can't seem to
> remember what that was. If you're confident that that's the best place to
> optimize, I'll take another look. Thanks!

Well, it certianly cannot work with read-only worksheets! I can't say with
any confidence what the cause is but Jared seemed to finger that and
suggest that using shutil seemed to work better. I won't really be able to
look at this before next week but maybe someone else will.

Regarding a future implementation: I can imagine something where we might
allow a worksheet to be "frozen" and essentially write it to a tempfile
and delete all the cells. That seems like the only way to allow templates
for really large files. 2.1 includes a lot more code-sharing between
implementations which should make this less hairy.

Adam Morris

unread,
Sep 5, 2014, 9:13:01 PM9/5/14
to openpyx...@googlegroups.com, axf...@gmail.com
If you just need to delete sheets, you could do it somewhat manually -

1. Unzip the excel file
2. Parse the xl/workbook.xml for the sheet names / ids
3. Delete the sheets from the xl/worksheets folder
4. Delete / reorder the sheets in the xml in the xl/workbook.xml file
5. Zip everything back up

Not sure how excel would respond with unnecessary entries in the string table, or any cross-sheet formulae, but it might be worth a shot.

A few months ago, I wrote a quick utility for editing excel documents @ https://bitbucket.org/amorris/editpyxl -- It's very limited in the sense that you can't add cells or edit shared text, but it works with excel files in a different way, and perhaps could be a starting point for what you want to do.  It's interface is very similar to openpyxl.

--Adam



On Thursday, September 4, 2014 7:22:12 PM UTC-4, axf...@gmail.com wrote:

Charlie Clark

unread,
Sep 6, 2014, 1:57:08 AM9/6/14
to openpyx...@googlegroups.com
Am .09.2014, 03:13 Uhr, schrieb Adam Morris <w...@myemptybucket.com>:

> If you just need to delete sheets, you could do it somewhat manually -
> 1. Unzip the excel file
> 2. Parse the xl/workbook.xml for the sheet names / ids
> 3. Delete the sheets from the xl/worksheets folder
> 4. Delete / reorder the sheets in the xml in the xl/workbook.xml file
> 5. Zip everything back up
> Not sure how excel would respond with unnecessary entries in the string
> table, or any cross-sheet formulae, but it might be worth a shot.

That won't work as the metadata is spread throughout the archive.

If that's what needs to be done, it's easier to load the workbook in
openpyxl and save it under a different name.

axf...@gmail.com

unread,
May 22, 2015, 1:22:15 PM5/22/15
to openpyx...@googlegroups.com
Hi folks!

Sorry to bump this old thread, but as I just noticed that there have been some major updates to openpyxl -- such that my old code on longer works with the newest version in pip :) -- I was wonder if this use case, removing sheets when using optimized methods so everything isn't being loaded cell by cell, is now better supported. A simple "yes, with method x" would be great and I can take it from there.

Charlie Clark

unread,
May 22, 2015, 1:29:14 PM5/22/15
to openpyx...@googlegroups.com
Am .05.2015, 19:22 Uhr, schrieb <axf...@gmail.com>:

> Hi folks!

> Sorry to bump this old thread, but as I just noticed that there have been
> some major updates to openpyxl -- such that my old code on longer works
> with the newest version in pip -- I was wonder if this use case,
> removing sheets when using optimized methods so everything isn't being
> loaded cell by cell, is now better supported. A simple "yes, with method
> x" would be great and I can take it from there.

Simple answer: no. There have indeed been major changes but these
essentially involve the following areas:

* improving reliability
* improving and adding features especially formatting
* improving performance especially when dumping large files
* harmonising the API across the different modes so that you can chop and
change them with minimal changes to your code

We could do with something like a lazy load mode that would allow
destructive operations so that we could do some of things that xlutils
does. You could probably hack the necessary code to save a workbook after
removing a worksheet into read-only mode and then maybe submit a pull
request.

Alex Garnett

unread,
May 22, 2015, 1:33:03 PM5/22/15
to openpyx...@googlegroups.com
Thanks for the quick response! I'll look into it if I get the chance.


-alex

--
You received this message because you are subscribed to a topic in the Google Groups "openpyxl-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openpyxl-users/tCrIdbgtKRc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openpyxl-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages