Memory usage Openpyxl v2.0.4 vs v2.2.4

405 views
Skip to first unread message

Frank

unread,
Jul 7, 2015, 2:49:36 PM7/7/15
to openpyx...@googlegroups.com
Hi openpyxl-users,

I recently upgraded from 2.0.4 to 2.2.4 and noticed a significant increase in memory usage when opening the same workbook (with load_workbook):

Excel file = 35 MB
2.0.4, 32bit = 800 MB
2.0.4, 64bit = 1400 MB
2.2.4, 32bit > 1700 MB # python complains about memory and gives up
2.2.4, 64bit = 3100 MB

I know there were a lot of changes between the two versions, but I wonder if anyone has any insight into this? 

I'm aware of the optimized writer and reader. The memory usage becomes an issue on the 32bit machine, since as far as I know I can't apply conditional formatting and use the optimized writer.

I imagine this is working as intended, offering more features taking up more memory. Or maybe I'm just doing it wrong? 

Anyway, thanks for the amazing module it has helped a lot in cases where Excel is unfortunately required!

-Frank

Charlie Clark

unread,
Jul 7, 2015, 3:14:25 PM7/7/15
to openpyx...@googlegroups.com
Hiya,

Am .07.2015, 20:49 Uhr, schrieb Frank <rippo...@gmail.com>:

> Hi openpyxl-users,

> I recently upgraded from 2.0.4 to 2.2.4 and noticed a significant
> increase in memory usage when opening the same workbook (with
> load_workbook):

> Excel file = 35 MB
> 2.0.4, 32bit = 800 MB
> 2.0.4, 64bit = 1400 MB
> 2.2.4, 32bit > 1700 MB # python complains about memory and gives up
> 2.2.4, 64bit = 3100 MB

Thanks for the numbers. You can get a reasonable idea of how much memory
is required by opening the XML files contained in the XLSX archive.

> I know there were a lot of changes between the two versions, but I wonder
> if anyone has any insight into this?

I don't think this should really be happening so I'm not sure why it is.
It is probably related to the changes in the way styles are handled,
though that should actually reduce memory use.

> I'm aware of the optimized writer and reader. The memory usage becomes an
> issue on the 32bit machine, since as far as I know I can't apply
> conditional formatting and use the optimized writer.

That's really a synthetic limitation: the code just never interacts with
ConditionalFormatting. It would be fairly easy to activate it. Do you have
a lot of such formats?

> I imagine this is working as intended, offering more features taking up
> more memory. Or maybe I'm just doing it wrong?

The idea is to increase features without necessarily increasing memory
use. I've just run a comparison of the size of a cell between 2.0 and 2.3
using the Pympler module. In most situations, cells will determine memory
use. They're slightly larger in 2.2 and 2.3 because of direct access to
formatting objects such as font and border, but still pretty small

on MacOS 64-bit:
2.0 27 kB
2.3 40 kB

That wouldn't explain what you're seeing which is about 100% more. But I'm
not much of an expert on memory profiling, which is probably what's
required here.

> Anyway, thanks for the amazing module it has helped a lot in cases where
> Excel is unfortunately required!

Always glad to hear that! Currently, I'm also accepting donations: got my
phone and computer nicked at PyCon. :-(

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

Charlie Clark

unread,
Jul 14, 2015, 6:30:45 AM7/14/15
to openpyx...@googlegroups.com
Am .07.2015, 20:49 Uhr, schrieb Frank <rippo...@gmail.com>:

> Hi openpyxl-users,
>
> I recently upgraded from 2.0.4 to 2.2.4 and noticed a significant
> increase
> in memory usage when opening the same workbook (with load_workbook):
>
> Excel file = 35 MB
> 2.0.4, 32bit = 800 MB
> 2.0.4, 64bit = 1400 MB
> 2.2.4, 32bit > 1700 MB # python complains about memory and gives up
> 2.2.4, 64bit = 3100 MB

I've just done a comparison between 2.0 and 2.3 using the file supplied
with a bug:
https://bitbucket.org/openpyxl/openpyxl/issues/494/load_workbook-giving-valueerror-invalid

You'll need a checkout to read it.

2.2 64bit 2.9 GB
2.3 64bit 3.4 GB

It's a huge file (I can't even read it properly with Excel 2011 for Mac).
Memory use in 2.3 is higher but seems closer to the difference suggested
by looking at what Pympler says about the Cell class. The API is now more
or less how it should be and I'd love to be able to reduce memory use
(we're already using cells) if anyone has any suggestions for the 2.4
branch.

Charlie Clark

unread,
Jul 14, 2015, 4:28:42 PM7/14/15
to openpyx...@googlegroups.com
Am .07.2015, 12:30 Uhr, schrieb Charlie Clark
<charli...@clark-consulting.eu>:

> 2.2 64bit 2.9 GB
> 2.3 64bit 3.4 GB

Sorry, that's wrong. The first number is for 2.0.

Delving deeper into what's actually allocated

from pympler.muppy import print_summary
print_summary()

and you can see what's actually allocated and cells are occupying around
190 bytes each. You can see them in comparison with other built-in types.
That's pretty good. This did, however, help me trackdown an intended
optimisation that wasn't working as intended.

2.3 now using around 2.8 GB, though MacOS seems to think it's doing a good
job of compressing that.

types | # objects | total
size
=============================================== | =========== |
============
<class 'openpyxl.cell.cell.Cell | 5757408 | 966.36
MB
<class 'dict | 3416 | 362.36
MB
<class 'int | 6356498 | 169.74
MB
<class 'str | 1565499 | 153.08
MB
<class 'float | 800438 | 18.32
MB
<class 'code | 5445 | 765.94
KB
<class 'type | 716 | 693.28
KB
<class 'set | 577 | 340.72
KB
<class 'openpyxl.descriptors.MetaSerialisable | 231 | 221.80
KB
<class 'tuple | 2428 | 162.84
KB
<class 'weakref | 1656 | 129.38
KB
<class 'wrapper_descriptor | 1291 | 100.86
KB
<class 'list | 681 | 98.45
KB
<class 'method_descriptor | 1178 | 82.83
KB
function (__init__) | 519 | 68.93
KB

I make this around 1.75 GB in total. I don't know enough to say where the
discrepancy between what the OS reports and what muppy sees is coming from.

2.0 for the same file on the same machine.

types | # objects | total
size
================================================= | =========== |
============
<class 'openpyxl.cell.cell.Cell | 5677969 |
606.47 MB
<class 'dict | 2779 |
392.13 MB
<class 'str | 5828464 |
284.25 MB
<class 'int | 6762127 |
180.57 MB
class 'openpyxl.worksheet.dimensions.RowDimension | 484615 |
25.88 MB
<class 'float | 800423 |
18.32 MB
<class 'code | 5317 |
747.93 KB
<class 'type | 726 |
701.34 KB
<class 'set | 453 |
232.59 KB
<class 'list | 1653 |
184.86 KB
<class 'tuple | 1908 |
127.98 KB
<class 'weakref | 1471 |
114.92 KB
<class 'wrapper_descriptor | 1291 |
100.86 KB
<class 'method_descriptor | 1178 |
82.83 KB
<class 'getset_descriptor | 950 |
66.80 KB

That's around 165 MB more for 2.3, which isn't visible to the system.
Obviously, it would be nice to be able to reduce the memory footprint, but
I'm not that sure how feasible that is while maintaining the flexibility
of the API. Maybe we could make styleable cells a lazy property and use it
to switch between classes.

However, where memory is a concern and you want to be able to edit files,
the best solution is to chain tick-tock between read-only and write-only
modes.

wb1 = load_workbook("source.xlsx", read_only=True)
wb2 = Workbook(write_only=True)

for ws1 in wb1:
wb2.create_sheet(ws1.title)
for row in ws1:
# do whatever is required to the cells
ws2.append(row)

wb2.save("changed.xlsx")

Charlie Clark

unread,
Aug 16, 2015, 3:45:34 PM8/16/15
to openpyx...@googlegroups.com
Am .07.2015, 20:49 Uhr, schrieb Frank <rippo...@gmail.com>:

> Hi openpyxl-users,
>
> I recently upgraded from 2.0.4 to 2.2.4 and noticed a significant
> increase
> in memory usage when opening the same workbook (with load_workbook):
>
> Excel file = 35 MB
> 2.0.4, 32bit = 800 MB
> 2.0.4, 64bit = 1400 MB
> 2.2.4, 32bit > 1700 MB # python complains about memory and gives up
> 2.2.4, 64bit = 3100 MB
>
> I know there were a lot of changes between the two versions, but I wonder
> if anyone has any insight into this?
>
> I'm aware of the optimized writer and reader. The memory usage becomes an
> issue on the 32bit machine, since as far as I know I can't apply
> conditional formatting and use the optimized writer.
>
> I imagine this is working as intended, offering more features taking up
> more memory. Or maybe I'm just doing it wrong?

Some more on this. I've just done some work on memory optimisation and,
depending upon the file, I'm getting similar numbers in 2.3b2 as with
previous versions. On a simple file with just numbers memory use should be
less than 2.0. Using the memory_profiler module I get similar numbers to
what the OS (Mac OS 10.10.5, 64-bit) reports. 2.2 certainly does use the
most memory but I don't see anything like as much variance as you are
between versions.

2.2 and 2.3 have a lot more functionality and should, in general, be
faster than previous versions. I do have one file where 2.2 is turning out
to be by far the fastest and with the lowest memory footprint. This will
require more investigation but I suspect it may be related to the new
tokeniser expanding lots of shared formulae.

Some profiling obtained using the following script:

import time
from memory_profiler import memory_usage
from openpyxl import load_workbook

fname = "openpyxl/benchmarks/files/very_large.xlsx" # has 1 million cells

def pixel():
start = time.clock()
wb = load_workbook(fname)
end = time.clock()
print("openpyxl {:0.1f} s".format(end - start))
use = memory_usage(proc=-1, interval=1)[0]
print("Memory use %.1f MB" % use)
from pympler.muppy import print_summary
print_summary()


if __name__ == '__main__':
from openpyxl import __version__
print()
print(__version__)
pixel()

2.0.5
openpyxl 37.5 s
Memory use 447.1 MB
types | # objects | total size
================================= | =========== | ============
<class 'openpyxl.cell.cell.Cell | 1000000 | 106.81 MB
<class 'str | 1016954 | 50.63 MB
<class 'dict | 2196 | 49.91 MB
<class 'int | 1491119 | 39.82 MB
<class 'type | 716 | 682.81 KB
<class 'code | 4659 | 655.37 KB
<class '_io.BufferedWriter | 2 | 256.34 KB
<class 'collections.OrderedDict | 8 | 254.03 KB
<class 'set | 425 | 211.47 KB
<class 'list | 1607 | 168.67 KB
<class '_io.BufferedReader | 1 | 128.17 KB
<class 'tuple | 1905 | 127.47 KB
<class 'weakref | 1389 | 108.52 KB
<class 'wrapper_descriptor | 1275 | 99.61 KB
<class 'method_descriptor | 1171 | 82.34 KB


2.1.5
openpyxl 40.2 s
Memory use 510.6 MB
types | # objects |
total size
======================================================= | =========== |
============
<class 'openpyxl.cell.cell.Cell | 1000000 |
114.44 MB
<class 'str | 2017245 |
102.32 MB
<class 'dict | 4248
| 50.32 MB
<class 'int | 1491148
| 39.82 MB
<class 'type | 726 |
692.72 KB
<class 'code | 4780 |
672.40 KB
<class '_io.BufferedWriter | 2 |
256.34 KB
<class 'openpyxl.worksheet.dimensions.DimensionHolder | 1 |
244.98 KB
<class 'set | 439 |
226.53 KB
<class 'list | 1614 |
169.23 KB
<class 'tuple | 1953 |
130.26 KB
<class '_io.BufferedReader | 1 |
128.17 KB
<class 'weakref | 1382 |
107.97 KB
<class 'wrapper_descriptor | 1275
| 99.61 KB
<class 'method_descriptor | 1171
| 82.34 KB

2.2.6
openpyxl 38.1 s
Memory use 515.6 MB
types | # objects |
total size
======================================================= | =========== |
============
<class 'openpyxl.cell.cell.Cell | 1000000 |
175.48 MB
<class 'str | 1989515 |
101.00 MB
<class 'dict | 2339
| 49.91 MB
<class 'int | 1490160
| 39.80 MB
<class 'type | 729 |
695.42 KB
<class 'code | 4845 |
681.54 KB
<class '_io.BufferedWriter | 2 |
256.34 KB
<class 'openpyxl.worksheet.dimensions.DimensionHolder | 1 |
244.98 KB
<class 'set | 451 |
231.16 KB
<class 'tuple | 2006 |
133.63 KB
<class '_io.BufferedReader | 1 |
128.17 KB
<class 'openpyxl.worksheet.dimensions.ColumnDimension | 1001 |
125.12 KB
<class 'openpyxl.worksheet.dimensions.RowDimension | 1001 |
125.12 KB
<class 'weakref | 1400 |
109.38 KB
<class 'wrapper_descriptor | 1275
| 99.61 KB

2.3.0-b2
openpyxl 33.5 s
Memory use 399.8 MB
types | # objects | total
size
=============================================== | =========== |
============
<class 'openpyxl.cell.cell.Cell | 1000000 | 106.81
MB
<class 'dict | 3108 | 50.16
MB
<class 'int | 1490351 | 39.80
MB
<class 'str | 17566 | 1.82
MB
<class 'type | 709 | 676.48
KB
<class 'code | 4792 | 674.09
KB
<class 'set | 545 | 304.72
KB
<class '_io.BufferedWriter | 2 | 256.34
KB
<class 'openpyxl.descriptors.MetaSerialisable | 230 | 220.56
KB
<class 'tuple | 2425 | 162.29
KB
<class '_io.BufferedReader | 2 | 128.34
KB
<class 'weakref | 1576 | 123.12
KB
<class 'wrapper_descriptor | 1275 | 99.61
KB
<class 'method_descriptor | 1171 | 82.34
KB
<class 'list | 622 | 77.70
KB
Reply all
Reply to author
Forward
0 new messages