Ram memory shortage?

403 views
Skip to first unread message

tzanou...@hotmail.com

unread,
Nov 9, 2015, 5:31:09 PM11/9/15
to openpyxl-users
Hi there,

I 've recently installed openpyxl and tried the Optimized reader using the following code:

def xlsx2(fname):
    from openpyxl import load_workbook
    wb = load_workbook(fname, read_only=True)
    ws = wb['big_data']
    for row in ws.rows:
        for cell in row:
            print cell.value

But my system runs out of memory and crashes. Is it because my memory is not enough or a faulty installation? Can someone give a me a hint or even better suggest a solution
The file I'm reading: 92MB ~480.000rows
Specs: Python 2.7.3, openpyxl 2.3.0, RAM 4GB

Thank you very much!

Charlie Clark

unread,
Nov 9, 2015, 6:14:57 PM11/9/15
to openpyx...@googlegroups.com
Am .11.2015, 23:31 Uhr, schrieb <tzanou...@hotmail.com>:

> But my system runs out of memory and crashes. Is it because my memory is
> not enough or a faulty installation? Can someone give a me a hint or even
> better suggest a solution
> The file I'm reading: 92MB ~480.000rows
> Specs: Python 2.7.3, openpyxl 2.3.0, RAM 4GB

Read-only mode should only be about 30 MB and stay at that unless keep
things in memory explicitly. The same file in standard mode can be
expected to be about 4GB.

Any chance you can provide the file? 480,000 rows should be nowhere near
92MB so you must have something special going on. Possibly lots and lots
of distinct strings.

In the meantime: get on a 64-bit system with more RAM.

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

tzanou...@gmail.com

unread,
Nov 10, 2015, 10:30:34 AM11/10/15
to openpyxl-users

Read-only mode should only be about 30 MB and stay at that unless keep  
things in memory explicitly. The same file in standard mode can be  
expected to be about 4GB.

Any chance you can provide the file? 480,000 rows should be nowhere near  
92MB so you must have something special going on. Possibly lots and lots  
of distinct strings.

In the meantime: get on a 64-bit system with more RAM.
 
Unfortunately thats seems to be the issue, the .xlsx contains lots of strings. I 've tried with a smaller version of that file and it works within the limits of my ram.
So i guess I'm gonna get some more RAM

Thanks for your response

Charlie Clark

unread,
Nov 10, 2015, 11:44:31 AM11/10/15
to openpyx...@googlegroups.com
Am .11.2015, 16:30 Uhr, schrieb <tzanou...@gmail.com>:

> Unfortunately thats seems to be the issue, the .xlsx contains lots of
> strings. I 've tried with a smaller version of that file and it works
> within the limits of my ram.
> So i guess I'm gonna get some more RAM

We're not done yet. I'm not exactly sure when you're running out of
memory: merely parsing the shared strings or when populating the
dictionary of them.

Can you check the size of the sharedStrings.xml file in your archive? This
will give us a guide to memory use. Because it's never been a problem in
the past we load this file into memory and keep it there while we populate
the lookup dictionary (technically a list). We could switch to using an
iterative parser so we'd only need to worry about the size of the lookup
table. This could still get pretty big – duplicates are unfortunately
allowed – but should still be significantly smaller than the equivalent
XML.

tzanou...@gmail.com

unread,
Nov 12, 2015, 11:00:10 AM11/12/15
to openpyxl-users

We're not done yet. I'm not exactly sure when you're running out of  
memory: merely parsing the shared strings or when populating the  
dictionary of them.

Can you check the size of the sharedStrings.xml file in your archive? This  
will give us a guide to memory use. Because it's never been a problem in  
the past we load this file into memory and keep it there while we populate  
the lookup dictionary (technically a list). We could switch to using an  
iterative parser so we'd only need to worry about the size of the lookup  
table. This could still get pretty big – duplicates are unfortunately  
allowed – but should still be significantly smaller than the equivalent  
XML.

I've just saw your reply, could you elaborate on how to check the size of the sharedStrings?
I've encountered another problem, I've tried the same code in windows and this is the output:
Traceback (most recent call last):
  File "C:\Python27\APP-related-wikis-url_associationv2.py", line 135, in <module>
    main("hpdepth0-8-1-2015.xlsx", "homepages.txt", "Wiki-RelatedIds.csv")
  File "C:\Python27\APP-related-wikis-url_associationv2.py", line 92, in main
    wb = load_workbook(fxlsx, read_only=True) #load xlsx file
  File "C:\Python27\lib\openpyxl\reader\excel.py", line 191, in load_workbook
    shared_strings = read_string_table(archive.read(strings_path))
  File "C:\Python27\lib\openpyxl\reader\strings.py", line 16, in read_string_table
    root = fromstring(text=xml_source)
  File "lxml.etree.pyx", line 3092, in lxml.etree.fromstring (src\lxml\lxml.etree.c:70473)
  File "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:106307)
  File "parser.pxi", line 1716, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:105098)
  File "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:99780)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94254)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95690)
  File "parser.pxi", line 620, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94757)
XMLSyntaxError: internal error: Huge input lookup, line 2, column 363349999

EDIT: Hm, I've tested again with half the lines(~200.000) and it's working in windows too.

Charlie Clark

unread,
Nov 12, 2015, 11:14:32 AM11/12/15
to openpyx...@googlegroups.com
Am .11.2015, 17:00 Uhr, schrieb <tzanou...@gmail.com>:

> I've just saw your reply, could you elaborate on how to check the size of
> the sharedStrings?
> I've encountered another problem, I've tried the same code in windows and
> this is the output:

An Excel file is just a zip file so you can just look at the archive ToC:

$ unzip -l hpdepth0-8-1-2015.xlsx

Archive: hpdepth0-8-1-2015.xlsx
Length Date Time Name
240 01-01-1980 00:00 xl/sharedStrings.xml

This is the file where all the strings are stored. It's fairly meaningless
(essentially just long list of strings) without the worksheets but it
would be nice to have it to work with for memory tests.

The error message you're seeing is lxml-specific. It *might* disappear if
you remove lxml but the best solution is going to be switching to
iterparse. We've avoided this so far because we haven't come across
situations with lots and lots of text. Excel is spectacularly unsuited to
storing lots of text but that isn't going to stop people using it for that
purpose.

Charlie Clark

unread,
Nov 14, 2015, 11:59:17 AM11/14/15
to openpyx...@googlegroups.com
Am .11.2015, 23:31 Uhr, schrieb <tzanou...@hotmail.com>:

> But my system runs out of memory and crashes. Is it because my memory is
> not enough or a faulty installation? Can someone give a me a hint or even
> better suggest a solution
> The file I'm reading: 92MB ~480.000rows
> Specs: Python 2.7.3, openpyxl 2.3.0, RAM 4GB

In the absence of a genuine test file I've cooked up the following example:

wb = Workbook(write_only=True)
ws = wb.active

for r in range(10000):
row = [str(c + r*1000) for c in range(1000)]
ws.append(row)

wb.save("big.xlsx")

This will produce a file of around 70 MB crucially with sharedStrings.xml
of around 230 MB. In memory this is around 3.9 GB (lxml) or 2.8 GB (no
lxml) on my system when it is read by an XML parser so it should provoke
an out of memory error on your system. It's worth noting that even if this
is replaced by an iterative parser you'll still need around 2GB for the
lookup table.

I've switched the strings parser to use an iterative parser. You might
want to checkout the 2.3 branch and try that.
Reply all
Reply to author
Forward
0 new messages