Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Out of memory while reading excel file

657 views
Skip to first unread message

Mahmood Naderan

unread,
May 10, 2017, 10:09:38 AM5/10/17
to
Hello,

The following code which uses openpyxl and numpy, fails to read large Excel (xlsx) files. The file si 20Mb which contains 100K rows and 50 columns.



W = load_workbook(fname, read_only = True)

p = W.worksheets[0]

a=[]

m = p.max_row

n = p.max_column


np.array([[i.value for i in j] for j in p.rows])



How can I fix that? I have stuck at this problem. For medium sized files (16K rows and 50 columns) it is fine.

Regards,

Mahmood

Peter Otten

unread,
May 10, 2017, 10:52:06 AM5/10/17
to
The docs at

https://openpyxl.readthedocs.io/en/default/optimized.html#read-only-mode

promise "(near) constant memory consumption" for the sample script below:

from openpyxl import load_workbook
wb = load_workbook(filename='large_file.xlsx', read_only=True)
ws = wb['big_data']

for row in ws.rows:
for cell in row:
print(cell.value)

If you change only the file and worksheet name to your needs -- does the
script run to completion in reasonable time (redirect stdout to /dev/null)
and with reasonable memory usage?

If it does you may be wasting memory elsewhere; otherwise you might need to
convert the xlsx file to csv using your spreadsheet application before
processing the data in Python.

Mahmood Naderan

unread,
May 10, 2017, 11:13:22 AM5/10/17
to
Thanks for your reply. The openpyxl part (reading the workbook) works fine. I printed some debug information and found that when it reaches the np.array, after some 10 seconds, the memory usage goes high.


So, I think numpy is unable to manage the memory.



Regards,
Mahmood
--

https://mail.python.org/mailman/listinfo/python-list

Peter Otten

unread,
May 10, 2017, 11:48:52 AM5/10/17
to
Mahmood Naderan via Python-list wrote:

> Thanks for your reply. The openpyxl part (reading the workbook) works
> fine. I printed some debug information and found that when it reaches the
> np.array, after some 10 seconds, the memory usage goes high.
>
>
> So, I think numpy is unable to manage the memory.

Hm, I think numpy is designed to manage huge arrays if you have enough RAM.

Anyway: are all values of the same type? Then the numpy array may be kept
much smaller than in the general case (I think). You can also avoid the
intermediate list of lists:

wb = load_workbook(filename='beta.xlsx', read_only=True)
ws = wb['alpha']

a = numpy.zeros((ws.max_row, ws.max_column), dtype=float)
for y, row in enumerate(ws.rows):
a[y] = [cell.value for cell in row]


Mahmood Naderan

unread,
May 10, 2017, 1:18:48 PM5/10/17
to
Well actually cells are treated as strings and not integer or float numbers.

One way to overcome is to get the number of rows and then split it to 4 or 5 arrays and then process them. However, i was looking for a better solution.

I read in pages that large excels are in the order of milion rows. Mine is about 100k. Currently, the task manager shows about 4GB of ram usage while working with numpy.

Regards,
Mahmood

--------------------------------------------
On Wed, 5/10/17, Peter Otten <__pet...@web.de> wrote:

Subject: Re: Out of memory while reading excel file
To: pytho...@python.org
Date: Wednesday, May 10, 2017, 3:48 PM
--
https://mail.python.org/mailman/listinfo/python-list

Irmen de Jong

unread,
May 10, 2017, 1:26:30 PM5/10/17
to
On 10-5-2017 17:12, Mahmood Naderan wrote:

> So, I think numpy is unable to manage the memory.

That assumption is very likely to be incorrect.


>> np.array([[i.value for i in j] for j in p.rows])

I think the problem is in the way you feed your excel data into the numpy array
constructor. The code above builds many redundant python lists from the data you already
have in memory, before even calling the numpy array function.

I strongly suggest finding a proven piece of code to read large excel files like the
example from Peter Otten's reply.

Irmen

Peter Otten

unread,
May 10, 2017, 2:31:20 PM5/10/17
to
Mahmood Naderan via Python-list wrote:

> Well actually cells are treated as strings and not integer or float
> numbers.

May I ask why you are using numpy when you are dealing with strings? If you
provide a few details about what you are trying to achieve someone may be
able to suggest a workable approach.

Back-of-the-envelope considerations: 4GB / 5E6 cells amounts to

>>> 2**32 / (100000 * 50)
858.9934592

about 850 bytes per cell, with an overhead of

>>> sys.getsizeof("")
49

that would be 800 ascii chars, down to 200 chars in the worst case. If your
strings are much smaller the problem lies elsewhere.

Mahmood Naderan

unread,
May 10, 2017, 4:30:35 PM5/10/17
to
Hi
I will try your code... meanwhile I have to say, as you pointed earlier and as stated in the documents, numpy is designed to handle large arrays and that is the reason I chose that. If there is a better option, please let me know.




Regards,
Mahmood

--------------------------------------------
On Wed, 5/10/17, Peter Otten <__pet...@web.de> wrote:

Subject: Re: Out of memory while reading excel file
To: pytho...@python.org
Date: Wednesday, May 10, 2017, 6:30 PM
--
https://mail.python.org/mailman/listinfo/python-list

Mahmood Naderan

unread,
May 11, 2017, 1:21:23 AM5/11/17
to
Hi,
I am confused with that. If you say that numpy is not suitable for my case and may have large overhead, what is the alternative then? Do you mean that numpy is a good choice here while we can reduce its overhead?


Regards,
Mahmood

Mahmood Naderan

unread,
May 11, 2017, 1:27:35 AM5/11/17
to
>a = numpy.zeros((ws.max_row, ws.max_column), dtype=float)
>for y, row in enumerate(ws.rows):
> a[y] = [cell.value for cell in row]



Peter,

As I used this code, it gave me an error that cannot convert string to float for the first cell. All cells are strings.


Regards,
Mahmood

Mahmood Naderan

unread,
May 11, 2017, 2:04:23 AM5/11/17
to
Hi,
I used the old fashion coding style to create a matrix and read/add the cells.

W = load_workbook(fname, read_only = True)
p = W.worksheets[0]
m = p.max_row
n = p.max_column
arr = np.empty((m, n), dtype=object)
for r in range(1, m):
for c in range(1, n):
d = p.cell(row=r, column=c)
arr[r, c] = d.value


However, the operation is very slow. I printed row number to see how things are going. It took 2 minutes to add 200 rows and about 10 minutes to add the next 200 rows.

Regards,
Mahmood

Peter Otten

unread,
May 11, 2017, 2:47:00 AM5/11/17
to
Mahmood Naderan via Python-list wrote:

For string values you have to adapt the dtype:

a = numpy.empty((ws.max_row, ws.max_column), dtype=object)
for y, row in enumerate(ws.rows):
a[y] = [cell.value for cell in row]

If that completes and is fast enough, fine.

But again, for non-numeric data numpy doesn't make much sense IMHO -- if you
tell us what you're up to we may be able to suggest a better approach.

Mahmood Naderan

unread,
May 11, 2017, 3:25:00 AM5/11/17
to
I wrote this:

a = np.zeros((p.max_row, p.max_column), dtype=object)
for y, row in enumerate(p.rows):
for cell in row:
print (cell.value)
a[y] = cell.value
print (a[y])


For one of the cells, I see

NM_198576.3
['NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3']


These are 50 NM_198576.3 in a[y] and 50 is the number of columns in my excel file (p.max_column)



The excel file looks like

CHR1 11,202,100 NM_198576.3 PASS 3.08932 G|B|C - . . .



Note that in each row, some cells are '-' or '.' only. I want to read all cells as string. Then I will write the matrix in a file and my main code (java) will process that. I chose openpyxl for reading excel files, because Apache POI (a java package for manipulating excel files) consumes huge memory even for medium files.

So my python script only transforms an xlsx file to a txt file keeping the cell positions and formats.

Any suggestion?

Regards,
Mahmood

Peter Otten

unread,
May 11, 2017, 3:53:36 AM5/11/17
to
Mahmood Naderan via Python-list wrote:

> I wrote this:
>
> a = np.zeros((p.max_row, p.max_column), dtype=object)
> for y, row in enumerate(p.rows):
> for cell in row:
> print (cell.value)
> a[y] = cell.value

In the line above you overwrite the row in the numpy array with the cell
value. In combination with numpy's "broadcasting" you end up with all values
in a row set to the rightmost cell in the spreadsheet row, just like in

>>> import numpy
>>> a = numpy.array([[0, 0, 0]])
>>> a
array([[0, 0, 0]])
>>> for x in 1, 2, 3:
... a[0] = x
...
>>> a
array([[3, 3, 3]])


The correct code:

for y, row in enumerate(ws.rows):
a[y] = [cell.value for cell in row]

I think I posted it before ;)
What kind of text file?

> Any suggestion?

In that case there's no need to load the data into memory. For example, to
convert xlsx to csv:

#!/usr/bin/env python3
from openpyxl import load_workbook
import csv

source = "beta.xlsx"
dest = "gamma.csv"
sheet = 'alpha'

wb = load_workbook(filename=source, read_only=True)
ws = wb[sheet]

with open(dest, "w") as outstream:
csv.writer(outstream).writerows(
[cell.value for cell in row]
for row in ws.rows
)


Mahmood Naderan

unread,
May 11, 2017, 4:51:15 AM5/11/17
to
Thanks. That code is so simple and works. However, there are things to be considered. With the CSV format, cells in a row are separated by ',' and for some cells it writes "" around the cell content.

So, if the excel looks like


CHR1 11,232,445


The output file looks like

CHR1,"11,232,445"


Is it possible to use <space> as the delimiting character and omit ""? I say that because, my java code which has to read the output file has to do some extra works (using space as delimiter is the default and much easier to work). I want

a[0][0] = CHR
a[0][1] = 11,232,445

And both are strings. Is that possible?

Regards,
Mahmood

Mahmood Naderan

unread,
May 11, 2017, 5:01:57 AM5/11/17
to
Excuse me, I changed

csv.writer(outstream)

to

csv.writer(outstream, delimiter =' ')


It puts space between cells and omits "" around some content. However, between two lines there is a new empty line. In other word, the first line is the first row of excel file. The second line is empty ("\n") and the third line is the second row of the excel file.

Any thought?

Regards,
Mahmood

Peter Otten

unread,
May 11, 2017, 6:43:13 AM5/11/17
to
Mahmood Naderan via Python-list wrote:

> Excuse me, I changed
>
> csv.writer(outstream)
>
> to
>
> csv.writer(outstream, delimiter =' ')
>
>
> It puts space between cells and omits "" around some content.

If your data doesn't contain any spaces that's fine. Otherwise you need a
way to distinguish between space as a delimiter and space inside a field, e.
g. by escaping it:

>>> w = csv.writer(sys.stdout, delimiter=" ", quoting=csv.QUOTE_NONE,
escapechar="\\")
>>> w.writerow(["a", "b c"])
a b\ c
8

> However,
> between two lines there is a new empty line. In other word, the first line
> is the first row of excel file. The second line is empty ("\n") and the
> third line is the second row of the excel file.
>
> Any thought?

In text mode Windows translates "\n" to b"\r\n" in the file. Python allows
you to override that:

>>> help(open)
Help on built-in function open in module io:

open(...)
open(file, mode='r', buffering=-1, encoding=None,
errors=None, newline=None, closefd=True, opener=None) -> file
object

<snip>

newline controls how universal newlines works (it only applies to text
mode). It can be None, '', '\n', '\r', and '\r\n'. It works as
follows:

<snip>

* On output, if newline is None, any '\n' characters written are
translated to the system default line separator, os.linesep. If
newline is '' or '\n', no translation takes place. If newline is any
of the other legal values, any '\n' characters written are translated
to the given string.

So you need to specify newlines:

with open(dest, "w", newline="") as outstream:
...


Mahmood Naderan

unread,
May 11, 2017, 11:32:36 AM5/11/17
to
Thanks a lot for suggestions. It is now solved.


Regards,
Mahmood

Pavol Lisy

unread,
May 12, 2017, 3:14:08 PM5/12/17
to
But lineterminator parameter (
https://docs.python.org/3.6/library/csv.html#csv.Dialect.lineterminator
) is by default \r\n on linux too!

b = io.StringIO()
w = csv.writer(b)
w.writerows([["a", "b c"], ['a', 'b,c']])
b.getvalue() # 'a,b c\r\na,"b,c"\r\n'

b = io.StringIO()
w = csv.writer(b, lineterminator='\n')
w.writerows([["a", "b c"], ['a', 'b,c']])
b.getvalue() # 'a,b c\na,"b,c"\n'

PL.

Peter Otten

unread,
May 12, 2017, 4:03:43 PM5/12/17
to
Pavol Lisy wrote:

> On 5/11/17, Peter Otten <__pet...@web.de> wrote:
>> Mahmood Naderan via Python-list wrote:

>>> between two lines there is a new empty line. In other word, the first
>>> line is the first row of excel file. The second line is empty ("\n") and
>>> the third line is the second row of the excel file.
>>>
>>> Any thought?
>>
>> In text mode Windows translates "\n" to b"\r\n" in the file. Python
>> allows you to override that:

>> So you need to specify newlines:
>>
>> with open(dest, "w", newline="") as outstream:
>> ...
>>
>
> But lineterminator parameter (
> https://docs.python.org/3.6/library/csv.html#csv.Dialect.lineterminator
> ) is by default \r\n on linux too!
>
> b = io.StringIO()
> w = csv.writer(b)
> w.writerows([["a", "b c"], ['a', 'b,c']])
> b.getvalue() # 'a,b c\r\na,"b,c"\r\n'

I don't have a Windows system to test, but doesn't that mean that on Windows

with open("tmp.csv", "w") as f:
csv.writer(f).writerows([["one"], ["two"]])
with open("tmp.csv", "rb") as f:
print(f.read())

would produce

b"one\r\r\ntwo\r\r\n"

? How is that avoided?

eryk sun

unread,
May 12, 2017, 4:45:53 PM5/12/17
to
On Fri, May 12, 2017 at 8:03 PM, Peter Otten <__pet...@web.de> wrote:
> I don't have a Windows system to test, but doesn't that mean that on Windows
>
> with open("tmp.csv", "w") as f:
> csv.writer(f).writerows([["one"], ["two"]])
> with open("tmp.csv", "rb") as f:
> print(f.read())
>
> would produce
>
> b"one\r\r\ntwo\r\r\n"
>
> ? How is that avoided?

Python 3 doesn't use the platform's standard I/O implementation.
However, it tries to be consistent with the platform by using
os.linesep as the default for translating newlines. That's not
necessarily compatible with the csv module, so it requires disabling
newline translation by passing newline="" to open(). Otherwise it does
cause the problem that you suppose it would, which is documented:

https://docs.python.org/3/library/csv.html#csv.writer

codew...@gmail.com

unread,
May 12, 2017, 6:48:36 PM5/12/17
to
Try opening the destination file in the binary mode:

open(dest, 'wb')

I ran into extra newlines when using csv.writerows() recently.
Since the default mode for open() is text, I imagine you get
extra newlines, since both csv and file object are adding them.
Switching to binary mode fixed it for me.

Regards,
Igor.
0 new messages