Efficient chunked reading of large files

130 views
Skip to first unread message

David Orme

unread,
Jan 21, 2022, 1:22:20 PM1/21/22
to openpyxl-users
Hi,

I have a package that is checking the format of data stored in an Excel file. This includes scanning the contents of fields in data frames to check they are compatible with provided field metadata. So my basic flow is to create Field objects and then pass chunks of data to be validated. I've deliberately taken a chunked approach to avoid having to load all rows at once for larger files (e.g. 480872 rows)

What I'm trying to figure out from the openpyxl source is what the most efficient way of doing this is. I'm using ReadOnlyWorkbooks, so data is only loaded as required and I was looking to see if there was an approach to read in a block of a given number of rows. I thought that might be more efficient, but it looks like there is only _iter_rows, in which case to keep memory usage down I should just iterate over a chunk of rows to ingest the data, validate and then do the next chunk? 

Is that right, or is there a secret efficient method to read a set block?

Thanks,
David

Charlie Clark

unread,
Jan 22, 2022, 11:35:00 AM1/22/22
to openpyxl-users

I would think ws.iter_rows(min_row=…, max_row=…, min_col=…, max_col=… values_only=True) is all you need. This will read the rows directly from the XML and return only the data in tuples. Even for a huge worksheet this will not require a lot of memory. As it's a generator you can easily wrap it to give you as many rows as you need – this is preferable to looping over the worksheet multiple times – but you'll have an overhead of calling the iterator.

How you do it depends on how much memory you have: what are your limits?

Charlie

--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Sengelsweg 34
Düsseldorf
D- 40489
Tel: +49-203-3925-0390
Mobile: +49-178-782-6226

David Orme

unread,
Jan 24, 2022, 6:39:11 AM1/24/22
to openpyxl-users
Hi Charlie,

That's great, thanks. It seemed like creating  the generator and then wrapping it to create chunks of lists of rows was going to be the canonical way to do it, but I just wanted to check I wasn't missing anything.

At the moment, it mostly runs on desktop machines but it also needs to be able to run behind a small webserver instances with - for example - 1 or 2 Gb RAM. 

Cheers,
David

Charlie Clark

unread,
Jan 24, 2022, 6:49:46 AM1/24/22
to openpyxl-users
On 24 Jan 2022, at 12:39, David Orme wrote:

> Hi Charlie,
>
> That's great, thanks. It seemed like creating the generator and then
>
> wrapping it to create chunks of lists of rows was going to be the canonical
>
> way to do it, but I just wanted to check I wasn't missing anything.
>
> At the moment, it mostly runs on desktop machines but it also needs to be
>
> able to run behind a small webserver instances with - for example - 1 or 2
>
> Gb RAM.

You'll get a lot into 1GB RAM, but run your own tests to be sure. But contention could be an issue there. Find out what you can cope with comfortably and work with that.
Reply all
Reply to author
Forward
0 new messages