Reading an Excel .xlsx file is extremely slow with openpyxl

7,356 views
Skip to first unread message

Deac-33 Lancaster

unread,
May 6, 2020, 4:59:53 AM5/6/20
to python-excel
Like many folks I need to read both .xls files (I call them S files, using xlrd) and .xlsx files (the X files, using openpyxl). 
In my case I read Excel files of about 30,000 rows and just copy all data read to a .csv file, no other processing so just Input/Output.

But the X file operations are much much slower, for reading a 30,000 row .xlsx file it now takes 2 minutes compared to 
1/2 second for .xls with xlrd. 

Is openpyxl that much slower or do I need to do something, like release some resource at the end of each row?

BTW, I have made several great improvements by using read_only=True and reading a row at a time instead of cell by cell 
as shown in the following code segment.  Thanks to blog.davep.org 
	https://blog.davep.org/2018/06/02/a_little_speed_issue_with_openpyxl.html

```
	wb = openpyxl.load_workbook("excel_file.xlsx",  data_only=True,  read_only=True)
sheet = wb.active
for row in sheet.rows:
for cell in row:
cell_from_excel = cell.value
```

Charlie Clark

unread,
May 6, 2020, 5:02:27 AM5/6/20
to python-excel
On 2 May 2020, at 22:31, Deac-33 Lancaster wrote:

Using different e-mail addresses and reposting the same e-mail to a
moderated list will endear you to no one.

> Like many folks I need to read both .xls files (I call them S files,
> using xlrd) and .xlsx files (the X files, using openpyxl).
>
> In my case I read Excel files of about 30,000 rows and just copy all
> data read to a .csv file, no other processing so just Input/Output.
>
> But the X file operations are much much slower, for reading a 30,000
> row .xlsx file it now takes 2 minutes compared to
> 1/2 second for .xls with xlrd.

Everything you need to know is covered by the openpyxl documentation:

https://openpyxl.readthedocs.io/en/stable/performance.html#read-performance

Charlie

--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

Deac-33 Lancaster

unread,
May 7, 2020, 3:52:18 AM5/7/20
to python-excel
Charlie,

Thanks much for the comment and link.  I guess I’m not smart enough to understand from the link what I need to know.  :-)   

I see how performance improves some, from 66 sec to 55 sec, with later versions of python and openpyxl, but it’s still about a minute for their example, slow compared to xlrd.

I failed to say that I’m using python 3.8 and openpyxl 3.0.3.

OptimizationData takes 20 seconds but I don’t know what I should do with that.

Should I use other load_workbook() options in addition to data_only=True, read_only=True to improve my performance?

-deac
 😎

Charlie Clark

unread,
May 7, 2020, 4:58:47 AM5/7/20
to python-excel
On 6 May 2020, at 20:50, Deac-33 Lancaster wrote:

> Charlie,
>
> Thanks much for the comment and link. I guess I’m not smart enough
> to
> understand from the link what I need to know. :-)
>
> I see how performance improves some, from 66 sec to 55 sec, with later
> versions of python and openpyxl, but it’s still about a minute for
> their
> example, slow compared to xlrd.

Then there is additional stuff in the file, I assume. The last time I
benchmarked against xlrd openpyxl was faster in read-only, values-only
mode. This somewhat to be expected because xlrd is no longer being
maintained so can't take advantage of any new performance features in
Python.

Charlie Clark

unread,
May 7, 2020, 11:52:07 AM5/7/20
to python-excel
On 7 May 2020, at 10:58, Charlie Clark wrote:

> Then there is additional stuff in the file, I assume. The last time I
> benchmarked against xlrd openpyxl was faster in read-only, values-only
> mode. This somewhat to be expected because xlrd is no longer being
> maintained so can't take advantage of any new performance features in
> Python.

Just ran the benchmarks with Python 3.8 on my MacBook Pro 2015.

Versions:
python: 3.8.2
xlread: 1.2.0
openpyxl: 3.0.4

xlrd
Workbook loaded 58.57s
OptimizationData 2.59s
Output Model 0.00s
>>DATA>> 0.00s
Store days 0% 1.99s
Store days 100% 1.82s
Total time 64.97s

openpyxl, read-only, values only
Workbook loaded 0.96s
OptimizationData 24.17s
Output Model 0.00s
>>DATA>> 0.00s
Store days 0% 23.32s
Store days 100% 17.95s
Total time 66.41s
0 cells in total

Parallised Read
Workbook loaded 1.25s
Output Model 1.16s
>>DATA>> 1.13s
OptimizationData 25.65s
Store days 0% 26.22s
Store days 100% 18.51s
Total time 46.00s

Depending upon your hardware, parallelising can really give you a boost
and would allow you to stream worksheets to CSV in parallel.

Deac-33 Lancaster

unread,
May 7, 2020, 2:18:33 PM5/7/20
to python-excel
Wow, thanks for all of the work, Charlie.  Looks like parallelising would be the way to go.
Many thanks, -deac33

Deac-33 Lancaster

unread,
May 7, 2020, 7:38:33 PM5/7/20
to python-excel
Charlie,

Looks like your .xlsx file (or files) went from 66 sec to 46 sec.  How many files was that and how many rows in each, more or less?

I'm processing one file at a time, about 33,000 rows each, in each directory and though I could probably figure out how to parallelize it a bit, I'm not sure it's worth the effort.

It just seems that being 120 times slower for openpyxl on xlsx than for xlrd on xls on comparable sized files suggests I'm doing something else wrong.

thanks for your patience,
-deac33

Charlie Clark

unread,
Jun 25, 2020, 9:23:58 AM6/25/20
to python-excel

On 2 May 2020, at 22:31, Deac-33 Lancaster wrote:

Like many folks I need to read both .xls files (I call them S files, using xlrd) and .xlsx files (the X files, using openpyxl).

In my case I read Excel files of about 30,000 rows and just copy all data read to a .csv file, no other processing so just Input/Output.

But the X file operations are much much slower, for reading a 30,000 row .xlsx file it now takes 2 minutes compared to
1/2 second for .xls with xlrd.

Is openpyxl that much slower or do I need to do something, like release some resource at the end of each row?

Seeing as xlrd can also read XLSX files, you can test with that as well. Basically, openpyxl and xlrd have similar parsing performance because they use the same XML libraries for reading. openpyxl includes some performance benchmarks on a real world file:

https://openpyxl.readthedocs.io/en/latest/performance.html#read-performance

Processing XML is much slower than processing plaintext or binary formats, so a comparison between XLSX and XLS is not really useful.

But openpyxl can also read much more of the OOXML specification such as images, charts but also pivot tables and external links. It's a much bigger library so it can takes longer to load.

Your sample file can be loaded and read in < 1s on my 2015 MacBook Pro, so I reckon the problem has nothing to do with parsing speed.

John Yeung

unread,
Jun 25, 2020, 5:02:21 PM6/25/20
to python-excel
On Thu, May 7, 2020 at 7:38 PM Deac-33 Lancaster <dea...@gmail.com> wrote:
>
> It just seems that being 120 times slower for openpyxl on xlsx than for xlrd on xls on comparable sized files suggests I'm doing something else wrong.

The first thing is: If you were already using xlrd, then why introduce
openpyxl at all? Whatever code you are using to read .xls files can be
used *verbatim* for reading .xlsx files. Literally no change
whatsoever, except possibly whatever you need to allow both types of
files, if you've artificially limited it to ending with '.xls'. Or, if
you haven't updated xlrd in many years (maybe 8 years or so), then it
is time to do that.

Why don't you try that first, and report on the performance you get
from using xlrd for both types of files?

John Y.

raf

unread,
Jun 25, 2020, 7:44:09 PM6/25/20
to python...@googlegroups.com
John Yeung wrote:

> On Thu, May 7, 2020 at 7:38 PM Deac-33 Lancaster <dea...@gmail.com> wrote:
> >
> > It just seems that being 120 times slower for openpyxl on xlsx
> > than for xlrd on xls on comparable sized files suggests I'm doing
> > something else wrong.
>
> The first thing is: If you were already using xlrd, then why introduce
> openpyxl at all? Whatever code you are using to read .xls files can be
> used *verbatim* for reading .xlsx files. Literally no change
> whatsoever, except possibly whatever you need to allow both types of
> files, if you've artificially limited it to ending with '.xls'. Or, if
> you haven't updated xlrd in many years (maybe 8 years or so), then it
> is time to do that.

That's not true if you need to know how many digits to round numbers to
so as to match the displayed value of numeric cells. xlrd doesn't support
obtaining the necessary formatting information for .xlsx files.

But if the number of decimal places isn't important, then you are right.

> Why don't you try that first, and report on the performance you get
> from using xlrd for both types of files?
>
> John Y.

cheers,
raf

Deac-33 Lancaster

unread,
Jun 25, 2020, 8:18:54 PM6/25/20
to python-excel
Thank you both Raf and John Yeung.   

I'm getting errors trying to use xlrd directly on xlsx files so I'll need some time to get it to work at all before I can test the timing.

I'm using xlrd 1.2.0 and it's erroring on opening with "formatting_info=True", and when that's removed it errors on using xf_list.  So I'll work on getting past those errors and then come back here for the next step.   :-)

thanks mucho, -deac

——————————————————————

John Yeung

unread,
Jun 25, 2020, 11:26:21 PM6/25/20
to python-excel
> > Whatever code you are using to read .xls files can be
> > used *verbatim* for reading .xlsx files.
>
> That's not true if you need to know how many digits to round numbers to
> so as to match the displayed value of numeric cells. xlrd doesn't support
> obtaining the necessary formatting information for .xlsx files.
>
> But if the number of decimal places isn't important, then you are right.

I keep forgetting about the formatting data because I purposely stay
away from it. You are right that xlrd won't retrieve .xlsx formatting.
I imagine some people are interested in more than just the decimal
places, too.

One reason I decided never to rely on the formatting data is that
Excel seemed to round a little differently than Python. Even though
they would only differ in edge cases, it was enough to turn me off of
trying to replicate that aspect of Excel files. And, since I didn't
need to know the intended number of decimal places very often, it was
enough for me to just manually open a sample workbook in genuine Excel
for those rare cases.

John Y.

John Yeung

unread,
Jun 25, 2020, 11:34:08 PM6/25/20
to python-excel
On Thu, Jun 25, 2020 at 8:18 PM Deac-33 Lancaster <dea...@gmail.com> wrote:
>
> I'm using xlrd 1.2.0 and it's erroring on opening with "formatting_info=True", and when that's removed it errors on using xf_list. So I'll work on getting past those errors and then come back here for the next step. :-)

Don't bother. What raf was saying is that it won't work.

What do you need the formatting info for? (Here is where I start to
probe for deeper understanding of the situation. It's not my intention
to challenge you or dissuade you from doing what you're doing. But
there are times when we have to take a step back and see if there are
other paths which will work better for the larger goal. This may or
may not be one of those times.)

John Y.

Deac-33 Lancaster

unread,
Jun 26, 2020, 1:01:13 AM6/26/20
to python-excel
John asked >>.  What do you need the formatting info for? 

I wrote the program for a friend who uses the numbers written to csv for some statistical analysis.   He needs the values to be the same precision and values as the csv which is manually exported from the xlsx file.  (He has thousands of files so manually exporting has become very laborious.)

My openpyxl program is correct for the xlsx files, it's just way to slow.
......  I am now able to use openpyxl to read the Excel formatting and output the data correctly (except for the occasional difference 
......  in Excel and Python rounding that you mentioned, but that result  is acceptable.). 

My xlrd program for xls files has the same functionality but runs in .5 seconds where the openpyxl program takes 2 minutes for the same size Excel file.

I even reduced the program to just read each Excel row and not write and the times are the same, so writing to csv is not the problem.

All I need now is to get the openpyxl version of the program to work on xlsx files in a second or so instead of 2 minutes.  

thanks so much for your help,
-deac

——————————————————————

Charlie Clark

unread,
Jun 26, 2020, 5:50:13 AM6/26/20
to python-excel
I'm sorry, but I really don't understand your problem and particularly
your approach to solving it, which seems to be based largely on "it's
too slow". It sounds to me like you should probably look at writing your
own code based on the parser of either openpyxl or xlrd and add to it
the code you need for the particular transformation you want.

EOM.

John Yeung

unread,
Jun 26, 2020, 12:19:47 PM6/26/20
to python-excel
> I wrote the program for a friend who uses the numbers written to csv for some statistical analysis. He needs the values to be the same precision and values as the csv which is manually exported from the xlsx file. (He has thousands of files so manually exporting has become very laborious.)

And the formatting changes from workbook to workbook? You can't just
look at one, get the precision of each column, and hard-code those
precisions into your program? It's a little hard to imagine (though I
can do it) that you have many thousands of workbooks, all different. I
would think at most there are a few different workbook formats, and
some kind of distinguishing features (other than the formatting data)
to let you tell which format each workbook follows.

But, if it really is the case that you *must* dynamically determine
the formats, there are probably no other viable ways to get the job
done quicker. I have a mathematician friend who routinely runs
programs for days on end. You posted this problem on May 6, so you
could (in principle) have been running your program continuously since
then, which would have been enough time to process tens of thousands
of .xlsx files by now, even at 2 or 3 minutes per file.

Is your code proprietary? Can you share it? It is conceivable that
you've inadvertently coded something in an especially inefficient way.

Assuming your code is using openpyxl as efficiently as possible,
significantly speeding things up would probably require custom
lowish-level coding (I could imagine possibly a purpose-built parser
which doesn't really look at XML per se, and just does text searches
for the specific things you need) and maybe even switching to another,
faster language (like C++ or Rust). Those options would probably take
you more time to implement than just waiting for your existing program
to trundle along as best it can.

Quick question: How long does it take to open one of these workbooks
manually in the Excel GUI? How long does it take for Excel to write it
out to a CSV?

And: Are new workbooks being generated that your program will have to
process? Or is this just one static repository that, once done, is
just done?

John Y.

raf

unread,
Jun 26, 2020, 6:32:46 PM6/26/20
to python...@googlegroups.com
Python (decimal.Decimal) supports 8 different ways of rounding:

ROUND_CEILING
ROUND_DOWN
ROUND_FLOOR
ROUND_HALF_DOWN
ROUND_HALF_EVEN
ROUND_HALF_UP
ROUND_UP
ROUND_05UP

You'd think that whatever Excel is doing would probably be one of
those but maybe not. And maybe xlrd doesn't use Decimal at all.
It would presumably slow things down but I always think it's worth
avoiding real floats/doubles (at least if the numbers represent
monetary amounts). I'd've also thought that the way that real floats
were rounded would be standardized but I guess not.

cheers,
raf

John Yeung

unread,
Jun 26, 2020, 8:16:27 PM6/26/20
to python-excel
> Python (decimal.Decimal) supports 8 different ways of rounding:
>
> ROUND_CEILING
> ROUND_DOWN
> ROUND_FLOOR
> ROUND_HALF_DOWN
> ROUND_HALF_EVEN
> ROUND_HALF_UP
> ROUND_UP
> ROUND_05UP
>
> You'd think that whatever Excel is doing would probably be one of
> those but maybe not. And maybe xlrd doesn't use Decimal at all.

Neither Excel nor xlrd uses a decimal type. In Excel, every number is
a (double-precision) float. This includes dates, times, and in a way,
Booleans.

Excel uses floats for everything for the same reason virtually
everything else in the computing universe uses floats: It's a
*marvelous* engineering compromise between precision and performance.
Like, freakishly good.

> It would presumably slow things down but I always think it's worth
> avoiding real floats/doubles (at least if the numbers represent
> monetary amounts).

And it's for this reason mainframes commonly have native support for
decimal arithmetic. Databases and business-oriented programming
languages also generally either default to or have first-class support
for decimals.

> I'd've also thought that the way that real floats
> were rounded would be standardized but I guess not.

It's mostly standard. The issue isn't so much rounding, per se, but
the mismatch between human base-10 thinking and computer base-2 float
values. Standard rounding is straightforward, but the results can be a
little surprising to humans who don't fully understanding floats.

Excel is the one which deviates more from standard (computing)
behavior. It adds a little bit of custom rounding logic (for display
purposes) to make its numbers seem more humanistic and less binary.
It's kind of neat, actually.

John Y.

Brendan Ford

unread,
Jun 27, 2020, 4:19:39 AM6/27/20
to python...@googlegroups.com
Just a general comment, for reporting purposes  I find pandas read excel function faster than openpyxl ,
I usually load data from sql, manipulate data in python, then load the data from the previous excel file I created with pandas, then use that to create the new excel file, pretty fast, Then again I dont have 33k rows, columns are where the speed comes into play, tons of columns slow everything down 

--
You received this message because you are subscribed to the Google Groups "python-excel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-excel...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/python-excel/20200626223239.mxak6c5dyf7lv2v3%40raf.org.

Charlie Clark

unread,
Jun 27, 2020, 4:23:43 AM6/27/20
to python...@googlegroups.com
On 27 Jun 2020, at 1:30, Brendan Ford wrote:

> Just a general comment, for reporting purposes I find pandas read
> excel
> function faster than openpyxl ,

Which is funny because Pandas switched to using openpyxl for reading
XLSX files a while back…

Charlie Clark

unread,
Jun 27, 2020, 4:39:12 AM6/27/20
to python-excel

On 27 Jun 2020, at 2:16, John Yeung wrote:

Excel is the one which deviates more from standard (computing)
behavior. It adds a little bit of custom rounding logic (for display
purposes) to make its numbers seem more humanistic and less binary.
It's kind of neat, actually.

This one of the reasons why Excel is not really suitable for scientific data. Precision is limited to 15 digits for numbers, so if you need anything else you must use strings. Datetimes are even worse but there is at least the option to serialise using the ISO format, though you must provide a date format to stop Excel treating them as serials. :-(

Of course, the advantage for Excel of being able to treat the value of every cell as a number is obvious: memory allocation can be much lower.

Charlie Clark

unread,
Jun 27, 2020, 4:48:03 AM6/27/20
to python-excel
On 26 Jun 2020, at 18:19, John Yeung wrote:

> Assuming your code is using openpyxl as efficiently as possible,
> significantly speeding things up would probably require custom
> lowish-level coding (I could imagine possibly a purpose-built parser
> which doesn't really look at XML per se, and just does text searches
> for the specific things you need) and maybe even switching to another,
> faster language (like C++ or Rust). Those options would probably take
> you more time to implement than just waiting for your existing program
> to trundle along as best it can.

Both openpyxl and xlrd use the standard library's xml.etree library for
parsing and this is written in C so there's little performance to be
gained by switching to C++ or Rust. I did try compiling with Cython and
did see something like 2 x performance due to the way Cython can
optimise loops and function calls, but basically XML parsing requires
regexes, which are a limiting factor, plus the boundary of going from
text to Python. I mean the reason these libraries exist is to provide a
rich API for handling the data.

XSLX to CSV could probably be handled by a streaming processor if you
can treat all data as numbers, though you still need to work out how to
handle the spare matrix implementation. But, yeah, the work required to
do this would probably significantly outweigh the benefits.

John Yeung

unread,
Jun 27, 2020, 10:58:24 AM6/27/20
to python-excel
On Sat, Jun 27, 2020 at 4:48 AM Charlie Clark
<charli...@clark-consulting.eu> wrote:
>
> Both openpyxl and xlrd use the standard library's xml.etree library for
> parsing and this is written in C so there's little performance to be
> gained by switching to C++ or Rust.

There might not be too much extra performance from processing the XML
in a different language, but (1) I wasn't limiting the conversation to
improvement of that area, and (2) then again, there might be.

There is a lot that goes into processing XML and a lot that goes into
the interface between C and Python-space objects (the latter of which
especially might improve from cutting Python out of the picture).
There is a lot of room for optimization, depending on what you're
optimizing for. This discussion from the makers of lxml gives a decent
picture of the complexity:

https://lxml.de/performance.html

> I did try compiling with Cython and
> did see something like 2 x performance due to the way Cython can
> optimise loops and function calls

Double the speed just by speeding up loops and function calls is
nothing to sneeze at.

> but basically XML parsing requires
> regexes, which are a limiting factor, plus the boundary of going from
> text to Python. I mean the reason these libraries exist is to provide a
> rich API for handling the data.

All of this supports the idea that there very well could be areas of
significant performance gain by not using Python.

> But, yeah, the work required to
> do this would probably significantly outweigh the benefits.

We agree on that.

John Y.

Deac-33 Lancaster

unread,
Jun 27, 2020, 8:51:25 PM6/27/20
to python-excel
John, thanks much for the response.

 
On Friday, June 26, 2020 at 9:19:47 AM UTC-7, John Yeung wrote:
 
And the formatting changes from workbook to workbook? You can't just 
look at one, get the precision of each column, and hard-code those 
precisions into your program?
And the formatting changes from workbook to workbook? 

The sources for this water well data are several government agencies who not only have different formatting but change their formatting from time to time.  With over 1500 different files of the past 12 years I was finally able to read the formatting as done in the Excel spreadsheet and apply it acceptably for my friend's analysis, it's just slow.    In response to a later question, new files are still coming in all the time.

But, if it really is the case that you *must* dynamically determine 
the formats, there are probably no other viable ways to get the job 
done quicker. I have a mathematician friend who routinely runs 
programs for days on end. 
 
You're right, I invited him to start the program and let it cook for days but he got distracted with another approach.  If this current effort to improve my speed fails I'll just let him live with that approach.  

He's doing this as a community service for water control and I'm doing my part as a favor for a mountaineering friend (and to have fun with Python).  But my fun time is getting tired.  Ha!

Is your code proprietary? Can you share it? It is conceivable that
you've inadvertently coded something in an especially inefficient way.

Not at all proprietary and anyone willing to take a look at the attached file to find any improvement at all is highly sought after.  I'd love to find any improvement in performance or style.
 
Assuming your code is using openpyxl as efficiently as possible,
significantly speeding things up would probably require custom
lowish-level coding

You are correct that I'm not interested in going that far.

Quick question: How long does it take to open one of these workbooks
manually in the Excel GUI? How long does it take for Excel to write it
out to a CSV?
 
 Good questions but I don't know.   He has Windows with full Office Suite (in Colorado) and I'm on a Mac (in California) just using the online version at onedrive.live.com when I want to look at the data.  I don't think it can export to csv.  I can ask him to test it when he gets back from his current trip to the mountains. 

Much thanks for any help!!
-Deac Lancaster 
wells_module_copy.py

John Yeung

unread,
Jun 28, 2020, 12:49:56 AM6/28/20
to python-excel
On Sat, Jun 27, 2020 at 8:51 PM Deac-33 Lancaster <dea...@gmail.com> wrote:
>
> Not at all proprietary and anyone willing to take a look at the attached file to find any improvement at all is highly sought after. I'd love to find any improvement in performance or style.

Thank you for sharing the code! It's enlightening.

Now, are the Excel workbooks shareable as well? If so, how big are
they, in terms of kilobytes or megabytes?

There are a lot of suggestions I could make regarding your code, but
the biggest one is probably that you don't need to call the
`decimal_precision_of_myformat_str` function so much. Right now,
you're calling it twice on every row of data. But you only ever look
at the format of cells E5 and F5, no matter which row you're actually
processing, so you really only need to call it two times for the whole
sheet.

I mocked up my own well data (just lots of random numbers) and got a
surprisingly massive speedup just by moving those two function calls
out of the loop.

I say surprising because I first tried to get a quick estimate of how
much time I'd save by simply calling
`decimal_precision_of_myformat_str` 60,000 times (not mocking up any
data, not involving workbooks at all, just calling that function in a
loop with made-up format strings, approximately as many times as would
presumably be done by your program on your data), and it wasn't much
time at all. Those 60,000 calls took less than one second on my
machine.

So I wasn't too hopeful when I did make up my own "large" workbook and
ran your program (as-is) on it, and it indeed took minutes to process.
Saving one second wouldn't even be noticeable. So imagine my surprise
when I implemented the simple change I described and achieved a
20-fold improvement in speed!

Give it a try, and let us know how it goes.

John Y.

John Yeung

unread,
Jun 28, 2020, 12:59:49 AM6/28/20
to python-excel
On Sun, Jun 28, 2020 at 12:49 AM John Yeung <gallium....@gmail.com> wrote:
>
> So imagine my surprise
> when I implemented the simple change I described and achieved a
> 20-fold improvement in speed!

Ah, what's slow is the retrieval of the number format string (within
openpyxl), not your function to figure out the precision from it.

That's what's critical to move out of the loop. (But while you're at
it, might as well move the function calls out as well.)

Maybe that's something openpyxl can improve with caching.

John Y.

Deac-33 Lancaster

unread,
Jun 28, 2020, 2:42:59 PM6/28/20
to python-excel
The bottom line:
----------------------
 AHA, INDEED.    BRILLIANT, YOU FOUND IT!!!
Yes, since the formats for E5 and F5 won't change for the worksheet, I now see how this could be called only once each.  
I just switched that for E5 & F5 out of the loop and it ran in 3 1/2 seconds instead of 2 minutes!!!   YOU SOLVED MY PROBLEM!!!

I'm wondering how you determined that it was the retrieval of formatting data that was so slow?   GREAT FIND!!!

THANK YOU, THANK YOU, THANK YOU!!
-deac

-------------------------------------------------
-------------------------------------------------
John,

For the record I'll go ahead and respond to your other questions anyway, but you've helped exactly as I had hoped for.  
I'll also gladly take any other suggestions I can get.   


These are my notes from working through your message before I finally understood what you were saying and fixed it.

On Saturday, June 27, 2020 at 9:49:56 PM UTC-7, John Yeung wrote: 
 
Now, are the Excel workbooks shareable as well? If so, how big are
they, in terms of kilobytes or megabytes? 
 
The sample workbook I'm using is 1908_bishop_080119.xlsx and is 2 MB.  That may be too big to post here so I'll put in a Dropbox shared folder for you or anyone to look at.  I also put the current output file there, 1908_bishop_080119.csv.
 
There are a lot of suggestions I could make regarding your code, but
the biggest one is probably that you don't need to call the
`decimal_precision_of_myformat_str` function so much. Right now,
you're calling it twice on every row of data. But you only ever look
at the format of cells E5 and F5, no matter which row you're actually
processing, so you really only need to call it two times for the whole
sheet.
 
You're right, E5 and F5 are two primary pieces of data needed for his analysis.  Now I understand how I can take the time consumers out of the loop.


Ah, what's slow is the retrieval of the number format string (within 
openpyxl), not your function to figure out the precision from it. 
That's what's critical to move out of the loop. (But while you're at 
it, might as well move the function calls out as well.)  

 AHA, INDEED.    BRILLIANT, YOU FOUND IT!!!
Yes, since the format for E5 (and for F5) won't change for the worksheet, I see how this could be called only once each.  I just switched my retrieval of the number format string for E5 & F5 out of the loop and it ran in 3 1/2 seconds instead of 2 minutes!!!   

Robert Silva

unread,
Aug 18, 2020, 5:54:41 AM8/18/20
to python-excel
Hi deac, seems you've it sorted now. 

Just wanted to share how I'm using openpyxl, and it's decently quick.

  • first>>> (which you are already using, btw) >>> wb = load_workbook(filename, read_only=True)
  • second >>> for row in ws.iter_rows(min_row=2, values_only=True):

iter_rows makes it really quick.

Hope it helps, 
Robert
Reply all
Reply to author
Forward
0 new messages