Like many folks I need to read both .xls files (I call them S files, using xlrd) and .xlsx files (the X files, using openpyxl).In my case I read Excel files of about 30,000 rows and just copy all data read to a .csv file, no other processing so just Input/Output. But the X file operations are much much slower, for reading a 30,000 row .xlsx file it now takes 2 minutes compared to 1/2 second for .xls with xlrd. Is openpyxl that much slower or do I need to do something, like release some resource at the end of each row? BTW, I have made several great improvements by using read_only=True and reading a row at a time instead of cell by cell as shown in the following code segment. Thanks to blog.davep.org https://blog.davep.org/2018/06/02/a_little_speed_issue_with_openpyxl.html```
wb = openpyxl.load_workbook("excel_file.xlsx", data_only=True, read_only=True)
sheet = wb.active
for row in sheet.rows:
for cell in row:
cell_from_excel = cell.value```
I failed to say that I’m using python 3.8 and openpyxl 3.0.3.
On 2 May 2020, at 22:31, Deac-33 Lancaster wrote:
Like many folks I need to read both .xls files (I call them S files, using xlrd) and .xlsx files (the X files, using openpyxl).
In my case I read Excel files of about 30,000 rows and just copy all data read to a .csv file, no other processing so just Input/Output.
But the X file operations are much much slower, for reading a 30,000 row .xlsx file it now takes 2 minutes compared to
1/2 second for .xls with xlrd.Is openpyxl that much slower or do I need to do something, like release some resource at the end of each row?
Seeing as xlrd can also read XLSX files, you can test with that as well. Basically, openpyxl and xlrd have similar parsing performance because they use the same XML libraries for reading. openpyxl includes some performance benchmarks on a real world file:
https://openpyxl.readthedocs.io/en/latest/performance.html#read-performance
Processing XML is much slower than processing plaintext or binary formats, so a comparison between XLSX and XLS is not really useful.
But openpyxl can also read much more of the OOXML specification such as images, charts but also pivot tables and external links. It's a much bigger library so it can takes longer to load.
Your sample file can be loaded and read in < 1s on my 2015 MacBook Pro, so I reckon the problem has nothing to do with parsing speed.
--
You received this message because you are subscribed to the Google Groups "python-excel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-excel...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/python-excel/20200626223239.mxak6c5dyf7lv2v3%40raf.org.
On 27 Jun 2020, at 2:16, John Yeung wrote:
Excel is the one which deviates more from standard (computing)
behavior. It adds a little bit of custom rounding logic (for display
purposes) to make its numbers seem more humanistic and less binary.
It's kind of neat, actually.
This one of the reasons why Excel is not really suitable for scientific data. Precision is limited to 15 digits for numbers, so if you need anything else you must use strings. Datetimes are even worse but there is at least the option to serialise using the ISO format, though you must provide a date format to stop Excel treating them as serials. :-(
Of course, the advantage for Excel of being able to treat the value of every cell as a number is obvious: memory allocation can be much lower.
On Friday, June 26, 2020 at 9:19:47 AM UTC-7, John Yeung wrote:
And the formatting changes from workbook to workbook? You can't justlook at one, get the precision of each column, and hard-code thoseprecisions into your program?And the formatting changes from workbook to workbook?
But, if it really is the case that you *must* dynamically determinethe formats, there are probably no other viable ways to get the jobdone quicker. I have a mathematician friend who routinely runsprograms for days on end.
Is your code proprietary? Can you share it? It is conceivable thatyou've inadvertently coded something in an especially inefficient way.
Assuming your code is using openpyxl as efficiently as possible,significantly speeding things up would probably require customlowish-level coding
Quick question: How long does it take to open one of these workbooksmanually in the Excel GUI? How long does it take for Excel to write itout to a CSV?
Now, are the Excel workbooks shareable as well? If so, how big arethey, in terms of kilobytes or megabytes?
There are a lot of suggestions I could make regarding your code, butthe biggest one is probably that you don't need to call the`decimal_precision_of_myformat_str` function so much. Right now,you're calling it twice on every row of data. But you only ever lookat the format of cells E5 and F5, no matter which row you're actuallyprocessing, so you really only need to call it two times for the wholesheet.
Ah, what's slow is the retrieval of the number format string (withinopenpyxl), not your function to figure out the precision from it.That's what's critical to move out of the loop. (But while you're atit, might as well move the function calls out as well.)