Like many folks I need to read both .xls files (using xlrd) and .xlsx files (using openpyxl), in both cases files of about 30,000 rows. And in both cases I'm just copying all excel data read and writing it out to a .csv file, no other processing so just Input/Output.
But the xlsx file operations are over 200 times slower than for .xls, for example reading a 30,000 row .xlsx file now takes 2 minutes compared to 1/2 second for .xls with xlrd. We have thousands of files to process so the time per file matters.
Is openpyxl that much slower or do I need to do something, like release some resource at the end of each row?
BTW, I have made several great improvements by using read_only=True and reading a row at a time instead of cell by cell as shown in the following code segment. Thanks to blog.davep.org https://blog.davep.org/2018/06/02/a_little_speed_issue_with_openpyxl.html
wb = openpyxl.load_workbook("excel_file.xlsx", data_only=True, read_only=True) sheet = wb.active for row in sheet.rows: for cell in row: cell_from_excel = cell.value