On 19 May 2025, at 13:27, 吴少博 wrote:
MemoryError Is the line of code "cells = list(cells)" necessary? This may
cause a MemoryError when performing insertions on tables with more than
100,000 rows.
If you think there is a problem with the code then the best thing to do is to write a test case and submit a bug report. Simply cherry-picking a couple of lines of code and asking a question isn't helpful.
Regarding memory: cells aren't that big, so that even 100,000 rows shouldn't be much of an issue unless the rows are very wide. I'm not sure if it's the case with more recent versions of Python, but at the time of writing, it wasn't possible to sort generators, so instantiation is required.
I've never really benchmarked this code, and it's certainly possible that is could be rewritten to be more efficient. But that would really be putting lipstick on a pig: Excel's model isn't suited to moving large amounts of data; and while the implementation is far from perfect, if I wanted to insert rows into a large sheet, I'd probably create a new sheet with the rows and copy the other rows in, or do everything in another process (such as Polars or Pandas) and dump that into another sheet. Especially with large files, it's best thinking of Excel as a slightly smarter format than CSV.
But if you think you can improve on this, then feel free to submit a bug report with a sample file and benchmark code, and then a PR. NB. look at some of the changes on more recent branches, as I am hoping to reduce memory use (and performance) of cells by removing the coordinate attributes, because AFAIK the overhead of updating these when, say moving (1,1):Cell(A1)
to (2,2):Cell(B2)
, is the limiting factor, but the complexity of the alternatives should not be ignored.
Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Sengelsweg 34
Düsseldorf
D- 40489
Tel: +49-203-3925-0390
Mobile: +49-178-782-6226
--
You received this message because you are subscribed to the Google Groups "openpyxl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openpyxl-user...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/openpyxl-users/B8EF0568-298A-416E-841E-6D893DCE08B5%40clark-consulting.eu.
On 19 May 2025, at 15:38, Charlie Clark wrote:
I've been thinking about this recently and revisited it.
If you think there is a problem with the code then the best thing to do is to write a test case and submit a bug report. Simply cherry-picking a couple of lines of code and asking a question isn't helpful.
This remains the case… but as there was no follow up, I wrote my own, which really points to the underlying problems.
Regarding memory: cells aren't that big, so that even 100,000 rows shouldn't be much of an issue unless the rows are very wide. I'm not sure if it's the case with more recent versions of Python, but at the time of writing, it wasn't possible to sort generators, so instantiation is required.
It's also worth noting that the list of the cells just contains copies of the rows of the cells, so it's not really a lot of memory, even if you replace it with something like all(cells)
. The problems really become visible if you have a "small" worksheet (one that doesn't have many real cells) but has many rows because openpyxl will backfill the worksheet in order to be able to overwrite rows.
You can see this if you profile the following code:
wb = Workbook()
ws = wb.active
ws["JJ10000"] = 1
ws.insert_rows(5, 15)
This shows us what's happening:
ncalls tottime percall cumtime percall filename:lineno(function)
2698920 8.339 0.000 12.343 0.000 worksheet.py:804(_move_cell)
2698920 7.803 0.000 9.393 0.000 cell.py:104(__init__)
5397841 7.640 0.000 22.881 0.000 worksheet.py:255(_get_cell)
2698920 3.905 0.000 5.847 0.000 worksheet.py:269(_add_cell)
2698920 1.698 0.000 22.647 0.000 worksheet.py:224(cell)
2708916 1.580 0.000 24.227 0.000 worksheet.py:450(<genexpr>)
8096760 1.323 0.000 1.323 0.000 cell.py:115(column)
8096760 1.309 0.000 1.309 0.000 cell.py:121(row)
1 1.282 1.282 38.828 38.828 worksheet.py:692(_move_cells)
5397840 1.244 0.000 1.244 0.000 coordinate.py:14(__init__)
2698923 0.955 0.000 0.955 0.000 {built-in method builtins.max}
2698920 0.881 0.000 0.881 0.000 styleable.py:132(__init__)
9997 0.780 0.000 25.007 0.003 worksheet.py:448(_cells_by_row)
Openpyxl is creating all the rows and then it's moving them. Now, in a worksheet that actually has this many cells, the cell creation will not be be needed here, but you'd just have the cost when opening the file or creating the rows.
The backfilling is not necessarily always necessary and here it's obviously counterproductive. It might be possible to improve this by determining which rows will be affected – taking into consideration that the implementation relies on overwriting cells, including with empty ones. This would no longer be required in an implementation that was less dependent upon cell coordinates.