insert_rows MemoryError

145 views
Skip to first unread message

吴少博

unread,
May 19, 2025, 7:27:08 AMMay 19
to openpyxl-users
  File "/python3.11/site-packages/openpyxl/worksheet/worksheet.py", line 706, in _move_cells
    cells = list(cells)
            ^^^^^^^^^^^
  File "/python3.11/site-packages/openpyxl/worksheet/worksheet.py", line 450, in _cells_by_row
    yield tuple(cells)
          ^^^^^^^^^^^^
  File "/python3.11/site-packages/openpyxl/worksheet/worksheet.py", line 446, in <genexpr>
    cells = (self.cell(row=row, column=column) for column in range(min_col, max_col + 1))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/python3.11/site-packages/openpyxl/worksheet/worksheet.py", line 244, in cell
    cell = self._get_cell(row, column)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/python3.11/site-packages/openpyxl/worksheet/worksheet.py", line 261, in _get_cell
    self._add_cell(cell)
  File "/python3.11/site-packages/openpyxl/worksheet/worksheet.py", line 272, in _add_cell
    self._cells[(row, column)] = cell
MemoryError Is the line of code "cells = list(cells)" necessary? This may cause a MemoryError when performing insertions on tables with more than 100,000 rows.

Charlie Clark

unread,
May 19, 2025, 9:38:23 AMMay 19
to openpyxl-users

On 19 May 2025, at 13:27, 吴少博 wrote:

MemoryError Is the line of code "cells = list(cells)" necessary? This may
cause a MemoryError when performing insertions on tables with more than
100,000 rows.

If you think there is a problem with the code then the best thing to do is to write a test case and submit a bug report. Simply cherry-picking a couple of lines of code and asking a question isn't helpful.

Regarding memory: cells aren't that big, so that even 100,000 rows shouldn't be much of an issue unless the rows are very wide. I'm not sure if it's the case with more recent versions of Python, but at the time of writing, it wasn't possible to sort generators, so instantiation is required.

I've never really benchmarked this code, and it's certainly possible that is could be rewritten to be more efficient. But that would really be putting lipstick on a pig: Excel's model isn't suited to moving large amounts of data; and while the implementation is far from perfect, if I wanted to insert rows into a large sheet, I'd probably create a new sheet with the rows and copy the other rows in, or do everything in another process (such as Polars or Pandas) and dump that into another sheet. Especially with large files, it's best thinking of Excel as a slightly smarter format than CSV.

But if you think you can improve on this, then feel free to submit a bug report with a sample file and benchmark code, and then a PR. NB. look at some of the changes on more recent branches, as I am hoping to reduce memory use (and performance) of cells by removing the coordinate attributes, because AFAIK the overhead of updating these when, say moving (1,1):Cell(A1) to (2,2):Cell(B2), is the limiting factor, but the complexity of the alternatives should not be ignored.

Charlie

--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Sengelsweg 34
Düsseldorf
D- 40489
Tel: +49-203-3925-0390
Mobile: +49-178-782-6226

john pendreich

unread,
May 19, 2025, 10:38:13 AMMay 19
to openpyx...@googlegroups.com
if you think about programming in the "old days", software like WORDSTAR and spellBinder constantly dumped: to disk to avoid memory shortage. Perhaps consider doing the same-  every 500? lines or soswitch to read and write to a "swap" file.

cheers,

John K.Pendreich
H:0044 207 221 4880 M:0044 7895 392530
F:0044 8445 070315 O:0044 203 290 9784
E:john.pe...@gmail.com
A:56A PALACE COURT LONDON W2 4JB UK


--
You received this message because you are subscribed to the Google Groups "openpyxl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openpyxl-user...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/openpyxl-users/B8EF0568-298A-416E-841E-6D893DCE08B5%40clark-consulting.eu.

Charlie Clark

unread,
May 19, 2025, 11:49:01 AMMay 19
to openpyx...@googlegroups.com
On 19 May 2025, at 16:37, john pendreich wrote:

> if you think about programming in the "old days", software like WORDSTAR
> and spellBinder constantly dumped: to disk to avoid memory shortage.
> Perhaps consider doing the same- every 500? lines or soswitch to read and
> write to a "swap" file.

This is the kind of bikeshedding I wanted to avoid… write_only mode already does this… But it would not be possible in this case, precisely because it's about rearranging the order, and rows must be stored in order. Visicalc, Lotus 1-2-3 and earlier versions of Excel were extremely constrained by memory and, therefore, used hard limits on row lengths (256 IIRC) and number of rows (16384), because they had to be able to map all cells into memory.

Charlie Clark

unread,
Jul 22, 2025, 12:11:40 PMJul 22
to openpyxl-users

On 19 May 2025, at 15:38, Charlie Clark wrote:

I've been thinking about this recently and revisited it.

If you think there is a problem with the code then the best thing to do is to write a test case and submit a bug report. Simply cherry-picking a couple of lines of code and asking a question isn't helpful.

This remains the case… but as there was no follow up, I wrote my own, which really points to the underlying problems.

Regarding memory: cells aren't that big, so that even 100,000 rows shouldn't be much of an issue unless the rows are very wide. I'm not sure if it's the case with more recent versions of Python, but at the time of writing, it wasn't possible to sort generators, so instantiation is required.

It's also worth noting that the list of the cells just contains copies of the rows of the cells, so it's not really a lot of memory, even if you replace it with something like all(cells). The problems really become visible if you have a "small" worksheet (one that doesn't have many real cells) but has many rows because openpyxl will backfill the worksheet in order to be able to overwrite rows.

You can see this if you profile the following code:

wb = Workbook()
    ws = wb.active
    ws["JJ10000"] = 1
    ws.insert_rows(5, 15)

This shows us what's happening:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2698920    8.339    0.000   12.343    0.000 worksheet.py:804(_move_cell)
  2698920    7.803    0.000    9.393    0.000 cell.py:104(__init__)
  5397841    7.640    0.000   22.881    0.000 worksheet.py:255(_get_cell)
  2698920    3.905    0.000    5.847    0.000 worksheet.py:269(_add_cell)
  2698920    1.698    0.000   22.647    0.000 worksheet.py:224(cell)
  2708916    1.580    0.000   24.227    0.000 worksheet.py:450(<genexpr>)
  8096760    1.323    0.000    1.323    0.000 cell.py:115(column)
  8096760    1.309    0.000    1.309    0.000 cell.py:121(row)
        1    1.282    1.282   38.828   38.828 worksheet.py:692(_move_cells)
  5397840    1.244    0.000    1.244    0.000 coordinate.py:14(__init__)
  2698923    0.955    0.000    0.955    0.000 {built-in method builtins.max}
  2698920    0.881    0.000    0.881    0.000 styleable.py:132(__init__)
     9997    0.780    0.000   25.007    0.003 worksheet.py:448(_cells_by_row)

Openpyxl is creating all the rows and then it's moving them. Now, in a worksheet that actually has this many cells, the cell creation will not be be needed here, but you'd just have the cost when opening the file or creating the rows.

The backfilling is not necessarily always necessary and here it's obviously counterproductive. It might be possible to improve this by determining which rows will be affected – taking into consideration that the implementation relies on overwriting cells, including with empty ones. This would no longer be required in an implementation that was less dependent upon cell coordinates.

Reply all
Reply to author
Forward
0 new messages