Cannot detect all strikethrough in within a cell

230 views
Skip to first unread message

Feng, Youyang

unread,
May 8, 2025, 9:34:13 PMMay 8
to openpyx...@googlegroups.com
So I wrote this code below to detect and remove all strikethroughs of a given sheet while leaving other regular content unmodified.
   wb = xl.load_workbook(file, rich_text=True, data_only=True)    # ignore all strikethrough texts    ws = wb[sheet]    for j in range(1, ws.max_column+1):        for i in range(1, ws.max_row+1):            col = get_column_letter(j)            row = str(i)            curval = []            cell = ws[col + row]            if isinstance(cell.value, CellRichText):                for text_block in cell.value:                    if isinstance(text_block, TextBlock):                        if text_block.font.strike is None or text_block.font.strike is False:                            curval.append(text_block)                    else:                        curval.append(text_block)                ws[col + row] = CellRichText(curval)            else:                if cell.value:                    if cell.font.strike is True:                        ws[col + row].value = ''
Similar situation happened around cells with different format and content, although I cannot generalize their share pattern. So far, I have come up with some possible cases where strikethrough does not work on a TextBlock:
  1. The textblock has overlapped styles, such as combination of underlines and strikethroughs.
  2. The cell has a new line within but that "\n" is not strikethrough-ed.
  3. The textblock was assigned unsupported font family, therefore parse into plaintext before detection.
Any suggestion?


Charlie Clark

unread,
May 9, 2025, 4:55:08 AMMay 9
to 'Feng, Youyang' via openpyxl-users

On 9 May 2025, at 3:33, 'Feng, Youyang' via openpyxl-users wrote:

Hi Youyang,

Similar situation happened around cells with different format and content, although I cannot generalize their share pattern. So far, I have come up with some possible cases where strikethrough does not work on a TextBlock:

I'm sorry, but the AI generated code is very poor. I suggest you change your agent. I'm not going to debug such code but will make the following observations:

  • uses ad hoc cell access in such situations is cumbersome and slow. There is an API for a reason.
  • I don't understand the collection logic but I would normally separate this from assignment anyway

I just tried Mistral for the task and the code looks a lot better – both more likely to be what you want and easier to read – but ChatGPT also seems okay.

Charlie

--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Sengelsweg 34
Düsseldorf
D- 40489
Tel: +49-203-3925-0390
Mobile: +49-178-782-6226

Yuuya

unread,
May 11, 2025, 11:10:44 PMMay 11
to openpyxl-users
IMG_9048.jpegHello, I tried mistral-ai, the code does look more readable, but i noticed a weird behavior of openpyxl:

when a cell is not a CellRichText, cell.font is determined by existance rather than style of each text. Check out the screenshot above, where left is the original file and right is result. The last 3 cells are not CellRichText according to the code, but the content of entire cell is removed, including text without strikethroughs.

Code for reference:


from openpyxl import load_workbook

from openpyxl.styles import Font, colors

from openpyxl.cell.rich_text import TextBlock, CellRichText

from pathlib import Path

def remove_strikethrough_text(file_path, sheet_name):

    # Load the workbook

    workbook = load_workbook(file_path)

    sheet = workbook[sheet_name]

 

    # Traverse all cells in the sheet

    for row in sheet.iter_rows():

        for cell in row:

            if isinstance(cell.value, CellRichText):

                # Handle rich text formatting

                new_rich_text = []

                for part in cell.value:

                    if not part.font.strikethrough:

                        new_rich_text.append(part)

                    else:

                        # Create a new TextBlock without strikethrough

                        new_part = TextBlock(part.font, part.text)

                        new_part.font = Font(strikethrough=False)

                        new_rich_text.append(new_part)

                cell.value = CellRichText(new_rich_text)

            else:

                # Check if the entire cell content is strikethrough

                print(cell.value)

                if cell.font and cell.font.strikethrough:

                    cell.value = ''

 

    # Save the modified workbook

    workbook.save(Path(file_path).stem + '_stripped.xlsx')

 

# Example usage

file_path = 'strike4.xlsx'

sheet_name = 'Sheet1'

remove_strikethrough_text(file_path, sheet_name)

Charlie Clark

unread,
May 15, 2025, 5:07:17 AMMay 15
to 'Yuuya' via openpyxl-users
On 12 May 2025, at 5:10, 'Yuuya' via openpyxl-users wrote:

> [image: IMG_9048.jpeg]Hello, I tried mistral-ai, the code does look more
> readable, but i noticed a weird behavior of openpyxl:

It's not Openpyxl that's weird, it's OOXML and this is covered in the Openpyxl documentation.

Charlie

--
Charlie Clark
Sengelsweg 34
Düsseldorf
D- 40489
Mobile: +49-178-782-6226

Yuuya

unread,
May 17, 2025, 5:08:26 AMMay 17
to openpyxl-users

I have searched around the documentation and did not find any pages specifically dedicated to this behavior, would you please elaborate a little bit or provide a link to your reference? Thank you.

Charlie Clark

unread,
May 19, 2025, 6:22:11 AMMay 19
to 'Yuuya' via openpyxl-users
On 17 May 2025, at 11:08, 'Yuuya' via openpyxl-users wrote:

> I have searched around the documentation and did not find any pages
> specifically dedicated to this behavior, would you please elaborate a
> little bit or provide a link to your reference? Thank you.

https://openpyxl.pages.heptapod.net/openpyxl/rich_text.html

And in the code

https://foss.heptapod.net/openpyxl/openpyxl/-/blob/branch/default/openpyxl/cell/text.py#L65

Charlie

--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Sengelsweg 34
Düsseldorf
D- 40489
Tel: +49-203-3925-0390
Mobile: +49-178-782-6226

Yuuya

unread,
May 20, 2025, 11:36:30 PMMay 20
to openpyxl-users

Hello, 

 

I have read the document you referenced but not sure if I have a full grasp of it. Below are my understandings:

  1. A RichText is essentially a list of str and TextBlock.
  2. We can initialize InlineFont with Font, but since Font is determined by existence, for RichText with mixed formatting where for example some TextBlocks have a property while others dont, that property for Font will always be True.

 

However, this does not answer my question that why some RichText are recognized as str. 

For cell A1, it appears that openpyxl reads the first chunk underline and strike as str, losing all formatting it contains, i.e., u=True and strike=True.

 

Like A3, where plain text with strikethrough is recognized as str.

 

How does openpyxl determine if a part in a CellRichText is a TextBlock or not?

 

Also, when I tried to process the processed file again, it removes A3 cell but still keeps A1 untouched.

 

Any thoughts?

 

 

Below is my updated code. 

 

 

from openpyxl import load_workbook

from openpyxl.styles import Font, colors

from openpyxl.cell.rich_text import TextBlock, CellRichText

from openpyxl.cell.text import InlineFont

from pathlib import Path

def remove_strikethrough_text(file_path, sheet_name):

    # Load the workbook

    workbook = load_workbook(file_path, rich_text=True)

    sheet = workbook[sheet_name]

 

    # Traverse all cells in the sheet

    for row in sheet.iter_rows():

        for cell in row:

            if isinstance(cell.value, CellRichText):

                # Handle rich text formatting

                new_rich_text = []

                for part in cell.value:

                    if isinstance(part, TextBlock):

                        if not part.font.strike or part.font.strike == False or part.font.strike is None:

                            new_rich_text.append(part)

                    # str does not have font, add directly.

                    else: # isinstance(part, str):

                        new_rich_text.append(part)

                    # else:

                    #     print(part)

                    #     # Create a new TextBlock without strikethrough

                    #     new_part = TextBlock(part.font, part.text)

                    #     new_part.font = InlineFont(strike=False)

                    #     new_rich_text.append(new_part)

                cell.value = CellRichText(new_rich_text)

            else:

                # Check if the entire cell content is strikethrough

                if cell.font and cell.font.strikethrough:

                    cell.value = ''

 

    # Save the modified workbook

    workbook.save(Path(file_path).stem + '_stripped.xlsx')

 

# Example usage

file_path = 'strike4.xlsx'

sheet_name = 'Sheet1'

remove_strikethrough_text(file_path, sheet_name)

Charlie Clark

unread,
May 21, 2025, 8:32:33 AMMay 21
to 'Yuuya' via openpyxl-users
On 21 May 2025, at 5:36, 'Yuuya' via openpyxl-users wrote:

> Any thoughts?

Yes, try and work your way through some code examples. Openpyxl implements the specification as it is and it is this that is cause of the problems.

Main QQ

unread,
May 22, 2025, 8:45:40 PMMay 22
to openpyx...@googlegroups.com
I'm confused here, are you indicating that openpyxl may read some formatted text as string?

I have read through the documentation regrding RichText, tried a lot of methods and codes to test around, but unfortunately not much luck so far, which is why I end up asking for help here. So far, the primary issue is openpyxl reads a plain text with format as regular string instead of a RichText, which is exactly what I am confused about.

If you notice any issues in my code and would love to help me out here, I would be really appreaciative.

Thank you.
 

From: openpyx...@googlegroups.com <openpyx...@googlegroups.com> on behalf of Charlie Clark <charli...@clark-consulting.eu>
Sent: Wednesday, May 21, 2025 20:32
To: openpyxl-users <openpyx...@googlegroups.com>
Subject: Re: [openpyxl-users] Cannot detect all strikethrough in within a cell
 
--
You received this message because you are subscribed to a topic in the Google Groups "openpyxl-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openpyxl-users/a9e6KFTof74/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openpyxl-user...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/openpyxl-users/2C66095C-17A7-4BF6-83FA-0056E261EC47%40clark-consulting.eu.

Charlie Clark

unread,
May 23, 2025, 5:32:12 AMMay 23
to 'Main QQ' via openpyxl-users

On 23 May 2025, at 2:43, 'Main QQ' via openpyxl-users wrote:

Hi Youyang,

I'm confused here, are you indicating that openpyxl may read some formatted text as string?

The main issue is that in a command line, it's difficult to have both the formatting and the text itself. Openpyxl tries make this easier but the actual OOXML is essentially unreadable.

I have read through the documentation regrding RichText, tried a lot of methods and codes to test around, but unfortunately not much luck so far, which is why I end up asking for help here. So far, the primary issue is openpyxl reads a plain text with format as regular string instead of a RichText, which is exactly what I am confused about.

This is the problem with OOXML. Rich text was added later, and was a mistake in my view. If you have text in a cell and format the cell this is completely different to using rich text, which has to be added explicitly. So, if you have both in your files then you'll have to check whether you have rich text or not.

For example:

c1 = ws["A1"]
c1.value = "Simple text"
c1.font = Font(strikethrough=True)

Is completely different to a cell with rich text.

strike = InlineFont(strike=True)
bold = InlineFont(b=True)
italic = InlineFont(i=True)
rich = CellRichText()
for word, style in zip(["bold", "italic", "strike"], [bold, italic, strike]):
    rich.append(TextBlock(style, word + " "))
c2 = ws["B1"]
c2.value = rich

Any formatting applied to c2 will be ignored.

If you notice any issues in my code and would love to help me out here, I would be really appreaciative.

I don't have your files so your code is not really usable for me. In any case, as I said, you should really work with RichText objects to get familiar with them.

Yuuya

unread,
May 26, 2025, 10:35:16 AMMay 26
to openpyxl-users
Hello,

Thank you for your explanation. Although creating a new cell was not my intention, with your words everything definitely makes more sense now. FYI, all files I work with have rich text upon creation.

Attached is a minimal runnable code I wrote to describe the situation I found. 

So far, based on my test cases, I have managed to narrow it down to one case:

Strikethrogh detection, both for cell and for TextBlock within,  will not work if: assuming the cell is splitable by spaces, the first chunk has strikethrough, BUT ALSO other styles like underlines and italic (i.e., style overlapping). 

Not sure if it is the only case, but from my experiment this is the only pattern I found.

The result I get from the code file provided:

```
A1
A2
A3
comb
A4
combi
A5
combi
A6
combi
A7
combi
A8
mb
A9
mb
A10
co
A11
combined
A12
combined
A13
combined
combined
A14
combi
A15
bined
A16
bined
aaa
A17
aa
combined
```

Thank you.
strike6.xlsx
test.ipynb
Reply all
Reply to author
Forward
0 new messages