How to get all the text and images from this html?

David S

unread,

Sep 15, 2022, 7:41:43 PM9/15/22

to beautifulsoup

Hello.

Thank you for providing this help for BeautifulSoup.

I have been unsuccessful at getting the soup object to contain all the text from the document here. This is just one of several similar documents with which I'm having this same issue. All of these HTML documents were generated from .DOCX using the Microsoft Word "Save As Web Page Filtered" option shown in the screenshot below.

Here is the code I am using below:

from pathlib import Path

from bs4 import BeautifulSoup

import re

html_path = Path('path/to/file/Doc10000000.htm')

charset = 'windows-1252' # I have verified this is the correct encoding from the original .docx file that this .htm document was generated from

with html_path.open('r', encoding=charset, newline=None, errors='replace') as f:
# read data and remove newline characters at the end of each line

data = ''.join([re.sub(r'\n$', '', line) for line in f])

soup = BeautifulSoup(data, parser='lxml')

Troubleshooting steps I've taken:

(1) replace 'lxml' parser with each and every parsers documented here in BeautifulSoup documentation to see if it fixes the problem.

(2) Use the diagnose function to troubleshoot. What's interesting about the output of diagnose(data) is that it contains all the text and images that I am interested in, even though the soup object does not. This inclines me to believe that it is not the parser that is causing the issue.

(3) Open the HTML file in the browser to see if the HTML is poorly formed. Google Chrome seems to have no problem opening the file in the browser with all the text and images properly formatted.

Thank you again for your help.

Regards,

David

Isaac Muse

unread,

Sep 15, 2022, 9:11:22 PM9/15/22

to beautifulsoup

Without something testable, I'm afraid it is going to be very difficult to help. I can only guess what your problem may or may not be. The document format could be off, you could be analyzing the document incorrectly, or it could be an issue with BeautifulSoup.

If it is a parsing issue in BeautifulSoup, it is impossible to debug it without a failing document. Also, it would be interesting to see how you are attempting to extract the text and images as well.

David S

unread,

Sep 15, 2022, 10:52:21 PM9/15/22

to beautifulsoup

Hi Faceless,

Thanks for the quick reply.
A testable document can be found here: https://drive.google.com/drive/folders/1SKpKnu7gZpsH2OlCvRMrPEUnvbL2iIIi. It's the same document I provided in my previous comment. Or, maybe you are looking for something else? Please let me know.

In regards to your last statement about how I am "attempting to extract the text and images", the text and image references are not found in the soup object. So, I am unable to extract them in any manner.

I look forward to hearing more after you have the chance to look at the document. Thanks again.

Regards,

David

Isaac Muse

unread,

Sep 16, 2022, 12:41:13 AM9/16/22

to beautifulsoup

So, I'm getting a ton of text out:

```python

from pathlib import Path
from bs4 import BeautifulSoup
import re

html_path = Path('Doc10000000.htm')
charset = 'windows-1252'

with html_path.open('r', encoding=charset, newline=None, errors='replace') as f:

data = ''.join([re.sub(r'\n$', '', line) for line in f])

soup = BeautifulSoup(data, 'lxml')

print(soup.text)

print(soup.select('img'))

```

Truncated, but you get the idea.

```text

MRTRS RXCVXCRTMXR RXRTCRTX RR RXMVCRMRRTR RR MRTRCMRSRY R RR VCRYTRRMÓR RR YRCMMRMXY Rzgzqxzz #: RC1876031_______ Txeex Rxxezxvx: 1 bx xvzdzz bx 2009 Txeex bx Rxqxqxexóg: 1 bx xvzdzz bx 2010 Rzgzqxzz bx ezgxxbxgexxxxbxb #: 6535402 RXMVCRRXC: Mgzxx Rzqqzqxzxzg (g zzbxd dad daqdxbxxqxxd g xxxxxxbxd bx Mgzxx, x xxd qax xg xz daexdxvz dx xxd bxgzqxgxqé xgbxdzxgzxqxgzx ezqz xx “Rzqqqxbzq” z “Mgzxx”). MRTRS RXCVXCRTMXR VXCRPRYR RDCRRMRRT – DXXRY RRR YRCMMRRY Rvqxxqxgz #: RC1876031_________ Yxvgxzaqx Rxzx: Ravadz 1dz 2009 Rxqxqxzxzg Rxzx: Ravadz 1dz 2010 RRRR # 6535402 MXRRC: Mgzxx Rzqqzqxzxzg (xgb xxx Mgzxx daqdxbxxqxxd xgb xxxxxxxzxd, exqxxgxxzxq "Magxq" zq "Mgzxx"). VCXMRRRXC: Mxxvzd YR.(xg xz daexdxvz xx "Vqzvxxbzq"). Rxqxeexóg: Mzqxgz 1145, 3xq Vxdz Maxgzd Rxqxd, Rqvxgzxgx. R1091RRC (xg xz daexdxvz xx "Vqzvxxbzq"). Rgxxzd qax dx xzagzxg x xgezqqzqxg xx qqxdxgzx (Mxqqax ezg agx “Y” xzd qax qxdaxzxg xqxxexqxxd): Téqqxgzd g ezgbxexzgxd bxx Rzgzqxzz bx Rzqqqxvxgzx bx Mxzxqxxxxd g bx qqxdzxexóg bx dxqvxexzd. R Rxdeqxqexóg bxx Vqzbaezz g Rxxxgbxqxz bx qxvzd. M Rdzégbxqxd bx Rxdxqqxñz / Cxqaxqxqxxgzz bx exxxbxb. R Rxdqzdxexzgxd ezqqxxqxgzxqxxd. R Rxqxezxvxd qxqx ag xqqxxgzx bx zqxqxzz xxqqx bx xxezezx g bx bqzvxd. R Sxexgexx bx adz bx dzxzwxqx. T Vqzzxeexóg bx xzd xezxvzd bx xgxzqqxexóg bxx Rzqqqxbzq. D

```

So, I'm not quite sure what the issue is. Is the content purposely garbled/encoded like this? I don't so much care what the data actually is, only whether it is supposed to be like what is shown.

Now, I didn't get images with `lxml`, but switching to `html.parser` or `html5lib`, I was able to get images, so `lxml` seems to have an issue with something in the file:

```python

[<img border="0width=328" height="183" id="Picture 1" src="Doc10000000_files/image001.png"/>, <img src="Doc10000000_files/image002.png" width="329height=1"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>, <img border="0width=329" height="193" id="Picture 6" src="Doc10000000_files/image003.jpg"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>]

```

David S

unread,

Sep 20, 2022, 3:21:30 PM9/20/22

to beautifulsoup

Thank you for the feedback. Sorry for not replying sooner. Somehow I missed the notification of your reply.

Is it possible to get different results because I'm on a Windows machine? I run the same code and when I run this line, print(soup.select('img')) I get an empty list, regardless of which of the 3 parsers I use.

If you think this is possibly the problem then I'll create a Linux, Docker image and see if that fixes the problem.

Note: The .html contents are purposely garbled/encoded for the sake of privacy.

Regards,

David

Isaac Muse

unread,

Sep 20, 2022, 3:55:22 PM9/20/22

to beautifulsoup

Running the exact script that I posted, just changing the string from lxml to html5lib or html.parser, I got the same results on Windows. I’m not sure if it could be caused by a specific version of libraries, but I’m using the latest of them.

David S

unread,

Sep 23, 2022, 8:41:58 PM9/23/22

to beautifulsoup

Thank you very much for all of your help!

I ran the file in a container with the latest versions of all the packages, and I was able to use the 'html.parser' to find the img tags.