How to get all the text and images from this html?

231 views
Skip to first unread message

David S

unread,
Sep 15, 2022, 7:41:43 PM9/15/22
to beautifulsoup
Hello. 

Thank you for providing this help for BeautifulSoup.

I have been unsuccessful at getting the soup object to contain all the text from the document here. This is just one of several similar documents with which I'm having this same issue. All of these HTML documents were generated from .DOCX using the Microsoft Word "Save As Web Page Filtered" option shown in the screenshot below.

docx_save_as_webpage_filtered.PNG

Here is the code I am using below:

from pathlib import Path
from bs4 import BeautifulSoup
import re

html_path = Path('path/to/file/Doc10000000.htm')
charset = 'windows-1252'  # I have verified this is the correct encoding from the original .docx file that this .htm document was generated from
with html_path.open('r', encoding=charset, newline=None, errors='replace') as f:
          # read data and remove newline characters at the end of each line
          data = ''.join([re.sub(r'\n$', '', line) for line in f])
          soup = BeautifulSoup(data, parser='lxml')

Troubleshooting steps I've taken:
(1) replace 'lxml' parser with each and every parsers documented here in BeautifulSoup documentation to see if it fixes the problem.
(2) Use the diagnose function to troubleshoot. What's interesting about the output of diagnose(data) is that it contains all the text and images that I am interested in, even though the soup object does not. This inclines me to believe that it is not the parser that is causing the issue.
(3) Open the HTML file in the browser to see if the HTML is poorly formed. Google Chrome seems to have no problem opening the file in the browser with all the text and images properly formatted.

Thank you again for your help.

Regards,
David

Isaac Muse

unread,
Sep 15, 2022, 9:11:22 PM9/15/22
to beautifulsoup
Without something testable, I'm afraid it is going to be very difficult to help. I can only guess what your problem may or may not be. The document format could be off, you could be analyzing the document incorrectly, or it could be an issue with BeautifulSoup.

If it is a parsing issue in BeautifulSoup, it is impossible to debug it without a failing document. Also, it would be interesting to see how you are attempting to extract the text and images as well.

David S

unread,
Sep 15, 2022, 10:52:21 PM9/15/22
to beautifulsoup
Hi Faceless,

Thanks for the quick reply.
A testable document can be found here: https://drive.google.com/drive/folders/1SKpKnu7gZpsH2OlCvRMrPEUnvbL2iIIi. It's the same document I provided in my previous comment. Or, maybe you are looking for something else? Please let me know.
In regards to your last statement about how I am "attempting to extract the text and images", the text and image references are not found in the soup object. So, I am unable to extract them in any manner.

I look forward to hearing more after you have the chance to look at the document. Thanks again.

Regards,
David

Isaac Muse

unread,
Sep 16, 2022, 12:41:13 AM9/16/22
to beautifulsoup
So, I'm getting a ton of text out:

```python
from pathlib import Path
from bs4 import BeautifulSoup
import re

html_path = Path('Doc10000000.htm')
charset = 'windows-1252'

with html_path.open('r', encoding=charset, newline=None, errors='replace') as f:
    data = ''.join([re.sub(r'\n$', '', line) for line in f])
    soup = BeautifulSoup(data, 'lxml')

print(soup.text)
print(soup.select('img'))
```

Truncated, but you get the idea.

```text
   MRTRS RXCVXCRTMXR   RXRTCRTX RR RXMVCRMRRTR  RR MRTRCMRSRY R RR VCRYTRRMÓR RR YRCMMRMXY     Rzgzqxzz #: RC1876031_______ Txeex Rxxezxvx:  1 bx xvzdzz  bx 2009 Txeex bx Rxqxqxexóg: 1 bx xvzdzz  bx 2010 Rzgzqxzz bx ezgxxbxgexxxxbxb  #: 6535402   RXMVCRRXC: Mgzxx Rzqqzqxzxzg (g zzbxd dad daqdxbxxqxxd  g xxxxxxbxd bx Mgzxx, x xxd qax xg xz daexdxvz dx xxd bxgzqxgxqé xgbxdzxgzxqxgzx  ezqz xx “Rzqqqxbzq” z “Mgzxx”).    MRTRS RXCVXCRTMXR VXCRPRYR RDCRRMRRT  – DXXRY RRR YRCMMRRY     Rvqxxqxgz #: RC1876031_________ Yxvgxzaqx Rxzx: Ravadz 1dz  2009 Rxqxqxzxzg Rxzx: Ravadz 1dz 2010 RRRR # 6535402   MXRRC:  Mgzxx  Rzqqzqxzxzg (xgb xxx Mgzxx daqdxbxxqxxd xgb xxxxxxxzxd, exqxxgxxzxq "Magxq"  zq "Mgzxx").       VCXMRRRXC: Mxxvzd  YR.(xg xz daexdxvz xx "Vqzvxxbzq"). Rxqxeexóg: Mzqxgz 1145, 3xq Vxdz  Maxgzd Rxqxd, Rqvxgzxgx. R1091RRC (xg xz daexdxvz xx "Vqzvxxbzq"). Rgxxzd qax dx xzagzxg x xgezqqzqxg xx qqxdxgzx  (Mxqqax ezg agx “Y” xzd qax qxdaxzxg xqxxexqxxd):   Téqqxgzd g ezgbxexzgxd bxx Rzgzqxzz bx Rzqqqxvxgzx  bx Mxzxqxxxxd g bx qqxdzxexóg bx dxqvxexzd.     R     Rxdeqxqexóg bxx Vqzbaezz g Rxxxgbxqxz  bx qxvzd.    M Rdzégbxqxd bx Rxdxqqxñz / Cxqaxqxqxxgzz  bx exxxbxb.  R     Rxdqzdxexzgxd ezqqxxqxgzxqxxd.  R     Rxqxezxvxd qxqx ag xqqxxgzx bx zqxqxzz  xxqqx bx xxezezx g bx bqzvxd.  R     Sxexgexx bx adz bx dzxzwxqx.  T     Vqzzxeexóg bx xzd xezxvzd bx xgxzqqxexóg  bxx Rzqqqxbzq.  D
```

So, I'm not quite sure what the issue is. Is the content purposely garbled/encoded like this? I don't so much care what the data actually is, only whether it is supposed to be like what is shown. 

Now, I didn't get images with `lxml`, but switching to `html.parser` or `html5lib`, I was able to get images, so `lxml` seems to have an issue with something in the file:

```python
[<img border="0width=328" height="183" id="Picture 1" src="Doc10000000_files/image001.png"/>, <img src="Doc10000000_files/image002.png" width="329height=1"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>, <img border="0width=329" height="193" id="Picture 6" src="Doc10000000_files/image003.jpg"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>, <img height='1src="Doc10000000_files/image002.png"' width="329"/>]
```

David S

unread,
Sep 20, 2022, 3:21:30 PM9/20/22
to beautifulsoup
Thank you for the feedback. Sorry for not replying sooner. Somehow I missed the notification of your reply.

Is it possible to get different results because I'm on a Windows machine? I run the same code and when I run this line, print(soup.select('img')) I get an empty list, regardless of which of the 3 parsers I use.
If you think this is possibly the problem then I'll create a Linux, Docker image and see if that fixes the problem.

Note: The .html contents are purposely garbled/encoded for the sake of privacy.

Regards,
David

Isaac Muse

unread,
Sep 20, 2022, 3:55:22 PM9/20/22
to beautifulsoup

Running the exact script that I posted, just changing the string from lxml to html5lib or html.parser, I got the same results on Windows. I’m not sure if it could be caused by a specific version of libraries, but I’m using the latest of them.

David S

unread,
Sep 23, 2022, 8:41:58 PM9/23/22
to beautifulsoup
Thank you very much for all of your help! 

I ran the file in a container with the latest versions of all the packages, and I was able to use the 'html.parser' to find the img tags.

Regards,
David 

Reply all
Reply to author
Forward
0 new messages