lxml and html.parser output differs

44 views
Skip to first unread message

Per Göttlicher

unread,
Apr 12, 2024, 8:48:26 AMApr 12
to beautifulsoup
Hi,
When I parse a webpage with a applied Soupstrainer I get different outputs wether I use lxml or html.parser.
When using html.parser I correctly get an empty output when nothing matches the strainer but with lxml I get "<!DOCTYPE html>". I do not know why this happens and it causes the rest of my code to constantly error because I do not expect "<!DOCTYPE html>" to be in the soup as it does not match the strainer.
Is there anything I can do be sides explicitly catching this case?

Per Göttlicher

unread,
Apr 12, 2024, 8:52:30 AMApr 12
to beautifulsoup

The SoupStrainer in question:
SoupStrainer("form", attrs={"class": "suggestion_form"})

I wouldn't expect this to parse "<!DOCTYPE html>" at all. It should be omitted in the output.

Isaac Muse

unread,
Apr 15, 2024, 12:04:02 PMApr 15
to beautifulsoup
Can you post a minimal example?

leonardr

unread,
Apr 17, 2024, 8:37:10 AMApr 17
to beautifulsoup
Thanks for writing in with your issue. You've found a bug in Beautiful Soup. I've filed it as issue 2062000.

Leonard

Per Göttlicher

unread,
Apr 17, 2024, 8:56:31 AMApr 17
to beauti...@googlegroups.com
Thanks for filing the bug report! I was unsure if this was actually a bug or just a weird quirk of lxml as there are documented differences between parsers.
Just to have a minimal example in this thread:
<!DOCTYPE html>
<html>
</html>

The exact strainer is pretty irrelevant but as an example:
SoupStrainer("p")

Per

17.04.2024 14:37:16 leonardr <leonard.r...@gmail.com>:

--
You received this message because you are subscribed to a topic in the Google Groups "beautifulsoup" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beautifulsoup/NeWxtLNw38c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/1968f22e-cbe3-4bd2-93b5-c105f1ecfc9dn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages