What is beautifulsoup built on?

40 views
Skip to first unread message

Julius Hamilton

unread,
Sep 20, 2021, 8:24:32 AM9/20/21
to beautifulsoup
Hey,

I would like to write a script which extracts the text content from similarly structured webpages (the documentation pages for Microsoft Visual Basic for Applications).

I understand Beautiful Soup is a standard tool designed precisely for this, i.e. that it could select a specific HTML node to extract text from.

However I was curious, what is the most pure, ubiquitous underlying tool for this application, just for the simple action of going to an HTML node and retrieving the text there? What is Beautiful Soup built on? I thought it might be Xpath, an XML parser, but so far it seems like Xpath is used inside HTML documents rather than outside it.

Thanks very much.

leonardr

unread,
Sep 20, 2021, 9:55:33 AM9/20/21
to beautifulsoup


Hi,

Beautiful Soup uses an external HTML parser (either html.parser, html5lib, or lxml) to parse documents into an internal DOM. For the thing you're talking about -- going to a node of the DOM after the document has been parsed -- Beautiful Soup exposes a custom Python API. Sometimes other developers copy this API for Beautiful Soup-like libraries in other languages, but it's ultimately something I made up in 2004.

There are other, language-independent ways of navigating the DOM: CSS selectors (which Beautiful Soup supports through the soupselect package) and XPath (which Beautiful Soup doesn't support but lxml does).

Leonard

Rogerio Cardin

unread,
Oct 3, 2021, 11:00:13 AM10/3/21
to beauti...@googlegroups.com
Pessoal, sou Brasileiro e gostaria de uma oportunidade para o desenvolvimento em Python usando a biblioteca BeatifulSoup, que souber que quiser me ajudar , faço trabalhos de graça em troca do conhecimento.

Obrigado

Rogerio Cardin
 

 


--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/85b0910e-e118-4076-9422-6577c0ca1d4dn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages