Possible to obtain root element from lxml?

534 views
Skip to first unread message

Jeroen Janssens

unread,
May 10, 2012, 2:11:53 PM5/10/12
to beauti...@googlegroups.com
Dear all,

Let me start by saying that Beautiful Soup 4 is a wonderful package. I have used BS4 to implement a script that extracts the main text from a news article and it works great, especially with Unicode. That script is now being merged with a bunch of other scripts that use plain lxml and no BS4.

In terms of efficiency, it would be best if the html would only be parsed once. There may be two possibilities to accomplish this. First, BS4 could be given the tree as parsed by lxml. Second, the tree should be obtainable once BS4 has processed and parsed the html. I think that the former is not possible since BS4 does some Unicode magic to the html (correct me if I'm wrong about this). So I'm now hoping that the latter is possible.

I have looked at the source code of BS4 (builder/_lxml.py) and at the lxml documentation. The documentation mentions that after the parser has been fed some markup, parser.close() should return the root element. However, I'm only getting "None". Any help on this or some other way to parse the html only once is much appreciated. Thank you.

Best wishes,

Jeroen

leonardr

unread,
May 24, 2012, 12:26:09 PM5/24/12
to beautifulsoup
> In terms of efficiency, it would be best if the html would only be parsed
> once. There may be two possibilities to accomplish this. First, BS4 could
> be given the tree as parsed by lxml. Second, the tree should be obtainable
> once BS4 has processed and parsed the html. I think that the former is not
> possible since BS4 does some Unicode magic to the html (correct me if I'm
> wrong about this). So I'm now hoping that the latter is possible.
>
> I have looked at the source code of BS4 (builder/_lxml.py) and at the lxml
> documentation. The documentation mentions that after the parser has been
> fed some markup, parser.close() should return the root element. However,
> I'm only getting "None". Any help on this or some other way to parse the
> html only once is much appreciated. Thank you.

Beautiful Soup uses lxml's target parser interface, which does not
create any in-memory data structure on its own. From
http://lxml.de/parsing.html#the-target-parser-interface:

"Note that the parser does not build a tree when using a parser
target. The result of the parser run is whatever the target object
returns from its .close() method."

At the end of the process, there is no lxml root element, only a
Beautiful Soup root element.

You could write your own target parser that passed each incoming event
to a Beautiful Soup tree builder *and* an ElementTree-compatible tree
builder. This would give you two different trees from a single pass
through the document. But it would require some hacking.

Leonard
Reply all
Reply to author
Forward
0 new messages