> In terms of efficiency, it would be best if the html would only be parsed
> once. There may be two possibilities to accomplish this. First, BS4 could
> be given the tree as parsed by lxml. Second, the tree should be obtainable
> once BS4 has processed and parsed the html. I think that the former is not
> possible since BS4 does some Unicode magic to the html (correct me if I'm
> wrong about this). So I'm now hoping that the latter is possible.
>
> I have looked at the source code of BS4 (builder/_lxml.py) and at the lxml
> documentation. The documentation mentions that after the parser has been
> fed some markup, parser.close() should return the root element. However,
> I'm only getting "None". Any help on this or some other way to parse the
> html only once is much appreciated. Thank you.
Beautiful Soup uses lxml's target parser interface, which does not
create any in-memory data structure on its own. From
http://lxml.de/parsing.html#the-target-parser-interface:
"Note that the parser does not build a tree when using a parser
target. The result of the parser run is whatever the target object
returns from its .close() method."
At the end of the process, there is no lxml root element, only a
Beautiful Soup root element.
You could write your own target parser that passed each incoming event
to a Beautiful Soup tree builder *and* an ElementTree-compatible tree
builder. This would give you two different trees from a single pass
through the document. But it would require some hacking.
Leonard