I'm trying to use BeautifulSoup 4.1.3 with Jython (I'm using Jython 2.7beta1, since the stable release corresponds to cPython 2.5).
As part of my upgrade of BeautifulSoup from 3 to 4, I see I need to install a parser. And since I'm using jython, I need to install one with no C dependency.
How do I convince BeautifulSoup to use the native Python html parser? I can't use 'html.parser' because that name is not valid in Python 2.7. I tried HTMLParser and that doesn't work either.
>>> BeautifulSoup.BeautifulSoup(content, 'html.parser')
raise FeatureNotFound(
BeautifulSoup.FeatureNotFound: Couldn't find a tree builder with the features you requested: html.parser. Do you need to install a parser library?
That pretty much leaves me trying to use html5lib. That won't work either, because inputstream.py in that lib contains some seriously broken Unicode characters in the range 0xD800-0xDFFF, which are known to the trade as "unpaired surrogate". jython cannot read inputstream.py. This has been closed as wont-fix: http://bugs.jython.org/issue1836
I've opened this as an issue with html5lib (https://code.google.com/p/html5lib/issues/detail?id=220)
So I'm back to the native Python html parser - how do I use it in 2.7 without calling it 'html.parser'?
I suppose since it's Jython there should be a way to use a Java-native parser, right?
--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.