Re: cannot find an html parser in Jython

247 views
Skip to first unread message

Aaron DeVore

unread,
Mar 3, 2013, 8:12:58 AM3/3/13
to beauti...@googlegroups.com
Just don't pass in a parser argument. Beautiful Soup will automatically pick one that it prefers.


On Fri, Mar 1, 2013 at 8:03 PM, straz <st...@strassmann.com> wrote:

I'm trying to use BeautifulSoup 4.1.3 with Jython (I'm using Jython 2.7beta1, since the stable release corresponds to cPython 2.5).

As part of my upgrade of BeautifulSoup from 3 to 4, I see I need to install a parser. And since I'm using jython, I need to install one with no C dependency. 

How do I convince BeautifulSoup to use the native Python html parser? I can't use 'html.parser' because that name is not valid in Python 2.7. I tried HTMLParser and that doesn't work either.

>>> BeautifulSoup.BeautifulSoup(content, 'html.parser')

    raise FeatureNotFound(

BeautifulSoup.FeatureNotFound: Couldn't find a tree builder with the features you requested: html.parser. Do you need to install a parser library?

That pretty much leaves me trying to use html5lib. That won't work either, because inputstream.py in that lib contains some seriously broken Unicode characters in the range 0xD800-0xDFFF, which are known to the trade as "unpaired surrogate". jython cannot read inputstream.py. This has been closed as wont-fix: http://bugs.jython.org/issue1836

I've opened this as an issue with html5lib (https://code.google.com/p/html5lib/issues/detail?id=220)

So I'm back to the native Python html parser - how do I use it in 2.7 without calling it 'html.parser'?

I suppose since it's Jython there should be a way to use a Java-native parser, right?

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Leonard Richardson

unread,
Mar 3, 2013, 9:18:00 AM3/3/13
to beauti...@googlegroups.com
I don't know if this will help. If Jython doesn't have HTMLParser
there's probably no parser it can use.

However, it's possible there's just a problem _loading_ HTMLParser
from the name 'html.parser'. So try Aaron's suggestion, and if that
doesn't work see what happens when you 'import HTMLParser'.

Leonard
Reply all
Reply to author
Forward
0 new messages