Alternatives for JSoup

Amrutha

unread,

Jul 24, 2012, 8:10:25 PM7/24/12

to common...@googlegroups.com

Hey guys,

I've been mapreducing over the arc files and I find that parsing using JSoup consumes a lot of time, I do like JSoup's API though. Is this slow because of the volume of data in a single arc file? I would greatly appreciate any suggestions for a faster alternative to JSoup. I did try a few parsers today, however, none of them worked well for me.

Thanks,

Amrutha

Chris Stephens

unread,

Aug 6, 2012, 2:11:51 PM8/6/12

to common...@googlegroups.com

Hi Amrutha,

I didn't see anyone respond to this. I've heard a number of people say good things about Apache Tika. If you are still having trouble with Jsoup, I'd suggest giving Tika a try.

- Chris

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/J-_hTdSbcHcJ.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.

Tom Morris

unread,

Aug 6, 2012, 2:26:11 PM8/6/12

to common...@googlegroups.com

Boilerpipe might be another alternative. More answers here:
http://www.quora.com/What-open-source-projects-can-be-used-for-extracting-relevant-content-from-various-webpages

Tom

jaidee...@gmail.com

unread,

Aug 6, 2012, 9:55:07 PM8/6/12

to common...@googlegroups.com

Hi Amrutha,

The parser will depend on what do you want to do with the pages. JSoup is particularly good when you want to extract information from specific tags. It has a very rich selector syntax and gives you all DOM functions. However, I have seen that for complex pages it can throw stack overflow errors.

If you want to extract text however, then Apache Tika or Boilerpipe is better as others have suggested.

NekoHTML is also an alternative

Thanks,

Jaideep

--
Jaideep Dhok

Pete Warden

unread,

Aug 6, 2012, 10:08:38 PM8/6/12

to common...@googlegroups.com

I would also consider getting down-and-dirty with regexes if speed really is a bottleneck for you. You can get surprisingly far with some simple tag-stripping functionality if you ignore anything inside head, script, etc tags, and you're only using the results in aggregate. The usual caveats apply of course, but it's worth evaluating to see if the benefits of a brute-force-and-ignorance approach outweigh the problems for your application.

I've done something a little less ugly with hpricot in Ruby here:

https://github.com/petewarden/cc2text

cheers,

Pete

Amrutha Rajiv

unread,

Aug 7, 2012, 11:53:05 AM8/7/12

to common...@googlegroups.com

Thanks, all of you, for your suggestions
What I need to do is get information from tags with specific class
values or attributes. JSoup has awesome functionality that supports
this and I compared it to some other parsers like Jericho, HTMLParser
and Validator.nu and it turns out that JSoup is faster than all of
them. NekoHTML ended up giving me a stackoverflow error. I wasn't
successful in using Apache Tika for this task, however, I think i'll
give it another go. I think I'll try using regexes and boilerpipe as
well, it will definitely will be great if any of these end up being
faster than JSoup.

Thanks,
Amrutha

Reply all

Reply to author

Forward