Re: Lightweight lib/way to strip html from text

657 views
Skip to first unread message

Michael Klishin

unread,
Sep 6, 2012, 10:45:57 PM9/6/12
to clo...@googlegroups.com
2012/9/6 jamieorc <jami...@gmail.com>
Hey all, I'm looking for a lightweight way to strip html from a long String of text and leave just the text. I've come across JSoup, but at over 300kb for the lib, not quite lightweight. 

Suggestions?

JSoup is good way to do it. If you need to identify the "main" part of a Web page, Boilerplate
is a great library. Because Boilerplate is such a pain to get started with (dependency and documentation wise), I highly suggest that you use Crawlista for this:


300kb does not sound like a lot. JVM will only load what is really used.
--
MK

Richard Lyman

unread,
Sep 6, 2012, 10:55:19 PM9/6/12
to clo...@googlegroups.com
On Thu, Sep 6, 2012 at 11:41 AM, jamieorc <jami...@gmail.com> wrote:
> Hey all, I'm looking for a lightweight way to strip html from a long String
> of text and leave just the text. I've come across JSoup, but at over 300kb
> for the lib, not quite lightweight.
>
> Suggestions?
>
> Cheers,
> Jamie
>

When you say 'html' do you mean any html that a modern or even older
browser would accept, or is a very restricted set of very clean html?

-Rich

Timo Mihaljov

unread,
Sep 10, 2012, 2:53:53 PM9/10/12
to clo...@googlegroups.com
On 06.09.2012 20:41, jamieorc wrote:
> Hey all, I'm looking for a lightweight way to strip html from a long
> String of text and leave just the text. I've come across JSoup, but at
> over 300kb for the lib, not quite lightweight.
>
> Suggestions?

I've found Jericho HTML Parser to be fast, robust, and well documented:
http://jericho.htmlparser.net/docs/index.html

Its TextExtractor class seems to do exactly what you need:
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html
http://jericho.htmlparser.net/samples/console/src/ExtractText.java

--
Timo

Denis Labaye

unread,
Sep 11, 2012, 5:31:05 AM9/11/12
to clo...@googlegroups.com
Hi, 

This thread on the Enlive mailling list may be of some interest to you:

[enlive] How to select all user visible text from webpage? 
https://groups.google.com/forum/#!msg/enlive-clj/rrY08JdI4Tc/FmDuNjc6w_oJ


Denis

On Thu, Sep 6, 2012 at 7:41 PM, jamieorc <jami...@gmail.com> wrote:
Hey all, I'm looking for a lightweight way to strip html from a long String of text and leave just the text. I've come across JSoup, but at over 300kb for the lib, not quite lightweight. 

Suggestions?

Cheers,
Jamie

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply all
Reply to author
Forward
0 new messages