Parsing HTML to get text content for email

310 views
Skip to first unread message

Jarrod Swart

unread,
Mar 17, 2014, 6:44:15 PM3/17/14
to enliv...@googlegroups.com
Is there a better way to do this, I am trying to parse simple HTML for emails to extract the text so I can send that as well.

(view formatted code on refheap if you prefer: https://www.refheap.com/60158)


(def html "<html><body><h1>Header</h1><p>Some text here!</p><a href=\"sparkles.com\">link text</a></body></html>")

(defn html->text [html-str]
  "Given a string of HTML parse and return a seq of strings representing the body text."
  (-> (e/html-snippet html-str)
       (e/select [:body :*])
       (->> (map e/text))))

Ed Bowler

unread,
Mar 17, 2014, 6:59:32 PM3/17/14
to enliv...@googlegroups.com
Hi Jarrod,

I've done something similar in the past with code like:

(defn htmt->text [html-str]
  (->> html-str StringReader. e/html-resource e/texts))

Hope this helps,

Ed


--
You received this message because you are subscribed to the Google Groups "Enlive" group.
To unsubscribe from this group and stop receiving emails from it, send an email to enlive-clj+...@googlegroups.com.
To post to this group, send email to enliv...@googlegroups.com.
Visit this group at http://groups.google.com/group/enlive-clj.
For more options, visit https://groups.google.com/d/optout.

Jarrod Swart

unread,
Mar 17, 2014, 9:57:55 PM3/17/14
to enliv...@googlegroups.com
Yeah I like it, except it returns one node and then grabs all the text at once.  

My version with select grabs all the inner nodes, so that I can interpose "\n\n" in between them.  Anyway I can replicate that?

My version: string -> 5 nodes -> 3 strings ("Header" "Some text here!" "link text")
You version: string -> 1 node -> one string ("HeaderSome text here!link text")

This is what I'm doing with the html->text function:

(defn join-lines [strs]
  (apply str (interpose "\n\n" (html->text html))))

Any alternatives, or better techniques would be great.  The select body thing I'm doing seems odd.  I basically just need to load the string and get all the nodes in the body separately.

Ed Bowler

unread,
Mar 18, 2014, 6:51:18 AM3/18/14
to enliv...@googlegroups.com
Hi Jarrod,

It looks like enlive's texts and text functions are concatenating the strings together. I think you'll get what you want if you do something like this:

(require '[net.cgrand.xml :as xml])

(defn text [node]
  (cond
    (string? node) [node]
    (xml/tag? node) (map text (:content node))
    :else [""]))
    
(defn texts [nodes]
  (flatten (map text nodes)))

(defn html->text [html-str]
  (->> html-str StringReader. e/html-resource texts))

Make sense?

Ed


--

Christophe Grand

unread,
Mar 18, 2014, 8:41:09 AM3/18/14
to enlive-clj
Something like this (untested)should work:

(sj/join "\n\n" (e/select (StringReader. html-str) [:body e/text-node])))
--
On Clojure http://clj-me.cgrand.net/
Clojure Programming http://clojurebook.com
Training, Consulting & Contracting http://lambdanext.eu/

Ed Bowler

unread,
Mar 18, 2014, 9:14:06 AM3/18/14
to enliv...@googlegroups.com
Fair point. Though I think it should be:

(sj/join "\n\n" (e/select (e/html-resource (StringReader. html-str)) [:body e/text-node]))

Jarrod Swart

unread,
Mar 19, 2014, 9:15:15 AM3/19/14
to enliv...@googlegroups.com
Awesome, thanks for the help!
Reply all
Reply to author
Forward
0 new messages