Autodetecting char encoding

73 views

Skip to first unread message

j-g-faustus

unread,

Jun 24, 2010, 3:13:18 PM6/24/10

to Enlive

Hi,

I had an issue where European characters in templates got garbled, and
found that Enlive parsing assumes UTF-8 encoding.

I changed the "AutoDetector" to read the char encoding from the XML or
HTML declaration, which worked swimmingly in my case.

Here is my implementation for anyone else with the same problem. It's
not widely tested yet, but it should illustrate the idea.
It goes in file net.cgrand.enlive-html.clj, function startparse-
tagsoup:
---
(defn- startparse-tagsoup [s ch]
(doto (org.ccil.cowan.tagsoup.Parser.)
....
(.setProperty "http://www.ccil.org/~cowan/tagsoup/properties/auto-
detector"
(proxy [org.ccil.cowan.tagsoup.AutoDetector] []
(autoDetectingReader [#^java.io.InputStream is]
; Autodetection by looking up the char encoding tag. 2k chars is
hopefully
; enough to include the HTML header up to the content type tag.
(let [ps (java.io.PushbackInputStream. is 2000)
buff (bytes (byte-array 2000))
len (.read ps buff)
enc (get (re-find
#"(encoding|charset)\s*=\s*[\"']?([-\w]+)"
(String. buff 0 len))
2) ]
(.unread ps buff 0 len)
(java.io.InputStreamReader. ps (or enc "UTF-8"))))))
....
---

I'd also like to thank Christophe Grand for Enlive, it's probably the
best templating concept I've ever seen.

Regards
jf

Reply all

Reply to author

Forward

0 new messages