Stack overflow while processing XML

81 views
Skip to first unread message

mkrajnak

unread,
Nov 17, 2009, 11:05:46 PM11/17/09
to Clojure
I am processing a very large xml file, 13MB, using clojure.xml.parse
and clojure.contrib.zip-filter.xml with clojure 1.0.0.

The xml file contains information on 13000 japanese characters and I'm
extracting about 200 or so.

At its core it extracts a very small subset of elements using:

(xml-> kdic :character [:literal #(contains? kcset (text %))] node)

Where kcset is a set of desired characters.

My understanding of this is that it returns a lazy-seq which if I
"count"-ed the length of the sequence it would return 200 (not
13000). But in practice it actually generates a stack overflow.

At the end of this post I have a relatively short version of the
program which throws the stack overflow. In this case it has a
(count ...) call which causes the stack overflow. In the full program
I tried a few variations like so:

(dorun (for [knode knodes] (print-kinfo knode))))

To try to get the information to print, but before it also reaches the
end of list it also throws a stack overflow.

I also have the stack trace at the end as well.

Thanks!


Here's the short version of the program:

(ns kanji.prkanji
(:use clojure.xml )
(:use [clojure.zip :only (xml-zip node)])
(:use clojure.contrib.zip-filter.xml)
(:import java.lang.Character$UnicodeBlock)
(:import java.io.File))

(def CJK Character$UnicodeBlock/CJK_UNIFIED_IDEOGRAPHS)

(defn filter-for-kanji
[chars]
(filter #(= CJK (Character$UnicodeBlock/of %)) chars))

(defn get-unique-kanji
[chars]
(let [kchars (filter-for-kanji chars)]
(set kchars)))

(defn print-kinfos
[knodes]
(count knodes))
;; this is what I would normally do: (dorun (for [knode knodes] (print-
kinfo knode))))

(defn get-kdic-info
[kdic kchars]
(let [kcset (set (map str kchars))]
(xml-> kdic :character [:literal #(contains? kcset (text %))]
node)))

(defn load-kdic
[fname]
(xml-zip (parse (File. fname))))

(defn process-file
[file]
(let [kchars (get-unique-kanji (slurp file))]
(print-kinfos
(get-kdic-info
(load-kdic "kanji/kdic-data.xml") kchars))))

(process-file (second *command-line-args*))

And here's the top of the stack trace:

Exception in thread "main" java.lang.StackOverflowError (prkanji.clj:
0)
at clojure.lang.Compiler.eval(Compiler.java:4543)
at clojure.lang.Compiler.load(Compiler.java:4857)
at clojure.lang.Compiler.loadFile(Compiler.java:4824)
at clojure.main$load_script__5833.invoke(main.clj:206)
at clojure.main$init_opt__5836.invoke(main.clj:211)
at clojure.main$initialize__5846.invoke(main.clj:239)
at clojure.main$null_opt__5868.invoke(main.clj:264)
at clojure.main$legacy_script__5883.invoke(main.clj:295)
at clojure.lang.Var.invoke(Var.java:346)
at clojure.main.legacy_script(main.java:34)
at clojure.lang.Script.main(Script.java:20)
Caused by: java.lang.StackOverflowError
at clojure.lang.Cons.next(Cons.java:37)
at clojure.lang.RT.boundedLength(RT.java:1117)
at clojure.lang.AFn.applyToHelper(AFn.java:168)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply__3243.doInvoke(core.clj:390)
at clojure.lang.RestFn.invoke(RestFn.java:443)
at clojure.core$mapcat__3842.doInvoke(core.clj:1528)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.contrib.zip_filter$descendants__48$fn__50.invoke
(zip_filter.clj:63)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.core$seq__3133.invoke(core.clj:103)
at clojure.core$map__3815$fn__3817.invoke(core.clj:1502)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.Cons.next(Cons.java:37)
at clojure.lang.RT.boundedLength(RT.java:1117)
at clojure.lang.RestFn.applyTo(RestFn.java:135)
at clojure.core$apply__3243.doInvoke(core.clj:390)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.core$mapcat__3842.doInvoke(core.clj:1528)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.contrib.zip_filter$mapcat_chain__65$fn__67.invoke
(zip_filter.clj:88)
at clojure.lang.ArraySeq.reduce(ArraySeq.java:116)
at clojure.core$reduce__3319.invoke(core.clj:536)
at clojure.contrib.zip_filter$mapcat_chain__65.invoke(zip_filter.clj:
89)
at clojure.contrib.zip_filter.xml$xml__GT___119.doInvoke(xml.clj:75)
at clojure.lang.RestFn.invoke(RestFn.java:460)
at clojure.contrib.zip_filter.xml$text__102.invoke(xml.clj:43)
at kanji.prkanji$get_kdic_info__147$fn__149.invoke(prkanji.clj:36)
at clojure.contrib.zip_filter$fixup_apply__60.invoke(zip_filter.clj:
76)
at clojure.contrib.zip_filter$mapcat_chain__65$fn__67$fn__69.invoke
(zip_filter.clj:88)
at clojure.core$map__3815$fn__3817.invoke(core.clj:1503)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.core$seq__3133.invoke(core.clj:103)
at clojure.core$spread__3240.invoke(core.clj:383)
at clojure.core$apply__3243.doInvoke(core.clj:390)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.core$mapcat__3842.doInvoke(core.clj:1528)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.contrib.zip_filter$mapcat_chain__65$fn__67.invoke
(zip_filter.clj:88)
at clojure.lang.APersistentVector$Seq.reduce(APersistentVector.java:
476)
at clojure.core$reduce__3319.invoke(core.clj:536)
at clojure.contrib.zip_filter$mapcat_chain__65.invoke(zip_filter.clj:
89)
at clojure.contrib.zip_filter.xml$xml__GT___119.doInvoke(xml.clj:75)
at clojure.lang.RestFn.applyTo(RestFn.java:144)
at clojure.core$apply__3243.doInvoke(core.clj:390)
at clojure.lang.RestFn.invoke(RestFn.java:443)
at clojure.contrib.zip_filter.xml$seq_test__111$fn__113.invoke
(xml.clj:55)
at clojure.contrib.zip_filter$fixup_apply__60.invoke(zip_filter.clj:
76)
at clojure.contrib.zip_filter$mapcat_chain__65$fn__67$fn__69.invoke
(zip_filter.clj:88)
at clojure.core$map__3815$fn__3817.invoke(core.clj:1503)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.Cons.next(Cons.java:37)
at clojure.lang.RT.next(RT.java:560)
at clojure.core$next__3117.invoke(core.clj:50)
at clojure.core$concat__3255$cat__3269$fn__3270.invoke(core.clj:428)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)

Alex Osborne

unread,
Nov 18, 2009, 7:03:55 AM11/18/09
to clo...@googlegroups.com
mkrajnak wrote:
> I am processing a very large xml file, 13MB, using clojure.xml.parse
> and clojure.contrib.zip-filter.xml with clojure 1.0.0.

clojure.xml.parse loads the whole document into memory at once so it's
only really suitable for small (at most a megabyte or two) XML
documents. Have a look at something like Xom instead:

http://www.xom.nu/

If you're looking for an example of usage from Clojure, Mark Triggs has
a nifty wrapper for Xom that efficiently turns an XML document into a
lazy-seq (using a queue) which he routinely uses on multi-gigabyte XML
files:

http://github.com/marktriggs/xml-picker-seq
Reply all
Reply to author
Forward
0 new messages