I have question about processing big XML files with lazy-xml. I'm trying to analyze
StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with
posts, i get java stack overflow, although i provide enough memory for java
(1Gb of heap).
My code looks following way
(ns stackoverflow
(:import java.io.File)
(:use clojure.contrib.lazy-xml))
(def so-base "..../data-sets/stack-overflow/2009-12/122009 SO")
(def posts-file (File. (str so-base "/posts.xml")))
(defn count-post-entries [xml]
(loop [counter 0
lst xml]
(if (nil? lst)
counter
(let [elem (first lst)
rst (rest lst)]
(if (and (= (:type elem) :start-element) (= (:name elem) :row))
(recur (+ 1 counter) rst)
(recur counter rst))))))
and run it with
(stackoverflow/count-post-entries (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file))
I don't collect real data here, so i expect, that clojure will discard
already processed data.
The same problem with stack overflow happens, when i use reduce:
(reduce (fn [counter elem]
(if (and (= (:type elem) :start-element) (= (:name elem) :row))
(+ 1 counter)
counter))
0 (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file))
So, question is open - how to process big xml files in constant space? (if
I won't collect much data during processing)
--
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/ http://xtalk.msk.su/~ott/
http://alexott-ru.blogspot.com/
On Wed, Jan 6, 2010 at 9:06 AM, Alex Ott <ale...@gmail.com> wrote:
> Hello all
>
> I have question about processing big XML files with lazy-xml. I'm trying to analyze
> StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with
> posts, i get java stack overflow, although i provide enough memory for java
> (1Gb of heap).
Someone asked this question a while back, and a suggestion given was
to use Mark Triggs' XOM wrapper:
http://github.com/marktriggs/xml-picker-seq
Thread:
http://groups.google.com/group/clojure/browse_thread/thread/365ca7aaaf8d55b7?pli=1
Cheers,
Graham
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>