parsing/processing of big xml files...

79 views
Skip to first unread message

Alex Ott

unread,
Jan 6, 2010, 9:06:26 AM1/6/10
to Clojure ML
Hello all

I have question about processing big XML files with lazy-xml. I'm trying to analyze
StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with
posts, i get java stack overflow, although i provide enough memory for java
(1Gb of heap).

My code looks following way


(ns stackoverflow
(:import java.io.File)
(:use clojure.contrib.lazy-xml))

(def so-base "..../data-sets/stack-overflow/2009-12/122009 SO")

(def posts-file (File. (str so-base "/posts.xml")))

(defn count-post-entries [xml]
(loop [counter 0
lst xml]
(if (nil? lst)
counter
(let [elem (first lst)
rst (rest lst)]
(if (and (= (:type elem) :start-element) (= (:name elem) :row))
(recur (+ 1 counter) rst)
(recur counter rst))))))

and run it with

(stackoverflow/count-post-entries (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file))

I don't collect real data here, so i expect, that clojure will discard
already processed data.

The same problem with stack overflow happens, when i use reduce:

(reduce (fn [counter elem]
(if (and (= (:type elem) :start-element) (= (:name elem) :row))
(+ 1 counter)
counter))
0 (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file))

So, question is open - how to process big xml files in constant space? (if
I won't collect much data during processing)

--
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/ http://xtalk.msk.su/~ott/
http://alexott-ru.blogspot.com/

Graham Fawcett

unread,
Jan 6, 2010, 3:27:17 PM1/6/10
to clo...@googlegroups.com
Hi Alex,

On Wed, Jan 6, 2010 at 9:06 AM, Alex Ott <ale...@gmail.com> wrote:
> Hello all
>
> I have question about processing big XML files with lazy-xml.  I'm trying to analyze
> StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with
> posts, i get java stack overflow, although i provide enough memory for java
> (1Gb of heap).

Someone asked this question a while back, and a suggestion given was
to use Mark Triggs' XOM wrapper:

http://github.com/marktriggs/xml-picker-seq

Thread:
http://groups.google.com/group/clojure/browse_thread/thread/365ca7aaaf8d55b7?pli=1

Cheers,
Graham

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>

Reply all
Reply to author
Forward
0 new messages