Parsing large XML files

1,029 views
Skip to first unread message

Peter Ullah

unread,
Dec 17, 2013, 5:57:32 AM12/17/13
to clo...@googlegroups.com

Hi all, 

I'm attempting to parse a large (500MB) XML, specifically I am trying to extract various parts using XPath. I've been using the examples presented here: http://clojure-doc.org/articles/tutorials/parsing_xml_with_zippers.html
and all was going when tested against small files, however now that I am using the larger file Fireplace/Vim just hangs and my laptop gets hot then I get a memory exception.

I've been playing around with various other libraries such as clojure.data.xml and and found that the following works perfectly well for parsing... but when I come to search inside root, things start to snarl up again.

(ns example.core
  (:require [clojure.java.io :as java.io
            [clojure.data.xml :as data.xml]
            ))  

(def large-file "/path-to-large-file")

;; using clojure.data.xml returns quickly with no problems whereas clojure.xml/parse from the link above causes problems..
(def root 
  ( -> large-file
       data.xml/parse
       ))  

(class root) ;clojure.data.xml.Element

Does anyone know a way of searching within root that won't consume the heap?

Forgive me, I'm new to Clojure and these forums, I've searched through previous posts but not managed to answer my own question.

Thanks in advance.

Ryan Senior

unread,
Dec 17, 2013, 7:45:50 AM12/17/13
to clo...@googlegroups.com
As far as I know, using zippers like that will need the whole XML data structure to be in memory.  data.xml returns fast because it's lazy (uses pull parsing).  Until you start traversing down the structure, it won't parse more of it.  data.xml should also be fully streaming, so it shouldn't require the full 500 MB XML file in memory unless you're doing something to require that.

Traversing the structure that data.xml emits directly should not consume heap, but you wouldn't be able to use XPath.  I've not used it, but there is an XPath wrapper library here: https://github.com/kyleburton/clj-xpath. Briefly looking at the code, it looks like it's using DOM parsing, so it would consume heap. You could bump your max heap (-Xmx from the command line) if you had the extra memory and weren't worried about the docs getting larger.

-Ryan


--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matching Socks

unread,
Dec 17, 2013, 8:01:28 PM12/17/13
to clo...@googlegroups.com
On general Java principles, you can "stream" a large XML file with either SAX or StAX and pluck what you like from it without wasting memory on the rest.  If the file is a long series of small sections that could be examined separately, you might use SAX to partition the file and then subject each section to orthodox methods.  Or here's a small library that pulls a lazy sequence of stuff (Clojure data structures) from a file using StAX:

   https://github.com/pbwolf/drainclog


danneu

unread,
Dec 18, 2013, 2:23:21 AM12/18/13
to clo...@googlegroups.com
Good question. Every lib that came to mind when I saw clojure.data.xml/parse's
tree of Elements {:tag _,
:attrs _, :content _} only works on zippers which apparently sit in memory.

One option is to use `clojure.data.xml/source-seq` to get back a lazy sequence
of Events {:type _, :name _, :attrs _, :str _} where the event :name is either
:start-element, :end-element, or :characters.

For example, "<strong>Hello</strong>" would parse into the events
[:start-element "strong"], [:characters "Hello"], [:end-element "strong"]. You
could use loop/recur to manage state as your consume the sequence.

That's actually how I'm used to working with SAX parsers anyways. Here are some
naive Ruby examples if it's new to you: https://gist.github.com/danneu/3977120.

Of course, I imagine the ideal solution would involve some way to express selectors on the
Element tree like I'm used to doing with raynes/laser on zippers: https://github.com/Raynes/laser/blob/master/docs/guide.md#screen-scraping.

Peter Ullah

unread,
Dec 19, 2013, 1:09:12 PM12/19/13
to clo...@googlegroups.com
Thank you everyone for your advice, I found it useful and think that I am part-way to a solution using clojure.data.xml/source-seq as suggested by dannue.

I'll post what I have done so far in the hope it might help someone else... comments on style welcome.

Solution:

Given the following XML,

<head>
  <title>This is some text</title>
  <body>
     <h1>This is a header</h1>
  </body>
</head>

data.xml/source-seq will return a lazy seq of data.xml.Event items 

#clojure.data.xml.Event{:type :start-element, :name :head, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :characters, :name nil, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :start-element, :name :title, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :characters, :name nil, :attrs nil, :str This is some text}
#clojure.data.xml.Event{:type :end-element, :name :title, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :start-element, :name :body, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :start-element, :name :h1, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :characters, :name nil, :attrs nil, :str This is a header}
#clojure.data.xml.Event{:type :end-element, :name :h1, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :end-element, :name :body, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :end-element, :name :head, :attrs nil, :str nil}

This is perfect for finding elements with a particular name, but completely useless if I want to find an element based on its location. So I maintain a stack where each :start-element causes the element name to be pushed, and each :end-element to invoke a pop.

(filter (fn [x] (complement (nil? x)))
  (let [stack (atom []) 
        search-pattern "vmware/collectionHost/Object/Property/Property"] 

    (doseq[x (take 100 xml)] ; just test with the first 100 elements in seq.
      (do 
        (cond 
          (= (:type x) :start-element) (swap! stack conj (name (get x :name)))
          (= (:type x) :end-element) (swap! stack pop) 
        )   
        (when (= search-pattern (clojure.string/join "/" @stack)) (println (clojure.string/join "/" @stack)))
      )   
    )   
  )
)

This is a work in progress and does not take account of attributes on the elements, but I would appreciate any comments.

Thanks

Pete
Reply all
Reply to author
Forward
0 new messages