[groovy-user] More on XmlSlurper vs XmlParser

Frederic Laruelle

unread,

Sep 28, 2010, 9:37:50 PM9/28/10

to user

There's something fundamental on the diff between XMLSlurper/GPathResult and XMLParser/Node that i'm having a tough time grasping...
Can someone explain why

def xml = '<html><head><title>Title</title></head><body><h1>Header</h1></body></html>'

def root = new XmlParser().parseText(xml)
new XmlNodePrinter(preserveWhitespace:true).print(root.body[0])

Outputs 'Header'

while

def xml = '<html><head><title>Title</title></head><body><h1>Header</h1></body></html>'
def root = new XmlParser().parseText(xml)
new XmlNodePrinter(preserveWhitespace:true).print(root.body[0])

outputs '

<body>
  <h1>Header</h1>
</body>

'

How can i obtain the same (latter) result with XMLSlurper?

Tks!

Fred~

Frederic Laruelle

unread,

Sep 28, 2010, 9:41:15 PM9/28/10

to user

There's something fundamental on the diff between XMLSlurper/GPathResult and XMLParser/Node that i'm having a tough time grasping...
Can someone explain why

def xml = '<html><head><title>Title</title></head><body><h1>Header</h1></body></html>'


def root = new XmlSlurper().parseText(xml)


new XmlNodePrinter(preserveWhitespace:true).print(root.body[0])

Robert Anderson

unread,

Sep 28, 2010, 10:02:10 PM9/28/10

to us...@groovy.codehaus.org

I think this site can answer your question:

http://www.tutkiun.com/2009/10/xmlparser-and-xmlslurper.html

"XMLParser stores intermediate results after parsing documents. But on the other hand, XMLSlurper does not stores internal results after processing XML documents."

Cheers,

--
Robert Anderson Nogueira de Oliveira
_________________________
MSN: ranop...@hotmail.com

"Ausência de evidência não é evidência de ausência." (Carl Sagan)

Frederic Laruelle

unread,

Sep 29, 2010, 3:24:06 PM9/29/10

to us...@groovy.codehaus.org

Tks Robert,

this is a great post ...

Can you help me understand how the dissimilarity pointed out below ("storing intermediate results" vs not),

explains the following behavior?:

given this example

def xml = '<html><head><title>Title</title></head><body><h1><i>The</i>Header</h1></body></html>'

XmlParser

def root = new XmlParser().parseText(xml)

new XmlNodePrinter(preserveWhitespace:true).print(root.body.h1[0])

output >>

<h1>
  <i>The</i>
Header</h1>

println (root.body.h1[0].text())

output >>

Header

XmlSlurper

def root = new XmlSlurper().parseText(xml)

new XmlNodePrinter(preserveWhitespace:true).print(root.body.h1[0])

output >>

TheHeader

println (root.body.h1[0].text())

output >>

TheHeader

For the project i am working on, i need the following two outputs on the root.body.h1[0] node

1. properly spaced text contents: "The Header"

2. DOM fragment:

<h1>
  <i>The</i>
Header</h1>

Paulk suggested using XMLSlurper for #1 in a diff thread (still missing the spacing tho, any ideas for that?)

And based on this thread, it seems XMLParser would help with #2...

Does that mean i need to parse the same doc twice? once with XMLParser and once with XMLSlurper?

Isn't that somewhat sub-optimal?

Is there a way to meet both of my reqts with one parser?

Tks!

Fred~

Robert Anderson

unread,

Sep 29, 2010, 5:00:44 PM9/29/10

to us...@groovy.codehaus.org

Hi Frederic!

You can do it with regex. For instance:

groovy> xmls =     [
groovy>                '<html><head><title>Title</title></head><body><h1><i>The</i>Header 1</h1></body></html>',
groovy>                '<html><head><title>Title</title></head><body><h1><i>The Super </i>Header 2</h1></body></html>',
groovy>                '<html><head><title>Title</title></head><body><h1><i>The Power</i>Header</h1></body></html>'
groovy>            ]
groovy> xmls.each { doc ->
groovy>     h1Content = (doc =~ /<h1>.*<\/h1>/)
groovy>     h1Text = h1Content[0]
groovy>     h1WithoutXml = h1Text.replaceAll("<(.|\n)*?>", ' ').trim()
groovy>     println h1Text
groovy>     println h1WithoutXml
groovy> }

<h1><i>The</i>Header 1</h1>
The Header 1
<h1><i>The Super </i>Header 2</h1>
The Super Header 2
<h1><i>The Power</i>Header</h1>
The Power Header

Greetings,

--
Robert Anderson Nogueira de Oliveira
_________________________
MSN: ranop...@hotmail.com

"Ausência de evidência não é evidência de ausência." (Carl Sagan)

Frederic Laruelle

unread,

Sep 29, 2010, 5:10:15 PM9/29/10

to us...@groovy.codehaus.org

Tks Robert.

I was hoping for something a little more systemic.:-)

This helps with getting properly spaced text from XMLSlurper results...

but doesn't answer whether i can get my 2nd reqt met with XMLSlurper...

or can get both reqt met with one parser...

Am i having a unique use case here? (i would have thought otherwise)

Fred~

Robert Anderson

unread,

Sep 29, 2010, 10:21:26 PM9/29/10

to us...@groovy.codehaus.org

Frederic,

XmlParser and one parser :)

groovy> xml = '<html><head><title>Title</title></head><body><h1><i>The</i>Header</h1></body></html>'
groovy> html = new XmlParser().parseText(xml)
groovy> writer = new StringWriter()
groovy> new XmlNodePrinter(new PrintWriter(writer)).print(html.body.h1[0])
groovy> h1XML = writer.toString()
groovy> println h1XML
groovy> h1Content = h1XML.replaceAll("<(.|\n)*?>|(\n)", '').trim().replaceAll(/\s+/,' ')
groovy> println h1Content

<h1>
<i>
The
</i>
Header
</h1>

The Header

An interview about XmlSlurper:

http://groovy.dzone.com/news/john-wilson-groovy-and-xml

Cheers,

--
Robert Anderson Nogueira de Oliveira
_________________________
MSN: ranop...@hotmail.com

"Ausência de evidência não é evidência de ausência." (Carl Sagan)

Frederic Laruelle

unread,

Sep 30, 2010, 3:59:19 AM9/30/10

to us...@groovy.codehaus.org

Nice job,

Tks Robert!

Fred~

Reply all

Reply to author

Forward