[groovy-user] More on XmlSlurper vs XmlParser

365 views
Skip to first unread message

Frederic Laruelle

unread,
Sep 28, 2010, 9:37:50 PM9/28/10
to user
There's something fundamental on the diff between XMLSlurper/GPathResult and XMLParser/Node that i'm having a tough time grasping...
Can someone explain why

def xml = '<html><head><title>Title</title></head><body><h1>Header</h1></body></html>'
def root = new XmlParser().parseText(xml)
new XmlNodePrinter(preserveWhitespace:true).print(root.body[0])
Outputs 'Header'

while

def xml = '<html><head><title>Title</title></head><body><h1>Header</h1></body></html>'  
def root = new XmlParser().parseText(xml)  
new XmlNodePrinter(preserveWhitespace:true).print(root.body[0])

outputs '
<body>
  <h1>Header</h1>
</body>
'

How can i obtain the same (latter) result with XMLSlurper?

Tks!

Fred~

Frederic Laruelle

unread,
Sep 28, 2010, 9:41:15 PM9/28/10
to user
There's something fundamental on the diff between XMLSlurper/GPathResult and XMLParser/Node that i'm having a tough time grasping...
Can someone explain why

def xml = '<html><head><title>Title</title></head><body><h1>Header</h1></body></html>'

def root = new XmlSlurper().parseText(xml)

new XmlNodePrinter(preserveWhitespace:true).print(root.body[0])

Robert Anderson

unread,
Sep 28, 2010, 10:02:10 PM9/28/10
to us...@groovy.codehaus.org
I think this site can answer your question:


"XMLParser stores intermediate results after parsing documents. But on the other hand, XMLSlurper does not stores internal results after processing XML documents."

Cheers,

--
Robert Anderson Nogueira de Oliveira
_________________________
MSN: ranop...@hotmail.com

"Ausência de evidência não é evidência de ausência." (Carl Sagan)

Frederic Laruelle

unread,
Sep 29, 2010, 3:24:06 PM9/29/10
to us...@groovy.codehaus.org
Tks Robert,
this is a great post ...

Can you help me understand how the dissimilarity pointed out below ("storing intermediate results" vs not),
explains the following behavior?:

given this example
​def xml '<html><head><title>Title</title></head><body><h1><i>The</i>Header</h1></body></html>'  
 
XmlParser
def root new XmlParser().parseText(xml)  

new XmlNodePrinter(preserveWhitespace:true).print(root.body.h1[0]
output >> 
<h1>
  <i>The</i>
Header</h1>
println (root.body.h1[0].text())
output >>
Header
XmlSlurper
def root new XmlSlurper().parseText(xml)  

new XmlNodePrinter(preserveWhitespace:true).print(root.body.h1[0]
output >> 
TheHeader
println (root.body.h1[0].text())
output >>
TheHeader

For the project i am working on, i need the following two outputs on the root.body.h1[0] node
1. properly spaced text contents: "The Header"
2. DOM fragment: 
<h1>
  <i>The</i>
Header</h1>
Paulk suggested using XMLSlurper for #1 in a diff thread (still missing the spacing tho, any ideas for that?)
And based on this thread, it seems XMLParser would help with #2...

Does that mean i need to parse the same doc twice? once with XMLParser and once with XMLSlurper?
Isn't that somewhat sub-optimal?

Is there a way to meet both of my reqts with one parser?

Tks!

Fred~

Robert Anderson

unread,
Sep 29, 2010, 5:00:44 PM9/29/10
to us...@groovy.codehaus.org
Hi Frederic!
 
You can do it with regex. For instance:
 
groovy> xmls =     [
groovy>                '<html><head><title>Title</title></head><body><h1><i>The</i>Header 1</h1></body></html>',
groovy>                '<html><head><title>Title</title></head><body><h1><i>The Super </i>Header 2</h1></body></html>',
groovy>                '<html><head><title>Title</title></head><body><h1><i>The Power</i>Header</h1></body></html>'
groovy>            ]
groovy> xmls.each { doc ->
groovy>     h1Content = (doc =~ /<h1>.*<\/h1>/)
groovy>     h1Text = h1Content[0]
groovy>     h1WithoutXml = h1Text.replaceAll("<(.|\n)*?>", ' ').trim()
groovy>     println h1Text
groovy>     println h1WithoutXml
groovy> }
 
<h1><i>The</i>Header 1</h1>
The Header 1
<h1><i>The Super </i>Header 2</h1>
The Super  Header 2
<h1><i>The Power</i>Header</h1>
The Power Header
 
Greetings,

--
Robert Anderson Nogueira de Oliveira
_________________________
MSN: ranop...@hotmail.com

"Ausência de evidência não é evidência de ausência." (Carl Sagan)


Frederic Laruelle

unread,
Sep 29, 2010, 5:10:15 PM9/29/10
to us...@groovy.codehaus.org
Tks Robert.
I was hoping for something a little more systemic.:-)
This helps with getting properly spaced text from XMLSlurper results...
but doesn't answer whether i can get my 2nd reqt met with XMLSlurper...
or can get both reqt met with one parser...

Am i having a unique use case here? (i would have thought otherwise)

Fred~

Robert Anderson

unread,
Sep 29, 2010, 10:21:26 PM9/29/10
to us...@groovy.codehaus.org
Frederic,
 
XmlParser and one parser :)
 
groovy> xml = '<html><head><title>Title</title></head><body><h1><i>The</i>Header</h1></body></html>'
groovy> html = new XmlParser().parseText(xml)
groovy> writer = new StringWriter()
groovy> new XmlNodePrinter(new PrintWriter(writer)).print(html.body.h1[0]) 
groovy> h1XML = writer.toString()
groovy> println h1XML
groovy> h1Content = h1XML.replaceAll("<(.|\n)*?>|(\n)", '').trim().replaceAll(/\s+/,' ') 
groovy> println h1Content
 
<h1>
  <i>
    The
  </i>
  Header
</h1>
The Header
 
An interview about XmlSlurper:
 
 
 
Cheers,

--
Robert Anderson Nogueira de Oliveira
_________________________
MSN: ranop...@hotmail.com

"Ausência de evidência não é evidência de ausência." (Carl Sagan)


Frederic Laruelle

unread,
Sep 30, 2010, 3:59:19 AM9/30/10
to us...@groovy.codehaus.org
Nice job,
Tks Robert!

Fred~
Reply all
Reply to author
Forward
0 new messages