Parsing XML / XHTML

478 views
Skip to first unread message

Andrew

unread,
Oct 12, 2011, 7:43:22 PM10/12/11
to gosu-lang
Hi,

Is anyone able to point me towards an example of using Gosu to parse
XML or XHTML?

Currently I have a few projects that use TagSoup and groovy to parse
HTML, but I am starting up a new project and would like to see if I
can do it using Gosu this time.

Andrew.

Brian Chang

unread,
Oct 12, 2011, 7:58:35 PM10/12/11
to gosu...@googlegroups.com
Here you go!  Yes, it's hidden away a little bit (for now), and I'm not sure if the tutorial is entirely up-to-date.  We will be more than happy to help you along on this list.

--
You received this message because you are subscribed to the Google Groups "gosu-lang" group.
To post to this group, send email to gosu...@googlegroups.com.
To unsubscribe from this group, send email to gosu-lang+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/gosu-lang?hl=en.




--
Brian
bcha...@gmail.com

Alan Keefer

unread,
Oct 12, 2011, 8:45:47 PM10/12/11
to gosu...@googlegroups.com
There's also a class built-in to Gosu called SimpleXmlNode, which is
waaayyyy less full-featured than XmlElement (no handling of namespaces
or prefixes or anything), but depending on your use case it might also
be acceptable.

-Alan

Peter Rexer

unread,
Oct 12, 2011, 10:27:18 PM10/12/11
to gosu...@googlegroups.com
An example that I've put together is a project on Github for integrating Jira to AgileZen.  See https://github.com/prexer/Jira2AgileZen .   As I show in that project, you can just drop a few xsd files somewhere in your classpath, and then access XML documents that conform to that XSD easily in Gosu.  That project is also useful as a project template in that the Gosu editor has the classpath arguments all setup, so it behaves pretty well.   Jira has SOAP/RPC apis, that I used Axis to create a jar file to access (since SOAP api's are pretty complex beasts) and the AgileZen REST API's I created a few xsd's.  

Hope that helps. 

Andrew Myers

unread,
Oct 13, 2011, 2:13:12 AM10/13/11
to gosu...@googlegroups.com
Thanks all.

My parsing previously has been quite "ad hoc".  Because I'm parsing XML it's been something like an x path to get the value from say the 1st cell in the 3rd row of the 2nd table on the page or something similar. 

So I'm not sure how or if an xsd fits into this scenario?

When the kids are asleep later I'll have a try and see how I go anyway.

Regards,
Andrew

Sent from my mobile

Carson Gross

unread,
Oct 13, 2011, 2:23:19 AM10/13/11
to gosu...@googlegroups.com
If you've got an XSD for your content, that's definitely the way to go.  Maybe Dlank and chime in with advice as well.

Also, if it is feasible, setting the project up on Github would let us help out in case stuff gets straight crazy, like at the end of this video:


Cheers,
Carson

Andrew Myers

unread,
Oct 13, 2011, 2:39:40 AM10/13/11
to gosu...@googlegroups.com
Will see if I can share it.  In that earlier email I mean "because I'm parsing *HTML*"...

Sent from my mobile

Dana Lank

unread,
Oct 13, 2011, 10:59:53 AM10/13/11
to gosu-lang
Parsing HTML is a little bit tricky. You can't use an XML parser
unless it happens to be XHTML. You should be able to use TagSoup from
Gosu, just place it into your classpath. This prints all text content
from http://google.com:

uses org.xml.sax.helpers.DefaultHandler
uses java.net.URL

var factory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl()
var is = new URL( "http://google.com" ).openStream()
factory.newSAXParser().parse( is, new DefaultHandler() {
override function characters ( ch : char[], start : int, length :
int ) {
print( new String( ch ) )
}
} )

Dana

On Oct 12, 11:39 pm, Andrew Myers <am2...@gmail.com> wrote:
> Will see if I can share it.  In that earlier email I mean "because I'm parsing *HTML*"...
>
> Sent from my mobile
>
> On 13/10/2011, at 5:23 PM, Carson Gross <carsongr...@gmail.com> wrote:
>
>
>
>
>
>
>
> > If you've got an XSD for your content, that's definitely the way to go.  Maybe Dlank and chime in with advice as well.
>
> > Also, if it is feasible, setting the project up on Github would let us help out in case stuff gets straight crazy, like at the end of this video:
>
> >  http://www.youtube.com/watch?v=WXtpNm_a4Us
>
> > Cheers,
> > Carson
>
> > On Wed, Oct 12, 2011 at 11:13 PM, Andrew Myers <am2...@gmail.com> wrote:
> > Thanks all.
>
> > My parsing previously has been quite "ad hoc".  Because I'm parsing XML it's been something like an x path to get the value from say the 1st cell in the 3rd row of the 2nd table on the page or something similar.
>
> > So I'm not sure how or if an xsd fits into this scenario?
>
> > When the kids are asleep later I'll have a try and see how I go anyway.
>
> > Regards,
> > Andrew
>
> > Sent from my mobile
>
> > On 13/10/2011, at 1:27 PM, Peter Rexer <pre...@alum.mit.edu> wrote:
>
> >> An example that I've put together is a project on Github for integrating Jira to AgileZen.  Seehttps://github.com/prexer/Jira2AgileZen.   As I show in that project, you can just drop a few xsd files somewhere in your classpath, and then access XML documents that conform to that XSD easily in Gosu.  That project is also useful as a project template in that the Gosu editor has the classpath arguments all setup, so it behaves pretty well.   Jira has SOAP/RPC apis, that I used Axis to create a jar file to access (since SOAP api's are pretty complex beasts) and the AgileZen REST API's I created a few xsd's.  
>
> >> Hope that helps.
>
> >> On Wednesday, October 12, 2011, Andrew wrote:
> >> Hi,
>
> >> Is anyone able to point me towards an example of using Gosu to parse
> >> XML or XHTML?
>
> >> Currently I have a few projects that use TagSoup and groovy to parse
> >> HTML, but I am starting up a new project and would like to see if I
> >> can do it using Gosu this time.
>
> >> Andrew.
>
> >> --
> >> You received this message because you are subscribed to the Google Groups "gosu-lang" group.
> >> To post to this group, send email to gosu...@googlegroups.com.
> >> To unsubscribe from this group, send email to gosu-lang+...@googlegroups.com.
> >> For more options, visit this group athttp://groups.google.com/group/gosu-lang?hl=en.
>
> >> --
> >> You received this message because you are subscribed to the Google Groups "gosu-lang" group.
> >> To post to this group, send email to gosu...@googlegroups.com.
> >> To unsubscribe from this group, send email to gosu-lang+...@googlegroups.com.
> >> For more options, visit this group athttp://groups.google.com/group/gosu-lang?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups "gosu-lang" group.
> > To post to this group, send email to gosu...@googlegroups.com.
> > To unsubscribe from this group, send email to gosu-lang+...@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/gosu-lang?hl=en.

Andrew Myers

unread,
Oct 13, 2011, 10:19:48 PM10/13/11
to gosu...@googlegroups.com
Hi Dana,
 
Thanks for this.  This seems close to how i'm doing it now in Groovy.   In Groovy I use TagSoup and XMLSlurper, and do something like this:
 
 
        Car result = new Car()
        def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
        def html = slurper.parseText(inputData) // inputData has previously been fetched from a web page using HTTPClient
 
        html.'**'.findAll { it.@class == 'right halves' }.table[0].each { item ->
            result.power = Integer.parseInt(item.tr[2].td[0].text())
            result.handling = Integer.parseInt(item.tr[2].td[1].text())
            result.acceleration = Integer.parseInt(item.tr[2].td[2].text())
        }
 
So it's looking like I can do something similar with Gosu using your suggestion.  I will try this out tonight :-)
 
Regards,
Andrew.

Carson Gross

unread,
Oct 13, 2011, 11:57:00 PM10/13/11
to gosu...@googlegroups.com
I've been helping Andrew get started on this, and here is the equivalent Gosu:


It's ugly because I'm reserializing the HTML to XHTML, which is kind of stupid, but, hey, the java XML APIs are so brain dead that I don't feel too bad about it.  The core gosu looks like this:

var cars =
  html.Descendents
      .where( \ elt ->elt.Attributes["class"] == "right halves" and elt.Name == "table" )
      .map( \ s -> new Car() {
        :Power =  s.children("tr")[2].children("td")[0].Text.toInt(),
        :Handling =  s.children("tr")[2].children("td")[1].Text.toInt(),
        :Acceleration =  s.children("tr")[2].children("td")[2].Text.toInt()
      })

Which isn't too bad.  But, shhhh, I had to cheat and add that children() method via an enhancement:


I wouldn't mind if we made that part of toe core SimpleXmlNode API.  Or, maybe we should make SimpleXmlNode extend Map<String, List<SimpleXmlNode>>, so you could say this:

var cars =
  html.Descendents
      .where( \ elt -> elt.Attributes["class"] == "right halves" and elt.Name == "table" )
      .map( \ s -> new Car() {
        :Power =  s["tr"][2]["td"][0].Text.toInt(),
        :Handling =  s["tr"][2]["td"][1].Text.toInt(),
        :Acceleration =  s["tr"][2]["td"][2].Text.toInt()
      })

Which, in my opinion anyway, doesn't suck, but maybe it's a bit too much of a hack...

Cheers,
Carson

Alan Keefer

unread,
Oct 14, 2011, 12:08:53 AM10/14/11
to gosu...@googlegroups.com
Yeah, the whole point of SimpleXmlNode is to be a totally brain-dead
way to, essentially, treat an Xml document like JSON: just a bunch of
stuff with name/value pairs for attributes and a list of child
elements you can add to/remove from. There are likely improvements to
be made for common cases, but I wouldn't want to make it implement
Map, since the interpretation there is hugely ambiguous: i.e. there
are 10 different things you could think that it means, so pinning it
to "children with this name" is totally arbitrary and unlikely to be
what someone expects.

In general my inclination is to keep the implementation brain-dead
unless there's a clear indication that people are actually using the
API in a particular fashion, and I'm not sure we've got enough data
for that yet.

So for now, I'd say just use your enhancement method. That's kind of
what they're there for: extending an API in a way that makes sense
for your particular use case, but which might not really make sense
for every client of that API.

-Alan

Carson Gross

unread,
Oct 14, 2011, 12:14:34 AM10/14/11
to gosu...@googlegroups.com
Yeah.  On thinking on it, if anything, I would expect the node to act like a map for the attributes, rather than the children nodes, but both are probably insane.

Just thinking out loud.

Cheers,
Carson

Dana Lank

unread,
Oct 20, 2011, 3:35:24 PM10/20/11
to gosu-lang
If you agreed with XPATH syntax, you could look up node["childName"]
or node["@attributeName"]... Or even node["childName/grandchildName/
@attributeName"]

Dana
> > On Thu, Oct 13, 2011 at 8:57 PM, Carson Gross <carsongr...@gmail.com>
> > wrote:
> > > I've been helping Andrew get started on this, and here is the equivalent
> > > Gosu:
> > >  https://github.com/carsongross/XML-Demo/blob/master/test.gsp'
> > > It's ugly because I'm reserializing the HTML to XHTML, which is kind of
> > > stupid, but, hey, the java XML APIs are so brain dead that I don't feel
> > too
> > > bad about it.  The core gosu looks like this:
>
> > > var cars =
> > >   html.Descendents
> > >       .where( \ elt ->elt.Attributes["class"] == "right halves" and
> > elt.Name
> > > == "table" )
> > >       .map( \ s -> new Car() {
> > >         :Power =  s.children("tr")[2].children("td")[0].Text.toInt(),
> > >         :Handling =  s.children("tr")[2].children("td")[1].Text.toInt(),
> > >         :Acceleration =
> >  s.children("tr")[2].children("td")[2].Text.toInt()
> > >       })
>
> > > Which isn't too bad.  But, shhhh, I had to cheat and add that children()
> > > method via an enhancement:
>
> >https://github.com/carsongross/XML-Demo/blob/master/src/hacks/SimpleX...
> > >> On 14 October 2011 01:59, Dana Lank <danavaleriel...@gmail.com> wrote:
>
> > >>> Parsing HTML is a little bit tricky. You can't use an XML parser
> > >>> unless it happens to be XHTML. You should be able to use TagSoup from
> > >>> Gosu, just place it into your classpath. This prints all text content
> > >>> fromhttp://google.com:
> ...
>
> read more »

Dana Lank

unread,
Oct 20, 2011, 3:36:45 PM10/20/11
to gosu-lang
Although the 3rd example would be silly since node["childName"]
["grandchildName"]["@attributeName"] would be the same thing.
> ...
>
> read more »
Reply all
Reply to author
Forward
0 new messages