Sitemap parser example

845 views
Skip to first unread message

Adrianogf

unread,
May 3, 2013, 12:27:56 PM5/3/13
to crawler...@googlegroups.com
Hi.

Could anyone give me an example of how to parse a sitemap using the SiteMap class? I'm having trouble using only the javadoc.

Tks.
Adrianogf

Lewis John Mcgibbney

unread,
May 3, 2013, 2:39:17 PM5/3/13
to crawler...@googlegroups.com
Hi Adrian,
off the top of my head, one particular resource lying around is our tests for the SiteMapParser [0].
I am going to work on pushing an example project to GitHub which basically displays general API usage, hopefully in the process I can contribute as much Javadoc back as possible.
Is there any particular functionality you are looking for or there is something you are trying to do which is not obvious?
Thanks
Lewis


Adrianogf

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.
Visit this group at http://groups.google.com/group/crawler-commons?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Lewis

Adrianogf

unread,
May 7, 2013, 2:32:17 PM5/7/13
to crawler...@googlegroups.com
Hi..tks..the parsertest help a lot..

I'm trying to get urls from "http://www.tricae.com.br/sitemap.xml but Im receiving a Exception.. How can I fix that?

[Fatal Error] :3890:57: The entity "Acirc" was referenced, but not declared.
crawlercommons.sitemaps.UnknownFormatException: Error parsing XML for http://www.tricae.com.br/sitemap.xml
at crawlercommons.sitemaps.SiteMapParser.processXml(SiteMapParser.java:212)
at crawlercommons.sitemaps.SiteMapParser.processXml(SiteMapParser.java:121)
at crawlercommons.sitemaps.SiteMapParser.parseSiteMap(SiteMapParser.java:96)
at com.indekse.Teste.main(Teste.java:51)

Lewis John Mcgibbney

unread,
May 8, 2013, 2:15:02 PM5/8/13
to crawler...@googlegroups.com
Hi Adrianogf,
I am really sorry but I cannot get around to this today. I am too busy.
I would like to help, and will try my best when I can.
Please update this thread if you make some progress as it would be nice to see what is causing the Exception.
Lewis

Lewis John Mcgibbney

unread,
May 8, 2013, 2:17:55 PM5/8/13
to crawler...@googlegroups.com
Actually upon closer inspection, I can't even open this XML from within my firefox browser!
This particular sitemap entry is causing my browser a problem
  <url>
    <lastmod>2011-10-28</lastmod>
    <loc>http://www.tricae.com.br//Triciclo-Meu-1&Acirc;&ordm;-Tico-Tico-Europa--Bandeirante-1302.html</loc>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
Maybe, we need to add some kind of normalization to the sitemap parser?
Any ideas guys?
Lewis
--
Lewis
Reply all
Reply to author
Forward
0 new messages