Sitemap parser performance test

Sebastian Nagel

unread,

Mar 8, 2017, 8:41:13 AM3/8/17

to crawler-commons

Hi,

in #116 we discussed the performance of the DOM and SAX implementations of the sitemap parser.
To make it reproducible I've wrote a class to test the parser with sitemaps read from WARC files:
https://github.com/sebastian-nagel/sitemap-performance-test/

WARC files containing sitemaps (various types, including garbage!) are here:
s3://commoncrawl-seeds/sitemaps/test/
    sitemap-test-2017-03-03.warc.gz (1.5 GB)
    sitemap-test-2017-03-04.warc.gz (450 MB)

First results with these WARCs:
sitemap-test-2017-03-03.warc.gz
    5' 38'' DOM
    3' 43'' SAX
sitemap-test-2017-03-04.warc.gz
    5' 49'' DOM
    4' 52'' SAX

- Java properties to run the tests: -Dwarc.index=false -Dsitemap.strict=false -Dsitemap.partial=true
- SAX: Dsitemap.useSax=true vs. DOM: -Dsitemap.useSax=false
- current 0.8-SNAPSHOT with fixes for #151 and #154
- should be extensible to measure memory usage
- and also to compare whether DOM and SAX parser extract the same set of URLs

Best,
Sebastian

Ken Krugler

unread,

Mar 12, 2017, 4:12:12 PM3/12/17

to crawler...@googlegroups.com

Hi Sebastian,

Thanks for posting, the results look good.

Is this where you ran across the issue of the SAX parser generating invalid URLs from broken XML?

Regards,

— Ken

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Sebastian Nagel

unread,

Mar 13, 2017, 4:46:42 AM3/13/17

to crawler-commons

> Is this where you ran across the issue of the SAX parser generating invalid URLs from broken XML?

You mean #153? No, this was seen in the logs of a Hadoop job fetching and parsing a large batch of sitemaps.
Instead, #154 was detected and #151 became reproducible. But it's now easy to check for regressions by comparing
the number of successfully parsed documents and extracted URLs after a code change.

If time I'll adopt this project to test the robots.txt parser - Common Crawl already provides WARC files
with all robots.txt responses.

Sebastian

Reply all

Reply to author

Forward