Sitemap parser performance test

55 views
Skip to first unread message

Sebastian Nagel

unread,
Mar 8, 2017, 8:41:13 AM3/8/17
to crawler-commons
Hi,

in #116 we discussed the performance of the DOM and SAX implementations of the sitemap parser.
To make it reproducible I've wrote a class to test the parser with sitemaps read from WARC files:
  https://github.com/sebastian-nagel/sitemap-performance-test/

WARC files containing sitemaps (various types, including garbage!) are here:
  s3://commoncrawl-seeds/sitemaps/test/
    sitemap-test-2017-03-03.warc.gz  (1.5 GB)
    sitemap-test-2017-03-04.warc.gz  (450 MB)

First results with these WARCs:
  sitemap-test-2017-03-03.warc.gz
    5' 38'' DOM
    3' 43'' SAX
  sitemap-test-2017-03-04.warc.gz
    5' 49'' DOM
    4' 52'' SAX

- Java properties to run the tests: -Dwarc.index=false -Dsitemap.strict=false -Dsitemap.partial=true
- SAX: Dsitemap.useSax=true vs. DOM: -Dsitemap.useSax=false
- current 0.8-SNAPSHOT with fixes for #151 and #154
- should be extensible to measure memory usage
- and also to compare whether DOM and SAX parser extract the same set of URLs

Best,
Sebastian

Ken Krugler

unread,
Mar 12, 2017, 4:12:12 PM3/12/17
to crawler...@googlegroups.com
Hi Sebastian,

Thanks for posting, the results look good.

Is this where you ran across the issue of the SAX parser generating invalid URLs from broken XML?

Regards,

— Ken
--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Sebastian Nagel

unread,
Mar 13, 2017, 4:46:42 AM3/13/17
to crawler-commons
> Is this where you ran across the issue of the SAX parser generating invalid URLs from broken XML?

You mean #153? No, this was seen in the logs of a Hadoop job fetching and parsing a large batch of sitemaps.
Instead, #154 was detected and #151 became reproducible. But it's now easy to check for regressions by comparing
the number of successfully parsed documents and extracted URLs after a code change.

If time I'll adopt this project to test the robots.txt parser - Common Crawl already provides WARC files
with all robots.txt responses.

Sebastian
Reply all
Reply to author
Forward
0 new messages