Hi,
in
#116 we discussed the performance of the DOM and SAX implementations of the sitemap parser.
To make it reproducible I've wrote a class to test the parser with sitemaps read from WARC files:
https://github.com/sebastian-nagel/sitemap-performance-test/WARC files containing sitemaps (various types, including garbage!) are here:
s3://commoncrawl-seeds/sitemaps/test/
sitemap-test-2017-03-03.warc.gz (1.5 GB)
sitemap-test-2017-03-04.warc.gz (450 MB)
First results with these WARCs:
sitemap-test-2017-03-03.warc.gz
5' 38'' DOM
3' 43'' SAX
sitemap-test-2017-03-04.warc.gz
5' 49'' DOM
4' 52'' SAX
- Java properties to run the tests: -Dwarc.index=false -Dsitemap.strict=false -Dsitemap.partial=true
- SAX: Dsitemap.useSax=true vs. DOM: -Dsitemap.useSax=false
- current 0.8-SNAPSHOT with fixes for
#151 and
#154- should be extensible to measure memory usage
- and also to compare whether DOM and SAX parser extract the same set of URLs
Best,
Sebastian