Yesterday I performed a few benchmarks probing DOM performance, and they are now available at the
benchmarks repository at Github. Several DOM implementations are tested: css4j-dom4j, plain DOM4J, Css4j's native DOM and finally the DOM implementation that comes with the JDK (a Xerces DOM fork).
Currently, the tests only measure the speed at which a document is parsed onto a DOM implementation. One test uses the
validator.nu HTML5 parser to parse a small (38 KB) HTML file, while the other two parse XML files with the SAX parser that comes bundled with the JDK; one parses a small (38 KB) XHTML file (not the same one as in the HTML test), while the other parses a larger (1 MB) XML file.
I performed another test (which is not shown here) to parse the small XHTML file with the
validator.nu HTML5 parser, and found out that the HTML parser causes nearly a 40% slowdown compared to parsing the same file with the XML parser.
The tests are run on 4 CPU cores.
HTML build
A small HTML file (the Css4j usage guide) is parsed with the HTML5 parser. Results (higher are better):
Implementation Mode Cnt Score Error Units
Css4j-DOM4J thrpt 32 321,760 ▒ 7,279 ops/s
Css4j DOM thrpt 32 309,080 ▒ 2,576 ops/s
JDK thrpt 32 359,725 ▒ 11,114 ops/s
Stand-alone DOM4J could not be tested as it is not enough DOM-compliant to be used with the HTML parser.
XML build (38 KB file)
Results (higher are better):
Implementation Mode Cnt Score Error Units
Css4j-DOM4J thrpt 32 612,553 ▒ 1,988 ops/s
Css4j DOM thrpt 32 505,988 ▒ 5,809 ops/s
DOM4J thrpt 32 672,043 ▒ 2,660 ops/s
Jdk thrpt 32 696,178 ▒ 2,391 ops/s
XML build (1 MB file)
Results (higher are better):
Implementation Mode Cnt Score Error Units
Css4jDOM4J thrpt 32 88,693 ▒ 3,131 ops/s
DOM thrpt 32 64,077 ▒ 1,827 ops/s
DOM4J thrpt 32 114,941 ▒ 0,839 ops/s
Jdk thrpt 32 136,875 ▒ 1,183 ops/s
Profiling
I did some profiling to identify performance bottlenecks, and the results were interesting. JMH allows some basic profiling with command lines like:
java -jar build/benchmarks.jar XMLBuildBenchmark -prof stack:lines=5;top=3;detailLine=true;period=1
and something stands out, whenever DOM4J is involved (either stand-alone or in css4j-dom4j):
Secondary result "io.sf.carte.doc.style.css.mark.XMLBuildBenchmark.markBuildDOM4J: stack":
Stack profiler:
....[Thread state distributions]....................................................................
72,3% BLOCKED
27,6% RUNNABLE
....[Thread state: BLOCKED].........................................................................
72,3% 100,0% java.util.Collections$SynchronizedMap.get
org.dom4j.tree.QNameCache.get
org.dom4j.DocumentFactory.createQName
org.dom4j.tree.NamespaceStack.createQName
org.dom4j.tree.NamespaceStack.pushQName
Yes, DOM4J has some contention problems. The performance on many-core systems is bad, as explained in DOM4J's issue #114, which claims a 6x improvement on a 64-core machine when replacing the original QNameCache with a non-synchronized version. None of the other contenders shows a similar issue with a BLOCKED state in the current benchmarks.
The performance of Css4j's native DOM is disappointing though, and the profiling shows one cause:
....[Thread state: RUNNABLE]........................................................................
72,3% 72,5% java.lang.String.intern
io.sf.carte.doc.dom.DOMDocument.createElementNS
io.sf.carte.doc.dom.DOMDocument.createElementNS
io.sf.carte.doc.dom.XMLDocumentBuilder$MyContentHandler.startElement
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement
26,2% 26,3% java.lang.String.intern
io.sf.carte.doc.dom.DOMDocument.createAttributeNS
io.sf.carte.doc.dom.XMLDocumentBuilder$MyContentHandler.setAttributes
io.sf.carte.doc.dom.XMLDocumentBuilder$MyContentHandler.startElement
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement
That's because the native DOM interns the local names of elements and attributes to use less memory (and be a bit faster in some operations).
I prepared a fe-performance branch that does not intern strings (only namespace URIs) and the speed improved by about 4%. I plan more improvements to that experimental branch, because the native DOM has more features than the other implementations and can be allowed to be a bit slower, but probably not by that much.