Thanks
Radu
When you compare the conversion performance of PDF to HTML conversion you need to make sure that you are using exactly the same parameters (otherwise you may be comparing apples and oranges).For example: What are exact parameters that you pass in the call to DocPub (http://www.pdftron.com/docpub/downloads.html) and what are the options you pass for the other solution?PDFNet which is used in DocPub CLI ( or Convert::ToHtml) is significantly faster than poppler on all counts, however things such as resolution/DPI, flattening, JPEG vs PNG output, text optimization parameters, ... all have significant performance implications.Another think to bare in mind is the conversion quality. There are many ways to convert PDF to HTML (http://blog.pdftron.com/2013/08/08/how-to-integrate-a-pdf-viewer-in-html5-apps/). For example you could just rasterize PDF to PNG or SVG and wrap it in HTML, or you could use a quick and dirty text/graphics separator, or something that produces accurate replica for most files.DocPub CLI (or pdftron.PDF.Convert::ToHtml) is unique in that it fits in the latter category (taking care of blending & transparency, overlapping text convent, optimizing text runs, etc.) and can produce accurate output for any PDF (rather than working on a PDF subset). For a bit more info about our conversion process see http://blog.pdftron.com/2013/11/15/high-quality-epub-html-from-pdf/.
On Tuesday, February 4, 2014 8:25:11 PM UTC-8, (unknown) wrote:
It looks like your numbers are off due to different conversion settings. For example, if you use PNG as a default for image background, it will be much slower compared to JPEG).
I re-tested your test file (pdf_reference_1-7.pdf) on our end and found that DocPub/PDFNet is actually significantly faster:
----
Test environment: Windows 7, 64bit, 16GB RAM, CPU i7-3.4 Ghz
Download DocPub (http://www.pdftron.com/docpub/downloads.html). Btw. the perf of pdftron.PDF.Convert.ToHtml() should be identical (if you use all the same options). The command-line was:
-
docpub64 -f html --time --dpi 144 --flatten off --prefer_jpg pdf_reference_1-7.pdf
It took me 79.74 seconds.
Note: Undocumented option '--time' option reports the conversion time. Given that the GPL solution doesn't do any flattening the (--flatten) option should be disabled, thought for your test file it would not make a significant difference.
Running:
pdf2htmlEX -o EX --split-pages 1 pdf_reference_1-7.pdf
took 119.5 sec in the best run (out of 5). Default resolution is 144
----
For example for a typical magazine (http://goo.gl/UHACFz):
docpub64 -f html --time --dpi 144 --flatten off SRD0512.pdf
takes 52.491 seconds.
takes 450 seconds
After all if the output is not accurate or reliable, it does not matter how long the conversion takes. From this perspective the two solutions can't be compared. To give you an idea, take a look at page 1142 in your test file:
As another example, see attached INITIAL.PDF. Text in DocPub/PDFNet HTML output can be selected/searched, it displays with correct font. pdf2htmlEX displays text as images with incorrect fonts etc. These samples are just a tip of the iceberg. You may need to run extensive time-consuming tests (hopefully automated) on a functional test suite in order to detect this kind of issues. Unfortunately it is not as simple as running perf test L, but we do this for every product release … J
Page 1: Text is off.
Random white lines:
Page 24: Text is incorrectly positioned and overflows columns...