How to solve display bugs (eg. "navigaon" instead of "navigation")?

Heck Lennon

unread,

May 11, 2018, 4:36:30 AM5/11/18

to pdf2htmlEX

Hello,

Google shows I'm not the only one struggling with display bugs, where some words are wrong displayed while OK when copy/pasting:

https://s14.postimg.cc/3nl9o1oap/pdf2htmlex.display.bugs.png

Per the FAQ and archives, I tried a few things, but still no cigar:

apt-get install ttfautohint

pdf2htmlEX --zoom 1.3 --external-hint-tool=ttfautohint --tounicode 0 --optimize-text 1 --space-as-offset 0 --correct-text-visibility 1 myfile.pdf

=>

1. Still get bunch of "ToUnicode CMap is not valid and got dropped for font"

2. Output: Still "navigation" > "navigaon"

Is there something else I could try?

Thank you.

pdf2htmlEX 0.14.6 on Debian 9.4

David Hedley

unread,

May 11, 2018, 4:44:00 AM5/11/18

to Heck Lennon, pdf2htmlEX

It looks like you are being affected by the ligature issue where your web browser sees letter groups such as “ti” and replaces them with a specific “ti” ligature which isn’t present in the custom font. This is really a browser bug (Chrome seems particularly bad at this), but you can work around it in JavaScript. Take a look at: https://github.com/coolwanglu/pdf2htmlEX/issues/675

Best wishes

David

--

David Hedley

CTO

Mobile: +44 (0)7971 681088

--
You received this message because you are subscribed to the Google Groups "pdf2htmlEX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdf2htmlex+...@googlegroups.com.
To post to this group, send email to pdf2h...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pdf2htmlex/20555609-111f-4a34-9087-57c047020bb4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vistair - World Class Aviation Technology

Tel	+44 (0)1454 616531 \| 0845 8478247
Fax	+44 (0)870 135 0992
Web	www.vistair.com

Information in this electronic mail message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorised. If you are not the intended recipient any use, disclosure, copying or distribution of this message is prohibited and may be unlawful. When addressed to our customers, any information contained in this message is subject to Vistair Systems Ltd, trading as Vistair, Terms and Conditions.

Vistair is a trading identity of Vistair Systems Limited.
Registered in England number 5418081.
Registered Office: Springfield House, 45 Welsh Back, Bristol, BS1 4AG.
VAT registration number GB 229 7777 54

Heck Lennon

unread,

May 11, 2018, 5:13:17 AM5/11/18

to pdf2htmlEX

Thanks much.

Yes indeed, the ligature issue occurs in Chrome but not Firefox, although there are other display issues that occur with both browsers.

https://s14.postimg.cc/5kr6k3eap/pdf2htmlex.display.bugs.lists.png

This didn't solve the issue with Chrome:

~# pdf2htmlEX --external-hint-tool=ttfautohint --zoom 1.3 --tounicode 0 --optimize-text 1 --space-as-offset 0 --correct-text-visibility 1 --decompose-ligature 1 myfile.pdf

The link you mentioned says that "The only robust solution to this is to process all text on the page and split up any potential ligatures. You can either post-process the HTML files to split potential ligatures, or do it on the fly in javascript as follows:"

Since I only have the PDF outputs, I can't edit the original documents.

So I'd like to try and edit the HTML output: By default, everything is turned into a big Base64 blob: How do I get some actual HTML text so I can search and replace ligatures, without introducing side effects?

font-family:ff1;src:url('data:application/font-woff;base64,d09GRgABAAAAButwABMAAAAOdyQAAgAlAAAAAAAAAAAAAAAAAAAAAAAAAABGRlRNAAboWAAAABwAAAAcgJFjAkdERUYABrDoAAAB6AAAApJAgkZeR1BPUwAGvggAACpPAAB2GoWryvhHU1VCAAay0AAACzUAABSgjy/Eqk1BVEgABuh0AAAC+wAABj7vU2+3T1MvMgAAAiQAAABWAAAAVl0seH9jbWFwAAAnlAAADvoAABh6

blahblah

Thank you.

David Hedley

unread,

May 11, 2018, 5:25:15 AM5/11/18

to Heck Lennon, pdf2htmlEX

You don’t edit the PDF, you need to change the HTML output. You can either do this by post-processing the HTML, or on the fly as each page is displayed, using the JavaScript I included in my response to that issue.

If you are intending to use the pdf2htmlEX output for the web and want to support all modern browsers then you are going to have to run some JavaScript on each page as it displayed in order to work around various browser incompatibilities. This will require a bit of development work on your part.

I wouldn’t use the “decompose-ligature” option. It is possible (however remote) that a custom font file will not contain the decomposed characters (i.e. it might contain the “ti” glyph but not the “t” or “i” glyphs) and I’m not sure how pdf2htmlEX will cope with that scenario, particularly with custom font encodings.

If you send me the PDF page that is rendering incorrectly (as shown in your screenshot), I’ll see if my branch of pdf2htmlEX has already fixed the issues.

David

--

David Hedley

CTO

Mobile: +44 (0)7971 681088

--

You received this message because you are subscribed to the Google Groups "pdf2htmlEX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdf2htmlex+...@googlegroups.com.
To post to this group, send email to pdf2h...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/pdf2htmlex/40c926b4-87ef-4d2e-a3e3-30e78ac24d5f%40googlegroups.com.