Font misalignment in converted pdf

84 views
Skip to first unread message

Raja Suba

unread,
May 2, 2016, 1:35:14 AM5/2/16
to pdf2htmlEX, coolw...@gmail.com
FONT ISSUES WITH PDF TO HTML CONVERSION

1.All "ti","fi","tt" characters are missing

SAMPLE SCREENSHOT

<img width="515" alt="screen shot 2016-04-25 at 4 13 44 pm" src="https://cloud.githubusercontent.com/assets/7858534/14781540/b8f02d7c-0b00-11e6-8435-b788a3b61678.png">

2. Font overlapping issue

SAMPLE SCREENSHOT

<img width="712" alt="screen shot 2016-04-25 at 4 16 07 pm" src="https://cloud.githubusercontent.com/assets/7858534/14781585/0d554b7c-0b01-11e6-85ce-a0afb475b283.png">

- _NOTE: I don't get this issue with firefox. Getting the above issues in chrome in safari browser_

I AM USING

- Using the **0.13.6** version of pdf2htmlEX 
- Using the following command to convert pdf to html

pdf2htmlEX --split-pages 1 --zoom 3 --fit-width 920 --correct-text-visibility 1 --dest-dir $1 $2 2>&1**

TRIED WITH

Using **--fallback 1** option solves all my above problems. But 

1. The fallback option reduces the clear visiblilty. Please refer the below screen shots.

SCREENSHOT WITH FALLBACK ENABLED

<img width="809" alt="screen shot 2016-04-25 at 4 21 20 pm" src="https://cloud.githubusercontent.com/assets/7858534/14781729/e450dcae-0b01-11e6-865e-830090ef3c42.png">

SCREENSHOT WITH FALLBACK DISABLED

<img width="800" alt="screen shot 2016-04-25 at 4 22 07 pm" src="https://cloud.githubusercontent.com/assets/7858534/14781728/e4496924-0b01-11e6-960c-ed71b9180ee7.png">

2.Table in the page disappears rather replaced with empty space.

PLEASE CLARIFY

1. Could you explain a bit more on fallback?

2. I have tried the above one (using fallback). Please suggest me if you prefer a different approach.

Kindly help me in this regard.

Lu Wang

unread,
May 2, 2016, 6:09:52 AM5/2/16
to Raja Suba, pdf2htmlEX
Regarding the 1st issue, try if `--tounicode -1` helps.


regards,
- Lu
Message has been deleted

Raja Suba

unread,
May 2, 2016, 8:43:26 AM5/2/16
to pdf2htmlEX, rajasu...@gmail.com
Hi Lu,

Thanks for posting. `--tounicode 1` doesn't help me :( . Please suggest me an alternative to sort this out.

Regards,
Rajasuba S.
Message has been deleted
Message has been deleted

Raja Suba

unread,
Jun 5, 2016, 3:52:46 AM6/5/16
to pdf2htmlEX, coolw...@gmail.com

Hi,


The above issue occurs only in - webkit web browsers like chrome and safari - which provides support for ligatures - whereas browser like firefox does not.

A ligature is a combination of two or more letters joined as a single glyph
(For more info : https://en.wikipedia.org/wiki/Typographic_ligature)

​Reason for the issue


This issue with missing characters is due to ligature support provided by these modern browsers - let me explain how

   1. Our tool while converting - it converts characters to glyphs using poppler for rendering - now these browser when they come across characters like tt tf ti ff  fi consider them to be ligature and searches for glyphs corresponding to tt and not t t

   2. Since they do not have their corresponding glyphs - they just skip the characters and renders the rest - so, we fount the characters missing


This could be solved by

  1. Disabling/ Turning-off the ligature in these browsers - embedding the css in the generating content

Refer:

#543
http://stackoverflow.com/questions/19591746/prevent-ligatures-in-safari-mavericks-ios7-via-css
https://developer.mozilla.org/en/docs/Web/CSS/font-feature-settings
http://caniuse.com/#feat=font-feature

Please correct me if I am wrong.

Lu Wang

unread,
Jun 20, 2016, 5:31:30 PM6/20/16
to Raja Suba, pdf2htmlEX
Maybe you can try to disable ligature with CSS.

regards,
- Lu

Lu Wang

unread,
Jun 21, 2016, 5:17:58 AM6/21/16
to Raja Suba, pdf2htmlEX
Ah, I just read the the last few paragraphs of your previous email, somehow they were folded.

Thanks for your message.



regards,
- Lu

On Tue, Jun 21, 2016 at 5:32 AM, Raja Suba <rajasu...@gmail.com> wrote:
Hi Lu,

Tried the same (diabling ligature with css) and it works fine for me in all the cases.

Thanks and Regards,
Rajasuba S.

David Hedley

unread,
Oct 11, 2016, 5:10:33 AM10/11/16
to pdf2htmlEX, coolw...@gmail.com

Issue 1 is due to dumb Chrome trying to use ligatures in a font that does not contain them.
You can use the javascript I posted https://github.com/coolwanglu/pdf2htmlEX/issues/675
to fix the issue.

Hard to know what the problem with issue 2 is without seeing the PDF, but I would guess it's a word-spacing issue which Chrome/Webkit also has problems with.

You could try my branch https://github.com/davidhedley/pdf2htmlEX in which I have rewritten how correct-text-visibility works and also has a few fixes in it which may work for you

Reply all
Reply to author
Forward
0 new messages