Why do Unicode fonts have to be embedded?

311 views
Skip to first unread message

Pierre Clouthier

unread,
Dec 11, 2017, 6:13:13 PM12/11/17
to libHaru
Why is it necessary to embed Unicode fonts?

The following works fine and results in a small document:

const char *ret = HPDF_LoadTTFontFromFile (pdf, "Arial", HPDF_FALSE);
HPDF_GetFont
(pdf, ret, "StandardEncoding");


However, if I want to use a Unicode font, I have to do:

const char *ret = HPDF_LoadTTFontFromFile (pdf, "Arial", HPDF_TRUE);
HPDF_GetFont
(pdf, ret, "UTF-8");

If I use "HPDF_FALSE" to prevent font embedding, the font does not display properly.

Using "HPDF_TRUE" bloats the document by 200K.

I understand that it might not be possible to read the document on another computer if the latter does not have a particular font. However, in the case of common fonts, this would not be a problem.

Is it possible to use Unicode fonts and not embed them? How?

spa...@idecon.it

unread,
Dec 13, 2017, 3:44:39 AM12/13/17
to libHaru
Hi Pierre, 
Unicode is used to give support to languages that doesn't use latin alphabet (i.e. Russian, Cinese, and so on..)
Basically to gain benefit from Unicode, you have to encapsulate a Unicode Font (and I don't think Arial is it).
Once you have the Unicode font, you have to take in account that even your program is built in Unicode.
If you don't need to support such languages, for example because you wanna just produce pdfs in French, it's not necessary to encapsulate the font in the document.

Best regards, 

Davide

Pierre Clouthier

unread,
Dec 18, 2017, 3:50:43 PM12/18/17
to libHaru
Hi Davide -

Thanks for your reply. 

I make international software, so I do need Unicode.

The Amyuni PDF software manages to display Unicode characters without storing the entire font. With Amyuni, I can create a Unicode PDF that's only 42K (see attached) because the fonts are not embedded.

Haru includes the fonts, which bloats the file to 360K. 

The point is that it is possible to do Unicode without embedding the font.

(BTW I'm switching from Amyuni to Haru because there are other problems with a driver-based solution).
arabic Amyuni.pdf
arabic Haru.pdf

Adalbert Michelic

unread,
Dec 19, 2017, 4:52:40 AM12/19/17
to lib...@googlegroups.com
Hi Pierre,

On 12/18/17 21:50, Pierre Clouthier wrote:
> The Amyuni PDF software manages to display Unicode characters without
> storing the entire font. With Amyuni, I can create a Unicode PDF
> that's only 42K (see attached) because the fonts are not embedded.

The fonts are also embedded in your PDF from Amyuni. I can see
Arial (~14 kiB) and ArialUnicodeMS (~22 kiB).

> Haru includes the fonts, which bloats the file to 360K.

The PDF from libharu embeds these fonts:
* ArialUnicodeMS (~73 kiB),
* Arial-ItalicMT (~24 kiB),
* Arial-BoldMT,Bold (~40 kiB),
* Arial-BoldItalicMT,Bold (~25 kiB),
* ArialMT (~36 kiB)

So you're using much more different fonts in your example with Haru.

Both programs, don't embed the full font - only the necessary glyphs
are embedded (the sizes above are glyph data). There's however a
difference between libharu and Amyuni: the width tables and the
/CIDToGIDMap.

While Amyuni strips down the width table (/W) to the necessary glyphs
(that makes 297 bytes for ArialUnicodeMS), libharu doesn't - that
adds another 23 kiB just for ArialUnicodeMS.

Similar for the /CIDToGIDMap: Amyuni just sets it to /Identity (an
implicit table - 1=1, 2=2, etc.), libharu puts the full table in the
PDF - adding 72 kiB just for ArialUnicodeMS.

> The point is that it is possible to do Unicode without embedding the
> font.

No - the font has to embedded. Otherwise you'r stuck with one of the
default encodings - StandardEncoding, MacRomanEncoding,
WinAnsiEncoding, PDFDocEncoding, or MacExportEncoding.

So the font is embedded for both, but libharu isn't very good at
removing the unnecessary parts.

We're interally using a forked version of libharu, which produces
smaller files. Here's the link for reference:
https://github.com/amichelic/libharu

Our application however is doing the complete text layout and glyph
positioning in a different component and is only feeding a list of
glyph ids to libharu (instead of text). So it's not feasible to use
this version in a "normal" application.

So the only thing that our version really is addressing is the
size of the /W array. The /CIDToGIDMap is set to /Identity for our
application, because we're already layouting individual glyphs. On
the other hand, Copy&Paste doesn't yet work for the produced files,
as I haven't yet implemented a proper /ToUnicode map.

I will have a look if I can seperate to the patch for reducing the
/W array from our patch collection and submit it as pull request.
But this won't shave off so much from your file.

However, it won't be possible to cut down the /CIDToGIDMap with
the way libharu writes the PDFs. This is a binary array and cannot
easily be reduced in size.

The way to get around this is to map all characters to glyphs in
the PDF driver libharu (or in our case, the output driver) and use
them, so that the /CIDToGIDMap can be set to /Identity. But this
needs a /ToUnicode map for Copy&Paste to work.


Thanks,
Adalbert

Adalbert Michelic

unread,
Dec 19, 2017, 6:57:23 AM12/19/17
to lib...@googlegroups.com
Following up on my own email:

On 12/19/17 10:46, Adalbert Michelic wrote:
> So the only thing that our version really is addressing is the
> size of the /W array. The /CIDToGIDMap is set to /Identity for our
> application, because we're already layouting individual glyphs. On
> the other hand, Copy&Paste doesn't yet work for the produced files,
> as I haven't yet implemented a proper /ToUnicode map.
>
> I will have a look if I can seperate to the patch for reducing the
> /W array from our patch collection and submit it as pull request.
> But this won't shave off so much from your file.

I've now separated the changes - my suggestions to reduce the size
of the /W array are in https://github.com/libharu/libharu/pull/171


Thanks,
Adalbert

Pierre Clouthier

unread,
Dec 19, 2017, 8:29:55 AM12/19/17
to libHaru
Hi Adalbert -

Thank you very much for the work. I was wrong to say "Amyuni doesn't embed the fonts", my knowledge of PDF is rudimentary :o)  I understand (as you explain) that Amyuni embeds a smaller subset of the fonts.

I downloaded and used your new version of hpdf_font_cid.c. It reduced the size of my test file from 360K to 300K (see attached). That's a good start. 

A few comments:

[1] My program tells Haru to embed Arial, Arial Bold, Arial Italic, etc. whereas Amyuni only has Arial. That's because I am naïvely assuming that I need all these fonts to do bold & italic. I don't understand how Amyuni is achieving bold & italic without including these specific font variations in the PDF file. I am not aware of a PDF "bold" or "italic" command.

[2]  "map all characters to glyphs in the PDF driver libharu and set /CIDToGIDMap to /Identity"
I would be very interested in pursuing this. What can I do to help?

I am presently pursuing gradients (Shading). I am able to define a Shading Dictionary and add it to Resources, but I don't know where to invoke "sh", how to specify RGB values or rectangle coordinates.

Thanks again for your submission.

PS: Je suis Québecois, vivant présentement en Nouvelle Écosse.
arabic Haru new font compaction.pdf

spa...@idecon.it

unread,
Dec 20, 2017, 2:20:05 AM12/20/17
to libHaru
Hi Pierre, 
I suspect that Amyuni works slightly different but with the same concept. On my development, I've seen significant size reduction if it's been included only a portion of the entire font.
I suspect the Amyuni does the same, for example if it perform a character selection before encapsulating the font, you could gain a significant benefit in term of smaller pdf footprint.
I couldn't do such a sparsification, because I needed to be fast on a WinCE device, but, if you work on a quite powerful platform and you can sacrificate a little bit of speed, it's really a good solution.
Best regards, 

Davide

Adalbert Michelic

unread,
Dec 20, 2017, 3:48:48 AM12/20/17
to lib...@googlegroups.com
Hello Pierre,

On 12/19/17 14:29, Pierre Clouthier wrote:
> Thank you very much for the work. I was wrong to say "Amyuni doesn't
> embed the fonts", my knowledge of PDF is rudimentary :o)  I understand
> (as you explain) that Amyuni embeds a smaller subset of the fonts.
>
> I downloaded and used your new version of hpdf_font_cid.c. It reduced
> the size of my test file from 360K to 300K (see attached). That's a
> good start.

That nice to hear :)

> A few comments:
>
> [1] My program tells Haru to embed Arial, Arial Bold, Arial Italic,
> etc. whereas Amyuni only has Arial. That's because I am naïvely
> assuming that I need all these fonts to do bold & italic. I don't
> understand how Amyuni is achieving bold & italic without including
> these specific font variations in the PDF file. I am not aware of a
> PDF "bold" or "italic" command.

I can see how Amyuni simulates a bold font. But is there any actual
text in italic? I can't see any latin characters in italic style. Of
the arabic characters, I can't tell if there are some in italic :)

Amyuni just simulates a bold font. It does this by using the Tr
command. libharu defines the HPDF_Page_SetTextRenderingMode function
to output this command. By setting the text rendering mode to 2
(HPDF_FILL_THEN_STROKE) and setting a certain line width, characters
will appear bolder. First, the glyphs are drawn as usual, then the
outline is drawn again with a certain line width (your example uses
a width of 0.24). So then the glyphs will appear a bit bolder.

You can do the same in libharu. To switch back to "normal" mode, use
HPDF_Page_SetTextRenderingMode(HPDF_FILL).

However, using a real bold font, will typically produce visually
more appealing results as bold fonts normally aren't just bolder
copies of the normal fonts, but optimized.

But if you really depend on a small file, that's a of a course a
way it can be done (the typographical nerd in the back of head is
now screaming "Noooo, you can't do that, that's totally wrong" ;) ).

I don't know if an italic font can be produced in a similar way -
if you are able to produce it in Amyuni, you can send the file to
me and I can find out how they're doing it.

> [2]  "map all characters to glyphs in the PDF driver libharu and set
> /CIDToGIDMap to /Identity"
> I would be very interested in pursuing this. What can I do to help?

There are two possibilities:
a) resolve all characters to glyphs in your application, position
   them on the page, use the master branch from my libharu
   repository, and call the function HPDF_SetTTFontGIDMode(pdf,
   font_name) for every loaded font.
   After calling this function the strings written have to only
   consist of glyph ids and things like automatic line breaking
   will no longer work.
   Our application is only using the HPDF_Page_ShowTextNextLine
   and HPDF_Page_ShowText functions.
   We are using FreeType, GNU FriBidi, and Harfbuzz to produce
   the correct string of glyphs and create a text layout.
   This allows us to correctly handle texts with mixed left-to-
   right and right-to-left parts, shape arabic texts, enable/
   disable various features of the used fonts (e.g. using either
   proportional or tabular numbers), etc.

b) Use a similar approach, but resolve the texts to glyphs in
   libharu.
   It's probably enough to intercept the texts in the
   InternalWriteText and InternalShowTextNextLine functions. The
   function HPDF_TTFontDef_GetGlyphid can be used for a simple
   mapping. This is however no longer anough with arabic fonts,
   where a full text shaping engine like HarfBuzz should be used
   (because there's no simple char->glyph mapping, but instead
   glyphs have to be selected according to the context).
   Translating characters to glyphs can probably be done by
   using the font_attr->encoder->encode_text_fn - however, after
   playing a bit around with this and trying to get this cleanly
   into libharu, I had given up and reverted to approach a).

Unfortunately, I don't have such a deep insight to libharu to get
these changes cleanly in (without killing other functions), that's
why I haven't done more in that direction.


Thanks,
Adalbert

Adalbert Michelic

unread,
Dec 20, 2017, 4:28:40 AM12/20/17
to lib...@googlegroups.com
Hello again,

On 12/19/17 10:46, Adalbert Michelic wrote:
> However, it won't be possible to cut down the /CIDToGIDMap with
> the way libharu writes the PDFs. This is a binary array and cannot
> easily be reduced in size.

On a second thought, I think, I'm wrong here.

Currently the CIDToGIDMap contains the mapping for all characters in
the font - up to 65536 entries, if the font has that many glyphs. It
should however be possible to reduce it in two aspects:
a) if only glyphs from the first 300 entries are used, the table can
   be shorter,
b) unused entries can be set to 0, so that the compression algorithm
   has an easier job.

So I did a quick and experimental patch:
   https://github.com/amichelic/libharu/commit/3026ab

However, I didn't test the patch. Try the produced PDF in a couple of
different readers to see if it breaks somewhere.


Thanks,
Adalbert

Pierre Clouthier

unread,
Dec 20, 2017, 8:52:20 AM12/20/17
to libHaru
Hi Adelbert -

I have incorporated the code (Pull Req.171/3026ab), and the file size is now at 211KB, down from 300KB in the previous version, and from 360KB in the original. Looking good. This is a major enhancement.

The italic font is thrown in by default because some charts use it. I created an example of a chart that includes italic (see attached). It uses an ugly font call Euphemia for which there is no native italic. I believe it (or me?) is reverting to Arial for the italic parts.

Your explanation of "boldifying" reminds me of the days when we simulated bold on impact printers by freezing the line advance and overstriking twice. The hammer mechanism would usually be slightly offset and inaccurate, and would smear and fatten the letter.

The reason I'm preoccupied with file size is that my users will frequently share charts in PDF format, as email attachments, and size matters. The charts can also include images, which further bloat the file. I realize that images have nothing to do with text (and I also convert images to thumbnails), but everything counts.

Thank you very much for your code and detailed explanations. 







arabic Haru new font compaction PullReq171(3026ab).pdf
Haru italic example.pdf
Reply all
Reply to author
Forward
0 new messages