dynamic subsetting

Skip to first unread message

Robin Becker

Jan 12, 2021, 10:04:42 AM1/12/21
to fonttools
Hi, hope this is the right place to ask. I maintain reportlab and although we have some code that attempts to make use of TrueType fonts I believe the subsetting is not very good.
I would like to make use of the subsetting capabilities of fontTools.

I tried some small example subsets and find that I seem to have more glyphs than are present in my requested unicodes. Our current code tries to implement subsetting on the fly by keeping a list of unicode points that have been used. When we reach a full subset say 255 codes we start a new subset. The cmap is exactly parallel to the subset.

Looking at my example via ttx for /usr/share/fonts/TTF/NotoSansMyanmar-Regular.ttf with unicodes 0x20 0x1000 0x103c 0x1031 0x1038 I find that glyph order shows 7 glyphs .notdef, space, ka, medial_ra, medial_ra.w2, _e & visarga. However, the cmap (in two variants) just shows my 5 desired code points. I'm confused about what ordinal I should use for my PDF text. Normally the input text translated via the subset ordering and can use bytes 0<=b<=255. Is the cmap the right ordering or is it the glyphOrder. Currently we make cmap[0] the .notdef, but even though it seems to be added to the subset font it doesn't appear in the cmap.

Andreas Eigendorf

Jan 12, 2021, 12:51:33 PM1/12/21
to fonttools
Hello Robin,

I think this is the intended behaviour. There are a lot of cases which result in a glyph set which is larger than the character set.
You want a font with character ä (adieresis, code point 0x00E4) only. If the corresponding glyph is build as a composite with the two components: a and dieresis, you will find three glyphs in the subsetted font: ä, a and dieresis. If this font implements a small caps feature which you want to keep in the font, you will find an additional glyph in the subsetted font: a.sc. If the small caps variant is build as a composite again, you may find even more glyphs.

For complex scripts like arabic you also need positional variants for many characters to render the text correctly (initial, medial, final and isolated forms of the same character). this is another case which results in a glyph set which is larger than the requested character set. The fontTools subsetter handles all this automatically, which is great!

I hope this helps.
All the best

梁海 Liang Hai

Jan 14, 2021, 6:31:26 AM1/14/21
to ro...@reportlab.com, fonttools
Hi Robin,

Mmm, either I didn’t get what you meant, or you need to refresh your mindset to understand that glyphs in a modern font are really not meant to be one-by-one mapped to encoding units (Unicode code points, for example). It’s only a special case when you do see that happen in fonts that are simple enough.

Then, make sure to check out the fontTools.subset’s fairly informative documentation, which does explain how you can tailor the tool’s behavior to meet your exact expectation:


Note especially this section:

Glyph set expansion:
These options control how additional glyphs are added to the subset.

I'm confused about what ordinal I should use for my PDF text. 

I’m confused though, about what you’re trying to do here. If you’re trying to subset a font to embed into PDF in order to back those glyph strings in PDF, of course you keep track of the glyph entities, I suppose?

Are you trying to manually get the resulted glyph sequence of a Myanmar Unicode character string? Don’t do that. Use a properly text shaping library like HarfBuzz. You feed HarfBuzz with a text string plus some additional information, and you get a sequence of glyphs and some additional information.

梁海 Liang Hai

You received this message because you are subscribed to the Google Groups "fonttools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fonttools+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fonttools/3ad32629-11d4-4abd-86b6-fed56ca4fd4fn%40googlegroups.com.

Robin Becker

Jan 14, 2021, 7:39:22 AM1/14/21
to 梁海 Liang Hai, fonttools
Hi I think I got this now; looking at the existing subset code we have, I need to just rely on the cmap for external use. I assume that's going to be in the same order that I provide. I probably just need to ensure that I have the map I want in sync with the map that fontTools wants to produce for a specified unicode list. A problem is that I may want to have gaps in the list. Eg when I want to make a latin alphabet mostly readable when used I would like to have code 32 -> space code 33 -> A etc etc and there is a gap between say notdef 0 and 31. I will have to look at the fontTools subsetting code to see if that's easy to overcome. If readability is not an issue then I think it works easily.
Robin Becker
Reply all
Reply to author
0 new messages