[reportlab-users] CJK fonts

411 views
Skip to first unread message

Glenn Linderman

unread,
Jun 18, 2011, 1:58:19 AM6/18/11
to reportl...@lists2.reportlab.com
So in generating some documents from UTF-8 text files, most of which are in European languages, but one is in Chinese (Simplified script), I discovered that the Times-New-Roman font that I'd been using for the European languages doesn't contain the CJK characters.  So the Chinese text I have also has some European characters.

Does the ReportLab API provide a way of selecting multiple fonts at the same time, so that characters not in one font will be found in another?

And any clues when a version of ReportLab will work with Python 3.x?

Robin Becker

unread,
Jun 20, 2011, 5:32:52 AM6/20/11
to reportlab-users
On 18/06/2011 06:58, Glenn Linderman wrote:
> So in generating some documents from UTF-8 text files, most of which are in
> European languages, but one is in Chinese (Simplified script), I discovered that
> the Times-New-Roman font that I'd been using for the European languages doesn't
> contain the CJK characters. So the Chinese text I have also has some European
> characters.
>
> Does the ReportLab API provide a way of selecting multiple fonts at the same
> time, so that characters not in one font will be found in another?
.........

For the T1 fonts we have such a mechanism via a list of substitution fonts.
That's used in pdfbase.pdfmetrics.unicode2T1 to fix up any encoding issues from
the available fonts. That's reasonable for T1 because the maximum number of
glyphs is 256.

In the TTF fonts the assumption is that they cover all of utf8/unicode and we
make lazy subset fonts so we don't get errors at the right time; in fact we only
detect that the font lacks a glyph when we are building the subset. That means
we might end up trying to build a subset for a different font in the middle of
building subsets. I'm not sure how feasible it would be to do that.
--
Robin Becker
_______________________________________________
reportlab-users mailing list
reportl...@lists2.reportlab.com
http://two.pairlist.net/mailman/listinfo/reportlab-users

Glenn Linderman

unread,
Jun 20, 2011, 3:44:14 PM6/20/11
to Robin Becker, reportlab-users
On 6/20/2011 2:32 AM, Robin Becker wrote:
On 18/06/2011 06:58, Glenn Linderman wrote:
So in generating some documents from UTF-8 text files, most of which are in
European languages, but one is in Chinese (Simplified script), I discovered that
the Times-New-Roman font that I'd been using for the European languages doesn't
contain the CJK characters. So the Chinese text I have also has some European
characters.

Does the ReportLab API provide a way of selecting multiple fonts at the same
time, so that characters not in one font will be found in another?
.........

For the T1 fonts we have such a mechanism via a list of substitution fonts. That's used in pdfbase.pdfmetrics.unicode2T1 to fix up any encoding issues from the available fonts. That's reasonable for T1 because the maximum number of glyphs is 256.

In the TTF fonts the assumption is that they cover all of utf8/unicode and we make lazy subset fonts so we don't get errors at the right time; in fact we only detect that the font lacks a glyph when we are building the subset. That means we might end up trying to build a subset for a different font in the middle of building subsets. I'm not sure how feasible it would be to do that.

So I'm just a brand new reportlab user, struggling with the need to use Python 2 (my focus is Python 3, I'm a fairly new Python user as well, and started with 3) to use reportlab at all, and now trying to figure out why my PDF with Chinese is mostly blank characters instead of Chinese characters.

So I think I'm using TTF fonts, I have no clue that T1 fonts would even work for Chinese because of the glyph limit... but in my text editor and browser, they or Windows do appropriate font substitutions, and the Chinese, since it is UTF-8, "just works".

I did earlier discover that the use of the "basic postscript font set" didn't work for all European languages, because it seems to be limited to the character repertoire of Latin-1 or something, even though I was feeding in Unicode.  So I had to select Times-New-Roman instead of Times Roman... and that got me started down the path of TTF fonts, I guess.

Being rather ignorant of Windows font APIs (I attempted to research Windows font APIs some time back, discovered there were at least 4 different font APIs available, couldn't figure out which were the new ones, or the recommended ones, and never did figure out any of them, since I didn't know which one to study), I wouldn't know either why it "just works" in the text editor and browser, and why it couldn't "just work" in reportlab... even if the embedded subset font would happen to contain characters from a substituted font, because that is what is available on the machine that is creating the PDF.

Is there any good reference material for Windows font APIs?  I'm not even sure what Chinese font is in use on my computer to be substituted in for the characters that are not in Times-New-Roman, nor how to determine that, as a first step to specifying it for use in reportlab.  Whatever it is, it must come with Windows.

Is there any good reference material for how to solve my problem above using reportlab?  I could probably figure out and code a solution, if I knew where to start...

Is it possible somehow to tell reportlab what subsets of what fonts should be included before rendering, as a way to avoid building a subset of one font while building a subset of a different font?  Or is it simply not possible to create a PDF with need for multiple fonts in reportlab?

Robin Becker

unread,
Jun 21, 2011, 5:32:51 AM6/21/11
to reportlab-users
...........

> So I think I'm using TTF fonts, I have no clue that T1 fonts would even work for
> Chinese because of the glyph limit... but in my text editor and browser, they or
> Windows do appropriate font substitutions, and the Chinese, since it is UTF-8,
> "just works".

Unless you are specifically declaring and registering a font it will not be used
in reportlab unless it is one of the standard 14. So for example this is from a
Japanese based test and uses msmincho.ttc

> from reportlab.pdfbase.ttfonts import TTFont
> try:
> msmincho = TTFont('MS Mincho','msmincho.ttc',subfontIndex=0,asciiReadable=0)
> fn = ' file=msmincho.ttc subfont 0'
> except:
> try:
> msmincho = TTFont('MS Mincho','msmincho.ttf',asciiReadable=0)
> fn = 'file=msmincho.ttf'
> except:
> #Ubuntu - works on Lucid Lynx if xpdf-japanese installed
> try:
> msmincho = TTFont('MS Mincho','ttf-japanese-mincho.ttf')
> fn = 'file=msmincho.ttf'
> except:
> msmincho = None
> if msmincho is None:
> c.setFont('Helvetica', 12)
> c.drawString(100,600, 'Cannot find msmincho.ttf or msmincho.ttc')
> else:
> pdfmetrics.registerFont(msmincho)
> c.setFont('MS Mincho', 30)

The same could be done for some of the MS fonts eg pmingliu.ttf or simsun.ttf,
but I am certainly no expert here.


...........


> Being rather ignorant of Windows font APIs (I attempted to research Windows font
> APIs some time back, discovered there were at least 4 different font APIs
> available, couldn't figure out which were the new ones, or the recommended ones,
> and never did figure out any of them, since I didn't know which one to study), I
> wouldn't know either why it "just works" in the text editor and browser, and why
> it couldn't "just work" in reportlab... even if the embedded subset font would
> happen to contain characters from a substituted font, because that is what is
> available on the machine that is creating the PDF.
>

I don't believe we are using any specific API to obtain/find fonts.

> Is there any good reference material for Windows font APIs? I'm not even sure
> what Chinese font is in use on my computer to be substituted in for the
> characters that are not in Times-New-Roman, nor how to determine that, as a
> first step to specifying it for use in reportlab. Whatever it is, it must come
> with Windows.
>

I know that there is an Asian font pack for Acrobat reader (but that allows the
meta information to be rendered properly). In addition you can always add asian
language support from the windows OS, but which fonts that provides I don't
actually know.

> Is there any good reference material for how to solve my problem above using
> reportlab? I could probably figure out and code a solution, if I knew where to
> start...
>

Ask here, others have certainly faced and overcome these problems

Jonathan Liu

unread,
Jun 26, 2011, 6:53:23 PM6/26/11
to reportl...@reportlab.com

iText supports specifying a list of fonts that can be selected that are
checked in order for rendering text using the FontSelector class
(http://api.itextpdf.com/itext/com/itextpdf/text/pdf/FontSelector.html).

I use FontSelector in iText quite extensively and it would be good if I
could use ReportLab instead of calling Java from Python to generate PDFs.

Does ReportLab plan to include an equivalent to this in a future version
or is there a way to emulate it?

Glenn Linderman

unread,
Jan 4, 2015, 12:19:18 AM1/4/15
to Robin Becker, reportlab-users
On 6/20/2011 2:32 AM, Robin Becker wrote:
In the TTF fonts the assumption is that they cover all of utf8/unicode and we make lazy subset fonts so we don't get errors at the right time; in fact we only detect that the font lacks a glyph when we are building the subset. That means we might end up trying to build a subset for a different font in the middle of building subsets. I'm not sure how feasible it would be to do that.

Three and a half years later (yeah, I got distracted), I've finally learned enough to understand and appreciate what you said there.

So if you don't detect the missing glyph until building the subset, that means you can detect it then.  For efficiency, I can understand why it might be deferred, allowing and assuming the user specifies the right font for the particular characters.

But this user doesn't always know. Now I could make an assumption about certain character ranges being in certain fonts, but is there a way I can ask reportlab "Does this character exist in this font?"

Andy Robinson

unread,
Jan 4, 2015, 4:02:04 AM1/4/15
to reportlab-users, Robin Becker
On 4 January 2015 at 05:18, Glenn Linderman <v+py...@g.nevcal.com> wrote:
> But this user doesn't always know. Now I could make an assumption about
> certain character ranges being in certain fonts, but is there a way I can
> ask reportlab "Does this character exist in this font?"

Give us some time to remind ourselves how it works - maybe another 3.5
years ;-) Remember that we don't draw the text, Adobe Reader (or
whatever you use instead) does, and it's quite happy being fed IDs of
glyphs which don't exist and/or they may introduce font substitution
mechanisms of their own at any time. So while it's useful to know
your font might lack a glyph, it doesn't mean it won't get displayed
somehow.

There is no convenient high-level API but we ought to add one. Robin,
how feasible would that be?

On this front, I note that Just van Rossum's fonttools package is
being updated; Behdad Esfahdod, who's very well known for his work on
rendering Persian text and now works for Google, has been bringing it
up to date:

https://github.com/behdad/fonttools/

This is not what we use for speed reasons but it's probably the right
tool if you want to query and understand what's in a font.



--
Andy Robinson
_______________________________________________
reportlab-users mailing list
reportl...@lists2.reportlab.com
https://pairlist2.pair.net/mailman/listinfo/reportlab-users

Glenn Linderman

unread,
Jan 5, 2015, 2:01:21 PM1/5/15
to reportlab-users
On 1/4/2015 1:01 AM, Andy Robinson wrote:
On 4 January 2015 at 05:18, Glenn Linderman <v+py...@g.nevcal.com> wrote:
But this user doesn't always know. Now I could make an assumption about
certain character ranges being in certain fonts, but is there a way I can
ask reportlab "Does this character exist in this font?"
Give us some time to remind ourselves how it works - maybe another 3.5
years ;-)   

:)  Well, I did learn a lot about fonts in the 3.5 years... enough to find a workaround myself, after sending the above... but... not an obvious API!


Remember that we don't draw the text, Adobe Reader (or
whatever you use instead) does, and it's quite happy being fed IDs of
glyphs which don't exist and/or they may introduce font substitution
mechanisms of their own at any time.  So while it's useful to know
your font might lack a glyph, it doesn't mean it won't get displayed
somehow.

Right now, there are no substitutions. If I create something now, I want it viewable now, not in 3.5 years :) Even though I've been waiting 3.5 years to create it :)

And since by the time Reader (or Sumatra, which is what I use instead) gets the character codes, they've been substituted out of their original positions and fonts, and they would hardly know what to substitute with...


There is no convenient high-level API but we ought to add one.  Robin,
how feasible would that be?

The workaround I figured out is as follows. When registering the fonts for use, save the result of the TTFont call as, for example, theFont.  Then a character can be checked for existence in theFont, by doing

if ord( character ) in theFont.face.charToGlyph:
    # it exists
    pass
else:
    # doesn't exist, use some other font
    pass

This can be done to help in generating the <font> directives to pass in to reportlab.

In HTML/CSS, font substitutions are done automatically, based on the existence of characters in the font. So one can specify a sequence of fonts to check to find the characters. This is convenient, and reportlab's platypus looks a lot like HTML :)  HTML/CSS apparently also provides a way to define that a particular subset of characters should be used from a particular font, although I haven't bothered to learn all those details as yet, but that could also be useful in some circumstances that are more complex than mine.

For the sequence of fonts, CSS allows a comma separated list of fonts, and it will search for the character in the fonts using that specified order... if still not found, it does its own substitution if it can find the character in some other font from its default set of fonts, but if found, it uses the first one found, giving the user a fair bit of control.  And more, if the user specifies the character ranges, however that is done.

So for generating the HTML version of my data, I just use  "Times New Roman", "SimSun" and get the characters I need.  This has filled the 3.5 year gap, but printing HTML doesn't produce as pretty of results as directly generated PDF files, but there were other priorities.

Lacking the priority list of fonts in reportlab, I can use the above technique to figure out how to add appropriate <font> wrappers for text subsets. The above technique doesn't seem to be a documented API, I had to do lots of digging to find it. A documented API would be better, but the HTML/CSS features would be even better.


On this front, I note that Just van Rossum's fonttools package is
being updated; Behdad Esfahdod, who's very well known for his work on
rendering Persian text and now works for Google, has been bringing it
up to date:

    https://github.com/behdad/fonttools/

This is not what we use for speed reasons but it's probably the right
tool if you want to query and understand what's in a font.

Thanks for the tip.

Andy Robinson

unread,
Jan 5, 2015, 3:01:15 PM1/5/15
to reportlab-users
Thanks for all this. Our font guru is out this week, but we could
certainly expose something a bit friendlier to test if a font contains
a character (or, more usefully, a set of characters).

In principle it would be a really nice feature to allow font
substitution, but the issue is performance. We have always strived to
make reportlab as fast as we reasonably can when making big documents,
as it's used to do things like generating millions of pages of manuals
per month when editors change content, or making 30-40 page legal
agreements on the fly. The slowest part in most real-world documents
is paragraph-wrapping, which requires us looking up the width of every
character to size the words. There is a 'stringWidth' function which
is highly optimised and if it needs to be able to stop, raise an
exception and backtrack to generate invisible font fragments, it
could slow things down a lot. Maybe we can find a way to handle it
so the default case still performs fast though. I can't promise when
we will look at it but clearly we should....

- Andy
> _______________________________________________
> reportlab-users mailing list
> reportl...@lists2.reportlab.com
> https://pairlist2.pair.net/mailman/listinfo/reportlab-users
>



--
Andy Robinson
Managing Director
ReportLab Europe Ltd.
Thornton House, Thornton Road, Wimbledon, London SW19 4NG, UK
Tel +44-20-8405-6420

Glenn Linderman

unread,
Jan 5, 2015, 4:06:27 PM1/5/15
to reportl...@lists2.reportlab.com
On 1/5/2015 12:01 PM, Andy Robinson wrote:
The slowest part in most real-world documents
is paragraph-wrapping, which requires us looking up the width of every
character to size the words.
Which reminds of that what I'm not seeing in any of the font or font subsetting code is anything about kerning... is kerning a lost art with modern scalable fonts, or is it just not bothered with except in high-end typesetting applications?  I sort of expected to find some kerning tables in the fonts, but haven't. At least, not yet.

Andy Robinson

unread,
Jan 5, 2015, 5:38:57 PM1/5/15
to reportlab-users
ReportLab doesn't do it. Again, it would slow things to a crawl so
was never our focus, and in 15 years no customer has asked. On the
other hand, for document titles and one-page documents, it would not
be hard to have a 'draw kerned string' routine; PDF provides all the
primitives for it.

I honestly have no idea how many of the popular fonts these days have
kerning tables in them. Now that Google have made their fonts archive
available as a repo, it would be interesting to write an app to
explore them and run queries like that...

- Andy

Philip Wong

unread,
Apr 13, 2023, 3:55:32 AM4/13/23
to reportlab-users
Hi there,

I was recently trying to implement the ReportLab with mingliu.ttc, of which its a 2 layer of fonts. The first layer of this font file has embedded 3 fonts and each embedded font has different face such as bold, italic and bolditalic.

Given to the existing function can only set the first layer of fonts but not second layer, for example

pdfmetrics.registerFont(TTFont('mingliu', 'mingliu.ttc', subfontIndex=1))

How to set the different face of such 'mingliu'?

Thanks

Reply all
Reply to author
Forward
0 new messages