Problems rendering U+02B9 and asian characters with UTF-8

53 views
Skip to first unread message

Matthias Kreier

unread,
Jan 26, 2024, 6:04:55 AMJan 26
to reportlab-users
I was happy to see how reportlab supported unicode and the Russian and Vietnamese language once I import a font file that has the respective characters in it. I used the rather new Aptos from Microsoft for my project.

But some characters are only rendered as boxes, like the U+02B9 : MODIFIER LETTER PRIME . It is included in the font, using the font with Word or Google docs renders it without problems. And recently I discovered that asian languages like Japanese, Chinese, Korean and Taiwanese are not that easy, either. 

I tried some of the examples in the /demos folder like test_multibyte_jpn.py and they do work but this approach is from 2002 when Unicode and UTF-8 were still young. The userguide https://docs.reportlab.com/reportlab/userguide/ch3_fonts/ states "This is the easy way to do it. No special handling at all is needed to work with Asian TrueType fonts."

I could not find any similar problem here in this forum, so I posted a new question. I created a test file example.py to visualize the problem. The code reads:

# example.py
from reportlab.pdfgen import canvas
my_canvas = canvas.Canvas("example.pdf")
text = ["Russian and Vietnamese works: Мафуса́л и Trận Đại Hồng Thủy",
"U+02B9 : MODIFIER LETTER PRIME does not work: Me·thuʹse·lah ",
"Japanese does not work: これは日本語のテキストです。",
"Neither does Chinese, Korean or Taiwanese:",
"这是一篇中文文本。| 한국어로 된 글입니다.|這是台灣文字。",
"The hack from /demos/test_multibyte_jpn works: 本語"]
msg = u'\u6771\u4EAC : Unicode font'.encode('utf8')
text.append(msg)

# simple drawing works only for latin characters
for i in range(7):
my_canvas.drawString(50, 780 - 16 * i, text[i])

# with an imported font some of the UTF-8 features work
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
pdfmetrics.registerFont(TTFont('Aptos', '../aptos.ttf'))
my_canvas.setFont("Aptos", 14)
for i in range(7):
my_canvas.drawString(50, 660 - 16 * i, text[i])

# now using the msmincho.ttc font
pdfmetrics.registerFont(TTFont('MS Mincho','msmincho.ttc'))
my_canvas.setFont('MS Mincho', 14)
for i in range(7):
my_canvas.drawString(50, 540 - 16 * i, text[i])

my_canvas.save()


The result looks like this:
Screenshot 2024-01-26 at 18.03.47.png
The second version is with the imported font and the way I would like to use it.

Thanks!



Matthias Kreier

unread,
Jan 26, 2024, 7:41:43 AMJan 26
to reportlab-users
A little update. I downloaded FontForge and opened both fonts. Under Element > Font Info ... > Unicode Ranges it states for aptos.ttf:

Basic Multilingual Plane U+0000 - U+FFFD 1086/64082

It's only 1086 glyphs and no specific part highlighted below for CJK characters. For comparison msminch.ttc as some 15747:

Basic Multilingual Plane U+0000-U+FFFD 15747/64082
CJK Symbols and Punctuation U+3000-U+303F 44/64
Enclosed CJK Letters and Months U+3200-U+32FF 174/255
CJK Compatibility U+3300-U33ff 249/256
CJK Unified Ideographs Extension A U+3400-U+4DBF 164/6592
CJK Unified Ideographs U+4E00-U+9FFF 12579/20992
CJK Compatibility ideographs U+F900-U+FAFF 98/472
CJK Compatibility Forms U+FE30-U+FE4F 2/32
CJK Unified Ideographs Extension B U+20000-U+2A6DF 303/42720

It looks like the CJK characters are not encoded in my aptos.ttf nor the U+02B9. I have to find a different font file. It just makes me wonder: How does Word and Google Docs render these documents in these fonts even though the respective glyphs are not present in the file?

Matthias Kreier

unread,
Jan 26, 2024, 8:08:30 AMJan 26
to reportlab-users
Further progress: If I download specific fonts of noto.ttf then the Japanese characters are rendered - they seem to be all included in Chinese and Korean. But not vice versa, some Chinese characters are missing in the Korean package, and no Korean characters are in Japanese or Chinese. But all latin and Cyrillic (Russian) characters are there, and even the diacritics for Vietnamese seem to be complete.

Still no solution for U+02B9 : MODIFIER LETTER PRIME. Not included in any of these character sets.

Reply all
Reply to author
Forward
0 new messages