Khmer script not correctly rendered

35 views
Skip to first unread message

Matthias Kreier

unread,
May 20, 2024, 2:54:47 PMMay 20
to reportlab-users
I use Khmer script in my project and have the text in utf-8 and use the Noto Sans Khmer ttf font file. For comparison I have the same text with the same font in Word (left) and as result from a reportlab run (right):

Screenshot 2024-05-21 015018.png
The text should be: 
54មនុស្ស
12ចៅក្រម
19ហោរា
53ស្តេច
82រយៈពេល
37ព្រឹត្តិការណ៍
18វត្ថុឬវត្ថុ

Matthias Kreier

unread,
May 21, 2024, 2:28:19 AMMay 21
to reportlab-users
Here is a simple example for the syllable ssa "ស្ស". In reportlab the result is Screenshot 2024-05-21 132037.png. Some further explanation:

The Khmer syllable "ស្ស" (ssa) consists of a base consonant followed by a subscript consonant. Here’s a detailed breakdown of the Unicode sequence:

1. Base Consonant: ស (SA) - U+179F
2. Subscript Consonant: ្ស (subscript SA) - U+17D2 (KHMER SIGN COENG) + U+179F (subscript form of SA)

Unicode Sequence

1. Base Consonant:
   - U+179F (ស)

2. Subscript Consonant:
   - U+17D2 (KHMER SIGN COENG)
   - U+179F (subscript form of SA)

Full Unicode Sequence

Putting these together, the full Unicode sequence for "ស្ស" is:

U+179F (ស)
U+17D2 (្)
U+179F (្ស)


UTF-8 Encoding

To represent this sequence in UTF-8, each Unicode code point is converted to its corresponding UTF-8 byte sequence:

- U+179F (ស) in UTF-8: E1 9E 9F
- U+17D2 (្) in UTF-8: E1 9F 92
- U+179F (subscript SA) in UTF-8: E1 9E 9F

Full UTF-8 Sequence

Combining these, the UTF-8 encoding for the sequence "ស្ស" is:

E1 9E 9F E1 9F 92 E1 9E 9F


Rendering Process

1. Base Consonant: The rendering engine identifies the base consonant ស (U+179F).
2. Subscript Consonant: It recognizes the subscript sign (KHMER SIGN COENG, U+17D2) and attaches the following consonant to the base consonant in its subscript form.
3. Combination: The engine renders the subscript consonant properly positioned under the base consonant.

In summary, the Unicode sequence for "ស្ស" involves a base consonant followed by a subscript sign and another consonant, encoded and rendered according to the rules of the Khmer script. The UTF-8 encoding ensures each character is correctly represented in byte form, which the rendering engine interprets to display the correct combined character.


Matthias Kreier

unread,
May 30, 2024, 9:16:08 AMMay 30
to reportlab-users
Digging around in the last 10 days gave me a better understanding of the problem. Some languages, scripts and glyphs need replacement tables/ligatures to proper render the intended text written in the unicode sequence. It is not an easy task. As I found in a Microsoft document there are some 634 language tags in software supported to properly render these languages in one of 173 scripts. Luckily most of the heavy lifting is already done or a constant process of refinement - namely Fonttools and Harfbuzz. Another project for creating pdf documents with python fpdf2 solved this problem 2022 with the inclusion of the mentioned tools. It might be an option for reportlab, given the required manpower (from the company or community).

I documented my findings here. I know the implementation of the ligature rendering process will require some time and work. Yet otherwise I might have to shift to another python base for my project. Andy probably knows what's best for his company.

Here some example code that solves the problem:

# example rendering Khmer
from fpdf import FPDF
pdf = FPDF(orientation="P", unit="mm", format="A4")
pdf.add_page()
pdf.add_font("noto", style="", fname="../../fonts/NotoKhmer.ttf")
pdf.set_font('noto', size=32)
pdf.cell(text="King        - ស្តេច", new_x="LMARGIN", new_y="NEXT")
pdf.cell(text="Prophet - ហោរា",     new_x="LMARGIN", new_y="NEXT")
pdf.set_font("Helvetica", size=12)
pdf.cell(h = 20,text="Now using __text_shaping__ with **uharfbuzz**:", markdown=True, new_x="LMARGIN", new_y="NEXT")
pdf.set_font("noto", size=32)
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
pdf.cell(text="King        - ស្តេច", new_x="LMARGIN", new_y="NEXT")
pdf.cell(text="Prophet - ហោរា",     new_x="LMARGIN", new_y="NEXT")
pdf.output("example_fpdf.pdf")

And the output:

Screenshot 2024-05-30 201445.png

Matthias Kreier

unread,
Jun 22, 2024, 1:20:41 AMJun 22
to reportlab-users
The investigation of this phenomena continued in my GitHub project in this issue: https://github.com/kreier/timeline/issues/35 

It seems this problem could be solved if a font shape engine like harfbuzz would be integrated into reportlab. For simpler combined characters and for arabic (with import arabic_reshaper and  reshaped = arabic_reshaper.reshape(exam_name) ) this is already done in reportlab. And continuing on a post in this forum here from 2005 Andy noted:

> We are trying to work out the right font descriptors and sequences of bytes to put in the PDF file so that the right stuff magically happens on screen.

And I think with harfbuzz this would actually be possible. Going back to the example mentioned above (and in my issue 35) if we use the Khmer word for years ឆ្នាំ it is represented by five unicode codepoints: '\u1786\u17D2\u1793\u17B6\u17C6'. But the codepoints to be inserted in the PDF to point to the right glyph points is uni178617B6, uni17D21793 and uni17C6. While the last looks like the same, the others are actually not Unicode code points but points in the font file for these specific ligatures. And we need a little more information about by how much our "cursor" should move forward after the glyph (first one has a width of 923, the others have zero) and how the glyphs should be positioned relative to the first glyph. These information would be integrated into the stream for the pdf file (I don't know how this stream is generated :( in reportlab) but all the required information is given by harfbuzz. 

I'm not sure if functions like instanceStringWidthTTF would work since they have a utf-8 encoded string as text argument, but uni178617B6 and uni17D21793 are not Unicode codepoints and therefore not represented in utf-8. It's probably a lot of work. But it looked like @replarobin Robin Becker was interested in starting this project. I still got no response for signing up to the official mail list and can't post there, so I have this little update here.

Finally a little visual how the font shaping would work, replacing the five Unicode code points with three glyph code points for the example above:

khmer_shape.png
Reply all
Reply to author
Forward
0 new messages