Local language font doesn't render properly on PDF. श्री renders as श् री

28 views
Skip to first unread message

Nikhil Mistry

unread,
Apr 25, 2024, 12:35:54 AMApr 25
to reportlab-users

User
I'm facing problems while using Mangal or Arial Unicode font and rendering some of the characters like श्री on PDF file. श्री renders as श् री. It renders correctly on HTML, but not on PDF. BTW, I'm using Django to code it. Any suggestions?

Matthias Kreier

unread,
May 21, 2024, 2:49:42 AMMay 21
to reportlab-users
It looks like that reportlab does not completely decode more complex unicode sequences for some Asian languages. I have a similar challenge with Khmer and syllabales like  "ស្ស". For your Devanagari syllable "श्री" (shri) I found the following explanation (asking ChatGPT): It consists of a base consonant followed by a vowel sign and a conjunct. Here’s a detailed breakdown of the Unicode sequence:
  1. Base Consonant: श (SHA) - U+0936
  2. Vowel Sign: ् (VIRAMA) - U+094D (used to suppress the inherent vowel of the consonant)
  3. Consonant: र (RA) - U+0930 (this is used in its repha form, appearing above the preceding consonant when combined with VIRAMA)
  4. Dependent Vowel Sign: ी (II) - U+0940 (attached to the combined consonant form)
Unicode Sequence
  1. Base Consonant:

    • U+0936 (श)
  2. Vowel Sign (Virama):

    • U+094D (्)
  3. Consonant (Repha):

    • U+0930 (र)
  4. Dependent Vowel Sign:

    • U+0940 (ी)
This looks like reportlab is not recognizing this combination but instead renders the base consonant with the vowel sign, and then the next consonant with the following dependent vowel sign. I'm no expert, just looking at this response from ChatGPT the special nature of the Virama (as in "please combine this with the following consonant") is not completely implemented in reportlab. The answer I got on how to render it is the following:

Rendering Process
  1. Base Consonant: The rendering engine identifies the base consonant श (U+0936).
  2. Vowel Sign (Virama): The virama (U+094D) indicates that the inherent vowel in the base consonant is suppressed, which facilitates the combination with the next consonant.
  3. Consonant (Repha): The consonant र (U+0930) is combined with the preceding consonant श using the virama. In Devanagari script, when र is combined in this way, it is rendered as a repha, appearing above the preceding consonant.
  4. Dependent Vowel Sign: Finally, the dependent vowel sign ी (U+0940) is applied to the combined consonant form.

In summary, the Unicode sequence for "श्री" involves a base consonant followed by a virama, another consonant that is rendered as a repha, and a dependent vowel sign. The UTF-8 encoding ensures each character is correctly represented in byte form, which the rendering engine interprets to display the correct combined character form.

I hope someone in reportlab is reading these forum posts and has an idea how to improve the rendering of these complex composite characters in some Asian languages. We do have some working like  "កាំ" (kâm) - a sequence of a Base Consonant, a dependent vowel and a diacritical mark.

My interpretation was confirmed by Copilot - the Virama is used to conjunct consonants. Here is the shorter answer from Copilot:

The word “श्री” is a ligature used in the Devanagari script, which is used to write Hindi, Sanskrit, and several other South Asian languages. It’s a combination of two characters: “श” and “्री”.

In Unicode, each character has a unique identifier known as a code point. The Unicode code points for “श” and “्री” are as follows:

  • “श” : U+0936
  • “्” : U+094D (this is called a Virama, which is used to form conjunct consonants)
  • “र” : U+0930
  • “ी” : U+0940

So, the sequence for “श्री” would be U+0936, U+094D, U+0930, U+0940. This sequence represents the individual characters that make up the ligature. When these Unicode points are rendered in the correct sequence, they form the ligature “श्री”.

Again, let's hope someone from reportlab reads these comments here.

Matthias

Matthias Kreier

unread,
May 30, 2024, 9:01:01 AMMay 30
to reportlab-users
Its related to my Khmer support issue. Many languages need proper glyphs to render the combined characters correct. The more you read about it, the more complicated it gets. But there is hope. The project harfbuzz is intended to solve this problem. Another pdf generating project called fpdf2 had a similar problem in 2022. The solution was the use of the Fonttools library and uharfbuzz. Now you can render your character correctly. I wrote a short program for demonstration:

from fpdf import FPDF
pdf = FPDF(orientation="P", unit="mm", format="A4")
pdf.add_page()
pdf.add_font("noto", style="", fname="../../fonts/NotoDevanagari.ttf")
pdf.set_font('noto', size=32)
pdf.cell(text="Devanagari syllable shri - श्री", new_x="LMARGIN", new_y="NEXT")
pdf.set_font("Helvetica", size=12)
pdf.cell(h = 20,text="Now using __text_shaping__ with **uharfbuzz**:", markdown=True, new_x="LMARGIN", new_y="NEXT")
pdf.set_font("noto", size=32)
pdf.set_text_shaping(use_shaping_engine=True, script="deva", language="hin")
pdf.cell(text="Devanagari syllable shri - श्री", new_x="LMARGIN", new_y="NEXT")
pdf.output("example_fpdf.pdf")

The result is:

Screenshot 2024-05-30 195927.png
An older post here from 2015 mentions the related work and awareness of this issue: https://groups.google.com/g/reportlab-users/c/scxAhaReanI/m/IYSaDfoH9ZkJ 
Reply all
Reply to author
Forward
0 new messages