Rendering of indic text

Anurag Bansal

unread,

Apr 11, 2022, 12:28:51 PM4/11/22

to pdfium

Hello,

I have been trying to create Indic (hindi, tamil, punjabi etc) PDFs using pdfium and faced a problem when inserting the indic text in a newly created pdf. The text which I am trying to insert is not rendering properly and is ignoring the languages' ligature.

Here is an example sentence in hindi - 'मैं घोषणा, पुष्टि और सहमत हूँ कि'. The sentence when rendered outputs the text which can be viewed in the attached pdf.

As it can be compared, the text here and the one viewable in the document are not exactly same (thus incorrect). I have been trying to do the same using other pdf creation libraries but have been unsuccessful so far. Thus, I switched to pdfium, as I found that the google document translation api is able to create a translated hindi document for an input english document using the correct ligature. Hence, I checked how the pdf was created and the document properties for this document showed that the document was created using pdfium.

I have also attached a file named 'sample.py' which uses the pypdfium2 library to create the pdf attached.

Am I missing something due to which the document is not being created properly, or pdfium does not allow such pdfs to be created at all?

Regards,

Anurag Bansal

sample.py

example.pdf

Lei Zhang

unread,

Apr 12, 2022, 4:50:06 PM4/12/22

to Anurag Bansal, pdfium

For Indic languages and others that require text shaping, one cannot
simply use FPDFText_SetText() as is. Instead, one needs to use text
shaping programs like HarfBuzz to position the glyphs in the correct
order and locations. FPDFText_SetCharcodes() is designed to work
HarfBuzz to help do this.

> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/d8f1deae-9e3c-4655-98c5-aa97fe72a0a9n%40googlegroups.com.

geisserml

unread,

Apr 12, 2022, 5:16:38 PM4/12/22

to pdfium

As this is pypdfium, you'll probably need some kind of Python binding to HarfBuzz. It looks like the official recommendation is to use PyGObject [1] [2]

There also seems to be at least one third-party binding [3] that does not depend on GTK, but I don't know if it works.

[1] https://harfbuzz.github.io/integration-python.html

[2] https://lazka.github.io/pgi-docs/#HarfBuzz-0.0

[3] https://github.com/ldo/harfpy

geisserml

unread,

Apr 13, 2022, 3:27:35 PM4/13/22

to pdfium

Oh, and I overlooked that there is also uharfbuzz [1], which is a cython-based Python binding for HarfBuzz. It is actually hosted in the official HarfBuzz organisation on GitHub, and looks like the most promising solution to me.

geisserml

unread,

Apr 13, 2022, 3:27:53 PM4/13/22

to pdfium

[1] https://github.com/harfbuzz/uharfbuzz

Anurag Bansal

unread,

Apr 13, 2022, 4:43:56 PM4/13/22

to pdfium

Thanks for the reply!

Is their an example, of how these two can be used together. I understand that harfbuzz can be used to extract the codepoint, cluster, and position info; but how exactly does the FPDFText_SetCharcodes takes input, or is combined with harfbuzz so that it can write to the pdf. The documentation of the ' FPDFText_SetCharcodes' function isn't helping much. Sorry for the trouble.

Lei Zhang

unread,

Apr 13, 2022, 4:55:24 PM4/13/22

to Anurag Bansal, pdfium

fpdfsdk/fpdf_edit_embeddertest.cpp has a test case to make sure
FPDFText_SetCharcodes() works. Though it doesn't integrate with
HarfBuzz. How to use HarfBuzz is out of scope from the PDFium
perspective, but hb_buffer_get_glyph_positions() may be a good
starting point.

> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/05a7482e-5ee6-4720-915c-fa55f2349900n%40googlegroups.com.

Anurag Bansal

unread,

Apr 14, 2022, 12:23:04 PM4/14/22

to pdfium

I implemented the solutions proposed and was able to get the glyphs and the glyph order as intended.

However, in this version, an additional space is being added after each glyph (not always, sometimes there is a space, sometimes there is not) rendered into the pdf. This is something which was not happening in the previous implementation where the glyphs and character order was incorrect, but there was no space issue.

The current version was created by using the following steps -

1. An input string (the one in the attached pdf, and mentioned initially in the thread) was input to the harbuzz function which returned the values of the codepoints for the glyphs. The font I used was 'Noto Sans', which was downloaded from google fonts.

2. The codepoints were then combined as a single array of only the codepoints, and was input to the 'FPDFText_SetCharcodes' function.

Here is what the text should look like - ' मैं घोषणा, पुष्टि और सहमत हूँ कि'. The attached file 'example.pdf' contains the old version created without harfbuzz, and the file 'example new.pdf' contains the file containing the text created using the above mentioned two points.

How can I improve the 'example new.pdf' file? Since, as it stands the new version is very close to the intended output.

example.pdf

sample.py

example new.pdf

geisserml

unread,

Apr 22, 2022, 12:59:05 PM4/22/22

to pdfium

Were you able to figure out how to resolve the whitespace issue? If not, could a member of PDFium team/community please provide some hint on how to address this problem?

Anurag Bansal

unread,

Apr 26, 2022, 12:07:54 PM4/26/22

to pdfium

Hello again,

Yes. Thanks. I was able to solve the issue.

The issue was solved when I tried to insert one character at a time in the pdf, using the offset and advance values. I was initially facing problem when trying to understand the advance and offset values, and found that the issue was caused due to me not being able to understand how to extract the 'upem' value for a given font using harfbuzz.

I think that support can be added either in pdfium itself or its python wrapper (pypdfium2), using which a text can be placed correctly using harfbuzz, with proper spacing and everything.

Thanks again for all the help.

Regards,

Anurag Bansal

geisserml

unread,

Apr 26, 2022, 12:30:43 PM4/26/22

to pdfium

Thanks for the reply, that's good to hear!

If you are willing to contribute your code as a support model for pypdfium2, I'll certainly be happy to integrate it. You may fork the repository on GitHub and make a pull request, or send me a patch per mail.

If have any questions on the code base, feel free to ask.

geisserml

unread,

May 2, 2022, 11:47:28 AM5/2/22

to pdfium

It would also be sufficient if you could just send me your updated script, then I can do the remaining integration work.

Kind regards

Reply all

Reply to author

Forward