How is RTL/LTR text handled in PDFium?

80 views

Skip to first unread message

Edouard Belval

unread,

Nov 4, 2020, 12:18:04 PM11/4/20

to pdfium

I see that ICU is in the third_party modules, but I don't see it being used anywhere. Does PDFium handle RTL text? If so, how can I get the characters as they are printed?

One example are parenthesis, which are sometimes mirrored when extracting characters.

Edouard Belval

unread,

Nov 5, 2020, 10:11:04 AM11/5/20

to pdfium

Reading this message, I think more context might be beneficial to get a good answer.

I am trying to extract Arabic text from PDFs containing a mix of Arabic and English. I have two different sample, one in which the parenthesis are inverted (and the text object is labelled as RTL) and one where they aren't.

I attempted to extract the character from two methods:

Getting the unicode points directly from the text object, which as I understand it, skips some preprocessing that is done in text_page
Getting the unicode point from FPDFText_GetUnicode

You can see the results here: https://imgur.com/a/PmYZZqG (Top: 1st method, Bottom: 2nd method). When we skip the preprocessing, I end up with mirrored parenthesis, which are simple enough to fix. However, when I use the second method, only the right parenthesis gets inverted and I am not sure that I understand the logic behind it. My more "practical" questions are:

What processing is done in text page that seemingly (wrongly) mirror only one of the parenthesis?
How can I know when the document was create with parenthesis that are mirrored at render and parenthesis that were already LTR?

I am using Skia if it matters in this situation. I attached the PDF used to generate the image.