How is RTL/LTR text handled in PDFium?

80 views
Skip to first unread message

Edouard Belval

unread,
Nov 4, 2020, 12:18:04 PM11/4/20
to pdfium
I see that ICU is in the third_party modules, but I don't see it being used anywhere. Does PDFium handle RTL text? If so, how can I get the characters as they are printed?

One example are parenthesis, which are sometimes mirrored when extracting characters.

Edouard Belval

unread,
Nov 5, 2020, 10:11:04 AM11/5/20
to pdfium
Reading this message, I think more context might be beneficial to get a good answer.

I am trying to extract Arabic text from PDFs containing a mix of Arabic and English. I have two different sample, one in which the parenthesis are inverted (and the text object is labelled as RTL) and one where they aren't.

I attempted to extract the character from two methods:
  1. Getting the unicode points directly from the text object, which as I understand it, skips some preprocessing that is done in text_page
  2. Getting the unicode point from FPDFText_GetUnicode
You can see the results here: https://imgur.com/a/PmYZZqG (Top: 1st method, Bottom: 2nd method). When we skip the preprocessing, I end up with mirrored parenthesis, which are simple enough to fix. However, when I use the second method, only the right parenthesis gets inverted and I am not sure that I understand the logic behind it. My more "practical" questions are:
  1. What processing is done in text page that seemingly (wrongly) mirror only one of the parenthesis?
  2. How can I know when the document was create with parenthesis that are mirrored at render and parenthesis that were already LTR?
I am using Skia if it matters in this situation. I attached the PDF used to generate the image.
odd.pdf
Reply all
Reply to author
Forward
0 new messages