Reading this message, I think more context might be beneficial to get a good answer.
I am trying to extract Arabic text from PDFs containing a mix of Arabic and English. I have two different sample, one in which the parenthesis are inverted (and the text object is labelled as RTL) and one where they aren't.
I attempted to extract the character from two methods:
- Getting the unicode points directly from the text
object, which as I understand it, skips some preprocessing that is done
in text_page
- Getting the unicode point from FPDFText_GetUnicode
You can see the results here:
https://imgur.com/a/PmYZZqG (Top: 1st method, Bottom: 2nd method). When we skip the preprocessing, I end up with mirrored parenthesis, which are simple enough to fix. However, when I use the second method, only the right parenthesis gets inverted and I am not sure that I understand the logic behind it. My more "practical" questions are:
- What processing is done in text page that seemingly (wrongly) mirror only one of the parenthesis?
- How can I know when the document was create with parenthesis that are mirrored at render and parenthesis that were already LTR?
I am using Skia if it matters in this situation. I attached the PDF used to generate the image.