Trouble with text extraction of special writing systems like hindi

233 views
Skip to first unread message

geisserml

unread,
May 31, 2022, 6:40:16 AM5/31/22
to pdfium
Hello,

I'm currently working with PDFium's text extraction features in pypdfium2. While it goes really well on latin letters, I have problems with hindi text.
When trying to extract text from the attached PDF, I get something like "मၝघोषणाᆸपुჹ჋और सहमत ီ ჸकᇆ", although it should be "मैं घोषणा, पुष्टि और सहमत हूँ कि:" instead.
Now I'm wondering if this is a known limitation of PDFium itself, or if I'm doing something wrong when decoding the data provided by PDFium. This is my current code:
```python3
c_array = (ctypes.c_ushort * (n_chars+1))()
pdfium.FPDFText_GetBoundedText(*args, ctypes.cast(c_array, ctypes.POINTER(ctypes.c_ushort)), n_chars)
text = bytes(c_array).decode("utf-16-le")[:-1]
```

Thanks!
hindi.pdf

geisserml

unread,
May 31, 2022, 8:49:57 AM5/31/22
to pdfium
I just tried opening the document with Chromium and copying the text, which returns the same gibberish as pypdfium2, so it looks like this is a bug in PDFium. Shall I file a report at monorail?

Miklos Vajna

unread,
Jun 19, 2022, 7:51:28 AM6/19/22
to geisserml, pdfium
Hi,

On Tue, May 31, 2022 at 05:49:57AM -0700, geisserml <geis...@gmail.com> wrote:
> I just tried opening the document with Chromium and copying the text, which
> returns the same gibberish as pypdfium2, so it looks like this is a bug in
> PDFium. Shall I file a report at monorail?

I think so: in general bugreports are tracked at
<https://bugs.chromium.org/p/pdfium/issues/entry>, not on this mailing
list.

Regards,

Miklos

Lei Zhang

unread,
Aug 4, 2022, 3:53:26 PM8/4/22
to geisserml, pdfium
Firefox does the same thing. Can any PDF software correctly extract the text?
> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/29af8595-6640-447b-aa38-4bad0bdaf250n%40googlegroups.com.

geisserml

unread,
Aug 17, 2022, 9:13:33 AM8/17/22
to pdfium
Maybe not. So far I've tried pdfium, pdfjs, poppler and mupdf, none of which can extract the sentence correctly (though some results look less bad than others).
I'll see if I can get access to a non-Linux device at some point to see how Adobe products behave.

Lei Zhang

unread,
Aug 17, 2022, 6:59:12 PM8/17/22
to geisserml, pdfium
Acrobat Reader DC gives tofu boxes with question marks inside for
every character in this PDF.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/d9e77c33-c706-4306-bd27-3d81c7cf8c5en%40googlegroups.com.

Justin Pierce

unread,
Jan 25, 2023, 12:29:54 PM1/25/23
to pdfium
we use pdfium in our commercial software and this is the result

make sure you are using wstring (c++) and properly converting from the odd short array output from pdfium

res.jpg

Justin Pierce

unread,
Jan 25, 2023, 12:29:54 PM1/25/23
to pdfium
what is the result when you simply write this into a c++ file and save it to disk or print to console? 

On Thursday, August 18, 2022 at 5:59:12 AM UTC+7 Lei Zhang wrote:
Reply all
Reply to author
Forward
0 new messages