PDFium text extraction duplicates text

40 views
Skip to first unread message

Harry Bego

unread,
Jun 6, 2019, 1:26:47 PM6/6/19
to pdfium
When using pdfium's text extraction features I find that in rare cases fragments of text are duplicated.

I have one PDF from which almost all pages are duplicated.

Is this a known issue?

Thank you for any clues!
Harry

Lei Zhang

unread,
Jun 11, 2019, 3:28:44 PM6/11/19
to Harry Bego, pdfium
Assuming you are using FPDFText_GetText(), you can search our bug
tracker [1] for the known issues that mention that particular
function.

If you can share a PDF, we can look into why that PDF has duplicate
text. You can always file a bug for this, and we can discuss the issue
on the bug.

[1] https://bugs.chromium.org/p/pdfium/issues/list?can=2&q=FPDFText_GetText
> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/9286857d-3e4c-49cd-bf29-920bb8349bfe%40googlegroups.com.

Harry Bego

unread,
Jun 13, 2019, 12:31:16 PM6/13/19
to pdfium
Thank you for your reply. I have uploaded a PDF to my DropBox at 
https://www.dropbox.com/s/jem5wvfzm0dsrvy/gunners.pdf?dl=0

Extracting text from this, most pages are duplicated. 

The bug tracker does not list related issues for FPDFText_GetText.
I use Erik Salaj's TPdf implementation for C++Builder VCL; I'm not sure if this calls FPDFText_GetText
Contacting Erik he confirmed that the pages are duplicated for this PDF.

Thanks for your help!
- - 
Reply all
Reply to author
Forward
0 new messages