How to avoid extracting hidden text

341 views
Skip to first unread message

hb...@planet.nl

unread,
Jan 27, 2022, 10:12:50 AM1/27/22
to pdfium

Hi all, I’m using FPDFText_GetUnicode() to extract text from pdf.  

I’m finding that this also extracts hidden text.

Is there a way to find out if text is hidden, so that I can discard hidden text?

I tried FPDFText_GetTextRenderMode but that doesn’t work here.

 Thanks for any hints!

Lei Zhang

unread,
Jan 27, 2022, 11:54:46 AM1/27/22
to hb...@planet.nl, pdfium
What is your definition of hidden text?
> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/b5f45897-2042-40d9-b4e4-3752181be31en%40googlegroups.com.

hb...@planet.nl

unread,
Jan 27, 2022, 12:39:41 PM1/27/22
to pdfium
Text that is not shown in the rendered pdf. The attached image shows a page in Adobe Reader, with a hidden text block selected. No text is shown in Reader, but I can copy and paste it, and  FPDFText_GetUnicode and  FPDFText_GetText  read the characters.

Op donderdag 27 januari 2022 om 17:54:46 UTC+1 schreef Lei Zhang:
HiddenText.png

Lei Zhang

unread,
Jan 27, 2022, 1:38:44 PM1/27/22
to hb...@planet.nl, pdfium
Have you looked into how that text is being hidden? There are likely
multiple methods to do so, and the answer would depend on how it's
hidden.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/da17508a-342c-44d3-97aa-c0da6bb7a8b8n%40googlegroups.com.

hb...@planet.nl

unread,
Jan 28, 2022, 4:41:23 AM1/28/22
to pdfium

This is probably a proof with the recent editing history included. I don't have the source file, just the pdf. The extracted text contains repetitions of fragments and paragraphs, also from adjacent pages.

Op donderdag 27 januari 2022 om 19:38:44 UTC+1 schreef Lei Zhang:

Lei Zhang

unread,
Jan 28, 2022, 3:08:38 PM1/28/22
to hb...@planet.nl, pdfium
It may be helpful to share the PDF, or just the relevant page from the PDF.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/053e9689-2e21-4675-b287-e742786a9149n%40googlegroups.com.

hb...@planet.nl

unread,
Jan 31, 2022, 8:55:40 AM1/31/22
to pdfium
Dropbox link to ESS.pdf below. See page 148 (physical page 163). 

Lei Zhang

unread,
Feb 1, 2022, 1:08:56 PM2/1/22
to hb...@planet.nl, pdfium
There's a clipping path applied to the hidden text. One way to examine
the PDF and determine this is to use FPDFPage_GetObject() to get
objects of type FPDF_PAGEOBJ_TEXT. Then use FPDFPageObj_GetClipPath()
and related APIs to find out what the clipping path is.


On Mon, Jan 31, 2022 at 5:55 AM hb...@planet.nl <hb...@planet.nl> wrote:
>
> Dropbox link to ESS.pdf below. See page 148 (physical page 163).
>
> https://www.dropbox.com/s/27mx5ur5ecgqqpf/ESS.pdf?dl=0
>
> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/c29a59cd-d596-46fd-92e9-33d789811f14n%40googlegroups.com.

hb...@planet.nl

unread,
Feb 3, 2022, 7:11:01 AM2/3/22
to pdfium

Thank you. Ok I can determine if clipping paths have been applied. See code below. But how is this going to help? How do I determine which text objects are invisible? And, do I then need to reconstruct a Text_Page, so that I can keep using FPDFText_GetUnicode() or FPDFText_GetText()?

        int nobjects = (int)FPDFPage_CountObjects(Pdf_Page);

        for (int i = 0; i < nobjects; ++i)
         {
                FPDF_PAGEOBJECT pageobj = FPDFPage_GetObject(Pdf_Page, i);

                if (FPDFPageObj_GetType(pageobj) == FPDF_PAGEOBJ_TEXT)
                 {
                        FPDF_CLIPPATH clippath = FPDFPageObj_GetClipPath(pageobj);

                        if (clippath != NULL)
                         {
                                int npaths = FPDFClipPath_CountPaths(clippath);

                                if (npaths != -1)
                                 {
                                        ; // what next?
                                 }
                         }
                 }
         }
Op dinsdag 1 februari 2022 om 19:08:56 UTC+1 schreef Lei Zhang:

Lei Zhang

unread,
Feb 8, 2022, 10:35:04 PM2/8/22
to hb...@planet.nl, pdfium
After calling FPDFClipPath_CountPaths(), use
FPDFClipPath_CountPathSegments() to get the number of segments for
some path. Then use FPDFClipPath_GetPathSegment() to get the segments.
FPDFPathSegment_GetPoint() and related APIs work with the segments.
Hope that helps connect the dots.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/7c3aefe4-d365-418d-9e08-7254c1db512cn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages