How to avoid extracting hidden text

hb...@planet.nl

unread,

Jan 27, 2022, 10:12:50 AM1/27/22

to pdfium

Hi all, I’m using FPDFText_GetUnicode() to extract text from pdf.

I’m finding that this also extracts hidden text.

Is there a way to find out if text is hidden, so that I can discard hidden text?

I tried FPDFText_GetTextRenderMode but that doesn’t work here.

Thanks for any hints!

Lei Zhang

unread,

Jan 27, 2022, 11:54:46 AM1/27/22

to hb...@planet.nl, pdfium

What is your definition of hidden text?

> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/b5f45897-2042-40d9-b4e4-3752181be31en%40googlegroups.com.

hb...@planet.nl

unread,

Jan 27, 2022, 12:39:41 PM1/27/22

to pdfium

Text that is not shown in the rendered pdf. The attached image shows a page in Adobe Reader, with a hidden text block selected. No text is shown in Reader, but I can copy and paste it, and FPDFText_GetUnicode and FPDFText_GetText read the characters.

Op donderdag 27 januari 2022 om 17:54:46 UTC+1 schreef Lei Zhang:

HiddenText.png

Lei Zhang

unread,

Jan 27, 2022, 1:38:44 PM1/27/22

to hb...@planet.nl, pdfium

Have you looked into how that text is being hidden? There are likely
multiple methods to do so, and the answer would depend on how it's
hidden.

> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/da17508a-342c-44d3-97aa-c0da6bb7a8b8n%40googlegroups.com.

hb...@planet.nl

unread,

Jan 28, 2022, 4:41:23 AM1/28/22

to pdfium

This is probably a proof with the recent editing history included. I don't have the source file, just the pdf. The extracted text contains repetitions of fragments and paragraphs, also from adjacent pages.

Op donderdag 27 januari 2022 om 19:38:44 UTC+1 schreef Lei Zhang:

Lei Zhang

unread,

Jan 28, 2022, 3:08:38 PM1/28/22

to hb...@planet.nl, pdfium

It may be helpful to share the PDF, or just the relevant page from the PDF.

> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/053e9689-2e21-4675-b287-e742786a9149n%40googlegroups.com.

hb...@planet.nl

unread,

Jan 31, 2022, 8:55:40 AM1/31/22

to pdfium

Dropbox link to ESS.pdf below. See page 148 (physical page 163).

https://www.dropbox.com/s/27mx5ur5ecgqqpf/ESS.pdf?dl=0

Lei Zhang

unread,

Feb 1, 2022, 1:08:56 PM2/1/22

to hb...@planet.nl, pdfium

There's a clipping path applied to the hidden text. One way to examine
the PDF and determine this is to use FPDFPage_GetObject() to get
objects of type FPDF_PAGEOBJ_TEXT. Then use FPDFPageObj_GetClipPath()
and related APIs to find out what the clipping path is.

On Mon, Jan 31, 2022 at 5:55 AM hb...@planet.nl <hb...@planet.nl> wrote:
>
> Dropbox link to ESS.pdf below. See page 148 (physical page 163).
>
> https://www.dropbox.com/s/27mx5ur5ecgqqpf/ESS.pdf?dl=0
>

> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/c29a59cd-d596-46fd-92e9-33d789811f14n%40googlegroups.com.

hb...@planet.nl

unread,

Feb 3, 2022, 7:11:01 AM2/3/22

to pdfium

Thank you. Ok I can determine if clipping paths have been applied. See code below. But how is this going to help? How do I determine which text objects are invisible? And, do I then need to reconstruct a Text_Page, so that I can keep using FPDFText_GetUnicode() or FPDFText_GetText()?

int nobjects = (int)FPDFPage_CountObjects(Pdf_Page);

for (int i = 0; i < nobjects; ++i)
{
FPDF_PAGEOBJECT pageobj = FPDFPage_GetObject(Pdf_Page, i);

if (FPDFPageObj_GetType(pageobj) == FPDF_PAGEOBJ_TEXT)
{
FPDF_CLIPPATH clippath = FPDFPageObj_GetClipPath(pageobj);

if (clippath != NULL)
{
int npaths = FPDFClipPath_CountPaths(clippath);

if (npaths != -1)
{
; // what next?
}
}
}
}

Op dinsdag 1 februari 2022 om 19:08:56 UTC+1 schreef Lei Zhang:

Lei Zhang

unread,

Feb 8, 2022, 10:35:04 PM2/8/22

to hb...@planet.nl, pdfium

After calling FPDFClipPath_CountPaths(), use
FPDFClipPath_CountPathSegments() to get the number of segments for
some path. Then use FPDFClipPath_GetPathSegment() to get the segments.
FPDFPathSegment_GetPoint() and related APIs work with the segments.
Hope that helps connect the dots.

> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/7c3aefe4-d365-418d-9e08-7254c1db512cn%40googlegroups.com.

Reply all

Reply to author

Forward