Correct way to match text objects/rects to character indexes

140 views
Skip to first unread message

Nikita Rybak

unread,
Sep 5, 2023, 1:49:24 PM9/5/23
to pdf...@googlegroups.com
Hi there,

If I have a text rectangle and/or object, received either through FPDFText_GetRect or FPDFPage_GetObject, is there a reliable way to find out its start & end character indexes? The kind one could pass into FPDFText_GetText/FPDFText_GetFillColor and similar methods.

I could iterate through all the rectangles/objects on the page and keep track of character count (index += len(rectText);) but I'm not sure if it would account for all the corner cases. (E.g. ligatures, modifier symbols, chinese/japanese characters, or some niche pdf features I never heard of) I'm concerned that I'd make a solution that works 97% of the time and fails badly on less common documents.

And a follow up question. Are FPDFText_GetRect rectangles always the same as text objects received through FPDFPage_GetObject? That seems to be the case in my testing (after discarding 0-length text objects), but I want to make sure I don't make wrong assumptions.


Kind regards,
Nikita

geisserml

unread,
Sep 9, 2023, 5:11:41 PM9/9/23
to pdfium
A related question is why we need `FPDFText_GetSchCount()` and can't just use the length of the input search text.

@OP By the way, may I ask why/how you are using pdfium's rectangle API? When I tried I didn't find it so useful, since the target apparently is not clearly defined – the rects may group anything between a few letters to one or multiple words. Distinct APIs to get words, lines and paragraphs would seem more useful to me, but this kind of layout analysis was considered out of scope by pdfium team (there was a feature request in the bug tracker). Anyway, it's not clear to me what purpose the rects API serves; however, it was not created in pdfium but inherited from foxit, so we probably can't reach the person who wrote the code.

Nikita Rybak

unread,
Sep 11, 2023, 3:35:05 AM9/11/23
to pdfium
@geisserml I'm actually working on grouping text into paragraph-level blocks (paragraphs/table cells/footers/etc), as part of general text extraction process. Using rect api was a helpful first step, plus I wasn't sure if using char api on every symbol from python (pypdfium2) would tank performance.
I think it might be safer to avoid it though. In case there are some dragons lurking.

geisserml

unread,
Sep 11, 2023, 6:29:38 AM9/11/23
to pdfium
Interesting. The rect API could indeed be a useful intermediary here, if you need only the grouping by proximity and not words specifically.
I'm not sure about the performance question, we would have to benchmark that. I'd hazard a guess that the difference should be low compared to any analysis work you will be doing afterwards in python, though (e.g. the TOC API also does many separate FFI calls and it's still very fast).
Reply all
Reply to author
Forward
0 new messages