Extracting text blocks

217 views
Skip to first unread message

Guy Rosin

unread,
Apr 20, 2023, 3:52:45 PM4/20/23
to pdfium
Hi,
I'm using pypdfium2 (a Python wrapper around pdfium).
It seems text extraction behaves very differently when using `get_rect()` (which wraps around `FPDFText_GetRect()` and `get_text_range()` (which wraps around `FPDFText_CountRects()`.
Generally, I see that `FPDFText_CountRects()` extracts text as expected, but sequentially calling `FPDFText_GetRect()` results in weird text - the rectangles are ordered in unnatural order such that the original text can't be inferred from.

Here's my Python code:
```python
for i in range(text_page.count_rects()):
rect = text_page.get_rect(i)
text = text_page.get_text_bounded(*rect)
print(text)
```
The output for the attached PDF page is attached.

[My ultimate goal is to extract text blocks, which are more abstract and natural than text rectangles. For example a text line can be composed of multiple rectangles, and I'm trying to merge close rectangles into blocks. This functionality is included for example in pymupdf (https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractBLOCKS)].
weird_pdf.pdf
output.txt

geisserml

unread,
Apr 24, 2023, 3:26:02 PM4/24/23
to pdfium

Frozen Forest

unread,
Apr 24, 2023, 11:48:28 PM4/24/23
to pdfium
if you want to extract all texts in all pages, this code worked as expected. I know that it is c++ (in fact it is my Unreal Engine plugin's code) but maybe it helps to create similiar approach.

bool UPDF_ReaderBPLibrary::PDF_Get_Texts(TArray<FString>& Out_Texts, UPARAM(ref)UPDFiumDoc*& In_PDF)
{
if (Global_bIsLibInitialized == false)
{
return false;
}

if (IsValid(In_PDF) == false)
{
return false;
}

if (!In_PDF->Document)
{
return false;
}

unsigned short* CharBuffer = (unsigned short*)malloc(0x2000 * sizeof(unsigned short));
for (int32 PageIndex = 0; PageIndex < FPDF_GetPageCount(In_PDF->Document); PageIndex++)
{
FPDF_PAGE PDF_Page = FPDF_LoadPage(In_PDF->Document, PageIndex);
FPDF_TEXTPAGE PDF_TextPage = FPDFText_LoadPage(PDF_Page);
FPDF_ClosePage(PDF_Page);

FString PageText;
int CharCount = FPDFText_CountChars(PDF_TextPage);
for (int32 CharIndex = 0; CharIndex < CharCount; CharIndex++)
{
FPDFText_GetText(PDF_TextPage, CharIndex, CharCount, CharBuffer);
PageText = PageText + (char*)CharBuffer;
}

Out_Texts.Add(PageText);
FPDFText_ClosePage(PDF_TextPage);
}

return true;
}

24 Nisan 2023 Pazartesi tarihinde saat 22:26:02 UTC+3 itibarıyla geisserml şunları yazdı:

geisserml

unread,
Apr 25, 2023, 8:07:04 AM4/25/23
to pdfium
This isn't really related to the reporter's questions (I suggest that you read [1]).

Apart from that, I'm very confused by your code.
a) The innermost for-loop seems to be gravely wrong.
b) I don't think you should close the FPDF_PAGE before the FPDF_TEXTPAGE.
c) A static size buffer is inflexible and may truncate text if not large enough.

Did you test your code??

[1] https://bugs.chromium.org/p/pdfium/issues/detail?id=2025#c3

Frozen Forest

unread,
Apr 25, 2023, 9:12:59 AM4/25/23
to pdfium
B and C -) Yes, I tested it and I use same workflow also for FPDFLink_GetURL() and FPDFText_GetBoundedText() without a problem.  If you are a user of Unreal Engine I can send you my plugin.
A-) I know that inner loop seems odd but if I don't use it I just get first character (or whatever the index point points to) of target page.

for static size buffer case, 0x2000 * sizeof(unsigned short) equal 16384 bytes. I tought it will enough for most cases but I changed it after your words. I put it inside of first for loop and freed it after finishing my works.
unsigned short* CharBuffer = (unsigned short*)malloc(static_cast<size_t>(CharCount) * 4)
I used 4 because UTF8 char units take 1 to 4 bytes.

If you have better suggestions, I will be happy.
25 Nisan 2023 Salı tarihinde saat 15:07:04 UTC+3 itibarıyla geisserml şunları yazdı:

geisserml

unread,
Apr 25, 2023, 1:05:33 PM4/25/23
to pdfium
a) That's definitely not the expected behavior and I still think there's something seriously wrong with your code. (A pdfium dev could take a look to confirm.)
b) Even if it seems to work, I think it's recommended to close objects in reverse order to opening.
c) Your new approach looks better, but may still allocate more memory than actually necessary. E. g. if your string consists of ASCII chars only, you allocate 6 bytes too much per character.
     It's not explicitly mentioned in the docs and I haven't tested yet, but possibly one may call FPDFText_GetText() with result=NULL first to just get the required buffer size.

geisserml

unread,
Apr 25, 2023, 1:08:32 PM4/25/23
to pdfium
c) correcting myself: 3, not 6 (too tired, sorry)

Frozen Forest

unread,
Apr 25, 2023, 1:52:41 PM4/25/23
to pdfium
Yes, if there is a way to get exact buffer size will be much more better approach than guessing it. I don't consider my approach was perfect.

I tried to test test inner loop again. If GetText() works as expected (Actully we define first char index and char count at start), maybe unsigned short pointer to FString (it is const char pointer or std:string with c.str()) wrong.

Also is there a documentation ? For example, when I try to LoadExternal fonts, created PDF says PDF is broken.
25 Nisan 2023 Salı tarihinde saat 20:08:32 UTC+3 itibarıyla geisserml şunları yazdı:

geisserml

unread,
Apr 25, 2023, 3:34:26 PM4/25/23
to pdfium
Hmm, unlike other APIs, FPDFText_GetText() doesn't seem to like being called with a NULL buffer first. That results in a segfault for me.
So with that limitation, allocating 4 BPC sounds like the right thing to do as caller.

Jeroen Bobbeldijk

unread,
Apr 25, 2023, 3:39:43 PM4/25/23
to pdfium
This might help: https://github.com/klippa-app/go-pdfium/blob/c2eecc81ac2a4ac784347438ad0f94d7dd8ec8b4/internal/implementation_cgo/text.go#L53
So:
 - No, you don't need the inner loop, you can get all the chars at once
 - Yes, you can know the max buffer size, it's UTF16-LE, which is max 2 bytes per char, and you need to add 1 char for the NULL terminator, so (CharCount+1)*2
 - FPDFText_GetText returns the amount of chars written to the buffer, you can then use that to slice your buffer to the correct size and convert the UTF16-LE to UTF-8
 - It looks like Unity FString can be created from WIDECHAR (UTF16-LE), so you might just need something like Out_Texts.Add(FString((WIDECHAR*)CharBuffer)); (I don't really know C/C++ so it's probably not fully correct.

For documentation you can mostly rely on the function docs in the header files, it's not much but mostly is sufficient.

On Tuesday, April 25, 2023 at 7:52:41 PM UTC+2 frozenfor...@gmail.com wrote:

geisserml

unread,
Apr 25, 2023, 3:40:23 PM4/25/23
to pdfium
Oh, and I forgot, this isn't UTF-8, but UTF-16(LE), where 1 char unit takes 2 or 4 bytes, so we're allocating max. 2 bytes too much per character. Sorry for the confusion again.

geisserml

unread,
Apr 25, 2023, 3:41:26 PM4/25/23
to pdfium
(Note for readers: I posted before seeing @jerb...'s message)

Frozen Forest

unread,
Apr 25, 2023, 3:44:49 PM4/25/23
to pdfium
My engine is not Unity. I am using Unreal Engine and it is fully C/C++. I will try again with your  Out_Texts.Add(FString((WIDECHAR*)CharBuffer)) suggestion and share result. Engine shouldn't matter for char types but I need to mention it just in case.
25 Nisan 2023 Salı tarihinde saat 22:39:43 UTC+3 itibarıyla jerb...@gmail.com şunları yazdı:

Jeroen Bobbeldijk

unread,
Apr 25, 2023, 3:46:32 PM4/25/23
to pdfium
@geisserml, UTF-16 can be 4 bytes too? Then my code would be incorrect too, I haven't seen that happen, if that's the case then it would indeed be impossible to figure out the buffer size (and I would need to change some code lol).

Jeroen Bobbeldijk

unread,
Apr 25, 2023, 3:47:02 PM4/25/23
to pdfium
Sorry, I mixed up Unity and Unreal :) But the code applies to Unreal.

Jeroen Bobbeldijk

unread,
Apr 25, 2023, 3:52:12 PM4/25/23
to pdfium
Pdfium source says:

  // UFT16LE_Encode doesn't handle surrogate pairs properly, so it is expected
  // the number of items to stay the same.
  ByteString byte_str = str.ToUTF16LE();
  size_t byte_str_len = byte_str.GetLength();
  size_t ret_count = byte_str_len / kBytesPerCharacter;

So now it's completely unclear to me whether the required buffer size could be more than (CharCount+1)*2, maybe someone from pdfium could clarify.

Frozen Forest

unread,
Apr 25, 2023, 3:54:47 PM4/25/23
to pdfium
Nope it doesn't work.
I mean it doesn't crash but texts come as chinese-like.

PDF is this.

Result is this. Each red text line is for a single page.
bottom is for second page.
one up is for third and etc.
Screenshot (344).png
25 Nisan 2023 Salı tarihinde saat 22:47:02 UTC+3 itibarıyla jerb...@gmail.com şunları yazdı:

Frozen Forest

unread,
Apr 25, 2023, 3:59:35 PM4/25/23
to pdfium
For game engine cases, buffer size shouldn't matter. Because we call this function mostly only once and target computers have really powerfull hardwares. As I say before, even 8192 * 2 (size of unsighned short) is equal to 4096 char per page at worst case. Does a PDF has 4096 char per page ?

Problem is converting buffer to FString or const char pointer or std::string without secondary loop.

25 Nisan 2023 Salı tarihinde saat 22:54:47 UTC+3 itibarıyla Frozen Forest şunları yazdı:

Jeroen Bobbeldijk

unread,
Apr 25, 2023, 4:04:52 PM4/25/23
to pdfium
I don't know your exact code, but FString can be constructed from the original format fine if you look at the different constructor: https://docs.unrealengine.com/5.1/en-US/API/Runtime/Core/Containers/FString/
Perhaps PageText.AppendChars((WIDECHAR*)CharBuffer), CharCount) works better, can't really help you with the exact implementation, but the loop over CharCount is definitely not needed. 

Frozen Forest

unread,
Apr 25, 2023, 4:35:49 PM4/25/23
to pdfium
Nope that didn't work either.

I don't really have a fantasy about secondary for loops :D I couldn't find any documentation and your suggestions don't work. I know that it "can" be done without secondary loop. But we (you, too) don't know that solution. If I don't get any crash about it and user's performance won't get any perceivable impact, I won't stop using full feature and I can't say my code is completely mess. For bigger PDF files, I can just write async code. Again, optimizations are not our ultimate target.

this is whole code with your suggestion.

for (int32 PageIndex = 0; PageIndex < FPDF_GetPageCount(In_PDF->Document); PageIndex++)
{
FPDF_PAGE PDF_Page = FPDF_LoadPage(In_PDF->Document, PageIndex);
FPDF_TEXTPAGE PDF_TextPage = FPDFText_LoadPage(PDF_Page);

int CharCount = FPDFText_CountChars(PDF_TextPage);
//unsigned short* CharBuffer = (unsigned short*)malloc(0x2000 * sizeof(unsigned short));
unsigned short* CharBuffer = (unsigned short*)malloc((static_cast<size_t>(CharCount) + 1) * 2);
Out_Texts.Add(FString((WIDECHAR*)CharBuffer));
FPDFText_ClosePage(PDF_TextPage);
FPDF_ClosePage(PDF_Page);
free(CharBuffer);
}

return true;

25 Nisan 2023 Salı tarihinde saat 23:04:52 UTC+3 itibarıyla jerb...@gmail.com şunları yazdı:

Jeroen Bobbeldijk

unread,
Apr 25, 2023, 4:41:30 PM4/25/23
to pdfium
Well the main issue here is that you're not calling FPDFText_GetText anymore, so it makes sense that it doesn't work ;)

I get that you got it working with the secondary loop, but it feels like it was rather luck that it worked than that you actually know what's going on. 
So yes, I get what you're saying, it works, and it doesn't matter a bit that your'e calling FPDFText_GetText way too often or allocating too much memory, but in my opinion you should know why the code behaves like it does if you ever want to fix bugs in it.

Frozen Forest

unread,
Apr 25, 2023, 4:52:47 PM4/25/23
to pdfium
you can call me moron :D

25 Nisan 2023 Salı tarihinde saat 23:41:30 UTC+3 itibarıyla jerb...@gmail.com şunları yazdı:

Frozen Forest

unread,
Apr 25, 2023, 5:14:01 PM4/25/23
to pdfium
wow. so these are correct ones for weblinks  and text selection. I am both gratefull and embarrassed :D

FPDF_PAGELINK PDF_Links = FPDFLink_LoadWebLinks(PDF_TextPage);
int32 Links_Count = FPDFLink_CountWebLinks(PDF_Links);

if (Links_Count == 0)
{
FPDFLink_CloseWebLinks(PDF_Links);
FPDFText_ClosePage(PDF_TextPage);

return false;
}

for (int32 Index_Link = 0; Index_Link < Links_Count; Index_Link++)
{
int CharLenght = FPDFLink_GetURL(PDF_Links, Index_Link, NULL, NULL);
unsigned short* CharBuffer = (unsigned short*)malloc(CharLenght);
FPDFLink_GetURL(PDF_Links, Index_Link, CharBuffer, CharLenght);

FString LinkText;
LinkText.AppendChars((WIDECHAR*)CharBuffer, CharLenght);

Out_Links.Add(LinkText);
free(CharBuffer);
}

FPDFLink_CloseWebLinks(PDF_Links);
FPDFText_ClosePage(PDF_TextPage);
FPDF_ClosePage(PDF_Page);

FPDF_PAGE PDF_Page = FPDF_LoadPage(In_PDF->Document, PageIndex);
FPDF_TEXTPAGE PDF_TextPage = FPDFText_LoadPage(PDF_Page);

int CharLenght = FPDFText_GetBoundedText(PDF_TextPage, Start.X, Start.Y, End.X, End.Y, NULL, NULL);
unsigned short* CharBuffer = (unsigned short*)malloc(CharLenght);
FPDFText_GetBoundedText(PDF_TextPage, Start.X, Start.Y, End.X, End.Y, CharBuffer, CharLenght);

FString SelectedText;
SelectedText.AppendChars((WIDECHAR*)CharBuffer, CharLenght);

Out_Text = SelectedText;

FPDFText_ClosePage(PDF_TextPage);
FPDF_ClosePage(PDF_Page);
free(CharBuffer);

25 Nisan 2023 Salı tarihinde saat 23:52:47 UTC+3 itibarıyla Frozen Forest şunları yazdı:

Jeroen Bobbeldijk

unread,
Apr 25, 2023, 5:25:08 PM4/25/23
to pdfium
Good to hear! Don't be embarrassed, we all need to learn somehow.
You might be writing out of memory bounds now though, since you malloc CharLenght, but since a character is 2 bytes, you need to allocate CharLenght * 2

Frozen Forest

unread,
Apr 25, 2023, 6:19:26 PM4/25/23
to pdfium
Okay, thanks !

Also other than that I have 2 plus 1 problem.
One of them, is not directly about pdfium. some reason latest libpng brokes iccp chunks of rendered bitmaps. FPDF_RenderPageBitmap(). So, engines like Unreal crashes on Android when it sees a warning about that chunk.
Exactly same code works with a pdfium from 2018 on Android and latest version works on Windows.

other two are loading external fonts and loading images.
TArray<uint8> Array_Bytes;
FFileHelper::LoadFileToArray(Array_Bytes, *Path);
FPDF_FONT Font = FPDFText_LoadFont(In_PDF->Document, Array_Bytes.GetData(), Array_Bytes.GetAllocatedSize(), FontType, bIsCid);

GetData returns array as uint8 pointer and this brokes PDF. When I open it, it says PDF is broken.

////

and this is my image loading code. Top functions is for loading a jpeg file.
bottom function is for loading a BGRA texture from memory. 
both of them don't add anything to texture.


also there is an active bug which has a "won't fix" status like this.
26 Nisan 2023 Çarşamba tarihinde saat 00:25:08 UTC+3 itibarıyla jerb...@gmail.com şunları yazdı:

Jeroen Bobbeldijk

unread,
Apr 26, 2023, 2:13:21 AM4/26/23
to pdfium

Let's pick this up in other threads, I feel like we kinda took over the initial post, sorry for that OP :)

geisserml

unread,
Apr 26, 2023, 8:34:27 AM4/26/23
to pdfium
Thanks @jerbob for helping @frozenforest write their code.

Responding to comments [1] and [2], which were somewhat addressed to me:
a) Yes, I think in UTF-16 a character may also take 4 bytes ("surrogate pair").
b) Agreed it would be good if a pdfium dev could clarify the situation. It kind of sounds like pdfium currently does not handle surrogate pairs (properly)? At least this API doesn't seem to have been designed with that in mind. (FWIW, in our code, we also just allocate 2 BPC as generally assuming 4 BPC would unnecessarily consume memory.)

[1] https://groups.google.com/g/pdfium/c/HwqzzGWWXVU/m/HDK5xfL7AgAJ
[2] https://groups.google.com/g/pdfium/c/HwqzzGWWXVU/m/Rp1E2kH8AgAJ

geisserml

unread,
Apr 26, 2023, 8:44:26 AM4/26/23
to pdfium
@frozenforest: Image insertion (as described in Bug 656) works for me. If you're experiencing problems with that, I guess that'll be an issue with your own code again.

Frozen Forest

unread,
Apr 26, 2023, 4:46:36 PM4/26/23
to pdfium
@geisserml did you try LoadFile or just image insertion ? because I can insert a bitmap which contains a blue rectangle to pdf. But I can't import jpeg with LoadFile. If you can share a sample code about it, i will be appreciate.

26 Nisan 2023 Çarşamba tarihinde saat 15:44:26 UTC+3 itibarıyla geisserml şunları yazdı:

geisserml

unread,
Apr 26, 2023, 5:19:42 PM4/26/23
to pdfium
In pypdfium2, we use both. See [1] for an FPDFImageObj_LoadJpegFile{Inline}() example.

[1] https://github.com/pypdfium2-team/pypdfium2/blob/0c1a20de1c7962164afd59383ac7d2c99a8f07a4/src/pypdfium2/_helpers/pageobjects.py#L194

Jeroen Bobbeldijk

unread,
Apr 26, 2023, 5:37:46 PM4/26/23
to pdfium
@geisserml, it sounds to me that pdfium doesn't recognize surrogate pairs and that it will just be handled as 2 chars in both FPDFText_CountChars and FPDFText_GetText, which might be for the better.

geisserml

unread,
Apr 26, 2023, 6:20:26 PM4/26/23
to pdfium
@jerbob OK. If that's the case, the caller should effectively end up with the correct result, right?
However, it sounds like single-char APIs such as FPDFText_GetUnicode() then have a problem.
Reply all
Reply to author
Forward
0 new messages