Is it possible to encode an image of the same content as a pdf page when encoded with FPDF_RenderPageBitmap?

84 views
Skip to first unread message

Something Something

unread,
Sep 6, 2025, 1:30:01 AMSep 6
to pdfium
I'm trying to encode a pdf page into a numpy array using PDFium:

py::array_t<uint8_t> render_page_helper(FPDF_PAGE page, int target_width = 0, int target_height = 0, int dpi = 80) {
    int width, height;

    if (target_width > 0 && target_height > 0) {
        width = target_width;
        height = target_height;
    } else {
        width = static_cast<int>(FPDF_GetPageWidth(page) * dpi / 72.0);
        height = static_cast<int>(FPDF_GetPageHeight(page) * dpi / 72.0);
    }

    FPDF_BITMAP bitmap = FPDFBitmap_Create(width, height, 1);
    if (!bitmap) throw std::runtime_error("Failed to create bitmap");

    FPDFBitmap_FillRect(bitmap, 0, 0, width, height, 0xFFFFFFFF);
    FPDF_RenderPageBitmap(bitmap, page, 0, 0, width, height, 0, 0);

    // int stride = FPDFBitmap_GetStride(bitmap);
    uint8_t* buffer = static_cast<uint8_t*>(FPDFBitmap_GetBuffer(bitmap));

    // Return numpy array with shape (height, width, 4) = BGRA
    auto result = py::array_t<uint8_t>({height, width, 4}, buffer);
    // std::memcpy(result.mutable_data(), buffer, height * width * 4);
    FPDFBitmap_Destroy(bitmap);
    return result;
}

This returns the array to python where I then process it with:

# Drop alpha, convert BGRA → RGB
np_arr_rgb = np_array[:, :, [2, 1, 0]]  # (H, W, 3)

To convert to RGB. 
However, when given an image of the same identical content as the pdf page, I have no idea how to encode it to obtain the same np array, as image rendering methods differ a lot from pdf rendering. Right now, I'm using stb_image to handle it:

py::array_t<uint8_t> render_image(const std::string& filename, int target_width = 224, int target_height = 224) {
    int width, height, channels;
    unsigned char* pixel_data = stbi_load(filename.c_str(), &width, &height, &channels, 0);
    if (!pixel_data) throw std::runtime_error("Failed to load image");

    // Temporary resized buffer with original channels
    std::vector<uint8_t> resized(target_width * target_height * channels);
    stbir_resize_uint8(pixel_data, width, height, 0,
                       resized.data(), target_width, target_height, 0, channels);
    stbi_image_free(pixel_data);

    // Always return RGB (3 channels)
    py::array_t<uint8_t> result({target_height, target_width, 3});
    auto buf = result.mutable_unchecked<3>();

    for (int y = 0; y < target_height; ++y) {
        for (int x = 0; x < target_width; ++x) {
            int idx = (y * target_width + x) * channels;
            buf(y, x, 0) = resized[idx + 0]; // R
            buf(y, x, 1) = resized[idx + 1]; // G
            buf(y, x, 2) = resized[idx + 2]; // B
            // If channels == 1 (grayscale), R=G=B
            if (channels == 1) buf(y, x, 1) = buf(y, x, 2) = buf(y, x, 0);
        }
    }

    return result;
}

But this doesn't return a good enough result. 
So what can I do to encode an image to get the same array as a pdf page when using FPDF_RenderPageBitmap?
Thank you.

Something Something

unread,
Sep 6, 2025, 1:33:54 AMSep 6
to pdfium
And I forgot to mention, both pdf files and images are being rendered at 224x224, so the dpi argument is omited in render_page_helper. 

Vào lúc 12:30:01 UTC+7 ngày Thứ Bảy, 6 tháng 9, 2025, Something Something đã viết:

geisserml

unread,
Sep 6, 2025, 5:26:06 AMSep 6
to pdfium
Interesting (is that with pybind11?), but may I ask why you are doing this in the first place, instead of following the established procedures?
There are existing Python APIs, e.g. from Pillow or cv2, to load an image into a numpy array in an efficient and format agnostic manner. You don't need to write own C++ code for that.

Also, for PDF rendering, I could suggest pypdfium2 (ctypes-based), which has an API to get a numpy array view of a rendered bitmap (disclaimer: I'm the author).

Unfortunately I'm not a C++ developer so I cannot help with your image question, which also isn't directly related to pdfium – you might want to ask that on e.g. Stackoverflow instead.

Something Something

unread,
Sep 6, 2025, 6:50:44 AMSep 6
to pdfium
Yes, it is with pybind11, and the reasons I've opted for rendering an image into a numpy array using C++ are because using Pillow gave similarly subpar results, with slightly worse performance, and keeping all rendering logic—both for PDFs and images—in one place just makes the codebase easier to manage, but thanks for the suggestion, nonetheless.

Also, thanks for suggesting pypdfium2—I'll definitely check it out, having everything in Python would simplify setup quite a bit. That said, performance is a key concern for my use case, so I'm curious whether pypdfium2 could handle encoding around 2 million PDF files within a reasonable timeframe.

And I did ask this question on StackOverFlow, on Reddit even, but so far, none has answered, so I figured I'd ask on a more library focused forum.
Vào lúc 16:26:06 UTC+7 ngày Thứ Bảy, 6 tháng 9, 2025, geisserml đã viết:

Something Something

unread,
Sep 7, 2025, 3:49:54 AMSep 7
to pdfium
After some fiddling around, I think I've cracked it. The reason the results don't seem to match is because I was trying to render the pdf page at 224x224 directly to maximize performance in which the method to do so is very different from rendering a page at high dpi then downscale, which is what I did with the images.

Delving deeper, for encoding images I had 3 options: using Pillow, using cv2 and using my custom render_image() function, out of those 3, using cv2 gives the best result (based on Euclidean distance, the closer the better) while Pillow and render_image() are tied behind. Furthermore, cv2 also gives the advantage of resizing an array directly rather than having to convert that array to an Image first, so both image and pdf can be put through the same resizing pipeline.

However, this method isn't perfect as the Euclidean distance between an image and a pdf of the same identical content can never be truly 0, just really close, and the higher the dpi you choose when rendering the pdf page, the closer it gets, so you have to find a sweet spot for both performance and accuracy.

 And if anyone has any more suggestions for further improving the pipeline, I’d love to hear them. 
Vào lúc 17:50:44 UTC+7 ngày Thứ Bảy, 6 tháng 9, 2025, Something Something đã viết:

geisserml

unread,
Sep 7, 2025, 6:21:02 AMSep 7
to pdfium
All right! I'd done some reading yesterday and was going to suggest cv2 since it's built all around numpy arrays, and loads images as arrays directly.
Reputedly, going through Pillow is in fact not (or less) efficient due to memory layout differences, API compilations and so on, which implies memory copies, though apparently one can achieve a better interface than the libaries' built-in functions, as described in [1].
But that's only side notes, as you already seem to have found out about this on your own now.

As for pypdfium2 and 2 million PDF files, hmm, I never tested it on such a large scale since I don't have a use case involving so many files.
I guess the only way to find out is to try. It also depends on how good your integration is (e.g. proper use of multiprocessing and avoiding caller-side copies), what you consider a 'reasonable timeframe', and what hardware you have.
pypdfium2 uses ctypes ABI/FFI bindings which are said to be somewhat less efficient than API mode bindings, but I never benchmarked the difference, mainly for lack of an API-based binding to compare to. This might not be overly relevant though, at least for rendering, given that pypdfium2 only calls a few high-level APIs and most time is spent in pdfium itself. However, I imagine it might be a different story with tasks that involve many API calls, like chars, pageobjects or outline.

Regarding improvements to your pipeline, it's a bit difficult without knowing the full use case.
Providing a broader picture of what task or end result you're actually trying to achieve, and maybe sharing more code, might be needed to get more advice.
Admittedly, some bits appear rather confusing to me, e.g. the fixed size of 224x224, or why you are so intent on pixel identical image and pdf renderings.

Here are some hints though:
- Your above code suggests (by comment) that it would memcpy the rendered buffer. I'd recommend using FPDFBitmap_CreateEx() to render into the target buffer directly, or else take care that the numpy array is constructed in a zero-copy fashion. (For reference, pypdfium2's PdfBitmap.to_numpy() API should be zero-copy.)
- You've indicated that you render BGRA but convert to RGB on the caller side. Note that pdfium can also output RGB directly by using FPDFBitmap_CreateEx() with FPDFBitmap_BGR and rendering with FPDF_REVERSE_BYTE_ORDER. Though with the caveat that FPDFBitmap_BGR should not be used where FPDFPage_HasTransparency() returns true, due to [2].
- I'm not super familiar with numpy but would try to reshape the view of the buffer instead of creating a copy, wherever that's possible. e.g. I think in the transparency case, you could convert from RGBA to RGB by merely setting the right shape and strides, e.g. shape = (height, width, 3) but strides = (width+maybe_padding, 4, 1). Strides are a pretty powerful feature in numpy.
- I think you've made an excellent point with "render at target DPI" vs "decode at full resolution but downscale". Again, I'm not much into that topic, but I'd try to figure out if there aren't means to render the image at a lower resolution directly without first decoding the full resolution, at least for some formats.
- Try to exploit any existing, mature Python APIs before resorting to custom binary extensions. In particular, a task like reading an image into a numpy array is something needed by many callers, so I felt certain you'd find a good existing API somewhere.

That's all for now, sorry this got so long. I hope at least part of this information is of some use to you.

geisserml

unread,
Sep 7, 2025, 6:28:40 AMSep 7
to pdfium
Oh, and in case you are dealing with scanned documents whose pages are effectively images embedded into a PDF container, you might want to extract them and use the image pipeline.
Or even vice versa: convert the images to PDF using a lossless tool like josch's img2pdf, and use the PDF pipeline for everything.

Something Something

unread,
Sep 9, 2025, 11:05:16 AMSep 9
to pdfium
Hi, sorry for the late response, I've taken your advice and they really do wonders in improving performance quite a bit. Especially with FPDFBitmap_CreateEx() where I only need to render in RGB, cutting away the need to convert the array post-process.

Regarding what I said about a 'reasonable timeframe' when rendering a large set of files, I was aiming for maybe under 3 hours for 1 million files, using consumer grade hardware available at your local office. Now I don't know if that sounds unreasonable cause I haven't tested it on that magnitude yet but based on the data I collected from a smaller sample using my laptop as the test environment, I think it's doable.

About the confusing bits, the render results are later fed into a pretrained feature extractor model to obtain feature vectors, which are again fed into an indexing engine to create a search engine, so it also explains why I need the size to be 224x224 since it's a commonly accepted size for pretrained models. Moreover, the need for pixel identical image and pdf renderings also stems from this purpose, since a large gap in similarity between 2 different files that share the same content may throw off the indexing engine, squeezing in visually unrelated stuffs which can cause inaccuracy in the queried results later on.

Also, thanks a lot for recommending extracting images from documents to use the image pipeline and vice versa, however, that method is less suitable for our use case due to two key limitations. One is that the app is more geared towards being applicable to digital drawings, engineering sketches created in vector format so the image extracting method wouldn't really work. Secondly, converting the images into pdf files doesn't work either since from what I understand PDFium utilizes multiple renderers for different parts of a page, which is the core concept of why I don't think pixel perfection between the 2 formats is possible, so that means feeding it a page with an image of a wall of text will put the image renderer into use rather than the text renderer, resulting in the output being different than a page with pure text.

Lastly, about my claim of the higher the dpi, the closer the distance between the files, well, I've done some experiments, and here is my report on the relation between dpi and Euclidean distance of the files, done on 2 samples:

relation_dpi_dist.JPG
I hope this finds whoever needs it. And again, thank you for your response.
Vào lúc 17:28:40 UTC+7 ngày Chủ Nhật, 7 tháng 9, 2025, geisserml đã viết:
Reply all
Reply to author
Forward
0 new messages