All right! I'd done some reading yesterday and was going to suggest cv2 since it's built all around numpy arrays, and loads images as arrays directly.
Reputedly, going through Pillow is in fact not (or less) efficient due to memory layout differences, API compilations and so on, which implies memory copies, though apparently one can achieve a better interface than the libaries' built-in functions, as described in [1].
But that's only side notes, as you already seem to have found out about this on your own now.
As for pypdfium2 and 2 million PDF files, hmm, I never tested it on such a large scale since I don't have a use case involving so many files.
I guess the only way to find out is to try. It also depends on how good your integration is (e.g. proper use of multiprocessing and avoiding caller-side copies), what you consider a 'reasonable timeframe', and what hardware you have.
pypdfium2 uses ctypes ABI/FFI bindings which are said to be somewhat less efficient than API mode bindings, but I never benchmarked the difference, mainly for lack of an API-based binding to compare to. This might not be overly relevant though, at least for rendering, given that pypdfium2 only calls a few high-level APIs and most time is spent in pdfium itself. However, I imagine it might be a different story with tasks that involve many API calls, like chars, pageobjects or outline.
Regarding improvements to your pipeline, it's a bit difficult without knowing the full use case.
Providing a broader picture of what task or end result you're actually trying to achieve, and maybe sharing more code, might be needed to get more advice.
Admittedly, some bits appear rather confusing to me, e.g. the fixed size of 224x224, or why you are so intent on pixel identical image and pdf renderings.
Here are some hints though:
- Your above code suggests (by comment) that it would memcpy the rendered buffer. I'd recommend using FPDFBitmap_CreateEx() to render into the target buffer directly, or else take care that the numpy array is constructed in a zero-copy fashion. (For reference, pypdfium2's PdfBitmap.to_numpy() API should be zero-copy.)
- You've indicated that you render BGRA but convert to RGB on the caller side. Note that pdfium can also output RGB directly by using FPDFBitmap_CreateEx() with FPDFBitmap_BGR and rendering with FPDF_REVERSE_BYTE_ORDER. Though with the caveat that FPDFBitmap_BGR should not be used where FPDFPage_HasTransparency() returns true, due to [2].
- I'm not super familiar with numpy but would try to reshape the view of the buffer instead of creating a copy, wherever that's possible. e.g. I think in the transparency case, you could convert from RGBA to RGB by merely setting the right shape and strides, e.g. shape = (height, width, 3) but strides = (width+maybe_padding, 4, 1). Strides are a pretty powerful feature in numpy.
- I think you've made an excellent point with "render at target DPI" vs "decode at full resolution but downscale". Again, I'm not much into that topic, but I'd try to figure out if there aren't means to render the image at a lower resolution directly without first decoding the full resolution, at least for some formats.
- Try to exploit any existing, mature Python APIs before resorting to custom binary extensions. In particular, a task like reading an image into a numpy array is something needed by many callers, so I felt certain you'd find a good existing API somewhere.
That's all for now, sorry this got so long. I hope at least part of this information is of some use to you.