Q:
How can I determine whether or not a PDF is a scanned image, and thus contains no selectable text?
A:
In general, scanned PDFs contain a single large image. Sometimes
PDF creators will run scanned images through an OCR reader and overlay
invisible, selectable text over the image.
The easiest way to determine if a PDF page contains any selectable text is to run TextExtractor (
https://www.pdftron.com/pdfnet/samplecode.html#TextExtract)
over the page and see what it finds. It sounds like this would be the
easiest solution for your use case, it it sounds like you're just
interested in whether the page contains selectable text.
An
alternative method --- one that would let you be more sure that this is
a scanned PDF and not just a page without text --- would be to use
ElementReader (
https://www.pdftron.com/pdfnet/samplecode.html#ElementReaderAdv).
You could use ElementReader to check for text. Additionally, you could
check whether the page contains a single image. You could also check
whether the image is monochrome, which is common for scanned PDFs.
(See also:
https://groups.google.com/d/msg/pdfnet-sdk/Wq_aDhzRYQw/qk8-7EgI2ZIJ).