image reading

Aniket Gadge

unread,

Apr 3, 2021, 6:30:35 PM4/3/21

to django...@googlegroups.com

How to read the test from image and pdf.

Kasper Laudrup

unread,

Apr 3, 2021, 6:49:27 PM4/3/21

to django...@googlegroups.com

On 03/04/2021 07.54, Aniket Gadge wrote:
> How to read the test from image and pdf.

Something like this should do it (in bash though and untested):

$ ls foo.{pdf,img} | xargs grep "the test" || echo "Unable to read 'the
test'"

Kind regards,

Kasper Laudrup

OpenPGP_signature

Ryan Nowakowski

unread,

Apr 7, 2021, 12:39:44 AM4/7/21

to django...@googlegroups.com

I'm going to assume you mean "text" here and not "test".

For recognizing text in images I've used the tesseract project with pretty good success. For more information about this you can Google OCR or optical character recognition.

For parsing text in PDF, it depends on how the text is encoded in the PDF. You can tell by trying to copy and paste the text manually when the PDF is open in your PDF reader. If you can't copy and paste the text that means that it's probably embedded in an image inside the PDF. In that case use the tesseract method recommended above.

If you can copy and paste a text, that means that it's actual text in the PDF file itself. In that case there are a few different PDF parsing libraries and Python that you can try to use to grab the text.

Reply all

Reply to author

Forward